StarPU Handbook - StarPU Extensions
|
Due to e.g. hardware error, some tasks may fail, or even complete nodes may fail. For now, StarPU provides some support for failure of tasks.
In case a task implementation notices that it fail to compute properly, it can call starpu_task_failed() to notify StarPU of the failure.
tests/fault-tolerance/retry.c
is an example of coping with such failure: the principle is that when submitting the task, one sets its prologue callback to starpu_task_ft_prologue(). That prologue will turn the task into a meta task, which will manage the repeated submission of try-tasks to perform the computation until one of the computations succeeds. One can create a try-task for the meta task by using starpu_task_ft_create_retry().
By default, try-tasks will be just retried until one of them succeeds (i.e. the task implementation does not call starpu_task_failed()). One can change the behavior by passing a check_failsafe
function as prologue parameter, which will be called at the end of the try-task attempt. It can look at starpu_task_get_current()->failed
to determine whether the try-task succeeded, in which case it can call starpu_task_ft_success() on the meta-task to notify success, or if it failed, in which case it can call starpu_task_failsafe_create_retry() to create another try-task, and submit it with starpu_task_submit_nodeps().
This can however only work if the task input is not modified, and is thus not supported for tasks with data access mode STARPU_RW.