Next: Cilk-NOW macroscheduling Up: Adaptive and Reliable Previous: Adaptive parallelism

Fault tolerance

With transparent fault tolerance built into the Cilk-NOW runtime system, Cilk jobs may survive machine crashes or network outages despite the fact that Cilk programs are fault oblivious, having been coded with no special provision for handling machine or network failures. If a worker crashes, then other workers automatically redo any work that was lost in the crash. In the case of a more catastrophic failure, such as a power outage, a total network failure, or a crash of the file server, then all workers may crash. For this case, Cilk-NOW provides automatic checkpointing, so when service is restored, the Cilk job may be restarted with minimal lost work. Recall that Cilk-NOW does not provide fault tolerance for I/O.

In this section, we show how the structure used to support adaptive parallelism--which leverages Cilk's tree structure and the work-stealing scheduler--may be further leveraged to build these fault tolerant capabilities in Cilk-NOW. As with adaptive parallelism, all of the overhead associated with fault tolerance (other than the cost of periodic checkpoints) can be amortized against the number of steals which grows at most linearly with the critical path and is not a function of the work.

Given adaptive parallelism, fault tolerance is only a short step away. With adaptive parallelism, a worker may leave a Cilk job, but before doing so, it first migrates all of its subcomputations to other workers. In contrast, when a worker crashes, all of its subcomputations are lost. To support fault tolerance, we add a mechanism that allows surviving workers to redo any work that was done by the lost subcomputations. Such a mechanism must address two fundamental issues. First, not all work is necessarily idempotent, so redoing work may present problems. We address this issue with a technique that we call a return transaction. Specifically, we ensure that the work done by any given subcomputation does not affect the state of any other subcomputations until the given subcomputation finishes. Thus, from the point-of-view of any other subcomputation, the work of a subcomputation appears as a transaction: either the subcomputation finishes and commits its work by making it visible to other subcomputations, or the subcomputation never happened. Second, the lost subcomputations may have done a large amount of work, and we would like to minimize the amount of work that needs to be redone. We address this issue by incorporating a transparent and fully distributed checkpointing facility. This checkpointing facility also allows a Cilk job to be restarted in the case of a total system failure in which every worker crashes.

To turn the work of a subcomputation into a return transaction, we modify the behavior of the subcomputation's result closure. In Cilk, returning a value is always the last operation performed by a Cilk procedure, so the result closure cannot be ready until the subcomputation is finished. In addition, recall that the execution of the result closure and the finishing of the subcomputation both warrant a message to the victim worker. Thus, we bundle these two messages into a single larger message sent to the victim worker. When the victim worker receives this message, it commits all of the thief subcomputation's work by sending the appropriate result value from the victim closure, freeing the victim closure (and its assignment information), and sending an acknowledgment back to the thief worker.

With subcomputations having this transactional nature, a Cilk job can tolerate individual worker crashes as follows. Suppose a worker crashes. Eventually, the clearinghouse will detect the crash, and the other living workers will learn of the crash at the next update from the clearinghouse. When a worker learns of a crash, it goes through all of its subcomputations, checking each assigned closure to see if it is assigned to the crashed worker. Each such closure is moved from the assigned pool back to the ready pool (and its assignment information is freed). Thus, all of the work done by the closure's thief subcomputation which has been lost in the crash will eventually be redone. Additionally, when a worker learns of a crash, it goes through all of its subcomputations to see if it has any that record the crashed worker as the subcomputation's victim. For each such subcomputation, the worker aborts it as follows. The worker goes through all of the subcomputation's assigned closures sending to each thief worker an abort message specifying the name of the thief subcomputation. Then the worker frees the subcomputation and all of its closures. When a worker receives an abort message, it finds the thief subcomputation named in the message and recursively aborts it. All of the work done by these aborted subcomputations must eventually be redone. In order to avoid aborting all of these subcomputations (which may comprise the entire job in the case when the root subcomputation is lost) and redoing potentially vast amounts of work, and in order to allow restarting when the entire job is lost, we need checkpointing.

Cilk-NOW performs automatic checkpointing without any synchronization among different workers and without any notion of global state. Specifically, each subcomputation is periodically checkpointed to a file named with the subcomputation's name. For example, a subcomputation named r:i would be checkpointed to a file named scomp_r_i. We assume that all workers in the job have access to a common file system (through NFS or AFS, for example), and all checkpoint files are written to a common checkpoint directory. To write a checkpoint file for a subcomputation r:i, the worker first opens a file named scomp_r_i.temp. Then, it writes the subcomputation record and all of the closures--including the assignment information for the assigned closures--into the file. Finally, it atomically renames the file scomp_r_i.temp to scomp_r_i, overwriting any previous checkpoint file. A checkpoint file can be read to recover the subcomputation. On writing a checkpoint file, the worker additionally prunes any no-longer-needed checkpoint files.

If workers crash, the lost subcomputations can be recovered from checkpoint files. In the case of a single worker crash, the lost subcomputations can be recovered automatically. When a surviving worker finds that it has a subcomputation with a closure assigned to the crashed worker, then it can recover the thief subcomputation by reading the checkpoint file. In the case of a large-scale failure in which every worker crashes, the Cilk job can be restarted from checkpoint files by setting the -Recover flag on the command line. Recovery begins with the root subcomputation whose checkpoint file is scomp_0_1. After recovering the root subcomputation, then every other subcomputation can be recovered by recursively recovering the thief subcomputation for each of the root subcomputation's assigned closures.

Next: Cilk-NOW macroscheduling Up: Adaptive and Reliable Previous: Adaptive parallelism

Robert D. Blumofe
Wed Nov 13 10:25:03 CST 1996