Next: The Cilk language and Up: Adaptive and Reliable Previous: Adaptive and Reliable

Introduction

A strong case argues for the use of networks of workstations (NOWs) as parallel-computation platforms [3], and Cilk-NOW [6] is a software system that has been designed and implemented to run parallel programs easily and efficiently on networks of UNIX workstations. Implemented entirely in user-level software on top of UNIX, Cilk-NOW is a runtime system for a functional subset of the parallel Cilk language [6, 8, 26], a multithreaded extension of C. Applications written in Cilk include graphics rendering, backtrack search, protein folding [37], and the *Socrates chess program [25] which won second prize at the 1995 ICCA World Computer Chess Championship running on the 1824-node Intel Paragon at Sandia National Labs. Like all runtime systems for Cilk, Cilk-NOW schedules threads using a provably efficient algorithm based on the technique of random "work stealing" [6, 9] in which processors with no threads steal threads from victims chosen at random. With this algorithm, Cilk delivers performance that is guaranteed to be both efficient and predictable [6, 8]. In addition to thread scheduling, Cilk-NOW also performs macroscheduling [30]. That is, Cilk-NOW automatically identifies idle workstations and assigns those idle workstations to help out with running Cilk programs.

Figure 1: (a) This plot shows the number of machines, out of the 50 machines in our network, that were idle at each point in time over the course of one typical week in March, 1995. (b) This histogram shows the number of idle processor-hours broken down by idle time-interval. When a machine remains idle for a period of t hours, it contributes t hours to the height of the bar plotted at position t rounded up to the nearest 10 minutes.

The Cilk-NOW runtime system is designed to execute Cilk programs efficiently in the highly dynamic environment of a NOW. Figure 1(a) plots the number of machines that were idle at each point in time over the course of a typical week for a network of 50 SPARCstations at the MIT Laboratory for Computer Science. As can be seen from this plot, though more machines are idle at night, a significant number of machines are idle at various times throughout the day. Therefore, by adaptively using idle machines both day and night, we can take advantage of significantly more machine resources than if we run our parallel jobs as batch jobs during the night. Figure 1(b) is a histogram giving the total idle processor-hours broken down by idle time-interval, from this experiment. This histogram shows that a significant percentage of idle time (1104 processors-hours, or 19.1% of the total 5776 processor-hours) comes from machines that are idle for less than 30 minutes at a time. Thus, the efficient exploitation of idle machines requires that machines are able to join and leave a computation quickly and without human intervention. These observations are consistent with those of others [5, 20, 27, 28, 31].

Cilk-NOW provides the following features for running Cilk programs on a network of workstations.

Ease of use.: A user can run a Cilk program in parallel on a NOW as if the program were only being run on the local workstation. The user simply types the program's command line, and then the Cilk-NOW runtime system automatically schedules the execution of the program in parallel across the network.
Adaptive parallelism.: The Cilk-NOW system adaptively executes Cilk programs on a dynamically changing set of otherwise-idle workstations [6, 10]. When a given workstation is not being used by its owner, the workstation automatically joins in and helps out with the execution of a Cilk program. When the owner returns to work, the machine automatically retreats from the Cilk program.
Fault tolerance.: The Cilk-NOW runtime system automatically performs checkpointing, detects failures, and performs recovery [6] while Cilk programs themselves remain fault oblivious. That is, Cilk-NOW provides fault tolerance without requiring that programmers code for fault tolerance.
Flexibility.: The Cilk-NOW system allows the conditions that are used to determine the idleness of workstations to be set dynamically, in accordance with the tastes of the users and the owners of the machines whose cycles are being stolen. This flexibility preserves the sovereignty of each workstation's owner which is essential to ensure that owners are willing to contribute their workstations for use by others.
Security.: The Cilk-NOW system uses secure protocols that do not open a workstation to unauthorized users running foreign code on a machine. The desired degree of security is that which a given system uses to authenticate its remote execution protocol.
Guaranteed performance.: The Cilk-NOW system executes Cilk programs using a work-stealing scheduler. This scheduler delivers performance that can be predicted accurately with a simple abstract model [6, 8]. Moreover this simple model can be adapted to the case of heterogeneous processors and networks [32].

Recently, we ran a Cilk protein-folding application pfold [37] using Cilk-NOW on a network of about 50 Sun SPARCstations connected by shared 10-Mb/s Ethernet to solve a large-scale protein-folding problem. The program ran for 9 days, surviving several machine crashes and reboots, utilizing 6566 processor-hours of otherwise-idle cycles, with no administrative effort on our part (besides typing pfold at the command-line to begin execution), while other users of the network went about their business unaware of the program's presence.

It is important to note that Cilk-NOW provides these features only for Cilk-2 programs which are essentially functional. Cilk-NOW does not support more recent versions of Cilk (Cilk-3 and Cilk-4) that incorporate virtual shared memory, and in particular, Cilk-NOW does not provide any kind of distributed shared memory. In addition, Cilk-NOW does not provide fault tolerance for its I/O facility.

In this paper, we present the design of Cilk-NOW, focusing on those features of Cilk-NOW that are particular to the NOW environment. The Cilk-2 language, work-stealing scheduler, MPP implementation, and guaranteed performance model have been covered at length in other papers [6, 8, 9, 26]. In this paper, we shall focus on adaptive parallelism and fault tolerance. Specifically, we will show how Cilk-NOW's end-to-end design [38] leverages algorithmic properties of the Cilk programming model and work-stealing scheduler in order to amortize all the overhead of adaptive parallelism and fault tolerance against the analytically and empirically bounded overhead of Cilk's work-stealing scheduler.

The remainder of this paper is organized as follows. In Section 2 we review the Cilk-2 language and work-stealing scheduler as first introduced in [8]. In Section 3 we describe the architecture of a Cilk job executing under the Cilk-NOW runtime system. Then, in Section 4 we explain how Cilk-NOW implements adaptive parallelism, and in Section 5 we explain how Cilk-NOW performs checkpointing, fault detection, and fault recovery. In Section 6 we describe the Cilk-NOW macroscheduling system architecture. In Section 7 we compare the Cilk-NOW system to related work. Finally, in Section 8 we outline plans for future work, and we conclude.

Next: The Cilk language and Up: Adaptive and Reliable Previous: Adaptive and Reliable

Robert D. Blumofe
Wed Nov 13 10:25:03 CST 1996