Next: The Cilk language and
Up: Adaptive and Reliable Previous: Adaptive
and Reliable
A strong case argues for the use of networks of workstations (NOWs) as parallel-computation platforms [3], and Cilk-NOW [6] is a software system that has been designed and implemented to run parallel programs easily and efficiently on networks of UNIX workstations. Implemented entirely in user-level software on top of UNIX, Cilk-NOW is a runtime system for a functional subset of the parallel Cilk language [6, 8, 26], a multithreaded extension of C. Applications written in Cilk include graphics rendering, backtrack search, protein folding [37], and the *Socrates chess program [25] which won second prize at the 1995 ICCA World Computer Chess Championship running on the 1824-node Intel Paragon at Sandia National Labs. Like all runtime systems for Cilk, Cilk-NOW schedules threads using a provably efficient algorithm based on the technique of random "work stealing" [6, 9] in which processors with no threads steal threads from victims chosen at random. With this algorithm, Cilk delivers performance that is guaranteed to be both efficient and predictable [6, 8]. In addition to thread scheduling, Cilk-NOW also performs macroscheduling [30]. That is, Cilk-NOW automatically identifies idle workstations and assigns those idle workstations to help out with running Cilk programs.
Figure 1: (a) This plot shows the number of
machines, out of the 50 machines in our network, that were idle at
each point in time over the course of one typical week in March,
1995. (b) This histogram shows the number of idle
processor-hours broken down by idle time-interval. When a machine
remains idle for a period of t hours, it contributes t
hours to the height of the bar plotted at position t rounded up
to the nearest 10 minutes.
The Cilk-NOW runtime system is designed to execute Cilk programs efficiently in the highly dynamic environment of a NOW. Figure 1(a) plots the number of machines that were idle at each point in time over the course of a typical week for a network of 50 SPARCstations at the MIT Laboratory for Computer Science. As can be seen from this plot, though more machines are idle at night, a significant number of machines are idle at various times throughout the day. Therefore, by adaptively using idle machines both day and night, we can take advantage of significantly more machine resources than if we run our parallel jobs as batch jobs during the night. Figure 1(b) is a histogram giving the total idle processor-hours broken down by idle time-interval, from this experiment. This histogram shows that a significant percentage of idle time (1104 processors-hours, or 19.1% of the total 5776 processor-hours) comes from machines that are idle for less than 30 minutes at a time. Thus, the efficient exploitation of idle machines requires that machines are able to join and leave a computation quickly and without human intervention. These observations are consistent with those of others [5, 20, 27, 28, 31].
Cilk-NOW provides the following features for running Cilk programs on a network of workstations.
Recently, we ran a Cilk protein-folding application pfold [37] using Cilk-NOW on a network of about 50 Sun SPARCstations connected by shared 10-Mb/s Ethernet to solve a large-scale protein-folding problem. The program ran for 9 days, surviving several machine crashes and reboots, utilizing 6566 processor-hours of otherwise-idle cycles, with no administrative effort on our part (besides typing pfold at the command-line to begin execution), while other users of the network went about their business unaware of the program's presence.
It is important to note that Cilk-NOW provides these features only for Cilk-2 programs which are essentially functional. Cilk-NOW does not support more recent versions of Cilk (Cilk-3 and Cilk-4) that incorporate virtual shared memory, and in particular, Cilk-NOW does not provide any kind of distributed shared memory. In addition, Cilk-NOW does not provide fault tolerance for its I/O facility.
In this paper, we present the design of Cilk-NOW, focusing on those features of Cilk-NOW that are particular to the NOW environment. The Cilk-2 language, work-stealing scheduler, MPP implementation, and guaranteed performance model have been covered at length in other papers [6, 8, 9, 26]. In this paper, we shall focus on adaptive parallelism and fault tolerance. Specifically, we will show how Cilk-NOW's end-to-end design [38] leverages algorithmic properties of the Cilk programming model and work-stealing scheduler in order to amortize all the overhead of adaptive parallelism and fault tolerance against the analytically and empirically bounded overhead of Cilk's work-stealing scheduler.
The remainder of this paper is organized as follows. In Section 2 we review the Cilk-2 language and work-stealing scheduler as first introduced in [8]. In Section 3 we describe the architecture of a Cilk job executing under the Cilk-NOW runtime system. Then, in Section 4 we explain how Cilk-NOW implements adaptive parallelism, and in Section 5 we explain how Cilk-NOW performs checkpointing, fault detection, and fault recovery. In Section 6 we describe the Cilk-NOW macroscheduling system architecture. In Section 7 we compare the Cilk-NOW system to related work. Finally, in Section 8 we outline plans for future work, and we conclude.