Transparent Fault Tolerance for Parallel Applications
on Networks of Workstations
Daniel J. Scales and Monica S. Lam
Computer Systems Laboratory
Stanford University
Abstract
This paper describes a new method for providing transparent fault
tolerance for parallel applications on a network of workstations. We
have designed our method in the context of shared object system called
SAM, a portable run-time system which provides a global name space and
automatic caching of shared data. SAM incorporates a novel design
intended to address the problem of the high communication overheads in
distributed memory environments and is implemented on a variety of
distributed memory platforms. Our fundamental approach to providing
fault tolerance is to ensure the replication of all data on more than
one workstation using the dynamic caching already providedby SAM. The
replicated data is accessible to the local processor like other cached
data, making access to shared data faster and potentially offsetting
some of the fault tolerance overhead. In addition, our method uses
information available in SAM applications on how processes access
shared data to enable several optimizations which reduce the
fault-tolerance overhead. We have built an implementation of our
fault tolerance method in SAM for heterogeneous networks of
workstations running PVM3. In this paper, we present our
fault-tolerance method and describe its implementation in detail. We
give performance results and overhead numbers for several large SAM
applications running on a cluster of Alpha workstations connected by
an ATM network. Our method is successful in providing transparent
fault tolerance for parallel applications running on a network of
workstations and is unique in requiring no global synchronizations and
no disk operations to a reliable file server.
Download the full text of this paper in
ASCII (59,905 bytes) and
POSTSCRIPT (229,163 bytes) form.
To Become a USENIX Member, please see our
Membership Information.