################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  http://www.usenix.org


AFRAID - A Frequently Redundant Array of Independent Disks

Stefan Savage
University of Washington, Seattle, WA

John Wilkes
Hewlett-Packard Laboratories, Palo Alto, CA

Abstract

Disk arrays are commonly designed to ensure that stored data will
always be able to withstand a disk failure, but meeting this goal
comes at a significant cost in performance. We show that this is
unnecessary. By trading away a fraction of the enormous reliability
provided by disk arrays, it is possible to achieve performance that is
almost as good as a non-parity-protected set of disks.

In particular, our AFRAID design eliminates the small-update penalty
that plagues traditional RAID 5 disk arrays. It does this by applying
the data update immediately, but delaying the parity update to the
next quiet period between bursts of client activity. That is, AFRAID
makes sure that the array is frequently redundant, even if it isn't
always so. By regulating the parity update policy, AFRAID allows a
smooth trade-off between performance and availability.

Under real-life workloads, the AFRAID design can provide close to the
full performance of an array of unprotected disks, and data
availability comparable to a traditional RAID 5. Our results show that
AFRAID offers 42% better performance for only 10% less availability,
97% better for 23% less, and as much as a factor of 4.1 times better
performance for giving up less than half RAID 5's availability.

We explore here the detailed availability and performance implications
of the AFRAID approach.


1. Introduction

In a RAID 5 disk array, small writes take a long time to complete
[Patterson88]. This is known as the "small update problem". In such an
array, redundancy for a stripe of data is provided by a parity block,
computed as the xor of the data blocks in the stripe, in order to
allow recovery if any disk fails. If a portion of a stripe is updated,
the parity data must also be updated to preserve the recoverability
property (Figure 1). To do this, it is necessary to (1) read the old
value of the data to be overwritten, unless it is already cached in
the array controller; (2) read the old parity; (3) xor the new data
with the old, and xor the result with the old parity to generate the
new parity data; (4) write the new data and (5) write the new parity.

Thus, three or four disk I/Os are needed to achieve one small write -
all of which are in the critical path. In contemplating this problem
we made the following observations:

 * modern disks are extremely reliable-so much so that disk array
   reliability is limited more by its support components than its
   disks;

 * many real workloads have slack periods between bursts of client
   activity;

 * people are already well-used to the notion of time-limited exposure
   to risk.

These eventually led us to the idea of AFRAID (A Frequently Redundant
Array of Independent Disks). AFRAID is a RAID 5 disk array that
relaxes the coherency between data and parity for short periods of
time; parity is made consistent again in the idle periods between
bursts of client writes. Thus the stored data is frequently held
redundantly, rather than always guaranteed to be so.

In this approach, small updates are not required to wait for the
parity to be updated, thereby reducing the four I/Os in the critical
path of the traditional small-update protocol to just one: write the
new data. The benefit is that performance approaches that of an
unprotected array. The disadvantage is a slightly increased risk of
data loss from a disk failure, but we will show that this increase is
small in practice, and also that it can be bounded at the cost of some
performance. That is, AFRAID allows a smooth trade-off between
increased reliability and increased performance.

1.1.  The AFRAID design

A write in AFRAID does two things: it updates the on-disk copy of the
original data, and it causes the target stripes to be marked
unredundant-i.e., that their parity needs rebuilding. This is
indicated by setting a bit per stripe in a non-volatile memory in the
array controller; attempting to re-mark an already-marked stripe does
nothing. Once one or more stripes have been marked, the AFRAID
controller waits until the array is idle and then starts to process
the pending parity updates where they can be achieved at effectively
zero performance cost to the clients of the array.

Each parity update requires reading all the data blocks in the stripe
and xoring them together to generate a new parity block. The new
parity block is then written over the top of the old one, after which
the unredundant mark for the stripe is removed. The rebuilding of
adjacent unprotected stripes can be coalesced to increase the
efficiency of disk accesses. Since the overhead of the parity update
is linear with the number of disks in a stripe group, AFRAID is best
suited to arrays with smaller numbers of disks.

The additional cost to build an AFRAID is just the cost of the marking
memory: one bit per stripe. With an array that is 5 disks wide and has
a stripe unit size, or stripe depth, of 8KB, this is ~3 bits per
100KB, or 3 KB of memory per 1GB of stored data-a trivial cost
compared to the cost of the disk storage itself. The recovery
technique for a failed marking memory is simply to rebuild parity for
the whole array. This rebuilding can proceed in parallel with
continued use.

Any write to a stripe unprotects it all-not just the data being
written to. Somewhat counterintuitively, this loss may include data
that has not been written to recently. This failure mode is a natural
consequence of RAIDs protecting whole stripes, rather than individual
blocks. In practice, the exposure is quite small, because the
likelihood of data being lost is minimal.

Rather than simply waiting for an idle period before starting to
reconstruct parity, it is possible to configure AFRAID to be more
aggressive about availability, at the possible cost of greater
interference with foreground I/Os. Sample policies include:

 * allowing parity rebuilding to start even when there is a non-zero
   client load on the array;

 * giving parity rebuilding priority over foreground client I/Os;

 * reverting to normal RAID 5 behavior.

These techniques can be enabled dynamically and adaptively to achieve
specified long-term availability and performance goals. We explore the
performance and availability effects of some of these policies below.

1.2.  AFRAID design assumptions

In this section we provide some additional information about the
suppositions on which we based our work.

Disk reliability

Modern disks have published mean time to failure (MTTF) times of 0.5-1
million hours, and this number is increasing every year. As a result,
small disk arrays have vanishingly small chances of experiencing a
dual-disk failure, which would cause a data loss. The expected
disk-related mean time to data loss (MTTDL) in a small RAID 5 with a
half-dozen disks is measured in hundreds of millions of hours-several
tens of thousands of years. This is far larger than the limits imposed
by other "support" components such as the array's power supply,
controller, and cabling. Small disk arrays with less than a dozen
disks are the most common in practice, and their overall data
availability may not be reduced much if full on-disk redundancy is not
provided for short periods.

As we will show, any data-redundancy scheme that produces a
disk-related MTTDL of a few million hours or better will be dominated
in practice by the array support components. An aggregate MTTDL of a
million hours (114 years) translates into only a 2.6% likelihood of
any data loss at all during a typical 3-year array lifetime. This is
much lower than the rate of problems due to software failures,
operator errors, and other environmental difficulties [Gray90,
Gray91a]-that is, a small-to-medium sized array that achieves an
overall MTTDL of 1m hours or better will probably be entirely adequate
for the majority of its applications.

In addition to reduced failure rates, modern disks also provide
feedback mechanisms for predicting when such failures will
occur. These can warn of impending disk failures hours or days in
advance by looking at soft-error logs (e.g., the number of retries
required on reads), or the variation in head flying-height. [Lin90c]
discusses one such experiment, which was able to predict 93.7% of
system failures in a distributed file system, typically many hours
before they happened. More recently, IBM disks drives have
incorporated a scheme that has been shown to offer a mean of 10 days
warning of disk failures [IBMpfa95], with an anticipated success rate
of 50-60%. Other manufacturers are following suit. With such
techniques available, the likelihood of experiencing an unexpected
disk failure can be made very small.

Bursty access patterns

Many (if not most) real uses of disk arrays have bursty access
patterns, with periods of relative inactivity between groups of client
accesses. [Ruemmler93] offers one quantification of this, in some
detail. Indeed, we believe that it is usually only in benchmarks or
very large, carefully-tuned systems that arrays are driven to
saturation for long periods of time. Given this, there will be spare
I/O time available in the idle periods between bursts. If expensive
operations such as parity updates can be delayed to these less busy
periods, they can be achieved at little or no apparent performance
cost to the client.

Time-limited exposure to data loss

This principle is already well understood and frequently
exploited. For example, data in UNIX file systems is held unprotected
in volatile memory buffers before it is written out to disk
[Ritchie84a]. Because data is typically only held in volatile memory
for short periods of time before being written to disk, this exposure
is tolerated in exchange for the increased performance that
results. This idea has been extended still further by special file
systems that deliberately store data in volatile memory [McKusick90,
Ohta90].

Most systems that use non-volatile memory (NVRAM) assume that a single
copy of the stored data is sufficient: providing true
single-failure-tolerant NVRAM systems is difficult [Menon93], and so
is rarely done. Examples of common systems that make this tradeoff
include the popular PrestoServe card [Moran90] for nfs servers, as
well as recent file system designs, such as LFS [Rosenblum92] and
Zebra [Hartman95]. All these rely on assembling large amounts of data
in NVRAM to obtain both good performance and acceptable
reliability. Other studies have suggested ways to extend this use of
NVRAM still further [Baker92b].


Together, these thoughts led us to the AFRAID notion: consciously
sacrificing a small amount of data redundancy in order to achieve
considerably better performance.


The remainder of the paper explores the AFRAID idea in some detail. We
begin with a short discussion of closely-related prior work. This is
followed by a description of analytic availability models for AFRAID,
and then a quantitative evaluation of the availability and performance
effects of the AFRAID design as compared to a traditional RAID 5 and a
set of unprotected disks. We conclude with some observations on what
this study taught us.


2. Related work

The most obviously related prior solution to the small update problem
is parity-logging [Stodolsky93]. A parity-logging array defers the
parity-update cost to a later time, at which point it can be performed
more efficiently. It does this by performing the traditional RAID 5
read-modify-write operation on the data block being updated, but then,
instead of doing the same for the parity block, it writes the xor of
the old and new data to a log-thereby preserving full redundancy all
the time. At a later date, the log file is replayed against the disk
array, and the parity updated in situ.

By comparison, AFRAID avoids a pre-read of the old data in the
critical path for writes, and thus saves a complete disk revolution on
most small writes. It also avoids potentially long delays during
parity rebuilds. For efficiency, the parity logging scheme applies a
batch of parity updates at a time, which can interfere with foreground
I/O requests. Although some of the policies used in AFRAID to control
availability can also generate interference with foreground I/Os, they
are less intrusive because parity updates may be preempted between
stripes.

The parity logging scheme could be extended to apply its parity
updates in idle periods, like AFRAID. This would improve its
performance, except under workloads in which the parity log fills up,
when either the pending parity updates must be applied immediately,
interrupting foreground processing to do so, or the array must revert
to a regular RAID 5 model of operation until it becomes idle enough to
apply updates. Efficiency will drop for either approach. There is no
parity log to fill up in AFRAID-all that happens is that the data
becomes less well protected.

Finally, parity logging is quite a complicated scheme; AFRAID is much
simpler.

Another approach to the same problem is the floating-parity scheme
[Menon89]. This reduces the cost of parity updates by writing the new
parity data to an empty, rotationally-nearby slot in the array, rather
than waiting for a full revolution to go by to update it in
place. Such an array still needs to do the old-data and old-parity
reads. The floating-data scheme extends this placement optimization to
data blocks, too, but this requires considerably larger amounts of
non-volatile state information: a word or two per stored block.

In contrast to these two schemes, AFRAID has a less efficient parity
update scheme (reading all the data portions of the stripe and
recalculating the parity from scratch), but it uses it during a time
when the array is less utilized, so that the resulting client-visible
cost is small or zero. The result is better performance when the array
is active, at the cost of a small exposure to data loss if a (rare)
disk failure happens before the new parity has been calculated and
written. We quantify the degree of this exposure more precisely in
Section 3.

The idea of allowing a file to become unredundant while it was being
created was proposed in [Cormen93], in the context of parallel file
systems for scientific computing. This paper also suggested the notion
of paritypoints, by analogy to checkpoints, at which an application
could ask for the parity to be computed for the file. Our AFRAID
design takes these ideas several steps further: it automates the
process of recomputing parity; and does so in idle periods rather than
only on demand; it isolates the parity rebuilding inside the disk
array where it need not be visible to application programs; and it is
not limited to stripe-aligned files.

Determining when the array is going to be idle enough to do the
rebuild without impacting incoming work is a problem that has been
studied already. [Golding95] discusses this problem and a variety of
solutions to it.

All the well-known techniques that have been developed for performing
stripe rebuilds in a recently repaired disk array can be applied to
the problem of rebuilding the parity in AFRAID (e.g., [Muntz90,
Holland92]). These include opportunistically piggybacking the parity
updates on other "nearby" activity done in the foreground; batching
together updates that are physically close together; or simply doing a
single, linear sweep through the disks.

Similarly, existing schemes for balancing disk traffic under failure
conditions can be applied to AFRAID (e.g., [Gray90c, Muntz90, Blaum94,
Reddy91]). For ease of exposition, however, we concentrate here on a
straightforward left-symmetric RAID 5 data layout.


3. Availability model of AFRAID

In this section we develop analytic models of data-loss mechanisms for
AFRAID, basing them on similar models for traditional RAIDs. In the
next section we apply these models to the data from our simulation
experiments to provide a quantitative evaluation of data availability
in AFRAID.

Following [Gibson93], we do not separate the cases of inaccessible
data from data that has been lost irrevocably. We use the term
availability in this paper to refer to the amount of time that data is
accessible and/or not lost. To make our discussion concrete, we apply
a set of assumptions about typical failure rates for modern array
components; these are summarized in Table 1.

Because manufacturers do not yet publish MTTFunexpected separately
from overall MTTF, we have been fairly conservative and set the
coverage factor C for disk failure predictions to 0.5 in our
calculations. That is, we assume that half of the disk failures will
not be predicted ahead of time. In what follows, we include the
coverage factor in the MTTFdisk calculations:

    MTTFdisk = MTTFdisk-raw / (1 - C)

	Table 1: values assumed for calculations in this paper.
	-------------------------------------------------------
	Parameter				Value
	-------------------------------------------------------
	disk mean time to failure
	MTTFdisk-raw				1M hours
	support hardware mean time to data loss
	MTTDLsupport				2M hours
	disk failure-prediction coverage (C)	0.5
	mean time to repair (MTTR)		48 hours
	stripe unit size (S)			8KB
	size of disk (Vdisk)			2GB
	-------------------------------------------------------

3.1.  Mean time to first data loss

Most RAID data-loss calculations look only at the time to the first
catastrophic data loss due to disk failure, which occurs if two disk
failures occur so close together that the first failure has not yet
been recovered from. If this happens, two disks worth of data is
lost. A convenient measure for this kind of catastrophe is the mean
time to data loss (MTTDL). The equation for a RAID 5 disk array with
N+1 disks, assuming rare, independent, exponentially-distributed disk
failures is:

    MTTDL_RAID_catastrophic = 
	(MTTFdisk)^2 / ( N (N+1) * MTTRdisk )		(1)

With a 5-disk array, and the parameters of Table 1, this gives a
theoretical MTTDL of ~4.10^9 hours, or about 475,000 years.

In addition to the regular RAID failure mode, AFRAID exhibits data
loss if a single disk fails unexpectedly while there is some
unprotected data. To determine the combined effect of these two modes,
we look at the likelihood of data loss occurring when there is
unprotected data (Tunprot) and when there is not (Ttotal - Tunprot). A
conservative measure of the contribution to MTTDL for the period in
which there is unprotected data is:

    MTTDL_AFRAID_unprotected =
	(Ttotal/Tunprot) * MTTFdisk/(N+1)		(2a)

This measure is conservative because we do not take account of the
fact that in some cases only parity data will be lost: we just
simplify and assume that there will always be some data loss. The rest
of the time, when there is no unprotected data, AFRAID behaves just
like a RAID for the MTTDL measure:

    MTTDL_AFRAID_RAID_catastrophic =
	Ttotal/(Ttotal - Tunprot) * MTTDL_RAID_catastrophic	(2b)

Summing these two contributions, which are best through of as inverses
of rates, gives:

    MTTDL_AFRAID = 
	1/(1/MTTDL_AFRAID_unprotected + 1/MTTDL_AFRAID_RAID_catastrophic) (2c)

Section 4.3 presents the results of experimental determinations of
this value over several workloads.

One more kind of multiple failure can afflict an AFRAID: if its NVRAM
marking memory fails, the array will start reconstructing parity
across all the stripes. This will take a little while (about ten
minutes for an array using 2GB disks that can read at a sustained rate
of 5MB/s). If a disk failure occurs before the parity has been
completely rebuilt, the array has no way of knowing which stripes were
still unprotected, if any, although it will be bounded by the
knowledge of how far the reconstruction has progressed. The likelihood
of this failure is exceedingly small, however, because of the small
window of vulnerability (MTTDL > 10^11 hours), so we can safely ignore
it here.

3.2.  Mean data loss rate

The MTTDL measure indicates the expected rate of failures leading to
any data loss. In any RAID 5-based system this occurs on a dual-disk
failure, at which point a catastrophe occurs: two whole disks worth of
data vanishes. In addition, unprotected data in AFRAID is vulnerable
to loss from a single-disk failure. However, the amount of data lost
in this case is bounded by how much is unprotected-and we will show
later that it is often quite small.

There is an important qualitative difference between losing a block or
two on a disk and losing the whole disk. For example, all disks have a
few defective sectors or tracks, and new ones are occasionally added
to this list over its lifetime-but the occurrence of a bad block on a
disk doesn't mean that the entire disk has been lost, or even that it
should be discarded. Similarly, the effect of accidentally deleting a
single small file is usually much less severe than that of losing an
entire file system. The former may be merely tedious, while the latter
can be a catastrophe.

There are several reasons for this qualitative difference: not all
data is equally valuable; some data can easily be reconstructed or
recomputed; much data "dies young"-that is, it will be deleted or
overwritten soon after it is created [Ousterhout85a]; recovering a
single file is often simpler than rebuilding an entire disk
set. Others have taken advantage of this difference before us. For
example, the bsd fast file system [McKusick84] and its journalling
file system successors take considerable care to maintain consistency
of file system metadata, but are much more cavalier with user
information.

As a result, we feel that it is important to measure the amount of
data subject to loss, as well the time to lose the first byte. A good
metric for this is the mean data loss rate (MDLR): the product of the
amount of data loss and the rate at which it is likely to occur. In
addition to quantifying the effects of such losses, it has the
advantage of being a reminder that mean time to failure values should
be used only to define failure rates, not expectations of lifetimes.

The catastrophic data loss rate for a regular array due to a two-disk
failure can be cast in these terms as:

    MDLR_RAID_catastrophic =
	2Vdisk * N/(N+1) * 1/MTTDL_RAID_catastrophic	(3)

where Vdisk is the capacity of a single disk, which is reduced by the
second term to reflect that some of the lost disk space held parity
rather than data blocks. The RAID 5 array we considered earlier would
have a MDLR of ~0.8 bytes/hour from this failure mode.

Analyzing the impact of single disk failures in AFRAID requires
additional information. To provide a basis for this availability
analysis, we first introduce the notion of parity lag, which is the
amount of unredundant non-parity data present in the array at any
time, measured in bytes. The mean parity lag is the average parity lag
over some test period, such as the duration of a test workload. Note
that parity lag is workload dependent.

Data loss only occurs from a single disk failure if the parity lag is
non-zero at the time of the failure. When this happens, one stripe
unit (block) from each unredundant stripe is lost (the one on the
failed disk), unless the lost block is a parity block, in which case
no actual data loss occurs.

The mean data loss rate for a single-disk failure on AFRAID with
unprotected data is:

    MDLRunprotected =
	(mean_parity_lag/N) * (N+1)/MTTFdisk		(4)

where the first term reflects the average amount of unprotected
non-parity data vulnerable to a single disk failure, and the second
term gives the total failure rate of all the disks in the array. We
will present experimental determinations of these values in section
4.3.

Summing the different component contributions gives us the final mean
data loss rate for the disk-related components of AFRAID:

    MDLR_AFRAID =
	MDLR_RAID_catastrophic + MDLRunprotected	(5)

3.3.  The effect of support components

We have concentrated so far only on disk failures. This emphasis made
sense when disks were much less reliable than support components such
as cooling fans, power supplies, cabling, and other passive
components. But one of our contentions is that disks are no longer the
primary cause of problems in a disk array. The reliability of the
support hardware and the array controller is little or no better than
that of the disks.

The data in [Schulze89] suggests that these support components would
together lead to a mean time to failure of a small array of about 46k
hours; [Gibson93] simply increases this to "a more reasonable value of
150k hours" without further discussion. Fortunately, more recent
designs and pressure by manufacturers to boost reliability seem to
have increased the quality of these components, and although many
array manufacturers disconcertingly consider the data either
irrelevant or proprietary, a few do not, and we were quoted MTTF
numbers of 20-35k hours, and MTTDL values of 270k to 5M hours. Some
typical component MTTF examples are: 150k or 0.5-1M hours for the
controller; 300-500k hours for a host bus adapter; 50-350k hours for a
power supply module; and 1-3M hours for cabling and packaging.

It takes considerable engineering effort and use of redundant
components to increase the overall MTTDL above 1M hours. For example,
the hp AutoRAID array [Wilkes95] uses two redundant power supplies,
three fans (any two of which can keep the system cool enough for
continued operation), and can support a dual controller configuration;
each controller has a separate NVRAM that uses dual rechargeable
batteries that are periodically discharge-tested. The result: with a
fully-populated system (12 disks and 2 controllers), the array's
overall MTTDL is specified (probably conservatively) as 1.97M hours,
together with an overall MTTF of 31k hours. Few designers of small
arrays go to all this trouble: for example, the Network Appliance's
FAServer350 product has a predicted MTTF of around 20-30k hours with
four disks, and disks are its only redundant components.

Together, these figures suggest that the current "more reasonable
value" for the aggregated non-disk components of a
conservatively-engineered array is probably about 2m hours. This is a
far cry from the 4.109 hours calculated from the independent-disk
failure model considered earlier. With a 2M hour MTTDL, our 5-disk
array would suffer a MDLR of 4.0KB/hour; using the 150k hour figure
from [Gibson93] would increase this to 53KB/hour.

The lesson here is that it is the support components that determine
the availability of a modern disk array, not its disks.

3.4.  Non-volatile memory

Despite the extensive use of NVRAM in high-availability systems,
remarkably little data has been published on its reliability.

Integral lithium-cell-backed static RAM is probably one of the most
reliable kinds of NVRAM: it offers retention lifetimes of 25-87k hours
and extremely low failure rates [Dallas94, Dallas95], but it is quite
expensive: ~$350/MB for a representative state-of-the-art part from
Dallas Semiconductor.

To avoid this expense, many systems use dynamic RAM backed by
rechargeable batteries based on NiCd cells. The battery technology
often limits the resulting availability: achieving MTTF values above a
few tens of thousands of hours requires the use of redundant batteries
whose status is periodically tested by controlled discharging, and
careful attention to charging circuitry and battery conditioning. The
complexity and cost of this design means that it is not often used, so
most battery-backed NVRAM has a much lower MTTF than the Li-cell
backed RAM. For example, the popular PrestoServe card has a predicted
MTTF of 15k hours [Neary91]; with 1MB of vulnerable data, this
corresponds to an MDLR of 67 bytes/hour.

We will show that this means that single-copy NVRAM applications are
already accepting significantly higher risk of data loss than results
from the temporary lack of parity protection in AFRAID.

3.5.  Power failure

One additional support component that is particularly important is
external power. Up to this point, our discussion has assumed that
external power failures simply don't happen. This matters because a
power failure that happens while a RAID 5 is writing can lead to data
loss unless a separate, non-volatile intentions log is kept.

[Gibson93] reports a MTTF of 4300 hours for mains power (i.e. a power
failure about every 6 months). This is probably reasonable for parts
of North America and Europe, but would be optimistic in some other
parts of the world. In our traces, we saw outstanding writes up to 59%
of the time, with a mean of 20%. Even using a more conservative value
of a 10% write duty cycle on a 5-disk RAID 5 gives a MTTDL of only 43k
hours due to external power failures. The effect on MDLR is roughly to
double it (0.7bytes/hour), but the change in MTTDL represents losing
about 98% of the availability that the array offers.

It might be thought that providing an uninterruptible power supply, or
ups, would be overkill for a small array, but it may be the single
largest contributor to preventing data loss. Using a high-grade ups
with an MTTF of 200k hours [Best95] and a 10% write duty cycle returns
the MTTDL for the array's external power components to 2M hours.

The large variability in power and ups reliability can obscure the
other support contributions to MTTDL, so we have chosen not to include
external power failure in the calculations in this paper.

3.6.  How much availability is enough?

When RAIDs were first being discussed as a replacement for large
disks, the MTTF for small disks was 20-30,000 hours, and the target
was to match the reliability of a single, large disk with a MTTF of
30-100,000 hours [Patterson88]. Things have improved since then:
modern small-form-factor disks typically have a published MTTF of
0.5-1.0m hours. Given that the expected useful lifetime of a disk or
disk array is probably no more than 3 years, or about 26k hours, this
is equivalent to a lifetime expected failure likelihood of 3-5%. If it
held 2GB, its mean data loss rate would be 2-4KB/hour. This means that
the best of the traditional 5-disk RAIDs, limited to a MTTDL of about
1-2M hours by their support components, are achieving a MDLR for the
whole array roughly equivalent to that of a single disk.

We contend that the combination of modern, highly reliable disks with
traditional RAID technology provides more than enough protection
against disk failures, and that further efforts to increase data
availability are attacking a solved problem for the vast majority of
customers. Instead, we suggest that it may be worth exchanging some of
the "excess" disk availability for better performance-which is
precisely what AFRAID does.

4. Evaluation

To provide a quantitative evaluation of the AFRAID concept, we used a
detailed event-driven simulator to compare the performance and
availability of an AFRAID array with a non-AFRAID system under a
variety of workloads. We report here on three aspects of this
evaluation:

 * the relative performance of AFRAID, RAID 5, and RAID 0 (an
   unprotected array);

 * quantitative availability measures;

 * the relationship between performance and availability under AFRAID.

We begin with a description of our experimental setup.

4.1.  Experimental methodology

In order to evaluate whether real-life workloads are bursty enough for
an AFRAID array to rebuild parity quickly, it was necessary to look at
some real-life workloads. So we did. Here are the ones we used:

 * hplajw - a single user HP-UX [Clegg86] system used mainly for email
   and document editing.

 * snake - an HP-UX file server for a cluster of workstations at uc
   Berkeley.

 * cello - an HP-UX timesharing system for about 20 people doing text
   editing and program development. We used two subsets of the full
   cello trace: cello-usr is the set of three disks holding the root
   file system, /usr, and /users; cello-news is a single disk holding
   the Usenet news database: it received half of all the disk I/Os in
   the system.

 * netware - an intensive database-loading benchmark measured on a
   Novell Netware server.

 * ATT - a production telephone-company database system. On the real
   system, the entire dataset was mirrored; for our tests, we just
   used one copy of the data.

 * IBM AS400 - four production AS400 systems. These traces were
   supplied to us by Bruce McNutt of IBM San Jose; we called them
   AS400-1 through AS400-4.

The workloads for the first three of these systems are described in
great detail in [Ruemmler93]. We used one-day subsets of them for this
work.

To evaluate our claims we constructed a detailed event-driven
performance simulation of AFRAID using the Pantheon simulator, which
includes the calibrated disk models discussed in [Ruemmler94]. To
simplify the discussions and save space, we just consider
spin-synchronized arrays here.

To add AFRAID to Pantheon, we started with a detailed RAID 5 model and
adapted it to support AFRAID. The changes were small: they consisted
of adding the marking memory and updating it on writes, and not doing
the read-modify-write cycle for the parity in AFRAID mode. We added a
background idle-task to do the parity rebuilds, triggered by an
idle-detection network [Golding95] or explicit foreground policies. By
default, we used a timer-based idleness detector with a 100ms delay:
that is, AFRAID started processing parity updates once the array had
been completely idle for 100ms; the output from the idle-period
predictor was ignored.

To make sure that we were seeing the effects of the AFRAID policies
themselves rather than just the disk array's cache policies
[Ruemmler94], we chose a small (256KB) write staging area with a
write-through policy together with a small (256KB) read cache with no
array-level readahead. Since our workloads came from systems with much
larger file buffer caches, read hits in the array's cache were
rare. We limited the number of concurrently active client requests
inside the array to the number of physical disks it had; the host
device driver used the clook policy [Worthington94a], the back-end
device drivers inside the array used fcfs. We modelled hp C3325 2GB
3.5" 5400 RPM disks in the array [HPC3324].

Multiple writes to the same stripe were allowed to proceed in
parallel, but would block if a parity-rebuild on that stripe was in
progress. Requests were never preempted: once started, they ran to
completion.

The I/O times we report in this paper start when a request is given to
the device driver, and stop when the request is completed by the
array. They include both the time spent in the array itself and any
time spent queued in the device driver. Given that we are using an
open-queueing, trace-driven workload, this provides the fairest
assessment of the performance that would be seen by a user or file
system.

We took no special action for synchronous writes: ones for which the
file system waits until the data being written has been put onto
non-volatile media. Such writes are designed to ensure resilience
against power-failure, not against disk failure; for example, they are
used to disable immediate-reporting in disks that allow this
[Ruemmler93, Ruemmler94]. Even if we had chosen to force a
parity-update on a stripe updated by a synchronous write, the
redundancy would go away again on the next update to any block in the
stripe-not just the one that had been written to synchronously-because
parity protection is at the stripe level, not the block level.

Because almost all of the code was the same between the various array
models, direct performance comparisons between them are
possible. Indeed, to make sure that the exact same disk and cache
algorithms was executed in all cases, we modelled RAID 0 as an AFRAID
that simply never did parity updates.

About the only things that we did not model were a few performance
improvements for AFRAID, of which the most important were probably
aggregation of adjacent stripes needing parity rebuilds and
piggybacking parity updates on disk accesses to nearby blocks.

In addition to the baseline AFRAID design, which updated parity only
in idle periods, we implemented a policy which directly traded off
performance for improved availability. This MTTDL_x policy is designed
to keep the disk-based MTTDL above a particular target value (x). To
do this, it continuously calculates the MTTDL that has been achieved
so far, and reverts to RAID 5 mode if the goal is not being met. (It
also starts the parity update for any unprotected stripes at this
time.) The policy attempts to limit MDLR by automatically starting a
parity update when more than 20 stripes are unprotected, even if the
array is not idle; we had found earlier that this was fairly effective
and caused little performance degradation.

4.2.  Performance evaluation of AFRAID

Figure 2 and Table 2 present the results of exploring the relative
performance of AFRAID, RAID 5 and RAID 0 across a range of workloads
and parity-update policies. The figure and table show that, as
predicted, pure AFRAID performance is very close to that of RAID 0,
with a smooth degradation in performance towards that of RAID 5 as
AFRAID is configured to increase data availability.

The performance of the baseline AFRAID was a geometric mean of 4.1
times that of RAID 5 across our test workloads. By comparison, RAID 0
performance was 4.2 times that of RAID 5. Thus, AFRAID is living up to
the first part of its promise: performance comparable to non-protected
arrays.

4.3.  Availability measures for AFRAID

Our next experiments determined the availability delivered by the
different parity update policies under real workloads. The results are
shown in Table 3 and Table 4. The AFRAID contribution to MDLR from
unprotected data is extremely low: with the exception of the heavy
load from the ATT trace, MDLR_unprotected contributes less than one
byte per hour to the overall MDLR. This is tiny by comparison to the
overall MDLR, which is dominated by support-component
effects. Consequently, AFRAID and RAID 5 have essentially identical
MDLRs. The MDLRunprotected drops to less than 0.1 bytes/hour if any of
the MTTDL_x policies are used.

Because the MDLR values for AFRAID are so close to those for RAID 5,
the more interesting comparison is between the MTTDL values. The first
thing to note is that even the baseline AFRAID design is uniformly
better than an unprotected disk array. It delivers a geometric mean
MTTDL 4.3 times better than RAID 0, and is only a factor of 1.8 worse
than pure RAID 5.

The MTTDL_x policy can bring the overall AFRAID MTTDL as close to RAID
5 as desired. Even the simple implementation of this policy that we
used proved highly effective: the disk-related MTTDL was never more
than 5% below its target, and usually far exceeded it.

As with MDLR, the dominant factor in overall MTTDL comes from the
support components, which limit overall MTTDL to 2 million hours for
all but the baseline AFRAID with the busiest workloads.

Thus, AFRAID is living up to the second part of its promise:
availability comparable to RAID 5.

4.4.  How changing availability affects performance

Figure 3 indicates just how little of the availability of a RAID 5 is
relinquished by AFRAID in order to obtain better performance. The
graph indicates relative performance and availability (MTTDL) by
comparison to RAID 5 (the top left data point); it uses the geometric
mean of the results obtained from all of our workloads. As the target
MTTDL_x value is reduced (points further to the right), performance
increases rapidly, while availability drops off much more slowly. For
example, AFRAID offers 42% better performance for only 10% less
availability, and 97% better for 23% less. By the time pure AFRAID is
reached at the bottom right of the graph, performance is 4.1 times
better than RAID 5, at a cost of less than half its availability.

Thus, a great deal of performance improvement can be had for a small
reduction in data availability.

Figure 4 shows how performance varies with the parity-update policy
for each of the traces that we studied. This figure highlights what
AFRAID is all about: providing a choice between more performance or
more availability.

The tradeoff between performance and availability is directly related
to the characteristics of the workload. For instance, the highly
bursty workloads such as snake, hplajw, and cello-usr show relatively
little change in mean I/O time as availability is increased by choice
of more conservative MTTDL_x policies. This is because the workloads
have enough idle time to update unredundant stripes and therefore the
amount of unprotected time usually stays low; in turn, this means that
there is little need to revert to RAID 5 mode. In workloads with fewer
idle periods and more write traffic, such as AS400-1 and ATT, there is
a smooth decline in mean I/O time as MTTDL is increased across the
entire range between RAID 5 and pure AFRAID.

This adaptability is one of the key features of AFRAID. Once a desired
level of availability has been specified, an AFRAID array will
translate any unneeded redundancy into performance. A typical bursty
workload will show performance close to that of an unprotected RAID 0
disk array, while even the most highly utilized workload will deliver
performance no worse than a RAID 5.

The net result is that AFRAID lives up to the last part of its
promise: it offers a smooth trade-off between performance and
protection that a regular RAID cannot.

5. Refinements of the AFRAID ideas

This section suggests some further applications and refinements of the
AFRAID idea.

An array could begin in a "conservative" RAID 5 mode, and
automatically switch into AFRAID behavior once it had determined that
the I/O patterns included sufficient idle time to keep the redundancy
deficit below some bound. This would be a more conservative scheme
than the one we used in the MTTDL_x policy, which took the opposite
approach, switching into RAID 5 when it felt that its target could not
be achieved.

Stripe-aligned subsets of an AFRAID's storage space could be
permanently flagged with different redundancy properties, from full
RAID 5 redundancy-preservation to zero-redundancy RAID 0-style
storage. Data could then be mapped to portions of the array that
provided different redundancy guarantees, allowing fine-tuning of the
array's availability properties according to user-specified goals
[Wilkes91]. The host could then actively request that a set of stripes
be made redundant, analogous to the traditional database commit
operation.

The units of parity-reconstruction can have a smaller "height" than
the stripes used for data layout if more marker memory can be
provided. For example, if M memory bits can be afforded per stripe,
then parity computations will still be efficient for small writes that
update only 1/M of a stripe unit.

A RAID 6 array keeps two parity blocks for each stripe, and thus pays
an even higher penalty for doing small updates than does RAID 5. The
AFRAID technique could be combined with the RAID 6 parity scheme to
delay either or both parity-block updates: if only one was deferred,
partial redundancy protection would be available immediately, and full
redundancy once the parity-rebuild happened for the other parity
block.

6. Conclusions

The main AFRAID idea is the notion of allowing deliberate, controlled,
temporary non-redundancy in a disk array in order to get significantly
better performance. Because real-life workloads are very often bursty,
these performance gains can be achieved with a minimally increased
chance of data loss-and indeed, there may be less exposure to data
loss than existing single-point-of-failure solutions such as
single-copy volatile or NVRAM caches. AFRAID also offers a choice that
has not been possible before: that of selecting just how much
availability is wanted in a particular situation.

The AFRAID design appears to be highly appropriate for workloads that
have even moderate amounts of idle time between bursts of
activity. Like other RAID designs, there are some workloads and
applications for which it is not particularly well suited. For
example, we would not advocate AFRAID for the cases where data must be
protected at all costs, but it does offer a very good solution for the
majority of people who want something between completely unprotected
data and a fully-redundant, high-end disk array with its performance,
purchase, and configuration costs.

In particular, we believe AFRAID is an appropriate design for low-load
environments where latency is important, such as systems with a small
number of interactive users. We hypothesize that these applications
are also the ones least likely to benefit from the full availability
improvements of RAID 5.

What did we learn as a result of this study? In addition to the
performance and availability results we have described already, a few
lessons stand out:

 * Throughout this paper we have been attempting to reinforce a larger
   point that deserves more attention in system design: there is
   little value in bolstering the fault-tolerance of a single
   component to heroic levels if the rest of the system is less
   reliable. We call this the end-to-end availability argument, by
   analogy with [Saltzer81]. Making simplifying assumptions about
   end-to-end availability (for example, that complete data redundancy
   in the disk layer of an array is sacrosanct, or that NVRAM storage
   never fails) prevents taking advantage of performance opportunities
   like AFRAID.

 * Real-life workloads really are bursty (we've been saying this for a
   while, but it bears repeating).

 * Although the amount of unprotected data in the array is a function
   of the workload, there are several algorithms for bounding it, at
   the cost of some of the performance gains from pure
   AFRAID. Unbounded AFRAID and pure RAID 5 are simply different
   points on a continuum of allowed parity lag-and our design allows a
   user to choose where on this scale they would like their array to
   be.

 * Thinking of different availability solutions in terms of
   data-loss-rate proved a useful way to unify a number of effects. 

Finally, just because an idea has a strange acronym doesn't mean you
should be worried by it:

        "Always do what you are AFRAID to do."
                                                  - Ralph Waldo Emerson


Acknowledgments

Richard Golding, Chris Ruemmler, Carl Staelin, Tim Sullivan, and Bruce
Worthington built much of the experimental infrastructure on which
this work was based. Denny Georg funded our pursuit of an oddball idea
with commendable(?) alacrity. Terri Watson provided the transportation
to Orcas Island's Mount Constitution in which the idea took shape, and
gave us invaluable feedback on drafts of this paper. Bruce McNutt
provided us with the IBM AS400 traces. Our usenix paper shepherd,
Miche Baker-Harvey, also provided many useful comments.

References

[Baker92b] Mary Baker, Satoshi Asami, Etienne Deprit, John Ousterhout,
and Margo Seltzer. Non-volatile memory for fast, reliable file
systems. Proceedings of 5th International Conference on Architectural
Support for Programming Languages and Operating Systems (Boston, MA,
12-15 October 1992). Published as Computer Architecture News,
20(special issue):10-22, October 1992.

[Best95] Best Power Technology Incorporated, Necedah, WI. Product
catalog, September 1995.

[Blaum94] Mario Blaum, Jim Brady, Jehoshua Bruck, and Jai
Menon. EVENODD: an optimal scheme for tolerating double disk failures
in RAID Architectures. Proceedings of 21st International Symposium on
Computer Architecture (Chicago, IL). Published as Computer
Architecture News, 22(2):245-54, 18-21 April 1994.

[Clegg86] Frederick W. Clegg, Gary Shiu-Fan Ho, Steven R. Kusmer, and
John R. Sontag. The HP-UX operating system on HP Precision
Architecture computers. Hewlett-Packard Journal, 37(12):4-22, December
1986.

[Cormen93] Thomas H. Cormen and David Kotz. Integrating theory and
practice in parallel file systems. Proceedings of the 1993 DAGS/PC
Symposium, (Hanover, NH), pages 64-74, Dartmouth Institute for
Advanced Graduate Studies, June 1993.

[Dallas94] Dallas Semiconductor, Dallas, TX. Using nonvolatile static
RAMs, Application note 63, September 1994.

[Dallas95] Dallas Semiconductor, Dallas, TX. DS1250Y/AB 4096K
nonvolatile SRAM: data sheet, October 1995.

[Gibson93] Garth A. Gibson and David A. Patterson. Designing disk
arrays for high data reliability. Journal Parallel and Distributed
Computing, 17(1-2):4-27. Academic Press, Incorporated,
January/February 1993.

[Golding95] Richard Golding, Peter Bosch, Carl Staelin, Tim Sullivan,
and John Wilkes. Idleness is not sloth. Proceedings of Winter USENIX
1995 Technical Conference (New Orleans, LA), pages 201-12. Usenix
Association, Berkeley, CA, 16-20 January 1995.

[Gray90] Jim Gray. A census of Tandem system availability between 1985
and 1990. Technical Report 90.1. Tandem Computers Incorporated,
September 1990.

[Gray90c] Jim Gray, Bob Horst, and Mark Walker. Parity striping of
disc arrays: low-cost reliable storage with acceptable
throughput. Proceedings of 16th International Conference on Very Large
Data Bases (Brisbane, Australia), pages 148-59, 13-16 August 1990.

[Gray91a] Jim Gray and Daniel P. Siewiorek. High-availability computer
systems. IEEE Computer, 24(9):39-48, September 1991.

[Hartman95] J. H. Hartman and J. K. Ousterhout. The Zebra striped
network file system. ACM Transactions on Computer Systems,
13(3):274-310, August 1995.

[Holland92] Mark Holland and Garth A. Gibson. Parity declustering for
continuous operation in redundant disk arrays. Proceedings of 5th
International Conference on Architectural Support for Programming
Languages and Operating Systems (Boston, MA). Published as Computer
Architecture News, 20(special issue):23-35, 12-15 October 1992.

[HPC3324] Hewlett-Packard Company, Boise, Idaho. HP C3324/C3724, HP
C3325/C3725 3.5-inch SCSI-2 disk drives: technical reference manual,
Part number 5963-0277, edition 3, February 1995.

[IBMpfa95] Predictive Failure Analysis: advanced condition
monitoring. International Business Machines Corporation, World-wide
web page http://www.almaden.ibm.com/storage/oem/tech/predfail.htm, 21
August 1995.

[Lin90c] T.-T. Y. Lin and D. P. Siewiorek. Error log analysis:
statistical modeling and heuristic trend analysis. IEEE Transactions
on Reliability, 39(4):419-32, October 1990.

[McKusick84] Marshall K. McKusick, William N. Joy, Samuel J. Leffler,
and Robert S. Fabry. A fast file system for UNIX. ACM Transactions on
Computer Systems, 2(3):181-97, August 1984.

[McKusick90] Marshall Kirk McKusick, Michael J. Karels, and Keith
Bostic. A pageable memory based filesystem. UKUUG Summer 1990
(London), pages 109-15. United Kingdom UNIX systems User Group,
Buntingford, Herts, 9-13 July 1990.

[Menon89] Jai Menon and Jim Kasson. Methods for improved update
performance of disk arrays. Technical report, rJ 6928 (66034). IBM
Almaden Rescarch Center, San Jose, CA, 13 July 1989. Declassified 21
Nov. 1990.

[Menon93] Jai Menon and Jim Courtney. The architecture of a
fault-tolerant cached RAID controller. Proceedings of 20th
International Symposium on Computer Architecture (San Diego, CA),
pages 76-86, 16-19 May 1993.

[Moran90] J. Moran, R. Sandberg, D. Coleman, J. Kepecs, and
B. Lyon. Breaking through the NFS performance barrier. Proceedings of
EUUG Spring 1990 (Munich, Germany), pages 199-206, 23-27 April 1990.

[Muntz90] Richard R. Muntz and John C. S. Lui. Performance analysis of
disk arrays under failure. Proceedings of 16th International
Conference on Very Large Data Bases (Brisbane, Australia), pages
162-73, 13-16 August 1990.

[Neary91] Annette M. Neary. MM-1350 reliability prediction. Micro
Memory Incorporated, Chatsworth, CA, 15th March 1991. Personal
communication from Dennis Doe, 23rd Oct. 1995.

[Ohta90] Masataka Ohta and Hiroshi Tezuka. A fast /tmp file system by
delay mount option. 1990 Summer USENIX Technical Conference (Anaheim,
California, June 1990), pages 145-50. USENIX, June 1990.

[Ousterhout85a] John K. Ousterhout, Herv�Da Costa, David Harrison,
John A. Kunze, Mike Kupfer, and James G. Thompson. A trace-driven
analysis of the UNIX 4.2 BSD file system. Proceedings of 10th ACM
Symposium on Operating Systems Principles (Orcas Island,
Washington). Published as Operating Systems Review, 19(5):15-24,
December 1985.

[Patterson88] David A. Patterson, Garth Gibson, and Randy H. Katz. A
case for redundant arrays of inexpensive disks (RAID). Proceedings of
SIGMOD. (Chicago, Illinois), 1-3 June 1988.

[Reddy91] A. L. Narasimha Reddy and P. Banerjee. Gracefully degradable
disk arrays. Proceedings of FTCS-21, pages 401-8. Institute of
Electrical and Electronics Engineers, June 1991.

[Ritchie84a] D. M. Ritchie. The evolution of the UNIX time-sharing
system. AT&T Bell Laboratories Technical Journal, 63(8, part
2):1577-93, October 1984.

[Rosenblum92] Mendel Rosenblum and John K. Ousterhout. The design and
implementation of a log-structured file system. ACM Transactions on
Computer Systems, 10(1):26-52, February 1992.

[Ruemmler93] Chris Ruemmler and John Wilkes. UNIX disk access
patterns. Proceedings of Winter 1993 USENIX (San Diego, CA), pages
405-20, 25-29 January 1993.

[Ruemmler94] Chris Ruemmler and John Wilkes. An introduction to disk
drive modeling. IEEE Computer, 27(3):17-28, March 1994.

[Saltzer81] J. H. Saltzer. End-to-end arguments in system
design. Proceedings of 2nd International Symposium on Operating
Systems (Paris), April 1981.

[Schulze89] Martin Schulze, Garth Gibson, Randy Katz, and David
Patterson. How reliable is a RAID? Spring COMPCON'89 (San Francisco),
pages 118-23. IEEE, March 1989.

[Stodolsky93] Daniel Stodolsky, Garth Gibson, and Mark Holland. Parity
logging: overcoming the small write problem in redundant disk
arrays. Proceedings of 20th International Symposium on Computer
Architecture (San Diego, CA), pages 64-75, 16-19 May 1993.

[Wilkes91] John Wilkes and Raymie Stata. Specifying data availability
in multi-device file systems. Position paper for 4th ACM-SIGOPS
European Workshop (Bologna, 3-5 September 1990). Published as
Operating Systems Review, 25(1):56-9, January 1991.

[Wilkes95] John Wilkes, Richard Golding, Carl Staelin, and Tim
Sullivan. The HP AutoRAID hierarchical storage system technology. Proc
15th ACM Symposium on Operating Systems Principles (Copper Mountain
Resort, CO). Published as Operating Systems Review, 29(4), 3-6
December 1995.

[Worthington94a] Bruce L. Worthington, Gregory R. Ganger, and Yale
N. Patt. Scheduling algorithms for modern disk drives. Proceedings of
ACM SIGMETRICS Conference on Measurement and Modeling of Computer
Systems (Nashville, TN), 16-20 May 1994.


Author information

John Wilkes graduated with degrees in Physics (BA 1978, MA 1980), and
a Diploma (1979) and PhD (1984) in Computer Science from the
University of Cambridge. He has worked since 1982 as a researcher and
project manager at Hewlett-Packard Laboratories. His main research
interest is in fast, highly available, distributed storage systems,
although he also dabbles in network architectures (the Hamlyn
sender-based message model), OS design (most recently in the Brevix
project), harassing PhD students at random universities whenever he
gets the chance, and learning about early Renaissance art and
architecture.

Stefan Savage graduated with a degree in Applied History (BS 1991)
from Carnegie Mellon University. He worked there for three years
developing real-time operating systems and then left for Seattle,
where he learned to like coffee and became a graduate student at the
University of Washington. His current research interests are in safe
extensible operating systems (such as SPIN, which has been occupying
most of his time for the last two years) although he also finds time
to be harassed by visiting researchers from random industrial research
labs. He prefers art with an explanation.

The authors can be reached by electronic mail at
savage@cs.washington.edu and wilkes@hpl.hp.com.