OSDI '04 Paper [OSDI '04 Technical Program]

Recovering Device Drivers

Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy

Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195 USA
`{mikesw,muthu,bershad,levy}@cs.washington.edu`

Abstract

This paper presents a new mechanism that enables applications to run correctly when device drivers fail. Because device drivers are the principal failing component in most systems, reducing driver-induced failures greatly improves overall reliability. Earlier work has shown that an operating system can survive driver failures [33], but the applications that depend on them cannot. Thus, while operating system reliability was greatly improved, application reliability generally was not.

To remedy this situation, we introduce a new operating system mechanism called a shadow driver. A shadow driver monitors device drivers and transparently recovers from driver failures. Moreover, it assumes the role of the failed driver during recovery. In this way, applications using the failed driver, as well as the kernel itself, continue to function as expected.

We implemented shadow drivers for the Linux operating system and tested them on over a dozen device drivers. Our results show that applications and the OS can indeed survive the failure of a variety of device drivers. Moreover, shadow drivers impose minimal performance overhead. Lastly, they can be introduced with only modest changes to the OS kernel and with no changes at all to existing device drivers.

1 Introduction

Improving reliability is one of the greatest challenges for commodity operating systems. System failures are commonplace and costly across all domains: in the home, in the server room, and in embedded systems, where the existence of the OS itself is invisible. At the low end, failures lead to user frustration and lost sales. At the high end, an hour of downtime from a system failure can result in losses in the millions [16].

Most of these system failures are caused by the operating system's device drivers. Failed drivers cause 85% of Windows XP crashes [30], while Linux drivers have seven times the bug rate of other kernel code [14]. A failed driver typically causes the application, the OS kernel, or both to crash or stop functioning as expected. Hence, preventing driver-induced failures improves overall system reliability.

Earlier failure-isolation systems within the kernel were designed to prevent driver failures from corrupting the kernel itself [33]. In these systems, the kernel unloads a failed driver and then restarts it from a safe initial state. While isolation techniques can reduce the frequency of system crashes, applications using the failed driver can still crash. These failures occur because the driver loses application state when it restarts, causing applications to receive erroneous results. Most applications are unprepared to cope with this. Rather, they reflect the conventional failure model: drivers and the operating system either fail together or not at all.

This paper presents a new mechanism, called a shadow driver, that improves overall system reliability by concealing a driver's failure from its clients while recovering from the failure. During normal operation, the shadow tracks the state of the real driver by monitoring all communication between the kernel and the driver. When a failure occurs, the shadow inserts itself temporarily in place of the failed driver, servicing requests on its behalf. While shielding the kernel and applications from the failure, the shadow driver restores the failed driver to a state where it can resume processing requests.

Our design for shadow drivers reflects four principles:

Device driver failures should be concealed from the driver's clients. If the operating system and applications using a driver cannot detect that it has failed, they are unlikely to fail themselves.
Recovery logic should be centralized in a single subsystem. We want to consolidate recovery knowledge in a small number of components to simplify the implementation.
Driver recovery logic should be generic. The increased reliability offered by driver recovery should not be offset by potentially destabilizing changes to the tens of thousands of existing drivers. Therefore, the architecture must enable a single shadow driver to handle recovery for a large number of device drivers.
Recovery services should have low overhead when not needed. The recovery system should impose relatively little overhead for the common case (that is, when drivers are operating normally).

Overall, these design principles are intended to minimize the cost required to make and use shadow drivers while maximizing their value in existing commodity operating systems.

We implemented the shadow driver architecture for sound, network, and IDE storage drivers on a version of the Linux operating system. Our results show that shadow drivers: (1) mask device driver failures from applications, allowing applications to run normally during and after a driver failure, (2) impose minimal performance overhead, (3) require no changes to existing applications and device drivers, and (4) integrate easily into an existing operating system.

This paper describes the design, implementation and performance of shadow drivers. The following section reviews general approaches to protecting applications from system faults. Section 3 describes device drivers and the shadow driver design and components. Section 4 presents the structure of shadow drivers and the mechanisms required to implement them in Linux. Section 5 presents experiments that evaluate the performance, effectiveness, and complexity of shadow drivers. The final section summarizes our work.

2 Related Work

This section describes previous research on recovery strategies and mechanisms. The importance of recovery has long been known in the database community, where transactions [19] prevent data corruption and allow applications to manage failure. More recently, the need for failure recovery has moved from specialized applications and systems to the more general arena of commodity systems [28].

A general approach to recovery is to run application replicas on two machines, a primary and a backup. All inputs to the primary are mirrored to the backup. After a failure of the primary, the backup machine takes over to provide service. The replication can be performed by the hardware [21], at the hardware-software interface [8], at the system call interface [2,5,7], or at a message passing or application interface [4]. Shadow drivers similarly replicate all communication between the kernel and device driver (the primary), sending copies to the shadow driver (the backup). If the driver fails, the shadow takes over temporarily until the driver recovers. However, shadows differ from typical replication schemes in several ways. First, because our goal is to tolerate only driver failures, not hardware failures, both the shadow and the "real" driver run on the same machine. Second, and more importantly, the shadow is not a replica of the device driver: it implements only the services needed to manage recovery of the failed driver and to shield applications from the recovery. For this reason, the shadow is typically much simpler than the driver it shadows.

Another common recovery approach is to restart applications after a failure. Many systems periodically checkpoint application state [26,27,29], while others combine checkpoints with logs [2,5,31]. These systems transparently restart failed applications from their last checkpoint (possibly on another machine) and replay the log if one is present. Shadow drivers take a similar approach by replaying a log of requests made to drivers. Recent work has shown that this approach is limited when recovering from application faults: applications often become corrupted before they fail; hence, their logs or checkpoints may also be corrupted [10,25]. Shadow drivers reduce this potential by logging only a small subset of requests. Furthermore, application bugs tend to be deterministic and recur after the application is restarted [11]. Driver faults, in contrast, often cause transient failures because of the complexities of the kernel execution environment [34].

Another approach is simply to reboot the failed component, for example, unloading and reloading failed kernel extensions, such as device drivers [33]. Rebooting has been proposed as a general strategy for building high-availability software [9]. However, rebooting forces applications to handle the failure, for example, reinitializing state that has been lost by the rebooted component. Few existing applications do this [9], and those that do not share the fate of the failed driver. Shadow drivers transparently restore driver state lost in the reboot, invisibly to applications.

Shadow drivers rely on device driver isolation to prevent failed drivers from corrupting the OS or applications. Isolation can be provided in various ways. Vino [32] encapsulates extensions using software fault isolation [35] and uses transactions to repair kernel state after a failure. Nooks [33] and Palladium [13] isolate extensions in protection domains enforced by virtual memory hardware. Microkernels [23,38,39] and their derivatives [15,17,20] force isolation by executing extensions in user mode.

Rather than concealing driver failures, these systems all reflect a revealing strategy, one in which the application or user is made aware of the failure. The OS typically returns an error code, telling the application that a system call failed, but little else (e.g., it does not indicate which component failed or how the failure occurred). The burden of recovery then rests on the application, which must decide what steps to take to continue executing. As previously mentioned, most applications cannot handle the failure of device drivers [37], since driver faults typically crash the system. When a driver failure occurs, these systems expose the failure to the application, which may then fail. By impersonating device drivers during recovery, shadow drivers conceal errors caused by driver failures and thereby protect applications.

Several systems have narrowed the scope of recovery to focus on a specific subsystem or component. For example, the Rio file cache [12] provides high performance by isolating a single system component, the file cache, from kernel failures. Phoenix [3] provides transparent recovery after the failure of a single problematic component type, database connections in multi-tier applications. Similarly, our shadow driver research focuses on recovery for a single OS component type, the device driver, which is the leading cause of OS failure. By abandoning general-purpose recovery, we transparently resolve a major cause of application and OS failure while maintaining a low runtime overhead.

3 Device Drivers and Shadow Driver Design

A device driver is a kernel-mode software component that provides an interface between the OS and a hardware device¹. The driver converts requests from the kernel into requests to the hardware. Drivers rely on two interfaces: the interface that drivers export to the kernel that provides access to the device, and the kernel interface that drivers import from the operating system. For example, Figure 1 shows the kernel calling into a sound driver to play a tone; in response, the sound driver converts the request into a sequence of I/O instructions that direct the sound card to emit sound.

Figure 1: A sample device driver. The device driver exports the services defined by the device's class interface and imports services from the kernel's interface.

In practice, most device drivers are members of a class, which is defined by its interface. For example, all network drivers obey the same kernel-driver interface, and all sound-card drivers obey the same kernel-driver interface. This class orientation simplifies the introduction of new drivers into the operating system, since no OS changes are required to accommodate them.

In addition to processing I/O requests, drivers also handle configuration requests. Applications may configure the device, for example, by setting the bandwidth of a network card or the volume for a sound card. Configuration requests may change both driver and device behavior for future I/O requests.

3.1 Driver Faults

Most drivers fail due to bugs that result from unexpected inputs or events [34]. For example, a driver may corrupt a data structure if an interrupt arrives during a sensitive portion of request processing. Device drivers may crash in response to (1) the stream of requests from the kernel, both configuration and I/O, (2) messages to and from the device, and (3) the kernel environment, which may raise or lower power states, swap pages of memory, and interrupt the driver at arbitrary times. A driver bug triggered solely by a sequence of configuration or I/O requests is called a deterministic failure. No generic recovery technique can transparently recover from this type of bug, because any attempt to complete an offending request may trigger the bug [11]. In contrast, transient failures are triggered by additional inputs from the device or the operating system and occur infrequently.

A driver failure that is detected and stopped by the system before any OS, device, or application state is affected is termed fail-stop. More insidious failures may corrupt the system or application and never be detected. The system's response to failure determines whether a failure is fail-stop. For example, a system that detects and prevents accidental writes to kernel data structures exhibits fail-stop behavior for such a bug, whereas one that allows corruption does not.

Appropriate OS techniques can ensure that drivers execute in a fail-stop fashion [32,33,36]. For example, in earlier work we described Nooks [33], a kernel reliability subsystem that executes each driver within its own in-kernel protection domain. Nooks detects faults through memory protection violations, excessive CPU usage, and certain bad parameters passed to the kernel. When Nooks detects a failure, it stops execution within the driver's protection domain and triggers a recovery process. We reported that Nooks was able to detect approximately 75% of failures in synthetic fault-injection tests [33].

Shadow drivers can recover only from failures that are both transient and fail-stop. Deterministic failures may recur when the driver recovers, again causing a failure. In contrast, transient failures are triggered by environmental factors that are unlikely to persist during recovery. In practice, many drivers experience transient failures, caused by the complexities of the kernel execution environment (e.g. asynchrony, interrupts, locking protocols, and virtual memory) [1], which are difficult to find and fix. Deterministic driver failures, in contrast, are more easily found and fixed in the testing phase of development because the failures are repeatable [18]. Recoverable failures must also be fail-stop, because shadow drivers conceal failures from the system and applications. Hence, shadow drivers require a reliability subsystem to detect and stop failures before they are visible to applications or the operating system. Although shadow drivers may use any mechanism that provides these services, our implementation uses Nooks.

3.2 Shadow Drivers

A shadow driver is a kernel agent that improves reliability for a single device driver. It compensates for and recovers from a driver that has failed. When a driver fails, its shadow restores the driver to a functioning state in which it can process I/O requests made before the failure. While the driver recovers, the shadow driver services its requests.

Shadow drivers execute in one of two modes: passive or active. In passive mode, used during normal (non-faulting) operation, the shadow driver monitors all communication between the kernel and the device driver it shadows. This monitoring is achieved via replicated procedure calls: a kernel call to a device driver function causes an automatic, identical call to a corresponding shadow driver function. Similarly, a driver call to a kernel function causes an automatic, identical call to a corresponding shadow driver function. These passive-mode calls are transparent to the device driver and the kernel. They are not intended to provide any service to either party and exist only to track the state of the driver as necessary for recovery.

In active mode, which occurs during recovery from a failure, the shadow driver performs two functions. First, it "impersonates" the failed driver, intercepting and responding to calls from the kernel. Therefore, the kernel and higher-level applications continue operating in as normal a fashion as possible. Second, the shadow driver impersonates the kernel to restart the failed driver, intercepting and responding to calls from the restarted driver to the kernel. In other words, in active mode the shadow driver looks like the kernel to the driver and like the driver to the kernel. Only the shadow driver is aware of the deception. This approach hides recovery details from the driver, which is unaware that it is being restarted by a shadow driver after a failure.

Once the driver has restarted, the active-mode shadow reintegrates the driver into the system. It re-establishes any application configuration state downloaded into the driver and then resumes pending requests.

A shadow driver is a "class driver," aware of the interface to the drivers it shadows but not of their implementations. A single shadow driver implementation can recover from a failure of any driver in the class. The class orientation has three key implications. First, an operating system can leverage a few implementations of shadow drivers to recover from failures in a large number of device drivers. Second, implementing a shadow driver does not require a detailed understanding of the internals of the drivers it shadows. Rather, it requires only an understanding of those drivers' interactions with the kernel. Finally, if a new driver is loaded into the kernel, no new shadow driver is required as long as a shadow for that class already exists. For example, if a new network interface card and driver are inserted into a PC, the existing network shadow driver can shadow the new driver without change. Similarly, drivers can be patched or updated without requiring changes to their shadows. Shadow updating is required only to respond to a change in the kernel-driver programming interface.

3.3 Taps

As we have seen, a shadow driver monitors communication between a functioning driver and the kernel and impersonates one component to the other during failure and recovery. These activities are made possible by a new mechanism, called a tap. Conceptually, a tap is a T-junction placed between the kernel and its drivers. It can be set to replicate calls during passive mode and redirect them during recovery.

A tap operates in passive or active mode, corresponding to the state of the shadow driver attached to it. During passive-mode operation, the tap: (1) invokes the original driver, then (2) invokes the shadow driver with the parameters and results of the call. This operation is shown in Figure 2.

Figure 2: A sample shadow driver operating in passive mode. Taps inserted between the kernel and sound driver ensure that all communication between the two is passively monitored by the shadow driver.

On failure, the tap switches to active mode, shown in Figure 3. In this mode, it: (1) terminates all communication between the driver and kernel, and (2) redirects all invocations to their corresponding interface in the shadow. In active mode, both the kernel and the recovering device driver interact only with the shadow driver. Following recovery, the tap returns to its passive-mode state.

Figure 3: A sample shadow driver operating in active mode. The taps redirect communication between the kernel and the failed driver directly to the shadow driver.

Taps depend on the ability to dynamically dispatch all communication between the driver and the OS. Consequently, all communication into and out of a driver being shadowed must be explicit, such as through a procedure call or a message. Most drivers operate this way, but some do not and cannot be shadowed. For example, kernel video drivers often communicate with usermode applications through shared memory regions [22].

3.4 The Shadow Manager

Recovery is supervised by the shadow manager, which is a kernel agent that interfaces with and controls all shadow drivers. The shadow manager instantiates new shadow drivers and injects taps into the call interfaces between the device driver and kernel. It also receives notification from the fault-isolation subsystem that a driver has stopped due to a failure.

When a driver fails, the shadow manager transitions its taps and shadow driver to active mode. In this mode, requests for the driver's services are redirected to an appropriately prepared shadow driver. The shadow manager then initiates the shadow driver's recovery sequence to restore the driver. When recovery ends, the shadow manager returns the shadow driver and taps to passive-mode operation so the driver can resume service.

3.5 Summary

Our design simplifies the development and integration of shadow drivers into existing systems. Each shadow driver is a single module written with knowledge of the behavior (interface) of a class of device drivers, allowing it to conceal a driver failure and restart the driver after a fault. A shadow driver, normally passive, monitors communication between the kernel and the driver. It becomes an active proxy when a driver fails and then manages its recovery.

4 Shadow Driver Implementation

This section describes the implementation of shadow drivers in the Linux operating system [6]. We have implemented shadow drivers for three classes of device drivers: sound card drivers, network interface drivers, and IDE storage drivers.

4.1 General Infrastructure

All shadow drivers rely on a generic service infrastructure that provides three functions. An isolation service prevents driver errors from corrupting the kernel by stopping a driver on detecting a failure. A transparent redirection mechanism implements the taps required for transparent shadowing and recovery. Lastly, an object tracking service tracks kernel resources created or held by the driver so as to facilitate recovery.

Our shadow driver implementation uses Nooks to provide these functions. Through its fault isolation subsystem, Nooks [33] isolates drivers within separate kernel protection domains. The domains use memory protection to trap driver faults and ensure the integrity of kernel memory. Nooks interposes proxy procedures on all communication between the device driver and kernel. We insert our tap code into these Nooks proxies to replicate and redirect communication. Finally, Nooks tracks kernel objects used by drivers to perform garbage collection of kernel resources during recovery.

Our implementation adds a shadow manager to the Linux operating system. In addition to receiving failure notifications from Nooks, the shadow manager handles the initial installation of shadow drivers. In coordination with the kernel's module loader, which provides the driver's class, the shadow manager creates a new shadow driver instance for a driver. Because a single shadow driver services a class of device drivers, there may be several instances of a shadow driver executing if there is more than one driver of a class present. The new instance shares the same code with all other instances of that shadow driver class.

Figure 4 shows the driver recovery subsystem, which contains the Nooks fault isolation subsystem, the shadow manager, and a set of shadow drivers, each of which can monitor one or more device drivers.

Figure 4: The Linux operating system with several device drivers and the driver recovery subsystem. New code components include the taps, the shadow manager and a set of shadow drivers, all built on top of the Nooks driver fault isolation subsystem.

4.2 Passive-Mode Monitoring

In passive mode, a shadow driver records several types of information. First, it tracks requests made to the driver, enabling pending requests to execute correctly after recovery. For connection-oriented drivers, the shadow driver records the state of each active connection, such as offset or positioning information. For request-oriented drivers, the shadow driver maintains a log of pending commands and arguments. An entry remains in the log until the corresponding request has been handled.

The shadow driver also records configuration and driver parameters that the kernel passes into the driver. During recovery, the shadow uses this information to act in the driver's place, returning the same information that was passed in previously. This information also assists in reconfiguring the driver to its pre-failure state when it is restarted. For example, the shadow sound driver keeps a log of ioctl calls (command numbers and arguments) that configure the driver. This log makes it possible to: (1) act as the device driver by remembering the sound formats it supports, and (2) recover the driver by resetting properties, such as the volume and sound format in use.

The shadow driver maintains only the configuration of the driver in its log. For stateful devices, such as frame buffers or storage devices, it does not create a copy of the device state. Instead, a shadow driver depends on the fail-stop assumption to preserve persistent state (e.g., on disk) from corruption. It can restore transient state (state that is lost when the device resets) if it can force the device's clients to recreate that state, for example, by redrawing the contents of a frame buffer.

Lastly, the shadow tracks all kernel objects that the driver allocated or received from the kernel. These objects would otherwise be lost when the driver fails, causing a memory leak. For example, the shadow must record all timer callbacks registered and all hardware resources owned, such as interrupt lines and I/O memory regions.

In many cases, passive-mode calls do no work and the shadow returns immediately to the caller. For example, the dominant calls to a sound-card driver are read and write, which record or play sound. In passive mode, the shadow driver implements these calls as no-ops, since there is no need to copy the real-time sound data flowing through the device driver. For an ioctl call, however, the sound-card shadow driver logs the command and data for the connection. Similarly, the shadow driver for an IDE disk does little or no work in passive mode, since the kernel and disk driver handle all I/O and request queuing. Finally, for the network shadow driver, much of the work is already performed by the Nooks object-tracking system, which keeps references to outstanding packets.

4.3 Active-Mode Recovery

A driver typically fails by generating an illegal memory reference or passing an invalid parameter across a kernel interface. The kernel-level failure detector notices the failure and invokes the shadow manager, which locates the appropriate shadow driver and directs it to recover the failed driver. The three steps of recovery are: (1) stopping the failed driver, (2) reinitializing the driver from a clean state, and (3) transferring relevant shadow driver state into the new driver.

4.3.1 Stopping the Failed Driver

The shadow manager begins recovery by informing the responsible shadow driver that a failure has occurred. It also switches the taps, isolating the kernel and driver from one another's subsequent activity during recovery. After this point, the tap redirects all kernel requests to the shadow until recovery is complete.

Informed of the failure, the shadow driver first disables execution of the failed driver. It also disables the hardware device to prevent it from interfering with the OS while not under driver control. For example, the shadow disables the driver's interrupt request line. Otherwise, the device may continuously interrupt the kernel and prevent recovery. On hardware platforms with I/O memory mapping, the shadow also removes the device's I/O mappings to prevent DMAs into kernel memory.

To prepare for restarting the device driver, the shadow garbage collects resources held by the driver. It retains objects that the kernel uses to request driver services, to ensure that the kernel does not see the driver "disappear" as it is restarted. The shadow releases the remaining resources.

4.3.2 Reinitializing the Driver

The shadow driver next "reboots" the driver from a clean state. Normally, restarting a driver requires reloading the driver from disk. However, we cannot assume that the disk is functional during recovery. For this reason, when creating a new shadow driver instance, the shadow manager caches in the shadow instance a copy of the device driver's initial, clean data section. These sections tend to be small. The driver's code is kernel-read-only, so it is not cached and can be reused from memory.

The shadow restarts the driver by initializing the driver's state and then repeating the kernel's driver initialization sequence. For some driver classes, such as sound card drivers, this consists of a single call into the driver's initialization routine. Other drivers, such as network interface drivers, require additional calls to connect the driver into the network stack.

As the driver restarts, the shadow reattaches the driver to its pre-failure kernel resources. During driver reboot, the driver makes a number of calls into the kernel to discover information about itself and to link itself into the kernel. For example, the driver calls the kernel to register itself as a driver and to request hardware and kernel resources. The taps redirect these calls to the shadow driver, which reconnects the driver to existing kernel data structures. Thus, when the driver attempts to register with the kernel, the shadow intercepts the call and reuses the existing driver registration, avoiding the allocation of a new one. For requests that generate callbacks, such as a request to register the driver with the PCI subsystem, the shadow emulates the kernel, making the same callbacks to the driver with the same parameters. The driver also acquires hardware resources. If these resources were previously disabled at the first step of recovery, the shadow re-enables them, e.g., enabling interrupt handling for the device's interrupt line. In essence, the shadow driver initializes the recovering driver by calling and responding as the kernel would when the driver starts normally.

4.3.3 Transferring State to the New Driver

The final recovery step restores the driver state that existed at the time of the fault, permitting it to respond to requests as if it had never failed. Thus, any configuration that either the kernel or an application had downloaded to the driver must be restored. The details of this final state transfer depend on the device driver class. Some drivers are connection oriented. For these, the state consists of the state of the connections before the failure. The shadow re-opens the connections and restores the state of each active connection with configuration calls. Other drivers are request oriented. For these, the shadow restores the state of the driver and then resubmits to the driver any requests that were outstanding when the driver crashed.

As an example, for a failed sound card driver, the shadow driver resets the sound driver and all its open connections back to their pre-failures state. Specifically, the shadow scans its list of open connections and calls the open function in the driver to reopen each connection. The shadow then walks its log of configuration commands and replays any commands that set driver properties.

For some driver classes, the shadow cannot completely transfer its state into the driver. However, it may be possible to compensate in other, perhaps less elegant, ways. For example, a sound driver that is recording sound stores the number of bytes it has recorded since the last reset. After recovery, the sound driver initializes this counter to zero. Because no interface call is provided to change the counter value, the shadow driver must insert its "true" value into the return argument list whenever the application reads the counter to maintain the illusion that the driver has not crashed. The shadow can do this because it receives control (on its replicated call) before the kernel returns to user space.

After resetting driver and connection state, the shadow must handle requests that were either outstanding when the driver crashed or arrived while the driver was recovering. Unfortunately, shadow drivers cannot guarantee exactly-once behavior for driver requests and must rely on devices and higher levels of software to absorb duplicate requests. For example, if a driver crashes after submitting a request to a device but before notifying the kernel that the request has completed, the shadow cannot know whether the request was actually processed. During recovery, the shadow driver has two choices: restart in-progress requests and risk duplication, or cancel the request and risk lost data. For some device classes, such as disks or networks, duplication is acceptable. However, other classes, such as printers, may not tolerate duplicates. In these cases, the shadow driver cancels outstanding requests, which may limit its ability to mask failures.

After this final step, the driver has been reinitialized, linked into the kernel, reloaded with its pre-failure state, and is ready to process commands. At this point, the shadow driver notifies the shadow manager, which sets the taps to restore kernel-driver communication and reestablish passive-mode monitoring.

4.4 Active-Mode Proxying of Kernel Requests

While a shadow driver is restoring a failed driver, it is also acting in place of the driver to conceal the failure and recovery from applications and the kernel. The shadow driver's response to a driver request depends on the driver class and request semantics. In general, the shadow will take one of five actions: (1) respond with information that it has recorded, (2) silently drop the request, (3) queue the request for later processing, (4) block the request until the driver recovers, or (5) report that the driver is busy and the kernel or application should try again later. The choice of strategy depends on the caller's expectations of the driver.

Writing a shadow driver that proxies for a failed driver requires knowledge of the kernel-driver interface, interactions, and requirements. For example, the kernel may require that some driver functions never block, while others always block. Some kernel requests are idempotent (e.g., many ioctl commands), permitting duplicate requests to be dropped, while others return different results on every call (e.g., many read requests). The shadow for a driver class uses these requirements to select the response strategy.

Active proxying is simplified for driver interfaces that support a notion of "busy." By reporting that the device is currently busy, shadow drivers instruct the kernel or application to block calls to a driver. For example, network drivers in Linux may reject requests and turn themselves off if their queues are full. The kernel then refrains from sending packets until the driver turns itself back on. Our shadow network driver exploits this behavior during recovery by returning a "busy" error on calls to send packets. IDE storage drivers support a similar notion when request queues fill up. Sound drivers can report that their buffers are temporarily full.

Our shadow sound-card driver uses a mix of all five strategies for emulating functions in its service interface. The shadow blocks kernel read and write requests, which play or record sound samples, until the failed driver recovers. It processes ioctl calls itself, either by responding with information it captured or by logging the request to be processed later. For ioctl commands that are idempotent, the shadow driver silently drops duplicate requests. Finally, when applications query for buffer space, the shadow responds that buffers are full. As a result, many applications block themselves rather than blocking in the shadow driver.

4.5 Limitations

As previously described, shadow drivers have limitations. First, shadow drivers rely on dynamic unloading and reloading of device drivers. If a driver cannot be reloaded dynamically, or will not reinitialize properly, then a shadow cannot recover the driver. Second, shadow drivers rely on explicit communication between the device driver and kernel. If driver-kernel communication takes place through an ad-hoc interface, such as shared memory, the shadow driver cannot monitor it. Third, shadow drivers assume that driver failure does not cause irreversible side effects. If a corrupted driver stores persistent state (e.g., printing a bad check or writing bad data on a disk), the shadow driver will not be able to correct that action.

The effectiveness of shadow drivers is also limited by the abilities of the isolation and failure-detection subsystem. If this layer cannot prevent kernel corruption, then shadow drivers cannot facilitate system recovery. In addition, if the fault-isolation subsystem does not detect a failure, then shadow drivers will not be properly invoked to perform recovery, and applications may fail. Detecting failures is difficult because drivers are complex and may respond to application requests in many ways. It may be impossible to detect a valid but incorrect return value; for example, a sound driver may return incorrect sound data when recording. As a result, no failure detector can detect every device driver failure. However, we support class-based failure detectors that can detect violations of a driver's programming interface and reduce the number of undetected failures.

Finally, shadow drivers may not be suitable for applications with real-time demands. During recovery, a device may be unavailable for several seconds without notifying the application of a failure. These applications, which should be written to tolerate failures, would be better served by a solution that restarts the driver but does not perform active proxying.

4.6 Summary

This section presented the details of our Linux shadow driver implementation. The shadow driver concept is straightforward: passively monitor normal operations, proxy during failure, and reintegrate during recovery. Ultimately, the value of shadow drivers depends on the degree to which they can be implemented correctly, efficiently, and easily in an operating system. The following section evaluates some of these questions both qualitatively and quantitatively.

5 Evaluation

This section evaluates four key aspects of shadow drivers.

Performance. What is the performance overhead of shadow drivers during normal, passive-mode operation (i.e., in the absence of failure)? This is the dynamic cost of our mechanism.
Fault-Tolerance. Can applications that use a device driver continue to run even after the driver fails? We evaluate shadow driver recovery in the presence of simple failures to show the benefits of shadow drivers compared to a system that provides failure isolation alone.
Limitations. How reasonable is our assumption that driver failures are fail-stop? Using synthetic fault injection, we evaluate how likely it is that driver failures are fail-stop.
Code size. How much code is required for shadow drivers and their supporting infrastructure? We evaluate the size and complexity of the shadow driver implementation to highlight the engineering cost integrating shadow drivers into an existing system.

Based on a set of controlled application and driver experiments, our results show that shadow drivers: (1) impose relatively little performance overhead, (2) keep applications running when a driver fails, (3) are limited by a system's ability to detect that a driver has failed, and (4) can be implemented with a modest amount of code.

The experiments were run on a 3 GHz Pentium 4 PC with 1 GB of RAM and an 80 GB, 7200 RPM IDE disk drive. We built and tested three Linux shadow drivers for three device-driver classes: network interface controller, sound card, and IDE storage device. To ensure that our generic shadow drivers worked consistently across device driver implementations, we tested them on thirteen different Linux drivers, shown in Table 1. Although we present detailed results for only one driver in each class (e1000, audigy, and ide-disk), behavior across all drivers was similar.

Class	Driver	Device

Network	e1000	Intel Pro/1000 Gigabit Ethernet
	pcnet32	AMD PCnet32 10/100 Ethernet
	3c59x	3COM 3c509b 10/100 Ethernet
	e100	Intel Pro/100 Ethernet
	epic100	SMC EtherPower 10/100 Ethernet

Sound	audigy	SoundBlaster Audigy sound card
	emu10k1	SoundBlaster Live! sound card
	sb	SoundBlaster 16 sound card
	es1371	Ensoniq sound card
	cs4232	Crystal sound card
	i810_audio	Intel 810 sound card

Storage	ide-disk	IDE disk
	ide-cd	IDE CD-ROM

Table 1: The three classes of shadow drivers and the Linux drivers tested. We present results for the boldfaced drivers only, as the others behaved similarly.

5.1 Performance

To evaluate performance, we produced three OS configurations based on the Linux 2.4.18 kernel:

Linux-Native is the unmodified Linux kernel.
Linux-Nooks is a version of Linux-Native that includes the Nooks fault isolation subsystem but no shadow drivers. When a driver fails, this system restarts the driver but does not attempt to conceal its failure.
Linux-SD is a version of Linux-Nooks that includes our entire recovery subsystem, including the Nooks fault isolation subsystem, the shadow manager, and our three shadow drivers.

We selected a variety of common applications that depend on our three device driver classes and measured their performance. The application names and behaviors are shown in Table 2.

Device Driver	Application Activity

Sound	mp3 player (zinf) playing 128kb/s audio
(audigy driver)	audio recorder (audacity) recording from microphone
	speech synthesizer (festival) reading a text file
	strategy game (Battle of Wesnoth)

Network	network send (netperf) over TCP/IP
(e1000 driver)	network receive (netperf) over TCP/IP
	network file transfer (scp) of a 1GB file
	remote window manager (vnc)
	network analyzer (ethereal) sniffing packets

Storage	compiler (make/gcc) compiling 788 C files
(ide-disk driver)	encoder (LAME) converting 90 MB file .wav to mp3
	database (mySQL) processing the Wisconsin Benchmark

Table 2: The applications used for evaluating shadow drivers.

Different applications have different performance metrics of interest. For the disk and sound drivers, we ran the applications shown in Table 2 and measured elapsed time. For the network driver, throughput is a more useful metric; therefore, we ran the throughput-oriented network send and network receive benchmarks. For all drivers, we also measured CPU utilization while the programs ran. All measurements were repeated several times and showed a variation of less than one percent.

Figure 5: Comparative application performance, relative to Linux-Native, for three configurations. The X-axis crosses at 80%.

Figure 6: Absolute CPU utilization by application for three configurations.

Figure 5 shows the performance of Linux-Nooks and Linux-SD relative to Linux-Native. Figure 6 compares CPU utilization for execution of the same applications on the three OS versions. Both figures make clear that shadow drivers impose only a small performance penalty compared to running with no isolation at all, and no no additional penalty beyond that imposed by isolation alone. Across all nine applications, performance of the system with shadow drivers averaged 99% of the system without, and was never worse than 97%.

The low overhead of shadow drivers can be explained in terms of its two constituents: fault isolation and the shadowing itself. As mentioned previously, fault isolation runs each driver in its own domain, leading to overhead caused by domain crossings. Each domain crossing takes approximately 3000 cycles, mostly to change page tables and execution stacks. As a side effect of changing page tables, the Pentium 4 processor flushes the TLB, resulting in TLB misses that can noticeably slow down drivers [33].

For example, the kernel calls the driver approximately 1000 times per second when running audio recorder. Each invocation executes only a small amount of code. As a result, isolating the sound driver adds only negligibly to CPU utilization, because there are not many crossings and not much code to slow down. For the most disk-intensive of the IDE storage applications, the database benchmark, the kernel and driver interact only 290 times per second. However, each call into the ide-disk driver results in substantial work to process a queue of disk requests. The TLB-induced slowdown doubles the time database spent in the driver relative to Linux-Native and increases the application's CPU utilization from 21% to 27%. On the other hand, the network send benchmark transmits 45,000 packets per second, causing 45,000 domain crossings. The driver does little work for each packet, but the overall impact is visible in Figure 6, where CPU utilization for this benchmark increases from 28% to 57% with driver fault isolation.

In the case the actual shadowing, we see from a comparison of the Linux-Nooks and Linux-SD bars in Figures 5 and 6 that the cost is small or negligible. As noted in Section 4.2, many passive-mode shadow-driver functions are no-ops. As a result, the incremental passive-mode performance cost over basic fault isolation is low or unmeasurable in many cases.

In summary, then, the overall performance penalty of shadow drivers during failure-free operation is low, suggesting that shadow drivers could be used across a wide range of applications and environments.

5.2 Fault-Tolerance

Regardless of performance, the crucial question for shadow drivers is whether an application can continue functioning following the failure of a device driver on which it relies. To answer this question, we tested 10 applications on the three configurations, Linux-Native, Linux-Nooks, and Linux-SD. For the disk and sound drivers, we again ran the applications shown in Table 2. Because we were interested in the response to, not performance, we substituted network file copy, remote window manager, and network analyzer for the networking benchmarks.

We simulated common bugs by injecting a software fault into a device driver while an application using that driver was running. Because both Linux-Nooks and Linux-SD depend on the same isolation and failure-detection services, we differentiate their recovery capabilities by simulating failures that are easily isolated and detected. To generate realistic synthetic driver bugs, we analyzed patches posted to the Linux Kernel Mailing List [24]. We found 31 patches that contained the strings "patch," "driver," and "crash" or "oops" (the Linux term for a kernel fault) in their subject lines. Of the 31 patches, we identified 11 that fix transient bugs (i.e., bugs that occur occasionally or only after a long delay from the triggering test). The most common cause of failure (three instances) was a missing check for a null pointer, often with a secondary cause of missing or broken synchronization. We also found missing pointer initialization code (two instances) and bad calculations (two instances) that led to endless loops and buffer overruns. Because these faults are detected by Nooks, they cause fail-stop failures on Linux-Nooks and Linux-SD.

We injected a null-pointer dereference bug derived from these patches into our three drivers. We ensured that the synthetic bug was transient by inserting the bug into uncommon execution paths, such as code that handles unusual hardware conditions. These paths are rarely executed, so we accelerated the occurrence of faults by also executing the bug at random intervals. The fault code remains active in the driver during and after recovery.

Table 3 shows the three application behaviors we observed. When a driver failed, each application either continued to run normally ("OK"), failed completely ("CRASH"), or continued to run but behaved abnormally ("MALFUNCTION"). In the latter case, manual intervention was typically required to reset or terminate the program.

		Application Behavior
Device Driver	Application Activity	Linux-Native	Linux-Nooks	Linux-SD

Sound	mp3 player	CRASH	MALFUNCTION	OK
(audigy driver)	audio recorder	CRASH	MALFUNCTION	OK
	speech synthesizer	CRASH	OK	OK
	strategy game	CRASH	MALFUNCTION	OK

Network	network file transfer	CRASH	OK	OK
(e1000 driver)	remote window manager	CRASH	OK	OK
	network analyzer	CRASH	MALFUNCTION	OK

IDE	compiler	CRASH	CRASH	OK
(ide-disk driver)	encoder	CRASH	CRASH	OK
	database	CRASH	CRASH	OK

Table 3: The observed behavior of several applications following the failure of the device drivers on which they rely. There are three behaviors: OK indicates that the application continued to operate normally; CRASH indicates that the application failed completely (i.e., it terminated); MALFUNCTION indicates that the application continued to run, but with abnormal behavior.

This table demonstrates that shadow drivers (Linux-SD) enable applications to continue running normally even when device drivers fail. In contrast, all applications on Linux-Native failed when drivers failed. Most programs running on Linux-Nooks failed or behaved abnormally, illustrating that Nooks' kernel-focused recovery system does not extend to applications. For example, Nooks isolates the kernel from driver faults and reboots (unloads, reloads, and restarts) the driver. However, it lacks two key features of shadow drivers: (1) it does not advance the driver to its pre-fail state, and (2) it has no component to "pinch hit" for the failed driver during recovery. As a result, Linux-Nooks handles driver failures by returning an error to the application, leaving it to recover by itself. Unfortunately, few applications can do this.

Some applications on Linux-Nooks survived the driver failure but in a degraded form. For example, mp3 player, audio recorder and strategy game continued running, but they lost their ability to input or output sound until the user intervened. Similarly, network analyzer, which interfaces directly with the network device driver, lost its ability to receive packets once the driver was reloaded.

A few applications continued to function properly after driver failure on Linux-Nooks. One application, speech synthesizer, includes the code to reestablish its context within an unreliable sound card driver. Two of the network applications survived on Linux-Nooks because they access the network device driver through kernel services (TCP/IP and sockets) that are themselves resilient to driver failures.

Linux-SD recovers transparently from disk driver failures. Recovery is possible because the IDE storage shadow driver instance maintains the failing driver's initial state. During recovery the shadow copies back the initial data and reuses the driver code, which is already stored read-only in the kernel. In contrast, Linux-Nooks illustrates the risk of circular dependencies from rebooting drivers. Following these failures, Nooks, which had unloaded the ide-disk driver, was then required to reload the driver off the IDE disk. The circularity could only be resolved by a system reboot. While a second (non-IDE) disk would mitigate this problem, few machines are configured this way.

In general, programs that directly depend on driver state but are unprepared to deal with its loss benefit the most from shadow drivers. In contrast, those that do not directly depend on driver state or are able to reconstruct it when necessary benefit the least. Our experience suggests that few applications are as fault-tolerant as speech synthesizer. Were future applications to be pushed in this direction, software manufacturers would either need to develop custom recovery solutions on a per-application basis or find a general solution that could protect any application from the failure of a kernel device driver. Cost is a barrier to the first approach. Shadow drivers are a path to the second.

Application Behavior During Driver Recovery

Although shadow drivers can prevent application failure, they are not "real" device drivers and do not provide complete device services. As a result, we often observed a slight timing disruption while the driver recovered. At best, output was queued in the shadow driver. At worst, input was lost by the device. The length of the delay was primarily determined by the recovering device driver itself, which, on initialization, must first discover, and then configure, the hardware.

Few device drivers implement fast reconfiguration, which can lead to brief recovery delays. For example, the temporary loss of the e1000 network device driver prevented applications from receiving packets for about five seconds.² Programs using files stored on the disk managed by the ide-disk driver stalled for about four seconds during recovery. In contrast, the normally smooth sounds produced by the audigy sound card driver were interrupted by a pause of about one-tenth of one second, which sounded like a slight click in the audio stream.

Of course, the significance of these delays depends on the application. Streaming applications may become unacceptably "jittery" during recovery. Those processing input data in real-time might become lossy. Others may simply run a few seconds longer in response to a disk that appears to be operating more sluggishly than usual. In any event, a short delay during recovery is best considered in light of the alternative: application and system failure.

5.3 Limits to Recovery

The previous section assumed that failures were fail-stop. However, driver failures experienced in deployed systems may exhibit a wider variety of behaviors. For example, a driver may corrupt state in the application, kernel, or device without being detected. In this situation, shadow drivers may not be able to recover or mask failures from applications. This section uses fault injection experiments in an attempt to generate faults that may not be fail-stop.

Non-fail-stop Failures

If driver failures are not fail stop, then shadow drivers may not be useful. To evaluate whether device driver failures are indeed fail-stop, we performed large-scale fault-injection tests of our drivers and applications running on Linux-SD. For each driver and application combination, we ran 350 fault-injection trials.³ In total, we ran 2100 trials across the three drivers and six applications. Between trials, we reset the system and reloaded the driver. For each trial, we injected five random errors into the driver while the application was using it. We ensured the errors were transient by removing them during recovery. After injection, we visually observed the impact on the application and the system to determine whether a failure or recovery had occurred. For each driver, we tested two applications with significantly different usage scenarios. For example, we chose one sound-playing application (mp3 player) and one sound-recording application (audio recorder).

If we observed a failure, we then assessed the trial on two criteria: whether the fault was detected, and whether the shadow driver could mask the failure and subsequent recovery from the application. For undetected failures, we triggered recovery manually. Note that a user may observe a failure that an application does not, for example, by testing the application's responsiveness.

Figure 7: Results of fault-injection experiments on Linux-SD. We show (1) the percentage of failures that are automatically detected by the fault isolation subsystem, and (2) the percentage of failures that shadow drivers successfully recovered. The total number of failures experienced by each application is shown at the top of the chart.

Figure 7 shows the results of our experiments. For each application, we show the percentage of failures that the Nooks subsystem detected and the percentage of failures from which shadow drivers correctly recovered. Only 18% of the injected faults caused a visible failure.

In our tests, 390 failures occurred across all applications. The sytem automatically detected 65% of the failures. In every one of these cases, shadow drivers were able to mask the failure and facilitate driver recovery. The system failed to detect 35% of the failures. In these cases, we manually triggered recovery. Shadow drivers recovered from nearly all of these failures (127 out of 135). Recovery was unsuccessful in the remaining 8 cases because either the system had crashed (5 cases) or the driver had corrupted the application beyond the possibility of recovery (3 cases). It is possible that recovery would have succeeded had these failures been detected earlier with a better failure detector.

Across all applications and drivers, we found three major causes of undetected failure. First, the system did not detect application hangs caused by I/O requests that never completed. Second, the system did not detect errors in the interactions between the device and the driver, e.g., incorrectly copying sound data to a sound card. Third, the system did not detect certain bad parameters, such as incorrect result codes or data values. Detecting these three error conditions would require that the system better understand the semantics of each driver class. For example, 68% of the sound driver failures with audio recorder went undetected. This application receives data from the driver in real time and is highly sensitive to driver output. A small error or delay in the results of a driver request may cause the application to stop recording or record the same sample repeatedly.

Our results demonstrate a need for class-based failure detectors that can detect violations of the driver interface to achieve high levels of reliability. However, driver failures need not be detected quickly to be fail-stop. There was a significant delay between the failure and the subsequent manual recovery in our tests, and yet the applications survived the vast majority of undetected failures. Thus, even a slow failure detector can be effective at improving application reliability.

Non-transient Failures

Shadow drivers can recover from transient failures only. In contrast, deterministic failures may recur during recovery when the shadow configures the driver. While unable to recover, shadow drivers are still useful for these failures. When a failure recurs during recovery, the sequence of shadow driver recovery events creates a detailed reproduction scenario that aids diagnosis. This record of recovery contains the driver's calls into the kernel, requests to configure the driver, and I/O requests that were pending at the time of failure. This information enables a software engineer to find and fix the offending bug more efficiently.

5.4 Code Size

The preceding sections evaluated the efficiency and effectiveness of shadow drivers. This section examines the complexity of shadow drivers in terms of code size, which can serve as a proxy for complexity.

Driver Class	Shadow Driver Lines of Code	Device Driver Shadowed Lines of Code	Class Size # of Drivers	Class Size Lines of Code
Sound	666	7,381 (audigy)	48	118,981
Network	198	13,577 (e1000)	190	264,500
Storage	321	5,358 (ide-disk)	8	29,000

Table 4: Size and quantity of shadows and the drivers they shadow.

Table 4 shows, for each class, the size in lines of code of the shadow driver for the class. For comparison, we show the size of the driver from the class that we tested and the total number and cumulative size of existing Linux device drivers in that class in the 2.4.18 kernel. The total code size is an indication of the leverage gained through the shadow's class-driver structure. Furthermore, the table shows that a shadow driver is significantly smaller than the device driver it shadows. For example, our sound-card shadow driver is only 9% of the size of the audigy device driver it shadows. The IDE storage shadow is only 6% percent of the size of the Linux ide-disk device driver.

The Nooks driver fault isolation subsystem we built upon contains about 23,000 lines of code. In total, we added about 3300 lines of new code to Nooks to support our three class drivers. Otherwise, we made no changes to the remainder of the Linux kernel. Shadow drivers required the addition of approximately 600 lines of code for the shadow manager, 800 lines of common code shared by all shadow drivers, and another 750 lines of code for general utilities. Of the 177 taps we inserted, only 31 required actual code; the remainder were no-ops.

5.5 Summary

This section examined the performance, fault-tolerance, limits, and code size of shadow drivers. Our results demonstrate that: (1) the performance overhead of shadow drivers during normal operation is small, particularly when compared to a purely isolating system, (2) applications that failed in any form on Linux-Native or Linux-Nooks ran normally with shadow drivers, (3) the reliability provided by shadow drivers is limited by the system's ability to detect failures, and (4) shadow drivers are small, even relative to single device driver. Overall, these results indicate that shadow drivers have the potential to significantly improve the reliability of applications on modern operating systems with only modest cost.

6 Conclusions

Improving the reliability of modern systems demands that we increase their resilience. To this end, we designed and implemented shadow drivers, which mask device driver failures from both the operating system and applications.

Our experience shows that shadow drivers improve application reliability, by concealing a driver's failure while facilitating recovery. A single shadow driver can enable recovery for an entire class of device drivers. Shadow drivers are also efficient, imposing little performance degradation. Finally, they are transparent, requiring no code changes to existing drivers.

Acknowledgments

This work was supported in part by the National Science Foundation under grants ITR-0085670 and CCR-0121341. We would also like to thank our shepherd, Peter Chen, who provided many valuable insights.

References

[1]: S. Arthur. Fault resilient drivers for Longhorn server. Technical Report WinHec 2004 Presentation DW04012, Microsoft Corporation, May 2004.
[2]: Ö. Babaoglu. Fault-tolerant computing based on Mach. In Proceedings of the USENIX Mach Symposium, Oct. 1990.
[3]: R. Barga, D. Lomet, and G. Weikum. Recovery guarantees for general multi-tier applications. In International Conference on Data Engineering, 2002. IEEE.
[4]: J. F. Bartlett. A NonStop kernel. In Proceedings of the 8th ACM Symposium on Operating Systems Principles, Dec. 1981.
[5]: A. Borg, W. Balu, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, Feb. 1989.
[6]: D. P. Bovet and M. Cesati. Inside the Linux Kernel. O'Reilly & Associates, 2002.
[7]: T. C. Bressoud. TFT: A software system for application-transparent fault tolerance. In Proceedings of the 28th Symposium on Fault-Tolerant Computing, June 1998. IEEE.
[8]: T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80-107, Feb. 1996.
[9]: G. Candea and A. Fox. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE Workshop on Hot Topics in Operating Systems, May 2001.
[10]: S. Chandra and P. M. Chen. How fail-stop are faulty programs? In Proceedings of the 28th Symposium on Fault-Tolerant Computing, June 1998. IEEE.
[11]: S. Chandra and P. M. Chen. Whither generic recovery from application faults? A fault study using open-source software. In Proceedings of the 2000 IEEE International Conference on Dependable Systems and Networks, June 2000.
[12]: P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell. The Rio file cache: Surviving operating system crashes. In Proceedings of the Seventh ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1996.
[13]: T. Chiueh, G. Venkitachalam, and P. Pradhan. Integrating segmentation and paging protection for safe, efficient and transparent software extensions. In Proceedings of the 17th ACM Symposium on Operating Systems Principles, Dec. 1999.
[14]: A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating system errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, Oct. 2001.
[15]: D. R. Engler, M. F. Kaashoek, and J. O. Jr. Exokernel: an operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Dec. 1995.
[16]: W. Feng. Making a case for efficient supercomputing. ACM Queue, 1(7), Oct. 2003.
[17]: B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers. The Flux OSKit: a substrate for OS language and research. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, Oct. 1997.
[18]: J. Gray. Why do computers stop and what can be done about it? Technical Report 85-7, Tandem Computers, June 1985.
[19]: J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[20]: S. M. Hand. Self-paging in the Nemesis operating system. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation, Feb. 1999.
[21]: D. Jewett. Integrity S2: A fault-tolerant Unix platform. In Proceedings of the 21st Symposium on Fault-Tolerant Computing, June 1991. IEEE.
[22]: M. J. Kilgard, D. Blythe, and D. Hohn. System support for OpenGL direct rendering. In Proceedings of Graphics Interface, May 1995. Canadian Human-Computer Communications Society.
[23]: J. Liedtke. On m-kernel construction. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Dec. 1995.
[24]: Linux Kernel Mailing List. Available at https://www.uwsg.indiana.edu/hypermail/linux/kernel.
[25]: D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2000.
[26]: D. E. Lowell and P. M. Chen. Discount checking: Transparent, low-overhead recovery for general applications. Technical Report CSE-TR-410-99, University of Michigan, Nov. 1998.
[27]: G. Muller, M. Banâtre, N. Peyrouze, and B. Rochat. Lessons from FTM: An experiment in design and implementation of a low-cost fault-tolerant system. IEEE Transactions on Software Engineering, 45(2):332-339, June 1996.
[28]: D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kýcýman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery-Oriented Computing (ROC): Motivation, definition, techniques, and case studies. Technical Report CSD-02-1175, UC Berkeley Computer Science, Mar. 2002.
[29]: J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. In Proceedings of the 1995 Winter USENIX Conference, Jan. 1995.
[30]: R. Short, Vice President of Windows Core Technology, Microsoft Corp. private communication, 2003.
[31]: M. Russinovich, Z. Segall, and D. Siewiorek. Application transparent fault management in Fault Tolerant Mach. In Proceedings of the 23rd Symposium on Fault-Tolerant Computing, June 1993. IEEE.
[32]: M. I. Seltzer, Y. Endo, C. Small, and K. A. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation, Oct. 1996.
[33]: M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems, 22(4), Nov. 2004.
[34]: V. Orgovan, Systems Crash Analyst, Windows Core OS Group, Microsoft Corp. private communication, 2004.
[35]: R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isolation. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, Dec. 1993.
[36]: R. S. Wahbe and S. E. Lucco. Methods for safe and efficient implementation of virtual machines, June 1998. US Patent 5,761,477.
[37]: J. A. Whittaker. Software's invisible users. IEEE Software, 18(3):84-88, May 2001.
[38]: W. A. Wulf. Reliable hardware-software architecture. In Proceedings of the International Conference on Reliable Software, 1975.
[39]: M. Young, M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, and A. Tevanian. Mach: A new kernel foundation for UNIX development. In Proceedings of the 1986 Summer USENIX Conference, June 1986.

Footnotes:

¹This paper uses the terms "device driver" and "driver" interchangeably; similarly, we use the terms "shadow driver" and "shadow" interchangeably.

²This driver is particularly slow at recovery. The other network drivers we tested recovered in less than a second.

³For details on the fault injector see [33].

File translated from T_EX by T_TH, version 3.63.
On 4 Oct 2004, 17:04.

This paper was originally published in the Proceedings of the 6th Symposium on Operating Systems Design and Implementation,
December 6–8, 2004, San Francisco, CA
Last changed: 18 Nov. 2004 aw

OSDI '04 Technical Program

OSDI '04 Home

USENIX home