Feng Qin, Joseph Tucek and Yuanyuan Zhou
Department of Computer Science, University of Illinois at Urbana-Champaign
{fengqin, tucek, yyzhou}@cs.uiuc.edu
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Previous approaches for surviving software failures suffer from several limitations, including requiring application restructuring, failing to address deterministic software bugs, unsafely speculating on program execution, and requiring a long recovery time.
This paper proposes an innovative, safe technique, called Rx, that can quickly recover programs from many types of common software bugs, both deterministic and non-deterministic. Our idea, inspired by allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the ``allergen'' from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.
Many applications, especially critical ones such as process control or on-line transaction monitoring, require high availability [14]. For server applications, downtime leads to lost productivity and lost business. According to a report by Gartner Group [22], the average cost of an hour of downtime for a financial company exceeds six million US dollars. With the tremendous growth of e-commerce, almost every kind of organization increasingly depends on highly available systems.
Unfortunately, software failures greatly reduce system availability. A recent study showed that software failures account for up to 40% of system failures [16]. Among them, memory-related bugs and concurrency bugs are common and severe software defects, causing more than 60% of system vulnerabilities [9]. For this reason, software companies invest enormous effort and resources on software testing and bug detection prior to releasing software. However, software failures still occur during production runs since some bugs will inevitably slip through even the strictest testing. Therefore, to achieve higher system availability, mechanisms must be devised to allow systems to survive the effects of uneliminated software bugs to the largest extent possible.
Previous work on surviving software failures can be classified into four categories. The first category encompasses various flavors of rebooting (restarting) techniques, including whole program rebooting [14,25], micro-rebooting of partial system components [7,6,8], and software rejuvenation [15,13,4]. Since many of these techniques were originally designed to handle hardware failures, most of them are ill-suited for surviving software failures. For example, they cannot deal with deterministic software bugs, a major cause of software failures [10], because these bugs will still occur even after rebooting. Another important limitation of these methods is service unavailability while restarting, which can take up to several seconds [26]. For servers that buffer significant amounts of state in main memory (e.g. data caches), it requires a long period to warm up to full service capacity [5,27]. Micro-rebooting [8] addresses this problem to some extent by only rebooting the failed components. However, it requires legacy software to be reconstructed in a loosely-coupled fashion.
The second category includes general checkpointing and recovery. The most straightforward method in this category is to checkpoint, rollback upon failures, and then reexecute either on the same machine [12,19] or on a different machine designated as the ``backup server'' (either active or passive) [14,3,5,27]. Similar to whole program rebooting, these techniques were also proposed to deal with hardware failures, and therefore suffer from the same limitations in addressing software failures. Progressive retry [28] is an interesting improvement over these works. It reorders messages to increase the degree of non-determinism. While this work proposes a promising direction, it limits the technique to message reordering. As a result, it cannot handle bugs unrelated to message order. For example, if a server receives a malicious request that exploits a buffer overflow bug, simply reordering messages will not solve the problem. The most aggressive approaches in this category include recovery blocks [18] and n-version programming [2,1], both of which rely on different implementation versions upon failure. These approaches may survive deterministic bugs under the assumption that different versions fail independently. But they are too expensive to be adopted by software companies because they double the software development costs and efforts.
The third category comprises application-specific recovery mechanisms, such as the multi-process model (MPM), exception handling, etc. Some multi-processed applications, such as the multi-processed version of the Apache HTTP server and the CVS server, can simply kill a failed process and start a new one to handle a failed request. While simple and capable of surviving certain software failures, this technique has several limitations. First, if the bug is deterministic, the new process will most likely fail again at the same place given the same request (e.g. a malicious request). Second, if a shared data structure is corrupted, simply killing the failed process and restarting a new one will not restore the shared data to a consistent state, therefore potentially causing subsequent failures in other processes. Other application-specific recovery mechanisms require software to be failure-aware, which adversely affects programming difficulty and code readability.
The fourth category includes several recent non-conventional proposals such as failure-oblivious computing [20,21] and the reactive immune system [23]. Failure-oblivious computing proposes to deal with buffer overflows by providing artificial values for out-of-bound reads, while the reactive immune system returns a speculative error code for functions that suffer software failures (e.g. crashes). While these approaches are inspiring and may work for certain types of applications or certain types of bugs, they are unsafe to use for correctness-critical applications (e.g. on-line banking systems) because they ``speculate'' on programmers' intentions, which can lead to program misbehavior. The problem becomes even more severe and harder to detect if the speculative ``fix'' introduces a silent error that does not manifest itself immediately. Such problems, if they occur, are very hard for programmers to diagnose since the application's execution has been forcefully and silently perturbed by those speculative ``fixes''.
Besides the above individual limitations, existing work provides insufficient feedback to developers for debugging. For example, the information provided to developers may include only a core dump, several checkpoints, and an event log for the deterministic replay of a few seconds of recent execution. To save debugging effort, it is desirable if the run-time system can provide information regarding the bug type, under what conditions the bug is triggered, and how it can be avoided. Such diagnostic information can guide programmers during their debugging process and thereby enhance efficiency.
Our idea is inspired from real life. When a person suffers from an allergy, the most common treatment is to remove allergens from their living environment. For example, if patients are allergic to milk, they should remove diary products from the diet. If patients are allergic to pollen, they may install air filters to remove pollen from the air. Additionally, when removing a candidate allergen from the environment successfully treats the symptoms, it allows diagnosis of the cause of the symptoms. Obviously, such treatment cannot and also should not start before patient shows allergic symptoms since changing living environment requires special effort and may also be unhealthy.
In software, many bugs resemble allergies. That is, their manifestation can be avoided by changing the execution environment. According to a previous study by Chandra and Chen [10], around 56% of faults in Apache depend on execution environment. Therefore, by removing the ``allergen'' from the execution environment, it is possible to avoid such bugs. For example, a memory corruption bug may disappear if the memory allocator delays the recycling of recently freed buffers or allocates buffers non-consecutively in isolated locations. A buffer overrun may not manifest itself if the memory allocator pads the ends of every buffer with extra space. Data races can be avoided by changing timing events such as thread-scheduling, asynchronous events, etc. Bugs that are exploited by malicious users can be avoided by dropping such requests during program reexecution. Even though dropping requests may make a few users (hopefully the malicious ones) unhappy, they do not introduce incorrect behavior to program execution like the failure-oblivious approaches do. Furthermore, given a spectrum of possible environment changes, the least intrusive changes can be tried first, reserving the most extreme one as a last resort for when all other changes have failed. Finally, the specific environment change which cures the problem gives diagnostic information as to what the bug might be.
Similar to an allergy, it is difficult and expensive to apply these execution environmental changes from the very beginning of the program execution because we do not know what bugs might occur later. For example, zero-filling newly allocated buffers imposes time overhead. Therefore, we should lazily apply environmental changes only when needed.
Compared to previous solutions, Rx has the following unique advantages:
(1) Comprehensive: Besides non-deterministic bugs, Rx can also
survive deterministic bugs. We have evaluated our idea using several
server applications with common software bugs and our preliminary results
show that Rx can successfully survive these software bugs.
(2) Safe: Rx does not speculatively ``fix'' bugs at run
time. Instead, it prevents bugs from manifesting themselves by changing
only the program's execution environment. Therefore, it does not
introduce uncertainty or misbehavior into a program's execution,
which is difficult for programmers to diagnose.
(3) Noninvasive: Rx requires few to no modifications to
applications' source code. Therefore, it can be easily applied to
legacy software.
(4) Efficient: Because Rx requires no rebooting or
warm-up, it significantly reduces system down time and provides
reasonably good performance during recovery. Additionally, Rx
imposes only the minimal overhead of lightweight checkpointing
during normal execution.
(5) Informative: Rx does not hide software bugs. Instead,
bugs are still exposed. Furthermore, besides the usual bug report
package (e.g. core dumps, checkpoints and event logs),
Rx provides programmers with
additional diagnostic information for postmortem analysis, including
what conditions triggered the bug and which environmental changes can
and cannot avoid the bug. Based on such information, programmers can
more efficiently find the root cause of the bug. For example, if
Rx successfully avoids a bug by padding newly allocated buffers,
the bug is likely to be a buffer overflow. Similarly, if Rx
avoids a bug by delaying the recycling of freed buffers, the bug is
likely to be caused by double free or dangling pointers.
|
The main idea of Rx is to reexecute a failed code region in a new environment that has been modified based on the failure symptoms. If the bug's ``allergen'' is removed from the new environment, the bug will not occur during reexecution, and so the program will survive this software failure without rebooting the whole program. After the reexecution safely passes through the problematic code region, the environmental changes are disabled to reduce time and space overhead.
Figure 1 shows the process by which Rx survives software failures. Rx periodically takes light-weight checkpoints that are specially designed to survive software failures instead of hardware failures or OS crashes [24]. When a bug is detected, either by an exception or by the integrated dynamic defect detection tools called Rx sensors, the program is rolled back to a recent checkpoint. Rx then analyzes the occurring failure based on the failure symptoms and ``experiences'' accumulated from previous failures, and determines how to apply environmental changes to avoid this failure. Finally, the program reexecutes from the checkpoint in the modified environment. This process may repeat several times, each time with a different environmental change or from a different checkpoint, until either the failure disappears or a timeout occurs. If the failure does not recur in a reexecution attempt, the execution environment is reset to normal to avoid the time and space overhead imposed by some of the environmental changes.
In our idea, the execution environment can include almost everything that is external to the target application but can affect the execution of the target application. At the lowest level, it includes the hardware such as process architectures, devices, etc. At the middle level, it includes the OS kernel such as scheduling, virtual memory management, device drivers, file systems, network protocols, etc. At the highest level, it includes standard libraries, third-party libraries, etc. Such definition of the execution environment is much broader than the one used in previous work [10].
Obviously, the execution environment cannot be arbitrarily modified for reexecution. A useful reexecution environmental change should satisfy two properties. First, it should be correctness-preserving, i.e., executing the original program and every step (e.g., instruction, library call and system call) of the program is executed according to the APIs. For example, in the library call, we have the flexibility to decide where buffers should be allocated, but we cannot allocate a smaller buffer than requested. Second, a useful environmental change should be able to potentially avoid some bugs. For example, padding every allocated buffer can avoid some buffer overflows from manifesting during reexecution.
Examples of useful execution environmental changes include, but are not limited to, the following categories:
(1)Memory management based: Many software bugs are
memory related, such as buffer overflows, dangling pointers,
etc. These bugs may not manifest themselves if memory management is
performed slightly differently. For example, each buffer allocated during
reexecution can have padding added to both ends to prevent some buffer
overflows. Delaying the recycling of freed buffers can reduce the
probability for a dangling pointer to cause memory corruption. In
addition, buffers allocated during reexecution can be placed in
isolated locations far away from existing memory buffers to avoid some
memory corruption. Furthermore, zero-filling new buffers can
avoid some uninitialized read bugs. Since none of the above
changes violate memory allocation or deallocation interface
specifications, they are safe to apply.
(2)Timing based: Most non-deterministic software bugs,
such as data races, are related to the timing of asynchronous
events. These bugs will likely disappear under different timing
conditions. Therefore, Rx can forcefully change the timing of these
events to avoid these bugs during reexecution. For example, increasing
the length of a scheduling time slice will likely avoid context switches
during buggy critical sections.
(3)User request based: Since it is infeasible to test
every possible user request before releasing software, many bugs occur
due to unexpected user requests. For example, malicious users issue
malformed requests to exploit buffer overflow bugs during stack
smashing attacks [11]. These bugs can be avoided by
dropping some users' requests during reexecution. Of course, since the
user may not be malicious, this method should be used as a last resort
after all other environmental changes fail.
Table 1 lists some environmental changes and the types of bugs that can be potentially avoided by them. Although there are many such changes, due to space limitations, we only list a few examples for demonstration.
After a reexecution attempt successfully passes the problematic program region for a threshold amount of time, the environmental changes applied during the successful reexecution are disabled to reduce space and time overhead. Furthermore, the failure symptoms and the effects of the environmental changes applied are recorded. This speeds up the process of dealing with future failures with similar symptoms and code locations. Additionally, Rx provides all such diagnostic information to programmers together with core dumps and other basic postmortem bug analysis information.
If the failure still occurs during a reexecution attempt, Rx will rollback and reexecute the program again, either with a different environmental change or from an older checkpoint. For example, if one change (e.g. padding buffers) cannot avoid the bug during the reexecution, Rx will rollback the program again and try another change (e.g. zero-filling new buffers) during the next reexecution. If none of the environmental changes work, Rx will rollback further and repeat the same process. If the failure still remains after a threshold number of iterations of rollback-reexecute, Rx will resort to previous solutions, such as whole program rebooting [14,25] or micro-rebooting [7,6,8], as supported by applications.
Upon a failure, Rx follows several rules to determine the order in which environmental changes should be applied during the recovery process. First, if a similar failure has been successfully avoided by Rx before, the environmental change that worked previously will be tried first. If this does not work, or if no information from previous failures exists, changes with small overheads (e.g. padding buffers) are tried before those with large overheads (e.g. zero-filling new buffers). Changes with negative side effects (e.g. dropping requests) are tried last. Changes that do not conflict, such as padding buffers and changing event timing, can be applied simultaneously.
There is a rare possibility that a bug still occurs during reexecution but is not detected in time by Rx's sensors. In this case, Rx will claim a recovery success while it is not. Addressing this problem requires using more rigorous on-the-fly software defect checkers as sensors. This is currently a hot research area that has attracted much attention. In addition, it is also important to note that, unlike in failure oblivious computing, this problem is caused by the application's bug instead of Rx's environmental changes. Environmental changes just make the bug manifest itself in a different way. Furthermore, since Rx logs its every action including what environmental changes are applied and what the results are, programmers can use this information to analyze the bug.
While the Rx implementation borrows ideas from previous work, many design issues need to be addressed differently due to differing goals. First, Rx targets software failures instead of hardware failures or OS crashes. Therefore, the checkpointing component does not need to be heavy-weight. Second, Rx does not require deterministic replay. Instead, Rx needs the exact opposite: non-determinism. Therefore, issues such as checkpoint management and the output commit problem [12] need to be addressed differently.
As shown in Figure 2, Rx consists of five components: (1) sensors for detecting failures and bugs, (2) a Checkpoint-and-Rollback (CR) component, (3) a proxy for making server recovery process transparent to clients, (4) environmental wrappers, and (5) a control unit that determines the recovery plan for an occurring failure.
Sensors detect software bugs and failures by dynamically monitoring applications' execution. There are two types of sensors. The first type detects software errors such as assertion failures, access violations, divide-by-zero exceptions, etc. This type of sensor is relatively easy to implement by simply taking over OS-raised exceptions. The second type of sensor detects software bugs such as buffer overflows, accesses to freed memory etc., before they cause the program to crash. This type of sensor leverages existing dynamic bug detection tools, such as our previous work, SafeMem [17], that have low run-time overhead (only 1.6-14%) for detecting memory-related bugs in server programs.
The CR (Checkpoint-and-Rollback) component takes checkpoints of the target application and rolls back the application to a previous checkpoint upon failure. Rx uses a light-weight checkpointing solution that is designed for surviving software failures. At a checkpoint, Rx stores a snapshot of the application into memory. Similar to the fork operation, Rx copies application memory in a copy-on-write fashion to minimize overhead. The details were discussed in our previous work [24]. Performing rollback is straightforward: simply reinstate the program from the shadow process associated with the specified checkpoint. The CR also supports multiple checkpoints and rollback to any of them.
The environment wrapper performs environmental changes during reexecution. We implement different environmental changes in different components. For example, we implement memory management based changes by wrapping the memory allocation library calls. The kernel deals with timing based changes, such as thread scheduling, signal delay, and other asynchronous timing events. The proxy process, which will be described next, manipulates user requests.
To provide the reexecution functionality, Rx uses a proxy to buffer messages between the server and its remote clients. The proxy runs as a separate process to avoid corruption by the server. During normal operation, the proxy simply bridges between the server and its clients, and buffers user requests that are made since the oldest undeleted checkpoint. During a reexecution attempt from a checkpoint, the proxy replays all the user requests received since the checkpoint.
To address the output commit problem, the proxy ensures that every user request is replied to once and only once. For each request, the proxy records whether this request has been answered. If so, a reply made during reexecution is dropped silently. Otherwise, the reply is sent to the corresponding client. In other words, only the first reply goes to the client, no matter whether this first reply is made during the original execution or a successful reexecution attempt.
For applications such as on-line shopping or the SSL hand-shake protocol that require strict session consistency (i.e. later requests in the same session depend on previous replies), Rx can record the signatures (hash values) of all committed replies for each outstanding session, and perform MD5 hash-based consistency checks during reexecution. If a reexecution attempt generates a reply that does not match with the associated committed reply, the session can be aborted abnormally to avoid confusing users.
The control unit analyzes occurring failures and determines which checkpoint to roll back to and which environmental changes to apply during reexecution. After each reexecution, it records the effects (success or failure) into its failure table. This table is used as a reference for future failures and is also provided to programmers for postmortem bug analysis. The control unit also monitors the recovery time and when it exceeds some threshold, it resorts to program restart solutions.
|
In the evaluation, we design four sets of experiments to evaluate different key aspects of Rx: (1) the functionality of Rx in surviving software failures caused by common software defects; (2) the performance overhead of Rx in both server throughput and average response time; (3) how Rx would behave while under malicious attacks that continuously send bug-exposing requests triggering software defects; (4) the benefits of Rx's mechanism of learning from previous failures to speed up recovery.
In particular, Figures 3 shows the overhead of Rx for Squid compared to the baseline (without Rx) for various frequencies of checkpointing. We can see that both throughput and response time are very close to baseline for all tested checkpoint rates. Results for other server applications are similar. In this experiment, we use a workload similar to the one used in [24].