Drew Dean1
Alan J. Hu2
Computer Science Laboratory, SRI
International
ddean@csl.sri.com
Dept. of Computer Science,
University of British Columbia
ajh@cs.ubc.ca
One particular race condition is especially infamous among developers of security-critical software, particularly setuid programs, on Unix and Unix-like systems: the one between an appearance of the access(2) system call, and a subsequent open(2) call. Although this paradigm appears to have been the intended use of access(2), which first appeared in V7 Unix in 1979, it has always been subject to this race condition. Recall that individual Unix system calls are atomic, but sequences of system calls offer no guarantees as to atomicity. This is a long standing problem: a 1993 CERT advisory [7] documents this exact race condition in xterm, and earlier exploits based on this problem are believed to exist. However, there is no generally available, highly portable, correct solution for providing the functionality of access(2). This paper remedies this unfortunate situation. It is commonly accepted that fixing this problem requires a kernel change. While this is true for a deterministic solution, we present a highly portable probabilistic solution that works under the existing system call interface. The technique used is reminiscent of hardness amplification as found in the cryptology literature [16], but applied to system calls, rather than cryptologic primitives.
We first survey the problem, its history, partial solutions, and related work. We then prove the ``folk theorem'' that there is no (deterministic) solution to the problem short of a kernel modification. We present our probabilistic solution, and experimental data showing the exploitability of the problem across several generations of machines. Final thoughts are presented in the conclusion.
if(access(pathname, R_OK) == 0) if((fd = open(pathname, O_RDONLY)) == 0) ...to work in the obvious way - that is, to check whether pathname is readable, and if so, open the file for reading on file descriptor fd.
Unfortunately, there is a classic time-of-check-to-time-of-use (TOCTTOU) [11] problem lurking here: the pair of access(2) and open(2) system calls is not a single, atomic operation. Hence, a clever attacker can change the file system in between the two system calls, to trick a setuid program into opening a file that it should not. Apple (MacOS X 10.3) and FreeBSD (4.7) are very succinct in their manual pages for access(2): ``Access() is a potential security hole and should never be used.'' For naïve uses of access(2), this is true; however, we shall see that the real situation is more complicated.
Cowan, et al., [10] cover a very similar problem, a race condition between the use of stat(2) and open(2), with their RaceGuard technology. They changed the kernel to maintain a small per-process cache of files that have been stat'd and found not to exist. If a subsequent open(2) finds an existing file, the open fails. This race condition is primarily found in the creation of temporary files. While it is essentially equivalent to the access(2)/open(2) race we consider in this paper, the solution is entirely different: they modify the kernel, we do not. Tsyrklevich and Yee [15] take a similar approach to RaceGuard, in that their solution involves a kernel modification. However, they have a richer policy language to express what race conditions they will intercept, and they suspend the putative attacker process (a process that interfered with another process' race prone call sequence) rather than causing an open(2) call to fail. Again, Tsyrklevich and Yee modify the kernel, to fix existing vulnerable applications, whereas we are proposing a user level technique to solve the problem.
A widely held belief is that such a solution isn't possible, but to our knowledge this has never been precisely stated nor proven. Here, we state and prove this theorem. Furthermore, the assumptions needed to prove the theorem will suggest an alternative solution.
The first assumption means that the theorem ignores solutions based on juggling user ids and giving up privilege, which we rule out because of portability and efficiency concerns. The first two assumptions also imply ignoring various solutions based on kernel changes: for example, an faccess(2) call that determines access permissions given a file descriptor violates the first assumption, whereas an O_RUID option to open(2), as discussed in Section 2.2, violates the second assumption. Note that although fstat(2) at first glance appears to violate the first assumption, it actually doesn't, since the stat buffer contains permission information for the file only, but doesn't consider the permissions through all directories on the file's path. In general, the theorem applies to any combination of the typical accessors of the file system state: access(2), open(2), stat(2), fstat(2), lstat(2), read(2), getdents(2), etc. The third assumption is standard when analyzing security against race conditions.
Proof: Any attempted solution will perform a sequence of system calls. We can model this sequence as a string over the alphabet , where represents a call to an access-checking function, and represents any other call, e.g., open(2). (If the attempted solution has multiple control flow paths making different sequences of calls, we can model this as a finite set of strings, one for each path through the program, and the attacker can attack each of these strings separately.) Similarly, we can model the attacker's execution as a string over the alphabet , where represents swapping in a good file (one for which the real user id has permission) for the file whose access is being checked, and represents swapping in a bad file (one for which the real user id doesn't have permission). An attempted attack is an interleaving of the strings and .
The assumption that the attacker can win races against the setuid program means that the attacker can control what interleaving occurs (at least some fraction of the time -- we only need one success for a successful attack). Suppose the attempted solution contains instances of . The attack can then consist of the string , and the attacker can make the interleaving such that each call is immediately bracketed before by and after by . Therefore, every access-checking call checks the good file and grants permission, whereas all other operations will see the same bad file, and hence there will be no inconsistencies that can be detected. Therefore, under the assumptions, there is no secure solution.
The above theorem actually generalizes to other TOCTTOU problem instances that satisfy similar assumptions. If there is no way to check something and acquire it in a single atomic operation, and if we assume the attacker can win races, then the attacker can always swap in the good thing before each check and swap in the bad thing before any other operations.
Notice how strongly the proof of the theorem relies on the assumption that the attacker can win races whenever needed. This assumption is reasonable and prudent when considering the security of ad hoc races that occurred by oversight, since a determined attacker can employ various means to increase the likelihood of winning races and can repeat the attack millions of times. However, is this assumption still reasonable if we carefully design an obstacle course of races that the attacker needs to win? By analogy to cryptology, an attacker that can guess bits of a key can break any cryptosystem, but with enough key bits, the probability of guessing them all correctly is acceptably small. This insight leads to our probabilistic solution.
The other major assumption needed for our solution is that the
calls to access(2) and open(2) must be idempotent and
have no undesirable side effects.
For typical usages of opening files for reading or writing, this
assumption is reasonable. However, one must be careful with certain
usages of open(2). In particular, some common flag combinations,
like (O_CREAT | O_EXCL)
, are not idempotent and
will not work with our solution.
Similarly, calling open(2) on some devices may cause undesirable
side effects (like rewinding a tape), which our solution will not prevent.
The probabilistic solution starts with the standard calls to access(2) followed by open(2). However, these two calls are then followed by strengthening rounds, where is a configurable strengthening parameter. Each strengthening round consists of an additional call to access(2) followed by open(2), and then a check to verify that the file that was opened was the same as had been opened previously (by comparing inodes, etc.). When , our solution degenerates into the standard, race-vulnerable access(2)/open(2) sequence. Figure 1 shows the code for our solution.
The probabilistic solution adds some overhead over the simple, insecure access(2)/open(2) sequence. In particular, the runtime will grow linearly in . How much improvement in security do we get for this cost in runtime?
If we assume that each race is an independent random event with probability , then the attacker succeeds with probability approximately . (This analysis ignores the attacks that win more than one race between characters in , because these will be much smaller terms for reasonable values of .) Hence, the attacker must work exponentially harder as we increase linearly. This is the same trade-off behind modern cryptology.
We note that our solution should generalize to other TOCTTOU situations, provided the access check and use are side-effect free.
The assumption that each race is an independent, identically-distributed random variable is obviously not realistic. Some amount of variation or dependences in the winnability of each race does not fundamentally change our analysis: the probability of a successful attack will still be the product of probabilities of winning individual races. The greatest threat is that an attacker might manage to synchronize exactly to the setuid program, so that it knows exactly when to insert its actions to win each race, making . A simple way to foil this threat is to add randomness to the delays before the access(2) and open(2) calls. We can add calls to a cryptographically strong random number generator and insert random delays into our code before each access(2) and open(2) call. The attacker would thereby have no way of knowing for sure when to win each race. On systems lacking the proc(5) file system, these delays can be calls to nanosleep(2).6 With the proc(5) file system, the attacker can quickly check whether the victim is running or sleeping, so the victim must always be seen to be running, even if it is only a delay loop. We note that applications implemented with user-level threads can execute code in another thread to accomplish useful work (if available) rather than simply spinning in a delay loop. Note that the stat(2) family of system calls returns time values with 1 second granularity, too coarse to be useful for the attacker. Nevertheless, despite our analysis, the only way to be certain how our solution performs in reality is to implement it and measure the results.
CPU/Clock Speed | OS | Compiler |
Pentium III/500 MHz | Linux 2.4.18 | GCC 2.95.3 |
Pentium III/500 MHz | FreeBSD 4.7 | GCC 2.95.4 |
Ultra-60/2300 Mhz | Solaris 8 | GCC 3.2 |
SS2/40 MHz | SunOS 4.1.4 | Sun /bin/cc |
For convenience, we will refer to the machines by operating system name, as these are unique. We will use the traditional terminology, where SunOS means SunOS 4.1.4, and Solaris means Solaris 8 (aka SunOS 5.8).
We developed an attack program that repeatedly forks off a copy of a victim setuid program using our strengthening, and then attempts the attack with races. Figure 2 shows the core code of our attack program.
As a basis for comparison, our first task was to measure how hard the classic access(2)/open(2) race is to win, in the absence of strengthening. Running on Linux and FreeBSD uniprocessors, and attacking files on the local disk, we were surprised by how hard the attack is to win. In fact, we did not observe any successful attacks in our initial experiments.7
Eventually, after careful tuning to match the attacker's delays to the victim as best as we could, and after extended experiments, we were able to observe some successful attacks against the unstrengthened access(2)/open(2) race: 14 successes out of one million trials on the 500Mhz FreeBSD box (13hrs real time), 1 success out of 500,000 trials on the 500Mhz Linux box, and subsequently 0 successes out of an additional one million trials on the 500Mhz Linux box (22hrs).
Some reflection suggested a possible explanation. The typical scheduling quantum on mainstream operating systems has not changed much over the past decades: on the order of tens of milliseconds. Barring any other events that cause a process to yield the CPU, a process on a uniprocessor will execute for that long quasi-atomically. However, processor speeds have increased by a few orders of magnitude over the same period, so the number of instructions that execute during a scheduling quantum from a single process has gone up accordingly. This implies that code on a uniprocessor should behave far more ``atomically'' than before and that race conditions should be far harder to win. Conversely, during the 1980s, when the access(2)/open(2) race entered the folklore, it was likely much easier to win.
To test this hypothesis, we obtained the oldest Unix-running machine we were able to resurrect sufficiently to run our experiments, a Sun SPARCstation 2 from the early 1990s. Other than conversion of function prototypes to Kernighan and Ritchie C, the code compiled and ran unmodified. On an experiment of one million attempts (56hrs), we observed 1316 successful attacks.
The good news, then, is that on a modern uniprocessor, even the unstrengthened access(2)/open(2) race is extremely hard to win. Given the low success rate and the difficulty of tuning the attacker for , we were never able to observe a successful attack with .
Uniprocessor Baseline Results Summary | |||
Machine | Attempts | Successes | |
Linux | 0 | 1,500,000 | 1 |
FreeBSD | 0 | 1,000,000 | 14 |
SunOS | 0 | 1,000,000 | 1,316 |
The scheduling quantum argument does not apply, of course, to multiprocessors, so the access(2)/open(2) race should be as easy to win as ever on a multiprocessor. To test this hypothesis, we experimented with our dual-processor Solaris machine.
Against the classic access(2)/open(2) race, we observed 117573 successful attacks out of one million attempts. Clearly, the access(2)/open(2) race is still a major threat for multiprocessors. With the widespread introduction of multi-/hyper-threaded CPUs, this risk may exist even on ``uniprocessor'' machines.
Even with the 10% success rate with , we did not feel we were able to tune the attacker for accurately. Intuitively, the difficulty is that we derive information for adjusting the DELAY2 and DELAY3 constants in the attacker (Figure 2) only in the cases when the attack would have succeeded, so little data is available for tuning. This data is swamped by other interleavings that produce indistinguishable behavior by the victim program. Out of hundreds of thousands of attempts with presumably imperfect delay tunings, there were no successful attacks with .
Multiprocessor Baseline Results Summary | |||
Machine | Attempts | Successes | |
Solaris | 0 | 1,000,000 | 117,573 |
So far, we have seen that without strengthening, the access(2)/open(2) race is very hard to win on a modern uniprocessor, but easy to win on a multiprocessor. However, in either case, with even one round of strengthening, the attack success rate (observed to be 0%) is too low for us to make meaningful statements. To measure the effect of the strengthening, therefore, we need a more sensitive experiment, in which races are easier to win.
Returning to our Linux and FreeBSD uniprocessors, we inserted calls to nanosleep(2), specifying a delay of 1ns, into the setuid program. These calls have the effect of yielding the CPU at that point in the program, making the races easily winnable.
As a sanity check, we first inserted a single nanosleep(2) call after each access(2) and open(2) call in the setuid program. We then tuned the attacker with nanosleep(2) calls as well, and observed that we could attain near 100% success rates even for moderately large values of . This corresponds to the case where an attacker is able to synchronize perfectly to the victim, making the probability of winning races .
Next, we randomized the delays, as described in Section 4, by changing the delay code to the following:
nanosleep(&onenano,NULL); if (random() & 01) nanosleep(&onenano,NULL);Note that we are using a less randomized delay than recommended in Section 4: we always have at least one nanosleep, to ensure that every race is winnable on our uniprocessors.
The table below summarizes the results for these experiments, and Figure 3 plots the data versus the theoretical model.
Strengthening with Randomized Nanosleeps | |||
Machine | Attempts | Successes | |
Linux | 0 | 100,000 | 99,992 |
Linux | 1 | 100,000 | 43,479 |
Linux | 2 | 100,000 | 16,479 |
Linux | 3 | 100,000 | 5,931 |
Linux | 4 | 100,000 | 1,773 |
Linux | 5 | 100,000 | 550 |
FreeBSD | 0 | 100,000 | 99,962 |
FreeBSD | 1 | 100,000 | 43,495 |
FreeBSD | 2 | 100,000 | 16,766 |
FreeBSD | 3 | 100,000 | 5,598 |
FreeBSD | 4 | 100,000 | 1,786 |
FreeBSD | 5 | 100,000 | 548 |
Several features immediately stand out from the data. First, the almost identical numbers from Linux versus FreeBSD show that our modified code has made the probability of winning races dependent on the randomized delays, rather than on details of the respective operating systems and machines. Second, the numbers show that the first race is almost 100% winnable. This is because our randomized delay is at least one nanosleep long, so the attacker knows it can wait one nanosleep and always win the first race (except for rare instances when the two processes start up extremely out of sync). Finally, as grows, the ratio of successive success rates is dropping slightly. This occurs because the attacker was tuned for smaller values of , and as grows, the attacker gradually slips out of phase from the victim.
As we can see, even in this extreme case, where the unstrengthened access(2)/open(2) race is almost 100% insecure and all races are constructed to be easy-to-win, each successive round of strengthening provides a multiplicative improvement in security, as predicted by the theoretical model.
From a practical perspective, with realistic victim programs (that don't go to sleep to wait for attackers), we have observed to be on the order of to . This suggests that for , the probability of a successful attack should be below . Given that running one million attacks takes on the order of tens of hours, a successful attack probability of should provide adequate security in most situations. As there are 8760 hours in a year, it is unlikely that even a cluster of 100 machines would remain running long enough to expect to see a successful attack. We note that the speed of this attack appears to be scaling with disk speed, rather than CPU speed.8 The relatively long duration of a trial, especially as compared to the evaluation of a hash function or block cipher, mean that we can allow a somewhat higher probability of attack than would be acceptable in other settings.
Implementation details, as always, are critical to the security of a system using our algorithm. So far, we have presented a highly portable design. If one is willing to trade off portability for stronger security, a number of improvements can be made. These improvements will generally serve to decrease the possible number of context switches that could occur in the critical section, thereby decreasing worst case (real) execution time, and thereby narrowing the attacker's window. We will discuss these optimizations from most portable to least portable.
First, if the setuid program (victim) is running as root, it should raise its scheduling priority with a nice(2) or setpriority(2) call with a negative argument. This optimization appears to be completely portable.
Second, the virtual memory page(s) containing the code to implement our algorithm should be pinned into real memory. The mlock(2) call is a portable way of accomplishing this across all the operating systems discussed in this paper, although one needs to be careful to balance mlock(2) and munlock(2) calls correctly, as different operating systems have different semantics for nested calls. This optimization will prevent a page fault from occurring and giving the attacker's process a chance to run.
Third, on Linux and other systems that implement POSIX process scheduling, one can use the sched_setscheduler(2) call to elevate the process priority above what can be accomplished with nice(2) or setpriority(2). If the setuid program is running as root, it can use SCHED_FIFO with an appropriate priority to make sure that it will run whenever it is runnable.
These optimizations further reduce the probability of attack by making it harder for an attacker to win races. While the first and third optimizations would be redundant, using one of them depending on portability considerations is highly recommended. The second optimization is fairly portable, and is recommended wherever it applies.
In general, and for multiprocessor machines in particular, the probabilistic security we have achieved appears to be all that is possible: another CPU can alter the file system concurrently with the victim program's actions. However, on a uniprocessor system, we are aware of only five ways for a process to yield the CPU on most Unix-like operating systems:
The prototypical victim program that we experimented with has 15 instructions in user mode between the actual system calls (i.e., int 0x80s) that implement access(2) and open(2), when using GCC 2.95.3 and glibc 2.2.5. The time required to execute the 15 user mode instructions has, of course, decreased dramatically as CPU speeds have increased. This helps prevent the exploitation of the race in two ways: first, it gives the timer interrupt an ever shrinking window of time to occur in, and second, the victim program will be able to run at least one round of the strengthening protocol without interference from the timer interrupt.
This analysis appears to support a conjecture that on Linux 2.4.18, running as root (and therefore able to use SCHED_FIFO and mlock(2), uniprocessor machines achieve deterministic security with only one round of strengthening. While this analysis is intellectually interesting, we strongly urge that it not be used, as it depends on code never being run on a multiprocessor (very difficult to ensure as systems evolve over time), and undocumented behavior of a particular kernel version, which is always subject to change.