################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org A Comparison of OS Extension Technologies Christopher Small and Margo Seltzer Harvard University {chris,margo}@eecs.harvard.edu This research was supported by Sun Microsystems Laboratories. Abstract The current trend in operating systems research is to allow applications to dynamically extend the kernel to improve application performance or extend function- ality, but the most effective approach to extensibility remains unclear. Some systems use safe languages to permit code to be downloaded directly into the kernel; other systems provide in-kernel interpreters to exe- cute extension code; still others use software tech- niques to ensure the safety of kernel extensions. The key characteristics that distinguish these systems are the philosophy behind extensibility and the technol- ogy used to implement extensibility. This paper pre- sents a taxonomy of the types of extensions that might be desirable in an extensible operating system, evalu- ates the performance cost of various extension tech- nologies currently being employed, and compares the cost of adding a kernel extension to the benefit of hav- ing the extension in the kernel. Our results show that compiled technologies (e.g. Modula-3 and software fault isolation) are good candidates for implementing general-purpose kernel extensions, but that the over- head of interpreted languages is sufficiently high that they are inappropriate for this use. 1 Introduction The motivation that led to the emergence of microker- nels in the early 80's leads now to the emergence of extensible operating systems. It is unreasonable to expect any one system to possess all of the functions needed by all applications. Rather, vendors of sophis- ticated applications, which require application-spe- cific operating system customization, sell these extensions as well. These kernel extensions enable new functionality, but they also introduce the possibil- ity of compromising system reliability. If an applica- tion consistently brings a system down, its additional functionality is hardly worthwhile. Therefore, one of the major challenges that arises in supporting easily extensible operating systems is being able to do so in a manner that does not compromise the reliability of the underlying system. Commercial systems have traditionally taken the route of supporting dynamically loadable extensions (e.g. device drivers, new subsystems, and new file systems). This is a practical approach, but does not address the reliability issue; it is assumed that it is acceptable for the kernel to crash if the module has bugs in it and if the module does not contain security violations. This scenario makes sense when a small group of trusted users are the only ones who can load such extensions, but a general purpose facility must be more robust. Several current research projects are developing more general-purpose extensible systems that provide reliability and security. The goal of this paper is to describe the assortment of technologies available and the types of extensions one might envision, and then to assess the performance trade- offs of the various choices. In Section 2 we discuss several of the extensible systems currently under development. Section 3 describes the different types of kernel extensions that might be useful. Section 4 discusses the various extension technologies, and section 5 introduces our test cases and presents our performance results and analysis. 2 Related Work If one chooses to ignore the potential reliability problems introduced by adding code to the kernel, the no-overhead solution is to allow unprotected patching. Microcomputer operating systems (e.g. MS- DOS and the Macintosh OS) allow arbitrary code to be patched into the kernel, providing no protection from misbehaved applications. On the other hand, because applications have complete control, it is much easier to manipulate and extend the kernel to provide for the needs of applications. It may be easier to write a real-time system on a microcomputer operating system than on a workstation operating system because the kernel lacks multitasking and protection. The move towards safe extensible operating systems is an evolutionary step on the path from conventional monolithic kernels to microkernel architectures. The Mach microkernel [ACCE86] was designed to allow kernel functionality to be moved out of the kernel address space into user-level external servers in order to increase safety, robustness, and flexibility. How- ever, the expense of frequent upcalls to user-level code motivated the current generation of microkernel- based systems, which link server code directly into the kernel address space [GUIL91]. The CMU Bridge project follows this model, placing servers in the ker- nel, but protects the kernel using software fault isola- tion techniques such as sandboxing [WAHBE93]. Measurements of sandboxing have shown it to have a much lower overhead than hardware protection mech- anisms (on the order of a few percent). The Exokernel project [ENGL95] also strives to reduce the number of user-kernel protection boundary crossings, but it takes a different approach. Rather than support safe downloading of code into the kernel, it moves as much functionality as possible from kernel to user-level. In this reductionist approach to kernel design, kernel abstractions are thought to be the reason that systems have poor performance, hence they are removed from the kernel and placed into the application. System abstractions are available from user-level libraries, but applications control which abstractions they use. The kernel is left only with responsibility for controlling access to physical devices. There is little or no reason to extend the kernel; nearly all functionality is under the control of the application. Along with adding functionality, extensible systems can provide applications with the ability to override policy decisions. Cao et al. [CAO94] motivate this sort of extension by examining the performance improvement achieved by allowing an application to control the buffer cache eviction policy. Their system did not allow applications to add new policy code to the kernel; rather, multiple policies were compiled into the kernel and an application chose among them. This work showed the benefit of allowing applications to control policy; however, we believe that it is not possible to determine (and implement) all policies a priori; a more general extensibility mechanism is required. The HiPEC system [LEE94] is similar to, but more flexible than, Cao's system. HiPEC allows applications to control VM caching policy using programs written in a simple, assembler-like, interpreted language designed specifically for the task of managing a queue of VM pages. The performance impact of executing a program in this language is low, but the expressiveness of the HiPEC language is limited (it has only 20 basic instructions). The language would have to be augmented if it were to be used for other applications. Packet filters are used to demultiplex a stream of network packets by examining the contents of each packet header. Often, packet filters are implemented in a simple interpreted language (e.g. [MOGUL87, MCCAN93, YUHARA94]). A special language, designed to efficiently describe packet headers, is used to write packet filters. The performance of interpreted packet filters is close to that of compiled code, but, like HiPEC, the expressiveness is limited to the specific domain. Instead of starting with a minimal language and extending it, the SPIN system [BERS95] is written in, and uses as its extension language, Modula-3 [NELS91]. Modula-3 is a strongly-typed, garbage- collected language, designed so that it is impossible for a program to have a "dangling pointer" to deleted data or to construct a pointer to an arbitrary memory location. Because of the design of the language, code in "safe" modules is not able to reach outside its bounds and violate the integrity of the program in which it is running. The mChoices operating system [CAMP95] proposes using "a simple flexible scripting language similar to Tcl" to aggregate multiple kernel calls or remove control traffice between user-level and kernel-level. Although Tcl was not designed with this application in mind, if extensions are relatively small, the raw performance of the extension technology will be of little or no consequence. Extension technology is also useful outside the kernel. Some database servers allow clients to load query- or datatype-specific code into the server to improve performance. The Thor database server uses a typesafe language designed for writing extensions [LISK95]; the Illustra database server is extended by writing DataBlades, which add support for new data types to the server [ILLU94]. The Illustra server does not currently protect itself from misbehaved DataBlade code, although Illustra is evaluating software fault isolation techniques. The HotJava Web browser from Sun Microsystems can be extended with "applets" written in Java, a Modula-3 like language with a C++ syntax [GOSL95]. Java code is compiled to a machine- independent byte-code that is downloaded by the HotJava browser and interpreted or compiled into native code on-the-fly. Like Modula-3, Java is designed to reduce or eliminate dangling or stray pointers and safety problems. In this work, we examine the extension technologies being proposed by these systems: unsafe C (the DOS approach), C in user-level servers (the microkernel approach), software fault isolation, Modula-3, Java, and Tcl. Our goal is to understand the performance implications of the extension technology choice. 3 Graft Taxonomy We refer to a kernel extension as a graft and the pro- cess of adding a graft to the running kernel grafting. We have found three motivations for grafting code into the operating system kernel: o Policy: the application wants to control kernel policy, e.g. how it manages the buffer cache, VM cache, or process scheduling. o Performance: the application wants to migrate portions of itself into the kernel in order to improve overall performance. This technique can save data copies between kernel and user space (e.g. when copying data from the disk to the network), and upcalls into user space (e.g. to handle a mouse event). o Functionality: the application wants to add general functionality to the kernel, e.g. support for access control lists, automatically compressed files, or new communication models. Although there are an unlimited number of ways in which a graft can be structured, we have identified three basic structures into which the implementation for most grafts will fall: Prioritization grafts, Stream grafts, and Black Box grafts. 3.1 Prioritization There are numerous places in a kernel where one of a set of entities is selected. Choosing a victim is the central policy decision made by the virtual memory system (which page to evict), the buffer cache man- ager (which buffer to evict), and the process scheduler (which process to schedule next). We call this a Prior- itization policy decision, and a graft that replaces pri- oritization policy code a Prioritization graft. Normally, the VM system and the buffer cache use a least-recently-used (LRU) policy (or a variant thereof), although in some cases a different policy works better. An application might know that each block of a file will be read once, in order, and not read again, in which case it makes sense to use a most- recently-used (MRU) strategy for that file. As another example, while processing a query, a database system knows which blocks of the database will be needed soon and which blocks are no longer needed. The access pattern of a set of pages or file blocks is often a good heuristic for deciding which pages or blocks will be used in the near future, but it is not infallible. For example, a garbage collector copies live objects from partially filled pages with the goal of decreasing fragmentation. After a collection, a page from which live objects have been taken has been accessed recently, but contains no useful data; on the other hand, pages that are full of live data have not been accessed recently. Using an LRU strategy, the former will be retained in memory and the latter will be evicted, which is the opposite of what is desired. Process scheduling is another example of a prioritization policy. At each scheduling point the kernel has a list of candidates, and chooses one to run. No scheduling algorithm is appropriate for all application mixes; the demands of interactive applications differ from those of applications with real-time deadlines, and a scheduler optimized for one class will not satisfy the other. Processes may wish to be scheduled as a group; a client-server application may not want the server to be scheduled unless there is an outstanding client request, in which case it should be scheduled ahead of any client. In general, a Prioritization graft is one that is presented with a list of options and must select the item of highest priority. It uses some internally defined weighting function to choose a candidate. The weighting function may use just the information in the list, or it may have access to some data from the application (e.g. which blocks or pages will not be needed again). Our Prioritization graft benchmark models a VM page eviction policy. We assume that the kernel maintains a list of pages in LRU order; when it comes time to evict a page, the kernel normally chooses the page at the head of the LRU queue as its candidate. In our model, instead of immediately evicting the candidate, the kernel determines which process owns the candidate page and allows the process to offer one of its other resident pages for eviction. Our model application keeps a "hot list" of pages that will be needed in the near future. The goal of the page eviction graft is to ensure that none of these pages are evicted. The page eviction graft receives a pointer to the head of the LRU queue. The graft checks to see if the candidate (the head of the queue) is on the application's hot list; if it is not, it accepts the kernel's candidate. If the candidate is on the hot list, the graft searches through the queue for an acceptable page that is not on the application's hot list. This page is returned to the kernel. For this benchmark we model a TPC-B transaction processing benchmark database [TPCB90]. The database holds 1,000,000 records in a four-level b-tree; the b-tree has approximately 400 internal pages (16MB) and 50,000 data pages (200 MB). The b-tree is 50% full, and has one root page, four pages at the second level, 391 pages at the third level, and approximately 50,000 pages at the fourth level; each third-level page in the b-tree points to up to 128 fourth level pages. The server accesses the database by mapping it into its virtual address space. When the server does a non-keyed lookup, it traverses the b-tree in depth-first order, starting from the root. When it reaches a third- level page it knows which 128 fourth-level pages it will access next, hence which of the memory mapped pages of the database should not be paged out. During such a search, it constructs a hot list of these 128 pages. If a page fault occurs during a search, our graft is called with the LRU chain head, and it uses the hot list to find an eviction candidate that is acceptable to the application. As each page is processed, its entry is removed from the hot list, so as the simulation runs, the queue grows shorter. We presume that the kernel keeps track of candidate pages and graft-proposed alternates, as in Cao's system [CAO94], to ensure that an application does not manipulate the VM system to gain more physical memory than it would receive under the default strategy. This test is not particularly compute-intensive. Instead, it is sensitive to the overhead associated with traversing a list of items. If the extension technology requires extensive pointer checking, or does not support traversal of linked lists, this test will highlight it. Because there are 50,000 data pages in the database, there is a low probability that a page that the application will need is already in the cache (roughly 64/50,000, or once every 781 times). However, if one of the pages is in memory, we want to ensure that it is not evicted. For the graft to be successful, the overhead of checking on each eviction should be low relative to the cost of evicting and then re-faulting a page. Our test computes the break-even point, showing how often the graft will need to save an eviction in order to pay for its per-eviction cost. 3.2 Stream Our second graft model is based on the idea of Unix filters and pipes. Frequently, it is useful to apply a set of filters to a data stream. For example, we might want the kernel to transparently compress a file when it is written and decompress it when it is read, or automatically encrypt a file when written and decrypt it when read by the appropriate user. For security reasons we might want to compute a secure checksum, or fingerprint, of an executable when it is loaded, to verify that it has not been compromised by a virus. A stream graft is such a filter. It consists of filtering code that is inserted into a data stream, normally between the storage system and application level. A journaling file system is one that accepts a stream of I/O requests, saves a journal of the metadata changes, and passes along the original requests. A standard filesystem could be transformed into a journaling filesystem by inserting into the request stream a graft that journals the changes made to the metadata. The Stream Input-Output System of UNIX [RITCH84] decomposed the character I/O system of UNIX into a set of filters. Network and terminal protocols were built up by linking filters into chains. Characters read from (or sent to) a device were passed to the first filter in the chain; each filter processed the characters in its input queue and moved them on to the next filter in the chain. This mechanism was used not only to handle erase and kill processing, but also to create pseudo-devices, such as virtual displays and keyboards, to construct multiple virtual terminals from a single terminal. Another example of a stream graft is one that takes a data source (e.g. the disk) and, rather than modifying it on the way to application level, writes it elsewhere (e.g. the network). Recent research into fast path connections, such as the x-kernel work at Arizona [DRUS93], the video server benchmark of the SPIN operating system [BERS95], and Fall's work in decreasing I/O time through use of in-kernel copying [FALL93] show that there is a substantial performance gain from saving copies to and from user-level. A stream graft that takes its input and directs it to an output connection, perhaps after transforming the data, could be used to build this type of fast path connection. Our representative stream graft is an implementation of the MD5 Message-Digest Algorithm [RFC1321], which produces a 128-bit fingerprint of a file. The MD5 fingerprint is both expensive to compute and computationally infeasible to forge. MD5 is useful for ensuring that a file has not been tampered with; a change to the contents of the file will result in a change to the fingerprint. If the fingerprint is kept separate from the file (say, on a small amount of safe media, such as a read-only floppy disk), a change to the file can be detected by computing its MD5 fingerprint and comparing it to the saved fingerprint. Like many compression and encryption algorithms, MD5 is stream-based. The algorithm maintains a small amount of state as it processes the data; unlike compression and encryption algorithms, the data output is the same as the input; when the algorithm completes, the graft can be queried for the fingerprint, and the computed fingerprint can then be compared to the saved fingerprint. If the MD5 fingerprint can be computed as quickly as data can be read from the disk, the time spent in the MD5 code can be overlapped with I/O activity. However, if it takes longer to compute a fingerprint than it takes to read the data from the disk, the overall processing time will increase. Our test measures whether a graft written in a given technology can keep up with the disk. 3.3 Black Box The Black Box graft structure is more general than that of a Prioritization or Stream graft. A Black Box graft has some number of inputs, some state, and a single output. We see it operating as a "black box" function, normally producing a single output value. For example, at the center of the code that implements Access Control Lists is a small database that (at an abstract level) accepts a triple containing a file access request, a user ID, and a file ID, and responds "yes" or "no." File system read-ahead code, which determines how many (and which) blocks of a file to prefetch, is another example of a "black box" function. If the application knows ahead of time the order in which blocks of a file will be read, the kernel can use this information to make read-ahead decisions. In some cases, an application will read a subset of the blocks of a file in order, and then skip to another region of the file. If the kernel uses heuristics (rather than application knowledge) to choose a read-ahead policy, it can not cope with arbitrary application behavior. With the cooperation of the application, it can make more appropriate read-ahead decisions. A Logical Disk facility (LD) [DEJON93] sits between the filesystem and the physical disk. The filesystem reads and writes logical blocks, and the LD maps the logical requests to locations on the physical disk. The LD can be used to transparently replicate data, by writing it in multiple places on the same disk or multiple disks, and speed write performance, by writing logically discontiguous blocks on a physically contiguous region. A log-structured file system [ROSE91] can be implemented using a logical disk facility; the filesystem lays out blocks as its sees fit, and the Logical Disk reorders and buffers writes to improve write performance. Our black box test application is a simple logical disk facility that converts random writes to sequential writes. It accepts block write requests, batches them into physical segments, and maintains a mapping from logical block numbers to physical block numbers. If the savings in I/O time due to batching is greater than the cost of translating logical block numbers on each read/write, the graft is effective. 4 Extension Technologies Each extension technology offers a different level of safety and imposes a different level of trust. An exten- sion written in an unsafe language (e.g. C) can read, write, or jump to any location in the address space (using pointer arithmetic); one written in a safe lan- guage (e.g. Modula-3 or Java) is restricted to code addresses exported to the extension. In either case, we need a mechanism to ensure that extension code not monopolize the CPU; we must be able to preempt an extension that runs too long. This means that exten- sion code should not be able to disable interrupts. For an operating system extension, read protection is important, even if it is not required for reasons of data privacy. Because an extension running in the kernel has access to memory mapped devices, an extension can destructively read a device register. Once safety is ensured, the performance of a technology determines its suitability for different applications. The overhead of a graft intended to increase performance should not cause performance to degrade. We also need to make sure that an extension technology is sufficiently expressive, so that it is possible to implement the grafts we want to write. A small, specialized language designed around a single problem domain may perform better than a general- purpose language, but will be difficult to reuse in other domains. The size of the runtime environment can have an impact on overall performance. The runtime code size of support environments ranges from tens of kilobytes to several megabytes. If we require that the support environment run in the kernel and remain resident, any memory used by the environment is memory unavailable for other uses. The extension technologies we examine fall into three basic trust models: hardware protection, software protection, and interpretation. 4.1 Hardware Protection The simplest, and perhaps most dependable, method for ensuring the safety of the kernel is to place extensions outside the kernel's address space in order to take advantage of the protection offered by the hardware. When the kernel wants to run an extension, it upcalls into user-level code; when the code returns, the kernel continues. Along with memory protection, the kernel can time-slice the extension to ensure that it does not monopolize the CPU. If the extension runs too long, the kernel can abort it and carry on without it. The primary disadvantage of this model is that there is a cost associated with an upcall; for small extensions, this per-invocation overhead can be much larger than the cost of running the extension. 4.2 Software Protection Using software protection, we place extension code in the same address space as the kernel, but restrict the instructions that are evaluated by the extension. By controlling the language (e.g. Modula-3), or the com- piler, or by patching the binary code [WAHBE93], we can control the instructions evaluated by the exten- sion. This allows us to ensure that the extension does not read or write outside its bounds, or jump to arbi- trary kernel code. In addition, we can ensure that it does not issue instructions that disable interrupts or that initiate I/O operations. Software protection can offer the highest performance of the three options, but requires control of, and trust in, the language translation tools used to process the extension source. The security of the kernel is a function of the correctness of the generated code; if the translation tool can be coaxed into generating unsafe code, the security of the kernel can not be guaranteed. If, for example, under certain circumstances the array bounds checking of a compiler can be subverted, it may be possible to write code that overwrites its stack and circumvents the kernel's security mechanisms. Even assuming that the translation tools work correctly, the kernel still needs to verify that code loaded into the kernel was in fact processed by a trusted language tool. This can be accomplished by performing the translation at load time, or by marking the code (say, with a cryptographic checksum) at the time it is generated. The former technique requires a high cost at extension load time; the latter implies that we are able to embed a secret key of some sort in the language tool and mark the generated code with this key. Methods for generating cryptographic checksums are well known; the social issues of key distribution and management are outside the scope of this paper. The goal of software fault isolation is to make it more efficient to ensure the validity of memory references using software than by using hardware. One type of SFI, sandboxing, ensures that the high bits of a memory address match those of the sandbox region assigned to the function or module. In this way, a module can, at worst, overwrite its own data with a stray pointer or jump to locations in its own code. Sandboxing can be done at compile time or performed as a post-processing phase on object files, allowing separation of the sandboxing tools from the compiler. At load time, a linear-time algorithm can be used to guarantee that all memory references in a piece of object code have been correctly sandboxed. Omniware C++ [COLU95] is a commercial system that includes a compiler that generates machine independent code. A run-time system translates the machine independent code into native code with software fault isolation instructions and links the code into the executable. Many modern languages, such as Modula-3 and ML, do not suffer the safety problems of C. In a typesafe and pointer-safe language it is impossible to construct a pointer to an arbitrary memory location, or be left with a "dangling" pointer to deallocated memory. If type casting is available, it is combined with compile-time analysis or a runtime type check to ensure that the cast is valid. Similarly, array bounds are checked on access to ensure that a program does not access memory outside the array. Although the definition of a language may ensure type and pointer safety, the language tools written to compile and run the language may not correctly implement the definition. It is infeasible (using present technology) to verify the correctness of a Modula-3 compiler; because of this, we can not be sure that a compiler will not generate incorrect code. (In fact, while running our VM Page Eviction benchmark, we found a bug of this type in the Modula-3 compiler.) 4.3 Interpretation Instead of depending on the safety and security of a compiler, we can create a virtual machine, an interpreter, to run grafts. The source language (which could be C, Modula-3, or anything else) would be compiled to intermediate or machine code for a real or virtual machine. The kernel would include an interpreter to run the code, ensuring its safety. This model allows complete control over the behavior of the extension by implementing only safe operations in the interpreter. The intermediate code could also be used as the input to a runtime code generator. A reasonably fast interpreter runs 10 to 100 times more slowly than compiled code [MAY87]; however, using incremental code generation techniques, performance can approach that of compiled code [HOLZE94]. (Note that there is a flexible line between generating native code at load time - as above - and dynamically generating native code from interpreted code.) Java is compiled to a compact byte code for the Java Virtual Machine. As interpreters go, the Java interpreter is fairly fast. In addition, Sun plans to release a runtime code generator for Java in the near future. The authors of Java expect that compiled Java will run at about the same speed as compiled C or C++ ([GOSL95], p. 48), which, based on current technology, is believable. The currently available Java system is Alpha- release software. Versions for Sparc/Solaris and Windows are available from Sun (on java.sun.com); Java has been ported to HP-UX and UnixWare (by OSF) and Linux (found on java.blackdown.org). Another technique for building an interpreted language is not to transform the source to an intermediate format, but rather to interpret it directly. This technique, used in awk, sh, and Tcl, leads to a smaller start-up time, with a higher overhead per statement. Because source-interpreted scripting languages are immensely popular, and have been proposed as a vehicle for writing grafts [CAMP95], we include Tcl as one of our tested technologies. 5 Performance Analysis Given the wide range of extension technologies avail- able, it is not obvious which is "best" in any dimen- sion. In fact, extensible systems are being built that employ nearly every technology described. In this section, we measure and analyze the performance of our sample extensions. 5.1 Hardware We ran tests on four hardware platforms: three com- mercial workstations and one Intel x86 "PC". o Alpha: DEC AlphaStation 400 4/233, running DEC OSF/1 v3.2A, 64MB of memory. o HP-UX: HP PA-RISC 9000/735 running HP-UX A.09.03, 80MB of memory. o Linux: "PC"-class 90MHz Pentium, running Linux 1.1.95, 16MB of memory. o Solaris: Sun SPARCStation 20 running SunOS 5.4, 128MB of memory. 5.2 Extension Technologies We implemented our grafts on five different extension technologies. Not all tests were run on all platforms; for example, the current release of the Omniware compiler runs only on Solaris. o C: compiled with gcc -O (version 2.6.3 on HP- UX, 2.7.0 on all other platforms). o Java: release Alpha 3. o Modula-3: version 3.5.3 of DEC SRC Modula-3. o Omniware: compiled with Colusa's omniC++ compiler, version 1.0 beta, release 1.5. (This commercially available compiler generates machine-independent code that is translated to native, fault-isolated code at runtime. This pre- release version of the compiler supports write and jump protection, but no read protection, and does not include an optimizer for the SFI instructions.) o Tcl: Tcl version 3.7. 5.3 Upcall Overhead As a baseline for comparison, we implemented each benchmark in C. Since C is not a safe language, we estimated the cost of implementing the benchmark as a user-level server by determining the break-even point as a function of the time required to upcall to such a server; if an upcall is free (takes no time), the performance of a system written using user-level servers is equal to that of one with unsafe C linked into the kernel. As the cost of an upcall increases, the performance of a system written using user-level servers will decrease. When the kernel makes an upcall to a user-level server, it pushes a new call frame on the server's stack and switches to the server. When the server is done handling the upcall, it returns to the kernel. This process is similar to (but simpler than) the one followed by the kernel when it posts a signal to a process. First, the kernel pushes a call frame for the signal handler on the application's stack, then schedules the application process. When the process returns from the signal handler, it re-enters the kernel. The kernel cleans up the state of the process and returns it to the point where the signal was handled. To give a feeling for the cost of an upcall on each platform, we measured the time required to handle a signal sent to a process. The test program forks a child process, which registers handlers for a group of twenty signals and then suspends itself (by sending itself SIGTSTP). When the parent is notified that the child is suspended, it posts the handled signals to the child, then wakes it (by sending it SIGCONT). The child awakes, handles the signals, and suspends itself again. When the parent is notified that the child has once again been suspended, it knows that all signals have been handled. We then measured the time to post the signals to the child when the child ignores (rather than handles) the group of signals. The latter time is subtracted from the former; the result is divided by the number of signals handled, which gives an estimate of the time required to handle a single signal. The results of this test are shown in Table 1. Platform Signal Handling Time Alpha 19.5ms(7.5%) HP-UX 25.8ms(1.4%) Linux 55.9ms(0.1%) Solaris 40.3ms(3.8%) Table 1. Signal Handling Time: We measure the time required to send twenty signals to a child process that handled the signals, then subtract the time required to send twenty signals to a child process that ignores the signals. The difference is divided by the number of signals to give a per-signal handling time. Each time is the mean of thirty runs of 1000 iterations each (standard deviations in parenthesis). We implemented and measured the performance of a simple upcall mechanism on BSD/OS 2.0. On a 486-DX266 we measured a signal handling time of 63.1m, and an upcall time of 37.2ms (about 40% quicker). These upcall estimates are fairly conservative. Work done to improve exception handling times ([THEK94]) and cross-domain procedure calls ([Ford94]) lead us to believe that it is feasible to implement an upcall mechanism that takes less than one fourth the time we measured for signal delivery. 5.4 VM Page Eviction As described in Section 3.1, our sample Prioritization graft takes the LRU-ordered list of page eviction can- didates and returns a candidate not on its hot list. We assume that the application keeps the hot list (of active pages) in its memory at a known location, so that it is accessible to the graft. In addition, we assume that the application keeps the hot list in a form that can be easily traversed by the graft; for example, the C graft searches a linked list of structs, where the Modula-3 graft searches a linked list of Modula-3 RECORDs. In our model application, the hot list starts out with 128 entries (the number of data pages referenced by a level-three internal page), and as each page is faulted, it removes that page from the hotlist. On average, the hotlist contains 64 pages, which is the number we simulate. We determined the time to check the 64 element hotlist in each of the supported technologies, and present that time in Table 2. To give an intuitive feeling for the performance of each technology, we also normalized the time relative to unsafe C code. To determine the break-even point for this graft, we measured the page fault time on each of our test platforms (Table 3). The times listed indicate how long it takes to handle a page fault event, not how long it takes to bring in a single faulted page; Alpha and HP-UX bring in multiple disk pages on each fault. We assume that the systems are performing read-ahead in order to take advantage of (expected) locality of reference. However, our model database server would not be able to take advantage of this behavior, as the faulted data pages are scattered throughout the database. (The page fault read-ahead policy exhibited here is an obvious candidate for grafting; if we are able to control how many pages the system brought in on a fault, we can reduce the per-fault time.) Once we have computed the page fault time, we can determine the break-even point for this graft. We divide the page fault time by the time required to run the graft; the result is the number of times we can run the graft for each page eviction saved and still be ahead of the game. Remember that our model application would find a page to save, on average, once ever 781 invocations. If the break-even point is less than this, our model application would not benefit from this graft. Worse yet, if the number is less than one, the amount of time to run the graft is greater than the page fault time, hence under no circumstances would the graft be beneficial. In Table 2 we see that on Solaris the relative times of Omniware and Modula-3 are quite close, each running about 40% slower than unprotected C. (Remember, however, that this version of Omniware does not include read protection, which gives it a performance advantage over Modula-3.) On Alpha and HP-UX the Modula-3 code runs 10% slower than the C code, which is not a significant difference for this test. Platform C Java Modula-3 Omniware raw 2.9ms(0.2%) 3.2ms(1.5%) Alpha normalized 1.0 N.A. 1.1 N.A. break-even 8655 7843 raw 6.0ms(0.4%) 159ms(0.8%) 6.8ms(1.7%) HP-UX normalized 1.0 26.5 1.1 N.A. break-even 2983 113 2632 raw 3.7ms(0.1%) 237ms(0.1%) 9.1ms(0.1%) Linux normalized 1.0 64 2.5 N.A. break-even 1270 20 516 raw 4.5ms(0.1%) 141ms(0.3%) 6.3ms(2.8%) 6.3ms(0.2%) Solaris normalized 1.0 31.3 1.4 1.4 break-even 1533 49 1095 1095 Table 2. VM Page Eviction Test: We measure the mean time required to search a 64 element "hot list" of page numbers. Raw times and time normalized to unprotected C code (on the same platform) are given. The break-even point is the number of times the graft can run in the time it takes handle a page fault. Each time is the mean of 30 runs of 100,000 searches each (standard deviations in parenthesis). Platform Fault Time Num Pages Alpha 25.1ms(5.0%) 16 HP-UX 17.9ms(0.8%) 4 Linux 4.7ms(0.5%) 1 Solaris 6.9ms(3.2%) 1 Table 3. Page Fault Time: Measured using lmbench (standard deviations in parenthesis). Alpha and HP-UX bring in more than one disk page on a fault, performing read-ahead, even though the test performs random accesses to memory. On Linux, we see a 150% slowdown for Modula- 3, a greater difference than we see on other platforms and with other tests. Examining the code generated by the Modula-3 compiler, we found that it includes a runtime check against NIL (location zero) on each pointer access. The code generated on the other platforms (Solaris, Alpha, and HP-UX) does not include explicit NIL checks. The Modula-3 language specification [NELS91, p. 50] states that dereferencing NIL should cause a runtime error; on Solaris and Alpha, dereferencing location zero causes a segmentation violation, which is trapped by the Modula-3 runtime system. This is not the case on Linux, so the runtime check is needed. (We found that although no runtime checks are generated by the HP- UX version of the Modula-3 compiler, dereferencing location zero does not cause a segmentation violation. This appears to be a bug in the Modula-3 compiler.) For our purposes (i.e., operating system extensions), the Alpha and Solaris slowdowns are the more appropriate comparison - because we can ensure that dereferencing location zero causes a fault, we would not need runtime NIL checks. (This fault would not be handled by a signal mechanism in the kernel, but would instead require some support by the normal kernel fault logic.) On Solaris, the Java code runs at about 1/30th the speed of compiled C code, and about 1/20 of Modula- 3. On this platform we see that the break-even point for Modula-3 is about 1100 pages; for Java, the break- even point is about 50 pages, too low to benefit our model application. << Figure 1 not included >> Figure 1. Break-Even vs. Upcall Time The break-even point for the VM Page Eviction test. Break-even is inversely proportional to the upcall time. The break-even points for Modula-3 and Omniware are included, showing that a sub-10ms upcall time is needed for user-level servers to compete with compiled, downloaded code here. To compare these results to the overhead of performing an upcall on each eviction, we computed the break-even point as a function of upcall time, allowing the upcall time to range from 0 to 50ms (see Figure 1). When compared with the break-even points for Modula-3 and Omniware, we find that we would need an upcall time on the order of 5ms for upcalls to compete with the compiled extension technologies, which would be difficult to achieve. Unsurprisingly, we found that Tcl is not the appropriate tool for this job. Measurements of Tcl showed that it was four orders of magnitude slower than the compiled languages (C and Modula-3), taking 40ms on Solaris, as compared with the C version, which took 4.5ms. With a break-even point close to or less than one (i.e. the Tcl code would have to save a page each time it was called in order to not degrade performance), Tcl is not competitive here with the other technologies. 5.5 MD5 Fingerprinting We took the standard C implementation of the MD5 algorithm (found in RFC1321) and modified or reim- plemented it for each of our test platforms. The MD5 algorithm makes heavy use of array access and unsigned 32-bit arithmetic. The standard implementa- tion takes advantage of C's behavior of silently ignor- ing numeric overflow by performing computation modulo 232. The code worked correctly in C on all platforms, and was easily translated into Java and Modula-3. This test is very compute intensive. It uses almost none of the support facilities of the extension technology (e.g. no data types other than array), but instead gives an indication of the cost of performing array access and numerical computation. The results are shown in Table 5. Because the cost of computing the fingerprint is so high, the cost of an upcall is comparatively low. If we assume that there would be one upcall to the user- level extension for every 64KB read from disk, we would need to add the cost of 1MB/64KB = 16 upcalls. Assuming a (pessimistic) upcall time of 50ms, this would add 8ms to the raw C time, which is not significant when added to compute times of 150ms to 800ms. To compute a break-even point, we measured the write bandwidth of the disk on each system (Table 4). This measurement was used to estimate the time required to read 1MB of data. If the time needed to compute the checksum of 1MB of data is less than that required to read the data from the disk, computation can time can be hidden by I/O time. (This optimistically assumes that the disk read requires no CPU cycles, and is a best-case break-even calculation.) Platform Bandwidth (KB/s) 1MB access time Alpha 4364(1.2%) 235ms HP-UX 1855(13%) 552ms Linux 1694(5.7%) 604ms Solaris 3126(11%) 320ms Table 4. Disk I/O Time: Write bandwidth in KB/s on each platform, measured using lmbench. From this, the time to access 1MB of data is computed. Each time is the mean of 30 runs (standard deviations in parenthesis). Platform C Java Modula-3 Omniware raw 159ms(1.8%) 207ms(0.4%) Alpha normalized 1.0 N.A. 1.3 N.A. MD5/disk 0.67 0.9 raw 239ms(1.6%) 23987ms(2.5%) 352ms(0.3%) HP-UX normalized 1.0 100 1.5 N.A. MD5/disk 0.43 43 0.64 raw 202ms(0.3%) 22887ms(1.0%) 387ms(0.1%) Linux normalized 1.0 113 1.9 N.A. MD5/disk 0.33 38 0.64 raw 146ms(1.7%) 10368ms(0.3%) 294ms(0.1%) 219ms(0%) Solaris normalized 1.0 71 2.0 1.5 MD5/disk 0.46 32 0.92 0.68 Table 5. MD5 Fingerprinting: Mean time required to compute the MD5 fingerprint of 1MB of data. The time is compared to the time needed to read 1MB from the disk. If this number is less than one, the computation of the fingerprint can be overlapped with I/O. If it is greater than one, computing the fingerprint will decrease throughput. The mean of 30 runs is reported (standard deviations in parenthesis). The Omniware code is faster than the Modula-3 code, but slower than compiled C. To see if the size of the test was affecting the results, we ran a larger test (64MB) and saw a similar overhead (14480ms for Omniware vs. 9498ms for C, an overhead of 50%). We believe that this overhead is representative of this implementation of Omniware for MD5. (Once again, this version of Omniware does not include read protection, which gives it a performance advantage over Modula-3.) Unlike C, Modula-3 does not ignore arithmetic overflow. The Word package supports computation modulo the native wordsize, so on the 32-bit platforms (HP-UX, Linux, and Solaris) we were able to implement MD5 efficiently. On the 64-bit Alpha, the Word package performs computation modulo 264, which produces incorrect results for MD5. We implemented two versions of MD5 on Alpha: one uses 64-bit integers and the Word package, doing roughly the same amount of work as the 32-bit implementations (but computing an incorrect checksum), and the other using 32-bit integers, which produces the correct checksum, but generates more instructions (by a factor of four). For our performance measurements we used the 64-bit version, which gives a more accurate comparison of the relative performance of the technologies. We found that the 32-bit version took approximately ten times as long to run as the 64-bit version. (We see this as an artifact of the compiler implementation, and not a characteristic of the Alpha processor or of Modula-3.) The Modula-3 code runs from 1/2 to 3/4 the speed of compiled C. There is no reason for the numerical computation to be slower; we attribute the difference to run-time array bounds checking. On all platforms, the time required to compute the fingerprint was less than the time to read the data from the disk, so a Modula-3 implementation of MD5 could keep up with disk access and overlap its computation with I/O time. On the other hand, we found that neither Java nor Tcl were able to keep up with the disk. The Java code, at best 1/30th disk speed, seems unlikely to be used for this application. And, as above, we found that our Tcl implementation was four orders of magnitude slower than one written in a compiled language, and hence too slow for this type of graft. (On Solaris it took 50 minutes to complete, as compared with 1.9 seconds for the C code.) 5.6 Logical Disk Our simulation models a logical disk designed to sup- port a log-structured layer between a filesystem and the physical disk. The simulation accepts write requests for logical blocks and maintains the mapping between these logical blocks and the physical blocks onto which they are stored. As with the system imple- mented by de Jonge et al. [DEJON93], our simulation maintains all data structures in main memory. We simulate a 1GB physical disk with 4KB blocks and 64KB (16 block) segments. Our simulation uses a stream of block write requests that are skewed so that 80% of the requests are for 20% of the blocks. Because our simulation does not include a cleaner, we run it for 262144 iterations (the number of blocks on the disk). We measured the absolute time required to maintain the mapping between logical and physical blocks for the entire run. To justify using this graft, it must save more time than it takes: the overhead incurred per write should be less than the time saved by batching writes into segments. We found that the compiled technologies (Omniware and Modula-3) add a sub-10ms overhead per write, which is on the order of 1% of a typical disk seek time. The Java overhead is on the order of 100ms, roughly 10% of a typical seek, so we would need to save one seek for every ten blocks written. This is not an unreasonable assumption to make; interpreted Java would work for this task. (Because of performance of Tcl on the first two tests, we did not take Tcl measurements for this test.) When looking at a user-level server for managing the mapping, we assume that there is one upcall on each block write, with an (aggressive) upcall estimate of 10ms. This would double the overhead per write, but still keep it substantially lower than that of interpreted Java. The performance of a user-level server for this task would be relatively close to that of compiled code. Platform C Java Modula-3 Omniware raw 0.74s(1.9%) 1.3s(2.0%) Alpha normalized 1.0 N.A. 1.75 N.A. per block 2.8ms 5.0ms raw 1.3s(1.3%) 32.2s(0.6%) 2.1s(0.7%) HP-UX normalized 1.0 25 1.6 N.A. per block 5.0ms 123ms 8.0ms raw 1.3s(0.8%) 46.5s(0.1%) 1.7s(0.9%) Linux normalized 1.0 36 1.3 N.A. per block 5.0ms 177ms 6.6ms raw 1.9s(0.2%) 24.6s(0.4%) 2.9s(0.4%) 2.2s(0.1%) Solaris normalized 1.0 13 1.5 1.16 per block 7.2ms 94ms 11.1ms 8.4ms Table 6. Logical Disk: Time to handle bookkeeping for 262,144 writes to a Logical Disk. The time is normalized to compiled C code. The per-block overhead is how much time must be saved on each write in order for the graft to break even. The mean of 30 runs is reported (standard deviations in parenthesis). 6 Conclusions We believe that our taxonomy of grafts and of graft architectures encompasses a significant share of the operating system extension space. Our sample grafts (VM page eviction, MD5 fingerprinting, and Logical Disk) are equivalent in structure and performance characteristics to the policy, performance, and func- tionality grafts we envision. Given the performance of these representative grafts, we are able to determine what kinds of extensions are worth building. The anecdotal evidence has been that structuring a system with fine-grained user-level extensions and upcalls is not feasible; our VM page eviction test supports this position. On the other hand, the coarse- grained and computationally expensive MD5 test shows that there are some situations where the overhead does not matter, and the Logical Disk simulation shows that upcalls can be successfully hidden in the face of a large number of I/O operations. Given that we want both safety and performance, a compiled technology such as Modula-3 or SFI is our best choice. Because Modula-3 is unfamiliar to many developers, it may be that a hybrid language with the syntax of C or C++ but the safety of Modula-3 would be more widely accepted. The two compelling candidates are compiled Java and SFI with full (read, write, and jump) protection. Neither is available today, but both are currently under development. In the near future we will be able to measure their performance and compare them directly. In their current state, neither of the interpreted technologies are up to the task. While Tcl and Java are well suited for interactive applications, where the relevant metric is human perceptual time, they are not suitable for building kernel extensions because system events occur at a finer timing granularity. 7 Status and Availability All code is available on the World Wide Web, at https://www.eecs.harvard.edu/~chris. Acknowledgments The authors are grateful to the reviewers for providing valuable feedback. David Black of OSF, the shepherd, was invaluable in helping with the paper, especially in coming up with the scheme for measuring signal delivery time. Special thanks to OSF for allowing us to use a pre-release copy of their port of Java to HP- UX. David Holland kindly lent us his Linux box (and all its free disk space) to obtain the Linux measure- ments. Dawson Engler of MIT made the helpful sug- gestion of presenting the break-even times as a function of upcall time. And special thanks to Keith Smith and Chris Thorpe for their help brainstorming and polishing the paper. We are very grateful to Larry McVoy of SGI for use of his lmbench suite [MCVOY96]. We used tests lmdd.c,v 1.12 and lat_pagefault.c,v 1.2. Bibliography [ACCE86] Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A., Young, M., "Mach: a New Kernel Foundation for UNIX Development," 1986 Summer USENIX Conference (July 1986). [BERS95] Bershad, B., Savage, S., Pardyak, P., Sirer, E. G., Fiuczynski, M., Becker, D., Eggers, S., Chambers, C., "Extensibility, Safety, and Performance in the SPIN Operating System," Proceedings of the 15th SOSP, Copper Mountain, CO (December 1995). [CAMP95] Campbell, R., Tan. S.-M., ``mChoices: An Object-Oriented Multimedia Operating System," Proceedings of HotOS V, pp. 90-94, Orcas Island, WA (May 1995). [CAO94] Cao, P., Felten, E., Li, K., "Implementation and Performance of Application-Controlled File Caching," Proceedings of the First Usenix Symposium on Operating System Design and Implementation, pp. 165-177, Montery, CA (November 1994). [COLU95] "Omniware Technical Overview", Colusa Software, https://www.colusa.com, (1995). [DEJON93] de Jonge, W., Kaashoek, M. F., Hsieh, W., "The Logical Disk: A New Approach to Improving File Systems," Proceedings of the 14th SOSP, pp. 15-28, Asheville, NC (December 1993). [DRUS93] Druschel, P., Peterson, L., "Fbufs: A High- Bandwidth Cross-Domain Transfer Facility," Proceedings of the 14th SOSP, pp. 189-202, Asheville, NC (December 1993). [ENGL95] Engler, D., Kaashoek, M. F., and O'Toole, J., "Exokernel: An Operating System Architecture for Application-Level Resource Management," Proceedings of the 15th SOSP, Copper Mountain, CO (December 1995). [FALL93] Fall, K., Pasquale, J., "Exploiting In- Kernel Data Paths to Improve I/O Throughput and CPU Availability," 1993 Winter USENIX Conference, pp. 327-334, San Diego, CA (January 1993). [FORD94] Ford, B, Lepreau, J, "Evolving Mach 3.0 to a Migrating Thread Model", 1994 Winter USENIX Conference, pp. 97-114, San Francisco, CA (January 1994). [GOSL95] Gosling, J., McGilton, H., "The Java Language Environment," available from https://java.sun.com (May 1995). [GUIL91] Guillemont, M., Lipkis, J., Orr, D., Rozier, M., "A Second-Generation Micro-Kernel Based UNIX; Lessons in Performance and Compatibility," 1991 Winter USENIX Conference, pp. 13-21, Dallas, TX (January 1991). [HOLZE94] Holze, U., Ungar, D., "Optimizing Dynamically-Dispatched Calls with Run-Time Type Feedback", Proceedings of the 1994 SIGPLAN Conference on Programming Language Design and Implementation, Orlando, FL (June 1994). [ILLU94] "Illustra DataBlade Developer's Kit Architecture Manual, Release 1.1," Illustra Information Technologies, Oakland, CA (1994). [LEE94] Lee, C.-H., Chen, M., Chang, R. C., "HiPEC: High Performance External Virtual Memory Caching," Proceedings of the First Usenix Symposium on Operating System Design and Implementation, pp. 153-164, Montery, CA (November 1994). [LISK95] Liskov, B., Curtis, D., Day, M., Ghemaway, S., Gruber, R., Johnson. P., Myers, A., "Theta Reference Manual," MIT LCS Programming Methodology Group Memo 88 (February 1995). [MAY87] May, C., "MIMIC: A Fast System/370 Simulator," Proceedings of the SIGPLAN `87 Symposium on Interpreters and Interpretive Techniques, in SIGPLAN Notices, 22, 7, pp. 1- 13, St. Paul, MN (July 1987). [MCCAN93] McCanne, S., Jacobson, V., "The BSD Packet Filter: A New Architecture for User-level Packet Capture," 1993 Winter USENIX Conference, San Diego, CA (January 1993). [MCVOY96] McVoy, L., Staelin, C., "lmbench: Portable Tools for Performance Analysis," 1996 USENIX Conference, San Diego, CA (January 1994). [MOGUL87] Mogul, J., Rashid, R., Accetta, M., "The Packet Filter: An Efficient Mechanism for User- level Network Code," Proceedings of the 11th SOSP, pp. 39-52, Austin, TX (November 1987). [NELS91] Systems Programming with Modula-3, Nelson, G., ed., Prentice Hall, Englewood Cliffs, NJ (1991). [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm," Network Working Group RFC 1321 (April 1992). [RITCH84] Ritchie, D., "A Stream Input-Output System," AT&T Bell Laboratories Technical Journal, 63, 8 pp. 1897-1910 (October 1984). [ROSE91] Rosenblum, M., Ousterhout, J., "The Design and Implementation of a Log-Structured File System," Proceedings of the 13th SOSP, pp. 1-15, Pacific Grove, CA (October 1991). [TPCB90] Transaction Processing Performance Council, "TPC Benchmark B," Standard Specification, Waterside Associates, Fremont, CA (1990). [THEK94] Thekkath, C., Levy, H, "Hardware and Software Support for Efficient Exception Handling," Proceedings of ASPLOS VI, pp. 110- 119, San Jose, CA (October 1994). [WAHBE93] Wahbe, R., Lucco, S., Anderson, T., Graham, S., "Efficient Software-Based Fault Isolation," Proceedings of the 14th SOSP, pp. 203-216 Asheville, NC (December 1993). [YUHURA94] Yuhara, M., Bershad, B., Maeda, C., Moss, J., "Efficient Packet Demultiplexing for Multiple Endpoints and Large Messages," 1994 Winter USENIX Conference, pp. 153-166, San Francisco, CA (January 1994). Christopher A. Small is a Ph.D. candidate in Com- puter Science at Harvard University. His research interests include the interaction of language systems, database systems, and operating systems. He received his B.A. in Mathematics and his M.A. in Computer Science from Boston University in 1984, where he was a Trustee Scholar. He worked at an unreasonably large number of companies in the Boston area before entering Harvard's Ph. D. program in 1993. Margo I. Seltzer is an Assistant Professor of Com- puter Science in the Division of Applied Sciences at Harvard University. Her research interests include file systems, databases, and transaction processing sys- tems. She is the co-author of several widely-used soft- ware packages including database and transaction libraries and the 4.4BSD log-structured file system. Dr. Seltzer spent several years working at start-up companies designing and implementing file systems and transaction processing software and designing microprocessors. She is a Sloan Foundation Fellow in Computer Science and was the recipient of the Uni- versity of California Microelectronics Scholarship, the Radcliffe College Elizabeth Cary Agassiz Schol- arship, and the John Harvard Scholarship. Dr. Seltzer received an A.B. degree in Applied Mathematics from Harvard/Radcliffe College in 1983 and a Ph. D. in Computer Science from the University of California, Berkeley in 1992. Alpha and AlphaStation are trademarks of Digital Equipment Corporation. HotJava, Java, SPARCsta- tion, and Solaris are trademarks of Sun Microsystems. Omniware and omniC++ are trademarks of Colusa Software. DataBlade is a trademark of Illustra Infor- mation Technologies. HP-UX and PA-RISC are trade- marks of Hewlett-Packard. UnixWare is a trademark of Novell. UNIX is a trademark of X/Open.