################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Microkernels and Other Kernel Architectures Symposium San Diego, California, September 20-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Kernel Support for the Wisconsin Wind Tunnel Steven K. Reinhardt, Babak Falsafi, and David A. Wood Computer Sciences Department University of Wisconsin--Madison 1210 West Dayton Street Madison, WI 53706 USA wwt@cs.wisc.edu Abstract This paper describes a kernel interface that provides an untrusted user-level process (an executive) with protected access to memory management functions, including the ability to create, manipulate, and execute within subservient contexts (address spaces). Page motion callbacks not only give the executive limited control over physical memory management, but also shift certain responsibilities out of the kernel, greatly reducing kernel state and complexity. The executive interface was motivated by the requirements of the Wisconsin Wind Tunnel (WWT), a system for evaluating cache-coherent shared-memory parallel architectures. WWT uses the executive interface to implement a fine-grain user-level extension of Li's shared virtual memory on a Thinking Machines CM-5, a message-passing multicomputer. However, the interface is sufficiently general that an executive could act as a multiprogrammed operating system, exporting an alternative interface to the threads running in its subservient contexts. The executive interface is currently implemented as an extension to CMOST, the standard operating system for the CM-5. In CMOST, policy decisions are made on a central, distinct control processor (CP) and broadcast to the processing nodes (PNs). The PNs execute a minimal kernel sufficient only to implement the CP's policy. While this structure efficiently supports some parallel application models, the lack of autonomy on the PNs restricts its generality. Adding the executive interface provides limited autonomy to the PNs, creating a structure that supports multiple models of application parallelism. This structure, with autonomy on top of centralization, is in stark contrast to most microkernel-based parallel operating systems in which the nodes are fundamentally autonomous. * This work is supported in part by NSF PYI Award CCR-9157366, NSF Grant MIP-9225097, a Wisconsin Alumni Research Foundation Fellowship, an A.T.&T. Bell Laboratories Ph.D. Fellowship, and donations from Xerox Corporation, Thinking Machines Corporation, and Digital Equipment Corporation. Our Thinking Machines CM-5 was purchased through NSF Institutional Infrastructure Grant No. CDA-9024618 with matching funding from the Univ. of Wisconsin Graduate School. 1 Introduction This paper describes the kernel interface designed to support the Wisconsin Wind Tunnel (WWT) [13], a system for parallel simulation of parallel computers. WWT currently runs on the Thinking Machines CM-5 (a message-passing machine) and simulates cache-coherent shared-memory multiprocessors. Shared memory applications execute directly on the CM-5 node processors, with WWT simulating references to remote data. Shared memory functionality is provided using a fine-grain user-level extension of Li's shared virtual memory [10], as described in Section 2.2. WWT uses a separate address space for each simulated (target) node and services all of its exceptions (e.g., MMU faults) and system call requests (e.g., file I/O). In order to study machines larger than the host, several target nodes timeshare a single CM-5 node. In many ways, WWT behaves like an operating system for shared-memory applications. Alternatively, WWT can be thought as providing a virtual machine abstraction---with a shared-memory MIMD machine atop a message-passing pseudo-SIMD machine---for these applications [8]. WWT requires several unusual features from the underlying operating system. Specifically, the kernel must allow WWT to: - Create subservient contexts (address spaces) - Manipulate page mappings within sub-contexts - Initiate execution in sub-contexts - Handle traps generated during execution in sub-contexts - Manage physical memory tags in sub-contexts (We have effectively extended the CM-5 architecture to support two bits of tag information for each 32-byte block of physical memory; see Section 2.3.) All of these features must coexist with traditional memory management functions, including paging and/or swapping. We have defined and implemented an interface which provides these features and call any application that makes use of them an executive. While the interface is motivated by our specific application, it provides any untrusted user-level process with protected access to memory management functions, including the ability to create, manipulate, and execute within subservient contexts (address spaces). Page motion callbacks not only give the executive limited control over physical memory management, but also shift certain responsibilities out of the kernel, greatly reducing kernel state and complexity. Because an executive creates contexts and controls them completely, it can act as the operating system for other applications, providing an execution model not available under the native operating system. For example, an executive can export various thread and memory abstractions without adding complexity to the kernel itself. While this flexibility is useful in a uniprocessor context, it is particularly important on the CM-5, since standard CMOST is a centralized, synchronous operating system, allowing little autonomy for the node kernels. CMOST's structure efficiently supports an important class of parallel applications, i.e. fine-grain data-parallel codes, but cannot take advantage of more autonomous execution models. By extending CMOST with the executive interface, we provide a kernel structure that can efficiently support both synchronous and asynchronous applications. The combination of CMOST and executive interface results in a unique kernel structure that provides autonomy on top of synchrony, rather than the more traditional approach of coordinating fundamentally autonomous nodes. This new kernel structure may prove superior because centralized control appears to have advantages for supporting fine-grain synchronous codes and managing global hardware resources, while the executive interface provides flexible support for other execution models. The next section provides background on the Thinking Machines CM-5 system, the Wisconsin Wind Tunnel, and our method of synthesizing memory tags on the CM-5. While the interface is not tied to any of these, this section provides context and motivation for the rest of the paper. Section 3 defines the interface, consisting of context manipulation calls, page motion callbacks, and execution management calls. Section 4 describes our implementation of the interface in CMOST and its performance. Section 5 discusses the implications of this work for the structure of multiprocessor operating systems. Finally, we discuss related work and our conclusions. 2 Background 2.1 Thinking Machines CM-5 and CMOST The Thinking Machines CM-5 [16] is a distributed-memory message- passing multiprocessor. Each processing node consists of a 33 MHz SPARC microprocessor with a cache and memory management unit, up to 128 MB of memory, a custom network interface chip, and optional custom vector units. The processing nodes are grouped into partitions of 32 or more processors. Each partition is managed by a control processor (CP), distinct from the processing nodes (PNs). The standard operating system for the CM-5 is CMOST. Under CMOST, all policy decisions, including scheduling, swapping, and memory allocation, are made on the control processor. The processing nodes execute a minimal microkernel (the PN kernel) which provides the bare mechanisms required to implement the CP's policy. Because all processors in the partition are managed as a synchronous unit, CMOST gives the CM-5 some SIMD-like qualities. For example, when the CP decides to context switch, all nodes simultaneously switch to the new context. Similarly, when a new process is created, the CP selects the physical pages which the process will occupy on the PNs and broadcasts that process's memory map. CMOST and the CM-5 are optimized to run data-parallel applications, where all nodes synchronously apply similar operations to a local subset of a global data structure. In particular, the CM-5 contains a ``control network'', distinct from the message-passing network, which provides hardware support for global operations such as barriers, reductions, and broadcasts [9]. To efficiently utilize this control network, all nodes in a partition must concurrently execute the same user process. The centralized CMOST structure automatically satisfies this condition. 2.2 The Wisconsin Wind Tunnel The Wisconsin Wind Tunnel (WWT) provides a platform for evaluating parallel computer systems---specifically cache-coherent shared-memory computers---by accurately modeling the performance of real workloads on proposed hardware [13]. WWT helps computer engineers evaluate computer architectures much like a wind tunnel helps aeronautical engineers design aircraft. WWT uses the execution of a parallel shared-memory application to drive a distributed discrete-event simulation, accurately calculating the execution time of that application on a modeled hardware system (the target). Events generated by the simulation, such as lock acquisitions and memory reference completions, are used in scheduling the application, guaranteeing that the application's execution proceeds exactly as it would on the target system. We call WWT a virtual prototype because it uses direct execution to leverage similarities between the target system and the system on which it executes (the host) [5]. This means that the target application executes directly on the host hardware as much as possible---for example, a target floating-point multiply runs as a host floating-point multiply. Software simulation is required only for those features of the target system not provided by the host. Because WWT executes on a message-passing machine, the primary feature it must simulate is the shared memory abstraction. We do this using a fine-grain extension of Li's shared virtual memory [10]. Shared virtual memory constructs a distributed shared memory using standard address translation hardware to control memory access on each node. If a node has a copy of a shared data page, it is mapped into the address space on that node; if a node has no copy, the virtual page is not mapped. Multiple read-only copies are easily supported using the page protection facilities. Program accesses that require a data transfer to acquire a valid or exclusive copy are signaled as page faults. Unfortunately, relying on address translation hardware alone restricts the granularity of coherence to at least the virtual memory page size. The shared-memory machines we wish to model maintain coherence at a finer granularity, typically tens of bytes rather than thousands. We have synthesized the ability to tag each 32-byte block in physical memory as invalid, read-only, or writable (see Section 2.3). Using these tags in combination with the address translation hardware, we implement a distributed shared memory that maintains coherence at a 32-byte granularity. The first reference to a shared page causes a page fault, as with shared virtual memory. We allocate and map a physical page, but initially mark each cache block invalid. Cache blocks are marked valid (read-only or writable) only as they are referenced. Accesses to invalid blocks (and writes to read-only blocks) cause faults, and initiate software that fetches the data and marks the block valid. The distinction between read-only and writable tags allows read replication at the cache block granularity. WWT uses this fine-grain shared virtual memory to directly execute shared-memory applications as they would execute on a cache-coherent target machine. A context is allocated for each target node; the shared data accessible from this context reflects the contents of the simulated cache on that target node. Page and tag faults correspond to target cache misses, which invoke WWT and are handled according to the target's coherence protocol. Large target systems are studied by allocating several contexts per host node and multiplexing their execution. 2.3 Memory tags To implement fine-grain shared virtual memory, blocks in memory must have three states: invalid, read-only, and writable. Any access to an invalid block and write accesses to read-only blocks must provide restartable exceptions. To achieve this functionality, we have logically extended the CM-5 architecture to support two additional bits of information---writable and invalid---per 32-byte physical memory block. Although memory tags with access semantics have appeared in numerous machines, e.g., the Denelcor HEP [15], most contemporary commercial machines---including the CM-5---do not provide this capability. However, we are able to synthesize an invalid tag on the CM-5 by forcing uncorrectable errors in the memory's error correcting code (ECC) via a diagnostic mode. Using the SPARC cache in write-back mode causes all SPARC cache misses to appear to the memory as cache block fills. A fill that encounters an uncorrectable ECC error generates a precise exception. Synthesizing a read-only state is more convoluted, since it requires using the page tables to make entire pages read-only. On a write fault, we must distinguish between a write to a read-only block and a write to a writable block that resides on the same page as one or more read-only blocks. We make this distinction by maintaining a bit vector---one bit per block---to indicate whether the block is writable. The write fault handler checks this bit; if set it performs the write and resumes the application, rather than signaling a fault. Memory tags introduce extra state---two additional bits per 32-byte block---making paging and swapping more complex to implement. The tag bits make the ``extended'' physical page no longer a power of two, causing a mismatch with typical disk block sizes, and requiring more bookkeeping and I/O operations. In addition, since memory tags are unused for many pages, e.g., text and non-shared data, any overhead maintaining them is wasted. The executive interface reduces the kernel's complexity by shifting responsibility for maintaining memory tags to the executive. 3 The executive interface The executive interface provides an executive---an untrusted user-level process---with protected access to memory management functions. An executive can use the interface's memory management calls to create subservient contexts (address spaces), and exert complete control over them, including adding, modifying, and deleting page-level mappings. An executive can invoke execution within a subcontext, and regain control on all faults and exceptions. Page motion callbacks not only give the executive limited control over physical memory management, but also shift certain responsibilities out of the kernel, greatly reducing kernel state and complexity. A primary goal of the executive interface is to minimize kernel state and complexity. Beside the aesthetic appeal, keeping most of the code and complexity at user level makes bugs less catastrophic and easier to eliminate. In addition, because we are modifying a continuously developing system, minimizing and isolating the kernel source changes makes it easier for us to keep up with vendor revisions. A particular challenge is maintaining the address mappings and memory tag values installed by an executive in the face of paging/swapping activity by the kernel. The brute-force solution is to have the kernel remember all of the mapping requests made by the executive and transparently maintain them when a page is swapped out and back in at a different physical address, and to transparently swap tag information as well as page data. We have solved this problem with greatly reduced kernel state and complexity using page motion callbacks. These callbacks allow the kernel to notify an executive immediately before a page is to be swapped out and immediately after it is swapped back in so that the executive itself can maintain address mappings and tag values. Another requirement is to avoid trusting the executive. A fully protected interface makes the system robust through even the earliest phases of executive development, and means that normal users have the ability to write or modify executives. A protected interface also makes other sites more willing to adopt our kernel so that we can distribute the Wisconsin Wind Tunnel. Two features of the interface contribute to this protection: - The executive cannot even refer to resources not explicitly allocated to it by the kernel. The executive never sees physical addresses or hardware context numbers. - The kernel guarantees that the executive never has an alias to a physical page it does not own by maintaining a count of aliases for each physical page. If the executive does not decrement this count to zero by deleting mappings before a page is removed from its control, the kernel will terminate it. int create_ctx(); void *executive_brk(void *new_brk); void *executive_sbrk(int incr); void *executive_vbrk(void *new_brk); void *executive_vsbrk(int incr); int add_mapping(int cd, void *va, void *pp, int attr); int change_pg_attr(int cd, void *va, int attr); int delete_mapping(int cd, void *va); void jump_to_ctx(int cd, struct regs *p, void *stackp); (a) General kernel calls int set_page_motion_cbs(void *(*page_going)(), void (*page_coming)(), void *stackp); int set_ctx_fault_cb(void (*ctx_fault_cb)()); (b) Callback registration calls void *page_going(void *pp); void page_coming(void *pp); void ctx_fault(int fault_code, struct regs *p, ...); (c) Callbacks Table 1: The executive interface. Parts (a) and (b) list the functions exported by the kernel. Part (c) describes the callbacks exported by the executive. Table 1 lists the calls and callbacks that comprise the executive interface. The kernel exports the general calls and callback registration functions, while the executive exports the callbacks, which it registers during initialization. The bulk of the interface is directly related to managing virtual and physical memory. The remaining functions, jump_to_ctx() the ctx_fault() callback, provide the ability to execute within subcontexts. 3.1 Memory management The executive manages virtual and physical memory resources via the context management calls and page motion callbacks. Pages managed by the executive are distinct from the executive's own text, data, and stack pages. This allows the kernel to easily distinguish the former for special handling while manipulating the latter as it would for any other user process. For example, the kernel can swap the executive's text segment, share it among multiple instances of the same executive, or allow a debugger to attach to an executive without interfering with the executive's memory management functions. 3.1.1 Context management calls The context management calls allow an executive to create new contexts, allocate pages to map into them, and add, modify, and delete page-level mappings. Using these calls, an executive has complete control over these subservient address spaces. The create_ctx() call allocates a new context and returns an integer context descriptor (similar to a Unix file descriptor). We refer to these contexts as subcontexts when it is necessary to distinguish them from the executive's context. A new subcontext is completely empty, i.e. it contains no valid address mappings (except for the kernel mappings required by the SPARC architecture). Context descriptor 0 is never returned by create_ctx() and is used to indicate the executive's own address space. The notation cd:va refers to virtual address va in context cd. To keep executive-managed pages distinct from kernel-managed pages, the executive allocates pages from a special segment in the executive's own context, the executive-managed heap. The executive_brk() and executive_sbrk() calls allow the executive to change the size of this segment in the same way that CMOST's Unix-like brk() and sbrk() work with the standard heap. Allocation on the executive-managed heap is always rounded up to the next multiple of the page size, so that an integral number of pages are allocated. The virtual addresses of these pages in the executive's context are the primary mappings (i.e., handles) which the executive uses to refer to these pages across the kernel interface. Only pages in the executive-managed heap---referred to as executive-managed pages---may be aliased via add_mapping() or have their memory tag values changed. The executive virtual heap allows the executive to manage a region of its own address space the same way that it manages subcontexts. The executive_vbrk() and executive_vsbrk() calls simply reserve a contiguous collection of virtual pages, but do not allocate physical pages behind them. The executive can then alias these virtual pages to pages in the executive-managed heap (possibly with different attributes, e.g. read-only or non-cacheable) without fear that the kernel will grow the standard heap or stack to conflict. This call is not necessary for subcontexts because the executive automatically has complete control of those. The add_mapping() call creates a secondary mapping from virtual address va in context cd to the page at virtual address pp, i.e. it aliases cd:va and 0:pp, where pp must point to an executive-managed page. The mapping attributes (protection and cacheability) are set according to attr. To prevent interference between the kernel and executive, if cd is zero then va must be in the executive virtual heap. The alias count for the corresponding physical page is incremented. The change_pg_attr() and delete_mapping() calls allow the executive to change the attributes of and delete mappings, respectively. Only mappings created via add_mapping() can be modified or deleted. delete_mapping() also decrements the physical page's alias count. 3.1.2 Page motion callbacks The two page-motion callbacks---page_going(), invoked when the kernel must reclaim an executive-managed page, and page_coming(), invoked when a page returns---serve a dual role. First, they allow the kernel and executive to cooperate in physical memory management, similar to the way scheduler activations allow management of physical processors in a shared-memory multiprocessor [2]. By explicitly saving and restoring data in response to page_going() and page_coming() calls, the executive can control exactly which data are resident in physical memory. Second, the callbacks significantly reduce the kernel's bookkeeping requirements, by giving the executive responsibility for maintaining secondary mappings and memory tags across physical page movements. The two page motion callbacks only affect executive-managed pages; the executive must register the callbacks before allocating pages on the executive-managed heap. When the kernel decides to reclaim an executive-managed page (e.g., to allocate it to a different process), it notifies the appropriate executive using the page_going() callback. In general, the kernel will call page_going() with the argument pp set to NULL, indicating that the executive can select any executive-managed page for reclamation. The executive can apply its own replacement policy and return the selected page-aligned pp as the return code. By default, the kernel discards the page contents; however, the executive may request that they be saved (i.e. moved to backing store) by overloading the return code (setting the least-significant bit to one). Occasionally, the kernel may need contiguous physical pages---e.g., for the CM-5 vector units--- requiring it to reclaim a specific executive-managed page. In this case, the argument pp points to the selected page, and only the least-significant bit of the return value is meaningful. The executive must use delete_mapping() to delete all secondary mappings for the selected page before returning, at which point the kernel removes the primary mapping from pp to the physical page. Note that virtual address pp is still ``in use'' as it uniquely identifies this page and will be provided as the argument to a future call to page_coming(). The kernel returns an executive-managed page via the page_coming() callback. The primary mapping is recreated, i.e. the virtual address pp again maps to a physical page, though not necessarily the same physical page as before the page_going() call. If the executive returned a zero in the LSB on the previous call to page_going(), the page has been zeroed; otherwise, the contents have been restored. In either case, the physical memory tags have been cleared. This call allows the executive to restore any secondary mappings and/or memory tags for the page. Secondary mappings could also be restored on demand. These callbacks allow user-level management of physical memory: even when the kernel reclaims a specific physical page, the executive can choose the data that get replaced at the expense of additional copying and page-table manipulation. Alternatively, the executive can let the kernel save and restore data in evicted pages. The executive can always regain access to a ``gone'' page by dereferencing pp and causing a fault. The kernel will handle the fault by obtaining a free page (possibly by calling page_going() on this or another executive), restoring the old data (if necessary), and calling page_coming() to signal the return of the needed page. Because the executive is untrusted, the kernel cannot rely on it to delete all secondary mappings on a page_going() callback. Conversely, it must guarantee that these mappings are removed---otherwise, the executive may retain an alias to a physical page re-allocated to a different process. A brute-force solution is for the kernel to automatically delete all secondary mappings. However, this approach requires that the kernel maintain all of the reverse translations, duplicating state already maintained by the executive. Instead, we require that the executive delete all secondary mappings to the selected page before returning from the page_going() call or face process termination. Process termination guarantees that all mappings are deleted, because the process and all its sub-contexts are destroyed. Thus the kernel only need know how many secondary mappings exist, but not their individual identities. A simple counter per physical page, incremented on each add_mapping() call and decremented on each delete_mapping call, is sufficient to maintain this state. In addition, to protect against deadlock or infinite loops in the executive, the kernel requires that all callbacks be completed within a fixed time. The kernel sets a virtual timer before invoking a callback; if the timer expires before the callback returns the executive's process is terminated. 3.2 Execution management Creating and manipulating address spaces is uninteresting without the ability to execute within them. Two calls suffice to provide this functionality. The kernel call jump_to_ctx() causes the current thread of execution to switch into the specified context. When the thread executing in the subordinate context encounters a fault (either an instruction fault or an explicit software trap), control resumes in the executive's context via the ctx_fault() callback. Because the jump_to_ctx() call continues the current thread in a different context rather than creating a new thread, the contents of the register file are largely unchanged across the switch. The struct regs structure passed as an argument to jump_to_ctx(), though implementation-dependent, conceptually consists of only the program counter and stack pointer in the new context and the original contents of the registers required to pass the three arguments of jump_to_ctx() into the kernel. (On the SPARC architecture, this structure also contains the NPC (to resume after faults in delayed branch slots) and the condition codes (since these cannot be saved and restored from user mode).) The third argument, stackp, specifies a stack in the executive context to use when ctx_fault() is invoked. From the executive's perspective, jump_to_ctx() does not return. After a thread passes through a jump_to_ctx() call, it is in a subordinate execution context where all faults are handled by the executive. When the thread executes a trapping instruction (either due to a fault or an explicit software trap), control resumes in the executive's context at the ctx_fault() entry point. The first argument indicates the type of fault and the second passes back a pointer to the same structure originally passed to jump_to_ctx(). The state saved is exactly the state required by jump_to_ctx(), so supplying this buffer unmodified as the second argument to jump_to_ctx() restarts execution in the other context at the faulting instruction. Additional arguments are passed from the kernel to the executive depending on the type of the fault---for example, an MMU fault will also provide the virtual address of the access and the nature of the fault (invalid address, protection violation, etc.). We have, to the greatest extent possible, separated thread management issues from this interface. However, with the addition of thread management code, this simple interface is sufficient for the executive to behave as a multiprogrammed operating system. Implementing a threads package on top of this interface simply requires code to allocate and manage multiple stacks, and to save and restore the registers not contained in the struct regs structure. For example, a context switch is as simple as having the ctx_fault() handler save and restore the CPU registers and call jump_to_ctx() with a different context descriptor and struct regs pointer. Separating the thread and context abstractions gives the executive flexibility, e.g. to support a kernel thread abstraction within its sub-contexts. If the underlying kernel provides a programmable timer interrupt, the interrupt can be made to appear through the ctx_fault() entry as well, making the multithreading preemptive. The executive does not normally handle its own faults, except for those to the executive virtual heap. Having the kernel handle the executive's faults facilitates demand paging of the executive's text and data, and makes growing the executive's stack the same as for any other user process. If the executive wants to handle its own faults, it can call jump_to_ctx() with the first argument set to zero (its own context). This creates a singular situation where the thread is in a subordinate execution context---so faults still invoke the ctx_fault() callback---but not a subordinate addressing context. If the executive decides that the kernel should handle a specific fault, it need only retry the faulting instruction from within its fault handler. There is no danger of a recursive call to ctx_fault() because any fault encountered by the executive's handler will be handled directly by the kernel. Subordinate execution contexts cannot be nested because jump_to_ctx() is a system call implemented as a software trap; ``recursive'' calls will show up in the executive via ctx_fault(). As with any other traps, system calls made in a subordinate execution context can be forwarded to the kernel simply by re-executing the call in the executive. The only complication occurs with pointer arguments, since the kernel will interpret these in the executive, rather than subordinate, context. 4 Implementation Function Lines of C Function Machine PN CP Instructions executive_{v}{s}brk add_mapping 187 Total, four calls 10 35 change_pg_attr 108 create_ctx 134 0 delete_mapping 123 Page table initialization 17 11 jump_to_ctx 50 Process termination 90 0 ctx_fault 40 Page callback support 222 426 set_page_motion_cbs 7 set_ctx_fault_cb 5 ------- --- Total 473 472 Total 520 (a) C language additions (b) PN assembly additions Table 2: Code added to CMOST to implement executive interface. We have implemented the executive interface in CMOST version 7.2 Beta 1. As shown in Table 2, the entire interface required less than one thousand lines of C and just over 500 machine instructions. Only jump_to_ctx() and ctx_fault() required assembly-level coding; the other functions were implemented in assembly to improve performance. Although the majority of the additional code is in the PN kernel, most of the complexity lies in the CP portion. This follows from the centralized structure of CMOST: the PN kernel implements only mechanisms, while all policy decisions occur on the CP. The executive interface requires very little additional kernel state. The PN kernel requires a few additions to the process control block (PCB) and an array with an entry per physical page. The PCB maintains the entry points and stack addresses for the callbacks and the struct regs and stackp pointers from the last jump_to_ctx() call (for use in the subsequent ctx_fault() callback). The array, indexed by physical page number, contains the alias count for each page and a pointer field. The pointer field is used to maintain a linked list, whose head is in the PCB, of all physical pages in the executive-managed heap to facilitate resetting the alias counts when the executive process terminates. We also added two new memory segments to every process: the executive-managed heap and the executive virtual heap. Both segments have special semantics: - Any time the CP decides to move a page in the executive-managed heap, it must first invoke the executive's page motion callbacks. - Allocations in the executive virtual heap segment are not backed by physical memory. This segment simply provides a region in the executive's virtual address space that is guaranteed not to conflict with regions used by the control processor. Also, faults to this segment are always handled by the executive. On the CP, two fields were added to the per-physical-page structure to record the process ID and virtual address of each physical page that is allocated to an executive-managed heap segment. This information is required to perform the page_going() callback when the physical page needs to be moved. Table 3 summarizes the performance of the executive interface calls, as measured using the CM-5's cycle counter and averaging tens of iterations. (This information was derived using a pre-release version of CMOST (7.2 Beta 1 of Feb. 1993). Performance on released versions may be significantly different.) The executive_{v}{s}brk() and create_ctx() calls are implemented as full traps, where the user's register windows are flushed, execution switches to the kernel stack, and interrupts are re-enabled. All other calls are implemented as ``fast'' traps, and execute without flushing any windows or re-enabling interrupts. The overheads for the two types of traps are approximately 300 usec and 3 usec, respectively. The callback registration functions simply store their arguments in fields in the process's PCB. No validation is required since illegal values can at worst cause an immediate fault on the invocation of a callback, which will terminate the executive process. Function Time Function Time cycles (usec) cycles (usec) executive_sbrk create_ctx 19K (575) with CP communication add_mapping alloc 1 page 1.6M (48 ms) no table allocation 359 (11) alloc 100 pages 4.9M (148 ms) alloc level 3 table 1166 (35) w/o CP communication 20K (606) alloc level 2 & 3 tables 2641 (80) executive_vsbrk change_pg_attr 340 (10) with CP communication 1.4M (42 ms) delete_mapping 855 (26) w/o CP communication 20K (606) jump_to_ctx 180 (5) set_page_motion_cbs 117 (4) ctx_fault 154 (5) set_ctx_fault_cb 108 (3) Table 3: Performance of executive interface calls on the CM-5. 4.1 Memory management 4.1.1 Context management calls The executive_{v}{s}brk() calls are implemented in the same fashion as CMOST's brk() and sbrk(). The first node requesting a page generates a system-wide interrupt, which causes the control processor to grow the appropriate segment on all nodes. Subsequent requests on other nodes can be satisfied locally by returning pages from the now-larger segment. The create_ctx() call allocates a SPARC hardware context number and an empty level-one table (the SPARC has a three-level page table structure). A call to add_mapping() validates its arguments, performs the page table insertion (allocating level two and three tables as necessary), and increments the alias counter for the physical page. Both change_pg_attr() and delete_mapping() validate arguments and do a page table walk, with the latter also decrementing the appropriate alias counter. None of these calls require communication from the node to the control processor. Because the CM-5 node has a virtually-tagged writeback cache, delete_mapping() must also flush data brought in using the deleted mapping. This requires iteratively flushing 128 cache lines, causing it to take significantly longer than change_pg_attr(). 4.1.2 Page motion callbacks The page motion callbacks account for most of the design complexity. Our current implementation focused on minimizing changes in the CP code. As a result, our implementation is influenced by several existing CMOST features: - Swapping is supported, but not demand paging. All of a process's pages must be in memory before it is allowed to run. - Memory management is performed entirely on the control processor, which assumes that the memory maps of all nodes are identical. When the CP reclaims a page, it must select a specific physical page and force all nodes to release it, even if there is no need for physical contiguity. In other words, in this implementation, page_going() is never called with a NULL argument because the CP cannot deal with different nodes freeing different pages. - Both the PN kernel and the CP portion of CMOST are single-threaded, i.e. there is only one stack in each. In addition, the PN kernel does not maintain any kernel stack state across communications with the CP. While this implementation satisfies our needs for WWT, and serves as a proof-of-concept for the executive interface, the CP code requires more radical changes for a clean and efficient implementation. In CMOST, pages are freed for two reasons: i) to satisfy an allocation request for a currently executing process or, ii) to swap in an idle process. In either case, the CP performs the necessary memory management operations while the PNs are idle, waiting for instructions to resume. Page moves not requiring callbacks are performed immediately, but those requiring callbacks are recorded and deferred. Before resuming the user process, the CP performs the deferred callbacks, scheduling the affected processes (executives) as needed. The callbacks are executed in the executive context by invoking the registered callback function using the registered stack pointer. The callback cannot be run on the current process stack because it could be in a different address space (i.e. if the process was suspended while running in a subcontext). A fault in the callback will result in the executive's termination. When the callback returns, the PN kernel either executes another callback pending for this executive, or waits for the other nodes to complete. Once all callbacks are done for this executive (across all nodes), the scheduler is re-invoked to run the next executive with pending callbacks. When all deferred callbacks have executed, the originally scheduled process can run. Because the executive is scheduled for callbacks the same way it is scheduled for normal execution, there are only a few constraints on callback execution. First, the callback cannot allocate memory of any kind since this could create a circular dependency. Second, each callback is provided at most one scheduling quantum; the executive is terminated if the quantum expires during a callback. Third, the callback also cannot call any blocking kernel function, since this would interfere with the callback timeout mechanism. Finally, the current implementation does not allow the callback to use the CM-5 network interface. This last restriction is not inherent to the architecture, but would add significant complexity and overhead for a feature our application would not use. Our implementation is designed to support swapping, but we have not yet tested this portion of the code. However, the CM-5 vector unit architecture severely constrains physical memory allocation, causing CMOST to frequently reallocate specific pages, moving data from one physical page to another. By treating these page moves as a swap out followed immediately by a swap in, we have completely exercised the callback mechanisms. 4.2 Execution management The subordinate execution context is simply the same CMOST process executing with a different trap vector and (perhaps) a different hardware MMU context. All trap vector entries, except hardware interrupts, jump to the ctx_fault() kernel stub. The jump_to_ctx() and ctx_fault() functions are implemented as ``fast'' traps, i.e. they execute without re-enabling interrupts in a partial SPARC register window. In addition to loading state from the struct regs structure, (or storing it in the case of ctx_fault()), both calls must change the hardware context, manipulate the register window mask, and change the trap vector base address. jump_to_ctx() requires ten extra instructions because it must read the processor status register, mask in the desired condition codes, and write it back. 4.3 Other implementation issues The executive needs to specify a stack for all interrupt or signal handlers, as it does for the page motion callbacks, since the current process stack may not exist in the executive's own address space. We have not added this extension yet, but it would be simple to do so. The cache controller used on the CM-5 node (Cypress 604) is a 64KB direct-mapped virtually-tagged cache. The hardware will handle aliases that map to the same cache block, but cannot guarantee consistency otherwise. To avoid cache flushing, aliases must be congruent modulo 64K. Rather than complicating the interface with this issue, we force the executive to deal with it on its own. We have added an additional call to the PN kernel, int cache_flush_page(int cd, void *pp), which flushes the specified page from the SPARC cache. This allows the executive to make the tradeoff between keeping aliases congruent and performing cache flushes according to its own needs. 5 Discussion While the executive interface is interesting in isolation, it is more striking when considered in the context of the CM-5 and CMOST. The resulting kernel structure is---we believe---unique. In most microkernels designed for parallel systems, nodes are fundamentally autonomous. Cooperation, e.g. for gang-scheduling, occurs as a policy at a higher level of abstraction. Our extended version of CMOST turns this structure on its head: the control processor maintains central, synchronous control of physical memory and scheduling. The control processor still forces all processors to context switch simultaneously; however, some of the processes may now be executives. Executives, because of the autonomy provided by the executive interface, can schedule execution in their subcontexts however they choose. This flexibility can be used to support applications or groups of applications which may benefit from more dynamic allocation and scheduling policies. Thus our extended CMOST provides autonomy on top of synchrony, rather than the more traditional alternative of synchrony on top of autonomy. The CMOST/executive structure was motivated by our implementation of the Wisconsin Wind Tunnel on the CM-5. However, the resulting structure is arguably the right way to structure an operating system for large-scale parallel machines. Efficiently supporting fine-grain parallel applications requires a global perspective for resource allocation, because a page fault or scheduling delay on one node can seriously impact the performance of the entire application. Centralizing control, as in CMOST, makes global resource allocation significantly easier. For example, CMOST's guarantee that one user process runs simultaneously on all nodes allows direct user access to the CM-5's network interface hardware, avoiding costly system calls for message operations. Operating systems that fail to efficiently manage global resources will have a particularly difficult time exploiting hardware features such as the CM-5's control network [9], which performs a global barrier or reduction in a few microseconds. Because hardware barriers are both cheap and fast---they are essentially AND-gates---we expect them to appear in most future parallel machines. (The Cray T3D also has hardware support for global barriers; barriers are actually faster than remote memory operations on this machine [14].) The operating systems for these machines must be able to exploit this hardware to efficiently execute fine-grain data-parallel codes. We believe this may prove easier with a synchronous microkernel structure, rather than a more traditional asynchronous kernel structure. While the CMOST/executive structure supports timesharing among different execution models, a hierarchical control structure can integrate space-sharing as well. For example, a central scheduler on a 128-node machine can reserve some time-slices for 128-node synchronous applications and some for 128-node applications or sets of applications with more dynamic executive-managed scheduling. The remaining time-slices can be delegated to two other schedulers, each of which can recursively do identical centralized scheduling within disjoint 64-node processor groups. In order for a single executive to manage multiple users' applications, the executive must be run with some additional privilege, e.g. as a Unix ``setuid root'' process, to access system resources with the effective permissions of the user on whose behalf the current application is being executed. Such an executive would also need a way to adjust its scheduling priority within the kernel so that processing resources can be fairly allocated across all user jobs, whether they are executing directly under the kernel or are one of several running under an executive. 6 Related Work The interface described in this paper was motivated by the needs of the Wisconsin Wind Tunnel. The centralized structure of CMOST, with all policy enacted on the control processor, was insufficient to support the user-level fine-grain distributed shared memory needed by WWT. The executive interface extends CMOST to provide nodes with limited autonomy in the way they manage their virtual address spaces and physical memory. This interface is interesting from two different perspectives: on its own, as a means of exporting memory-management functions to the user of a uniprocessor; and in conjunction with CMOST, as a means of supporting multiple models of application parallelism on a single machine. 6.1 Uniprocessor aspects The executive interface provides a complete set of low-level virtual memory functions. The interface is simpler and lower-level than the virtual memory interfaces of either Mach [12] or Chorus [1], which both impose significant semantics on the use of memory. To the first order, the executive interface merely exposes the underlying hardware mechanisms to the user in a protected manner. The executive interface is similar to the ``inferior spheres of protection'', described by Dennis and Van Horn [6]. Their execution model allowed processes to create subcontexts, initiate execution within them, and handle any resulting faults. The primary difference is our page orientation rather than their more general segments and capabilities. More recently, Probert, et al, proposed SPACE, an object-oriented operating system [11]. SPACE allows applications to create, manipulate, and execute within spaces, i.e., address spaces, thereby facilitating protected objects. However, SPACE is much more general than our interface, allowing different ``executives'' to manage different parts of a single address space. Appel and Li surveyed the most common uses of user-level virtual memory, and identified the set of primitives needed by these applications [3]. The set includes primitives to modify protection on pages and create aliases within an address space; however, they did not include the ability to create new address spaces, nor get callbacks when pages are reclaimed by the kernel. Using page motion callbacks to manage physical memory allocation is analogous to using scheduler activations to manage physical processor allocation [2]. Both provide the user with notification of kernel allocation decisions so that the application can adapt knowledgeably to its new circumstance. A key difference is that the page_going() callback notifies the user before the page is taken away, while the scheduler activation model notifies the user after a processor has been taken away. This adds some complexity to the page motion callbacks (i.e. the necessity for the kernel to enforce a finite completion time), but reflects a fundamental difference between memory and processors: it is reasonable to have the user allocate space to save a processor's state in case the kernel takes it away, but nonsensical to apply the same principle to memory pages. The Mach external memory manager interface is similar to our page motion callbacks. However, if an external memory manager does not remove a page in a timely fashion, the Mach kernel can always write the page to backing store using the default memory manager. Our interface does not permit this, because the kernel cannot clean up secondary mappings itself nor can it permit them to point into another process. (Our kernel could write the page to backing store, so long as it also saved and restored the memory tags and guaranteed to return it to the original physical page before resuming the process.) 6.2 Multiprocessor aspects User control over multiprocessor scheduling with the intent of supporting multiple models of parallelism (including gang-scheduling) is provided in Mach by a processor allocation server [4]. In this model, an application requests a certain number of processors to create a ``processor set'' to which threads can be bound. The binding of actual processors to processor sets is performed by a privileged user-level server. The server can be modified to support site- or usage-specific policies, but there can only be one per platform. In our scheme, a process requiring a fixed number of processors can be handled simply by scheduling it at the appropriate level in the hierarchy. A process with more dynamic needs could be served by an appropriate executive that balances its requirements by scheduling it in conjunction with other applications having similar dynamic parallelism. The fundamental differences are that our model provides gang-scheduling more as the rule than the exception, and allows for multiple executives running simultaneously to support multiple abstractions. The hierarchical integration of time- and space-sharing discussed in Section 5 is similar to Feitelson and Rudolph's distributed hierarchical control [7], except that they assume a hardware hierarchy of control processors, while we believe a software hierarchy of control processes may be just as effective. Also, their model does not have the executive interface to provide different scheduling models underneath the global hierarchical structure. 7 Conclusion The executive interface exports a complete, abstract model of virtual memory management to a user process, including the ability to create, manipulate, and execute in multiple address spaces. The interface allows the user process to participate in physical memory management using page motion callbacks. The callbacks also serve to minimize the kernel complexity of implementing the interface. Giving a user-level executive the ability to define a complete virtual memory environment in a protected fashion allows multiple executives providing multiple process abstractions to coexist on a single system. Though interesting from a uniprocessor perspective, it is more significant in the context of large-scale multiprocessors. Instead of having the operating system view the machine as a set of autonomous nodes upon which coordination mechanisms must be imposed, it can start with a global perspective and selectively delegate nodes in both space and time to executives which allow increasing amounts of autonomy. The resulting structure combines the advantages of centralization and decentralization: the underlying global perspective simplifies efficient support of fine-grained synchronous (e.g. data-parallel) applications and management of global resources such as barrier hardware, while executives provide the flexible support for other application models that a completely centralized system lacks. Acknowledgements Mark Hill, Frans Kaashoek, Jim Larus, Bart Miller, Yannis Schoinas, and Marv Solomon provided helpful comments that greatly improved this paper. References [1] Vadim Abrossimov and Marc Rossier. Generic Virtual Memory Management for Operating System Kernels. In Proceedings of the Twelveth ACM Symposium on Operating System Principles (SOSP), pages 123-136, December 1989. [2] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism. ACM Transactions on Computer Systems, 10(1):53-79, February 1992. [3] Andrew W. Appel and Kai Li. Virtual Memory Primitives for User Programs. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pages 96-107, April 1991. [4] David L. Black. Scheduling Support for Concurrency and Parallelism in the Mach Operating System. IEEE Computer, 23(5):35-43, May 1990. [5] M. D. Canon, D. H. Fritz, J. H. Howard, T. D. Howell, M. F. Mitoma, and J. Rodriguez-Rosell. A Virtual Machine Emulator for Performance Evaluation. Communications of the ACM, 23(2):71-80, February 1980. [6] Jack B. Dennis and Earl C. Van Horn. Programming Semantics for Multiprogrammed Computations. In ACM Programming Languages and Pragmatics Conference, August 1965. [7] Dror G. Feitelson and Larry Rudolph. Distributed Hierarchical Control for Parallel Processing. IEEE Computer, 23(5):65-77, May 1990. [8] Robert P. Goldberg. Survey of Virtual Machine Research. IEEE Computer, 7(6):34-45, June 1974. [9] Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A. St. Pierre, David S. Wells, Monica C. Wong, Shaw-Wen Yang, and Robert Zak. The Network Architecture of the Connection Machine CM-5. In Proceedings of the Fifth ACM Symposium on Parallel Algorithms and Architectures (SPAA), July 1992. [10] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989. [11] D. Probert, J. Bruno, and M. Karaorman. SPACE: A New Approach to Operating System Abstraction. In Proceedings of the International Workshop on Object Orientation in Operating Systems, October 1991. [12] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William J. Bolosky, and Jonathan Chew. Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures. IEEE Transactions on Computers, 37(8):896-908, August 1988. [13] Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, and David A. Wood. The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers. In Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 48-60, May 1993. [14] S. L. Scott. Personal communication, June 1993. [15] Burton J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Proc. of the Int. Soc. for Opt. Engr., pages 241-248, 1982. [17] Thinking Machines Corporation. Connection Machine CM-5 Technical Summary, October 1991.