We argue that a new OS for a multicore machine should be designed ground-up as a distributed system, using concepts from that field. Modern hardware resembles a networked system even more than past large multiprocessors: in addition to familiar latency effects, it exhibits node heterogeneity and dynamic membership changes.
Cache coherence protocols encourage OS designers to selectively ignore this, except for limited performance reasons. Commodity OS designs have mostly assumed fixed, uniform CPU architectures and memory systems.
It is time for researchers to abandon this approach and engage fully with the distributed nature of the machine, carefully applying (but also modifying) ideas from distributed systems to the design of new operating systems. Our goal is to make it easier to design and construct robust OSes that effectively exploit heterogeneous, multicore hardware at scale. We approach this through a new OS architecture resembling a distributed system.
The use of heterogeneous multicore in commodity computer systems, running dynamic workloads with increased reliance on OS services, will face new challenges not addressed by monolithic OSes in either general-purpose or high-performance computing.
It is possible that existing OS architectures can be evolved and scaled to address these challenges. However, we claim that stepping back and reconsidering OS structure is a better way to get insight into the problem, regardless of whether the goal is to retrofit new ideas to existing systems, or to replace them over time.
In the next section we elaborate on why modern computers should be thought of as networked systems. Section 3 discusses the implications of distributed systems principles that are pertinent to new OS architecture, and Section 4 describes a possible architecture for an OS following these ideas. Section 5 lays out open questions at the intersection of distributed systems and OS research, and Section 6 concludes.
A modern computer is undeniably a networked system of point-to-point links exchanging messages: Figure 1 shows a 32-core commodity PC server in our lab3. But our argument is more than this: distributed systems (applications, networks, P2P systems) are historically distinguished from centralized ones by three additional challenges: node heterogeneity, dynamic changes due to partial failures and other reconfigurations, and latency.
Modern computers exhibit all these features.
Managing diverse cores with the same kernel object code is clearly impossible. At present, some processors (such as GPUs) are special-cased and abstracted as devices, as a compromise to mainstream operating systems which cannot easily represent different processing architectures within the same kernel. In other cases, a programmable peripheral itself runs its own (embedded) OS.
So far, no coherent OS architecture has emerged which accommodates such processors in a single framework. Given their radical differences, a distributed systems approach, with well-defined interfaces between the software on these cores, seems the only viable one.
Nodes in a distributed system come and go, as a result of changes in provisioning, failures, network anomalies, etc. The hardware of a computer from the OS perspective is not viewed in this manner.
Partial failure in a single computer is not a mainstream concern, though recently failure of parts of the OS has become one [15], a recognition that monolithic kernels are now too complex, and written by too many people, to be considered a single unit of failure.
However, other sources of dynamicity within a single OS are now commonplace. Hot-plugging of devices, and in some cases memory and processors, is becoming the norm. Increasingly sophisticated power management allows cores, memory systems, and peripheral devices to be put into a variety of low-power states, with important implications for how the OS functions: if a peripheral bus controller is powered down, all the peripherals on that bus become inaccessible. If a processor core is suspended, any processes on that core are unreachable until it resumes, unless they are migrated.
|
The problem of latency in cache-coherent NUMA machines is well-known. Table 1 shows the different latencies of accessing cache on the machine in Figure 1, in line with those reported by Boyd-Wickizer et al. for a 16-core machine [5]4. Accessing a cache line from a different core is up to 130 times slower than from local L1 cache, and 17 times slower than local L2 cache. The trend is towards more cores and an increasingly complex memory hierarchy.
Because most operating systems coordinate data structures between cores using shared memory, OS designs focus on optimizing this in the face of memory latency. While locks can scale to large machines [12], locality of data becomes a performance problem [2], since fetching remote memory effectively causes the hardware to perform an RPC call. In Section 3 we suggest insights from distributed systems which can help mitigate this.
Why are we not programming it as such?
We now present some ideas, concepts, and principles from distributed systems which we feel are particularly relevant to OS design for modern machines. We draw parallels with existing ideas in operating systems, and where there is no corresponding notion, suggest how the idea might be applied. We have found that viewing OS problems as distributed systems problems either suggests new solutions to current problems, or fruitfully casts new light on known OS techniques.
In distributed systems, ``chatty'' RPCs are reduced by encoding the high-level operation more compactly; in the limit, this becomes a single RPC or code shipping. When communication is expensive (in latency or bandwidth), it is more efficient to send a compact message encoding a complex operation than to access the data remotely. While there was much research in the 1990s into distributed shared virtual memory on clusters (e.g. [1]), it is rarely used today.
In an OS, a message-passing primitive can make more efficient use of the interconnect and reduce latency over sharing data structures between cores. If an operation on a data structure and its results can each be compactly encoded in less than a cache line, a carefully written and efficient user-level RPC [3] implementation which leaves the data where it is in a remote cache can incur less overhead in terms of total machine cycles. Moreover, the use of message passing rather than shared data facilitates interoperation between heterogeneous processors.
Note that this line of reasoning is independent of the overhead for synchronization (e.g. through scalable locks). The performance issues arise less from lock contention than from data locality issues, an observation which has been made before [2].
Explicit message-based communication is amenable to both informal and formal analysis, using techniques such as process calculi and queueing theory. In contrast, although they have been seen by some as easier to program, it is notoriously difficult to prove correctness results about, or predict the performance of, systems based on implicit shared-memory communication.
Long ago, Lauer and Needham [11] pointed out the equivalence of shared-memory and message passing in OS structure, arguing that the choice should be guided by what is best supported by the underlying substrate. More recently, Chaves et al. [7] considered the same choice in the context of an operating system for an early multiprocessor, finding the performance tradeoff biased towards message passing for many kernel operations. We claim that hardware today strongly motivates a message-passing model, and refine this broad claim below.
The performance benefits of replicating in software have not been lost on OS designers. Tornado [10] replicated (as well as partitioned) OS state across cores to improve scalability and reduce memory contention, and fos [17] argues for ``fleets'' of replicated OS servers. Commodity OSes such as Linux now replicate cached pages and read-only data such as program text [4].
However, replication in current OSes is treated as an optimization of the shared data model. We suggest that it is useful to instead see replication as the default model, and that programming interfaces should reflect this - in other words, OS code should be written as if it accessed a local replica of data rather than a single shared copy.
The principal impact on clients is that they now invoke an agreement protocol (propose a change to system state, and later receive agreement or failure notification) rather than modifying data under a lock or transaction. The change of model is important because it provides a uniform way to synchronize state across heterogeneous processors that may not coherently share memory.
At scale we expect the overhead of a relatively long-running, split-phase (asynchronous) agreement protocol over a lightweight transport such as URPC to be less than the heavyweight global barrier model of inter-processor interrupts (IPIs), particularly as the usual batching and pipelining optimizations also apply with agreement. In effect, we are trading increased latency off against lower overhead for operations.
Of course, sometimes sharing is cheap. Replication of data across closely coupled cores, such as those sharing L2 or L3 cache, is likely to perform substantially slower than spinlocks. However, we can reintroduce sharing and locks (or transactions) for groups of cores as an optimization behind the replication interface.
This is a reversal of the scalability trend in kernels whereby replication and partitioning is used to optimize locks and sharing; in contrast, we advocate using locks and limited sharing to optimize replica maintenance.
However, the design space for agreement and consensus protocols is large and mostly unexploited in OS design. Operations such as page mapping, file I/O, network connection setup and teardown, etc. have varying consistency and ordering requirements, and distributed systems have much to offer in insights to these problems, particularly as systems become more concurrent and diverse.
Furthermore, the ability of consensus protocols to agree on ordering of operations even when some participants are (temporarily) unavailable seems relevant to the problem of ensuring that processors which have been powered down subsequently resume with their OS state consistent before restarting processes. This is tricky code to write, and the OS world currently lacks a good framework within which to think about such operations.
Just as importantly, reasoning about OS state as a set of consistent replicas with explicit constraints on the ordering of updates seems to hold out more hope of assuring correctness of a heterogeneous multiprocessor OS than low-level analysis of locking and critical sections.
Closer to the level of system software, routing problems emerge when considering where to place buffers in memory as data flows through processors, DMA controllers, memory, and peripherals. For example, data that arrives at a machine and is immediately forwarded back over the network should be placed in buffers close to the NIC, whereas data that will be read in its entirety should be DMAed to memory local to the computing core [14].
We have proposed an analogous, though simpler, approach based on constraint logic programming to allow an OS (and applications) to make sense of the diverse and complex hardware on which it finds itself running [14].
In this section, we sketch out the multikernel architecture for an operating system built from the ground up as a distributed system, targeting modern multicore processors, intelligent peripherals, and heterogeneous multiprocessors, and incorporating the ideas above.
We do not believe that this is the only, or even necessarily the best, structure for such a system. However, it is useful for two related reasons. Firstly, it represents an extreme point in the design space, and hence serves as a useful vehicle to investigate the full consequences of viewing the machine as a networked system. Secondly, it is designed with total disregard for compatibility with either Windows or POSIX. In practice, we can achieve compatibility with sub-optimal performance by running a VMM over the OS [13], and this gives us the freedom to investigate OS APIs better suited to both modern hardware and the scheduling and I/O requirements of modern concurrent language runtimes.
In keeping with the message-passing model, all communication between cores is asynchronous (split-phase). On current commodity hardware, we use URPC [3] as our message transport, with recourse to IPIs only when necessary for synchronization. Other transports may be used over non-coherent links (such as PCIe), or for future hardware. Other than URPC buffers, no data structures whatsoever are shared between OS nodes. One of the consequences of this is that the OS code for different cores can be implemented entirely differently (and, for example, specialized for a given architecture).
For example, an OS node would change a user-space memory mapping by a distributed two-phase commit, first tentatively changing the local mapping, then initiating agreement among other cores possessing the mapping, and finally committing the change (or aborting if a conflicting mapping on another core preempted it).
We do not advocate blindly importing distributed systems ideas into OS design, and applying such ideas usefully in an OS is rarely straightforward. However, many of the challenges lead to their own interesting research directions.
There are many more relevant areas in distributed computing than we have mentioned here. For example, leadership and group membership algorithms may be useful in handling hot-plugging of devices and CPUs.
A separate question concerns whether future multicore designs will remain cache-coherent, or opt instead for a different communication model (such as that used in the Cell processor). A multikernel seems to offer the best options here. As in some HPC designs, we may come to view scalable cache-coherency hardware as an unnecessary luxury with better alternatives in software.
Perhaps less radical is to look at how structuring a single-node OS as a distributed system might make it more suitable as part of a larger physically distributed system, in an environment such as a data center.
Modern computers are inherently distributed systems, and we miss opportunities to tackle the OS challenges of new hardware if we ignore insights from distributed systems research. We have tried to come out of denial by applying the resulting ideas to a new OS architecture, the multikernel.
An implementation, Barrelfish, is in progress.