An isolation kernel is a small-kernel operating system architecture targeted at hosting multiple untrusted applications that require little data sharing. We have formulated four principles that govern the design of isolation kernels.
1. Expose low-level resources rather than high-level
abstractions. In theory, one might hope to achieve isolation on a
conventional OS by confining each untrusted service to its own process
(or process group). However, OSs have proven ineffective at
containing insecure code, let alone untrusted or malicious services.
An OS exposes high-level abstractions, such as files and sockets, as
opposed to low-level resources such as disk blocks and network
packets. High-level abstractions entail significant complexity and
typically have a wide API, violating the security principle of economy
of mechanism [29]. They also invite ``layer below''
attacks, in which an attacker gains unauthorized access to a resource
by requesting it below the layer of enforcement [18].
An isolation kernel exposes hardware-level resources, displacing the burden of implementing operating systems abstractions to user-level code. In this respect, an isolation kernel resembles other ``small kernel'' architectures such as microkernels [1], virtual machine monitors [6], and Exokernels [20]. Although small kernel architectures were once viewed as prohibitively inefficient, modern hardware improvements have made performance less of a concern.
2. Prevent direct sharing by exposing only private, virtualized
namespaces. Conventional OSs facilitate protected data sharing
between users and applications by exposing global namespaces, such as
file systems and shared memory regions. The presence of these sharing
mechanisms introduces the problem of specifying a complex access
control policy to protect these globally exposed resources.
Little direct sharing is needed across Internet services, and therefore an isolation kernel should prevent direct sharing by confining each application to a private namespace. Memory pages, disk blocks, and all other resources should be virtualized, eliminating the need for a complex access control policy: the only sharing allowed is through the virtual network.
Both principles 1 and 2 are required to achieve strong isolation. For example, the UNIX chroot command discourages direct sharing by confining applications to a private file system name space. However, because chroot is built on top of the file system abstraction, it has been compromised by a layer-below attack in which the attacker uses a cached file descriptor to subvert file system access control.
Although our discussion has focused on security isolation, high-level abstractions and direct sharing also reduce performance isolation. High-level abstractions create contention points where applications compete for resources and synchronization primitives. This leads to the effect of ``cross-talk'' [23], where application resource management decisions interfere with each other. The presence of data sharing leads to hidden shared resources like the file system buffer cache, which complicate precise resource accounting.
3. Zipf's Law implies the need for scale. An isolation kernel
must be designed to scale up to a large number of services. For
example, to support dynamic content in web caches and CDNs, each cache
or CDN node will need to store content from hundreds (if not
thousands) of dynamic web sites. Similarly, a wide-area research
testbed to simulate systems such as peer-to-peer content sharing
applications must scale to millions of simulated nodes. A testbed
with thousands of contributing sites would need to support thousands
of virtual nodes per site.
Studies of web documents, DNS names, and other network services show that popularity tends to be driven by Zipf distributions [5]. Accordingly, we anticipate that isolation kernels must be able to handle Zipf workloads. Zipf distributions have two defining traits: most requests go to a small set of popular services, but a significant fraction of requests go to a large set of unpopular services. Unpopular services are accessed infrequently, reinforcing the need to multiplex many services on a single machine.
To scale, an isolation kernel must employ techniques to minimize the memory footprint of each service, including metadata maintained by the kernel. Since the set of all unpopular services won't fit in memory, the kernel must treat memory as a cache of popular services, swapping inactive services to disk. Zipf distributions have a poor cache hit rate [5], implying that we need rapid swapping to reduce the cache miss penalty of touching disk.
4. Modify the virtualized architecture for simplicity, scale, and
performance. Virtual machine monitors (VMMs), such as
Disco [6] and VM/370 [9], adhere to our first two
principles. These systems also strive to support legacy OSs by
precisely emulating the underlying hardware architecture. In our
view, the two goals of isolation and hardware emulation are
orthogonal. Isolation kernels decouple these goals by allowing the
virtual architecture to deviate from the underlying physical
architecture. By so doing, we can enhance properties such as
performance, simplicity, and scalability, while achieving the strong
isolation that VMMs provide.
The drawback of this approach is that it gives up support for unmodified legacy operating systems. We have chosen to focus on the systems issues of scalability and performance rather than backwards compatibility for legacy OSs. However, we are currently implementing a port of the Linux operating system to the Denali virtual architecture; this port is still work in progress.