High development velocity is critical for modern systems, but for the most part hasn't reached operating systems development. It is especially important for filesystems, which need to cope with new storage devices and new usage patterns. To this end, we have developed Bento, a framework for high velocity development of Linux kernel file systems. It enables file systems written in safe Rust to be installed in the Linux kernel, with errors largely sandboxed to the file system. Bento file systems can be replaced with no disruption to running applications, allowing daily or weekly upgrades in a cloud server setting. Bento also supports userspace debugging.
Development and deployment velocity is a critical aspect of modern cloud software development. High velocity delivers new features to customers more quickly, reduces integration and debugging costs, and reacts quickly to security vulnerabilities. However, this push for rapid development has not fully caught up to operating systems. In Linux, the most widely used cloud operating system, release cycles are still measured in months and years. Elsewhere in the cloud, new features are deployed weekly or even daily.
Slow Linux development can be attributed to several factors. Linux has a large code base with relatively few guardrails, with complicated internal interfaces that are easily misused. Combined with the inherent difficulty of programming correct concurrent code in C, this means that new code is very likely to have bugs. The lack of isolation between kernel modules means that these errors often have non-intuitive effects and are difficult to track down. The difficulty of implementing kernel-level debuggers and kernel testing frameworks makes this worse. The restricted and different kernel programming environment also limits the number of trained developers. Finally, upgrading a kernel module requires either rebooting the machine or restarting the relevant module, either way rendering the machine or module unavailable during the upgrade and forcing programs relying on the module to be stopped or moved to other machines. In the cloud setting, this forces kernel upgrades to be batched to meet cloud-level availability goals.
Recent changes in storage hardware (e.g., low latency SSDs and NVM, but also density-optimized QLC SSD and shingle disks) have made it increasingly important to have an agile storage stack. Likewise, application workload diversity and system management requirements (e.g., the need for container-level SLAs, or provenance tracking for security forensics) make feature velocity essential. Indeed, the failure of file systems to keep pace has led to perennial calls to replace file systems with blob stores that would likely face many of the same challenges despite having a simplified interface [1].
An alternative is to trade higher velocity for reduced performance. FUSE [2] is a widely-used system for user-space file system development and deployment. However, FUSE can incur a significant performance overhead, particularly for metadata-heavy workloads [6]. We show that the same file system runs a factor of 7x slower on ‘git clone’ via FUSE than as a native kernel file system.
Our goal is to enable high-velocity development of kernel file systems without sacrificing performance, for existing kernels like Linux. Our trust model is that of a slightly harried kernel developer. This means supporting a user-friendly development environment, safety both within the file system and across external interfaces, effective testing mechanisms, fast debugging, incremental live upgrade, high performance, and generality of file system designs.
We built Bento, a framework for high-velocity development of Linux kernel filesystems. Bento hooks into Linux as a VFS file system, but allows file systems to be dynamically loaded and replaced without unmounting or affecting running applications except for a short performance lag. As Bento runs in the kernel, it enables file systems to reuse well-developed Linux features, such as VFS caching, buffer management, and logging, as well as network communication. File systems are written in Rust, a type-safe, performant, non-garbage collected language. Bento interposes thin layers around the Rust file system to provide safe interfaces for both calling into the file system and calling out to other kernel functions. Leveraging the existing Linux FUSE interface, a Bento file system can be compiled to run in user space by changing a build flag. Thus, most testing and debugging can take place at user-level, with type safety limiting the frequency and scope of bugs when code is moved into the kernel. Because of this interface, porting to a new Linux version requires only changes to Bento and not the file system itself. Bento additionally supports networked file systems using the kernel TCP stack. The code for Bento is available at https://gitlab.cs.washington.edu/sm237/bento.
High-velocity kernel development (including kernel file system development) is hard to come by. To start with, kernel modifications are notoriously difficult to get right. Kernel code paths are complex and easy to accidentally misuse. Worse, debugging kernel source code is much harder than user-level debugging. Upgrading kernel modules is also an intrusive operation. Any programs using the module must be killed before the module can be unloaded for the upgrade. To meet four or five nine application uptime service-level objectives [4], kernel changes need to be batched and applied en masse.
A challenge is that new kernel code is often buggy, with input that can spread throughout the kernel. Table 1 shows an analysis of bug-fix git commits from 2014-2018 for three modules that modify core Linux functionality used by Docker containers: OverlayFS, AppArmor, and Open vSwitch Datapath. We divide bugs in these systems into two types. One set are semantic bugs in the high-level correctness properties of each module. These can range from mission critical to configuration errors, but generally impair just the functionality of the module. These accounted for 50% of the total bugs fixed in these modules.
The second set concerns low-level bugs that apply to any C language module, but when found in the kernel can potentially undermine the correctness or operation of the rest of the kernel. We categorized these as (1) memory bugs, such as NULL pointer dereferences, out-of-bounds errors, and memory leaks (68%); (2) concurrency bugs, such as deadlocks and race conditions (15%); and (3) type errors, such as incorrect usage of kernel types (e.g., interpreting error values as valid data) (17%). Many of these low-level bugs, particularly memory and type errors, result from inherent challenges of C code and could be prevented when using programming languages like Rust with more safety checks.
Upcall (FUSE [2]): One common technique, particularly for file systems and I/O devices, is to implement new functionality as a userspace server. A stub is left in the kernel that converts system calls to upcalls into the server. Filesystem in Userspace (FUSE) does this for file systems. As opposed to implementing new file system functionality directly in the kernel, this isolates the impact of errors to the user space process. (Bugs can still affect file system functionality, of course.) Development Speed is faster because engineers can use familiar debugging tools like gdb. All this comes at a performance cost, particularly for metadata-operations [6]. Additionally, FUSE file systems can’t reuse kernel functionality, such as the buffer cache.
In-Kernel Interpreter: Another approach is to use an interpreter inside the kernel for a dynamically loaded program in a safe language. Linux supports eBPF (extended Berkeley Packed Filter) [3], an in-kernel virtual machine that allows code to be dynamically loaded and executed in the kernel at prespecified points defined by the kernel. eBPF is used heavily for packet filtering, system-call filtering, and kernel tracing. The idea is to allow kernel customization in a safe manner. The Linux eBPF virtual machine validates memory safety and execution terminated before it JIT compiles the virtual machine instructions into native machine code. As such, eBPF can sandbox untrusted extensions. However the restrictions placed on eBPF programs, such as limited module size and loop restrictions, make it difficult to implement larger or more complex pieces of functionality.
Rust is a strongly-typed, memory-safe, data-race-free, non-garbage-collected language. With these properties, Rust is able to provide strong safety guarantees without high performance overhead or the performance unpredictability caused by garbage collectors. These provide useful building blocks for Bento.
The Rust type system restricts how objects can be created and cast, so if an object exists and is of a certain type, this guarantees that the memory backing the object is valid and correctly represents that type. Since raw pointers can be NULL
and can be cast to nonequivalent types, dereferencing pointers and creating strongly typed objects from pointers is restricted.
Rust prevents most memory leaks by tracking the lifetime of objects. All objects must be owned by one variable at a time. When the variable owning an object goes out of scope, the lifetime of the object is over and the memory backing the object can be safely reclaimed. References allow other variables to refer to data without claiming ownership of the memory. References are either immutable or mutable, enabling read-only or read-write accesses, respectively; references cannot outlive the owner. Developers can provide custom functionality to be performed when an object goes out of scope by implementing the drop method. Leaking memory is not a safety violation in Rust, but memory leaks must be explicit instead of accidental.
Data races are avoided by enforcing that all objects, except those that can be safely modified concurrently, must only have one mutable reference at a time. For non-thread safe objects shared between threads, a synchronization mechanism such as locking must be used to safely obtain references. Acquiring the lock gives the caller access to the underlying data. Lock acquisitions methods generally return a guard that unlocks the lock in drop, preventing the caller from forgetting to unlock. However, deadlocks, such as circular waiting for locks, are possible in safe Rust code.
Figure 1 shows the Bento architecture; the shaded portions are the Bento framework. Bento is a thin layer that, to the rest of Linux, operates like a normal VFS file system. The Linux kernel is unmodified other than the introduction of Bento. In turn, like VFS, Bento defines a set of function calls that Bento file systems implement and provides a mechanism for file systems to register themselves with the framework by exposing the necessary function pointers. Unlike VFS, Bento is designed to support file systems written in safe Rust.
Bento consists of three components. BentoFS is a standalone kernel module written in C that sits between VFS and the file system module. LibBentoFS and libBentoKS are Rust libraries that are compiled into the file system module to provide safe interfaces to the file system.
BentoFS: BentoFS is a standalone kernel module written in C that is interposed between VFS and the file-system module. It acts as a controller that manages running Bento file systems. When a file system is inserted, it registers with BentoFS. BentoFS manages kernel data structures on behalf of the file system and translates kernel calls into safer calls. BentoFS must be careful to abide by Rust’s memory safety for any memory shared with the file system.
LibBentoFS: LibBentoFS is a Rust library compiled into the file-system kernel module. It receives calls from BentoFS and translates the C-style calls into safe Rust calls. For example, libBentoFS replaces potentially NULL values with the Rust Option type.
LibBentoKS: LibBentoKS provides safe wrappers around kernel functions so the file system can access existing kernel functionality, such as block device access, synchronization mechanisms, and memory allocation to enable use of Rust’s alloc crate. To create safe wrappers, libBentoKS, for example, adds types to interfaces, requires references instead of pointers, and enforces locks around racy functions.
Live Upgrade: Live upgrade is implemented using a component in BentoFS that enables a new file system to upgrade a running file system. It supports passing custom state across the upgrade. When an upgrade file system module is inserted, the component begins the upgrade procedure. First, all new operations to the file system are stopped. Then, an upgrade function is called in the old file system to trigger it to prepare for removal and create the state transfer data structure and return it to BentoFS. BentoFS passes the struct to a different upgrade function in the new file system, and the new file system initializes itself from the state. BentoFS replaces references to the old file system with those to the new file system and finally allows new operations to proceed.
Userspace Debugging: To support userspace debugging, we export the same interfaces in userspace and in the kernel so developers can switch targets using a build flag. Most of the interfaces exposed to the file system mirror existing userspace interfaces, such as the FUSE low-level interface for libBentoFS and Rust libraries for libBentoKS. We provide additional userspace libraries for interfaces that don't match with existing userspace libraries, such as block device access.
We evaluated Bento on the performance of Bento-fs, a file system written using Bento, and on the impact of live upgrade on the availability of the system. Bento-fs is structured like the MIT xv6 [7] code and includes some optimizations from ext4 to provide performance similar to ext4. As baselines, we use ext4 with both data journaling (data=journal
mode) and the default metadata journaling (data=ordered
mode). We focus our evaluation on ext4 with data journaling because Bento-fs also implements data journaling
Applications: Figure 3 shows the results for application workloads. Here, Bento-fs outperforms Bento-user by 4-36x. The difference is particularly noticeable for ‘untar’ because it involves many creates, which are particularly affected by slow block I/O from userspace. Bento-fs performs similarly to ext4-j on ‘untar’, ‘tar’, and ‘git clone’ and 19% worse on ‘grep’. Relative to ext4-o, Bento-fs performs 13% worse on ‘untar’ due to data journaling and the lack of delayed allocation.
Live Upgrade: To evaluate the impact of live upgrade, we created a directory with 400,000 files. In Figure 2, we executed 10 threads that repeatedly wrote and synced 64Kb write to random files. We upgraded to the version with provenance tracking after 0.5 seconds and completed the test after another 0.5 seconds. Results in the figure are smoothed over 5ms intervals. The file system experienced around 15ms of downtime during the upgrade, after which performance recovered.
Our principal result is that it is possible to make large improvements to Linux kernel development velocity. Rust and userspace execution sped up our own development process. Originally, along with the Rust version of Bento-fs, we tried to develop an in-kernel, C version to directly measure the performance impact of Bento. However, this dramatically slowed down our development. We were more likely to write bugs in C, and each of the bugs took much longer to debug. In some instances, bugs that would have been compiler errors in Rust took hours to debug in C. Bento’s design also makes it easy to change Bento as needed. Only the shim layer will need to be updated to remain compatible with future versions of Linux. Support for Rust in the kernel is growing from both sides. Rust is continually improving its support for kernel projects which can’t use the Rust standard library, and Linux Kernel maintainers are taking steps toward integrating Rust modules [5].
We benefited from Linux’s existing support for FUSE. Since FUSE communicates with userspace, it uses a message-passing API between the FUSE kernel module and the userspace file system. This message-passing API was a good starting point because strict memory separation means there can be no unsafe memory sharing. Rust’s linear type system allowed us to relax FUSE’s strict memory separation and pass pointers across the interfaces for performance while still ensuring safe memory sharing.
We also benefited from being able to reuse existing kernel modules such as the jbd2 journal module, buffer cache, and TCP stack. We found it fairly easy to write safe wrappers around these modules, often mirroring existing userspace Rust libraries. Even if our wrappers might have some bugs, they likely have many fewer bugs than if we had tried to rewrite the modules from scratch. Reusing existing modules gives us an incremental path to developing Rust modules — a new safe module can be inserted without requiring other parts of the kernel to be modified.
There are also a number of open questions and avenues for future work. So far, we have been able to write safe interfaces around kernel functions without too much complexity and without a large impact on performance. However, it is not clear if that will hold for all parts of the kernel. Are there fundamental tradeoffs in safe kernel interfaces? Future work can explore this question as we apply the ideas from Bento to other interfaces across the kernel, such as the networking stack, the scheduler, and the memory manager.