We live in an age of rapid innovation; Every year, there is a new iPhone, three new crypto currencies, five new AWS services, and a hundred new JavaScript frameworks. We have been conditioned to expect new shiny things, to the point where many people equate “new” with “better”.
Under the shiny exterior of our technology though, some old and dusty infrastructure is carrying a weight it was never designed to hold. This infrastructure, while proving itself remarkably more flexible and stable than our most optimistic predictions, is showing signs of rust and old age, making maintenance and development more challenging year after year. As concerning as that is, it pales in comparison with another problem: the world has changed, and the infra has not been keeping up with it. Core design assumptions have been voided, making whole architectures deprecated and faulty.
If these statements seem exaggerated, take a minute to look up the age of several well known projects: Linux is almost 30 years old, Postgres and MySQL are over 25. The core of “modern” clouds such as AWS, GCP and Azure is 15 years old if not older. In those years, many things changed in both the hardware landscape and the use cases they served. To understand the ramifications of these changes, let’s dive into one interesting example: I/O.
Core design assumptions have been voided, making whole architectures deprecated and faulty.
20 years ago, at the dawn of the internet age, CPU speed was measured in MHz, RAM size in MB, and disk sizes in GB. Since then, there has been a 1000x fold increase in compute power, RAM size, and disk space; now we measure in GHz and TB - but that is not the whole story. At the same time, bandwidth and latency have also changed; hard drives latency dropped from ~10ms to ~10µs, while RAM and network latency improved only slightly by a relatively negligible factor of 4x. However, network bandwidth improved 1000x, hard drive bandwidth improved 10x, while RAM bandwidth only increased by a factor of 4.
Change 2001 → 2021 | Latency (ratio) | Bandwidth (ratio) |
---|---|---|
RAM:storage | 1E6 → 1E3 | 40→ 2 |
RAM:Network | 2E5 → 1E5 | 40 → 1 |
Network:storage | 20 → 0.1 | 2/3 →4 |
The table above summarizes the changes in relative speeds, and clearly shows how the balance completely shifted, making design choices that made sense in 2001 practically senseless. For example, If in 2001 it was sensible to fetch data from the RAM of a remote server, in 2021 you want to replicate that data and read it locally from disk; not only will your latency be 10x faster, you will also enjoy the superior bandwidth of network and disk and save precious RAM bandwidth. This itself was enough to invalidate many traditional systems designs, and spawn a generation of distributed systems.
On the other hand, a far more devious problem lurked in the shadows: the change in relative speeds of disk and RAM has invalidated the basic design of operating systems and file systems.
If in 2001 it was sensible to fetch data from the RAM of a remote server, in 2021 you want to replicate that data and read it locally from disk.
Linux, traditionally being modeled after UNIX operating systems, has long since followed the traditional blocking, buffered, and cached I/O model. In essence, every I/O operation is indirect and served from a special memory pool known as the “page cache”. When we write, Linux first writes the data to a page in the page cache, and asynchronously flushes the write to disk later. This is why we have the notorious and costly sync syscall, which forces Linux to write data to disk immediately. Likewise, when we read data, Linux consults the page cache for an existing copy of the data and only fetches from disk if the data is stale or missing.
This design has a lot of advantages, like merging read and write operations; but the real benefit is that operations are executed on a much faster medium - RAM. RAM being so much faster than disk, raises the question: why fuss over complex control mechanisms and semantics of a non-blocking API? A simple blocking API made a lot of sense from both safety and ease of use perspectives. It used to make sense until disks became so much faster…
As usual, the devil was in the details - the blocking API came with a price to be paid - the overhead of a context switch. Context switches happen when a userspace program blocks and the kernel switches tasks to do something else (among other reasons). A context switch may take 1-5µs on a modern computer - completely negligible compared to a 10ms magnetic disk seek, but disastrously expensive compared to a 10µs latency of a modern NVMe drive.
This problem alone was enough to send a myriad of programmers scrambling to write async I/O frameworks, but to no avail. Until recently, Linux did not have a proper non-blocking I/O API that worked with disks (we do now, check out io_uring!), leading to horrible hacks and compromises.
If the context switch situation wasn’t bad enough, by copying every block of data from disk to kernel memory and from there to userspace memory and vice versa, we consume large portions of our precious RAM bandwidth which has now become the new bottleneck of many systems. It also means that all tasks are implicitly sharing memory, with all the locking overhead that comes with it. The rapidly growing size of datasets, the increasing number of CPU cores, along with the relative speedup of disks and network, using RAM as a scratch space for data has become a dead end. If only we could read directly from disk and bypass the page cache entirely...
But wait! Don’t we have O_DIRECT in Linux (direct I/O) which can be used in combination with non-blocking I/O? Well, not exactly. As it turns out, O_DIRECT was never properly defined or supported by any major file system except XFS, but even in XFS its non blocking semantics are not supported for all operations. Database vendors started listing XFS as the recommended filesystem and some even went further still and used raw disks bypassing the filesystem entirely.
These solutions pushed the limits of Linux performance for 20 years, but the quest to squeeze every bit of speedup from Linux I/O appears to have exhausted itself. With core count now approaching 100 and CPU-RAM gap still growing, the coordination cost between CPU cores has become another limiting factor for I/O; after all, storage is a shared resource. Technologies like NUMA aim to mitigate the RAM problem by sharding memory and giving CPUs ownership of a shard and the data in it.
If we follow this trend to its logical conclusion, with NVMe promising to achieve near RAM speeds in the foreseeable future, it’s not far-fetched to assume that the day when CPUs will have dedicated disks, is not far off.
By copying every block of data from disk to kernel memory and from there to userspace memory and vice versa, we consume large portions of our precious RAM bandwidth which has now become the new bottleneck of many systems.
In recent years, there have been many attempts to improve the situation, ranging from relatively minor kernel changes, to complete redesign of the entire stack. Two projects that stand out are OSv and Seastar.
The OSv unikernel project dispensed with the userspace/kernelspace model entirely, solving context switch costs by simply eliminating them. However, replacing the kernel and its common interfaces is a radical approach, and still requires a redesign of applications to reap the benefits.
The Seastar project from the same authors, took a somewhat less controversial route by creating what can be functionally described as a limited scope kernel-in-userspace. Seastar is an async C++ framework for low latency applications; it takes over the entire resources of the machine and manages them itself, bypassing the kernel - it has its own CPU and I/O scheduler as well as memory management. The page cache is circumvented as I/O is always O_DIRECT (and XFS is the only supported filesystem) and non-blocking (to the extent Linux allows).
The biggest difference in architecture is that Seastar does not permit sharing resources between cores - every core is essentially a separate isolated process and communication between cores must be explicit using messages. This shard-per-core architecture is NUMA friendly, avoids locking and enables further improvements when hardware allows direct interaction (for example, DPDK). In principle, the monolithic machine is treated as a distributed cluster of individual computers - after all, a distributed system is characterized by the relative latency of messages and probability of message faults, and there is no functional difference between a stall/pause and a network partition.
With hardware that is slowly transforming into a modern version of a blade server and usage patterns that are reminiscent of mainframes, perhaps it is time to rethink the architecture of our operating systems and the services built on top of them. Consider the POSIX APIs we hold so dear: how many of the 30 million estimated world population of developers use them directly? How many would notice, if we hijacked runtime calls and wrote to a remote blob storage instead of disks?
Given that in many cases, we are already doing that on the driver level with the use of virtual disks, my guess would be not many. By taking these virtualization based solutions up a level or two to userspace and runtimes, we can gain much more both in performance and semantics - the POSIX API forces us to pretend our distributed systems are centralized and monolithic so we must hide errors from users and deal with them in non-optimal ways. In our blob storage example, POSIX blocking API will not allow proper timeouts, and even with io_uring we will still lose information about what is a retryable error and what is not.
Computers have dealt with various distributed systems problems for a long time, but they mostly did so at the hardware level presenting a monolithic facade to the programmer.
This design made sense when a sole CPU reigned as the uncontested ruler and coordinator of all operations, with RAM as its trusted servant. The days of absolute CPU monarchy are behind us, with many lesser CPU lieges known as “cores” taking their place, while the once peripheral devices now claim a throne of their own.
What will the future hold for these fallen monarchs? How much longer will they be able to delegate communication with the barbaric peripherals to their servant RAM? The time is nigh for a revolution!
Comments
The premises outlined in the