Seraph: Towards Scalable and Efficient Fully-external Graph Computation via On-demand Processing

Tsun-Yu Yang; Dean Hildebrand; Yizou Chen; Glenn Lockwood; Yuhong Liang; Zhe Zhang; Ming-Chang Yang; Yuesheng Gu; Jimmy Lu; Zhenwei Lu; Jiwu Shu; Snehal Khandkar; Guanqun Dai; Abhinav Sharma; Wenwen Peng; Daniel S. Berger; Shuo Zhang; Nathan Beckmann; Dong Wu; Gregory R. Ganger; Tianyun Wang; Haoran Zhang; Jiasheng Wang; Wenyuan Yan; Yuanyuan Dong; Wenhui Yao; Zhongjie Wu; Lingjun Zhu; Chao Shi; Yinhu Wang; Rong Liu; Junping Wu; Jiaji Zhu; Jiesheng Wu

All sessions will be held in Santa Clara Ballroom unless otherwise noted.

Papers are available for download below to registered attendees now and to everyone beginning Tuesday, February 27. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter

Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents

Papers and Proceedings

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].

Full Proceedings PDF Files

FAST '24 Full Proceedings (PDF)

FAST '24 Full Proceedings Interior (PDF, Best for Mobile Devices)

Attendee Files

FAST '24 Attendee List (PDF)

FAST '24 Proceedings Web Archive (ZIP)

Tuesday, February 27

8:00 am–9:00 am

Continental Breakfast

Mezzanine East/West

9:00 am–9:15 am

Opening Remarks and Awards

Program Co-Chairs: Xiaosong Ma, Qatar Computing Research Institute, Hamad Bin Khalifa University, and Youjip Won, Korea Advanced Institute of Science and Technology (KAIST)

9:15 am–10:15 am

Keynote Address

Lessons Learnt in Trying to Build New Storage Technologies

Dr. Antony Rowstron, Microsoft Research

Available Media

A decade ago, cloud storage was dominated by tape, HDD, and flash. Today this is still true, but will it hold in 2034? For over the last decade at Microsoft Research Cambridge, we have been trying to build out new storage technologies for the cloud. This started with wondering how we could build out extremely low-cost HDD-based archival storage—and the challenges and frustrations of trying to do this—became the opportunity to begin to really think about how to build cloud-scale archival storage from the media up. We picked glass, and in Project Silica, we have been working on building out the technologies to make glass-based archival storage real. If we had known how hard Project Silica was going to be, we may never have started, but nearly a decade on, we now have a set of principles and thoughts on how to build out innovative novel storage systems from the media up. I will share some principles that we learnt along the way and also talk about how we are thinking of creating other future storage technologies.

Ant is a Distinguished Engineer at Microsoft Research, Cambridge, UK, leading a team looking at future hardware technologies for the cloud across storage, networking, and computing, and most are focused on new optical-based technologies. The most well-known project is probably Project Silica, which is trying to use glass for long-term archival storage. Ant is a systems researcher at heart who has spent most of his career working at the intersection of Storage, Networking, and Distributed Systems, and he is best known as one of the original inventors of structured overlays or Distributed Hash Tables (DHTs) (called Pastry) and the first large-scale key-value storage system (PAST SOSP’01). In 2016, he was awarded the ACM SIGOPS Mark Weiser Award, and in 2021, the ACM EuroSys Lifetime Achievement Award. In September 2020, he was elected a Fellow of the Royal Academy of Engineering.

10:15 am–10:45 am

Break with Refreshments

Mezzanine East/West

10:45 am–12:00 pm

Distributed Storage

Session Chair: Raju Rangaswami, Florida International University

TeRM: Extending RDMA-Attached Memory with SSD

Zhe Yang, Qing Wang, Xiaojian Liao, and Youyou Lu, Tsinghua University; Keji Huang, Huawei Technologies Co., Ltd; Jiwu Shu, Tsinghua University

Available Media

RDMA-based in-memory storage systems offer high performance but are restricted by the capacity of physical memory. In this paper, we propose TeRM to extend RDMA-attached memory with SSD. TeRM achieves fast remote access on the SSD-extended memory by eliminating page faults of RDMA NIC and CPU from the critical path. We also introduce a set of techniques to reduce the consumption of CPU and network resources. Evaluation shows that TeRM performs close to the performance of the ideal upper bound where all pages are pinned in the physical memory. Compared with existing approaches TeRM significantly improves the performance of unmodified RDMA-based storage systems, including a file system and a key-value system.

Combining Buffered I/O and Direct I/O in Distributed File Systems

Yingjin Qian, Data Direct Networks; Marc-André Vef, Johannes Gutenberg University Mainz; Patrick Farrell and Andreas Dilger, Whamcloud Inc.; Xi Li and Shuichi Ihara, Data Direct Networks; Yinjin Fu, Sun Yat-Sen University; Wei Xue, Tsinghua University and Qinghai University; André Brinkmann, Johannes Gutenberg University Mainz

Available Media

Direct I/O allows I/O requests to bypass the Linux page cache and was introduced over 20 years ago as an alternative to the default buffered I/O mode. However, high-performance computing (HPC) applications still mostly rely on buffered I/O, even if direct I/O could perform better in a given situation. This is because users tend to use the I/O mode they are most familiar with. Moreover, with complex distributed file systems and applications, it is often unclear which I/O mode to use.

In this paper, we show under which conditions both I/O modes are beneficial and present a new transparent approach that dynamically switches to each I/O mode within the file system. Its decision is based not only on the I/O size but also on file lock contention and memory constraints. We exemplary implemented our design into the Lustre client and server and extended it with additional features, e.g., delayed allocation. Under various conditions and real-world workloads, our approach achieved up to 3× higher throughput than the original Lustre and outperformed other distributed file systems that include varying degrees of direct I/O support by up to 13×.

OmniCache: Collaborative Caching for Near-storage Accelerators

Jian Zhang and Yujie Ren, Rutgers University; Marie Nguyen, Samsung; Changwoo Min, Igalia; Sudarsun Kannan, Rutgers University

Available Media

We propose OmniCache, a novel caching design for near-storage accelerators that combines near-storage and host memory capabilities to accelerate I/O and data processing. First, OmniCache introduces a "near-cache" approach, maximizing data access to the nearest cache for I/O and processing operations. Second, OmniCache presents collaborative caching for concurrent I/O and data processing using host and device caches. Third, OmniCache incorporates a dynamic model-driven offloading support, which actively monitors hardware and software metrics for efficient processing across host and device processors. Finally, OmniCache explores the extensibility of the newly introduced CXL, a memory expansion technology. Evaluation of OmniCache demonstrates significant performance gains of up to 3.24X for I/O workloads and 3.06X for data processing workloads.

12:00 pm–2:00 pm

Lunch (on your own)

2:00 pm–3:15 pm

Caching

Session Chair: Carl Waldspurger, Carl Waldspurger Consulting

Symbiosis: The Art of Application and Kernel Cache Cooperation

Yifan Dai, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau, University of Wisconsin—Madison

Available Media

We introduce Symbiosis, a framework for key-value storage systems that dynamically configures application and kernel cache sizes to improve performance. We integrate Symbiosis into three production systems — LevelDB, WiredTiger, and RocksDB — and, through a series of experiments on various read-heavy workloads and environments, show that Symbiosis improves performance by 1.5× on average and over 5× at best compared to static configurations, across a wide range of synthetic and real-world workloads.

Optimizing File Systems on Heterogeneous Memory by Integrating DRAM Cache with Virtual Memory Management

Yubo Liu, Yuxin Ren, Mingrui Liu, Hongbo Li, Hanjun Guo, Xie Miao, and Xinwei Hu, Huawei Technologies Co., Ltd.; Haibo Chen, Huawei Technologies Co., Ltd. and Shanghai Jiao Tong University

Available Media

This paper revisits the usage of DRAM cache in DRAM-PM heterogeneous memory file systems. With a comprehensive analysis of existing file systems with cache-based and DAX-based designs, we show that both suffer from suboptimal performance due to excessive data movement. To this end, this paper presents a cache management layer atop heterogeneous memory, namely FLAC, which integrates DRAM cache with virtual memory management. FLAC is further incorporated with two techniques called zero-copy caching and parallel-optimized cache management, which facilitates fast data transfer between file systems and applications as well as efficient data synchronization/migration between DRAM and PM. We further design and implement a library file system upon FLAC, called FlacFS. Micro benchmarks show that FlacFS provides up to two orders of magnitude performance improvement over existing file systems in file read/write. With real-world applications, FlacFS achieves up to 10.6 and 9.9 times performance speedup over state-of-the-art DAX-based and cache-based file systems, respectively.

Kosmo: Efficient Online Miss Ratio Curve Generation for Eviction Policy Evaluation

Kia Shakiba, Sari Sultan, and Michael Stumm, University of Toronto

Available Media

In-memory caches play an important role in reducing the load on backend storage servers for many workloads. Miss ratio curves (MRCs) are an important tool for configuring these caches with respect to cache size and eviction policy. MRCs provide insight into the trade-off between cache size (and thus costs) and miss ratio for a specific eviction policy. Over the years, many MRC-generation algorithms have been developed. However, to date, only Miniature Simulations is capable of efficiently generating MRCs for popular eviction policies, such as Least Frequently Used (LFU), First-In-First-Out (FIFO), 2Q, and Least Recently/Frequently Used (LRFU), that do not adhere to the inclusion property. One critical downside of Miniature Simulations is that it incurs significant memory overhead, precluding its use for online cache analysis at runtime in many cases.

In this paper, we introduce Kosmo, an MRC generation algorithm that allows for the simultaneous generation of MRCs for a variety of eviction policies that do not adhere to the inclusion property. We evaluate Kosmo using 52 publicly-accessible cache access traces with a total of roughly 126 billion accesses. Compared to Miniature Simulations configured with 100 simulated caches, Kosmo has lower memory overhead by a factor of 3.6 on average, and as high as 36, and a higher throughput by a factor of 1.3 making it far more suitable for online MRC generation.

3:15 pm–3:45 pm

Break with Refreshments

Mezzanine East/West

3:45 pm–5:00 pm

File Systems

Session Chair: Peter Macko, MongoDB

I/O Passthru: Upstreaming a flexible and efficient I/O Path in Linux

Kanchan Joshi, Anuj Gupta, Javier González, Ankit Kumar, Krishna Kanth Reddy, Arun George, and Simon Lund, Samsung Semiconductor; Jens Axboe, Meta Platforms Inc.

Available Media

New storage interfaces continue to emerge fast on Non-Volatile Memory Express (NVMe) storage. Fitting these innovations in the general-purpose I/O stack of operating systems has been challenging and time-consuming. The NVMe standard is no longer limited to block-I/O, but the Linux I/O advances historically centered around the block-I/O path. The lack of scalable OS interfaces risks the adoption of the new storage innovations.

We introduce I/O Passthru, a new I/O Path that has made its way into the mainline Linux Kernel. The key ingredients of this new path are NVMe char interface and io_uring command. In this paper, we present our experience building and upstreaming I/O Passthru and report on how this helps to consume new NVMe innovations without changes to the Linux kernel. We provide experimental results to (i) compare its efficiency against existing io_uring block path and (ii) demonstrate its flexibility by integrating data placement into Cachelib. FIO peak performance workloads show 16–40% higher IOPS than block path.

Metis: File System Model Checking via Versatile Input and State Exploration

Yifei Liu and Manish Adkar, Stony Brook University; Gerard Holzmann, Nimble Research; Geoff Kuenning, Harvey Mudd College; Pei Liu, Scott A. Smolka, Wei Su, and Erez Zadok, Stony Brook University

Available Media

We present Metis, a model-checking framework designed for versatile, thorough, yet configurable file system testing in the form of input and state exploration. It uses a nondeterministic loop and a weighting scheme to decide which system calls and their arguments to execute. Metis features a new abstract state representation for file-system states in support of efficient and effective state exploration. While exploring states, it compares the behavior of a file system under test against a reference file system and reports any discrepancies; it also provides support to investigate and reproduce any that are found. We also developed RefFS, a small, fast file system that serves as a reference, with special features designed to accelerate model checking and enhance bug reproducibility. Experimental results show that Metis can flexibly generate test inputs; also the rate at which it explores file-system states scales nearly linearly across multiple nodes. RefFS explores states 3–28× faster than other, more mature file systems. Metis aided the development of RefFS, reporting 11 bugs that we subsequently fixed. Metis further identified 12 bugs from five other file systems, five of which were confirmed and with one fixed and integrated into Linux.

RFUSE: Modernizing Userspace Filesystem Framework through Scalable Kernel-Userspace Communication

Kyu-Jin Cho, Jaewon Choi, Hyungjoon Kwon, and Jin-Soo Kim, Seoul National University

Available Media

With the advancement of storage devices and the increasing scale of data, filesystem design has transformed in response to this progress. However, implementing new features within an in-kernel filesystem is a challenging task due to development complexity and code security concerns. As an alternative, userspace filesystems are gaining attention, owing to their ease of development and reliability. FUSE is a renowned framework that allows users to develop custom filesystems in userspace. However, the complex internal stack of FUSE leads to notable performance overhead, which becomes even more prominent in modern hardware environments with high-performance storage devices and a large number of cores.

In this paper, we present RFUSE, a novel userspace filesystem framework that utilizes scalable message communication between the kernel and userspace. RFUSE employs a per-core ring buffer structure as a communication channel and effectively minimizes transmission overhead caused by context switches and request copying. Furthermore, RFUSE enables users to utilize existing FUSE-based filesystems without making any modifications. Our evaluation results indicate that RFUSE demonstrates comparable throughput to in-kernel filesystems on high-performance devices while exhibiting high scalability in both data and metadata operations.

6:00 pm–7:30 pm

FAST '24 Poster Session and Reception

Mezzanine East/West

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and beverages. View the complete list of accepted posters.