All sessions will be held in Santa Clara Ballroom unless otherwise noted.
Papers are available for download below to registered attendees now and to everyone beginning Tuesday, February 27. Paper abstracts are available to everyone now. Copyright to the individual works is retained by the author[s].
Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents
Papers and Proceedings
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author[s].
9:00 am–9:15 am
Opening Remarks and Awards
Program Co-Chairs: Xiaosong Ma, Qatar Computing Research Institute, Hamad Bin Khalifa University, and Youjip Won, Korea Advanced Institute of Science and Technology (KAIST)
9:15 am–10:15 am
Keynote Address
Lessons Learnt in Trying to Build New Storage Technologies
Dr. Antony Rowstron, Microsoft Research
A decade ago, cloud storage was dominated by tape, HDD, and flash. Today this is still true, but will it hold in 2034? For over the last decade at Microsoft Research Cambridge, we have been trying to build out new storage technologies for the cloud. This started with wondering how we could build out extremely low-cost HDD-based archival storage—and the challenges and frustrations of trying to do this—became the opportunity to begin to really think about how to build cloud-scale archival storage from the media up. We picked glass, and in Project Silica, we have been working on building out the technologies to make glass-based archival storage real. If we had known how hard Project Silica was going to be, we may never have started, but nearly a decade on, we now have a set of principles and thoughts on how to build out innovative novel storage systems from the media up. I will share some principles that we learnt along the way and also talk about how we are thinking of creating other future storage technologies.
Antony Rowstron, Microsoft Research
Ant is a Distinguished Engineer at Microsoft Research, Cambridge, UK, leading a team looking at future hardware technologies for the cloud across storage, networking, and computing, and most are focused on new optical-based technologies. The most well-known project is probably Project Silica, which is trying to use glass for long-term archival storage. Ant is a systems researcher at heart who has spent most of his career working at the intersection of Storage, Networking, and Distributed Systems, and he is best known as one of the original inventors of structured overlays or Distributed Hash Tables (DHTs) (called Pastry) and the first large-scale key-value storage system (PAST SOSP’01). In 2016, he was awarded the ACM SIGOPS Mark Weiser Award, and in 2021, the ACM EuroSys Lifetime Achievement Award. In September 2020, he was elected a Fellow of the Royal Academy of Engineering.
10:15 am–10:45 am
Break with Refreshments
Mezzanine East/West
10:45 am–12:00 pm
Distributed Storage
Session Chair: Raju Rangaswami, Florida International University
TeRM: Extending RDMA-Attached Memory with SSD
Zhe Yang, Qing Wang, Xiaojian Liao, and Youyou Lu, Tsinghua University; Keji Huang, Huawei Technologies Co., Ltd; Jiwu Shu, Tsinghua University
RDMA-based in-memory storage systems offer high performance but are restricted by the capacity of physical memory. In this paper, we propose TeRM to extend RDMA-attached memory with SSD. TeRM achieves fast remote access on the SSD-extended memory by eliminating page faults of RDMA NIC and CPU from the critical path. We also introduce a set of techniques to reduce the consumption of CPU and network resources. Evaluation shows that TeRM performs close to the performance of the ideal upper bound where all pages are pinned in the physical memory. Compared with existing approaches TeRM significantly improves the performance of unmodified RDMA-based storage systems, including a file system and a key-value system.
Combining Buffered I/O and Direct I/O in Distributed File Systems
Yingjin Qian, Data Direct Networks; Marc-André Vef, Johannes Gutenberg University Mainz; Patrick Farrell and Andreas Dilger, Whamcloud Inc.; Xi Li and Shuichi Ihara, Data Direct Networks; Yinjin Fu, Sun Yat-Sen University; Wei Xue, Tsinghua University and Qinghai University; André Brinkmann, Johannes Gutenberg University Mainz
Direct I/O allows I/O requests to bypass the Linux page cache and was introduced over 20 years ago as an alternative to the default buffered I/O mode. However, high-performance computing (HPC) applications still mostly rely on buffered I/O, even if direct I/O could perform better in a given situation. This is because users tend to use the I/O mode they are most familiar with. Moreover, with complex distributed file systems and applications, it is often unclear which I/O mode to use.
In this paper, we show under which conditions both I/O modes are beneficial and present a new transparent approach that dynamically switches to each I/O mode within the file system. Its decision is based not only on the I/O size but also on file lock contention and memory constraints. We exemplary implemented our design into the Lustre client and server and extended it with additional features, e.g., delayed allocation. Under various conditions and real-world workloads, our approach achieved up to 3× higher throughput than the original Lustre and outperformed other distributed file systems that include varying degrees of direct I/O support by up to 13×.
OmniCache: Collaborative Caching for Near-storage Accelerators
Jian Zhang and Yujie Ren, Rutgers University; Marie Nguyen, Samsung; Changwoo Min, Igalia; Sudarsun Kannan, Rutgers University
We propose OmniCache, a novel caching design for near-storage accelerators that combines near-storage and host memory capabilities to accelerate I/O and data processing. First, OmniCache introduces a "near-cache" approach, maximizing data access to the nearest cache for I/O and processing operations. Second, OmniCache presents collaborative caching for concurrent I/O and data processing using host and device caches. Third, OmniCache incorporates a dynamic model-driven offloading support, which actively monitors hardware and software metrics for efficient processing across host and device processors. Finally, OmniCache explores the extensibility of the newly introduced CXL, a memory expansion technology. Evaluation of OmniCache demonstrates significant performance gains of up to 3.24X for I/O workloads and 3.06X for data processing workloads.
12:00 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:15 pm
Caching
Session Chair: Carl Waldspurger, Carl Waldspurger Consulting
Symbiosis: The Art of Application and Kernel Cache Cooperation
Yifan Dai, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau, University of Wisconsin—Madison
We introduce Symbiosis, a framework for key-value storage systems that dynamically configures application and kernel cache sizes to improve performance. We integrate Symbiosis into three production systems — LevelDB, WiredTiger, and RocksDB — and, through a series of experiments on various read-heavy workloads and environments, show that Symbiosis improves performance by 1.5× on average and over 5× at best compared to static configurations, across a wide range of synthetic and real-world workloads.
Optimizing File Systems on Heterogeneous Memory by Integrating DRAM Cache with Virtual Memory Management
Yubo Liu, Yuxin Ren, Mingrui Liu, Hongbo Li, Hanjun Guo, Xie Miao, and Xinwei Hu, Huawei Technologies Co., Ltd.; Haibo Chen, Huawei Technologies Co., Ltd. and Shanghai Jiao Tong University
This paper revisits the usage of DRAM cache in DRAM-PM heterogeneous memory file systems. With a comprehensive analysis of existing file systems with cache-based and DAX-based designs, we show that both suffer from suboptimal performance due to excessive data movement. To this end, this paper presents a cache management layer atop heterogeneous memory, namely FLAC, which integrates DRAM cache with virtual memory management. FLAC is further incorporated with two techniques called zero-copy caching and parallel-optimized cache management, which facilitates fast data transfer between file systems and applications as well as efficient data synchronization/migration between DRAM and PM. We further design and implement a library file system upon FLAC, called FlacFS. Micro benchmarks show that FlacFS provides up to two orders of magnitude performance improvement over existing file systems in file read/write. With real-world applications, FlacFS achieves up to 10.6 and 9.9 times performance speedup over state-of-the-art DAX-based and cache-based file systems, respectively.
Kosmo: Efficient Online Miss Ratio Curve Generation for Eviction Policy Evaluation
Kia Shakiba, Sari Sultan, and Michael Stumm, University of Toronto
In-memory caches play an important role in reducing the load on backend storage servers for many workloads. Miss ratio curves (MRCs) are an important tool for configuring these caches with respect to cache size and eviction policy. MRCs provide insight into the trade-off between cache size (and thus costs) and miss ratio for a specific eviction policy. Over the years, many MRC-generation algorithms have been developed. However, to date, only Miniature Simulations is capable of efficiently generating MRCs for popular eviction policies, such as Least Frequently Used (LFU), First-In-First-Out (FIFO), 2Q, and Least Recently/Frequently Used (LRFU), that do not adhere to the inclusion property. One critical downside of Miniature Simulations is that it incurs significant memory overhead, precluding its use for online cache analysis at runtime in many cases.
In this paper, we introduce Kosmo, an MRC generation algorithm that allows for the simultaneous generation of MRCs for a variety of eviction policies that do not adhere to the inclusion property. We evaluate Kosmo using 52 publicly-accessible cache access traces with a total of roughly 126 billion accesses. Compared to Miniature Simulations configured with 100 simulated caches, Kosmo has lower memory overhead by a factor of 3.6 on average, and as high as 36, and a higher throughput by a factor of 1.3 making it far more suitable for online MRC generation.
3:15 pm–3:45 pm
Break with Refreshments
Mezzanine East/West
3:45 pm–5:00 pm
File Systems
Session Chair: Peter Macko, MongoDB
I/O Passthru: Upstreaming a flexible and efficient I/O Path in Linux
Kanchan Joshi, Anuj Gupta, Javier González, Ankit Kumar, Krishna Kanth Reddy, Arun George, and Simon Lund, Samsung Semiconductor; Jens Axboe, Meta Platforms Inc.
New storage interfaces continue to emerge fast on Non-Volatile Memory Express (NVMe) storage. Fitting these innovations in the general-purpose I/O stack of operating systems has been challenging and time-consuming. The NVMe standard is no longer limited to block-I/O, but the Linux I/O advances historically centered around the block-I/O path. The lack of scalable OS interfaces risks the adoption of the new storage innovations.
We introduce I/O Passthru, a new I/O Path that has made its way into the mainline Linux Kernel. The key ingredients of this new path are NVMe char interface and io_uring command. In this paper, we present our experience building and upstreaming I/O Passthru and report on how this helps to consume new NVMe innovations without changes to the Linux kernel. We provide experimental results to (i) compare its efficiency against existing io_uring block path and (ii) demonstrate its flexibility by integrating data placement into Cachelib. FIO peak performance workloads show 16–40% higher IOPS than block path.
Metis: File System Model Checking via Versatile Input and State Exploration
Yifei Liu and Manish Adkar, Stony Brook University; Gerard Holzmann, Nimble Research; Geoff Kuenning, Harvey Mudd College; Pei Liu, Scott A. Smolka, Wei Su, and Erez Zadok, Stony Brook University
We present Metis, a model-checking framework designed for versatile, thorough, yet configurable file system testing in the form of input and state exploration. It uses a nondeterministic loop and a weighting scheme to decide which system calls and their arguments to execute. Metis features a new abstract state representation for file-system states in support of efficient and effective state exploration. While exploring states, it compares the behavior of a file system under test against a reference file system and reports any discrepancies; it also provides support to investigate and reproduce any that are found. We also developed RefFS, a small, fast file system that serves as a reference, with special features designed to accelerate model checking and enhance bug reproducibility. Experimental results show that Metis can flexibly generate test inputs; also the rate at which it explores file-system states scales nearly linearly across multiple nodes. RefFS explores states 3–28× faster than other, more mature file systems. Metis aided the development of RefFS, reporting 11 bugs that we subsequently fixed. Metis further identified 12 bugs from five other file systems, five of which were confirmed and with one fixed and integrated into Linux.
RFUSE: Modernizing Userspace Filesystem Framework through Scalable Kernel-Userspace Communication
Kyu-Jin Cho, Jaewon Choi, Hyungjoon Kwon, and Jin-Soo Kim, Seoul National University
With the advancement of storage devices and the increasing scale of data, filesystem design has transformed in response to this progress. However, implementing new features within an in-kernel filesystem is a challenging task due to development complexity and code security concerns. As an alternative, userspace filesystems are gaining attention, owing to their ease of development and reliability. FUSE is a renowned framework that allows users to develop custom filesystems in userspace. However, the complex internal stack of FUSE leads to notable performance overhead, which becomes even more prominent in modern hardware environments with high-performance storage devices and a large number of cores.
In this paper, we present RFUSE, a novel userspace filesystem framework that utilizes scalable message communication between the kernel and userspace. RFUSE employs a per-core ring buffer structure as a communication channel and effectively minimizes transmission overhead caused by context switches and request copying. Furthermore, RFUSE enables users to utilize existing FUSE-based filesystems without making any modifications. Our evaluation results indicate that RFUSE demonstrates comparable throughput to in-kernel filesystems on high-performance devices while exhibiting high scalability in both data and metadata operations.
6:00 pm–7:30 pm
FAST '24 Poster Session and Reception
Mezzanine East/West
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and beverages. View the complete list of accepted posters.
9:00 am–10:15 am
Flash Storage
Session Chair: Jooyoung Hwang, Samsung Electronics
The Design and Implementation of a Capacity-Variant Storage System
Ziyang Jiao and Xiangqun Zhang, Syracuse University; Hojin Shin and Jongmoo Choi, Dankook University; Bryan S. Kim, Syracuse University
We present the design and implementation of a capacity-variant storage system (CVSS) for flash-based solid-state drives (SSDs). CVSS aims to maintain high performance throughout the lifetime of an SSD by allowing storage capacity to gracefully reduce over time, thus preventing fail-slow symptoms. The CVSS comprises three key components (1) CV-SSD, an SSD that minimizes write amplification and gracefully reduces its exported capacity with age; (2) CV-FS, a log-structured file system for elastic logical partition; and (3) CV-manager, a user-level program that orchestrates system components based on the state of the storage system. We demonstrate the effectiveness of CVSS with synthetic and real workloads, showing significant improvements in latency, throughput, and lifetime compared to a fixed-capacity storage system. Specifically, under real workloads, CVSS reduces the latency, improves the throughput, and extends the lifetime by 8–53%, 49–316%, and 268–327%, respectively.
I/O in a Flash: Evolution of ONTAP to Low-Latency SSDs
Matthew Curtis-Maury, Ram Kesavan, Bharadwaj V R, Nikhil Mattankot, Vania Fang, Yash Trivedi, Kesari Mishra, and Qin Li, NetApp, Inc
Flash-based persistent storage media are capable of sub-millisecond latency I/O. However, a storage architecture optimized for spinning drives may contain software delays that make it impractical for use with such media. The NetApp® ONTAP® storage system was designed originally for spinning drives, and needed alterations before it was productized as an all-SSD system. In this paper, we focus on the changes made to the read I/O path over the last several years, which have been crucial to this transformation, and present them in chronological fashion together with the associated performance analysis.
We Ain't Afraid of No File Fragmentation: Causes and Prevention of Its Performance Impact on Modern Flash SSDs
Yuhun Jun, Sungkyunkwan University and Samsung Electronics Co., Ltd.; Shinhyun Park, Sungkyunkwan University; Jeong-Uk Kang, Samsung Electronics Co., Ltd.; Sang-Hoon Kim, Ajou University; Euiseong Seo, Sungkyunkwan University
Awarded Best Paper!A few studies reported that fragmentation still adversely affects the performance of flash solid-state disks (SSDs) particularly through request splitting. This research investigates the fragmentation-induced performance degradation across three levels: kernel I/O path, host-storage interface, and flash memory accesses in SSDs. Our analysis reveals that, contrary to assertions in existing literature, the primary cause of the degraded performance is not due to request splitting but stems from a significant increase in die-level collisions. In SSDs, when other writes come between writes of neighboring file blocks, the file blocks are not placed on consecutive dies, resulting in random die allocation. This randomness escalates the chances of die-level collisions, causing deteriorated read performance later. We also reveal that this may happen when a file is overwritten. To counteract this, we propose an NVMe command extension combined with a page-to-die allocation algorithm designed to ensure that contiguous blocks always land on successive dies, even in the face of file fragmentation or overwrites. Evaluations with commercial SSDs and an SSD emulator indicate that our approach effectively curtails the read performance drop arising from both fragmentation and overwrites, all without the need for defragmentation. Representatively, when a 162 MB SQLite database was fragmented into 10,011 pieces, our approach limited the performance drop to 3.5%, while the conventional system experienced a 40% decline.
10:15 am–10:45 am
Break with Refreshments
Mezzanine East/West
10:45 am–11:45 am
Work-in-Progress Reports (WiPs)
Session Chair: Ali R. Butt, Virginia Tech
View the list of accepted Work-in-Progress Reports.
11:45 am–12:00 pm
FAST '24 Test of Time Award Presentation
SFS: Random Write Considered Harmful in Solid State Drives
Changwoo Min, Kangnyeon Kim, Hyunjin Cho, Sang-Won Lee, and Young Ik Eom
Published in the Proceedings of the 10th USENIX Conference on File and Storage Technologies, February 2012
12:00 pm–2:00 pm
Conference Luncheon
Terra Courtyard
2:00 pm–3:40 pm
Key-Value Systems
Session Chair: Angela Demke Brown, University of Toronto
In-Memory Key-Value Store Live Migration with NetMigrate
Zeying Zhu, University of Maryland; Yibo Zhao, Boston University; Zaoxing Liu, University of Maryland
Distributed key-value stores today require frequent key-value shard migration between nodes to react to dynamic workload changes for load balancing, data locality, and service elasticity. In this paper, we propose NetMigrate, a live migration approach for in-memory key-value stores based on programmable network data planes. NetMigrate migrates shards between nodes with zero service interruption and minimal performance impact. During migration, the switch data plane monitors the migration process in a fine-grained manner and directs client queries to the right server in real time, eliminating the overhead of pulling data between nodes. We implement a NetMigrate prototype on a testbed consisting of a programmable switch and several commodity servers running Redis, and evaluate it under YCSB workloads. Our experiments demonstrate that NetMigrate improves the query throughput from 6.5% to 416% and maintains low access latency during migration, compared to the state-of-the-art migration approaches.
IONIA: High-Performance Replication for Modern Disk-based KV Stores
Yi Xu, University of California, Berkeley; Henry Zhu, University of Illinois Urbana-Champaign; Prashant Pandey, University of Utah; Alex Conway, Cornell Tech and VMware Research; Rob Johnson, VMware Research; Aishwarya Ganesan and Ramnatthan Alagappan, University of Illinois Urbana-Champaign and VMware Research
We introduce IONIA, a novel replication protocol tailored for modern SSD-based write-optimized key-value (WO-KV) stores. Unlike existing replication approaches, IONIA carefully exploits the unique characteristics of SSD-based WO-KV stores. First, it exploits their interface characteristics to defer parallel execution to the background, enabling high-throughput yet one round trip (RTT) writes. IONIA also exploits SSD-based KV-stores’ performance characteristics to scalably read at any replica without enforcing writes to all replicas, thus providing scalability without compromising write availability; further, it does so while completing most reads in 1RTT. IONIA is the first protocol to achieve these properties, and it does so through its storage-aware design. We evaluate IONIA extensively to show that it achieves the above properties under a variety of workloads.
Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index
Asaf Levi, Technion - Israel Institute of Technology; Philip Shilane, Dell Technologies; Sarai Sheinvald, Braude College of Engineering; Gala Yadgar, Technion - Israel Institute of Technology
In the realm of information retrieval, the need to maintain reliable term-indexing has grown more acute in recent years, with vast amounts of ever-growing online data searched by a large number of search-engine users and used for data mining and natural language processing. At the same time, an increasing portion of primary storage systems employ data deduplication, where duplicate logical data chunks are replaced with references to a unique physical copy.
We show that indexing deduplicated data with deduplication-oblivious mechanisms might result in extreme inefficiencies: the index size would increase in proportion to the logical data size, regardless of its duplication ratio, consuming excessive storage and memory and slowing down lookups. In addition, the logically sequential accesses during index creation would be transformed into random and redundant accesses to the physical chunks. Indeed, to the best of our knowledge, term indexing is not supported by any deduplicating storage system.
In this paper, we propose the design of a deduplication-aware term-index that addresses these challenges. IDEA maps terms to the unique chunks that contain them, and maps each chunk to the files in which it is contained. This basic design concept improves the index performance and can support advanced functionalities such as inline indexing, result ranking, and proximity search. Our prototype implementation based on Lucene (the search engine at the core of Elasticsearch) shows that IDEA can reduce the index size and indexing time by up to 73% and 94%, respectively, and reduce term-lookup latency by up to 82% and 59% for single and multi-term queries, respectively.
MIDAS: Minimizing Write Amplification in Log-Structured Systems through Adaptive Group Number and Size Configuration
Seonggyun Oh, Jeeyun Kim, and Soyoung Han, DGIST; Jaeho Kim, Gyeongsang National University; Sungjin Lee, DGIST; Sam H. Noh, Virginia Tech
Log-structured systems are widely used in various applications because of its high write throughput. However, high garbage collection (GC) cost is widely regarded as the primary obstacle for its wider adoption. There have been numerous attempts to alleviate GC overhead, but with ad-hoc designs. This paper introduces MiDAS that minimizes GC overhead in a systematic and analytic manner. It employs a chain-like structure of multiple groups, automatically segregating data blocks by age. It employs analytical models, Update Interval Distribution (UID) and Markov-Chain-based Analytical Model (MCAM), to dynamically adjust the number of groups as well as their sizes according to the workload I/O patterns, thereby minimizing the movement of data blocks. Furthermore, MiDAS isolates hot blocks into a dedicated HOT group, where the size of HOT is dynamically adjusted according to the workload to minimize overall WAF. Our experiments using simulations and a proof-of-concept prototype for flash-based SSDs show that MiDAS outperforms state-of-the-art GC techniques, offering 25% lower WAF and 54% higher throughput, while consuming less memory and CPU cycles.
3:40 pm–4:10 pm
Break with Refreshments
Mezzanine East/West
4:10 pm–5:10 pm
Panel
Storage Systems in the LLM Era
Moderator: Keith A. Smith, MongoDB
Panelists: Greg Ganger, Carnegie Mellon University; Dean Hildebrand, Google; Glenn Lockwood, Microsoft; Nisha Talagala, Pyxeda; Zhe Zhang, AnyScale
This one-hour panel discussion will be centered around the new challenges and opportunities brought by revolutionary AI technologies to the storage community in terms of research, system development, management, and education. Panelists will include experienced researchers and entrepreneurs from multiple storage-related systems fields. The panel will be moderated by Keith Smith from MongoDB.
Greg Ganger, Carnegie Mellon University
Greg Ganger is the Jatras Professor of ECE and CS (by courtesy) at Carnegie Mellon University (CMU). Since 2001, he has also served as the Director of CMU's Parallel Data Laboratory (PDL) research center focused on data storage and processing systems. He has broad research interests in computer systems, including storage/file systems, cloud computing, ML systems, distributed systems, and operating systems. He earned his collegiate degrees from the University of Michigan and did a postdoc at MIT before joining CMU. He still loves playing basketball... he's lost a step but developed a sweet 3-point shot.
Dean Hildebrand, Google
Dean Hildebrand is a Technical Director for storage in the Google Cloud Office of the CTO. His interests span the spectrum of distributed storage, spending an inordinate amount of time on NFS, GPFS, and most recently DAOS. Prior to Google, Dean was a Principal Research Staff Member at IBM Research. He completed his Ph.D. in computer science from the University of Michigan in 2007.
Glenn Lockwood, Microsoft
Glenn K. Lockwood is a Principal Engineer at Microsoft, where he is responsible for supporting Microsoft's largest AI supercomputers through workload-driven systems design. His work has focused on applied research and development in extreme-scale and parallel computing systems for high-performance computing, and he has specific expertise in scalable architectures, performance modeling, and emerging technologies for I/O and storage. Prior to joining Microsoft, Glenn led the design and validation of several large-scale storage systems, including the world's first 30+ PB all-NVMe Lustre file system for the Perlmutter supercomputer at NERSC. He has also authored numerous peer-reviewed papers and reports on HPC storage topics and contributed to various reviews and advisory roles in U.S. and European research programs. He holds a Ph.D. in Materials Science from Rutgers University.
Nisha Talagala, Pyxeda
Nisha Talagala is the CEO and founder of AIClub.World, which is bringing AI Literacy to K-12 students and individuals worldwide. Nisha has significant experience in introducing technologies like Artificial Intelligence to new learners from students to professionals. Previously, Nisha co-founded ParallelM (acquired by DataRobot), which pioneered the MLOps practice of managing Machine Learning in production for enterprises. Nisha is a recognized leader in the operational machine learning space, having also driven the USENIX Operational ML Conference, the first industry/academic conference on production AI/ML. Nisha was previously a Fellow at SanDisk and a Fellow/Lead Architect at Fusion-io, where she worked on innovation in non-volatile memory technologies and applications. Nisha has over 20 years of expertise in enterprise software development, distributed systems, technical strategy, and product leadership. She has worked as technology lead for server flash at Intel—where she led server platform non-volatile memory technology development, storage-memory convergence, and partnerships. Prior to Intel, Nisha was the CTO of Gear6, where she designed and built clustered computing caches for high-performance I/O environments. Nisha earned her Ph.D. at UC Berkeley where she did research on clusters and distributed systems. Nisha holds 75 patents in distributed systems and software, over 25 refereed research publications, is a frequent speaker at industry and academic events, and is a contributing writer to Forbes and other publications.
Zhe Zhang, AnyScale
Zhe Zang is currently the Head of Open Source Engineering (Ray.io project) at Anyscale. Before Anyscale, Zhe spent 4.5 years at LinkedIn where he managed the Hadoop/Spark infra team. He has been working on open source for about 10 years; he's a committer and PMC member of the Apache Hadoop project, and a member of the Apache Software Foundation.
9:00 am–10:15 am
Cloud Storage
Session Chair: Young-ri Choi, UNIST (Ulsan National Institute of Science and Technology)
What's the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store
Weidong Zhang, Erci Xu, Qiuping Wang, Xiaolu Zhang, Yuesheng Gu, Zhenwei Lu, Tao Ouyang, Guanqun Dai, Wenwen Peng, Zhe Xu, Shuo Zhang, Dong Wu, Yilei Peng, Tianyun Wang, Haoran Zhang, Jiasheng Wang, Wenyuan Yan, Yuanyuan Dong, Wenhui Yao, Zhongjie Wu, Lingjun Zhu, Chao Shi, Yinhu Wang, Rong Liu, Junping Wu, Jiaji Zhu, and Jiesheng Wu, Alibaba Group
Awarded Best Paper!In this paper, we qualitatively and quantitatively discuss the design choices, production experience, and lessons in building the Elastic Block Storage (EBS) at Alibaba Cloud over the past decade. To cope with hardware advancement and users' demands, we shift our focus from design simplicity in EBS1 to high performance and space efficiency in EBS2, and finally reducing network traffic amplification in EBS3.
In addition to the architectural evolutions, we also summarize the lessons and experiences in development as four topics, including: (i) achieving high elasticity in latency, throughput, IOPS and capacity; (ii) improving availability by minimizing the blast radius of individual, regional, and global failure events; (iii) identifying the motivations and key tradeoffs in various hardware offloading solutions; and (iv) identifying the pros/cons of the alternative solutions and explaining why seemingly promising ideas would not work in practice.
ELECT: Enabling Erasure Coding Tiering for LSM-tree-based Storage
Yanjing Ren and Yuanming Ren, The Chinese University of Hong Kong; Xiaolu Li and Yuchong Hu, Huazhong University of Science and Technology; Jingwei Li, University of Electronic Science and Technology of China; Patrick P. C. Lee, The Chinese University of Hong Kong
Given the skewed nature of practical key-value (KV) storage workloads, distributed KV stores can adopt a tiered approach to support fast data access in a hot tier and persistent storage in a cold tier. To provide data availability guarantees for the hot tier, existing distributed KV stores often rely on replication and incur prohibitively high redundancy overhead. Erasure coding provides a low-cost redundancy alternative, but incurs high access performance overhead. We present ELECT, a distributed KV store that enables erasure coding tiering based on the log-structured merge tree (LSM-tree), by adopting a hybrid redundancy approach that carefully combines replication and erasure coding with respect to the LSM-tree layout. ELECT incorporates hotness awareness and selectively converts data from replication to erasure coding in the hot tier and offloads data from the hot tier to the cold tier. It also provides a tunable approach to balance the trade-off between storage savings and access performance through a single user-configurable parameter. We implemented ELECT atop Cassandra, which is replication-based. Experiments on Alibaba Cloud show that ELECT achieves significant storage savings in the hot tier, while maintaining high performance and data availability guarantees, compared with Cassandra.
MinFlow: High-performance and Cost-efficient Data Passing for I/O-intensive Stateful Serverless Analytics
Tao Li, Yongkun Li, and Wenzhe Zhu, University of Science and Technology of China; Yinlong Xu, Anhui Province Key Laboratory of High Performance Computing, University of Science and Technology of China; John C. S. Lui, The Chinese University of Hong Kong
Serverless computing has revolutionized application deployment, obviating traditional infrastructure management and dynamically allocating resources on demand. A significant use case is I/O-intensive applications like data analytics, which widely employ the pivotal "shuffle" operation. Unfortunately, the shuffle operation poses severe challenges due to the massive PUT/GET requests to remote storage, especially in high-parallelism scenarios, leading to high performance degradation and storage cost. Existing designs optimize the data passing performance from multiple aspects, while they operate in an isolated way, thus still introducing unforeseen performance bottlenecks and bypassing untapped optimization opportunities. In this paper, we develop MinFlow, a holistic data passing framework for I/O-intensive serverless analytics jobs. MinFlow first rapidly generates numerous feasible multi-level data passing topologies with much fewer PUT/GET operations, then it leverages an interleaved partitioning strategy to divide the topology DAG into small-size bipartite sub-graphs to optimize function scheduling, further reducing over half of the transmitted data to remote storage. Moreover, MinFlow also develops a precise model to determine the optimal configuration, thus minimizing data passing time under practical function deployments. We implement a prototype of MinFlow, and extensive experiments show that MinFlow significantly outperforms state-of-the-art systems, FaaSFlow and Lambada, in both the job completion time and storage cost.
10:15 am–10:45 am
Break with Refreshments
Mezzanine East/West
10:45 am–12:00 pm
AI and Storage
Session Chair: Patrick P. C. Lee, The Chinese University of Hong Kong (CUHK)
COLE: A Column-based Learned Storage for Blockchain Systems
Ce Zhang and Cheng Xu, Hong Kong Baptist University; Haibo Hu, Hong Kong Polytechnic University; Jianliang Xu, Hong Kong Baptist University
Blockchain systems suffer from high storage costs as every node needs to store and maintain the entire blockchain data. After investigating Ethereum's storage, we find that the storage cost mostly comes from the index, i.e., Merkle Patricia Trie (MPT). To support provenance queries, MPT persists the index nodes during the data update, which adds too much storage overhead. To reduce the storage size, an initial idea is to leverage the emerging learned index technique, which has been shown to have a smaller index size and more efficient query performance. However, directly applying it to the blockchain storage results in even higher overhead owing to the requirement of persisting index nodes and the learned index's large node size. To tackle this, we propose COLE, a novel column-based learned storage for blockchain systems. We follow the column-based database design to contiguously store each state's historical values, which are indexed by learned models to facilitate efficient data retrieval and provenance queries. We develop a series of write-optimized strategies to realize COLE in disk environments. Extensive experiments are conducted to validate the performance of the proposed COLE system. Compared with MPT, COLE reduces the storage size by up to 94% while improving the system throughput by 1.4×-5.4×.
Baleen: ML Admission & Prefetching for Flash Caches
Daniel Lin-Kit Wong, Carnegie Mellon University; Hao Wu, Meta; Carson Molder, UT Austin; Sathya Gunasekar, Jimmy Lu, Snehal Khandkar, and Abhinav Sharma, Meta; Daniel S. Berger, Microsoft and University of Washington; Nathan Beckmann and Gregory R. Ganger, Carnegie Mellon University
Flash caches are used to reduce peak backend load for throughput-constrained data center services, reducing the total number of backend servers required. Bulk storage systems are a large-scale example, backed by high-capacity but low-throughput hard disks, and using flash caches to provide a more cost-effective storage layer underlying everything from blobstores to data warehouses.
However, flash caches must address the limited write endurance of flash by limiting the long-term average flash write rate to avoid premature wearout. To do so, most flash caches must use admission policies to filter cache insertions and maximize the workload-reduction value of each flash write.
The Baleen flash cache uses coordinated ML admission and prefetching to reduce peak backend load. After learning painful lessons with our early ML policy attempts, we exploit a new cache residency model (which we call episodes) to guide model training. We focus on optimizing for an end-to-end system metric (Disk-head Time) that measures backend load more accurately than IO miss rate or byte miss rate. Evaluation using Meta traces from seven storage clusters shows that Baleen reduces Peak Disk-head Time (and hence the number of backend hard disks required) by 12% over state-of-the-art policies for a fixed flash write rate constraint. Baleen-TCO, which chooses an optimal flash write rate, reduces our estimated total cost of ownership (TCO) by 17%. Code and traces are available at https://www.pdl.cmu.edu/CILES/.
Seraph: Towards Scalable and Efficient Fully-external Graph Computation via On-demand Processing
Tsun-Yu Yang, Yizou Chen, Yuhong Liang, and Ming-Chang Yang, The Chinese University of Hong Kong
Fully-external graph computation systems exhibit optimal scalability by computing the ever-growing, large-scale graph with constant amount of memory on a single machine. In particular, they keep the entire massive graph data in storage and iteratively load parts of them into memory for computation. Nevertheless, despite the merit of optimal scalability, their unreasonably-low efficiency often makes them uncompetitive, and even unpractical, to the other types of graph computation systems. The key rationale is that most existing fully-external graph computation systems over-emphasize retrieving graph data from storage through sequential access. Although this principle achieves high storage bandwidth, it often causes reading excessive and irrelevant data, which can severely degrade their overall efficiency.
Therefore, this work presents Seraph, a fully-external graph computation system that achieves optimal Scalability while toward satisfactory Efficiency improvement. Particularly, inspired by the modern storage offering comparable sequential and random access speeds, Seraph adopts the principle of on-demand processing to access the necessary graph data for saving I/O while enjoying the decent speed in random access. On the basis of this principle, Seraph further devises three practical designs to bring excellent performance leap to fully-external graph computation: 1) the hybrid format to represent the graph data for striking a good balance between I/O amount and access locality, 2) the vertex passing to enable efficient vertex updates on top of hybrid format, and 3) the selective pre-computation to re-use the loaded data for I/O reduction. Our evaluations reveal that Seraph notably outperforms other state-of-the-art fully-external systems under all the evaluated billion-scale graphs and representative graph algorithms by up to two orders of magnitude.