USENIX ATC '24 Technical Sessions

Display:

Wednesday, July 10

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:00 am

USENIX ATC '24 and OSDI '24 Joint Keynote Address

Grand Ballroom ABGH

Scaling AI Sustainably: An Uncharted Territory

Carole-Jean Wu, Meta

The past 50 years has seen a dramatic increase in the amount of compute per person, in particular, those enabled by AI. Despite the positive societal benefits, AI technologies come with significant environmental implications. I will talk about the scaling trend and the operational carbon footprint of AI computing by examining the model development cycle, spanning data, algorithms, and system hardware. At the same time, we will consider the life cycle of system hardware from the perspective of hardware architectures and manufacturing technologies. I will highlight key efficiency optimization opportunities for cutting-edge AI technologies, from deep learning recommendation models to multi-modal generative AI tasks. To scale AI sustainably, we need to make AI and computing more broadly efficient and flexible. We must also go beyond efficiency and optimize across the life cycle of computing infrastructures, from hardware manufacturing to datacenter operation and end-of-life processing for the hardware. Based on the industry experience and lessons learned, my talk will conclude with important development and research directions to advance the field of computing in an environmentally responsible and sustainable manner.

Carole-Jean Wu, Meta

Carole-Jean Wu is a Director at Meta. She is a founding member and a Vice President of MLCommons—a non-profit organization that aims to accelerate machine learning for the benefit of all. Dr. Wu also serves on the MLCommons Board as a Director, chaired the MLPerf Recommendation Benchmark Advisory Board, and co-chaired for MLPerf Inference. Prior to Meta/Facebook, She was a tenured professor at ASU. She earned her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.

Dr. Wu's expertise sits at the intersection of computer architecture and machine learning. Her work spans across datacenter infrastructures and edge systems, such as developing energy- and memory-efficient systems and microarchitectures, optimizing systems for machine learning execution at-scale, and designing learning-based approaches for system design and optimization. Dr. Wu's work has been recognized with several awards, including IEEE Micro Top Picks and ACM/IEEE Best Paper Awards. She was the Program Co-Chair of the Conference on Machine Learning and Systems (MLSys) in 2022, the Program Chair of the IEEE International Symposium on Workload Characterization (IISWC) in 2018, and the Editor for the IEEE MICRO Special Issue on Environmentally Sustainable Computing. She currently serves on the ACM SIGARCH/SIGMICRO CARES committee.

10:00 am–10:30 am

Break with Refreshments

Grand Ballroom Foyer

10:30 am–10:45 am

Opening Remarks, Awards, and Presentation of the 2024 USENIX Lifetime Achievement (Flame) Award

Grand Ballroom CD

Program Co-Chairs: Saurabh Bagchi, Purdue University; Yiying Zhang, University of California, San Diego

10:45 am–12:25 pm

Track 1

Cloud Computing

Grand Ballroom CD

Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu

Qingyuan Liu, Yanning Yang, Dong Du, and Yubin Xia, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Ping Zhang and Jia Feng, Huawei Cloud; James R. Larus, EPFL; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Key Laboratory of System Software (Chinese Academy of Science)

Current serverless platforms struggle to optimize resource utilization due to their dynamic and fine-grained nature. Conventional techniques like overcommitment and autoscaling fall short, often sacrificing utilization for practicability or incurring performance trade-offs. Overcommitment requires predicting performance to prevent QoS violation, introducing trade-off between prediction accuracy and overheads. Autoscaling requires scaling instances in response to load fluctuations quickly to reduce resource wastage, but more frequent scaling also leads to more cold start overheads. This paper introduces Jiagu to harmonize efficiency with practicability through two novel techniques. First, pre-decision scheduling achieves accurate prediction while eliminating overheads by decoupling prediction and scheduling. Second, \emph{dual-staged scaling} achieves frequent adjustment of instances with minimum overhead. We have implemented a prototype and evaluated it using real-world applications and traces from the public cloud platform. Our evaluation shows a 54.8% improvement in deployment density over commercial clouds (with Kubernetes) while maintaining QoS, and 81.0%–93.7% lower scheduling costs and a 57.4%–69.3% reduction in cold start latency compared to existing QoS-aware schedulers.

ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions

Yuqi Fu, University of Virginia; Ruizhe Shi, George Mason University; Haoliang Wang, Adobe Research; Songqing Chen, George Mason University; Yue Cheng, University of Virginia

FaaS (Function-as-a-Service) workloads feature unique patterns. Serverless functions are ephemeral, highly concurrent, and bursty, with an execution duration ranging from a few milliseconds to a few seconds. The workload behaviors pose new challenges to kernel scheduling. Linux CFS (Completely Fair Scheduler) is workload-oblivious and optimizes long-term fairness via proportional sharing. CFS neglects the short-term demands of CPU time from short-lived serverless functions, severely impacting the performance of short functions. Preemptive shortest job first—shortest remaining process time (SRPT)—prioritizes shorter functions in order to satisfy their short-term demands of CPU time and, therefore, serves as a best-case baseline for optimizing the turnaround time of short functions. A significant downside of approximating SRPT, however, is that longer functions might be starved.

In this paper, we propose a novel application-aware kernel scheduler, ALPS (Adaptive Learning, Priority Scheduler), based on two key insights. First, approximating SRPT can largely benefit short functions but may inevitably penalize long functions. Second, CFS provides necessary infrastructure support to implement user-defined priority scheduling. To this end, we design ALPS to have a novel, decoupled scheduler frontend and backend architecture, which unifies approximate SRPT and proportional-share scheduling. ALPS’ frontend sits in the user space and approximates SRPT-inspired priority scheduling by adaptively learning from an SRPT simulation on a recent past workload. ALPS’ backend uses eBPF functions hooked to CFS to carry out the continuously learned policies sent from the frontend to inform scheduling decisions in the kernel. This design adds workload intelligence to workload-oblivious OS scheduling while retaining the desirable properties of OS schedulers. We evaluate ALPS extensively using two production FaaS workloads (Huawei and Azure), and results show that ALPS achieves a reduction of 57.2% in average function execution duration compared to CFS.

Starburst: A Cost-aware Scheduler for Hybrid Cloud

Michael Luo, Siyuan Zhuang, Suryaprakash Vengadesan, and Romil Bhardwaj, UC Berkeley; Justin Chang, UC Santa Barbara; Eric Friedman, Scott Shenker, and Ion Stoica, UC Berkeley

To efficiently tackle bursts in job demand, organizations employ hybrid cloud architectures to scale their batch workloads from their private clusters to public cloud. This requires transforming cluster schedulers into cloud-enabled versions to navigate the tradeoff between cloud costs and scheduler objectives such as job completion time (JCT). However, our analysis over production-level traces show that existing cloud-enabled schedulers incur inefficient cost-JCT trade-offs due to low cluster utilization.

We present Starburst, a system that maximizes cluster utilization to streamline the cost-JCT tradeoff. Starburst's scheduler dynamically controls jobs' waiting times to improve utilization—it assigns longer waits for large jobs to increase their chances of running on the cluster, and shorter waits to small jobs to increase their chances of running on the cloud. To offer configurability, Starburst provides system administrators a simple waiting budget framework to tune their position on the cost-JCT curve. A departure from traditional cluster schedulers, Starburst operates as a higher-level resource manager over a private cluster and dynamic cloud clusters. Simulations over production-level traces and real-world experiments on a 32-GPU private cluster show that Starburst can reduce cloud costs by up to 54-91% over existing cluster managers, while increasing average JCT by at most 5.8%.

StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

Hao Wu, Yue Yu, and Junxiao Deng, Huazhong University of Science and Technology; Shadi Ibrahim, Inria; Song Wu and Hao Fan, Huazhong University of Science and Technology and Jinyinhu Laboratory; Ziyue Cheng, Huazhong University of Science and Technology; Hai Jin, Huazhong University of Science and Technology and Jinyinhu Laboratory

The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems.

Track 2

ML Inference

Grand Ballroom EF

Power-aware Deep Learning Model Serving with μ-Serve

Haoran Qiu, Weichao Mao, Archit Patke, and Shengkun Cui, University of Illinois Urbana-Champaign; Saurabh Jha, Chen Wang, and Hubertus Franke, IBM Research; Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer, University of Illinois Urbana-Champaign

With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, µ-Serve. µ-Serve is a model-serving framework that optimizes the power consumption and model serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that µ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.

Fast Inference for Probabilistic Graphical Models

Jiantong Jiang, The University of Western Australia; Zeyi Wen, HKUST (Guangzhou) and HKUST; Atif Mansoor and Ajmal Mian, The University of Western Australia

Probabilistic graphical models (PGMs) have attracted much attention due to their firm theoretical foundation and inherent interpretability. However, existing PGM inference systems are inefficient and lack sufficient generality, due to issues with irregular memory accesses, high computational complexity, and modular design limitation. In this paper, we present Fast-PGM, a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms. Fast-PGM incorporates careful memory management techniques to reduce memory consumption and enhance data locality. It also employs computation and parallelization optimizations to reduce computational complexity and improve the overall efficiency. Furthermore, Fast-PGM offers high generality and flexibility, allowing easy integration with all the mainstream importance sampling-based algorithms. The system abstraction of Fast-PGM facilitates easy optimizations, extensions, and customization for users. Extensive experiments show that Fast-PGM achieves 3 to 20 times speedup over the state-of-the-art implementation. Fast-PGM source code is freely available at https://github.com/jjiantong/FastPGM.

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Bin Gao, National University of Singapore; Zhuomin He, Shanghai Jiaotong University; Puru Sharma, Qingxuan Kang, and Djordje Jevdjic, National University of Singapore; Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo, Huawei Cloud

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch

Kinman Lei, Yuyang Jin, Mingshu Zhai, Kezhao Huang, Haoxing Ye, and Jidong Zhai, Tsinghua University

Aligning Large Language Models (LLMs) is currently the primary method to ensure AI systems operate in an ethically responsible and socially beneficial manner. Its paradigm differs significantly from standard pre-training or fine-tuning processes, involving multiple models and workloads (context), and necessitates frequently switching execution, introducing significant overhead, such as parameter updates and data transfer, which poses a critical challenge: efficiently switching between different models and workloads.

To address these challenges, we introduce PUZZLE, an efficient system for LLM alignment. We explore model orchestration as well as light-weight and smooth workload switching in aligning LLMs by considering the similarity between different workloads. Specifically, PUZZLE adopts a two-dimensional approach for efficient switching, focusing on both intra- and inter-stage switching. Within each stage, switching costs are minimized by exploring model affinities and overlapping computation via time-sharing. Furthermore, a similarity-oriented strategy is employed to find the optimal inter-stage switch plan with the minimum communication cost. We evaluate PUZZLE on various clusters with up to 32 GPUs. Results show that PUZZLE achieves up to 2.12× speedup compared with the state-of-the-art RLHF training system DeepSpeed-Chat.

12:25 pm–2:00 pm

Conference Luncheon

Santa Clara Ballroom

2:00 pm–3:40 pm

Track 1

Storage 1

Grand Ballroom CD

ScalaAFA: Constructing User-Space All-Flash Array Engine with Holistic Designs

Shushu Yi, Peking University and Zhongguancun Laboratory; Xiurui Pan, Peking University; Qiao Li, Xiamen University; Qiang Li, Alibaba; Chenxi Wang, University of Chinese Academy of Sciences; Bo Mao, Xiamen University; Myoungsoo Jung, KAIST and Panmnesia; Jie Zhang, Peking University and Zhongguancun Laboratory

All-flash array (AFA) is a popular approach to aggregate the capacity of multiple solid-state drives (SSDs) while guaranteeing fault tolerance. Unfortunately, existing AFA engines inflict substantial software overheads on the I/O path, such as the user-kernel context switches and AFA internal tasks (e.g., parity preparation), thereby failing to adopt next-generation high-performance SSDs.

Tackling this challenge, we propose ScalaAFA, a unique holistic design of AFA engine that can extend the throughput of next-generation SSD arrays in scale with low CPU costs. We incorporate ScalaAFA into user space to avoid user-kernel context switches while harnessing SSD built-in resources for handling AFA internal tasks. Specifically, in adherence to the lock-free principle of existing user-space storage framework, ScalaAFA substitutes the traditional locks with an efficient message-passing-based permission management scheme to facilitate inter-thread synchronization. Considering the CPU burden imposed by background I/O and parity computation, ScalaAFA proposes to offload these tasks to SSDs. To mitigate host-SSD communication overheads in offloading, ScalaAFA takes a novel data placement policy that enables transparent data gathering and in-situ parity computation. ScalaAFA also addresses two AFA intrinsic issues, metadata persistence and write amplification, by thoroughly exploiting SSD architectural innovations. Comprehensive evaluation results indicate that ScalaAFA can achieve 2.5× write throughput and reduce average write latency by a significant 52.7%, compared to the state-of-the-art AFA engines.

FastCommit: resource-efficient, performant and cost-effective file system journaling

Harshad Shirwadkar, Saurabh Kadekodi, and Theodore Tso, Google

JBD2, the current physical journaling mechanism in Ext4 is bulky and resource-hungry. Specifically, in case of metadata-heavy workloads, fsyncs issued by applications cause JBD2 to write copies of changed metadata blocks, incurring high byte and IO overhead. When storing data in Ext4 via NFS (a popular setup), the NFS protocol issues fsyncs for every file metadata update which further exacerbates the problem. In a simple multi-threaded mail-server workload, JBD2 consumed approximately 76% of the disk’s write bandwidth. Higher byte and IO utilization of JBD2 results in reduced application throughput, higher wear-out of flash based media and increased performance provisioning costs in cloud-based storage services.

We present FastCommit: a hybrid journaling approach for Ext4 which performs logical journaling for simple and frequent file system modifications, while relying on JBD2 for more complex and rare modifications. Key design elements of FastCommit are compact logging, selective flushing and inline journaling. The first two techniques work together to ensure that over 80% commits are contained within a single 4KB block and are written to disk without requiring an expensive cache flush operation. Inline journaling minimizes context switching delays. With faster and efficient fsyncs, FastCommit reduces throughput interference of JBD2 by over 2× along with throughput improvements of up to 120%. We implemented FastCommit in Ext4 and successfully merged our code to the upstream Linux kernel.

ZMS: Zone Abstraction for Mobile Flash Storage

Joo-Young Hwang, Seokhwan Kim, Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, and Sangyeun Cho, Samsung Electronics; Youjip Won, Korea Advanced Institute of Science and Technology

We propose an I/O stack for ZNS based flash storage in mobile environment, ZMS. The zone interface is known to save the flash storage from two fundamental issues which modern flash storage suffers from: logical-to-physical mapping table size and garbage collection overhead. Through extensive study, we find that realizing the zone interface in mobile environment is more than a challenge due to the unique characteristics of mobile environment: the lack of on-device memory in mobile flash storage and the frequent fsync() calls in mobile applications. Aligned with this, we identify the root causes that need to be addressed in realizing the zone interface in mobile I/O stack: write buffer thrashing and tiny synchronous file update. We develop a filesystem, block I/O layer, and device firmware techniques to address the above mentioned two issues. The three key techniques in ZMS are (i) IOTailor, (ii) budget-based in-place update, and (iii) multi-granularity logical-to-physical mapping. Evaluation on a real production platform shows that ZMS improves write amplification by 2.9–6.4× and random write performance by 5.0–13.6×. With the three techniques, ZMS shows significant performance improvement in writing to the multiple zones concurrently, executing SQLite transactions, and launching the applications.

Ethane: An Asymmetric File System for Disaggregated Persistent Memory

Miao Cai, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics; Junru Shen, College of Computer Science and Software Engineering, Hohai University; Baoliu Ye, State Key Laboratory for Novel Software Technology, Nanjing University

The ultra-fast persistent memories (PMs) promise a practical solution towards high-performance distributed file systems. This paper examines and reveals a cascade of three performance and cost issues in the current PM provision scheme, namely expensive cross-node interaction, weak single-node capability, and costly scale-out performance, which not only underutilizes fast PM devices but also magnifies its limited storage capacity and high price deficiencies. To remedy this, we introduce Ethane, a file system built on disaggregated persistent memory (DPM). Through resource separation using fast connectivity technologies, DPM achieves efficient and cost-effective PM sharing while retaining low-latency memory access. To unleash such hardware potentials, Ethane incorporates an asymmetric file system architecture inspired by the imbalanced resource provision feature of DPM. It splits a file system into a control-plane FS and a data-plane FS and designs these two planes to make the best use of the respective hardware resources. Evaluation results demonstrate that Ethane reaps the DPM hardware benefits, performs up to 68× better than modern distributed file systems, and improves data-intensive application throughputs by up to 17×.

Track 2

Networks 1

Grand Ballroom EF

PeRF: Preemption-enabled RDMA Framework

Sugi Lee and Mingyu Choi, Acryl Inc.; Ikjun Yeom, Acryl Inc. and Sungkyunkwan University; Younghoon Kim, Sungkyunkwan University

Remote Direct Memory Access (RDMA) provides high throughput, low latency, and minimal CPU usage for data-intensive applications. However, RDMA was initially designed for single-tenant use, and its application in a multi-tenant cloud environment poses challenges in terms of performance isolation, security, and scalability. This paper proposes a Preemption-enabled RDMA Framework (PeRF), which offers software-based performance isolation for efficient multi-tenancy in RDMA. PeRF leverages a novel RNIC preemption mechanism to dynamically control RDMA resource utilization for each tenant, while ensuring that RNICs remain busy, thereby enabling work conservation. PeRF outperforms existing approaches by achieving flexible performance isolation without compromising RDMA's bare-metal performance.

CyberStar: Simple, Elastic and Cost-Effective Network Functions Management in Cloud Network at Scale

Tingting Xu, Nanjing University; Bengbeng Xue, Yang Song, Xiaomin Wu, Xiaoxin Peng, and Yilong Lyu, Alibaba Group; Xiaoliang Wang, Chen Tian, Baoliu Ye, and Camtu Nguyen, Nanjing University; Biao Lyu and Rong Wen, Alibaba Group; Zhigang Zong, Alibaba Group and Zhejiang University; Shunmin Zhu, Alibaba Group and Tsinghua University

Network functions (NFs) facilitate network operations and have become a critical service offered by cloud providers. One of the key challenges is how to meet the elastic requirements of massive traffic and diverse NF requests of tenants. This paper identifies the opportunity by leveraging cloud elastic compute services (ECS), i.e. containers or virtual machines, to provide the cloud-scale network function services, CyberStar. CyberStar introduces two key designs: (i) resource pooling based on a newly proposed three-tier architecture for scalable network functions; and (ii) on-demand resource assignment while maintaining high resource utilization in terms of both tenant demands and operation cost. Compared to the traditional NFs constructed over bare-metal servers, CyberStar can achieve 100Gbps bandwidth (6.7×) and scale to millions of connections within one second (20×).

OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, and Timo Schneider, ETH Zürich; Daniele De Sensi, ETH Zürich and Sapienza University of Rome; Luca Benini and Torsten Hoefler, ETH Zürich

Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource manager co-design. OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead.

ETC: An Elastic Transmission Control Using End-to-End Available Bandwidth Perception

Feixue Han, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Peng Zhang, Tencent; Gareth Tyson, Hong Kong University; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Mingwei Xu, Tsinghua University; Yulong Lan and ZhiCheng Li, Tencent

Researchers and practitioners have proposed various transport protocols to keep up with advances in networks and the applications that use them. Current Wide Area Network protocols strive to identify a congestion signal to make distributed but fair judgments. However, existing congestion signals such as RTT and packet loss can only be observed after congestion occurs. We therefore propose Elastic Transmission Control (ETC). ETC exploits the instantaneous receipt rate of N consecutive packets as the congestion signal. We refer to this as the pulling rate, as we posit that the receipt rate can be used to "pull'' the sending rate towards a fair share of the capacity. Naturally, this signal can be measured prior to congestion, as senders can access it immediately after the acknowledgment of the first N packets. Exploiting the pulling rate measurements, ETC calculates the optimal rate update steps following a simple elastic principle: the further away from the pulling rate, the faster the sending rate increases. We conduct extensive experiments using both simulated and real networks. Our results show that ETC outperforms the state-of-the-art protocols in terms of both throughput (15% higher than Copa) and latency (20% lower than BBR). Besides, ETC shows superiority in convergence speed and fairness, with a 10× improvement in convergence time even compared to the protocol with the best convergence performance.

3:40 pm–4:10 pm

Break with Refreshments

Grand Ballroom Foyer

4:10 pm–5:55 pm

Track 1

Edge Computing

Grand Ballroom CD

More is Different: Prototyping and Analyzing a New Form of Edge Server with Massive Mobile SoCs

Li Zhang, Beijing University of Posts and Telecommunications; Zhe Fu, Tsinghua University; Boqing Shi and Xiang Li, Beijing University of Posts and Telecommunications; Rujin Lai and Chenyang Yang, vclusters; Ao Zhou, Xiao Ma, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

Huge energy consumption poses a significant challenge for edge clouds. In response to this, we introduce a new type of edge server, namely SoC Cluster, that orchestrates multiple low-power mobile system-on-chips (SoCs) through an on-chip network. For the first time, we have developed a concrete SoC Cluster consisting of 60 Qualcomm Snapdragon 865 SoCs housed in a 2U rack, which has been successfully commercialized and extensively deployed in edge clouds. Cloud gaming emerges as the principal workload on these deployed SoC Clusters, owing to the compatibility between mobile SoCs and native mobile games.

In this study, we aim to demystify whether the SoC Cluster can efficiently serve more generalized, typical edge workloads. Therefore, we developed a benchmark suite that employs state-of-the-art libraries for two critical edge workloads, i.e., video transcoding and deep learning inference. This suite evaluates throughput, latency, power consumption, and other application-specific metrics like video quality. Following this, we conducted a thorough measurement study and directly compared the SoC Cluster with traditional edge servers, with regards to electricity usage and monetary cost. Our results quantitatively reveal when and for which applications mobile SoCs exhibit higher energy efficiency than traditional servers, as well as their ability to proportionally scale power consumption with fluctuating incoming loads. These outcomes provide insightful implications and offer valuable direction for further refinement of the SoC Cluster to facilitate its deployment across wider edge scenarios.

HiP4-UPF: Towards High-Performance Comprehensive 5G User Plane Function on P4 Programmable Switches

Zhixin Wen and Guanhua Yan, Binghamton University

Due to better cost benefits, P4 programmable switches have been considered in a few recent works to implement 5G User Plane Function (UPF). To circumvent limited resources on P4 programmable switches, they either ignore some essential UPF features or resort to a hybrid deployment approach which requires extra resources. This work is aimed to improve the performance of UPFs with comprehensive features which, except packet buffering, are deployable entirely on commodity P4 programmable switches. We build a baseline UPF based on prior work and analyze its key performance bottlenecks. We propose a three-tiered approach to optimize rule storage on the switch ASICs. We also develop a novel scheme that combines pendulum table access and selective usage pulling to reduce the operational latency of the UPF. Using a commodity P4 programmable switch, the experimental results show that our UPF implementation can support twice as many mobile devices as the baseline UPF and 1.9 times more than SD-Fabric. Our work also improves the throughputs in three common types of 5G call flows by 9-619% over the UPF solutions in two open-source 5G network emulators.

KEPC-Push: A Knowledge-Enhanced Proactive Content Push Strategy for Edge-Assisted Video Feed Streaming

Ziwen Ye, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qing Li, Peng Cheng Laboratory; Chunyu Qiao, ByteDance; Xiaoteng Ma, Tsinghua Shenzhen International Graduate School; Yong Jiang, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qian Ma and Shengbin Meng, ByteDance; Zhenhui Yuan, University of Warwick; Zili Meng, HKUST

Video Feed Streaming (e.g., TikTok, Reels) is increasingly popular nowadays. Users will be scheduled to the distribution infrastructure, including content distribution network (CDN) and multi-access edge computing (MEC) nodes, to access the content. Our observation is that the existing proactive content push algorithms, which are primarily based on historical access information and designed for on-demand videos, no longer meet the demands of video feed streaming. The main reason is that video feed streaming applications always push recently generated videos to attract users’ interests, thus lacking historical information when pushing. In this case, push mismatches and load imbalances will be observed, resulting in degraded bandwidth cost and user experience. To this end, we propose KEPC-Push, a Knowledge-Enhanced Proactive Content Push strategy with the \textit{knowledge} of video content features. KEPC-Push employs knowledge graphs to determine the popularity correlation among similar videos (with similar authors, contents, length, etc.) and pushes content based on this guidance. Besides, KEPC-Push designs a hierarchical algorithm to optimize the resource allocation in edge nodes with heterogeneous capabilities and runs at the regional level to shorten the communication distance. Trace-driven simulations show that KEPC-Push saves the peak-period CDN bandwidth costs by 20% and improves the average download speeds by 7% against the state-of-the-art solutions.

High-density Mobile Cloud Gaming on Edge SoC Clusters

Li Zhang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

System-on-Chip (SoC) Clusters, i.e., servers consisting of many stacked mobile SoCs, have emerged as a popular platform for serving mobile cloud gaming. Sharing the underlying hardware and OS, these SoC Clusters enable native mobile games to be executed and rendered efficiently without modification. However, the number of deployed game sessions is limited due to conservative deployment strategies and high GPU utilization in current game offloading methods. To address these challenges, we introduce SFG, the first system that enables high-density mobile cloud gaming on SoC Clusters with two novel techniques: (1) It employs a resource-efficient game partitioning and cross-SoC offloading design that maximally preserves GPU optimization intents in the standard graphics rendering pipeline; (2) It proposes an NPU-enhanced game partition coordination strategy to adjust game performance when co-locating partitioned and complete game sessions. Our evaluation of five Unity games shows that SFG achieves up to 4.5× higher game density than existing methods with trivial performance loss. Equally important, SFG extends the lifespan of SoC Clusters, enabling outdated SoC Clusters to serve new games that are unfeasible on a single SoC due to GPU resource shortages.

Track 2

Operating Systems 1

Grand Ballroom EF

Limitations and Opportunities of Modern Hardware Isolation Mechanisms

Xiangdong Chen and Zhaofeng Li, University of Utah; Tirth Jain, Maya Labs; Vikram Narayanan and Anton Burtsev, University of Utah

A surge in the number, complexity, and automation of targeted security attacks has triggered a wave of interest in hardware support for isolation. Intel memory protection keys (MPK), ARM pointer authentication (PAC), ARM memory tagging extensions (MTE), and ARM Morello capabilities are just a few hardware mechanisms aimed at supporting low-overhead isolation in recent CPUs. These new mechanisms aim to bring practical isolation to a broad range of systems, e.g., browser plugins, device drivers and kernel extensions, user-defined database and network functions, serverless cloud platforms, and many more. However, as these technologies are still nascent, their advantages and limitations are yet unclear. In this work, we do an in-depth look at modern hardware isolation mechanisms with the goal of understanding their suitability for the isolation of subsystems with the tightest performance budgets. Our analysis shows that while a huge step forward, the isolation mechanisms in commodity CPUs are still lacking implementation of several design principles critical for supporting low-overhead enforcement of isolation boundaries, zero-copy exchange of data, and secure revocation of access permissions.

FetchBPF: Customizable Prefetching Policies in Linux with eBPF

Xuechun Cao, Shaurya Patel, and Soo Yee Lim, University of British Columbia; Xueyuan Han, Wake Forest University; Thomas Pasquier, University of British Columbia

Monolithic operating systems are infamously complex. Linux in particular has a tendency to intermingle policy and mechanisms in a manner that hinders modularity. This is especially problematic when developers aim to finely optimize performance,since it is often the case that a default policy in Linux, while performing well on average, cannot achieve the optimal performance in all circumstances. However, developing and maintaining a bespoke kernel to satisfy the need of a specific application is usually an unrealistic endeavor due to the high software engineering cost. Therefore, we need a mechanism to easily customize kernel policies and its behavior. In this paper, we design a framework called FetchBPF that addresses this problem in the context of memory prefetching. FetchBPF extends the widely used eBPF framework to allow developers to easily express, develop, and deploy prefetching policies without modifying the kernel codebase. We implement various memory prefetching policies from the literature and demonstrate that our deployment model incurs negligible overhead as compared to the equivalent native kernel implementation.

Fast (Trapless) Kernel Probes Everywhere

Jinghao Jia, University of Illinois Urbana-Champaign; Michael V. Le and Salman Ahmed, IBM T.J. Watson Research Center; Dan Williams, Virginia Tech and IBM T.J. Watson Research Center; Hani Jamjoom, IBM T.J. Watson Research Center; Tianyin Xu, University of Illinois at Urbana-Champaign

The ability to efficiently probe and instrument a running operating system (OS) kernel is critical for debugging, system security, and performance monitoring. While efforts to optimize the widely used Kprobes in Linux over the past two decades have greatly improved its performance, many fundamental gaps remain that prevent it from being completely efficient. Specifically, we find that Kprobe is only optimized for ~80% of kernel instructions, leaving the remaining probe-able kernel code to suffer the severe penalties of double traps needed by the Kprobe implementation. In this paper, we focus on the design and implementation of an efficient and general trapless kernel probing mechanism (no hardware exceptions) that can be applied to almost all code in Linux. We discover that the main limitation of current probe optimization efforts comes from not being able to assume or change certain properties/layouts of the target kernel code. Our main insight is that by introducing strategically placed nops, thus slightly changing the code layout, we can overcome this main limitation. We implement our mechanism on Linux Kprobe, which is transparent to the users. Our evaluation shows a 10x improvement of probe performance over standard Kprobe while providing this level of performance for 96% of kernel code.

HydraRPC: RPC in the CXL Era

Teng Ma, Alibaba Group; Zheng Liu, Zhejiang University and Alibaba Group; Chengkun Wei, Zhejiang University; Jialiang Huang, Alibaba Group and Tsinghua University; Youwei Zhuo, Alibaba Group and Peking University; Haoyu Li, Zhejiang University; Ning Zhang, Yijin Guan, and Dimin Niu, Alibaba Group; Mingxing Zhang, Tsinghua University; Tao Ma, Alibaba Group

In this paper, we present HydraRPC, which utilizes CXL-attached HDM for data transmission. By leveraging CXL, HydraRPC can benefit from memory sharing, memory semantics, and high scalability. As a result, expensive network rounds, memory copying, and serialization/deserialization are eliminated. Since CXL.cache protocols are not fully supported, we employ non-cachable sharing to bypass the CPU cache and design a busy-polling free notification mechanism. This ensures efficient data transmission without the need for constant polling. We conducted evaluations of HydraRPC on real CXL hardware, which showcased the potential efficiency of utilizing CXL HDM to build RPC systems.

ExtMem: Enabling Application-Aware Virtual Memory Management for Data-Intensive Applications

Sepehr Jalalian, Shaurya Patel, Milad Rezaei Hajidehi, Margo Seltzer, and Alexandra Fedorova, University of British Columbia

For over forty years, researchers have demonstrated that operating system memory managers often fall short in supporting memory-hungry applications. The problem is even more critical today, with disaggregated memory and new memory technologies and in the presence of tera-scale machine learning models, large-scale graph processing, and other memory-intensive applications. Past attempts to provide application-specific memory management either required significant in-kernel changes or suffered from high overhead. We present ExtMem, a flexible framework for providing application-specific memory management. It differs from prior solutions in three ways: (1) It is compatible with today’s Linux deployments, (2) it is a general-purpose substrate for addressing various memory and storage backends, and (3) it is performant in multithreaded environments. ExtMem allows for easy and rapid prototyping of new memory management algorithms, easy collection of memory patterns and statistics, and immediate deployment of isolated custom memory management.

6:00 pm–7:30 pm

OSDI '24 Poster Session and Reception

Sponsored by Amazon

Santa Clara Ballroom

Thursday, July 11

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:40 am

Track 1

Operating Systems 2

Grand Ballroom CD

Telescope: Telemetry for Gargantuan Memory Footprint Applications

Alan Nair, Sandeep Kumar, and Aravinda Prasad, Intel Labs; Ying Huang, Intel Corporation; Andy Rudoff and Sreenivas Subramoney, Intel Labs

Data-hungry applications that require terabytes of memory have become widespread in recent years. To meet the memory needs of these applications, data centers are embracing tiered memory architectures with near and far memory tiers. Precise, efficient, and timely identification of hot and cold data and their placement in appropriate tiers is critical for performance in such systems. Unfortunately, the existing state-of-the-art telemetry techniques for hot and cold data detection are ineffective at terabyte scale.

We propose Telescope, a novel technique that profiles different levels of the application's page table tree for fast and efficient identification of hot and cold data. Telescope is based on the observation that for a memory- and TLB-intensive workload, higher levels of a page table tree are also frequently accessed during a hardware page table walk. Hence, the hotness of the higher levels of the page table tree essentially captures the hotness of its subtrees or address space sub-regions at a coarser granularity. We exploit this insight to quickly converge on even a few megabytes of hot data and efficiently identify several gigabytes of cold data in terabyte-scale applications. Importantly, such a technique can seamlessly scale to petabyte-scale applications.

Telescope's telemetry achieves 90%+ precision and recall at just 0.9% single CPU utilization for microbenchmarks with 5 TB memory footprint. Memory tiering based on Telescope results in 5.6% to 34% throughput improvement for real-world benchmarks with 1–2 TB memory footprint compared to other state-of-the-art telemetry techniques.

An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise

Hongyu Li, Beijing University of Posts and Telecommunications; Liwei Guo, University of Electronic Science and Technology of China; Yexuan Yang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

Developed for over 30 years, Linux has already become the computing foundation for today's digital world; from gigantic, complex mainframes (e.g., supercomputers) to cheap, wimpy embedded devices (e.g., IoTs), countless applications are built on top of it. Yet, such an infrastructure has been plagued by numerous memory and concurrency bugs since the day it was born, due to many rogue memory operations are permitted by C language. A recent project Rust-for-Linux (RFL) has the potential to address Linux's safety concerns once and for all -- by embracing Rust's static ownership and type checkers into the kernel code, the kernel may finally be free from memory and concurrency bugs without hurting its performance. While it has been gradually matured and even merged into Linux mainline, however, RFL is rarely studied and still remains unclear whether it has indeed reconciled the safety and performance dilemma for the kernel.

To this end, we conduct the first empirical study on RFL to understand its status quo and benefits, especially on how Rust fuses with Linux and whether the fusion assures driver safety without overhead. We collect and analyze 6 key RFL drivers, which involve hundreds of issues and PRs, thousands of Github commits and mail exchanges of the Linux mailing list, as well as over 12K discussions on Zulip.We have found while Rust mitigates kernel vulnerabilities, it is beyond Rust's capability to fully eliminate them; what is more, if not handled properly, its safety assurance even costs the developers dearly in terms of both runtime overhead and development efforts.

Scalable and Effective Page-table and TLB management on NUMA Systems

Bin Gao, Qingxuan Kang, and Hao-Wei Tee, National University of Singapore; Kyle Timothy Ng Chu, Horizon Quantum Computing; Alireza Sanaee, Queen Mary University of London; Djordje Jevdjic, National University of Singapore

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent.

In this paper, we propose a novel page-table management mechanism, called Hydra, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that Hydra's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, Hydra not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement Hydra in Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that Hydra achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

UniMem: Redesigning Disaggregated Memory within A Unified Local-Remote Memory Hierarchy

Yijie Zhong, Minqiang Zhou, and Zhirong Shen, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University

Disaggregated memory (DM) has been proposed as a feasible solution towards scaling memory capacity. A variety of memory disaggregation approaches have been introduced to facilitate the practical use of DM. The cache-coherent-based DM system, which relies on cache-coherent accelerator, can offer network-attached memory as NUMA memory. However, the current cache-coherent-based DM system introduces an extra address translation for each remote memory access. Meanwhile, the local cache mechanism of existing approaches overlooks the inherent issues of cache thrashing and pollution that arise from DM system. This paper presents UniMem, a cache-coherent-based DM system that proposes a unified local-remote memory hierarchy to remove extra indirection layer on remote memory access path. To optimize local memory utilization, UniMem redesigns the local cache mechanism to prevent cache thrashing and pollution. Furthermore, UniMem puts forth a page migration mechanism that promotes frequently used pages from device-attached memory to host memory based not only on page hotness but also on hotness fragmentation. Compared to state-of-the-art systems, UniMem reduces the average memory access time by up to 76.4% and offers substantial improvement in terms of data amplification.

Track 2

Correctness

Grand Ballroom EF

WingFuzz: Implementing Continuous Fuzzing for DBMSs

Jie Liang, Zhiyong Wu, and Jingzhou Fu, Tsinghua University; Yiyuan Bai and Qiang Zhang, Shuimu Yulin Technology Co., Ltd.; Yu Jiang, Tsinghua University

Database management systems (DBMSs) are critical components within software ecosystems, and their security and stability are paramount. In recent years, fuzzing has emerged as a prominent automated testing technique, effectively identifying vulnerabilities in various DBMSs. Nevertheless, many of these fuzzers require specific adaptation for a DBMS with a particular version. Employing these techniques to test enterprise-level DBMSs continuously poses challenges due to the diverse specifications of DBMSs and the code changes in their rapid version evolution.

In this paper, we present the industry practice of implementing continuous DBMS fuzzing on enterprise-level DBMSs like ClickHouse. We summarize three main obstacles in implementing, namely the diverse SQL grammar in test case generation, the ongoing evolution of codebase in continuous testing, and the disturbance of noises during anomaly analysis. We propose WingFuzz, which utilizes specification-based mutator generation, corpus-driven evolving code fuzzing, and noise-resilient anomaly assessment to address them. By working with the engineers in continuous DBMS fuzzing, we have found a total of 236 previously undiscovered bugs in 12 widely-used enterprise-level DBMSs including ClickHouse, DamengDB, and TenDB. Due to its favorable test results, our efforts received recognition and cooperation invitations from some DBMS vendors. For example, ClickHouse’s CTO praised: "Which tool did you use to find this test case? We need to integrate it into our CI." and WingFuzz has been successfully integrated into its development process.

Balancing Analysis Time and Bug Detection: Daily Development-friendly Bug Detection in Linux

Keita Suzuki, Keio University; Kenta Ishiguro, Hosei University; Kenji Kono, Keio University

Linux, a battle-tested codebase, is known to suffer from many bugs despite its extensive testing mechanisms. While many of these bugs require domain-specific knowledge for detection, a significant portion matches well-known bug patterns. Even though these bugs can be found with existing tools, our simple check of Linux kernel patches suggests that these tools are not used much in the developer's daily workflow. The lack of usage is probably due to the well-known trade-off between analysis time and bug detection capabilities: tools typically employ complex analysis to effectively and comprehensively find bugs in return for a long analysis time, or focus on a short analysis time by only employing elementary analyses and thus can only find a very limited number of bugs. Ideally, developers expect the tools to incur short analysis time, while still finding many bugs to use them in daily development.

This paper explores an approach that balances this trade-off by focusing on bugs that can be found with less computationally-complex analysis methods, and limiting the scope to each source code. To achieve this, we propose a combination of computationally lightweight analyses and demonstrate our claim by designing FiTx, a framework for generating daily development-friendly bug checkers that focus on well-known patterns. Despite its simplicity, FiTx successfully identified 47 new bugs in the Linux kernel version 5.15 within 2.5 hours, outperforming Clang Static Analyzer and CppCheck in both speed and bug detection. It demonstrates that focusing on less complex bug patterns can still significantly contribute to the improvement of codebase health. FiTx can be embedded into the daily development routine, enabling early bug detection without sacrificing developers' time.

Kivi: Verification for Cluster Management

Bingzhe Liu and Gangmuk Lim, UIUC; Ryan Beckett, Microsoft; P. Brighten Godfrey, UIUC and Broadcom

Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users’ operational burden, the complex, dynamic, and non-deterministic nature of these systems makes them hard to reason about, potentially leading to failures ranging from performance degradation to outages.

We present Kivi, the first system for verifying controllers and their configurations in cluster management systems. Kivi focuses on the popular system Kubernetes, and models its controllers and events into processes whereby their interleavings are exhaustively checked via model checking. Central to handling autoscaling and large-scale deployments are our modeling optimizations and our design which seeks to find violations in a smaller and reduced topology. We show that Kivi is effective and accurate in finding issues in realistic and complex scenarios and showcase two new issues in Kubernetes controller source code.

Monarch: A Fuzzing Framework for Distributed File Systems

Tao Lyu, EPFL; Liyi Zhang, University of Waterloo; Zhiyao Feng, Yueyang Pan, and Yujie Ren, EPFL; Meng Xu, University of Waterloo; Mathias Payer and Sanidhya Kashyap, EPFL

Distributed file systems (DFSes) are prone to bugs. Although numerous bug-finding techniques have been applied to DFSes, static analysis does not scale well with the sheer complexity of DFS codebases while dynamic methods (e.g., regression testing) are limited by the quality of test cases. Although both can be improved by pouring in manual effort, they are less practical when facing a diverse set of real-world DFSes. Fuzzing, on the other hand, has shown great success in local systems. However, several problems exist if we apply existing fuzzers to DFSes as they 1) cannot test multiple components of DFSes holistically; 2) miss the critical testing aspects of DFSes (e.g., distributed faults); 3) have not yet explored the practical state representations as fuzzing feedback; and 4) lack checkers for asserting semantic bugs unique to DFSes.

In this paper, we introduce MONARCH, a multi-node fuzzing framework to test all POSIX-compliant DFSes under one umbrella. MONARCH pioneers push-button fuzzing for DFSes with a new set of building blocks to the fuzzing toolbox: 1) A multi-node fuzzing architecture for testing diverse DFSes from a holistic perspective; 2) A two-step mutator for testing DFSes with syscalls and faults; 3) Practical execution state representations with a unified coverage collection scheme across execution contexts; 4) A new DFSes semantic checker SYMSC. We applied MONARCH to six DFSes and uncovered a total of 48 bugs, including a bug whose existence can be traced back to the initial release of the DFSes.

10:40 am–11:10 am

Break with Refreshments

Grand Ballroom Foyer

11:10 am–12:25 pm

Track 1

ML Training

Grand Ballroom CD

Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism

Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, Kuaishou Technology

Recent advancements in training large-scale models have centered on optimizing activation strategies and exploring various parallel training options. One research avenue focuses on enhancing activation-related operations, such as offloading and recomputing. However, there is room for further refinement in these strategies to improve the balance between computation and memory utilization. Another line of work explores different training parallelisms, which often require extensive parameter tuning and achieve suboptimal combinations of parallel options.

To tackle these challenges, this paper introduces a novel method for losslessly accelerating the training of large language models. Specifically, two efficient activation rematerialization strategies are proposed: Pipeline-Parallel-Aware Offloading, which maximizes the utilization of host memory for storing activations, and Compute-Memory Balanced Checkpointing, which seeks a practical equilibrium between activation memory and computational efficiency. Additionally, the paper presents an extremely efficient searching method for optimizing parameters for hybrid parallelism, considering both offloading and checkpointing to achieve optimal performance. The efficacy of the proposed method is demonstrated through extensive experiments on public benchmarks with diverse model sizes and context window sizes. For example, the method significantly increases Model FLOPs Utilization (MFU) from 32.3% to 42.7% for a 175B Llama-like model with a context window size of 32,768 on 256 NVIDIA H800.

Metis: Fast Automatic Distributed Training on Heterogeneous GPUs

Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, and Mohd Muzzammil, Samsung Research; Myeongjae Jeon, UNIST

As deep learning model sizes expand and new GPUs are released every year, the need for distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end GPUs and reduce the cost of purchasing expensive high-end GPUs. In this paper, we introduce Metis, a system designed to automatically find efficient parallelism plans for distributed training on heterogeneous GPUs. Metis holistically optimizes several key system components, such as profiler, cost estimator, and planner, which were limited to single GPU types, to now efficiently leverage compute powers and memory capacities of diverse GPU types. This enables Metis to achieve fine-grained distribution of training workloads across heterogeneous GPUs, improving resource efficiency. However, the search space designed for automatic parallelism in this complexity would be prohibitively expensive to navigate.

To address this issue, Metis develops a new search algorithm that efficiently prunes large search spaces and balances loads with heterogeneity-awareness, while preferring data parallelism over tensor parallelism within a pipeline stage to take advantage of its superior computation and communication trade-offs. Our evaluation with three large models (GPT-3, MoE, and Wide-Resnet) on combinations of three types of GPUs demonstrates that Metis finds better parallelism plans than traditional methods with $1.05 ~ 8.43× training speed-up, while requiring less profiling searching time. Compared to the oracle planning that delivers the fastest parallel training, Metis finds near-optimal solutions while reducing profiling and search overheads by orders of magnitude.

FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences

Mengwei Xu, Dongqi Cai, Yaozong Wu, Xiang Li, and Shangguang Wang, Beijing University of Posts and Telecommunications (BUPT)

Large Language Models (LLMs) are transforming the landscape of mobile intelligence. Federated Learning (FL), a method to preserve user data privacy, is often employed in fine-tuning LLMs to downstream mobile tasks, i.e., FedLLM. A vital challenge of FedLLM is the tension between LLM complexity and resource constraint of mobile devices.

In response to this challenge, this work introduces FwdFL, an innovative FL protocol designed to enhance the FedLLM efficiency. The key idea of FwdFL is to employ backpropagation (BP)-free training methods, requiring devices only to execute ''perturbed inferences''. Consequently, FwdFL delivers way better memory efficiency and time efficiency (expedited by mobile NPUs and an expanded array of participant devices). FwdFL centers around three key designs: (1) it combines BP-free training with parameter-efficient training methods, an essential way to scale the approach to the LLM era; (2) it systematically and adaptively allocates computational loads across devices, striking a careful balance between convergence speed and accuracy; (3) it discriminatively samples perturbed predictions that are more valuable to model convergence. Comprehensive experiments illustrate FwdFL's significant advantages over conventional methods, including up to three orders of magnitude faster convergence and a 4.6× reduction in memory footprint. Uniquely, FwdFL paves the way for federated billion-parameter LLMs such as LLaMA on COTS mobile devices -- a feat previously unattained.

Track 2

Security 1

Grand Ballroom EF

A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND

Jaehyun Song and Bumsuk Kim, Sungkyunkwan University; Minwoo Kwak, Yonsei University; Byoungyoung Lee, Seoul National University; Euiseong Seo, Sungkyunkwan University; Jinkyu Jeong, Yonsei University

Serverless computing often utilizes the warm container technique to improve response times. However, this method, which allows the reuse of function containers across different function requests of the same type, creates persistent vulnerabilities in memory and file systems. These vulnerabilities can lead to security breaches such as data leaks. Traditional approaches to address these issues often suffer from performance drawbacks and high memory requirements due to extensive use of user-level snapshots and complex restoration processes.

The paper introduces REWIND, an innovative and efficient serverless function execution platform designed to address these security and efficiency concerns. REWIND ensures that after each function request, the container is reset to an initial state, free from any sensitive data, including a thorough restoration of the file system to prevent data leakage. It incorporates a kernel-level memory snapshot management system, which significantly lowers memory usage and accelerates the rewind process. Additionally, REWIND optimizes runtime by reusing memory regions and leveraging the temporal locality of function executions, enhancing performance while maintaining strict data isolation between requests. The REWIND prototype is implemented on OpenWhisk and Linux and evaluated with serverless benchmark workloads. The evaluation results have demonstrated that REWIND provides substantial memory saving while providing high function execution performance. Especially, the low memory usage makes more warm containers kept alive thereby improving the throughput as well as the latency of function execution while providing isolation between function requests.

SimEnc: A High-Performance Similarity-Preserving Encryption Approach for Deduplication of Encrypted Docker Images

Tong Sun and Bowen Jiang, Zhejiang University; Borui Li, Southeast University; Jiamei Lv, Yi Gao, and Wei Dong, Zhejiang University

Encrypted Docker images are becoming increasingly popular in Docker registries for privacy. As the Docker registry is tasked with managing an increasing number of images, it becomes essential to implement deduplication to conserve storage space. However, deduplication for encrypted images is difficult because deduplication exploits identical content, while encryption tries to make all contents look random. Existing state-of-the-art works try to decompress images and perform message-locked encryption (MLE) to deduplicate encrypted images. Unfortunately, our measurements uncover two limitations in current works: (i) even minor modifications to the image content can hinder MLE deduplication, (ii) decompressing image layers would increase the size of the storage for duplicate data, and significantly compromise user pull latency and deduplication throughput.

In this paper, we propose SimEnc, a high-performance similarity-preserving encryption approach for deduplication of encrypted Docker images. SimEnc is the first work that integrates the semantic hash technique into MLE to extract semantic information among layers for improving the deduplication ratio. SimEnc builds on a fast similarity space selection mechanism for flexibility. Unlike existing works completely decompressing the layer, we explore a new similarity space by Huffman decoding that achieves a better deduplication ratio and performance. Experiments show that SimEnc outperforms both the state-of-the-art encrypted serverless platform and plaintext Docker registry, reducing storage consumption by up to 261.7% and 54.2%, respectively. Meanwhile, SimEnc can surpass them in terms of pull latency.

mmTLS: Scaling the Performance of Encrypted Network Traffic Inspection

Junghan Yoon, Seoul National University; Seunghyun Do and Duckwoo Kim, KAIST; Taejoong Chung, Virginia Tech; KyoungSoo Park, Seoul National University

Modern network monitoring TLS middleboxes play a critical role in fighting against the abuse by encrypted network traffic. Unfortunately, operating a TLS middlebox often incurs a huge computational overhead as it must translate and relay encrypted traffic from one endpoint to the other. We observe that even a simple TLS proxy drops the throughput of end-to-end TLS sessions by 43% to 73%. What is worse is that recent security enhancement TLS middlebox works levy an even more computational tax.

In this paper, we present mmTLS, a scalable TLS middlebox development framework that significantly improves the traffic inspection performance and provides a TLS event programming library with which one can write a TLS middlebox with ease. mmTLS eliminates the traffic relaying cost as it operates on a single end-to-end TLS session by secure session key sharing. This approach is not only beneficial to performance but it naturally guarantees all end-to-end TLS properties except confidentiality. To detect illegal content modification, mmTLS supplements a TLS record with a private tag whose key is kept secret only to TLS endpoints. We find that the extra overhead for private tag generation and verification is minimal when augmented with the first tag generation. Our evaluation demonstrates that mmTLS outperforms the nginx TLS proxy in the split-connection mode by a factor 2.7 to 41.2, and achieves 179 Gbps of traffic relaying throughput.

12:25 pm–2:00 pm

Conference Luncheon

Santa Clara Ballroom

2:00 pm–3:40 pm

Track 1

ML-System Co-Design

Grand Ballroom CD

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

Dan Graur, Oto Mraz, Muyu Li, and Sepehr Pourghannad, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich

Input data preprocessing is a common bottleneck in machine learning (ML) jobs, that can significantly increase training time and cost as expensive GPUs or TPUs idle waiting for input data. Previous work has shown that offloading data preprocessing to remote CPU servers successfully alleviates data stalls and improves training time. However, remote CPU workers in disaggregated data processing systems comprise a significant fraction of total training costs. Meanwhile, current disaggregated solutions often underutilize CPU and DRAM resources available on ML accelerator nodes. We propose two approaches to alleviate ML input data stalls while minimizing costs. First, we dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth. Second, we analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput. We observe that relaxing commutativity increases throughput while maintaining high model accuracy for a variety of ML data pipelines. We build Pecan, an ML data preprocessing service that automates data preprocessing worker placement and transformation reordering decisions. Pecan reduces preprocessing costs by 87% on average and total training costs by up to 60% compared to training with state-of-the-art disaggregated data preprocessing and total training costs by 55% on average compared to collocated data preprocessing.

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model

Zheng Wang, University of California, San Diego; Yuke Wang, Boyuan Feng, and Guyue Huang, University of California, Santa Barbara; Dheevatsa Mudigere and Bharath Muthiah, Meta; Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California, San Diego

The deployment of Deep Learning Recommendation Models (DLRMs) involves the parallelization of extra-large embedding tables (EMTs) on multiple GPUs. Existing works overlook the input-dependent behavior of EMTs and parallelize them in a coarse-grained manner, resulting in unbalanced workload distribution and inter-GPU communication.

To this end, we propose OPER, an algorithm-system co-design with OPtimality-guided Embedding table parallelization for large-scale Recommendation model training and inference. The core idea of OPER is to explore the connection between DLRM inputs and the efficiency of distributed EMTs, aiming to provide a near-optimal parallelization strategy for EMTs. Specifically, we conduct an in-depth analysis of various types of EMTs parallelism and propose a heuristic search algorithm to efficiently approximate an empirically near-optimal EMT parallelization. Furthermore, we implement a distributed shared memory-based system, which supports the lightweight but complex computation and communication pattern of fine-grained EMT parallelization, effectively converting theoretical improvements into real speedups. Extensive evaluation shows that OPER achieves 2.3× and 4.0× speedup on average in training and inference, respectively, over state-of-the-art DLRM frameworks.

MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States

Chen Zhang, Rongchao Dong, Haojie Wang, Runxin Zhong, Jike Chen, and Jidong Zhai, Tsinghua University

Real-world deep learning programs are often developed with dynamic programming languages like Python, which usually have complex features, such as built-in functions and dynamic typing. These programs typically execute in eager mode, where tensor operators run without compilation, resulting in poor performance. Conversely, deep learning compilers rely on operator-based computation graphs to optimize program execution. However, complexities in dynamic languages often prevent the conversion of these programs into complete operator graphs, leading to sub-optimal performance.

To address this challenge, we introduce MAGPY to optimize the generation of operator graphs from deep learning programs. MAGPY generates more complete operator graphs by collecting key runtime information through monitoring program execution. MAGPY provides a reference graph to record program execution states and leverages reference relationships to identify state changes that can impact program outputs. This approach significantly reduces analysis complexity, leading to more complete operator graphs. Experimental results demonstrate that MAGPY accelerates complex deep learning programs by up to 2.88× (1.55× on average), and successfully instantiates 93.40% of 1191 real user programs into complete operator graphs.

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs

Haojun Xia, University of Sydney; Zhen Zheng and Xiaoxia Wu, Microsoft; Shiyang Chen, Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt, Microsoft; Donglin Zhuang and Zhongzhu Zhou, University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song, Microsoft

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.

Track 2

Networks 2

Grand Ballroom EF

QDSR: Accelerating Layer-7 Load Balancing by Direct Server Return with QUIC

Ziqi Wei, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Zhiqiang Wang, Tencent and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Yuan Yang, Tsinghua University; Cheng Luo and Fuyu Wang, Tencent; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Sijie Yang, Tencent; Zhenhui Yuan, Northumbria University

Layer-7(L7) load balancing is a crucial capability for cloud service providers to maintain stable and reliable services. However, high flexibility of the L7 load balancers(LBs) and increasing downlink relaying service result in a heavy workload, which significantly increases the cost of cloud service providers and reduces end-to-end service quality. We proposes QDSR, a new L7 load balancing scheme that uses QUIC and Direct Server Return(DSR) technology. QDSR divides the QUIC connection into independent streams and distributes them to multiple real servers(RSs), enabling real servers to send data directly to the client simultaneously. Due to the lack of redundant relaying, QDSR enables high performance, low latency, and nearly eliminates additional downlink relaying overhead.

To evaluate the performance of QDSR, we implemented all its components using Nginx and Apache Traffic Server, deployed them in a real environment testbed, and conducted large-scale simulation experiments using mahimahi. The experimental results show that QDSR can process an additional 4.8%-18.5% of client requests compared to traditional L7 proxy-based load balancing schemes. It can achieve a maximum throughput that is 12.2 times higher in high-load scenarios and significantly reduce end-to-end latency and first packet latency.

Evaluating Chiplet-based Large-Scale Interconnection Networks via Cycle-Accurate Packet-Parallel Simulation

Yinxiao Feng and Yuchen Wei, Institute for Interdisciplinary Information Sciences, Tsinghua University; Dong Xiang, School of Software, Tsinghua University; Kaisheng Ma, Institute for Interdisciplinary Information Sciences, Tsinghua University

The Chiplet architecture has achieved great success in recent years. However, chiplet-based networks are significantly different from traditional networks, thus presenting new challenges in evaluation. On the one hand, on-chiplet and off-chiplet networks are tightly coupled; therefore, the entire heterogeneous network must be designed and evaluated jointly rather than separately. On the other hand, existing network simulators cannot efficiently evaluate large-scale chiplet-based networks with cycle-accurate accuracy.

In this paper, we present the design and implementation of the Chiplet Network Simulator (CNSim), a cycle-accurate packet-parallel simulator supporting efficient simulation for large-scale chiplet-based (shared-memory) networks. In CNSim, a packet-centric simulation architecture and an atomic-based hyper-threading mechanism are adopted, accelerating simulation speed by 11× ~ 14× compared with existing cycle-accurate simulators. Besides, we implement the heterogeneous router/link microarchitecture and many other features, including hierarchical topologies, adaptive routing, and real workload traces integration. Based on CNSim, two typical chiplet-based networks, which cannot be efficiently simulated by existing simulators, are systematically evaluated. The advantages and limitations of chiplet-based networks are revealed through systematical cycle-accurate simulations. The simulator and evaluation framework are open-sourced to the community.

Config-Snob: Tuning for the Best Configurations of Networking Protocol Stack

Manaf Bin-Yahya, Yifei Zhao, and Hossein Shafieirad, Huawei Technologies Canada; Anthony Ho, Huawei Technologies Canada and University of Waterloo; Shijun Yin and Fanzhao Wang, Huawei Technologies China; Geng Li, Huawei Technologies Canada

Web servers usually use predefined configurations, yet empirical studies have shown that performance can be significantly improved when the configurations of the networking protocol stack (e.g., TCP, QUIC, and congestion control parameters) are carefully tuned due to the fact that a “one-size-fits-all” strategy does not exist. However, dynamically tuning the protocol stack's configurations is challenging: first, the configuration space is ample, and parameters with complex dependencies must be tuned jointly; second, the network condition space is also large, so an adaptive solution is needed to handle clients' diversity and network dynamics; and finally, clients endure unsatisfactory performance degradation due to learning exploration. To this end, we propose Config-Snob, a protocol tuning solution that selects the best configurations based on historical data. Config-Snob exploits the configuration space by tuning several configuration knobs and provides a practical fine-grained client grouping while handling the network environment dynamics. Config-Snob uses a controlled exploration approach to minimize the performance degradation. Config-Snob utilizes causal inference (CI) algorithms to boost the tuning optimization. Config-Snob is implemented in a QUIC-based server and deployed in a large-scale production environment. Our extensive experiments show that the proposed solution improves the completion time over the default configurations by 15% to 36% (mean) and 62% to 70% (median) in the real deployment.

Conspirator: SmartNIC-Aided Control Plane for Distributed ML Workloads

Yunming Xiao, Northwestern University; Diman Zad Tootaghaj, Aditya Dhakal, Lianjie Cao, and Puneet Sharma, Hewlett Packard Labs; Aleksandar Kuzmanovic, Northwestern University

Modern machine learning (ML) workloads heavily depend on distributing tasks across clusters of server CPUs and specialized accelerators, such as GPUs and TPUs, to achieve optimal performance. Nonetheless, prior research has highlighted the inefficient utilization of computing resources in distributed ML, leading to suboptimal performance. This inefficiency primarily stems from CPU bottlenecks and suboptimal accelerator scheduling. Although numerous proposals have been put forward to address these issues individually, none have effectively tackled both inefficiencies simultaneously. In this paper, we introduce Conspirator, an innovative control plane design aimed at alleviating both bottlenecks by harnessing the enhanced computing capabilities of SmartNICs. Following the evolving role of SmartNICs, which have transitioned from their initial function of standard networking task offloading to serving as programmable connectors between disaggregated computing resources, Conspirator facilitates efficient data transfer without the involvement of host CPUs and hence circumvents the potential bottlenecks there. Conspirator further integrates a novel scheduling algorithm that takes into consideration of the heterogeneity of accelerators and adapts to changing workload dynamics, enabling the flexibility to mitigate the second bottleneck. Our evaluation demonstrates that Conspirator may provide a 15% end-to-end completion time reduction compared to RDMA-based alternatives while being 17% more cost-effective and 44% more power-efficient. Our proposed scheduler also helps to save 33% GPU hours compared to naive GPU-sharing schedulers by making close-to-optimal decisions while taking much less time than the optimal NP-Hard scheduler.

3:40 pm–4:10 pm

Break with Refreshments

Grand Ballroom Foyer

4:10 pm–5:25 pm

Track 1

Memory

Grand Ballroom CD

FBMM: Making Memory Management Extensible With Filesystems

Bijan Tabatabai, James Sorenson, and Michael M. Swift, University of Wisconsin—Madison

New memory technologies like CXL promise diverse memory configurations such as tiered memory, far memory, and processing in memory. Operating systems must be modified to support these new hardware configurations for applications to make use of them. While many parts of operating systems are extensible, memory management remains monolithic in most systems, making it cumbersome to add support for a diverse set of new memory policies and mechanisms.

Rather than creating a whole new extensible interface for memory managers, we propose to instead use the memory management callbacks provided by the Linux virtual file system (VFS) to write memory managers, called memory management filesystems (MFSs). Memory is allocated by creating and mapping a file in an MFS's mount directory and freed by deleting the file. Use of an MFS is transparent to applications. We call this system File Based Memory Management (FBMM).

Using FBMM, we created a diverse set of standalone memory managers for tiered memory, contiguous allocations, and memory bandwidth allocation, each comprising 500-1500 lines of code. Unlike current approaches that require custom kernels, with FBMM, an MFS can be compiled separately from the kernel and loaded dynamically when needed. We measure the overhead of using filesystems for memory management and found the overhead to be less than 8% when allocating a single page, and less than 0.1% when allocating as little as 128 pages. MFSs perform competitively with kernel implementations, and sometimes better due to simpler implementations.

Mangosteen: Fast Transparent Durability for Linearizable Applications using NVM

Sergey Egorov, Gregory Chockler, and Brijesh Dongol, University of Surrey, UK; Dan O'Keeffe, Royal Holloway, University of London, UK; Sadegh Keshavarzi, University of Surrey, UK

The advent of byte-addressable non-volatile memory (NVM) technologies has enabled the development of low-latency high-throughput durable applications, i.e., applications that are capable of recovering from full-system crashes. However, programming such applications is error-prone as efficiency gains often require fine-grained (programmer-controlled) management of low-level persistence instructions.

We propose Mangosteen, a high-level programming framework that allows developers to transform an existing linearizable in-memory application to a corresponding durably linearizable version using NVM. Our framework’s API consists of a set of callback hooks that interpose on an application’s request processing flow with minimal developer effort. Mangosteen executes client operations on DRAM and persists their effects using binary instrumentation and redo logging. Mangosteen’s concurrency control facilitates batching of read-write requests to minimize the cost of persistence, while allowing read-only requests to execute concurrently. A novel intra-batch deduplication mechanism further reduces persistence overheads for common OLTP workloads. Our empirical evaluation results show that Mangosteen-enabled applications outperform state-of-the-art solutions across the entire spectrum of read-write ratios. In particular, the Mangosteen-based version of Redis demonstrates throughput gains of between 2×–5× in comparison to prior work.

FlexMem: Adaptive Page Profiling and Migration for Tiered Memory

Dong Xu, University of California, Merced; Junhee Ryu, Jinho Baek, and Kwangsik Shin, SK hynix; Pengfei Su and Dong Li, University of California, Merced

Tiered memory, combining multiple memory components with different performance and capacity, provides a cost-effective solution to increase memory capacity and improve memory utilization. The existing system software to manage the tiered memory often has limitations: (1) rigid memory profiling methods that cannot timely capture emerging memory access patterns or lose profiling quality, (2) rigid page demotion (i.e., the number of pages for demotion is driven by an invariant requirement on free memory space), and (3) rigid warm page range (i.e., emerging hot pages) that leads to unnecessary page demotion from fast to slow memory. To address the above limitations, we introduce FlexMem, a page profiling and migration system for tiered memory. FlexMem combines the performance counter-based and page hinting fault-based profiling methods to improve profiling quality, dynamically decides the number of pages for demotion based on the needs of accommodating hot pages (i.e., frequently accessed pages), and dynamically decides the warm page range based on how often the pages in the range is promoted to hot pages. We evaluate FlexMem with common memory-intensive benchmarks. Compared to the state-of-the-art (Tiering-0.8, TPP, and MEMTIS), FlexMem improves performance by 32%, 23%, and 27% on average respectively.

Track 2

Reliability

Grand Ballroom EF

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Yifan Xiong, Yuting Jiang, Ziyue Yang, and Lei Qu, Microsoft Research; Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, and Joe Chau, Microsoft; Peng Cheng, Yongqiang Xiong, and Lidong Zhou, Microsoft Research

Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions.

We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61×. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.

Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field

Ronglong Wu, Shuyue Zhou, Jiahao Lu, Zhirong Shen, and Zikang Xu, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University; Kunlin Yang and Feilong Lin, Huawei Technologies Co., Ltd; Yiming Zhang, Xiamen University

High-bandwidth memory (HBM) is regarded as a promising technology for fundamentally overcoming the memory wall. It stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth. However, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes.

In this paper, we conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services. Through error analyses and methodology validations, we confirm that the HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM. We design and implement Calchas, a hierarchical failure prediction framework for HBM based on our findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcoming failures. The results demonstrate the feasibility of failure prediction across hierarchical levels.

MSFRD: Mutation Similarity based SSD Failure Rating and Diagnosis for Complex and Volatile Production Environments

Yuqi Zhang, Tianyi Zhang, Wenwen Hao, Shuyang Wang, Na Liu, and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics; Yang Zhang, Weixin Wang, Yongguang Cheng, Huan Wang, Jie Xu, Feng Wang, and Bo Jiang, ByteDance Inc.; Yongwong Gwon, Jongsung Na, Zoe Kim, and Geunrok Oh, Samsung Electronics

SSD failures have an increasing impact on storage reliability and performance in data centers. Some manufacturers have customized fine-grained Telemetry attributes to analyze and identify SSD failures. Based on Telemetry data, this paper proposes the mutation similarity based failure rating and diagnosis (MSFRD) scheme to predict failures in dynamic environment of data centers and improve failure handling efficiency. MSFRD dynamically detects the internal mutations of SSDs in real time and measures their similarity to the mutations of historical failed SSDs and healthy SSDs for failure prediction and early rating. Based on the rating, unavailable SSDs with serious failures are handled immediately, while available SSDs with less serious failures will be continuously tracked and diagnosed. The MSFRD is evaluated on real Telemetry datasets collected from large-scale SSDs in data centers. Compared with the existing schemes, MSFRD improves precision by 23.8% and recall by 38.9% on average for failure prediction. The results also show the effectiveness of MSFRD on failure rating and progressive diagnosis.

6:00 pm–7:30 pm

USENIX ATC '24 Poster Session and Reception

Santa Clara Ballroom

Friday, July 12

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:15 am

Track 1

Deployed Systems

Grand Ballroom CD

Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds

Zhe Wang, Shanghai Jiao Tong University; Huanwu Hu, Alibaba Cloud; Linghe Kong, Shanghai Jiao Tong University; Xinlei Kang and Teng Ma, Alibaba Cloud; Qiao Xiang, Xiamen University; Jingxuan Li and Yang Lu, Alibaba Cloud; Zhuo Song, Shanghai Jiao Tong University and Alibaba Cloud; Peihao Yang, Alibaba Cloud; Jiejian Wu, Shanghai Jiao Tong University; Yong Yang and Tao Ma, Alibaba Cloud; Zheng Liu, Alibaba Cloud and Zhejiang University; Xianlong Zeng and Dennis Cai, Alibaba Cloud; Guihai Chen, Shanghai Jiao Tong University

Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent in currently deployed monitoring systems. Ping systems are oblivious to the "actual weights'' of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing.

We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retxs) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from ‘noisy’ retxs with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.

Data Caching for Enterprise-Grade Petabyte-Scale OLAP

Chunxu Tang and Bin Fan, Alluxio; Jing Zhao and Chen Liang, Uber, Inc; Yi Wang and Beinan Wang, Alluxio; Ziyue Qiu, Carnegie Mellon University and Uber, Inc.; Lu Qiu, Bowen Ding, Shouzhuo Sun, Saiguang Che, Jiaming Mai, Shouwei Chen, Yu Zhu, and Jianjian Xie, Alluxio; Yutian (James) Sun, Meta, Inc.; Yao Li and Yangjun Zhang, Uber, Inc.; Ke Wang, Meta, Inc.; Mingmin Chen, Uber, Inc.

With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale, enterprise-grade workloads over three years of deployment at Uber and Meta. We share insights and operational experiences in implementing these optimizations, providing valuable perspectives on managing modern, massive-scale OLAP workloads.

Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?

Bin Yang, Tsinghua University and National Supercomputer Center in Wuxi; Hao Wei, Tsinghua University; Wenhao Zhu, Shandong University and National Supercomputer Center in Wuxi; Yuhao Zhang, Tsinghua University; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University, Qinghai University and Intelligent Computing and Application Laboratory of Qinghai Province, and National Supercomputer Center in Wuxi

The system architecture of contemporary supercomputers is growing increasingly intricate with the ongoing evolution of system-wide network and storage technologies, making it challenging for application developers and system administrators to manage and utilize the escalating complexity of supercomputers effectively. Moreover, the limited experience of application developers and system administrators in conducting insightful analyses of diverse High-Performance Computing (HPC) workloads and the resulting array of resource utilization characteristics exacerbate the challenge. To address this issue, we undertake a comprehensive analysis of six years' worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes and currently ranked as the world's 11th-fastest supercomputer. Our study provides valuable insights into operational management strategies for HPC systems (i.e., job hanging caused by heavy-load benchmark testing, job starvation caused by aggressive scheduling policies) and I/O workload characteristics (i.e., getattr operations spiking caused by massive access to grid files, a large number of files accessed by many applications in a short period), shedding light on both challenges and opportunities for improvements in the HPC environment. This paper delineates our methodology, findings, and the significance of this study. Additionally, we discuss the potential of our research for future studies and practice within this domain.

Track 2

Wide Area Network

Grand Ballroom EF

Panorama: Optimizing Internet-scale Users’ Routes from End to End

Geng Li, Shuihai Hu, and Kun Tan, Huawei Technologies

Network performance is critical to the user experience of many real-time interactive applications, such as video conferencing and live streaming. Empirical studies show that transport latency over 300ms would become unacceptable, leading to significant user satisfaction declining. Unfortunately, due to the best-effort nature of Internet, such strict performance requirement can hardly be fully met. Despite continuous efforts have been made to improve the performance of Internet (e.g., overlay routing optimization, traffic engineering and content delivery network), we are still far from delivering satisfying network performance for these applications. The stringent network requirements, the world-wide cross-continental network transfers, and the large-scale Internet-wide users, together make it a complex challenge to deliver ideal user experience for emerging real-time interactive applications.

In this paper, we present Panorama, a scalable system for delivering desired user experience to real-time interactive applications over a globally distributed overlay network. To achieve ideal user experience, Panorama takes a centralized approach to do global end-to-end traffic engineering optimization, and overcomes the scalability issue by intelligent measurement-based user grouping and scalable, parallelizable route computation. Panorama has been deployed in a large global real-time overlay network since 2021. We evaluate Panorama based on 81 million selected real-world traces in deployment environment with clients across 66 countries. The extensive evaluation demonstrates that Panorama can support a routing service for millions of users, while providing latency lower than 200ms for 96.34% of the communication sessions, and improving SLA satisfaction by up to 88.0%.

Enhancing Resource Management of the World's Largest PCDN System for On-Demand Video Streaming

Rui-Xiao Zhang, UIUC; Haiping Wang, Shu Shi, Xiaofei Pang, Yajie Peng, and Zhichen Xue, ByteDance; Jiangchuan Liu, Simon Fraser University

The rapid growth of video services has led to the significant requirement for efficient content delivery. Traditional approaches mainly rely on Content Delivery Networks (CDNs), which unfortunately incur significant bandwidth cost for video providers. To resolve this problem, the cost-efficient edge resources have emerged as a new solution to replace CDNs. However, their heterogeneous hardware and poor performance still present challenges in their effective utilization. In this paper, we present how ByteDance explores the use of these cost-efficient but less performant resources. Specifically, we first present an extensive overview of PCDN, ByteDance's alternative delivery network for CDNs. Second, as PCDN encounters significant resource imbalances after years of deployment, we further introduce PCDN+, the enhanced iteration of PCDN. Specifically, by integrating a well-designed centralized/decentralized framework, we evolve previous "static'' and "uncontrolled'' PCDN into a "dynamic'' and "controlled'' system. The extensive A/B test and real-world deployment have demonstrated that PCDN+ 1) effectively alleviates overloading issues, 2) significantly improves the utilization of low-cost resources, and 3) provides higher service speed.

TileClipper: Lightweight Selection of Regions of Interest from Videos for Traffic Surveillance

Shubham Chaudhary and Aryan Taneja, IIIT Delhi; Anjali Singh, Indira Gandhi Delhi Technology University for Women; Purbasha Roy, Sohum Sikdar, Mukulika Maity, and Arani Bhattacharya, IIIT Delhi

With traffic surveillance increasingly used, thousands of cameras on roads send video feeds to cloud servers to run computer vision algorithms, requiring high bandwidth. State-of-the-art techniques reduce the bandwidth requirement by either sending a limited number of frames/pixels/regions or relying on re-encoding the important parts of the video. This imposes significant overhead on both the camera side and server side compute as re-encoding is expensive. In this work, we propose TILECLIPPER, a system that utilizes tile sampling, where a limited number of rectangular areas within the frames, known as tiles, are sent to the server. TILECLIPPER selects the tiles adaptively by utilizing its correlation with the tile bitrates. We evaluate TILECLIPPER on different datasets having 55 videos in total to show that, on average, our technique reduces ≈ 22% of data sent to the cloud while providing a detection accuracy of 92% with minimal calibration and compute compared to prior works. We show real-time tile filtering of TILECLIPPER even on cheap edge devices like Raspberry Pi 4 and nVidia Jetson Nano. We further create a live deployment of TILECLIPPER to show that it provides over 87% detection accuracy and over 55% bandwidth savings.

10:15 am–10:50 am

Break with Refreshments

Grand Ballroom Foyer

10:50 am–12:05 pm

Track 1

Virtualization

Grand Ballroom CD

Expeditious High-Concurrency MicroVM SnapStart in Persistent Memory with an Augmented Hypervisor

Xingguo Pang, Yanze Zhang, and Liu Liu, University of Macau; Dazhao Cheng, WuHan University; Chengzhong Xu and Xiaobo Zhou, University of Macau

The industry has embraced snapshotting to tackle the cold starts and efficiently manage numerous short-lived functions for microservice-native architectures, serverless computing, and machine learning inference. A cutting-edge research approach FaaSnap, while innovative in reducing page faults during on-demand paging through prefetching the profiled working set pages into DRAM, incurs high caching overheads and I/O demands, potentially degrading system efficiency and limiting concurrent MicroVM executions.

This paper introduces PASS, a system leveraging byte-addressable persistent memory (PMEM) for cost-effective and highly concurrent MicroVM SnapStart execution. PASS, functioning as a PMEM-aware augmented hypervisor in the user space, revolutionizes MicroVM memory restoration. It constructs complete address indexing of the guest memory mapped to single-tier PMEM space, enabling zero-copy on-demand paging by exploiting PMEM's direct access feature. This approach bypasses the cache layer and maintains guest OS transparency, avoiding invasive modifications. Experimental results, derived from real-world applications, reveal that PASS substantially decreases SnapStart execution time, achieving up to 72% reduction compared to the Firecracker hypervisor on the PMEM filesystem, and 47% reduction compared to FaaSnap. Moreover, PASS achieves double the maximum concurrency compared to both Firecracker and FaaSnap. It improves the cost-effectiveness by 2.2x and 1.6x over the Firecracker and FaaSnap, respectively.

Taming Hot Bloat Under Virtualization with HUGESCOPE

Chuandong Li, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Zhongguancun Laboratory; Sai Sha, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Beijing Huawei Digital Technologies; Yangqing Zeng and Xiran Yang, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University; Yingwei Luo and Xiaolin Wang, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Zhongguancun Laboratory; Zhenlin Wang, Michigan Tech; Diyu Zhou, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and EPFL

Huge pages are effective in reducing the address translation overhead under virtualization. However, huge pages suffer from the hot bloat problem, where accesses to a huge page are skewed towards a few base pages (i.e., 4KB page), making the hypervisor (mistakenly) classify the whole huge page as hot. Hot bloat renders several critical techniques used in virtualization ineffective, including tiered memory and page sharing. Prior work addressing hot bloat either requires hardware modification or targets a specific scenario and is not applicable to a hypervisor.

This paper presents HugeScope, a lightweight, effective and generic system that addresses the hot bloat problem under virtualization based on commodity hardware. HugeScope includes an efficient and precise page tracking mechanism, leveraging the other level of indirect memory translation in the hypervisor. HugeScope provides a generic framework to support page splitting and coalescing policies, considering the memory pressure, as well as the recency, frequency, and skewness of page access. Moreover, HugeScope is general and modular, thereby can be easily applied to various scenarios concerning hot bloat, including tiered memory management (HS-TMM) and page sharing (HS-Share). Evaluation shows that HugeScope incurs less than 4% overhead, and by addressing hot bloat, HS-TMM improves performance by up to 61% over vTMM while HS-Share saves 41% more memory than Ingens while offering comparable performance.

CrossMapping: Harmonizing Memory Consistency in Cross-ISA Binary Translation

Chen Gao and Xiangwei Meng, Lanzhou University; Wei Li, Tsinghua University; Jinhui Lai, Lanzhou University; Yiran Zhang, Beijing University of Posts and Telecommunications; Fengyuan Ren, Lanzhou University and Tsinghua University

The increasing prevalence of new Instruction Set Architectures (ISAs) necessitates the migration of closed-source binary programs across ISAs. Dynamic Binary Translation (DBT) stands out as a crucial technology for the cross-ISA emulation of binary programs. However, due to the mismatch in memory consistency between guest ISA and host ISA, DBT systems face substantial challenges in guaranteeing correctness and translation performance for concurrent programs. Despite several attempts to bridge the memory inconsistency between guest and host ISA, prior work is either not universal for cross-ISA DBT systems or inefficient and even error-prone in translation.

This work presents CrossMapping, a general primitive mapping framework to enhance existing DBT systems for cross-ISA translation. By harmonizing memory consistency across diverse ISAs, CrossMapping enables smooth cross-ISA translation and accomplishes correct emulation. CrossMapping introduces specification tables to describe memory models in a unified and precise format, which facilitates the derivation of concurrent primitive mapping schemes based on a convenient comparison and analysis of memory models. The correctness of cross-ISA emulation is guaranteed by harmoniously integrating the derived mapping schemes with existing DBT systems. We evaluate CrossMapping for x86, ARMv8, and RISC-V on top of QEMU using the PARSEC benchmark suite. The results show that the average performance improvement can reach 8.5% when emulating x86 on ARMv8 and 7.3% when emulating x86 on RISC-V.

Track 2

Security 2

Grand Ballroom EF

Efficient Decentralized Federated Singular Vector Decomposition

Di Chai, Junxue Zhang, Liu Yang, and Yilun Jin, Hong Kong University of Science and Technology; Leye Wang, Peking University; Kai Chen, Hong Kong University of Science and Technology; Qiang Yang, Hong Kong University of Science and Technology and Webank

Federated singular value decomposition (SVD) is a foundation for many real-world distributed applications. Existing federated SVD studies either require external servers which downgrade privacy protection or leverage homomorphic encryption (HE) to get rid of external servers (e.g., being decentralized) but suffer from significant inefficiencies caused by extensive computational and communication overhead.

This paper presents Excalibur, an efficient decentralized federated SVD system. At its core, Excalibur proposes a lightweight matrix protection method to reduce the computational degradation caused by cryptographic operations, improving computation performance. Furthermore, it designs a communication-efficient decentralized SVD workflow based on the quantitative analysis of the design space, optimizing communication performance. To validate the efficiency of Excalibur, we implement a fully functional Excalibur system and evaluate it with real-world applications. Our results show that Excalibur not only removes the external servers but also achieves 3.1× ~ 6.0× faster performance than state-of-the-art (SOTA) server-aided method on different shapes of billion-scale data. In addition, Excalibur exhibits > 23000× larger throughput than the SOTA HE-based system.

Models on the Move: Towards Feasible Embedded AI for Intrusion Detection on Vehicular CAN Bus

He Xu, Di Wu, Yufeng Lu, and Jiwu Lu, Hunan University and ExponentiAI Innovation; Haibo Zeng, Virginia Tech

Controller Area Network (CAN) protocol is widely used in vehicles as an efficient standard enabling communication among Electronic Control Units (ECUs). However, the CAN bus is vulnerable to malicious attacks because of a lack of defense features. To achieve efficient and effective intrusion detection system (IDS) design for hardware and embedded system security in vehicles, we have specifically tackled the challenge that existing IDS techniques rarely consider attacks with small-batch. We propose a model with hardware implementation to function in the vehicular CAN bus, namely MULSAM which employing multi-dimensional long short-term memory with the self-attention mechanism. The self-attention mechanism can enhance the characteristics of CAN bus-oriented attack behavior and the multi-dimensional long short-term memory can effectively extract the in-depth features of time series data. The MULSAM model has been compared with other baselines on five attacks generated by extracting benign CAN data from the actual vehicle. Our experimental results demonstrate that MULSAM has the best training stability and detection accuracy (98.98%) to identify small-batch injection attacks. Furthermore, to speed up the inference of MULSAM as an embedded unit in vehicles, hardware accelerator has been implemented on FPGA to achieve a better energy efficiency than other embedded platform. Even with a certain degree of quantification, the acceleration model for MULSAM still presents a high detection accuracy of 98.81% and a low latency of 1.88 ms, leading to a new cyber-physical system security solution towards feasible embedded AI for intrusion detection on vehicular CAN bus.

CPC: Flexible, Secure, and Efficient CVM Maintenance with Confidential Procedure Calls

Jiahao Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University; Zeyu Mi and Yubin Xia, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Haibing Guan, Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University

Confidential virtual machines (CVMs), while providing strong data privacy for cloud tenants, pose significant challenges to VM maintenance like live migration and snapshotting. Traditional host-based maintenance, while applicable to conventional VMs, is infeasible for CVMs due to the lack of trust in the host and the prevention of mandated intrusive access from the host. State-of-the-art approaches depend on non-trivial modifications to hardware and firmware and thus lead to notable compromises in security and/or performance. Furthermore, such approaches lack flexibility for upgrades and cross-platform compatibility, hindering the popularity of CVMs on the cloud.

In this paper, we introduce Confidential Procedure Calls (CPCs), a flexible approach to the efficient and secure execution of CVM maintenance modules from within the guest. We have implemented prototypes on two leading CVM platforms. Our prototype on AMD SEV showcases the high performance of CPCs, with 3× (resource reclamation) or even 138× (live migration) faster than existing approaches. Our prototype on ARM CCA further confirms CPCs' outstanding security and flexibility.

12:05 pm–1:40 pm

Lunch (on your own)

1:40 pm–3:20 pm

Track 1

Storage 2

Grand Ballroom CD

RL-Watchdog: A Fast and Predictable SSD Liveness Watchdog on Storage Systems

Jin Yong Ha, Seoul National University; Sangjin Lee, Chung-Ang University; Heon Young Yeom, Seoul National University; Yongseok Son, Chung-Ang University

This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-state drive (SSD) liveness or failures by faults (e.g., controller/power faults and high temperature) quickly, precisely, and online to minimize application data loss. To do this, we first provide a lightweight watchdog (LWW) to actively and lightly examine SSD liveness by issuing a liveness-dedicated command to the SSD. Second, we introduce a reinforcement learning-based timeout predictor (RLTP) which predicts the timeout of the dedicated command, enabling the detection of a failure point regardless of the SSD model. Finally, we propose fast failure notification (FFN) to immediately notify the applications of the failure to minimize their potential data loss. We implement RLW with three techniques in a Linux kernel 6.0.0 and evaluate it in a single SSD and RAID using realistic power fault injection. The experimental results reveal that RLW reduces the data loss by up to 96.7% compared with the existing scheme, and its accuracy in predicting failure points reaches up to 99.8%.

Exploit both SMART Attributes and NAND Flash Wear Characteristics to Effectively Forecast SSD-based Storage Failures in Clusters

Yunfei Gu and Chentao Wu, Shanghai Jiao Tong University; Xubin He, Temple University

Solid State Drives (SSDs) based on flash technology are extensively employed as high-performance storage solutions in supercomputing data centers. However, SSD failures are frequent in these environments, resulting in significant performance issues. To ensure the reliability and accessibility of HPC storage systems, it is crucial to predict failures in advance, enabling timely preventive measures. Although many failure prediction methods focus on improving SMART attributes and system telemetry logs, their predictive efficacy is constrained due to the limited capacity of these logs to directly elucidate the root causes of SSD failures at the device level. In this paper, we revisit the underlying causes of SSD failures and first utilize the device-level flash wear characteristics of SSDs as a critical indicator instead of solely relying on SMRAT data. We propose a novel Aging-Aware Pseudo Twin Network (APTN) based SSD failure prediction approach, exploiting both SMART and device-level NAND flash wear characteristics, to effectively forecast SSD failures. In practice, we also adapt APTN to the online learning framework. Our evaluation results demonstrate that APTN improves the F1-score by 51.2% and TPR by 40.1% on average compared to the existing schemes. This highlights the potential of leveraging device-level wear characteristics in conjunction with SMART attributes for more accurate and reliable SSD failure prediction.

StreamCache: Revisiting Page Cache for File Scanning on Fast Storage Devices

Zhiyue Li and Guangyan Zhang, Tsinghua University

Buffered I/O via page cache is used for file scanning in many cases as page cache can provide buffering, data aggregation, I/O alignment and prefetching transparently. However, our study indicates that employing page cache for file scanning on fast storage devices presents two performance issues: it offers limited I/O bandwidth that does not align with the performance of fast storage devices, and the intensive background writeback onto fast storage devices can significantly interfere with foreground I/O requests.

In this paper, we propose StreamCache, a new page cache management system for file scanning on fast storage devices. StreamCache exploits three techniques to achieve high I/O performance. First, it uses a lightweight stream tracking method to record the states of cached pages at the granularity of sequential streams. Second, it uses a stream-based page reclaiming method to lower the interference to foreground I/O requests. Third, it uses a two-layer memory management method to accelerate page allocation by leveraging CPU cache locality.

We implement StreamCache in XFS. Experimental results show that compared with existing methods, StreamCache can increase the I/O bandwidth of scientific applications by 44%, and reduce the checkpoint/restart time of large language models by 15.7% on average.

Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs

Bing Tian, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, and Yu Zhang, Huazhong University of Science and Technology

Approximate nearest neighbor search (ANNS) in high-dimensional vector spaces has become increasingly crucial in database and machine learning applications. Most previous ANNS algorithms require TB-scale memory to store indices of billion-scale datasets, making their deployment extremely expensive for high-performance search. The emerging SmartSSD technology offers an opportunity to achieve scalable ANNS via near data processing (NDP). However, there remain challenges to directly adopt existing ANNS algorithms on multiple SmartSSDs.

In this paper, we present SmartANNS, a SmartSSD-empowered billion-scale ANNS solution based on a hierarchical indexing methodology. We propose several novel designs to achieve near-linear scaling with multiple SmartSSDs. First, we propose a "host CPUs + SmartSSDs'' cooperative architecture incorporated with hierarchical indices to significantly reduce data accesses and computations on SmartSSDs. Second, we propose dynamic task scheduling based on optimized data layout to achieve both load balancing and data reusing for multiple SmartSSDs. Third, we further propose a learning-based shard pruning algorithm to eliminate unnecessary computations on SmartSSDs. We implement SmartANNS using Samsung’s commercial SmartSSDs. Experimental results show that SmartANNS can improve query per second (QPS) by up to 10.7× compared with the state-of-the-art SmartSSD-based ANNS solution—CSDANNS. Moreover, SmartANNS can achieve near-linear performance scalability for large-scale datasets using multiple SmartSSDs.

Track 2

Hardware

Grand Ballroom EF

gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing

Yicheng Gu, Yun Wang, Yunfan Sun, Yuxin Xiang, Yufan Jiang, Xuyan Hu, Zhengwei Qi, and Haibing Guan, Shanghai Jiao Tong University

Ray tracing rendering technology enhances scene realism and offers immersive experiences. However, it demands significant computational resources to trace and compute light-object interactions. As a result, traditional local GPU rendering might not meet the demands for high image quality and low latency. Moreover, many applications are tailored to utilize the resources of a single GPU, limiting their capacity to increase computational power through additional GPUs.

This paper presents gVulkan, the first transparent multi-GPU acceleration rendering solution for Vulkan-based ray tracing. To address the bottleneck caused by limited local GPU resources, gVulkan can offload ray tracing rendering to the cloud via API-forwarding. In the cloud, gVulkan employs Split Frame Rendering (SFR) to enable an arbitrary number of GPUs to accelerate rendering in parallel, while dynamically self-rebalancing the workload at a pixel-grained level across GPUs. Experiments demonstrate that gVulkan can accelerate Vulkan-based ray tracing programs in an application-unaware manner. By dynamically rebalancing each GPU's workload, gVulkan achieves good linearity with 3.81× speedup across 4 GPUs on average.

vFPIO: A Virtual I/O Abstraction for FPGA-accelerated I/O Devices

Jiyang Chen, Harshavardhan Unnibhavi, Atsushi Koshiba, and Pramod Bhatotia, Technical University of Munich

Modern cloud systems have adopted a variety of FPGA-accelerated I/O devices, such as SmartNICs and computational storage, while they face programmability and portability challenges. Existing FPGA frameworks either directly expose device-specific I/O interfaces to user logic or offer virtualized I/Os limited to a single device type. The lack of I/O abstraction imposes high engineering costs, less design portability, and even unexpected throughput degradation.

We introduce vFPIO, an FPGA-based I/O acceleration framework that brings better programmability and design portability. vFPIO extends modern FPGA OSes to expose virtual I/O ports to user logic, which abstracts device-dependent I/O specifications and makes the user logic design platform-agnostic. The connectivity between virtual and physical I/O ports can be easily configured by host applications using POSIX-like file APIs. vFPIO also offers a preemptive I/O transaction scheduler that alleviates the I/O throughput degradation caused by concurrent I/O requests from multiple accelerators in a multi-tenant environment.

We implement a prototype of the vFPIO framework on x86 servers equipped with AMD Xilinx Alveo U280 cards. Our prototype supports four different I/O interfaces: PCIe, DRAM, HBM, and network. Our evaluation highlights that vFPIO incurs negligible performance overheads compared to Coyote, one of the latest FPGA OSes, while preserving the maximum I/O throughput for high-priority tasks even under resource contention.

ScalaCache: Scalable User-Space Page Cache Management with Software-Hardware Coordination

Li Peng and Yuda An, Peking University; You Zhou, Huazhong University of Science and Technology; Chenxi Wang, University of Chinese Academy of Sciences; Qiao Li, Xiamen University; Chuanning Cheng, Huawei; Jie Zhang, Peking University and Zhongguancun Laboratory

Due to the host-centric design principle, the existing page cache management suffers from CPU consumption, communication costs, and garbage collection (GC) interference. To address these challenges, we propose ScalaCache, a scalable user-space page cache with software-hardware coordination. Specifically, to reduce the host CPU overhead, we offload the cache management into computational storage drives (CSDs) and further merge the indirection layers in both the cache and flash firmware, which facilitates lightweight cache management. To further boost scalability, we build a lockless resource management framework that allows multiple CSD internal cores to manage the cache space concurrently. ScalaCache also aggregates the computing power of multiple CSDs to deliver scalable I/O performance. Moreover, ScalaCache reduces communication costs by trimming the I/O control path while mitigating GC interference via a GC-aware replacement policy, thereby enhancing its efficiency and performance stability. Our evaluation results reveal that ScalaCache exhibits 5.12× and 1.70× bandwidth improvements, respectively, compared to kernel page cache and the state-of-the-art user-space one. ScalaCache is open source and available at https://github.com/ChaseLab-PKU/ScalaCache.

Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor

Zhen Xie, Binghamton University; Murali Emani, Argonne National Laboratory; Xiaodong Yu, Stevens Institute of Technology; Dingwen Tao, Indiana University; Xin He, Xidian University; Pengfei Su, University of California, Merced; Keren Zhou, George Mason University; Venkatram Vishwanath, Argonne National Laboratory

For an extended period, graphics processing units (GPUs) have stood as the exclusive choice for training deep neural network (DNN) models. Over time, to serve the growing demands in a more targeted manner, various artificial intelligence-specific hardware, referred to as AI accelerators, have been vigorously developed, aiming to provide more efficient DNN acceleration solutions. However, sufficient solutions are also heterogeneous and thus introduce complexities in accelerator selection. Given a DNN model and a training objective, such as throughput or price-performance ratio, it remains challenging to arrive at the optimal decision among many options due to high reimplementation costs and unexpected performance.

To tackle this challenge, we propose Centimani, a performance predictor that accurately and rapidly predicts DNN training throughput on various AI accelerators, thereby facilitating the accelerator selection process. To achieve this goal, we first analyze typical AI accelerators and draw observations that abstract AI accelerator designs and guide our performance modeling approach. In particular, we construct a memory estimation model and decoupled performance models to select the most appropriate batch size and predict the execution time of DNN training. We validate our approach by applying Centimani to six common DNN models on four typical AI accelerators. Results show that Centimani predicts the throughput with an average accuracy of 93.1% on single-device training and 90.4% on multiple-device training, thus the optimal accelerator corresponding to the user's training objective can be obtained.

3:20 pm–3:40 pm

Break with Refreshments

Grand Ballroom Foyer

3:40 pm–5:10 pm

Potpourri

Grand Ballroom CD

A Difference World: High-performance, NVM-invariant, Software-only Intermittent Computation

Harrison Williams, Virginia Tech; Saim Ahmad, Amazon; Matthew Hicks, Virginia Tech

Supporting long life, high performance, intermittent computation is an essential challenge in allowing energy harvesting devices to fulfill the vision of smart dust. Intermittent computation is the extension of long-running computation across the frequent, unexpected, power cycles that result from replacing batteries with harvested energy. The most promising intermittent computation support strategies combine programmer direction and compiler analysis to minimize run-time overhead and provide programmer control—without specialized hardware support. While such strategies succeed in reducing the size of non-volatile memory writes due to checkpointing, they must checkpoint continuously. Unfortunately, for Flash-based devices (by far the most ubiquitous), writing checkpoints is slow and gradually kills the device. Without intervention, Flash devices and software-only intermittent computation are fundamentally incompatible.

To enable ubiquitous programmer-guided intermittent computation we design and implement Camel. The key idea behind Camel is the systematic bifurcation of program state into two "worlds'' of differing volatility. Programmers compose intermittent programs by stitching together atomic units of computation called tasks. The Camel compiler ensures that all within-task data is placed in the volatile world and all between-task data is placed in the non-volatile world. Between tasks, Camel swaps the worlds, atomically locking-in the forward progress of the preceding task. In preparation for the next task, Camel resolves differences in world view by copying only differences due to the preceding task's updates. This systematic decomposition into a mixed-volatility memory allows—for the first time—long-life and high performance programmer-guided intermittent computation on Flash devices: Camel outperforms the state-of-the-art checkpointing system for Flash-based devices by up to 5x while eliminating the need for hardware support. Beyond Flash, Camel's differential buffer system improves performance by a factor of 2x compared to existing task-based approaches on FRAM platforms.

Efficient Large Graph Processing with Chunk-Based Graph Representation Model

Rui Wang, Zhejiang University and Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security; Weixu Zong, Shuibing He, Xinyu Chen, Zhenxin Li, and Zheng Dang, Zhejiang University

Existing external graph processing systems face challenges in terms of low I/O efficiency, expensive computation overhead, and high graph algorithm development costs when running on emerging NVMe SSDs, due to their reliance on complex loading and computing models that aim to convert numerous random I/Os into a few sequential I/Os. While in-memory graph systems working with memory-storage cache systems like OS page cache or TriCache, offer a promising solution for large graph processing with fine-grained I/Os and easy algorithm programming, they often overlook the specific characteristics of graph applications, resulting in inefficient graph processing. To address these challenges, we introduce ChunkGraph, an I/O-efficient graph system designed for processing large-scale graphs on NVMe SSDs. ChunkGraph introduces a novel chunk-based graph representation model, featuring classified and hierarchical vertex storage, and efficient chunk layout optimization. Evaluations show that ChunkGraph can outperform existing external graph systems, as well as in-memory graph systems relying on general cache systems, running several times faster.

SlimArchive: A Lightweight Architecture for Ethereum Archive Nodes

Hang Feng, Yufeng Hu, and Yinghan Kou, Zhejiang University; Runhuai Li and Jianfeng Zhu, BlockSec; Lei Wu and Yajin Zhou, Zhejiang University

With the rapid development of Ethereum, archive nodes that record all historical states have become a critical component of the infrastructure. However, current archive nodes suffer enormous storage requirements and poor performance due to the inefficient authenticated Merkle Patricia Trie and coarse-grained state granularity.

This paper presents a lightweight and high-performance architecture for Ethereum archive nodes to address the two limitations mentioned earlier. The core idea of our approach is to maintain compacted, flattened, and fine-grained (i.e., transaction-level) historical states by flattening the minimum state changes of each transaction required for the world state. Our method maintains an archive node with minimum storage requirements while providing high-performance state access. We have implemented a prototype system named SLIMARCHIVE for Ethereum. The evaluation results demonstrate that our approach reduces storage requirements by 98.1%, improves state access throughput by 19.0×, and speeds up transaction execution by an average of 1112.5×, compared to vanilla Geth.

Every Mapping Counts in Large Amounts: Folio Accounting

David Hildenbrand, Technical University of Munich and Red Hat GmbH; Martin Schulz, Technical University of Munich; Nadav Amit, Technion, Israel Institute of Technology

Operating systems can significantly enhance performance by utilizing large contiguous memory regions, even when the memory is not mapped using huge pages, by streamlining memory management. To harness these advantages, Linux has introduced "folios," representing multiple contiguous pages. Unlike traditional huge pages, folios can be partially mapped, which complicates folio accounting and hinders both performance and memory savings.

Accurate and efficient folio accounting is crucial for optimizing memory management operations, enforcing various memory management policies, and performing Unique Set Size accounting in the operating system. In particular, determining whether a folio is exclusively mapped in a single address space is essential for avoiding unnecessary Copy-On-Write operations when memory is no longer shared.

We introduce a novel tracking scheme to determine, with negligible overhead, whether a folio is exclusively mapped in a single address space. Our solution achieves a memory overhead that grows sublinearly with the number of pages per folio. By implementing our method in Linux, we demonstrate a notable improvement in fork and unmap operations by 1.9x and 4.2x respectively, and in the performance of fork-intensive workloads, such as Redis, achieving up to a 2.2x speedup.

5:10 pm–5:20 pm

Closing Remarks

Grand Ballroom CD

Program Co-Chairs: Saurabh Bagchi, Purdue University; Yiying Zhang, University of California, San Diego