Monday, April 28
7:30 am–8:55 am
Continental Breakfast
8:55 am–9:10 am
Opening Remarks and Awards
Program Co-Chairs: Theophilus A. Benson, Carnegie Mellon University; Radhika Niranjan Mysore, VMware Research Group
9:10 am–10:30 am
Data Centers Queuing and Routing
PRED: Performance-oriented Random Early Detection for Consistently Stable Performance in Datacenters
Xinle Du, Huawei Technologies; Tong Li, Renmin University of China; Guangmeng Zhou, Zhuotao Liu, Hanlin Huang, and Xiangyu Gao, Tsinghua University; Mowei Wang and Kun Tan, Huawei Technologies; Ke Xu, Tsinghua University
For decades, Random Early Detection (RED) has been integrated into datacenter switches as a fundamental Active Queue Management (AQM). Accurate configuration of RED parameters is crucial to achieving high throughput and low latency. However, due to the highly dynamic nature of workloads in datacenter networks, maintaining consistently high performance with statically configured RED thresholds poses a challenge. Prior art applies reinforcement learning to predict proper thresholds, but their real-world deployment has been hindered by poor tail performance caused by instability. In this paper, we propose PRED, a novel system that enables automatic and stable RED parameter adjustment in response to traffic dynamics. PRED uses two loosely coupled systems, Flow Concurrent Stabilizer (FCS) and Queue Length Adjuster (QLA), to overcome the challenges of dynamically setting RED parameters to adapt to the ever-changing traffic pattern. We perform extensive evaluations on our physical testbed and large-scale simulations. The results demonstrate that PRED can keep up with the real-time network dynamics generated by realistic workloads. For instance, compared with the static-threshold-based methods, PRED keeps 66%lower switch queue length and obtains up to 80% lower Flow Completion Time (FCT). Compared with the state-of-the-art learning-based method, PRED reduces the tail FCT by 34%.
Rajomon: Decentralized and Coordinated Overload Control for Latency-Sensitive Microservices
Jiali Xing, Akis Giannoukos, Paul Loh, Shuyue Wang, and Justin Qiu, University of Pennsylvania; Henri Maxime Demoulin, DBOS, Inc; Konstantinos Kallas, University of California, Los Angeles; Benjamin C. Lee, University of Pennsylvania
Microservices are increasingly central for cloud applications due to their flexibility and support for rapid integration and deployment. However, applications often experience overload or sudden traffic surges that exceed service capacity, resulting in increased latency or service failures. Moreover, microservices are decentralized, interdependent, and multiplexed, exacerbating risks from overload.
We present RAJOMON, a market-based overload control system for large microservice graphs. RAJOMON controls overload through distributed rate-limiting and load shedding. Clients attach tokens to requests and services charge a price for each API, dropping requests with insufficient tokens. Tokens and prices propagate through the entire call graph, piggybacking on requests and responses. Thus, RAJOMON is the first decentralized, end-to-end overload control system.
We implement and evaluate RAJOMON on a setup of up to 140 cores and on a variety of applications from academia and industry. Experiments indicate RAJOMON protects microservice goodput and tail latency from substantial demand spikes, even in the case of mixed request types and deeper service graphs. For high-load scenarios, RAJOMON reduces tail latency by 78% and increases goodput by 45% when compared against state-of-the-art overload control for microservices.
Learnings from Deploying Network QoS Alignment to Application Priorities for Storage Services
Matthew Buckley and Parsa Pazhooheshy, Google and University of Toronto; Z. Morley Mao, Nandita Dukkipati, Hamid Hajabdolali Bazzaz, Priyaranjan Jha, Yingjie Bi, and Steve Middlekauff, Google; Yashar Ganjali, University of Toronto
To ensure that application network traffic is prioritized correctly within data center networks, it is critical to align the configuration of network QoS in packets to the intended priority of the application. These QoS configurations, typically encoded in the DSCP bits in the IP header, are interpreted by network switches and routers to determine the resources such as buffer space and scheduling priorities, for network traffic. Conceptually, it appears fairly straightforward to map the application priorities within data center networks to network QoS configurations, as long as the mapping is well defined. In this work, we describe our experience of aligning network QoS settings for intra-cluster storage traffic to application priorities on a per-RPC basis for a large data center network, with well-defined static mappings from priorities to QoS traffic classes. We describe some unexpected insights learned from the deployment experiences, e.g., downgrading traffic to use a lower QoS does not always imply worse network latency due to over-used QoS bands in the network. We also share some challenges encountered along the way to reach the goal of a fleet-wide deployment, including the concerns of potential performance regressions due to QoS downgrades. These lessons provide guidance on the use of a QoS-based scheduling strategy to meet service guarantees and can be deployed to networks of any scale.
DISC: Backpressure Mitigation In Multi-tier Applications With Distributed Shared Connection
Brice Ekane and Djob Mvondo, Univ. Rennes, Inria, CNRS, IRISA, France; Renaud Lachaize, Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France; Yérom-David Bromberg, Univ. Rennes, Inria, CNRS, IRISA, France; Alain Tchana, Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France; Daniel Hagimont, IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3 Toulouse, France
Most data-center applications are based on a multi-tier architecture, involving either coarse-grained software components (e.g., traditional 3-tier web applications) or fine-grained ones (e.g., microservices). Such applications are prone to the backpressure problem, which introduces a strong performance coupling between tiers, thus degrading scalability and resource consumption. This problem is due to the fact that, on the response path towards the initial client, a significant fraction of the payloads in the messages exchanged between tiers correspond to “final” data that are simply relayed (i.e., without further modifications) from a backend tier such as a database. This traffic results in additional pressure on the intermediate and frontend tiers.
To address this problem, we introduce DISC, a system allowing several tiers within a multi-tier chain to jointly act as endpoints of the same TCP connection. This enables the selective bypass of one or several tiers on the response path. Unlike existing solutions, DISC is (1) flexible — it accommodates arbitrary multi-tier topologies and heterogeneous application-level protocols, (2) fine-grained — it allows multiple tiers to be involved in the generation and emission of a given response message (e.g., to decouple the network path of the response headers and footers from the path of the response body), (3) and non-intrusive — it requires only minor and localized/modular modifications to the code base of legacy applications and is transparent for external clients. Evaluation results with several micro- and macro-benchmarks show that DISC can reduce the cumulative CPU load on servers by up to 41.5%, decrease the average and tail latencies respectively by up to 74.1% and 5.71×, and also improve the request rate by up to 45%.
Data Plane Programmability 1
Enabling Silent Telemetry Data Transmission with InvisiFlow
Yinda Zhang, University of Pennsylvania; Liangcheng Yu, University of Pennsylvania and Microsoft Research; Gianni Antichi, Politecnico di Milano and Queen Mary University of London; Ran Ben Basat, University College London; Vincent Liu, University of Pennsylvania
Network applications from traffic engineering to path tracing often rely on the ability to transmit fine-grained telemetry data from network devices to a set of collectors. Unfortunately, prior work has observed—and we validate—that existing transmission methods for such data can result in significant overhead to user traffic and/or loss of telemetry data, particularly when the network is heavily loaded.
In this paper, we introduce InvisiFlow, a novel communication substrate to collect network telemetry data, silently. In contrast to previous systems that always push telemetry packets to collectors based on the shortest path, InvisiFlow dynamically seeks out spare network capacity by leveraging opportunistic sending and congestion gradients, thus minimizing both the loss rate of telemetry data and overheads on user traffic. In a FatTree topology, InvisiFlow can achieve near-zero loss rate even under high-load scenarios (around 33.8× lower loss compared to the state-of-the-art transmission methods used by systems like Everflow and Planck).
Unlocking ECMP Programmability for Precise Traffic Control
Yadong Liu, Tencent; Yunming Xiao, University of Michigan; Xuan Zhang, Weizhen Dang, Huihui Liu, Xiang Li, and Zekun He, Tencent; Jilong Wang, Tsinghua University; Aleksandar Kuzmanovic, Northwestern University; Ang Chen, University of Michigan; Congcong Miao, Tencent
ECMP (equal-cost multi-path) has become a fundamental mechanism in data centers, which distributes flows along multiple equivalent paths based on their hash values. Randomized distribution optimizes for the aggregate case, spreading load across flows over time. However, there exists a class of important Precise Traffic Control (PTC) tasks that are at odds with ECMP randomness. For instance, if an end host perceives that its flows are traversing a problematic switch/link, it often needs to change their paths before a fix can be rolled out. With randomized hashing, existing solutions resort to modifying flow tuples; since hashing mechanisms are unknown and they vary across switches/vendors, it may take many trials before yielding a new path. Many other similar cases exist where precise and timely response is critical to the network.
We propose programmable ECMP (P-ECMP), a programming model, compiler, and runtime that provides precise traffic control. P-ECMP leverages an oft-ignored feature, ECMP groups, which allows for a constrained set of capabilities that are nonetheless sufficiently expressive for our tasks. An operator supplies high-level descriptions of their topology and policies, and our compiler generates PTC configurations for each switch. End hosts can reconfigure specific flows to use different PTC policies precisely and quickly, addressing a range of important use cases. We have evaluated P-ECMP using simulation at scale, and deployed one use case to a real-world data center that serves live user traffic.
Enabling Portable and High-Performance SmartNIC Programs with Alkali
Jiaxin Lin, UT Austin; Zhiyuan Guo, UCSD; Mihir Shah, NVIDIA; Tao Ji, Microsoft; Yiying Zhang, UCSD; Daehyeok Kim and Aditya Akella, UT Austin
Trends indicate that emerging SmartNICs, either from different vendors or generations from the same vendor, exhibit substantial differences in hardware parallelism and memory interconnects. These variations make porting programs across NICs highly complex and time-consuming, requiring programmers to significantly refactor code for performance based on each target NIC’s hardware characteristics.
We argue that an ideal SmartNIC compilation framework should allow developers to write target-independent programs, with the compiler automatically managing cross-NIC porting and performance optimization. We present such a framework, Alkali, that achieves this by (1) proposing a new intermediate representation for building flexible compiler infrastructure for multiple NIC targets and (2) developing a new iterative parallelism optimization algorithm that automatically ports and parallelizes the input programs based on the target NIC’s hardware characteristics.
Experiments across a wide range of NIC applications demonstrate that Alkali enables developers to easily write portable, high-performance NIC programs. Our compiler optimization passes can automatically port these programs and make them run efficiently across all targets, achieving performance within 9.8% of hand-tuned expert implementations.
Scaling IP Lookup to Large Databases using the CRAM Lens
Robert Chang and Pradeep Dogga, University of California, Los Angeles; Andy Fingerhut, Cisco Systems; Victor Rios and George Varghese, University of California, Los Angeles
Wide-area scaling trends require new approaches to Internet Protocol (IP) lookup, enabled by modern networking chips such as Intel Tofino, AMD Pensando, and Nvidia BlueField, which provide substantial ternary content-addressable memory (TCAM) and static random-access memory (SRAM). However, designing and evaluating scalable algorithms for these chips is challenging due to hardware-level constraints. To address this, we introduce the CRAM (CAM+RAM) lens, a framework that combines a formal model for evaluating algorithms on modern network processors with a set of optimization idioms. We demonstrate the effectiveness of CRAM by designing and evaluating three new IP lookup schemes: RESAIL, BSIC, and MashUp. RESAIL enables Tofino-2 to scale to 2.25 million IPv4 prefixes—likely sufficient for the next decade—while a pure TCAM approach supports only 250k prefixes, just 27% of the current global IPv4 routing table. Similarly, BSIC scales to 390k IPv6 prefixes on Tofino-2, supporting 3.2 times as many prefixes as a pure TCAM implementation. In contrast, existing state-of-the-art algorithms, SAIL for IPv4 and Hi-BST for IPv6, scale to considerably smaller sizes on Tofino-2.
10:30 am–11:00 am
Coffee and Tea Break
11:00 am–12:20 pm
Data Center Resource Scheduling
Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing
Zhenyuan Ruan, MIT CSAIL; Shihang Li, Brown University; Kaiyan Fan, MIT CSAIL; Seo Jin Park, University of Southern California; Marcos K. Aguilera, VMware Research by Broadcom; Adam Belay, MIT CSAIL; Malte Schwarzkopf, Brown University
Datacenters today waste CPU and memory, as resources demanded by applications often fail to match the resources available on machines. This leads to stranded resources because one resource that runs out prevents placing additional applications that could consume the other resources. Unusable stranded resources result in reduced utilization of servers, and wasted money and energy.
Quicksand is a new framework and runtime system that unstrands resources by providing developers with familiar, high-level abstractions (e.g., data structures, batch computing). Internally Quicksand decomposes them into resource proclets, granular units that each primarily consume resources of one type. Inspired by recent granular programming models, Quicksand decouples consumption of resources as much as possible. It splits, merges, and migrates resource proclets in milliseconds, so it can use resources on any machine, even if available only briefly.
Evaluation of our prototype with four applications shows that Quicksand uses stranded resources effectively; that Quicksand reacts to changing resource availability and demand within milliseconds, increasing utilization; and that porting applications to Quicksand requires moderate effort.
Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs
Quanxi Li, Hong Huang, Ying Liu, and Yanwen Xia, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Jie Zhang, Peking University; Mosong Zhou, Huawei Cloud; Xiaobing Feng and Huimin Cui, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Quan Chen, Shanghai Jiao Tong University; Yizhou Shan, Huawei Cloud; Chenxi Wang, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
The Microsecond (µs)-scale I/O fabrics raise a tension between the programming productivity and performance, especially in disaggregated memory systems. The multithreaded synchronous programming model is popular in developing memory-disaggregated applications due to its intuitive program logic. However, our key insight is that although thread switching can effectively mitigate µs-scale latency, it leads to poor data locality and non-trivial scheduling overhead, leaving significant opportunities to improve the performance further. This paper proposes a memory-disaggregated framework, Beehive, which improves the remote access throughput by exploiting the asynchrony within each thread. To improve the programming usability, Beehive allows the programmers to develop applications in the conventional multithreaded synchronous model and automatically transforms the code into pararoutine (a newly proposed computation and scheduling unit) based asynchronous code via the Rust compiler. Beehive outperforms the state-of-the-art memory-disaggregated frameworks, i.e., Fastswap, Hermit, and AIFM, by 4.26×, 3.05×, and 1.58× on average.
Making Serverless Pay-For-Use a Reality with Leopard
Tingjia Cao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Tyler Caraza-Harter, University of Wisconsin-Madison
Serverless computing has gained traction due to its event-driven architecture and “pay for use” (PFU) billing model. However, our analysis reveals that current billing practices do not align with true resource consumption. This paper challenges the prevailing SLIM (static, linear, interactive-only model) assumptions that underpin existing billing models, demonstrating that current billing does not realize PFU for realistic workloads. We introduce the Nearly Pay-for-Use (NPFU) billing model, which accommodates varying CPU and memory demands, spot cores, and preemptible memory. We also introduce Leopard, an NPFU-based serverless platform that integrates billing awareness into several major subsystems: CPU scheduler, OOM killer, admission controller, and cluster scheduler. Experimental results indicate that Leopard benefits both providers and users, increasing throughput by more than 2x and enabling cost reductions.
GRANNY: Granular Management of Compute-Intensive Applications in the Cloud
Carlos Segarra, Simon Shillaker, Guo Li, and Eleftheria Mappoura, Imperial College London; Rodrigo Bruno, INESC-ID, Instituto Superior Técnico, University of Lisbon; Lluís Vilanova and Peter Pietzuch, Imperial College London
Parallel applications are typically implemented using multi-threading (with shared memory, e.g., OpenMP) or multi-processing (with message passing, e.g., MPI). While it seems attractive to deploy such applications in cloud VMs, existing cloud schedulers fail to manage these applications efficiently: they cannot scale multi-threaded applications dynamically when more vCPUs in a VM become available, and they cause fragmentation over time because of the static allocation of multi-process applications to VMs.
We describe GRANNY, a new distributed runtime that enables the fine-granular management of multi-threaded/process applications in cloud environments. GRANNY supports the vertical scaling of multi-threaded applications within a VM and the horizontal migration of multi-process applications between VMs. GRANNY achieves both through a single WebAssembly-based execution abstraction: Granules can execute application code with thread or process semantics and allow for efficient snapshotting. GRANNY scales up applications by adding more Granules at runtime, and de-fragments applications by migrating Granules between VMs. In both cases, it launches new Granules from snapshots efficiently. We evaluate GRANNY with dynamic scheduling policies and show that, compared to current schedulers, it reduces the makespan for OpenMP workloads by up to 60% and the fragmentation for MPI workloads by up to 25%.
Verification 1
On Temporal Verification of Stateful P4 Programs
Delong Zhang, Chong Ye, and Fei He, School of Software, BNRist, Tsinghua University, Beijing 100084, China; Key Laboratory for Information System Security, MoE, China
Stateful P4 programs offload network states from the control plane to the data plane, enabling unprecedented network programmability. However, existing P4 verifiers overapproximate the stateful nature of P4 programs and are inherently inadequate for verifying network functions that require stateful decision-making.
To overcome this limitation, this paper introduces an innovative approach to verify P4 programs while accounting for their stateful feature. We propose a specification language named P4LTL, tailored for describing temporal properties of stateful P4 programs at the packet processing level. Additionally, we introduce a novel concept called the Büchi transaction, representing the product of the P4 program and the P4LTL specification. The P4 program verification problem can be reduced to determining the existence of any fair and feasible trace within the Büchi transaction. To the best of our knowledge, our approach represents the first endeavor in temporal verification of stateful P4 programs at the packet processing level. We implemented a prototype tool called p4tv. Evaluation results demonstrate p4tv’s effectiveness and efficiency in temporal verification of stateful P4 programs.
NDD: A Decision Diagram for Network Verification
Zechun Li, Peng Zhang, and Yichi Zhang, Xi'an Jiaotong University; Hongkun Yang, Google
State-of-the-art network verifiers extensively use Binary Decision Diagram (BDD) as the underlying data structure to represent the network state and equivalence classes. Despite its wide usage, we find BDD is not ideal for network verification: verifiers need to handle the low-level computation of equivalence classes, and still face scalability issues when the network state has a lot of bits.
To this end, this paper introduces Network Decision Diagram (NDD), a new decision diagram customized for network verification. In a nutshell, NDD wraps BDD with another layers of decision diagram, such that each NDD node represents a field of the network, and each edge is labeled with a BDD encoding the values of that field. We designed and implemented a library for NDD, which features a native support for equivalence classes, and higher efficiency in memory and computation. Using the NDD library, we re-implemented five BDD-based verifiers with minor modifications to their original codes, and observed a 100× reduction in memory cost and 100× speedup. This indicates that NDD provides a drop-in replacement of BDD for network verifiers.
Smart Casual Verification of the Confidential Consortium Framework
Heidi Howard, Markus A. Kuppe, Edward Ashton, and Amaury Chamayou, Azure Research, Microsoft; Natacha Crooks, Azure Research, Microsoft and UC Berkeley
The Confidential Consortium Framework (CCF) is an open-source platform for developing trustworthy and reliable cloud applications. CCF powers Microsoft's Azure Confidential Ledger service and as such it is vital to build confidence in the correctness of CCF's design and implementation. This paper reports our experiences applying smart casual verification to validate the correctness of CCF's novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. We use the term smart casual verification to describe our hybrid approach, which combines the rigor of formal specification and model checking with the pragmatism of automated testing, in our case binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods approaches require substantial buy-in and are often one-off efforts by domain experts, we have integrated our smart casual verification approach into CCF's CI pipeline, allowing contributors to continuously validate CCF as it evolves. We describe the challenges we faced in applying smart casual verification to a complex existing codebase and how we overcame them to find six subtle bugs in the design and implementation before they could impact production.
VEP: A Two-stage Verification Toolchain for Full eBPF Programmability
Xiwei Wu, Yueyang Feng, Tianyi Huang, Xiaoyang Lu, Shengkai Lin, Lihan Xie, Shizhen Zhao, and Qinxiang Cao, Shanghai Jiao Tong University
Extended Berkely Package Filter (eBPF) is a revolutionary technology that can safely and efficiently extend kernel capabilities. It has been widely used in networking, tracing, security, and more. However, existing eBPF verifiers impose strict constraints, often requiring repeated modifications to eBPF programs to pass verification. To enhance programmability, we introduce VEP, an annotation-guided eBPF program verification toolchain. VEP consists of three components: VEP-C, a verifier for annotated eBPF-C programs; VEP-compiler, a compiler targeting annotated eBPF bytecode; and VEP-eBPF, a lightweight bytecode-level proof checker. VEP allows users to verify the correctness of their programs with appropriate annotations, thus enabling full programmability. Our experimental results demonstrate that VEP addresses the limitations of existing verifiers, i.e. the Linux verifier and PREVAIL, and provides a more flexible and automated approach to kernel security.
12:20 pm–2:00 pm
Symposium Luncheon and Test of Time Award Presentation
2:00 pm–3:20 pm
Failure and Diagnosis
MeshTest: End-to-End Testing for Service Mesh Traffic Management
Naiqian Zheng, Tianshuo Qiao, Xuanzhe Liu, and Xin Jin, Peking University
We present MeshTest, the first end-to-end testing framework for traffic management of service mesh. The key idea of MeshTest is to automatically generate input configurations with end-to-end semantics, and then create real test request suites on each input. There are two technical challenges. First, the input space of service mesh configurations is large and complex. The input configurations should be carefully orchestrated to form end-to-end service flow paths. Second, the abstract output network behavior cannot be directly checked for correctness, and we need to generate a set of real requests that are capable of checking possible behaviors. To address these challenges, we model the service flows of traffic management in service mesh, and propose a novel Service Flow Exploration technique to enumerate all possible configuration resources and interactions between them in the input configuration. We design and implement MeshTest, which contains an automatic input configuration generator based on Service Flow Exploration and a Service Mesh Oracle which leverages formal methods to generate test request suites. MeshTest has found 23 new bugs (19 confirmed and 10 fixed) in two popular service mesh systems, Istio and Linkerd.
Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage
Hamid Hajabdolali Bazzaz, Yingjie Bi, and Weiwu Pang, Google; Minlan Yu, Harvard University; Ramesh Govindan, University of Southern California; Neal Cardwell, Nandita Dukkipati, Meng-Jung Tsai, Chris DeForeest, and Yuxue Jin, Google; Charles Carver, Columbia University; Jan Kopański, Liqun Cheng, and Amin Vahdat, Google
Datacenter network hotspots, defined as links with persistently high utilization, can lead to performance bottlenecks.In this work, we study hotspots in Google’s datacenter networks. We find that these hotspots occur most frequently at ToR switches and can persist for hours. They are caused mainly by bandwidth demand-supply imbalance, largely due to high demand from network-intensive services, or demand exceeding available bandwidth when compute/storage upgrades outpace ToR bandwidth upgrades. Compounding this issue is bandwidth-independent task/data placement by data-center compute and storage schedulers. We quantify the performance impact of hotspots, and find that they can degrade the end-to-end latency of some distributed applications by over 2× relative to low utilization levels. Finally, we describe simple improvements we deployed. In our cluster scheduler, adding hotspot-aware task placement reduced the number of hot ToRs by 90%; in our distributed file system, adding hotspot-aware data placement reduced p95 network latency by more than 50%. While congestion control, load balancing, and traffic engineering can efficiently utilize paths for a fixed placement, we find hotspot-aware placement – placing tasks and data under ToRs with higher available bandwidth – is crucial for achieving consistently good performance.
Enhancing Network Failure Mitigation with Performance-Aware Ranking
Pooria Namyar and Arvin Ghavidel, University of Southern California; Daniel Crankshaw, Daniel S. Berger, Kevin Hsieh, and Srikanth Kandula, Microsoft; Ramesh Govindan, University of Southern California; Behnaz Arzani, Microsoft
Cloud providers install mitigations to reduce the impact of network failures within their datacenters. Existing network mitigation systems rely on simple local criteria or global proxy metrics to determine the best action. In this paper, we show that we can support a broader range of actions and select more effective mitigations by directly optimizing end-to-end flow-level metrics and analyzing actions holistically. To achieve this, we develop novel techniques to quickly estimate the impact of different mitigations and rank them with high fidelity. Our results on incidents from a large cloud provider show orders of magnitude improvements in flow completion time and throughput. We also show our approach scales to large datacenters.
One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems
Ruiming Lu, University of Michigan and Shanghai Jiao Tong University; Yunchi Lu and Yuxuan Jiang, University of Michigan; Guangtao Xue, Shanghai Jiao Tong University; Peng Huang, University of Michigan
Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.
All Things Transport
Pyrrha: Congestion-Root-Based Flow Control to Eliminate Head-of-Line Blocking in Datacenter
Kexin Liu, Zhaochen Zhang, Chang Liu, and Yizhi Wang, Nanjing University; Vamsi Addanki and Stefan Schmid, TU Berlin; Qingyue Wang, Wei Chen, Xiaoliang Wang, and Jiaqi Zheng, Nanjing University; Wenhao Sun, Tao Wu, Ke Meng, Fei Chen, Weiguang Wang, and Bingyang Liu, Huawei, China; Wanchun Dou, Guihai Chen, and Chen Tian, Nanjing University
In modern datacenters, the effectiveness of end-to-end congestion control (CC) is quickly diminishing with the rapid bandwidth evolution. Per-hop flow control (FC) can react to congestion more promptly. However, a coarse-grained FC can result in Head-Of-Line (HOL) blocking. A fine-grained, per-flow FC can eliminate HOL blocking caused by flow control, however, it does not scale well. This paper presents Pyrrha, a scalable flow control approach that provably eliminates HOL blocking while using a minimum number of queues. In Pyrrha, flow control first takes effect on the root of the congestion, i.e., the port where congestion occurs. And then flows are controlled according to their contributed congestion roots. A prototype of Pyrrha is implemented on Tofino2 switches. Compared with state-of-the-art approaches, the average FCT of uncongested flows is reduced by 42%-98%, and 99th-tail latency can be 1.6×-215× lower, without compromising the performance of congested flows.
eTran: Extensible Kernel Transport with eBPF
Zhongjie Chen, Tsinghua University; Qingkai Meng, Nanjing University; ChonLam Lao, Harvard University; Yifan Liu and Fengyuan Ren, Tsinghua University; Minlan Yu, Harvard University; Yang Zhou, UC Berkeley and UC Davis
Evolving datacenters with diverse application demands are driving network transport designs. However, few have successfully landed in the widely-used kernel networking stack to benefit broader users, and they take multiple years. We present eTran, a system that makes kernel transport extensible to implement and customize diverse transport designs agilely. To achieve this, eTran leverages and extends eBPF-based techniques to customize the kernel to support complex transport functionalities safely. Meanwhile, eTran carefully absorbs user-space transport techniques for performance gain without sacrificing robust protection. We implement TCP (with DCTCP congestion control) and Homa under eTran, and achieve up to 4.8×/1.8× higher throughput with 3.7×/7.5× lower latency compared to existing kernel implementation.
White-Boxing RDMA with Packet-Granular Software Control
Chenxingyu Zhao and Jaehong Min, University of Washington; Ming Liu, University of Wisconsin-Madison; Arvind Krishnamurthy, University of Washington
Driven by diverse workloads and deployments, numerous innovations emerge to customize RDMA transport, spanning congestion control, multi-tenant isolation, routing, and more. However, RDMA's hardware-offloading nature poses significant rigidity when landing these innovations. Prior workflows to deliver customizations have either waited for lengthy hardware iterations, developed bespoke hardware, or applied coarse-grained control over the black-box RDMA NIC. Despite considerable efforts, current customization workflows still lack flexibility, raw performance, and broad availability.
In this work, we advocate for White-Boxing RDMA, which provides control of the hardware transport to general-purpose software while preserving raw data path performance. To facilitate the white-boxing methodology, we design and implement Software-Controlled RDMA (SCR), a framework enabling packet-granular software control over the hardware transport. To address challenges stemming from granular control over high-speed line rates, SCR employs effective control models, boosts the efficiency of subsystems within the framework, and leverages emerging hardware capabilities. We implement SCR on the latest Nvidia BlueField-3 equipped with Datapath Accelerators, delivering a spectrum of new customizations not present in legacy RDMA transport, such as Multi-Tenant Fair Scheduler, User-Defined Congestion Control, Receiver-Driven Flow Control, and Multi-Path Routing Selection. Furthermore, we demonstrate SCR's applicability for GPU-Direct and NVMe-oF RDMA with zero modifications to machine learning or storage code.
SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol
Konstantinos Prasopoulos, EPFL; Ryan Kosta, UCSD; Edouard Bugnion, EPFL; Marios Kogias, Imperial College London
Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols — which schedule packet transmissions instead of reacting to congestion — excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict.
We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender’s real time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric.
We implement SIRD on top of the Caladan stack and show that SIRD’s asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.
3:20 pm–3:50 pm
Coffee and Tea Break
3:50 pm–5:50 pm
LLM Training and Resilience
Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation
Fei Gui, Tsinghua University; BNRist; Tsinghua Shenzhen International Graduate School; Kaihui Gao and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Vincent Liu, University of Pennsylvania; Ran Zhang and Hongbing Yang, Zhongguancun Laboratory; Dian Xiong, Tsinghua University
The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance.
In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1−73.2X speedup over state-of-the-art CPU-based simulators in various use cases.
Optimizing RLHF Training for Large Language Models with Stage Fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, and Shengyu Liu, School of Computer Science, Peking University; Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, and Yibo Zhu, StepFun; Xin Jin, School of Computer Science, Peking University
We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to overlap the execution of generation and inference stages, thus mitigating the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches and performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, effectively mitigating the pipeline bubbles. The experiments show that RLHFuse increases the training throughput by up to 3.7×, compared to existing systems.
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng, Tsinghua University; Xiang Shi and Zhuo Jiang, ByteDance; Xingjian Zhang, Tsinghua University; Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, and Gaohong Liu, ByteDance; Fuliang Li, Northeastern University; Shuguang Wang, Haibin Lin, and Jianxi Ye, ByteDance; Minlan Yu, Harvard University
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters
Zhiyi Yao and Pengbo Hu, Fudan University and Tencent; Congcong Miao and Xuya Jia, Tencent; Zuning Liang and Yuedong Xu, Fudan University; Chunzhi He, Hao Lu, Mingzhuo Chen, Xiang Li, Zekun He, Yachen Wang, and Xianneng Zou, Tencent; Junchen Jiang, University of Chicago
Training Large Language Models (LLMs) on large-scale GPU clusters requires numerous iterations over several months. Existing works mainly focus on addressing failures that interrupt the iterative training process to improve the utilization of GPU clusters. However, our large-scale measurements over tens of thousands of GPUs show that the training process exhibits an unstable state with some irregular iterations taking even more than twice the time of a normal iteration. Surprisingly, we find that these irregular iterations greatly extend the time of LLM training, which is even more severe than the impact of failures. Meanwhile, the irregular phenomenon is silent, making it challenging to be accurately localized. In this paper, we propose a first-of-its-kind system called Holmes, leveraging communication operators to accurately localize these irregularities in real-time. The core of Holmes's approach is to employ an enhanced abnormal operator detection model and a novel communication operator graph to perform efficient irregularity localization. Furthermore, Holmes conducts cross-iteration analysis to improve localization accuracy. We evaluate Holmes using large-scale trace-driven simulations and a production-level prototype. Large-scale simulation results demonstrate that Holmes achieves irregularity localization accuracy of 97.21%. Production-level prototype evaluation results show Holmes can localize irregularity within 30.3 seconds, achieving a speedup of 6.52× as compared to traditional approaches.
SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision
Xizheng Wang, Alibaba Cloud and Tsinghua University; Qingxu Li, Yichi Xu, and Gang Lu, Alibaba Cloud; Dan Li, Tsinghua University; Li Chen, Zhongguancun Laboratory; Heyang Zhou, Alibaba Cloud; Linkang Zheng, Alibaba Cloud and South China University of Technology; Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, and Ennan Zhai, Alibaba Cloud; Dennis Cai, Alibaba Group; Binzhang Fu, Alibaba Cloud
The large number of GPUs required for a single LLM training significantly hinders the validation of new designs, tunings, and optimizations, calling for the occurrence of efficient simulators. Existing simulators, however, only target a specific granularity of the entire training, intrinsically leading to imprecision. This paper presents SimAI, a unified simulator aiming at precisely and efficiently simulating the LLM training procedure at scale. Through selective and high-fidelity integration of the training frameworks, the kernel computation, and the collective communication library into the simulating procedure, SimAI achieves high precision in simulations. SimAI further conducts multi-thread acceleration and implements lock-free global context-sharing to accelerate the execution speed. The effectiveness of SimAI is validated by its performance results, which show an average of 98.1% alignment to real-world results under various test scenarios and affirm its robustness and adaptability from small-scale labs to large-scale industrial environments. SimAI delivers meaningful guidelines for new host designs and parameter settings, directly benefiting in-production LLM training. We also share experiences and lessons learned during the evolution of SimAI. SimAI is open sourced at https://github.com/aliyun/SimAI.
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Borui Wan, The University of Hong Kong; Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, and Xin Liu, ByteDance Inc.; Chuan Wu, The University of Hong Kong
Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding; a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends; full-stack optimizations to ensure high I/O efficiency and scalability; a suite of monitoring tools to streamline large-scale performance analysis and bottleneck detection. Compared to existing open-source checkpointing systems [51, 57], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20×. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96× and 8.80×, respectively.
Video and Cloud Gaming
Mowgli: Passively Learned Rate Control for Real-Time Video
Neil Agarwal and Rui Pan, Princeton University; Francis Y. Yan, University of Illinois Urbana-Champaign; Ravi Netravali, Princeton University
Rate control algorithms are at the heart of video conferencing platforms, determining target bitrates that match dynamic network characteristics for high quality. Despite the promise that recent data-driven strategies have shown for this challenging task, the performance degradation that they introduce during training has been a nonstarter for many production services, precluding adoption. This paper aims to bolster the practicality of data-driven rate control by presenting an alternate avenue for experiential learning: using purely existing telemetry logs that we surprisingly observe embed performant decisions but often at the wrong times or in the wrong order. To realize this approach despite the inherent uncertainty that log-based learning brings (i.e., lack of feedback for new decisions), our system, Mowgli, combines a variety of robust learning techniques (i.e., conservatively reasoning about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise). Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm, increasing average video bitrates by 15–39% while reducing freeze rates by 60–100%.
Dissecting and Streamlining the Interactive Loop of Mobile Cloud Gaming
Yang Li, Jiaxing Qiu, Hongyi Wang, and Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Jing Yang, Tsinghua University; Hao Lin, Tsinghua University and University of Illinois Urbana-Champaign; Yunhao Liu, Tsinghua University; Bo Xiao and Xiaokang Qin, Ant Group; Tianyin Xu, University of Illinois Urbana-Champaign
With cloud-side computing and rendering, mobile cloud gaming (MCG) is expected to deliver high-quality gaming experiences to budget mobile devices. However, our measurement on representative MCG platforms reveals that even under good network conditions, all platforms exhibit high interactive latency of 112–403 ms, from a user-input action to its display response, that critically affects users’ quality of experience. Moreover, jitters in network latency often lead to significant fluctuations in interactive latency.
In this work, we collaborate with a commercial MCG platform to conduct the first in-depth analysis on the interactive latency of cloud gaming. We identify VSync, the synchronization primitive of Android graphics pipeline, to be a key contributor to the excessive interactive latency; as many as five VSync events are intricately invoked, which serialize the complex graphics processing logic on both the client and cloud sides. To address this, we design an end-to-end VSync regulator, dubbed LoopTailor, which minimizes VSync events by decoupling game rendering from the lengthy cloud-side graphics pipeline and coordinating cloud game rendering directly with the client. We implement LoopTailor on the collaborated platform and commodity Android devices, reducing the interactive latency (by ∼34%) to stably below 100 ms.
Region-based Content Enhancement for Efficient Video Analytics at the Edge
Weijun Wang, Institute for AI Industry Research (AIR), Tsinghua University; Liang Mi, Shaowei Cen, and Haipeng Dai, State Key Laboratory for Novel Software Technology, Nanjing University; Yuanchun Li, Institute for AI Industry Research (AIR), Tsinghua University; Xiaoming Fu, University of Göttingen; Yunxin Liu, Institute for AI Industry Research (AIR), Tsinghua University
Video analytics is widespread in various applications serving our society. Recent advances of content enhancement in video analytics offer significant benefits for the bandwidth saving and accuracy improvement. However, existing content-enhanced video analytics systems are excessively computationally expensive and provide extremely low throughput. In this paper, we present region-based content enhancement, that enhances only the important regions in videos, to improve analytical accuracy. Our system, RegenHance, enables high-accuracy and high-throughput video analytics at the edge by 1) a macroblock-based region importance predictor that identifies the important regions fast and precisely, 2) a regionaware enhancer that stitches sparsely distributed regions into dense tensors and enhances them efficiently, and 3) a profile-based execution planer that allocates appropriate resources for enhancement and analytics components. We prototype RegenHance on five heterogeneous edge devices. Experiments on two analytical tasks reveal that region-based enhancement improves the overall accuracy of 10-19% and achieves 2-3× throughput compared to the state-of-the-art frame-based enhancement methods.
Tooth: Toward Optimal Balance of Video QoE and Redundancy Cost by Fine-Grained FEC in Cloud Gaming Streaming
Congkai An, Huanhuan Zhang, Shibo Wang, Jingyang Kang, Anfu Zhou, Liang Liu, and Huadong Ma, Beijing University of Posts and Telecommunications; Zili Meng, Hong Kong University of Science and Technology; Delei Ma, Yusheng Dong, and Xiaogang Lei, Well-Link Times Inc.
Despite the rapid rise of cloud gaming, real-world evaluations of its quality of experience (QoE) remain scarce. To fill this gap, we conduct a large-scale measurement campaign, analyzing over 60,000 sessions on an operational cloud gaming platform. We find that current cloud gaming streaming suffers from substantial bandwidth wastage and severe interaction stalls simultaneously. In-depth investigation reveals the underlying reason, i.e., existing streaming adopts coarse-grained Forward Error Correction (FEC) encoding, without considering the adverse impact of frame length variation, which results in over-protection of large frames (i.e., bandwidth waste) and under-protection of smaller ones (i.e., interaction stalls). To remedy the problem, we propose Tooth, a per-frame adaptive FEC that aims to achieve the optimal balance between satisfactory QoE and efficient bandwidth usage. To build Tooth, we design a dual-module FEC encoding strategy, which takes full consideration of both frame length variation and network dynamics, and hence determines an appropriate FEC redundancy rate for each frame. Moreover, we also circumvent the formidable per-frame FEC computational overhead by designing a lightweight Tooth, so as to meet the rigid latency bound of real-time cloud gaming. We implement, deploy, and evaluate Tooth in the operational cloud gaming system. Extensive field tests demonstrate that Tooth significantly outperforms existing state-of-the-art FEC methods, reducing stall rates by 40.2% to 85.2%, enhancing video bitrates by 11.4% to 29.2%, and lowering bandwidth costs by 54.9% to 75.0%.
AsTree: An Audio Subscription Architecture Enabling Massive-Scale Multi-Party Conferencing
Tong Meng, Wenfeng Li, Chao Yuan, Changqing Yan, and Le Zhang, ByteDance Inc.
While operating a multi-party video conferencing system (Lark) globally, we find that audio subscription alone may pose considerable challenges to the network, especially when scaling towards massive scales. Traditional strategy of subscribing to all remote participants suffers from issues such as signaling storm, excessive bandwidth and resource consumption on both server and client sides. Aimed at enhanced scalability, we share our design of AsTree, an audio subscription architecture. By a cascading tree topology and media plane-based audio selection, AsTree dramatically reduces the number of signaling messages and audio streams to forward. Practical deployment in Lark reduces audio and video stall ratios by more than 30% and 50%. We also receive 40% less negative client reviews, strongly proving the value of AsTree.
Tuesday, April 29
8:00 am–9:00 am
Continental Breakfast
9:00 am–10:20 am
Infra For ML
AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, and Zewen Jin, University of Science and Technology of China; Youshan Miao, Microsoft Research; Cheng Li, University of Science and Technology of China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. Most network optimizations are under the assumption that these libraries are well-tuned, ignoring their low-level parameter selection. In this paper, we present a novel automated tuning method AutoCCL that significantly improves communication performance without incurring additional costs. One of the primary challenges we tackle is the state explosion in searching for the optimal configuration. To overcome this, we decouple implementation-related parameters from those sensitive to the search space size and propose a divide-and-conquer algorithm, minimizing the requirement for exhaustive trials. We further propose an online tuning approach that accounts for communication-computation interference to enhance accuracy in finding optimal configurations, while hiding tuning overhead within early iterations of training jobs. We implement AutoCCL atop NCCL, a leading and widely-used communication library provided by NVIDIA. Our evaluation on both a 2-node cluster (16 A40 GPUs, intra-node NVLink, inter-node 2× 400Gbps InfiniBand) and a 4-node cluster (32 A40 GPUs, intra-node PCIe, inter-node 100Gbps InfiniBand) demonstrates that AutoCCL achieves 1.24-1.29× and 1.15-1.22× speedups on microbenchmarks compared to NCCL and another SOTA NCCL tuner, respectively, and up to 1.80× and 1.49× with concurrent computation. End-to-end evaluations on three large language models and one vision model show 1.07-1.32× improvements in periteration training time.
OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Ertza Warraich, Purdue University; Omer Shabtai and Khalid Manaa, Nvidia; Shay Vargaftik, VMware Research; Yonatan Piasetzky and Matty Kadosh, Nvidia; Lalith Suresh, Feldera; Muhammad Shahbaz, University of Michigan
We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models.
Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.
Efficient Direct-Connect Topologies for Collective Communications
Liangyu Zhao, University of Washington; Siddharth Pal, Raytheon BBN Technologies; Tapan Chugh, University of Washington; Weiyang Wang, MIT CSAIL; Jason Fantl, Prithwish Basu, and Joud Khoury, Raytheon BBN Technologies; Arvind Krishnamurthy, University of Washington
We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and communication schedules for a given cluster size and degree, then identifies the best option for a given workload. Our algorithms start from small, optimal base topologies and associated schedules, using techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient communication schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Alind Khare and Dhruv Garg, Georgia Institute of Technology; Sukrit Kalra, UC Berkeley; Snigdha Grandhi, Adobe; Ion Stoica, UC Berkeley; Alexey Tumanov, Georgia Institute of Technology
The increasing deployment of ML models on the critical path of production applications requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving many models under such conditions requires a careful balance between each application’s latency and accuracy requirements and the overall efficiency of utilization of scarce resources. Faced with this tension, state-of-the-art systems either choose a single model representing a static point in the latency-accuracy tradeoff space to serve all requests or incur latency target violations by loading specific models on the critical path of request serving. Our work instead resolves this tension through a resource-efficient serving of the entire range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized control-flow operators in pre-trained, weight-shared super-networks. These operators enable SubNetAct to dynamically route a request through the network to actuate a specific model that meets the request’s latency and accuracy target. Thus, SubNetAct can serve a vastly higher number of models than prior systems while requiring upto 2.6× lower memory. More crucially, SubNetAct’s near-instantaneous actuation of a wide-range of models unlocks the design space of fine-grained, reactive scheduling policies. We design one such extremely effective policy, SlackFit, and instantiate both SubNetAct and SlackFit in a real system, SuperServe. On real-world traces derived from a Microsoft workload, SuperServe achieves 4.67% higher accuracy for the same latency targets and 2.85× higher latency target attainment for the same accuracy.
Fast Scalable Consensus
Pineapple: Unifying Multi-Paxos and Atomic Shared Registers
Tigran Bantikyan, Northwestern; Jonathan Zarnstorff, unaffiliated; Te-Yen Chou, CMU; Lewis Tseng, UMass Lowell; Roberto Palmieri, Lehigh University
Linearizable storage systems reduce the complexity of developing correct large-scale customer-facing applications, in the presence of concurrent operations and failures. A common approach for providing linearizability is to use consensus to order operations invoked by applications. This paper explores designs that offload operations (from the consensus component) to improve overall performance.
This paper presents Pineapple, which uses logical timestamps to unify Multi-Paxos and atomic shared registers so that any node in the system can serve read and write operations. Compared to Multi-Paxos (or leader-based consensus), Pineapple reduces bottlenecks at the leader. Compared to Gryff, which unifies EPaxos and atomic shared registers, Pineapple has better performance because Pineapple has “non-blocking operation execution.”
Our evaluation shows that Pineapple improves both throughput and tail latency, compared to state-of-the-art systems (e.g., Gryff, Multi-Paxos, EPaxos), in both wide-area networks and local-area networks. We also integrate Pineapple with etcd. In a balanced workload, Pineapple reduces median latency by more than 50%, compared to the original system that uses an optimized version of Raft.
Ladder: A Convergence-based Structured DAG Blockchain for High Throughput and Low Latency
Dengcheng Hu, Jianrong Wang, Xiulong Liu, and Hao Xu, Tianjin University; Xujing Wu, Jd.Com, Inc; Muhammad Shahzad, North Carolina State University; Guyue Liu, Peking University; Keqiu Li, Tianjin University
Recent literature proposes the use of Directed Acyclic Graphs (DAG) to enhance blockchain performance. However, current block-DAG designs face three important limitations when fully utilizing parallel block processing: high computational overhead due to costly block sorting, complex transaction confirmation process, and vulnerability to balance attacks when determining the pivot chain. To this end, we propose Ladder, a structured twin-chain DAG blockchain with a convergence mechanism that efficiently optimizes parallel block processing strategy and enhances overall performance and security. In each round, a designated convergence node generates a lower-chain block, sorting the forked blocks from the upper-chain, reducing computational overhead and simplifying transaction confirmation.To counter potential adversarial disruptions, a dynamic committee is selected to generate special blocks when faulty blocks are detected. We implemented and evaluated Ladder in a distributed network environment against several state-of-the-art methods. Our results show that Ladder achieves a 59.6% increase in throughput and a 20.9% reduction in latency.
Vegeta: Enabling Parallel Smart Contract Execution in Leaderless Blockchains
Tianjing Xu and Yongqi Zhong, Shanghai Jiao Tong University; Yiming Zhang, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3; Ruofan Xiong, Xiamen University; Jingjing Zhang, Fudan University; Guangtao Xue and Shengyun Liu, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3
Consensus and smart contract execution play complementary roles in blockchain systems. Leaderless consensus, as a promising direction in the blockchain context, can better utilize the resources of each node and/or avoid incurring the extra burden of timing assumptions. As modern Byzantine-Fault Tolerant (BFT) consensus protocols can order several hundred thousand transactions per second, contract execution is becoming the performance bottleneck. Adding concurrency to contract execution is a natural way to boost its performance, but none of the existing frameworks is a perfect fit for leaderless consensus.
We propose speculate-order-replay, a generic framework tailored to leaderless consensus protocols. Our framework allows each proposer to (pre-)process transactions prior to consensus, better utilizing its computing resources. We instantiate the framework with a concrete concurrency control protocol Vegeta. Vegeta speculatively executes a series of transactions and analyzes their dependencies before consensus, and later deterministically replays the schedule. We ran experiments under the real-world Ethereum workload on 16vCPU virtual machines. Our evaluation results show that Vegeta achieved up to 7.8× speedup compared to serial execution. When deployed on top of a leaderless consensus protocol with 10 nodes, Vegeta still achieved 6.9× speedup.
Shoal++: High Throughput DAG BFT Can Be Fast and Robust!
Balaji Arun and Zekun Li, Aptos Labs; Florian Suri-Payer, Cornell University; Sourav Das, UIUC; Alexander Spiegelman, Aptos Labs
Today's practical partially synchronous Byzantine Fault Tolerant consensus protocols trade off low latency and high throughput. On the one end, traditional BFT protocols such as PBFT and its derivatives optimize for latency. They require, in fault-free executions, only 3 message delays to commit, the optimum for BFT consensus. However, this class of protocols typically relies on a single leader, hampering throughput scalability. On the other end, a new class of so-called DAG-BFT protocols demonstrates how to achieve highly scalable throughput by separating data dissemination from consensus, and using every replica as proposer. Unfortunately, existing DAG-BFT protocols pay a steep latency premium, requiring on average 10.5 message delays to commit transactions.
This work aims to soften this tension, and proposes Shoal++, a novel DAG-based BFT consensus system that offers the throughput of DAGs while reducing end-to-end consensus commit latency to an average of 4.5 message delays. Our empirical findings are encouraging, showing that Shoal++ achieves throughput comparable to state-of-the-art DAG BFT solutions while reducing latency by up to 60%, even under less favorable network and failure conditions.
10:20 am–10:50 am
Coffee and Tea Break
10:50 am–12:10 pm
Operational Experiences
Learning Production-Optimized Congestion Control Selection for Alibaba Cloud CDN
Xuan Zeng, Alibaba Cloud; Haoran Xu, Sun Yat-sen University; Chen Chen and Xumiao Zhang, Alibaba Cloud; Xiaoxi Zhang and Xu Chen, Sun Yat-sen University; Guihai Chen, Nanjing University; Yubing Qiu, Yiping Zhang, Chong Hao, and Ennan Zhai, Alibaba Cloud
Today's content delivery networks (CDNs) typically use static congestion control (CC) configurations, yet the diverse network environments preclude a universally optimal CC for all geographical regions, as evidenced by our extensive measurements. Current CC algorithms, limited by narrow applicability or high maintenance costs, struggle in large-scale CDNs. This work introduces AliCCS, the first CC Selection (CCS) approach tailored for production CDN, integrating fine-grained domain knowledge for learning to choose the best CC from existing, well-established ones. Through an over-one-year real-world deployment in Alibaba Cloud CDN, AliCCS has enhanced the Quality-of-Experience (QoE) by up to 9.31%, surpassing the competitive margin in the CDN market, and significantly reduced the retransmission rate by 25.51% to 174.36% across all provinces of China, leading to cost savings over 10 million US dollars. We also share key insights and experiences from deploying AliCCS at scale, highlighting traffic patterns in Alibaba Cloud CDN.
GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale
Lingyun Yang, Hong Kong University of Science and Technology; Yongchen Wang and Yinghao Yu, Alibaba Group; Qizhen Weng, Hong Kong University of Science and Technology; Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, Nan Wang, Yu Dong, Menglei Zheng, Lanlan Xi, Xiaowei Lu, Liang Ye, Guodong Yang, Binzhang Fu, Tao Lan, Liping Zhang, and Lin Qu, Alibaba Group; Wei Wang, Hong Kong University of Science and Technology
Online recommender systems use deep learning recommendation models (DLRMs) to provide accurate, personalized recommendations to improve customer experience. However, efficiently provisioning DLRM services at scale is challenging. DLRMs exhibit distinct resource usage patterns: they require a large number of CPU cores and a tremendous amount of memory, but only a small number of GPUs. Running them in multi-GPU servers quickly exhausts the servers' CPU and memory resources, leaving a large number of unallocated GPUs stranded, unable to utilize by other tasks.
This paper describes Prism, a production DLRM serving system that eliminates GPU fragmentation by means of resource disaggregation. In Prism, a fleet of CPU nodes (CNs) interconnect with a cluster of heterogeneous GPU nodes (HNs) through RDMA, leading to two disaggregated resource pools that can independently scale. Prism automatically divides DLRMs into CPU- and GPU-intensive subgraphs and schedules them on CNs and HNs for disaggregated serving. Prism employs various techniques to minimize the latency overhead caused by disaggregation, including optimal graph partitioning, topology-aware resource management, and SLO-aware communication scheduling. Evaluations show that Prism effectively reduces CPU and GPU fragmentation by 53% and 27% in a crowded GPU cluster. During seasonal promotion events, it efficiently enables capacity loaning from training clusters, saving over 90% of GPUs. Prism has been deployed in production clusters for over two years and now runs on over 10k GPUs.
Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, and Dennis Cai, Alibaba Cloud
Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase- 1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively chose to customize the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count and 71% of the performance degradation.
PAPAYA Federated Analytics Stack: Engineering Privacy, Scalability and Practicality
Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, and Shoubhik Bhattacharya, Meta
Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users’ devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA’s broad adoption, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build a system that overcomes these limitations. The PAPAYA system at Meta system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.
Middleboxes
HA/TCP: A Reliable and Scalable Framework for TCP Network Functions
Haoyu Gu, Ali José Mashtizadeh, and Bernard Wong, University of Waterloo
Layer 7 network functions (NFs) are a critical piece of modern network infrastructure. As a result, the scalability and reliability of these NFs are important but challenging because of the complexity of layer 7 NFs. This paper presents HA/TCP, a framework that enables migration and failover of layer 7 NFs. HA/TCP uses a novel replication mechanism to synchronize the state between replicas with low overhead, enabling seamless migration and failover of TCP connections. HA/TCP encapsulates the implementation details into our replicated socket interface to allow developers to easily add high availability to their layer 7 NFs such as WAN accelerators, load balancers, and proxies. Our benchmarks show that HA/TCP provides reliability for a 100 Gbps NF with as little as 0.2% decrease in client throughput. HA/TCP transparently migrates a connection between replicas in 38 µs, including the network latency. We provide reliability to a SOCKS proxy and a WAN accelerator with less than 2% decrease in throughput and a modest increase in CPU usage.
High-level Programming for Application Networks
Xiangfeng Zhu, Yuyao Wang, Banruo Liu, Yongtong Wu, and Nikola Bojanic, University of Washington; Jingrong Chen, Duke University; Gilbert Louis Bernstein and Arvind Krishnamurthy, University of Washington; Sam Kumar, University of Washington and UCLA; Ratul Mahajan, University of Washington; Danyang Zhuo, Duke University
Application networks facilitate communication between the microservices of cloud applications. They are built today using service meshes with low-level specifications that make it difficult to express application-specific functionality (e.g., access control based on RPC fields), and they can more than double the RPC latency. We develop AppNet, a framework that makes it easy to build expressive and high-performance application networks. Developers specify rich RPC processing in a high-level language with generalized match-action rules and built-in state management. We compile the specifications to high-performance code after optimizing where (e.g., client, server) and how (e.g., RPC library, proxy) each RPC processing element runs. The optimization uses symbolic abstraction and execution to judge if different runtime configurations of possibly-stateful RPC processing elements are semantically equivalent for arbitrary RPC streams. Our experiments show that AppNet can express common application network function in only 7-28 lines of code. Its optimizations lower RPC processing latency by up to 82%.
State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
Qiongwen Xu, Rutgers University; Sebastiano Miano, Politecnico di Milano; Xiangyu Gao and Tao Wang, New York University; Adithya Murugadass and Songyuan Zhang, Rutgers University; Anirudh Sivaraman, New York University; Gianni Antichi, Queen Mary University of London and Politecnico di Milano; Srinivas Narayana, Rutgers University
With the slowdown of Moore’s law, CPU-oriented packet processing in software will be significantly outpaced by emerging line speeds of network interface cards (NICs). Single-core packet-processing throughput has saturated.
We consider the problem of high-speed packet processing with multiple CPU cores. The key challenge is state—memory that multiple packets must read and update. The prevailing method to scale throughput with multiple cores involves state sharding, processing all packets that update the same state, e.g., flow, at the same core. However, given the skewed nature of realistic flow size distributions, this method is untenable, since total throughput is limited by single-core performance.
This paper introduces state-compute replication, a principle to scale the throughput of a single stateful flow across multiple cores using replication. Our design leverages a packet history sequencer running on a NIC or top-of-the-rack switch to enable multiple cores to update state without explicit synchronization. Our experiments with realistic data center and wide-area Internet traces show that state-compute replication can scale total packet-processing throughput linearly with cores, independent of flow size distributions, across a range of realistic packet-processing programs.
MTP: Transport for In-Network Computing
Tao Ji, UT Austin; Rohan Vardekar and Balajee Vamanan, University of Illinois Chicago; Brent E. Stephens, Google and University of Utah; Aditya Akella, UT Austin
In-network computing (INC) is being increasingly adopted to accelerate applications by offloading part of the applications’ computation to network devices. Such application-specific (L7) offloads have several attributes that the transport protocol must work with — they may mutate, intercept, reorder and delay application messages that span multiple packets. At the same time the transport must also work with the buffering and computation constraints of network devices hosting the L7 offloads. Existing transports and alternative approaches fall short in these regards. Therefore, we present MTP, the first transport to natively support INC. MTP is built around two major components: 1) a novel message-oriented reliability protocol and 2) a resource-specific congestion control framework. We implement a full-fledged prototype of MTP based on DPDK. We show the efficacy of MTP in a testbed with a real INC application as well as with comprehensive microbenchmarks and large-scale simulations.
12:10 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:20 pm
Rethinking Data Center Efficiency
ONCache: A Cache-Based Low-Overhead Container Overlay Network
Shengkai Lin, Shizhen Zhao, Peirui Cao, and Xinchi Han, Shanghai Jiao Tong University; Quan Tian, Wenfeng Liu, Qi Wu, and Donghai Han, Broadcom; Xinbing Wang, Shanghai Jiao Tong University
Recent years have witnessed a widespread adoption of containers. While containers simplify and accelerate application development, existing container network technologies either incur significant overhead, which hurts performance for distributed applications, or lose flexibility or compatibility, which hinders the widespread deployment in production.
We carefully analyze the kernel data path of an overlay network, quantifying the time consumed by each segment of the data path and identifying the extra overhead in an overlay network compared to bare metal. We observe that this extra overhead generates repetitive results among packets, which inspires us to introduce caches within an overlay network.
We design and implement ONCache (Overlay Network Cache), a cache-based container overlay network, to eliminate the extra overhead while maintaining flexibility and compatibility. We implement ONCache using the extended Berkeley Packet Filter (eBPF) with only 524 lines of code, and integrate it as a plugin of Antrea. With ONCache, containers attain networking performance akin to that of bare metal. Compared to the standard overlay networks, ONCache improves throughput and request-response transaction rate by 12% and 36% for TCP (20% and 34% for UDP), respectively, while significantly reducing per-packet CPU overhead. Popular distributed applications also benefit from ONCache.
GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters
Kaiqiang Xu and Decang Sun, iSING Lab, Hong Kong University of Science and Technology; Han Tian, USTC; Junxue Zhang and Kai Chen, iSING Lab, Hong Kong University of Science and Technology
This paper explores the problem of scheduling machine Learning (ML) jobs while also taking into account the reduction of carbon emissions in the cluster. Traditional cluster schedulers for ML jobs mainly focus on optimizing job completion time~(JCT), but do not consider the environmental impact of their decisions, resulting in a suboptimal carbon footprint. To address this issue, we propose GREEN, an ML cluster scheduler that is both time-efficient and carbon-efficient. At its core, GREEN uses a unique carbon-aware scheduling algorithm that reduces carbon footprint with minimized impact on JCT. Additionally, it leverages the temporal flexibility of ML jobs to reduce carbon emissions by shifting workloads to less carbon-intensive times, while still maintaining overall daily capacity. Our experiments using real ML jobs workload demonstrate that GREEN can achieve up to 41.2% reduction in cluster-wide carbon footprint and 12% reduction in peak power consumption, while incurring 3.6%-5.9% time efficiency tradeoff compared to existing methods.
The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling
Linsong Guo, Danial Zuberi, Tal Garfinkel, and Amy Ousterhout, UC San Diego
Preemptive scheduling promises to mitigate head-of-line blocking and enable flexible scheduling while retaining a simple programming model. Despite this, preemption is underutilized in server-side software today. Instead, high-performance datacenter systems and language runtimes often rely on cooperative concurrency, or else use preemption only at very coarse timescales, limiting its effectiveness. A key reason that preemption is underutilized today is that existing preemption mechanisms have high and unpredictable overheads.
Intel recently introduced support for user interrupts, a new feature that offers an opportunity to change this. By enabling interrupts to be sent and received entirely in user space, user interrupts can significantly lower the overhead of preemption. In this paper, we shed light on how user interrupts impact the landscape of preemption mechanisms. We build two user-level schedulers that leverage user interrupts for low-overhead preemption. We find that user interrupts are not a panacea. For example, they provide limited benefits when other software layers constrain the kinds of scheduling policies that can be used. Still, user interrupts can match or exceed the performance of existing mechanisms for all but the highest preemption rates, while achieving much more consistent overheads and retaining a user friendly programming model.
Securing Public Cloud Networks with Efficient Role-based Micro-Segmentation
Sathiya Kumaran Mani and Kevin Hsieh, Microsoft; Santiago Segarra, Rice University; Ranveer Chandra, Microsoft; Yajie Zhou, University of Maryland; Srikanth Kandula, Microsoft
Securing network traffic within data centers is a critical and daunting challenge due to the increasing complexity and scale of modern public clouds. Micro-segmentation offers a promising solution by implementing fine-grained, workload-specific network security policies to mitigate potential attacks. However, the dynamic nature and large scale of deployments present significant obstacles in crafting precise security policies, limiting the practicality of this approach. To address these challenges, we introduce a novel system that efficiently processes vast volumes of network-flow logs and effectively infers the roles of network endpoints. Our method integrates domain knowledge and communication patterns in a principled manner, facilitating the creation of micro-segmentation policies at a large scale. Evaluations with real-world deployment demonstrate that our solution significantly surpasses existing algorithms in role inference accuracy. We implement our solution as an end-to-end system, and we show that our system is up to 21.5× more cost-efficient than Apache Flink, a widely used open-source stream processing system.
RDMA
Mitigating Scalability Walls of RDMA-based Container Networks
Wei Liu, Tsinghua University and Alibaba Cloud; Kun Qian, Alibaba Cloud; Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Tianyin Xu, UIUC; Yunhao Liu, Tsinghua University; Yu Guan, Shuhong Zhu, Hongfei Xu, Lanlan Xi, Chao Qin, and Ennan Zhai, Alibaba Cloud
As a state-of-the-art technique, RDMA-offloaded container networks (RCNs) can provide high-performance data communications among containers. Nevertheless, this seems to be subject to the RCN scale—when there are millions of containers simultaneously running in a data center, the performance decreases sharply and unexpectedly. In particular, we observe that most performance issues are related to RDMA NICs (RNICs), whose design and implementation defects might constitute the "scalability wall" of the RCN. To validate the conjecture, however, we are challenged by the limited visibility into the internals of today's RNICs. To address the dilemma, a more pragmatic approach is to infer the most likely causes of the performance issues according to the common abstractions of an RNIC's components and functionalities.
Specifically, we conduct combinatorial causal testing to efficiently reason about an RNIC's architecture model, effectively approximate its performance model, and thereby proactively optimize the NF (network function) offloading schedule. We embody the design into a practical system dubbed ScalaCN. Evaluation on production workloads shows that the end-to-end network bandwidth increases by 1.4× and the packet forwarding latency decreases by 31%, after resolving 82% of the causes inferred by ScalaCN. We report the performance issues of RNICs and the most likely causes to relevant vendors, all of which have been encouragingly confirmed; we are now closely working with the vendors to fix them.
Eden: Developer-Friendly Application-Integrated Far Memory
Anil Yelam, Stewart Grant, and Saarth Deshpande, UC San Diego; Nadav Amit, Technion, Israel Institute of Technology; Radhika Niranjan Mysore, VMware Research Group; Amy Ousterhout, UC San Diego; Marcos K. Aguilera, VMware Research Group; Alex C. Snoeren, UC San Diego
Far memory systems are a promising way to address the resource stranding problem in datacenters. Far memory systems broadly fall into two categories. On one hand, paging-based systems use hardware guards at the granularity of pages to intercept remote accesses, which require no application changes but incur significant faulting overhead. On the other hand, app-integrated systems use software guards on data objects and apply application-specific optimizations to avoid faulting overheads, but these systems require significant application redesign and/or suffer from overhead on local accesses. We propose Eden, a new approach to far memory that combines hardware guards with a small number of software guards in the form of programmer annotations, to achieve performance similar to app-integrated systems with minimal developer effort. Eden is based on the insight that applications generate most of their page faults at a small number of code locations, and those locations are easy to find programmatically. By adding hints to such locations, Eden can notify the pager about upcoming memory accesses to customize read-ahead and memory reclamation. We show that Eden achieves 19.4–178% higher performance than Fastswap for memory-intensive applications including DataFrame and memcached. Eden achieves performance comparable to AIFM with almost 100× fewer code changes.
Achieving Wire-Latency Storage Systems by Exploiting Hardware ACKs
Qing Wang, Jiwu Shu, Jing Wang, and Yuhao Zhang, Tsinghua University
We present Juneberry, a low-latency communication framework for storage systems. Different from existing RPC frameworks, Juneberry provides a fast path for storage requests: they can be committed with a single round trip and server CPU bypass, thus delivering extremely low latency; the execution of these requests is performed asynchronously on the server CPU. Juneberry achieves it by relying on our proposed Ordered Queue abstraction, which exploits NICs’ hardware ACKs as commit signals of requests while ensuring linearizability of the whole system. Juneberry also supports durability by placing requests in persistent memory (PM). We implement Juneberry using commodity RDMA NICs and integrate it into two storage systems: Memcached (a widely used in-memory caching system) and PMemKV (a PM-based persistent key-value store). Evaluation shows that compared with RPC, Juneberry can significantly lower their latency under write-intensive workloads.
ODRP: On-Demand Remote Paging with Programmable RDMA
Zixuan Wang, Xingda Wei, Jinyu Gu, Hongrui Xie, Rong Chen, and Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University
Memory disaggregation with OS swapping is becoming popular for next-generation datacenters. RDMA is a promising technique for achieving this. However, RDMA does not support dynamic memory management in the data path. Current systems rely on RDMA’s control path operations, which are designed for coarse-grained memory management. This results in a trade-off between performance and memory utilization and also requires significant CPU usage, which is a limited resource on memory nodes.
This paper introduces On-Demand Remote Paging, the first system that smartly chains native RDMA data path primitives to offload all memory access and management operations onto the RDMA-capable NIC (RNIC). However, efficiently implementing these operations is challenging due to the limited capability of RNIC. ODRP leverages the semantics of OS swapping and adopts a client-assisted principle to address the efficiency and functionality challenges. Compared to the state-of-the-art system, ODRP can achieve significantly better memory utilization, no CPU usage while introducing only a 0.8% to 14.6% performance overhead in real-world applications.
3:20 pm–3:50 pm
Coffee and Tea Break
3:50 pm–5:30 pm
Storage
Understanding and Profiling NVMe-over-TCP Using ntprof
Yuyuan Kang and Ming Liu, University of Wisconsin-Madison
NVMe-over-TCP (NVMe/TCP) is an emerging remote storage protocol, increasingly adopted in enterprises and clouds. It establishes a high-performance reliable data channel between clients and storage targets to deliver block I/Os. Understanding and analyzing the protocol execution details and how well storage workloads run atop are pivotal for system developers and infrastructure engineers. However, our community lacks such a profiling utility, whereas existing solutions are ad-hoc, tedious, and heuristic-driven. Realizing it is challenging due to the unpredictable I/O workload profile, intricate system layer interaction, and deep execution pipeline.
This paper presents ntprof, a systematic, informative, and lightweight NVMe/TCP profiler. Our key idea is to view the NVMe/TCP storage substrate as a lossless switched network and apply network monitoring techniques. We model each on-path system module as a software switch, equip it with a programmable profiling agent on the data plane, and develop a proactive query interface for statistics collection and analysis. ntprof, comprising a kernel module and a user-space utility, allows developers to define various profiling tasks, incurs marginal overhead when co-locating with applications, and generates performance reports based on prescribed specifications. We build ntprof atop Linux kernel 5.15.143 and apply it in six cases, i.e., end-to-end latency breakdown, interference analysis, SW/HW bottleneck localization, and application performance diagnostic. ntprof is available at https://github.com/netlab-wisconsin/ntprof.
Building an Elastic Block Storage over EBOFs Using Shadow Views
Sheng Jiang, Carnegie Mellon University; Ming Liu, University of Wisconsin-Madison
The EBOF (Ethernet-Bunch-Of-Flash) has emerged as an enticing and promising disaggregated storage platform due to its streamlined I/O processing, high scalability, and substantial energy/cost-efficiency improvement. An EBOF applies a smart-sender dumb-receiver design philosophy and provides backward-compatible storage volumes to expedite system deployment. Yet, the static and opaque internal I/O processing pipeline lacks resource allocation, I/O scheduling, and traffic orchestration capabilities, entailing bandwidth waste, workload non-adaptiveness, and performance interference.
This paper presents the design and implementation of a distributed telemetry system (called shadow view) to tackle the above challenges and facilitate the effective use of an EBOF. We model an EBOF as a two-layer multi-switch architecture and develop a view development protocol to construct the EBOF running snapshot and expose internal execution statistics at runtime. Our design is motivated by the observation that fast data center networks make the overheads of inter-server communication and synchronization negligible. We demonstrate the effectiveness of shadow view by building a block storage (dubbed Flint) atop EBOFs. The enhanced I/O data plane allows us to develop new three techniques–an elastic volume manager, a view-enabled bandwidth auction mechanism, and an eIO scheduler. Our evaluations using the Fungible FS1600 EBOF show that a Flint volume achieves 9.3/9.2 GB/s read/write bandwidth with no latency degradation, significantly outperforming the defacto EBOF volume. It achieves up to 2.9× throughput improvements when running an object store. Flint is tenant-aware and remote target-aware, delivering efficient multi-tenancy and workload adaptiveness.
Pushing the Limits of In-Network Caching for Key-Value Stores
Gyuyeong Kim, Sungshin Women's University
We present OrbitCache, a new in-network caching architecture that can cache variable-length items to balance a wide range of key-value workloads. Unlike existing works, OrbitCache does not cache hot items in the switch memory. Instead, we make hot items revisit the switch data plane continuously by exploiting packet recirculation. Our approach keeps cached key-value pairs in the switch data plane while freeing them from item size limitations caused by hardware constraints. We implement an OrbitCache prototype on an Intel Tofino switch. Our experimental results show that OrbitCache can balance highly skewed workloads and is robust to various system conditions.
Cellular and Wireless
CellReplay: Towards accurate record-and-replay for cellular networks
William Sentosa, University of Illinois Urbana-Champaign; Balakrishnan Chandrasekaran, VU Amsterdam; P. Brighten Godfrey, University of Illinois Urbana-Champaign and Broadcom; Haitham Hassanieh, EPFL
The inherent variability of real-world cellular networks makes it hard to evaluate, reproduce, and debug the performance of networked applications running on these networks. A common approach is to record and replay a trace of observed cellular network performance. However, we show that the state-of-the-art record-and-replay technique produces empirically inaccurate results that can cause evaluation bias. This paper presents the design and implementation of CellReplay, a tool that records the time-varying performance of a live cellular network into traces using preset workloads and faithfully replays the observed performance for other workloads through an emulated network interface. The key challenge in achieving high accuracy is to replay varying network behavior in a way that captures its sensitivity to the workload. CellReplay records network behavior under two predefined workloads simultaneously and interpolates upon replay for other workloads. Across various challenging network conditions, our evaluation shows that real-world networked applications (e.g., web browsing or video streaming) running on CellReplay achieve similar performance (e.g., page load time or bitrate selection) to their live network counterparts, with significantly reduced error compared to the prior method.
Large Network UWB Localization: Algorithms and Implementation
Nakul Garg and Irtaza Shahid, University of Maryland, College Park; Ramanujan K Sheshadri, Nokia Bell Labs; Karthikeyan Sundaresan, Georgia Institute of Technology; Nirupam Roy, University of Maryland, College Park
Localization of networked nodes is an essential problem in emerging applications, including first-responder navigation, automated manufacturing lines, vehicular and drone navigation, asset tracking, Internet of Things, and 5G communication networks. In this paper, we present Locate3D, a novel system for peer-to-peer node localization and orientation estimation in large networks. Unlike traditional range-only methods, Locate3D introduces angle-of-arrival (AoA) data as an added network topology constraint. The system solves three key challenges: it uses angles to reduce the number of measurements required by 4× and jointly uses range and angle data for location estimation. We develop a spanning-tree approach for fast location updates, and to ensure the output graphs are rigid and uniquely realizable, even in occluded or weakly connected areas. Locate3D cuts down latency by up to 75% without compromising accuracy, surpassing standard range-only solutions. It has a 0.86 meter median localization error for building-scale multi-floor networks (32 nodes, 0 anchors) and 12.09 meters for large-scale networks (100,000 nodes, 15 anchors).
Towards Energy Efficient 5G vRAN Servers
Anuj Kalia, Microsoft; Nikita Lazarev, MIT; Leyang Xue, The University of Edinburgh; Xenofon Foukas and Bozidar Radunovic, Microsoft; Francis Y. Yan, Microsoft Research and UIUC
We study the problem of improving energy efficiency in virtualized radio access network (vRAN) servers, focusing on CPUs. Two distinct characteristics of vRAN software—strict real-time sub-millisecond deadlines and its proprietary black-box nature—preclude the use of existing general-purpose CPU energy management techniques. This paper presents RENC, a system that saves energy by adjusting CPU frequency in response to sub-second variations in cellular workloads, using the following techniques. First, despite large fluctuations in vRAN CPU load at sub-ms timescales, RENC establishes safe low-load intervals, e.g., by coupling Media Access Control (MAC) layer rate limiting with CPU frequency changes. This prevents high traffic during low-power operation, which would otherwise cause deadline misses. Second, we design techniques to compute CPU frequencies that are safe for these low-load intervals, achieved by measuring the slack in vRAN threads' deadlines using Linux eBPF hooks, or minor binary rewriting of the vRAN software. Third, we demonstrate the need to handle CPU load spikes triggered by control operations, such as new users attaching to the network. Our evaluation in a state-of-the-art vRAN testbed shows that our techniques reduces a vRAN server's CPU power consumption by up to 45% (29% server-wide).
Building Massive MIMO Baseband Processing on a Single-Node Supercomputer
Xincheng Xie, Wentao Hou, Zerui Guo, and Ming Liu, University of Wisconsin-Madison
The rising deployment of massive MIMO coupled with the wide adoption of virtualized radio access networks (vRAN) poses an unprecedented computational demand on the baseband processing, hardly met by existing vRAN hardware substrates. The single-node supercomputer, an emerging computing platform, offers scalable computation and communication capabilities, making it a promising target to hold and run the baseband pipeline. However, realizing this is non-trivial due to the mismatch between (a) the diverse execution granularities and incongruent parallel degrees of different stages along the software processing pipeline and (b) the underlying evolving irregular hardware parallelism at runtime.
This paper closes the gap by designing and implementing MegaStation–an application-platform co-designed system that effectively harnesses the computing power of a single-node supercomputer for processing massive MIMO baseband. Our key insight is that one can adjust the execution granularity and reconstruct the baseband processing pipeline on the fly based on the monitored hardware parallelism status. Inspired by dynamic instruction scheduling, MegaStation models the single-node supercomputer as a tightly coupled microprocessor and employs a scoreboarding-like algorithm to orchestrate "baseband processing" instructions over GPU-instantiated executors. Our evaluations using the GigaIO FabreX demonstrate that MegaStation achieves up to 66.2% lower tail frame processing latency and 4× higher throughput than state-of-the-art solutions. MegaStation is a scalable and adaptive solution that can meet today’s vRAN requirements.
Efficient Multi-WAN Transport for 5G with OTTER
Mary Hogan, Oberlin College; Gerry Wan, Google; Yiming Qiu, University of Michigan; Sharad Agarwal and Ryan Beckett, Microsoft; Rachee Singh, Cornell University; Paramvir Bahl, Microsoft
In the ongoing cloudification of 5G, software network functions (NFs) are replacing fixed-function network hardware, allowing 5G network operators to leverage the benefits of cloud computing. The migration of NFs and their management to the cloud causes 5G traffic to traverse an operator’s wide-area network (WAN) to the cloud WAN that hosts the datacenters (DCs) running 5G NFs and applications. However, achieving end-to-end (E2E) performance for 5G traffic across two WANs is hard. Placing 5G flows across two WANs with different performance and reliability characteristics, edge and DC resource constraints, and interference from other flows is different and more challenging than single-WAN traffic engineering. We address this challenge and show that orchestrating E2E paths across a multi-WAN overlay allows us to achieve average 13% more throughput, 15% less RTT, 45% less jitter, or reduce average loss from 0.06% to under 0.001%. We implement our multi-WAN 5G flow placement in a scalable optimization prototype that allocates 26%–45% more bytes on the network than greedy baselines while also satisfying the service demands of more flows.
6:00 pm–7:30 pm
NSDI '25 Poster Session and Reception
Wednesday, April 30
8:00 am–9:00 am
Continental Breakfast
9:00 am–10:20 am
Verification 2
Verifying maximum link loads in a changing world
Tibor Schneider, ETH Zürich; Stefano Vissicchio, University College London; Laurent Vanbever, ETH Zürich
To meet ever more stringent requirements, network operators often need to reason about worst-case link loads. Doing so involves analyzing traffic forwarding after failures and BGP route changes. State-of-the-art systems identify failure scenarios causing congestion, but they ignore route changes.
We present Velo, the first verification system that efficiently finds maximum link loads under failures and route changes. The key building block of Velo is its ability to massively reduce the gigantic space of possible route changes thanks to (i) a router-based abstraction for route changes, (ii) a theoretical characterization of scenarios leading to worst-case link loads, and (iii) an approximation of input traffic matrices. We fully implement and extensively evaluate Velo. Velo takes only a few minutes to accurately compute all worst-case link loads in large ISP networks. It thus provides operators with critical support to robustify network configurations, improve network management and take business decisions.
A Layered Formal Methods Approach to Answering Queue-related Queries
Divya Raghunathan, Maria Apostolaki, and Aarti Gupta, Princeton University
Queue dynamics introduce significant uncertainty in network management tasks such as debugging, performance monitoring, and analysis. Despite numerous queue-monitoring techniques, many networks today continue to collect only per-port packet counts (e.g., using SNMP). Although queue lengths are correlated with packet counts, deriving the precise correlation between them is very challenging since packet counts do not specify many quantities (e.g., packet arrival order) which affect queue lengths.
This paper presents QuASI, a system that can answer many queue-related queries using only coarse-grained per-port packet counts. QuASI checks whether there exists a packet trace that is consistent with the packet counts and satisfies a query. To scale on large problem instances, QuASI relies on a layered approach and on a novel enqueue-rate abstraction, which is lossless for the class of queries that QuASI answers. The first layer employs a novel and efficient algorithm that generates a cover-set of abstract traces, constructs representative abstract traces from the cover-set, and efficiently checks each representative abstract trace by leveraging a known result on (0,1)-matrix existence. The first layer guarantees no false negatives: if the first layer says "No", there is no packet trace consistent with the observed packet counts that makes the query true. If it says "Yes", further verification is needed, which the second layer resolves using an SMT solver. As a result, QuASI has no false positives and no false negatives.
Our evaluations show that QuASI is up to 106X faster than state-of-the-art, and can answer non-trivial queries about queue metrics (e.g., queue length) using minute-granularity packet counts. Our work is the first step toward more practical formal performance analysis under given measurements.
Runtime Protocol Refinement Checking for Distributed Protocol Implementations
Ding Ding, Zhanghan Wang, Jinyang Li, and Aurojit Panda, NYU
Despite significant progress in verifying protocols, services that implement distributed protocols (we refer to these as DPIs in what follows), e.g., Chubby or Etcd, can exhibit safety bugs in production deployments. These bugs are often introduced by programmers when converting protocol descriptions into code. This paper introduces Runtime Protocol Refinement Checking (RPRC) a runtime approach for detecting protocol implementation bugs in DPIs. RPRC systems observe a deployed DPI's runtime behavior and notify operators when this behavior evidences a protocol implementation bug, allowing operators to mitigate the bugs impact and developers to fix the bug. We have developed an algorithm for RPRC and implemented it in a system called Ellsberg that targets DPIs that assume fail-stop failures and the asynchronous (or partially synchronous) model. Our goal when designing Ellsberg was to make no assumptions about how DPIs are implemented, and to avoid additional coordination or communication. Therefore, Ellsberg builds on the observation that in the absence of Byzantine failures, a protocol safety properties are maintained if all live DPI processes correctly implement the protocol. Thus, Ellsberg checks RPRC by comparing messages sent and received by each DPI process to those produced by a simulated execution of the protocol. We apply Ellsberg to three open source DPIs, Etcd, Zookeeper and Redis Raft, and show that we can detect previously reported protocol bugs in these DPIs.
CEGS: Configuration Example Generalizing Synthesizer
Jianmin Liu, Tsinghua University; Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Yukai Miao, Zhongguancun Laboratory
Network configuration synthesis promises to increase the efficiency of network management by reducing human involvement. However, despite significant advances in this field, existing synthesizers still require much human effort in drafting configuration templates or coding in a domain-specific language. We argue that the main reason for this is that a core capability is missing for current synthesizers: identifying and following configuration examples in configuration manuals and generalizing them to arbitrary topologies.
In this work, we fill this capability gap with two recent advancements in artificial intelligence: graph neural networks (GNNs) and large language models (LLMs). We build CEGS, which can automatically identify appropriate configuration examples, follow and generalize them to fit target network scenarios. CEGS features a GNN-based Querier to identify relevant examples from device documentations, a GNN-based Classifier to generalize the example to arbitrary topology, and an efficient LLM-driven synthesis method to quickly and correctly synthesize configurations that comply with the intents. Evaluations of real-world networks and complex intents show that CEGS can automatically synthesize correct configurations for a network of 1094 devices without human involvement. In contrast, the state-of-the-art LLM-based synthesizer are more than 30 times slower than CEGS on average, even when human experts are in the loop.
10:20 am–10:50 am
Coffee and Tea Break
10:50 am–12:10 pm
Security
Suppressing BGP Zombies with Route Status Transparency
Yosef Edery Anahory, The Hebrew University of Jerusalem; Jie Kong, Nicholas Scaglione, and Justin Furuness, University of Connecticut; Hemi Leibowitz, The College of Management Academic Studies; Amir Herzberg and Bing Wang, University of Connecticut; Yossi Gilad, The Hebrew University of Jerusalem
Withdrawal suppression has been a known weakness of BGP for over a decade. It has a significant detrimental impact on both the reliability and security of inter-domain routing on the Internet. This paper presents Route Status Transparency (RoST), the first design that efficiently and securely thwarts withdrawal suppression misconfigurations and attacks. RoST allows ASes to efficiently verify whether a route has been withdrawn; it is compatible with BGP as well as with BGP security enhancements. We use simulations on the Internet’s AS-level topology to evaluate the benefits from adopting RoST. We use an extensive real-world BGP announcements dataset to show that it is efficient in terms of storage, bandwidth, and computational requirements.
ValidaTor: Domain Validation over Tor
Jens Frieß, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt; Haya Schulmann, National Research Center for Applied Cybersecurity ATHENE and Goethe-Universität Frankfurt; Michael Waidner, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt
Domain Validation (DV) is the primary method used by Certificate Authorities (CAs) to confirm administrative control over a domain before issuing digital certificates. Despite its widespread use, DV is vulnerable to various attacks, prompting the adoption of multiple vantage points to enhance security, such as the state of the art DV mechanism supported by Let’s Encrypt. However, even distributed static vantage points remain susceptible to targeted attacks. In this paper we introduce ValidaTor, an HTTP-based domain validation system that leverages the Tor network to create a distributed and unpredictable set of validators. By utilizing Tor’s exit nodes, ValidaTor significantly increases the pool of available validators, providing high path diversity and resilience against strong adversaries. Our empirical evaluations demonstrate that ValidaTor can achieve the validation throughput of a commercial CA and has the potential to scale to a validation volume comparable to Let’s Encrypt, while using minimal dedicated infrastructure and only a small fraction (~0.1%) of Tor’s available bandwidth. While unpredictable selection of validators makes ValidaTor fully resistant to targeted attacks on validators, we also show the use of Tor nodes improves path diversity and thereby the resilience of DV to subversion by well-positioned ASes, reducing the number of Autonomous Systems (ASes) capable of issuing fraudulent certificates by up to 27% compared to Let’s Encrypt. Lastly, we show that the chance of subversion by malicious, colluding exit nodes is negligible (≤ 1% even with a quarter of existing exit nodes). We make the code of ValidaTor as well as the datasets and measurements publicly available for use, reproduction, and future research.
From Address Blocks to Authorized Prefixes: Redesigning RPKI ROV with a Hierarchical Hashing Scheme for Fast and Memory-Efficient Validation
Zedong Ni, Computer Network Information Center, Chinese Academy of Sciences; and School of Cyber Science & Engineering, Southeast University; Yinbo Xu, Hui Zou, and Yanbiao Li, Computer Network Information Center, Chinese Academy of Sciences; and University of Chinese Academy of Sciences; Guang Cheng, School of Cyber Science & Engineering, Southeast University; and Purple Mountain Laboratories; Gaogang Xie, Computer Network Information Center, Chinese Academy of Sciences; and University of Chinese Academy of Sciences
Route Origin Validation (ROV) with Route Origin Authorizations (ROAs), built on top of the Resource Public Key Infrastructure (RPKI), serves as the only formally standardized and production-grade defense mechanism against route hijackings in global interdomain routing infrastructures. However, the widespread adoption of RPKI has introduced escalating scalability challenges in validating high volumes of route messages against massive ROA entries.
In this paper, we attribute the performance bottleneck of existing ROV schemes to their underlying validation model, where the route is matched against rules in the form of address blocks. To overcome this bottleneck, we propose the Authorized Prefix (AP) model that enables validation at the prefix granularity, and redesign RPKI ROV based on this new model with a hierarchical hashing scheme named h2ROV
. Extensive evaluations verify h2ROV
's superiority over state-of-the-art approaches in IPv4, with a speedup of $1.7× ∼ 9.8× in validation and a reduction of 49.3% ∼ 86.6% in memory consumption. System emulations using real-world network topologies further demonstrate h2ROV
confines its impact to routing convergence to below 8.5% during update burst events, while reducing ROV-induced delays by 30.4% ∼ 64.7% compared to existing solutions.
PreAcher: Secure and Practical Password Pre-Authentication by Content Delivery Networks
Shihan Lin, Duke University; Suting Chen, Northwestern University; Yunming Xiao, University of Michigan; Yanqi Gu, University of California, Irvine; Aleksandar Kuzmanovic, Northwestern University; Xiaowei Yang, Duke University
In today's Internet, websites widely rely on password authentication for user logins. However, the intensive computation required for password authentication exposes web servers to Application-layer DoS (ADoS) attacks that exploit the login interfaces. Existing solutions fail to simultaneously prevent such ADoS attacks, preserve password secrecy, and maintain good usability. In this paper, we present PreAcher, a system architecture that incorporates third-party Content Delivery Networks (CDNs) into the password authentication process and offloads the authentication workload to CDNs without divulging the passwords to them. At the core of PreAcher is a novel three-party authentication protocol that combines Oblivious Pseudorandom Function (OPRF) and Locality-Sensitive Hashing (LSH). This protocol allows CDNs to pre-authenticate users and thus filter out ADoS traffic without compromising password security. Our evaluations demonstrate that PreAcher significantly enhances the resilience of web servers against both ADoS attacks and preserves password security while introducing acceptable overheads. Notably, PreAcher can be deployed immediately by websites alone today, without modifications to client software or CDN infrastructure. We release the source code of PreAcher to facilitate its deployment and future research.
12:10 pm–2:00 pm
Lunch (on your own)
2:00 pm–3:20 pm
Data Plane Programmability 2
ClubHeap: A High-Speed and Scalable Priority Queue for Programmable Packet Scheduling
Zhikang Chen, Tsinghua University; Haoyu Song, Futurewei Technologies; Zhiyu Zhang and Yang Xu, Fudan University; Bin Liu, Tsinghua University
While PIFO is a powerful priority queue abstraction to support programmable packet scheduling in network devices, the efficient implementation of PIFO faces multiple challenges in performance and scalability. The existing solutions all fall short of certain requirements. In this paper, we propose ClubHeap to address the problem. On the one hand, we develop a novel hardware-friendly heap data structure to support faster PIFO queue operations that can schedule a flow in every clock cycle, reaching the theoretical lower bound; on the other hand, the optimized hardware architecture reduces the circuit complexity and thus enables a higher clock frequency. The end result is the best scheduling performance in its class. Combined with its inherently better scalability and flexibility, ClubHeap is an ideal solution to be built in programmable switches and SmartNICs to support various scheduling algorithms. We build an FPGA-based hardware prototype and conduct a thorough evaluation by comparing ClubHeap with the other state-of-the-art solutions. ClubHeap also allows graceful trade-offs between throughput and resource consumption through parameter adjustments, making it adaptable on different target devices.
Self-Clocked Round-Robin Packet Scheduling
Erfan Sharafzadeh, Johns Hopkins University and Hewlett Packard Labs; Raymond Matson, University of California Riverside; Jean Tourrilhes and Puneet Sharma, Hewlett Packard Labs; Soudeh Ghorbani, Johns Hopkins University and Meta
Deficit Round Robin (DRR) is the de facto fair packet scheduler in the Internet due to its superior fairness and scalability. We show that DRR can perform poorly due to its assumptions about packet size distributions and traffic bursts. Concretely, DRR performs best if (1) packet size distributions are known in advance; its optimal performance depends on tuning a parameter based on the largest packet, and (2) all bursts are long and create backlogged queues. We show that neither of these assumptions holds in today's Internet: packet size distributions are varied and dynamic, complicating the tuning of DRR. Plus, Internet traffic consists of many short, latency-sensitive flows, creating small bursts. These flows can experience high latency under DRR as it serves a potentially large number of flows in a round-robin fashion.
To address these shortcomings while retaining the fairness and scalability of DRR, we introduce Self-Clocked Round-Robin Scheduling (SCRR), a parameter-less, low-latency, and scalable packet scheduler that boosts short latency-sensitive flows through careful adjustments to their virtual times without violating their fair share guarantees. We evaluate SCRR using theoretical models and a Linux implementation on a physical testbed. Our results demonstrate that while performing on an equal footing with DRR on achieving flow fairness, SCRR reduces the average CPU overhead by 23% compared to DRR with a small quantum while improving the application latency by 71% compared to DRR with a large quantum.
Everything Matters in Programmable Packet Scheduling
Albert Gran Alcoz, ETH Zürich; Balázs Vass, BME-TMIT; Pooria Namyar, USC; Behnaz Arzani, Microsoft Research; Gábor Rétvári, BME-TMIT; Laurent Vanbever, ETH Zürich
Operators can deploy any scheduler they desire on existing switches through programmable packet schedulers: they tag packets with ranks (which indicate their priority) and schedule them in the order of these ranks. The ideal programmable scheduler is the Push-In First-Out (PIFO) queue, which schedules packets in a perfectly sorted order by "pushing" packets into any position of the queue based on their ranks. However, it is hard to implement PIFO queues in hardware due to their need to sort packets at line rate (based on their ranks).
Recent proposals approximate PIFO behaviors on existing data-planes. While promising, they fail to simultaneously capture both of the necessary behaviors of PIFO queues: their scheduling behavior and admission control. We introduce PACKS, an approximate PIFO scheduler that addresses this problem. PACKS runs on top of a set of priority queues and uses packet-rank information and queue-occupancy levels during enqueue to determine whether to admit each incoming packet and to which queue it should be mapped.
We fully implement PACKS in P4 and evaluate it on real workloads. We show that PACKS better-approximates PIFO than state-of-the-art approaches. Specifically, PACKS reduces the rank inversions by up to 7× and 15× with respect to SP-PIFO and AIFO, and the number of packet drops by up to 60% compared to SP-PIFO. Under pFabric ranks, PACKS reduces the mean FCT across small flows by up to 33% and 2.6×, compared to SP-PIFO and AIFO. We also show that PACKS runs at line rate on existing hardware (Intel Tofino).
When P4 Meets Run-to-completion Architecture
Hao Zheng, State Key Laboratory for Novel Software Technology, Nanjing University, China; Xin Yan, Huawei, China; Wenbo Li, Jiaqi Zheng, and Xiaoliang Wang, State Key Laboratory for Novel Software Technology, Nanjing University, China; Qingqing Zhao, Luyou He, Xiaofei Lai, Feng Gao, and Fuguang Huang, Huawei, China; Wanchun Dou, Guihai Chen, and Chen Tian, State Key Laboratory for Novel Software Technology, Nanjing University, China
P4 programmable data planes have significantly accelerated the evolution of various network technologies. Although the P4 language has gained wide acceptance, its further development encounters two obstacles: limited programmability and the cessation of the next-generation Tofino chip. As a hardware manufacturer, we try to address the above dilemmas by opening the P4 programmability of our run-to-completion (RTC) chips. At present, there is no publicly available experience in this field. We introduce P4RTC, a comprehensive consolidation of our experiences applying the P4 language to RTC architecture. P4RTC introduces a new P4 architecture model and a set of beneficial extern constructs to fully leverage the RTC architecture’s programmability. Besides, we share the insights we have gained from designing and implementing compilers. We also provide a performance model to facilitate profiling P4RTC’s performance on user-customized P4 code. We prototype P4RTC on an RTC chip with 1.2 Tbps bandwidth. Case-oriented evaluation demonstrates that P4RTC can enhance P4 programmability and reduce the burdens of RTC development. The performance model can provide substantial insights into optimizing P4RTC programs.
3:20 pm–3:50 pm
Coffee and Tea Break
3:50 pm–5:10 pm
ML for Networks
Mutant: Learning Congestion Control from Existing Protocols via Online Reinforcement Learning
Lorenzo Pappone, Computer Science Department, Saint Louis University; Alessio Sacco, DAUIN, Politecnico di Torino; Flavio Esposito, Computer Science Department, Saint Louis University
Learning how to control congestion remains a challenge despite years of progress. Existing congestion control protocols have demonstrated efficacy within specific network conditions, inevitably behaving suboptimally or poorly in others. Machine learning solutions to congestion control have been proposed, though relying on extensive training and specific network configurations. In this paper, we loosen such dependencies by proposing Mutant, an online reinforcement learning algorithm for congestion control that adapts to the behavior of the best-performing schemes, outperforming them in most network conditions. Design challenges included determining the best protocols to learn from, given a network scenario, and creating a system able to evolve to accommodate future protocols with minimal changes. Our evaluation on real-world and emulated scenarios shows that Mutant achieves lower delays and higher throughput than prior learning-based schemes while maintaining fairness by exhibiting negligible harm to competing flows, making it robust across diverse and dynamic network conditions.
CATO: End-to-End Optimization of ML-Based Traffic Analysis Pipelines
Gerry Wan, Stanford University; Shinan Liu, University of Chicago; Francesco Bronzino, ENS Lyon; Nick Feamster, University of Chicago; Zakir Durumeric, Stanford University
Machine learning has shown tremendous potential for improving the capabilities of network traffic analysis applications, often outperforming simpler rule-based heuristics. However, ML-based solutions remain difficult to deploy in practice. Many existing approaches only optimize the predictive performance of their models, overlooking the practical challenges of running them against network traffic in real time. This is especially problematic in the domain of traffic analysis, where the efficiency of the serving pipeline is a critical factor in determining the usability of a model. In this work, we introduce CATO, a framework that addresses this problem by jointly optimizing the predictive performance and the associated systems costs of the serving pipeline. CATO leverages recent advances in multi-objective Bayesian optimization to efficiently identify Pareto-optimal configurations, and automatically compiles end-to-end optimized serving pipelines that can be deployed in real networks. Our evaluations show that compared to popular feature optimization techniques, CATO can provide up to 3600× lower inference latency and 3.7× higher zero-loss throughput while simultaneously achieving better model performance.
Resolving Packets from Counters: Enabling Multi-scale Network Traffic Super Resolution via Composable Large Traffic Model
Xizheng Wang, Tsinghua University and Zhongguancun Laboratory; Libin Liu and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Yukai Miao and Yu Bai, Zhongguancun Laboratory
Realistic fine-grained traffic traces are valuable to numerous applications in both academia and industry. However, obtaining them directly from devices is significantly challenging, while coarse-grained counters are readily available on almost all network devices. None of existing work can restore fine-grained traffic traces from counters, which we call network traffic super-resolution (TSR). To this end, we propose ZOOMSYNTH, the first TSR system that can achieve packet-level trace synthesis with counter traces as input. Following the basic structure of the TSR task, we design the Granular Traffic Transformer (GTT) model and the Composable Large Traffic Model (CLTM). CLTM is a tree of GTT models, and the GTT models in each layer perform upscaling on a particular granularity, which allows each GTT model to capture the traffic characteristics at this resolution. Using CLTM, we synthesize fine-grained traces from counters. We also leverage a rule-following model to comprehend counter rules (e.g. ACLs) when available, guiding the generations of fine-grained traces. We implement ZOOMSYNTH and perform extensive evaluations. Results show that, with only second-level counter traces, ZOOMSYNTH achieves synthesis quality comparable to existing solutions that takes packet-level traces as input. CLTM can also be fine-tuned to support downstream tasks. For example, ZOOMSYNTH with fine-tuned CLTM outperforms the existing solution by 27.5% and 9.8% in anomaly detection and service recognition tasks, respectively. To promote future research, we release the pre-trained CLTM-1.8B model weights along with its source code.
BFTBrain: Adaptive BFT Consensus with Reinforcement Learning
Chenyuan Wu and Haoyun Qin, University of Pennsylvania; Mohammad Javad Amiri, Stony Brook University; Boon Thau Loo, University of Pennsylvania; Dahlia Malkhi, UC Santa Barbara; Ryan Marcus, University of Pennsylvania
This paper presents BFTBrain, a reinforcement learning (RL) based Byzantine fault-tolerant (BFT) system that provides significant operational benefits: a plug-and-play system suitable for a broad set of hardware and network configurations, and adjusts effectively in real-time to changing fault scenarios and workloads. BFTBrain adapts to system conditions and application needs by switching between a set of BFT protocols in real-time. Two main advances contribute to BFTBrain’s agility and performance. First, BFTBrain is based on a systematic, thorough modeling of metrics that correlate the performance of the studied BFT protocols with varying fault scenarios and workloads. These metrics are fed as features to BFTBrain’s RL engine in order to choose the best-performing BFT protocols in real-time. Second, BFTBrain coordinates RL in a decentralized manner which is resilient to adversarial data pollution, where nodes share local metering values and reach the same learning output by consensus. As a result, in addition to providing significant operational benefits, BFTBrain improves throughput over fixed protocols by 18% to 119% under dynamic conditions and outperforms state-of-the-art learning based approaches by 44% to 154%.