Reasoning about Network Traffic Load Property at Production Scale

Authors: 

Ruihan Li, Peking University and Alibaba Cloud; Fangdan Ye, Yifei Yuan, Ruizhen Yang, Bingchuan Tian, Tianchen Guo, Hao Wu, Xiaobo Zhu, Zhongyu Guan, Qing Ma, and Xianlong Zeng, Alibaba Cloud; Chenren Xu, Peking University; Dennis Cai and Ennan Zhai, Alibaba Cloud

Abstract: 

This paper presents Jingubang, the first reported system for checking network traffic load properties (e.g., if any link’s utilization would exceed 80% during a network change) in a production Wide Area Network (WAN). Motivated by our network operators, Jingubang should meet three important requirements: (R1) comprehensive support for complex traffic behavior under BGP, IS-IS, policy-based routes (PBR), and segment routes (SR), (R2) reasoning on traffic load of billions of flows across a period of time, (R3) real-time failure-tolerance analysis. These requirements pose challenges in modeling the complex traffic behavior and maintaining the checking efficiency. Jingubang has successfully addressed these challenges. First, we propose the traffic distribution graph (or TDG), capable of modeling equal-cost multi-path (ECMP), packet rewriting, and tunneling, introduced by BGP/IS-IS, PBR, and SR, respectively. Second, we design an algorithm based on TDG to simulate traffic distribution for billions of flows across a time period both efficiently and accurately. Third, Jingubang proposes an incremental traffic simulation approach that first computes an incremental TDG and then simulates only the differential traffic distribution, avoiding the need to simulate the entire network traffic distribution from scratch. Jingubang has been used in the daily checking of our WAN for more than one year and prevented service downtime resulting from traffic load violations.

NSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {295585,
author = {Ruihan Li and Fangdan Ye and Yifei Yuan and Ruizhen Yang and Bingchuan Tian and Tianchen Guo and Hao Wu and Xiaobo Zhu and Zhongyu Guan and Qing Ma and Xianlong Zeng and Chenren Xu and Dennis Cai and Ennan Zhai},
title = {Reasoning about Network Traffic Load Property at Production Scale},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {1063--1082},
url = {https://www.usenix.org/conference/nsdi24/presentation/li-ruihan},
publisher = {USENIX Association},
month = apr
}