Bingzhe Liu, UIUC; Colin Scott, Mukarram Tariq, Andrew Ferguson, Phillipa Gill, Richard Alimi, Omid Alipourfard, Deepak Arulkannan, Virginia Jean Beauregard, and Patrick Conner, Google; P. Brighten Godfrey, UIUC; Xander Lin, Joon Ong, Mayur Patel, Amr Sabaa, Arjun Singh, Alex Smirnov, Manish Verma, Prerepa V Viswanadham, and Amin Vahdat, Google
Management operations are a major source of outages for networks. A number of best practices designed to reduce and mitigate such outages are well known, but their enforcement has been challenging, leaving the network vulnerable to inadvertent mistakes and gaps which repeatedly result in outages. We present our experiences with CAPA, Google’s “containment and prevention architecture” for regulating management operations on our cluster networking fleet. Our goal with CAPA is to limit the systems where strict adherence to best practices is required, so that availability of the network is not dependent on the good intentions of every engineer and operator. We enumerate the features of CAPA which we have found to be necessary to effectively enforce best practices within a thin “regulation“ layer. We evaluate CAPA based on case studies of outages prevented, counter-factual analysis of past incidents, and known limitations. Management-plane-related outages have substantially reduced both in frequency and severity, with a 82% reduction in cumulative duration of incidents normalized to fleet size over five years.
NSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Bingzhe Liu and Colin Scott and Mukarram Tariq and Andrew Ferguson and Phillipa Gill and Richard Alimi and Omid Alipourfard and Deepak Arulkannan and Virginia Jean Beauregard and Patrick Conner and P. Brighten Godfrey and Xander Lin and Joon Ong and Mayur Patel and Amr Sabaa and Arjun Singh and Alex Smirnov and Manish Verma and Prerepa V Viswanadham and Amin Vahdat},
title = {{CAPA}: An Architecture For Operating Cluster Networks With High Availability},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {1995--2010},
url = {https://www.usenix.org/conference/nsdi24/presentation/liu-bingzhe},
publisher = {USENIX Association},
month = apr
}