ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing

Authors: 

Mike Chow, Meta Platforms; Yang Wang, Meta Platforms and The Ohio State University; William Wang, Ayichew Hailu, Rohan Bopardikar, Bin Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo Paim, Mack Ward, Ivor Huang, Matt McNally, Daniel Hodges, Zoltan Farkas, Caner Gocmen, Elvis Huang, and Chunqiang Tang, Meta Platforms

Awarded Best Paper!

Abstract: 

This paper presents ServiceLab, a large-scale performance testing platform developed at Meta. Currently, the diverse set of applications and ML models it tests consumes millions of machines in production, and each year it detects performance regressions that could otherwise lead to the wastage of millions of machines. A major challenge for ServiceLab is to detect small performance regressions, sometimes as tiny as 0.01%. These minor regressions matter due to our large fleet size and their potential to accumulate over time. For instance, the median regression detected by ServiceLab for our large serverless platform, running on more than half a million machines, is only 0.14%. Another challenge is running performance tests in our private cloud, which, like the public cloud, is a noisy environment that exhibits inherent performance variances even for machines of the same instance type. To address these challenges, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Finally, we share our seven years of operational experience in dealing with a diverse set of applications.

OSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {298722,
author = {Mike Chow and Yang Wang and William Wang and Ayichew Hailu and Rohan Bopardikar and Bin Zhang and Jialiang Qu and David Meisner and Santosh Sonawane and Yunqi Zhang and Rodrigo Paim and Mack Ward and Ivor Huang and Matt McNally and Daniel Hodges and Zoltan Farkas and Caner Gocmen and Elvis Huang and Chunqiang Tang},
title = {{ServiceLab}: Preventing Tiny Performance Regressions at Hyperscale through {Pre-Production} Testing},
booktitle = {18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
year = {2024},
isbn = {978-1-939133-40-3},
address = {Santa Clara, CA},
pages = {545--562},
url = {https://www.usenix.org/conference/osdi24/presentation/chow},
publisher = {USENIX Association},
month = jul
}