Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure

Authors: 

Ze Li, Qian Cheng, Ken Hsieh, and Yingnong Dang, Microsoft Azure; Peng Huang, Johns Hopkins University; Pankaj Singh and Xinsheng Yang, Microsoft Azure; Qingwei Lin, Microsoft Research; Youjiang Wu, Sebastien Levy, and Murali Chintalapati, Microsoft Azure

Abstract: 

Modern cloud systems have a vast number of components that continuously undergo updates. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals. It will correlate each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. The core decision logic of Gandalf includes an ensemble ranking algorithm that determines which rollout may have caused the fault signals, and a binary classifier that assesses the impact of the fault signals. The analysis result will decide whether a rollout is safe to proceed or should be stopped.

By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for data-plane rollouts. For control-plane rollouts, Gandalf achieves 94.9% precision and 99.8% recall.

NSDI '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {246504,
author = {Ze Li and Qian Cheng and Ken Hsieh and Yingnong Dang and Peng Huang and Pankaj Singh and Xinsheng Yang and Qingwei Lin and Youjiang Wu and Sebastien Levy and Murali Chintalapati},
title = {Gandalf: An Intelligent, {End-To-End} Analytics Service for Safe Deployment in {Large-Scale} Cloud Infrastructure },
booktitle = {17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)},
year = {2020},
isbn = {978-1-939133-13-7},
address = {Santa Clara, CA},
pages = {389--402},
url = {https://www.usenix.org/conference/nsdi20/presentation/li},
publisher = {USENIX Association},
month = feb
}

Presentation Video