Qiao Zhang, University of Washington; Guo Yu, Cornell University; Chuanxiong Guo, Toutiao (Bytedance); Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, and Murali Chintalapati, Microsoft; Arvind Krishnamurthy and Thomas Anderson, University of Washington
In Infrastructure as a Service (IaaS), virtual machines (VMs) use virtual hard disks (VHDs) provided by a remote storage service via the network. Due to separation of VMs and their VHDs, a new type of failure, called VHD failure, which may be caused by various components in the IaaS stack, becomes the dominating factor that reduces VM availability. The current state-of-the-art approaches fall short in localizing VHD failures because they only look at individual components.
In this paper, we designed and implemented a system called Deepview for VHD failure localization. Deepview composes a global picture of the system by connecting all the components together, using individual VHD failure events. It then uses a novel algorithm which integrates Lasso regression and hypothesis testing for accurate and timely failure localization.
We have deployed Deepview at Microsoft Azure, one of the largest IaaS providers. Deepview reduced the number of unclassified VHD failure events from tens of thousands to several hundreds. It unveiled new patterns including unplanned top-of-rack switch (ToR) reboots and storage gray failures. Deepview reduced the time-to-detection for incidents to under 10 minutes. Deepview further helped us quantify the implications of some key architectural decisions for the first time, including ToR switches as a single-point-of-failure and the compute-storage separation.
NSDI '18 Open Access Videos Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Qiao Zhang and Guo Yu and Chuanxiong Guo and Yingnong Dang and Nick Swanson and Xinsheng Yang and Randolph Yao and Murali Chintalapati and Arvind Krishnamurthy and Thomas Anderson},
title = {Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure},
booktitle = {15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Renton, WA},
pages = {519--532},
url = {https://www.usenix.org/conference/nsdi18/presentation/zhang-qiao},
publisher = {USENIX Association},
month = apr
}