Draining the Flood—A Combat against Alert Fatigue

Tuesday, May 23, 2017 - 9:00am9:55am

Yu Chen, Baidu Inc.

Abstract: 

Monitoring system is an important tool for SREs to guarantee service stability and availability. Baidu’s monitoring system, Argus, keeps tracking hundreds of services of distribution system. With the increasing complexity of the services, the monitoring items and the corresponding anomaly detectors grows to a magnitude that the generated alerts floods from time to time. On an average day, Argus detects millions of warning events and sends out thousands of SMS alerts to on-call engineers. This results in a per person amount of more than 100 alerts during the day time and 30 during the night time. When a severe failure occurs, the alerts flood in a massive surge, and become little helpful for the engineers to fix the problem. Therefore, it is imperative to improve the detection accuracy of abnormal events, reduce the amount of alerts, and organize them into a meaningful way.

In this talk, we will introduce our practice that leverage machine learning methods to detect anomalies and group alerts, in order to solve the above issues. We will also share some successful experiences, such as alert based datacenter-level failure detection, and alert-triggering automatic recovery techniques.

Yu Chen, Baidu Inc.

Yu Chen is the data architect of the SRE team in Baidu. He previously worked in Microsoft Research Asia as a researcher. His working experience includes data mining, search relevance, and distributed systems.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {202751,
author = {Yu Chen},
title = {Draining the {Flood{\textemdash}A} Combat against Alert Fatigue},
year = {2017},
publisher = {USENIX Association},
month = may
}

Presentation Video 

Presentation Audio