Alerting for Distributed {Systems—A} Tale of Symptoms and Causes, Signals and Noise

Björn Rabenstein

Alerting for Distributed Systems—A Tale of Symptoms and Causes, Signals and Noise

Björn Rabenstein, SoundCloud

Abstract:

Noisy alerts are the deadly sin of monitoring. They obfuscate real issues and cause pager fatigue. Instead of reacting with the due sense of urgency, the person on-call will start to skim or even ignore alerts, not to speak about the destruction of their sanity and work-life balance. Unfortunately, there are many monitoring pitfalls on the road to complex production systems, and most of them result in noisier alerts. In distributed systems, and in particular in a microservice architecture, there is usually a good understanding of local failure modes while the behavior of the system as a whole is difficult to reason with. Thus, it is tempting to alert on the many possible causes – after all, finding the root cause of a problem is important. However, a distributed system is designed to tolerate local failures, and a human should only be paged on real or imminent problems of a service, ideally aggregated to one meaningful alert per problem. The definition of a problem should be clear and explicit rather than relying on some kind of automatic "anomaly detection." Taking historical trends into account is needed, though, to detect imminent problems. Those predictions should be simple rather than "magic." Alerting because "something seems weird" is almost never the right thing to do.

SoundCloud's long way from noisy pagers to much saner on-call rotations will serve as a case study, demonstrating how different monitoring technologies, among them most notably Prometheus, have affected alerting.

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Björn Rabenstein, SoundCloud

Björn is a Production Engineer at SoundCloud and one of the main Prometheus developers. Previously, he was a Site Reliability Engineer at Google and a number cruncher for science.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {208544,
author = {Bj{\"o}rn Rabenstein},
title = {Alerting for Distributed {Systems{\textemdash}A} Tale of Symptoms and Causes, Signals and Noise},
year = {2016},
address = {Dublin},
publisher = {USENIX Association},
month = jul
}

Download

View the slides

Presentation Video

Log in or register to post comments