Watching the Watchers: Generating Absent Alerts for Prometheus

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 13 October, 2021 - 04:0004:15

Nick Spain, Stile Education

Abstract: 

You've written some great recording rules and alerts for your Prometheus monitoring system, you've carefully recreated scenarios to check that the alerts fire—awesome! Your app is never failing silently again! And yet, months later you realize that your system has silently fallen over. How? The cron job that exports the metrics just didn't run, the collector changed its labels: the metrics are missing. Your Prometheus alerts aren't going to fire and you won't know that they've gone away. You could write the alerts manually, but that's a lot of toil and you don't trust yourself not to forget—let's automate it! At Stile Education, we built a tool for generating these alerts automatically. Come along to find out what we did, why we did it, and how it's been useful in the 6 months since we introduced it.

Nick Spain, Stile Education

Nick is a Software Engineer working at Stile Education helping build a platform facilitating teachers to provide a world-class science education to their students. He loves automating things and getting out for a good hike.

SREcon21 Open Access Sponsored by Indeed

BibTeX
@conference {276725,
author = {Nick Spain},
title = {Watching the Watchers: Generating Absent Alerts for Prometheus},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Presentation Video