Colin Douch, Cloudflare
With the SRE revolution, Alert Runbooks and Dashboards have become vital tools for engineering teams hoping to adopt better incident response strategies. Unfortunately, these tools are often used in a way that makes them ineffective at this task. In particular, these tools are often created as knee jerk responses to incidents, without thought as to where they fit into the overall landscape of the incident response. This leads to hyper specific tooling that often masks the root causes of incidents and negatively impacts an incident response rather than helping.
In this talk, I will cover why creating dashboards and runbooks is such an attractive proposition to engineering teams, why it's so easy to fall into the specificity trap, why having these runbooks and dashboards is such an issue, and where these tools should instead fall into your incident response structure.
Colin Douch, Cloudflare
Colin currently Tech Leads the Observability Platform Team at Cloudflare, orchestrating and inventing solutions to Monitor and Debug Cloudflares infrastructure. Starting in Mining, he has been working, in the Observability space for close to 10 years with companies both big and small.
author = {Colin Douch},
title = {Dashboards and Runbooks: Scrapbooking for Engineers},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec
}