Offsite Reliability Engineering: Towards the SRE Hivemind

Thursday, 31 October, 2024 - 11:4512:05 GMT

Danny Kopping, coder.com

Abstract: 

The twin pillars of SRE and Observability have made the software world a more reliable place. And yet, our lives as SREs are growing more complicated by the day. We are responsible for an ever-increasing set of software we didn't write, usually popular OSS projects used by millions. Every project is different in the way it works, fails, and scales - and we have to find each out the hard way!

What we need is an "SRE Hivemind": we need to codify our experiences of these projects into dashboards, alerts, and runbooks. We need to share our hard-earned experience with each other, to push back against complexity and embrace the collective. This is the next phase in the co-evolution of SRE and Observability; let's define it together.

Danny Kopping, coder.com

Danny is a Staff SWE at Coder.com, building a leading cloud development environment (CDE) product. In his previous role as an SRE at Grafana Labs, he was a maintainer of the Grafana Loki project and contributor to Prometheus. He is based in Cape Town, South Africa.

BibTeX
@conference {302175,
author = {Danny Kopping},
title = {Offsite Reliability Engineering: Towards the {SRE} Hivemind},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}