Courtney Eckhardt and Lex Neva, Heroku
Is the root cause really “human error”? How did your environment let the human make the error? How did their error take down the service? How many outages did humans prevent? Can your dev teams’ priorities be aligned with reliability, instead of only with churning out features?
At Heroku, we do ops as a service—reliability is our product. If we go down, we take thousands of businesses with us. In SRE, we push for reliability and resiliency in designs, sure, but it’s more than that. We iterate on process, automation, tooling, and incident response, because people are at the heart of everything we do.
Courtney Eckhardt, Heroku
Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability.
Lex Neva, Heroku
Lex Neva is probably not a super-villain. He has six years of experience keeping large services running, including Linden Lab's Second Life, DeviantArt.com, and his current position as a Heroku SRE. While originally trained in computer science, he’s found that he most enjoys applying his software engineering skills to operations. A veteran of many large incidents, he has strong opinions on incident response, on-call sustainability, and reliable infrastructure design, and he currently runs SRE Weekly (sreweekly.com).
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Courtney Eckhardt and Lex Neva},
title = {{SRE}: It{\textquoteright}s People All the Way Down},
year = {2016},
address = {Boston, MA},
publisher = {USENIX Association},
month = dec
}