Best Practices for When s*IT Hits the Fan
Dave Cliffe, PagerDuty
Outages suck; how you handle them shouldn’t. At PagerDuty, we talk to real customers experiencing real outages all the time. Operations escalations and downtime can be handled in many ways:
- During the incident: who to alert when, how to communicate, handling dependency and downstream failures, disclosure
- After the incident: post-mortems, public disclosure, formalizing process vs. investing in automation, preventative actions
There are also ways to keep engineers sane, customers happy, and the $$$ flowing. In this talk, come learn about best practices from across the industry, including how PagerDuty executes during an outage (but trust us, those never happen).
Dave Cliffe, PagerDuty
Dave is an engineer who has adopted a more peaceful role as "sherpa" on the Product team at PagerDuty, a company whose sole goal is to make the lives of DevOps engineers everywhere a calmer, sanity-filled reality. Before PagerDuty, Dave worked in cloud computing at Microsoft on the Windows Azure team. Frequently, he wonders which is scarier: being an on-call engineer responsible for an outage or being a parent. The debate rages on.
LISA16 Open Access Sponsored by Bloomberg
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Dave Cliffe},
title = {Best Practices for When {s*IT} Hits the Fan},
year = {2014},
address = {Seattle, WA},
publisher = {USENIX Association},
month = nov
}
connect with us