A Tale of Two Postmortems: A Human Factors View

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website may not be available on Monday, March 17, from 10:00 am–6:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience and thank you for your patience.

If you would like to register for NSDI '25, SREcon25 Americas, or PEPR '25, please complete your registration before or after this time period.

Wednesday, June 12, 2019 - 9:10 am9:55 am

Tanner Lund, Microsoft

Abstract: 

Many companies become frustrated with their postmortem and incident review process, feeling that it is a burden, or that it does not provide meaningful insights, or that the repairs and learnings generated do not help prevent repeats or other incidents. Fortunately, there is a better way to do things, backed by decades of scientific rigor and proven in industries where outages can mean a lot worse than lost revenue.

Join our fictional company, "Potato Systems‚" as they deal with the aftermath of a catastrophic incident. As they struggle to learn from it and move forward, they—and we—will come to understand the stark contrast in outcomes and effectiveness of Safety I vs Safety II thinking.

Tanner Lund, Microsoft

Tanner Lund has been a part of Azure's SRE organization from the beginning. He has worked in a variety of roles, including crisis management, developing SREBot, building data pipelines, and leading services through SRE/DevOps transitions. Throughout it all his focus has been on understanding complex systems and how we achieve our goals through them, seeking to unlock their secrets.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

Presentation Video