Blameless Incidents: Learning from Failure at Scale

Chip Turner

Wednesday, October 30, 2019 - 2:45 pm–3:30 pm

Chip Turner, Facebook, Inc.

Abstract:

How a company handles outages is a conscious decision, and being intentional about the mindset you cultivate is critical to long-term reliability and operability. Building a culture that embraces crises as learning opportunities rather than failures is a crucial component of healthy Incident Management.

Facebook’s blameless, reflective approach tries to make the most from every outage, large and small. Our scalable Incident Management program is designed to be used for incidents of all size, from full site issues to minor, localized problems affecting small, non-critical services. This talk will discuss the cultural and technical challenges to having an open culture that focuses on moving fast while keeping a high bar for operational excellence and reliability. We will explore the principles, tools, and processes we use to accomplish the above goals, how we scale communication during incidents, and how our open-door review culture reinforces our blameless approach while still maintaining high standards.

Chip Turner is a Director of Engineering at Facebook where he focuses on-site reliability on the Web Foundation team. As a first responder for many years for incidents large and small, Chip has been involved with all phases of Incident Management. Chip has functioned in both an SWE and SRE role, working primarily in databases, storage, and caching systems in massively distributed environments.

BibTeX

@conference {240888,
author = {Chip Turner},
title = {Blameless Incidents: Learning from Failure at Scale},
year = {2019},
address = {Portland, OR},
publisher = {USENIX Association},
month = oct
}

Download

Blameless Incidents: Learning from Failure at Scale

Presentation Video