David D. Woods, Ohio State University and Adaptive Capacity Labs; Laura Nolan, Slack
Software systems are brittle in various ways, and prone to failures. We can sometimes improve the robustness of our software systems, but true resilience always requires human involvement: people are the only agents that can detect, analyze, and fix novel problems.
But this is not easy in practice. Woods' Theorem states that as the complexity of a system increases, the accuracy of any single agent's own model of that system—their 'process feel'—decreases rapidly. This matters, because we work in teams, and a sustainable on-call rotation requires several people.
This talk brings a researcher and a practitioner together to discuss some Resilience Engineering concepts as they apply to SRE, with a particular focus on how teams can systematically approach sharing experiences about anomalies in their systems and create ongoing learning from 'weak signals' as well as major incidents.
David D. Woods, Ohio State University
David Woods (Ph.D., Purdue University) has worked to improve systems safety in high-risk complex settings for 40 years. These include studies of human coordination with automated and intelligent systems and accident investigations in aviation, nuclear power, critical care medicine, crisis response, military operations, and space operations. Beginning in 2000-2003 he developed Resilience Engineering on the dangers of brittle systems and the need to invest in sustaining sources of resilience as part of the response to several NASA accidents. His results on proactive safety and resilience are in the book Resilience Engineering (2006). He developed the first comprehensive theory on how systems can build the potential for resilient performance despite complexity. Recently, he started the "SNAFU Catchers Consortium," an industry-university partnership to build resilience in critical digital services.
Laura Nolan, Slack Technologies
Laura Nolan is a Senior Staff Engineer and tech lead at Slack, working mainly on service networking and ingress load balancing, as well as occasionally writing outage reports for the Slack Engineering blog. Laura has contributed to a number of books on SRE, including Site Reliability Engineering: How Google Runs Production Systems, Seeking SRE, and 97 Things Every SRE Should Know. She also regularly writes for USENIX's ;login: magazine, and is a member of the USENIX board and SREcon Steering Committee.
SREcon21 Open Access Sponsored by Indeed
author = {David D. Woods and Laura Nolan},
title = {You{\textquoteright}ve Lost That Process Feeling: Some Lessons from Resilience Engineering},
year = {2021},
publisher = {USENIX Association},
month = oct
}