Matt Provost, Yelp
The NHS is the United Kingdom's National Health Service, established in 1948 to provide free healthcare at point of service to all 64.6 million UK residents.
In England's National Health Service (NHS), a Never Event is a serious incident that "arise[s] from [the] failure of strong systemic protective barriers which can be defined as successful, reliable and comprehensive safeguards or remedies". They key criteria for defining Never Events is that they are preventable and have the potential to cause serious patient harm or death. All Never Events are reportable and undergo Root Cause Analysis to determine why the failure occurred, to prevent similar incidents from happening again.
Considering that the NHS is a healthcare service where incidents can obviously have serious, life-threatening or life-changing consequences, together with the scale of services provided (the NHS in England deals with over 1 million patients every 36 hours), their list of Never Events is actually quite short (14 events), including such items as “Wrong site surgery”, “Retained foreign object post-procedure”, and “Wrong route administration of medication.”
In our industry, the requirement for these events to be preventable would exclude things like DDOS attacks or security breaches which are outside of the SRE team's direct control. Of course steps should be taken to minimise or prevent these types of incidents, the same way that doctors work to prevent patients from dying of cancer. But they don't cause cancer, so a patient dying of it is not a Never Event. However, a nurse administering the wrong type of cancer medication, or cancer medication to the wrong patient, or delivering the medication via the wrong route (intravenous vs spinal etc) can all be Never Events.
If there are insufficient processes in place to prevent such mistakes, then they cannot be Never Events. This system is designed to protect the staff as well as patients, so that they aren't put under pressure to be perfect. There must be procedures in place so that it doesn't come down to an individual to make all of the correct choices on their own.
Events are a fundamental part of the safety culture of the NHS which is a "just culture that rejects blame as a tool." In recent years, modern systems safety concepts such as just culture and blameless postmortems have been introduced to the System Administration/Site Reliability Engineering/Devops community from other fields (such as healthcare). However the concept of defining specific Never Events has not been explored in this context and can bring similar benefits to those reported by the healthcare community with a reduction in the reoccurrence of such events.
Many systems engineering organisations already have their own formal or informal guidelines for reportable events. Publishing postmortems (either internally or public facing) is now becoming standard practise in our industry, but not all of these events are Never Events. These incidents should be studied by each organisation after each postmortem to generate a list of failures that should never occur again because safety systems/protective barriers have been put in place to prevent them. Any occurrence of such an incident after the fact is therefore a Never Event.
The goal of implementing the Never Events system is firstly to reduce the number of these serious events, but also to protect staff and to provide a safe working environment. Repeated Never Events indicate that management has not addressed the underlying causes of these incidents, which shifts responsibility away from the front line staff who are operating in (clearly) unsafe conditions or with inadequate safety systems in place to prevent these events.
While each organisation will come up with its own list of Never Events for their specific environment based on their examination and analysis of previous incidents, some generalisations can be made. For example, looking at “Wrong Site Surgery” from the NHS list, where the wrong part of the body is operated on (left vs right leg etc). This is a process failure, where the staff may do the correct procedure but to the wrong location. Transferring that to the systems administration world, this is analogous to running the correct command on the wrong system.
During their careers, most (if not all) system administrators have made certain classes of similar mistakes such as rebooting the wrong server, removing the wrong directory (including the classic "rm -rf /") or executing a SQL DELETE statement without a WHERE clause. We will examine the steps the NHS has taken to prevent this type of "wrong site" incident, along with other Never Events. By learning from other industries we can come up with recommendations for preventing similar mistakes in our field.
Matt Provost, Yelp
Matt Provost is an Engineering (SRE) Manager at Yelp, based in London. Prior to this he was the Systems Manager at Weta Digital in Wellington, New Zealand where he was responsible for the Top 500 supercomputers used to render such films as Avatar and the Hobbit trilogy. Matt has been a system and network administrator for over twenty years. He has a BA from Indiana University, Bloomington.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Matt Provost},
title = {Never Events},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = oct
}