The widespread outage that occurred on Friday 19 July as a result of a CrowdStrike configuration push that put Windows machines into a boot loop may well have been the largest digital systems availability incident that the world has ever seen. The event affected airlines, payment systems, hospitals, emergency phone services, and many more. Many people had a very bad day. It is likely some people may have died or suffered lasting harm as a result of the loss of hospital and emergency services capacity.
CrowdStrike posted a description of the event a few days after it occurred [1]. CrowdStrike Falcon Sensor is an EDR (Endpoint Detection and Response) agent, which monitors devices for cybersecurity purposes. An invalid configuration file caused a crash in the CrowdStrike Falcon Sensor code, which runs during the Windows boot process, causing a boot loop. But, as is unfortunately common with such summaries, many questions are left unanswered. In particular, there is no real detail about how CrowdStrike tests changes to their configuration files before they are pushed, how this problem evaded their tests, and there is no discussion of CrowdStrike’s apparent decision to push such changes globally at a single point in time, rather than to use a more progressive rollout mechanism.
A more gradual rollout would—of course—potentially increase the period of time that CrowdStrike’s customers might be unprotected from a novel attack. Using canarying or other forms of gradual rollout, it can be difficult to catch subtle regressions without a very long rollout duration (multiple hours at least), but severe issues like this one are generally visible quite quickly, within minutes. It is a tradeoff. A short delay can add a significant degree of reliability.
CrowdStrike’s preliminary incident report describes the buggy configuration file as “Rapid Response Content that is designed to respond to the changing threat landscape at operational speed” [1]. Rolling out these changes quickly is a key part of CrowdStrike’s value proposition, and it is normal work for engineers at CrowdStrike. Does CrowdStrike track the speed of such rollouts? Do staff have targets that must be met for getting such updates onto client machines? We may never know the answer to this question, but it would be very interesting to know if staff at CrowdStrike have ever advocated for a more gradual deployment mechanism for these configuration files, and, if so, how that discussion went.
Many readers of ;login: are likely primarily users of various flavours of Unix and may not be fans of Windows. However, those of us in the Linux world should not consider ourselves immune from this kind of event. As Mark Twain said, “history doesn’t repeat itself, but it often rhymes”. From a broad-strokes technical perspective, what happened on Friday wasn’t entirely different to the DataDog outage from March 2023 [2]. In both cases, an automatic update was pushed globally within a short timeframe, resulting in widespread unavailability of machines, and thus, the services that run on them, with an inability for the affected services to automatically recover themselves without significant intervention by operators. In both cases, it is the inability for services to self-heal that exacerbated the severity of the incident, a pattern which is common to most significant outages.
Of course, there are differences, too. Based on their report [2], DataDog seems to have left their “legacy security update channel” (the description of which matches the unattended-upgrade tool [3]) enabled as an oversight, rather than as a conscious decision. CrowdStrike appears to push its configuration updates immediately as a matter of design, and, furthermore, does not give local administrators any control over the timing of application of those configuration updates, and shipping configuration updates quickly is a key part of their value proposition.
The pattern of events that CrowdStrike encountered—a client crashing due to a bad configuration—is not new. I have seen this pattern play out, and I expect most other seasoned software engineers to have seen it. Andy Ellis has described how Akamai encountered a similar issue 20 years ago, and how they solved it with a technique they called crash rejection [9]. Of course, unlike many other professions, there is no agreed-upon body of knowledge for software engineers—we learn on the job, or from stories that other engineers tell us. If there were a standard body of knowledge, crash rejection should certainly be part of it. Lacking that standardised body of knowledge, it becomes difficult to blame engineers for failing to spot the potential for coordinated client failure, without the possibility of recovery.
Good Intentions
Of course, nobody at CrowdStrike intended this incident to occur. Nobody goes to work to write code that will make 8.5 million Windows machines inoperative [4], or to push an invalid configuration. People go to work and in general, they do their jobs as well as they can with the time and the resources and the knowledge that they have. I am sure that this is true of the engineers at CrowdStrike. But do those engineers have sufficient support—access to appropriate specialist expertise, dedicated testers, sufficient time to implement things such as static analysis and fuzz testing—to do the quality of work necessary to run their system well? It would not surprise me if CrowdStrike’s issue tracker held several items prior to July 19th that later appeared in their preliminary incident review as planned action items [1].
Charles Perrow’s incredibly influential book Normal Accidents [5] is 40 years old today, but the arguments Perrow makes remain relevant. Most people recall that Perrow argues that perfect reliability in complex systems is impossible, because such systems will occasionally exhibit unintended and surprising behaviour. However, Perrow had another thesis: that most accidents are not unpredictable Normal Accidents. Most serious accidents are a result of some combination of mismanagement, inadequate resourcing, and production pressure. Examples are numerous but include the Boeing 737 Max crashes [6], the Grenfell Tower fire [7], and the Deepwater Horizon blowout [8].
So was this a normal accident or a standard organisational accident? Software is a challenging domain for reliability. We can’t see our software’s internal state directly—we only have the observability that we have the foresight and the time to build in. Software has a lot of dependencies. The state of a digital system can change incredibly fast, before human operators have any hope of understanding and reacting. It’s also a young domain, and we are still building the techniques that allow us to run digital systems more reliably, and, as mentioned previously, we have not yet built a core body of knowledge that software engineers should possess. But still, many software outages (and many security issues) are avoidable and preventable, with sufficient investment, good management, and appropriate expertise—although sustaining these conditions over time, in the face of commercial pressures, is an enormous organisation challenge, at which many organisations have failed.
Regardless of whether the specific incident on July 19th was preventable or a Normal Accident, digital systems are and will remain complex systems with the potential for Normal Accidents. As organisations, and as societies, it would be wise to consider this when designing our systems. Any piece of software (or hardware) is potentially a failure domain. There is value in diversity. At a former employer, we used multiple kinds of routers in our networks, running with different chips, in order to ensure that a router-specific issue could not disable the entire network.
Mono-culture
A significant part of the issue on July 19th was that so many organisations in particular sectors, such as aviation and healthcare, were reliant on a single software stack. At the organisational level, it certainly adds cost and effort to run two sets of tools, so that may not make sense for some organisations. But does it make sense for all hospitals in a region to use the same software? Or does this create unnecessary risk, because neighbouring hospitals cannot assist in the event of system downtime? Similar arguments might be made with respect to other key systems, such as emergency telephone services: these seem critical enough to maintain fully-redundant backups. There are very real tradeoffs to be made here between costs and human lives and health.
Another aspect of the failure is that organisations can be remarkably resilient and can often do quite a bit without their usual IT systems. European budget airline Ryanair, for example, used paper manifests to board their planes. Organisations that prepare for these eventualities will be better able to cope when an outage strikes. The best way to understand the impact of the loss of IT systems and what would be needed to cope in their absence is to do drills, in as close to real-world conditions as possible. Drills train your staff in processes to follow in an outage, and, even more crucially, helps your organisation to find problems and gaps and ways to improve emergency plans. This, of course, has a cost in staff time, and potentially a cost in ongoing work, such as keeping printouts of essential data up-to-date.
Cutting staff to the bone, however, reduces resilience. A viral social media post on July 19th featured the plight of a solo DevOps engineer tasked with recovering 2000 servers affected by the issue as quickly as possible. Whether a real story or not, it is certainly the case that having extra colleagues to call upon to resolve major problems is extremely helpful in a crisis. Organisations that choose to run very lean on staff run the risk of extending outages. This is a subtle effect of layoffs that does not become apparent until a crisis occurs.
One thing that is not likely to be the answer to fixing software availability issues is the knee-jerk creation of regulation. Consider the example of Crowdstrike—its major selling point is that it provides a turn-key solution for customers to use to comply with myriad certification and compliance requirements—which can be quite burdensome for organisations. The challenge of complying with certification requirements has effectively created a captive market for CrowdStrike and a few other competitors. This has systemic effects. Firstly, it creates centralisation, meaning that one large service having a problem creates many issues downstream. Secondly, it adds complexity to systems because of the addition of agents. This increases the potential for outages, and, paradoxically, even adds some new security risks. Think of the SolarWinds supply-chain attack.
Once an organisation has met a compliance requirement by installing an EDR solution, they may be less likely to consider solutions which do not add so much complexity and risk, such as design of networks into multiple zones, management of user permissions in line with the principle of least privilege, and the use of multi-factor authentication and zero-trust. The issue is that it is much more work to demonstrate the correct implementation of such organisation-specific interventions than it is to purchase and install a turnkey solution which regulators will be satisfied with.
Thanks to Eric Dobbs, Josh Kaderlan, and Alan Kraft for their thoughtful comments and review.