“He who fights with monsters should be careful lest he thereby become a monster. And if thou gaze long into an abyss, the abyss will also gaze into thee.” – Friedrich Nietzsche
On October 4th 2021 Facebook experienced a total outage for over six hours. This was one of the most significant Internet service outages that has ever been seen, in scale and scope. For many of us, unavailability of Facebook, WhatsApp and Instagram is not a significant problem, but in parts of the world WhatsApp is the most widely-used communications platform, as well as being widely used for payments. In these areas, the impact was significant.
Facebook’s outage was one of the most prolonged and complete outages we’ve seen for a major Internet service. In particular, Facebook’s authoritative nameservers were unreachable for a significant period of time, meaning that Facebook’s DNS names could not be resolved. Facebook’s APIs are widely used in other webpages and in mobile phone applications. When Facebook disappeared from the Internet, all of these devices began to experience errors and repeatedly retried to resolve Facebook’s DNS names. This led to an increase in traffic of up to 30 times to DNS resolvers. Some resolvers were unable to deal with the load, and problems with DNS resolution were widespread. In effect, Facebook’s clients inadvertently caused an enormous DDoS of the world’s DNS infrastructure, as a secondary effect of Facebook’s downtime. In general, the top level DNS resolvers as well as the large open DNS resolvers like OpenDNS and Quad9 were robust against this increased traffic.
Facebook is one the largest distributed systems engineering organisations on the planet. Scale and redundancy cannot prevent all incidents — if anything, larger scale and the increased complexity and automation that go along with it introduce new failure modes and can make the ‘blast radius’ of failures that do happen larger. This is the dilemma of modern systems operation: our job now is to build automation that manages our systems, and when such automation goes awry, recovery may require manual processes that are rarely used and rusty (one of the Ironies of Automation identified by Lisanne Bainbridge in the 1980s).
When Nietzsche wrote about the perils of gazing into the abyss, he meant the danger of confronting evil and being changed by the experience. Our systems are not evil — even if they sometimes seem to be torturing us with problems. Nevertheless, we do have to confront the fact that there are some inherent problems in software operations that we have to live with and to continuously mitigate as best we can. Reliability is not a project to be completed: it is a program to be sustained and managed indefinitely. Furthermore, reliability requires humans: our ability to predict and avoid most failures — let’s not forget that for every incident that occurs, many more are avoided or detected and averted at an early stage — and our ability to mitigate, debug, and fix the ones that do slip through the net.
When Nietzsche wrote about the perils of gazing into the abyss, he meant the danger of confronting evil and being changed by the experience. Our systems are not evil — even if they sometimes seem to be torturing us with problems. Nevertheless, we do have to confront the fact that there are some inherent problems in software operations that we have to live with and to continuously mitigate as best we can. Reliability is not a project to be completed: it is a program to be sustained and managed indefinitely.
We need to gaze into our abyss. Because there is no silver bullet, the way we get better at avoiding and resolving incidents is by learning from each other. This has long been one of the significant benefits of working at a very large organisation such as Google, Facebook, or Amazon: simply due to the number of systems and their scale you see a lot of incidents, bugs and problems and you can learn a lot about how software systems fail. A lot of what I know about software reliability I learned from attending incident reviews and reading internal post-incident reports while I was at Google.
Knowledge about software incidents can’t prevent all outages. I suspect it does help us proactively avoid a lot of problems, but we cannot count the incidents that do not occur. However, knowing a lot about incidents can help us respond better to incidents. It’s very rare to see incidents repeat themselves, but they do often rhyme. As an incident commander or responder, we’re always under significant pressure to understand situations quickly and to make the best decisions possible, often under uncertainty. Understanding, for instance, that one is dealing with a cascading failure and knowing how a variety of other cascading failure events were resolved can be very helpful.
Reading broadly about incidents and incident response can level up our skills in various ways, but this has been harder to do at smaller organisations, or at companies that don’t have a strong incident investigation culture. There has been no comprehensive public and searchable repository of incident reports and reviews, despite the efforts of people like Dan Luu to compile lists of outage reports, SRE Weekly’s regular outage highlights, and the occasional project-specific catalog such as Kubernetes Failure Stories.
Therefore, the launch this week of the Verica Open Incident Database (the VOID) is both timely and welcome. The VOID is a large repository of public incident reports and reviews, tagged with metadata about the impact of the incident, the duration of impact, and technologies involved. There’s a useful report about the broader trends apparent in the VOID’s data. It’s worth noting — as the report also points out — that public incident reports are not necessarily unbiased or complete. However, the VOID still has enormous value for practitioners as a way of learning about incidents.
The metadata and description of the reports in the VOID is searchable, which is particularly useful — this is a great way to find reports of sharp edges or failure modes of technologies that you use. Reading reports of other organisations’ incidents related to the open source technologies or cloud services that your team uses can be very helpful as part of a Production Readiness Review or when planning disaster testing. Many of the most detailed reports that are available publicly have been written by smaller organisations, and the VOID makes it much easier to find these reports than ever before.
There’s no permanent or perfect solution for distributed systems reliability: only thousands of engineers, all around the world, doing their best to keep the wheels on, week after week and year after year. We can do this better when we learn together, and the VOID is now one of the essential tools for sharing our knowledge. All of us should gaze into the VOID.