Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

;login: Enters a New Phase of Its Evolution

For over 20 years, ;login: has been a print magazine with a digital version; in the two decades previous, it was USENIX’s newsletter, UNIX News. Since its inception 45 years ago, it has served as a medium through which the USENIX community learns about useful tools, research, and events from one another. Beginning in 2021, ;login: will no longer be the formally published print magazine as we’ve known it most recently, but rather reimagined as a digital publication with increased opportunities for interactivity among authors and readers.

Since USENIX became an open access publisher of papers in 2008, ;login: has remained our only content behind a membership paywall. In keeping with our commitment to open access, all ;login: content will be open to everyone when we make this change. However, only USENIX members at the sustainer level or higher, as well as student members, will have exclusive access to the interactivity options. Rik Farrow, the current editor of the magazine, will continue to provide leadership for the overall content offered in ;login:, which will be released via our website on a regular basis throughout the year.

As we plan to launch this new format, we are forming an editorial committee of volunteers from throughout the USENIX community to curate content, meaning that this will be a formally peer-reviewed publication. This new model will increase opportunities for the community to contribute to ;login: and engage with its content. In addition to written articles, we are open to other ideas of what you might want to experience.

Large, production-quality distributed systems still fail periodically, sometimes catastrophically where most or all users experience an outage or data loss. Conventional wisdom has it that these failures can only manifest themselves on large production clusters and are extremely difficult to prevent a priori, because these systems are designed to be fault tolerant and are well-tested. By investigating 198 user-reported failures that occurred on production-quality distributed systems, we found that almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors, and, surprisingly, many of them are caused by trivial mistakes such as error handlers that are empty or that contain expressions like "FIXME" or "TODO" in the comments. We therefore developed a simple static checker, Aspirator, capable of locating trivial bugs in error handlers; it found 143 new bugs and bad practices that have been fixed or confirmed by the developers.

Download Article:

Simple Testing Can Prevent Most Critical Failures

Article Section:

PROGRAMMING