Improving Availability in Distributed Systems with Failure Informers

Trinabh Gupta; Joshua B. Leners; Marcos K. Aguilera; Michael Walfish

Authors:

Joshua B. Leners and Trinabh Gupta, The University of Texas at Austin; Marcos K. Aguilera, Microsoft Research Silicon Valley; Michael Walfish, The University of Texas at Austin

Abstract:

This paper addresses a core question in distributed systems: how should applications be notiﬁed of failures? When a distributed system acts on failure reports, the system’s correctness and availability depend on the granularity and semantics of those reports. The system’s availability also depends on coverage (failures are reported), accuracy (reports are justiﬁed), and timeliness (reports come quickly). This paper describes Pigeon, a failure reporting service designed to enable high availability in the applications that use it. Pigeon exposes a new abstraction, called a failure informer, which allows applications to take informed, application-speciﬁc recovery actions, and which encapsulates uncertainty, allowing applications to proceed safely in the presence of doubt. Pigeon also signiﬁcantly improves over the previous state of the art in the three-way trade-off among coverage, accuracy, and timeliness.

Trinabh Gupta, The University of Texas at Austin

Joshua B. Leners, The University of Texas at Austin

Marcos K. Aguilera, Microsoft Research Silicon Valley

Michael Walfish, The University of Texas at Austin

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {180327,
author = {Trinabh Gupta and Joshua B. Leners and Marcos K. Aguilera and Michael Walfish},
title = {Improving Availability in Distributed Systems with Failure Informers},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {427--441},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/leners},
publisher = {USENIX Association},
month = apr
}

Download

Leners PDF

View the slides

Presentation Video

Presentation Audio

Download Audio

Public Summary:

by Katerina Argyraki

This paper addresses one of the quintessential questions in system design: how much and what kind of information should applications get about lower-layer failures? Existing systems do not typically expose such information in a systematic way, leaving applications to detect failures themselves through end-to-end timeouts. The rationale behind this design choice was that the benefit applications would gain from explicit failure reports was not worth the cost of the mechanism needed to provide them. The paper argues that things have changed, and it is time our perspective on failure reporting did, too: these days, failures that reduce application availability can bear a significant financial cost; hence, we need to provide applications with the information they need to mitigate the impact of these failures as much as possible.

The first piece of the proposed solution is a "failure informer" interface for exposing host and network failures to applications. What this interface reports about each failure is (1) whether it has certainly occurred or is imminent and (2) whether it is certainlypermanent or not. This classification results in four failure types: a "stop" (the target process has stopped executing and lost its volatile state), an "unreachability" (the target process may still be running but the client cannot reach it), a "stop warning" (the target process may soon stop as it is running out of a critical resource) and an "unreachability warning" (the target process may soon become unreachable). The interface also reports the expected duration of the failure and, in case of warning, the critical resource responsible for the warning.

The second piece of the proposed solution is "Pigeon," a service that implements the failure-informer interface in the context of a single administrative domain running Open Shortest Path First (OSPF). Pigeon consists of several components running at end-hosts and routers: "sensors" detect faults (e.g., process exits, host or router reboots, link failures or overloads), "relays" communicate these faults to the interested clients, and an "interpreter" processes these faults and turns them into failure reports.

Does the failure-informer interface provide information that is necessary and sufficient for applications to respond to lower-layer failures in the best possible way? Is the benefit for applications worth the complexity of embedding sensors and relays in every end-host and router? These questions are impossible to answer without real deployment experience—or, at the very least, real-world failure data, which is rarely available to researchers. But the authors have done the next best thing: They have tested Pigeon with three real applications (running in a small number of virtual machines interconnected in a fat tree). Their evaluation shows that Pigeon helps a storage system avoid unnecessary failovers, a key-value store recover from a link failure several seconds faster, and a lease-based replication system reclaim the lease held by a crashed process severalseconds earlier. This is promising evidence that explicit failure reports can significantly improve application availability.

connect with us