Bobtail: Avoiding Long Tails in the Cloud

Yunjing Xu; Zachary Musgrave; Brian Noble; Michael Bailey

Authors:

Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey, University of Michigan

Abstract:

Highly modular data center applications such as Bing, Facebook, and Amazon’s retail platform are known to be susceptible to long tails in response times. Services such as Amazon’s EC2 have proven attractive platforms for building similar applications. Unfortunately, virtualization used in such platforms exacerbates the long tail problem by factors of two to four. Surprisingly, we ﬁnd that poor response times in EC2 are a property of nodes rather than the network, and that this property of nodes is both pervasive throughout EC2 and persistent over time. The root cause of this problem is co-scheduling of CPU-bound and latency-sensitive tasks. We leverage these observations in Bobtail, a system that proactively detects and avoids these bad neighboring VMs without significantly penalizing node instantiation. With Bobtail, common communication patterns beneﬁt from reductions of up to 40% in 99.9th percentile response times.

Yunjing Xu, University of Michigan

Zachary Musgrave, University of Michigan

Brian Noble, University of Michigan

Michael Bailey, University of Michigan

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {180314,
author = {Yunjing Xu and Zachary Musgrave and Brian Noble and Michael Bailey},
title = {Bobtail: Avoiding Long Tails in the Cloud},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {329--341},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/xu_yunjing},
publisher = {USENIX Association},
month = apr
}

Download

Xu PDF

View the slides

Presentation Video

Presentation Audio

Download Audio

Public Summary:

by George Porter

Shared cloud infrastructures provide application developers with numerous advantages, including the ability to seamlessly scale up or down resources in response to demand, and the ability to focus on developing their applications, without having to purchase, manage, or wait for dedicated data center resources. However, these advantages come at a cost, namely that the developer can no longer reason about dedicated compute, storage, and networking resources. This paper addresses a particularly important and problematic instance of this problem, namely high "tail latency,'' or the 99.9th percentile of service response time. For distributed applications that compose numerous network requests into a single, logical operation, tail latency is particularly important, since it often determines the overall response time of the application as a whole. Designing performant applications requires minimizing the latency variance of individual network operations, which is a challenge in shared cloud environments.

In this paper, the authors have made two primary contributions. First, they have carried out a multi-week long measurement study on the popular Amazon EC2 cloud platform to determine the source of tail latency. The study found that it is not the network topology or congestion that is to blame, but rather the node, and in particular the virtual machine scheduler. When latency-sensitive and CPU-intensive VMs are co-scheduled together, that result is high tail latency. The program committee was particularly impressed with thecare that went into this measurement study, and how through the study the authors were able to determine that a simple node-local test (measuring the accuracy of timer events) predicts the expected latency of network operations. While it has generally been widely known that shared cloud environments exhibit higher tail latency than dedicated data centers, the program committee was pleased to see a detailed, quantitative analysis of the specific sources of that latency and their effect on observed application performance.

The resulting node-local test forms the basis of this paper's second major contribution, namely the Bobtail system. With Bobtail, cloud users are able to select a subset of their nodes that are likely to exhibit low tail latency, and do so without support from the cloud provider. This is critical for developers that target black-box cloud infrastructure or spread their applications across heterogeneous cloud providers. One concern raised by the program committee, and echoed by the authors themselves, is avoiding a "race to the bottom'' scenario where multiple users relying on Bobtail pursue the same set of resources, reducing the overall utility of the infrastructure. The authors point out that their findings show that it is not VM co-scheduling that results in high tail latency, but specifically co-scheduling latency-sensitive and CPU-intensive workloads together. The authors' commendable efforts in measuring EC2 do not necessarily carry over to other cloud providers; however, their findings show that the primary sources of tail latency are hypervisor-specific, meaning that Bobtail is likely applicable to any cloud relying on Xen. Further experience with other platforms might bear out this hypothesis.
In summary, providing users with tools to better reason about the properties of shared cloud infrastructures is critical to their adoption, and Bobtail is a step in that direction.

connect with us