- Overview
- Registration Information
- Registration Discounts
- Symposium Organizers
- At a Glance
- Calendar
- Technical Sessions
- Live Streaming
- Purchase the Box Set
- Tutorial on GENI
- Posters and Demos
- Sponsorship
- Activities
- Hotel and Travel Information
- Services
- Students
- Questions?
- Help Promote
- For Participants
- Call for Papers
- Past Proceedings
sponsors
usenix conference policies
Effective Straggler Mitigation: Attack of the Clones
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica, University of California, Berkeley
Small jobs, that are typically run for interactive data analyses in datacenters, continue to be plagued by disproportionately long-running tasks called stragglers. In the production clusters at Facebook and Microsoft Bing, even after applying state-of-the-art straggler mitigation techniques, these latency sensitive jobs have stragglers that are on average 8 times slower than the median task in that job. Such stragglers increase the average job duration by 47%. This is because current mitigation techniques all involve an element of waiting and speculation. We instead propose full cloning of small jobs, avoiding waiting and speculation altogether. Cloning of small jobs only marginally increases utilization because workloads show that while the majority of jobs are small, they only consume a small fraction of the resources. The main challenge of cloning is, however, that extra clones can cause contention for intermediate data. We use a technique, delay assignment, which efficiently avoids such contention. Evaluation of our system, Dolly, using production workloads shows that the small jobs speedup by 34% to 46% after state-of-the-art mitigation techniques have been applied, using just 5% extra resources for cloning.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Ganesh Ananthanarayanan and Ali Ghodsi and Scott Shenker and Ion Stoica},
title = {Effective Straggler Mitigation: Attack of the Clones},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {185--198},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/ananthanarayanan},
publisher = {USENIX Association},
month = apr
}
Presentation Video
Presentation Audio
by Jaeyeon Jung
This paper focuses on straggling tasks (tasks that run much longer than others, thus increasing the latency of the corresponding jobs) in cloud frameworks, and their adverse impact on small jobs. The paper first shows (using real production traces, from Yahoo!, Facebook, and Bing) that most jobs have a small number of tasks, and therefore get affected by the stragglers. The paper then shows that existing straggler mitigations strategies are inefficient, especially dealing with small jobs.
The new idea is to proactively clone at the task-level, within a fixed resource utilization budget. The side effect of this approach is that cloned tasks can introduce additional contention within the job on intermediate data. Their system, Dolly, uses an approach called delayed assignment to address this issue. This paper presents an extensive evaluation with Facebook and Bing traces, and shows impressive reductions in overall running times of small jobs.
The reviewers uniformly felt that the paper was well executed with good ideas (intuitive, simple mechanism overall; empirically validated insight), and presented solid evaluation results.
connect with us