Uncertainty in Aggregate Estimates from Sampled Distributed Traces

Nate Coehlo; Arif Merchant; Murray Stokely; Hideyuki Tokuda; Priya  Narasimhan; Marco Domenico Santambrogio

Workshop Program

All sessions will be held in Hollywood Ballroom Studio A unless otherwise noted.

8:30 a.m.–9:30 a.m.	Sunday
Keynote To Err Is Human, to Log Divine: Expediting Production Failure Diagnosis with Better Logging Ding Yuan, University of California, San Diego; University of Illinois at Urbana-Champaign; University of Toronto When systems fail in the field, logged data are frequently the only evidence available for support engineers and developers to assess and diagnose the underlying cause. Consequently, the efficacy of such logging data is a matter of significant practical importance. We have empirically studied tens of thousands of log messages and hundreds of production failures from several widely-used systems, and built several tools for log automation and postmortem log analysis. In this talk, I will summarize our experiences on exploring questions such as "How much do log messages really help in debugging?", "Are they good enough?", "What are the opportunities for improving log qualities?", "Can we automatically improve log messages?", and "How can we automate the log inference?" I will also discuss where the greatest opportunities for impact are likely to be found in the future. When systems fail in the field, logged data are frequently the only evidence available for support engineers and developers to assess and diagnose the underlying cause. Consequently, the efficacy of such logging data is a matter of significant practical importance. We have empirically studied tens of thousands of log messages and hundreds of production failures from several widely-used systems, and built several tools for log automation and postmortem log analysis. In this talk, I will summarize our experiences on exploring questions such as "How much do log messages really help in debugging?", "Are they good enough?", "What are the opportunities for improving log qualities?", "Can we automatically improve log messages?", and "How can we automate the log inference?" I will also discuss where the greatest opportunities for impact are likely to be found in the future. This talk is based on joint work with: Y. Zhou, P. Huang, S. Park, J. Zheng, H. Mai, Y. Liu, M. Lee, X. Tang, W. Xiong, L. Tan, S. Savage, and S. Pasupathy. Ding Yuan is a graduating Ph.D. candidate in the University of Illinois at Urbana-Champaign and a visiting student at the University of California, San Diego. He will join the University of Toronto as an assistant professor in 2013. His research focuses on practical approaches for failure diagnosis via log messages. He has received two ASPLOS best paper nominations, an ACM SIGSOFT Distinguished Paper Award, an Outstanding Teaching Assistant Award, and a Saburo Muroga Fellowship. His research systems on failure diagnosis has been requested for release by large vendors including Cisco, EMC, Huawei, NetApp, and Qualcomm. Available Media Read more about To Err Is Human, to Log Divine: Expediting Production Failure Diagnosis with Better Logging
9:30 a.m.–10:30 a.m.	Sunday
Recommendation Systems Towards a Data Analysis Recommendation System Sara Alspaugh, University of California, Berkeley; Archana Ganapathi, Splunk, Inc. System data is abundant, yet data-driven decision making is currently more of an art than a science. Many organizations rely on data analysis for problem detection and diagnosis, but the process continues to be custom and ad hoc. In this paper, we examine the analytics process undertaken by users to mine large data sets, and try to characterize these searches by the operations performed. Furthermore, we take a first stab at a methodical process to automatically suggest operations based on statistical analysis of previous searches performed. Available Media Mojave: A Recommendation System for Software Upgrades Rekha Bachwani, Rutgers University; Olivier Crameri, EPFL; Ricardo Bianchini, Rutgers University; Willy Zwaenepoel, EPFL Software upgrades are frequent. Unfortunately, many of the upgrades either fail or misbehave. We argue that many of these failures can be avoided for new users of each upgrade by exploiting the characteristics of the upgrade and feedback from the users that have already installed it. To demonstrate that this can be achieved, we build Mojave, the first recommendation system for software upgrades. Mojave leverages data from the existing and new users, machine learning, and static and dynamic source analyses. For each new user, Mojave computes the likelihood that the upgrade will fail for him/her. Based on this value, Mojave recommends for or against the upgrade. We evaluate Mojave for two real upgrade problems with the OpenSSH suite. Initial results show that it provides accurate recommendations. Available Media
10:30 a.m.–11:00 a.m.	Sunday
Break Hollywood Ballroom Foyer
11:00 a.m.–12:30 p.m.	Sunday
Managing the Cloud Vayu: Learning to Control the Cloud Ira Cohen, Ohad Assulin, Eli Mordechai, Yaniv Sayers, and Ruth Bernstein, HP Software In this paper we describe Vayu, a system for managing cloud applications from the performance, availability and capacity standpoints. The system automatically learns the behavior of cloud applications and the remediation actions required to avoid and resolve problems that may arise in the application by detecting and creating signatures of problems and mapping them to a finite set of automated remediation actions available in cloud environments. Vayu is based on a set of algorithms; anomaly detection, problem signature and similarity and classification based action learning. Available Media A Framework for Thermal and Performance Management Davide Basilio Bartolini, Politecnico di Milano; Filippo Sironi, Massachusetts Institute of Technology; Martina Maggio, Lund University; Riccardo Cattaneo and Donatella Sciuto, Politecnico di Milano; Marco Domenico Santambrogio, Politecnico di Milano and Massachusetts Institute of Technology In modern computing facilities, higher and higher operating temperatures are due to the employment of power-hungry devices, hence the need for cost-effective heat dissipation solutions to guarantee proper operating temperatures. Within this context, dynamic thermal management techniques (DTM) can be highly beneficial in proactively control heat dissipation, avoiding overheating. The large-scale adoption of DTM may eventually allow the use of more cost-effective heat dissipation system, with great power consumption advantages for large datacenters. Preventive thermal management is a technique to achieve long-term thermal control via performance degradation. However, this may result in impaired Quality of Service (QoS) and Service Level Agreements (SLAs) breaking. We address this problem by proposing a self-adaptive framework combining performance and thermal management targeting Chip Multi-Processors (CMPs). The proposed methodology harnesses control-theoretical controllers for driving idle-cycle injection and threads priority adjustment, in order to provide control over the processor temperature, while taking applications’ QoS (in terms of performance) into account. We implemented our framework in the FreeBSD operating system and evaluated it on real hardware, also comparing it with a previous framework for preventive DPTM. Available Media Transparent System Call Based Performance Debugging for Cloud Computing Nikhil Khadke, Michael P. Kasick, Soila P. Kavulya, Jiaqi Tan, and Priya Narasimhan, Carnegie Mellon University Problem diagnosis and debugging in distributed environments such as the cloud and popular distributed systems frameworks has been a hard problem. We explore an evaluation of a novel way of debugging distributed systems, such as the MapReduce framework, by using system calls. Performance problems in such systems can be hard to diagnose and to localize to a specific node or a set of nodes. Additionally, most debugging systems often rely on forms of instrumentation and signatures that sometimes cannot truthfully represent the state of the system (logs or application traces for example). We focus on evaluating the performance debugging of these frameworks using a low level of abstraction - system calls. By focusing on a small set of system calls, we try to extrapolate meaningful information on the control flow and state of the framework, providing accurate and meaningful automated debugging. Available Media
12:30 p.m.–2:00 p.m.	Sunday
Workshop Luncheon Hollywood Studios CDE
2:00 p.m.–3:30 p.m.	Sunday
Tracing Monitoring the Dynamics of Network Traffic by Recursive Multi-Dimensional Aggregation Midori Kato, Keio University; Kenjiro Cho, IIJ/Keio University; Michio Honda, NEC Europe Ltd.; Hideyuki Tokuda, Keio University A promising way to capture the characteristics of changing traffic is to extract significant flow clusters in traffic. However, clustering flows by 5-tuple requires flow matching in huge flow attribute spaces, and thus, is difficult to perform on the fly. We propose an efficient yet flexible flow aggregation technique for monitoring the dynamics of network traffic. Our scheme employs two-stage flow-aggregation. The primary aggregation stage is for efficiently processing a huge volume of raw traffic records. It first aggregates each attribute of 5-tuple separately, and then, produces multi-dimensional flows by matching each attribute of a flow to the resulted aggregated attributes. The secondary aggregation stage is for providing flexible views to operators. It performs multi-dimensional aggregation with the R-tree algorithm to produce concise summaries for operators. We report our prototype implementation and preliminary results using traffic traces from backbone networks. Available Media A State-Machine Approach to Disambiguating Supercomputer Event Logs Jon Stearley, Robert Ballance, and Lara Bauman, Sandia National Laboratories Supercomputer components are inherently stateful and interdependent, so accurate assessment of an event on one component often requires knowledge of previous events on that component or others. Administrators who daily monitor and interact with the system generally possess sufficient operational context to accurately interpret events, but researchers with only historical logs are at risk for incorrect conclusions. To address this risk, we present a state-machine approach for tracing context in event logs, a flexible implementation in Splunk, and an example of its use to disambiguate a frequently occurring event type on an extreme-scale supercomputer. Specifically, of 70,126 heartbeat stop events over three months, we identify only 2% as indicating failures. Available Media Uncertainty in Aggregate Estimates from Sampled Distributed Traces Nate Coehlo, Arif Merchant, and Murray Stokely, Google, Inc. Tracing mechanisms in distributed systems give important insight into system properties and are usually sampled to control overhead. At Google, Dapper [8] is the always-on system for distributed tracing and performance analysis, and it samples fractions of all RPC traffic. Due to difficult implementation, excessive data volume, or a lack of perfect foresight, there are times when system quantities of interest have not been measured directly, and Dapper samples can be aggregated to estimate those quantities in the short or long term. Here we find unbiased variance estimates of linear statistics over RPCs, taking into account all layers of sampling that occur in Dapper, and allowing us to quantify the sampling uncertainty in the aggregate estimates. We apply this methodology to the problem of assigning jobs and data to Google datacenters, using estimates of the resulting cross-datacenter traffic as an optimization criterion, and also to the detection of change points in access patterns to certain data partitions. Available Media
3:30 p.m.–4:00 p.m.	Sunday
Break Hollywood Ballroom Foyer
4:00 p.m.–5:00 p.m.	Sunday
Demo Session Service Health Analyzer Ira Cohen, HP Labs HP Service Health Analyzer (SHA) is the industry's first predictive analytics tool built on top of a real-time, dynamic service model providing customers with a configuration free system that proactively detects and decodes IT performances problems. We will provide a demo of SHA and describe the research behind it involving machine learning applied to IT data. Agurim: Multi-dimensional Flow Re-aggregation for Traffic Monitoring Midori Kato, Keio University A promising way to capture the characteristics of changing traffic is to extract significant flow clusters in traffic. However, clustering flows by 5-tuple requires flow matching in huge flow attribute spaces, and thus, is difficult to perform on the fly. We propose an efficient yet flexible flow aggregation technique for monitoring the dynamics of network traffic. In the demonstration, we present Agurim, our resulting software. FDiag A Failure Diagnostics Toolkit based on the Analysis of Cluster System Logs Edward Chuah, University of Texas at Austin A goal for the analysis of cluster system logs is to determine the sources and causes of system failures. Large cluster systems are composed of many hardware and software components, and they are used to execute jobs that require the immense computational power provided by these systems. When nodes of a large cluster system crash or when jobs hang, the root-causes of these failures must be identified. However, cluster system logs are huge, incomplete and contain considerable ambiguity so that direct discovery of the complete causal trace path of events leading to the failure is difficult. In this presentation, we will demonstrate how FDiag can be used to process the logs of the Ranger supercomputer and to generate the diagnostics reports from which the systems administrators can use to determine where (the nodes and jobs), when (the times) and why (the causes) the system failure occurred.

Workshop Program

Break

Workshop Luncheon

Break