9:00 a.m.–10:30 a.m. |
Wednesday |
|
10:30 a.m.–11:00 a.m. |
Wednesday |
|
11:00 a.m.–12:30 p.m. |
Wednesday |
Session Chair: Marcos K. Aguilera, Microsoft Research
Zhe Zhang, Han Chen, and Hui Lei, IBM T. J. Watson Research Center
File cache management is among the most important factors affecting the performance of a cloud computing system. To achieve higher economies of scale, virtual machines are often overcommitted, which creates high memory pressure. Thus it is essential to eliminate duplicate data in the host and guest caches to boost performance. Existing cache deduplication solutions are based on complex algorithms, or incur high runtime overhead, and therefore are not widely applicable. In this paper we present a simple and lightweight mechanism based on functional partitioning. In our mechanism, the responsibility of each cache becomes smaller. As a result, the overall effective cache size becomes bigger. Our method requires very small change to existing software (15 lines of new/modified code) to achieves big performance improvements – more than 40% performance gains in high memory pressure settings.
Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz, University of California, Berkeley
Modern datacenters support a large number of applica- tions with diverse performance requirements. These performance requirements are expressed at the application layer as high-level service-level objectives (SLOs). However, large-scale distributed storage systems are unaware of these high-level SLOs. This lack of awareness results in poor performance when workloads from multiple applications are consolidated onto the same storage cluster to increase utilization. In this paper, we argue that because SLOs are expressed at a high level, a high-level control mechanism is required. This is in contrast to existing approaches, which use block- or disk-level mechanisms. These require manual translation of high-level requirements into low-level parameters. We present Frosting, a request scheduling layer on top of a distributed storage system that allows application programmers to specify their high-level SLOs directly. Frosting improves over the state-of-the-art by automatically translating high-level SLOs into internal scheduling parameters and uses feedback control to adapt these parameters to changes in the workload. Our preliminary results demonstrate that our overlay approach can multiplex both latency-sensitive and batch applications to increase utilization, while still maintaining a 100ms 99th percentile latency SLO for latency-sensitive clients.
Xing Lin, University of Utah, School of Computing; Yun Mao, AT&T Labs—Research; Feifei Li and Robert Ricci, University of Utah, School of Computing
A common problem with disk-based cloud storage services is that performance can vary greatly and become highly unpredictable in a multi-tenant environment. A fundamental reason is the interference between workloads co-located on the same physical disk. We observe that different IO patterns interfere with each other significantly, which makes the performance of different types of workloads unpredictable when they are executed concurrently. Unpredictability implies that users may not get a fair share of the system resources from the cloud services they are using. At the same time, replication is commonly used in cloud storage for high reliability. Connecting these two facts, we propose a cloud storage system designed to minimize workload interference without increasing storage costs or sacrificing the overall system throughput. Our design leverages log-structured disk layout, chain replication and a workload-based replica selection strategy to minimize interference, striking a balance between performance and fairness. Our initial results suggest that this approach is a promising way to improve the performance and predictability of cloud storage.
|
12:30 p.m.–1:30 p.m. |
Wednesday |
|
1:30 p.m.–3:00 p.m. |
Wednesday |
Session Chair: Byung-Gon Chun, Yahoo! Research
James Horey, Edmon Begoli, Raghul Gunasekaran, Seung-Hwan Lim, and James Nutaro, Oak Ridge National Laboratory
Infrastructure-as-a-Service has revolutionized the manner in which users commission computing infrastructure. Coupled with Big Data platforms (Hadoop, Cassandra), IaaS has democratized the ability to store and process massive datasets. For users that need to customize or create new Big Data stacks, however, readily available solutions do not yet exist. Users must first acquire the necessary cloud computing infrastructure, and manually install the prerequisite software. For complex distributed services this can be a daunting challenge. To address this issue, we argue that distributed services should be viewed as a single application consisting of virtual machines. Users should no longer be concerned about individual machines or their internal organization. To illustrate this concept, we introduce Cloud-Get, a distributed package manager that enables the simple installation of distributed services in a cloud computing environment. Cloud-Get enables users to instantiate and modify distributed services, including Big Data services, using simple commands. Cloud-Get also simplifies creating new distributed services via standardized package definitions.
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica, University of California, Berkeley
Despite prior research on outlier mitigation, our analysis of jobs from the Facebook cluster shows that outliers still occur, especially in small jobs. Small jobs are particularly sensitive to long-running outlier tasks because of their interactive nature. Outlier mitigation strategies rely on comparing different tasks of the same job and launching speculative copies for the slower tasks. However, small jobs execute all their tasks simultaneously, thereby not providing sufficient time to observe and compare tasks. Building on the observation that clusters are underutilized, we take speculation to its logical extreme—run full clones of jobs to mitigate the effect of outliers. The heavy-tail distribution of job sizes implies that we can impact most jobs without using much resources. Trace-driven simulations show that average completion time of all the small jobs improves by 47% using cloning, at the cost of just 3% extra resources.
Edward Bortnikov, Yahoo! Labs, Haifa, Israel; Ari Frank, Affectivon Inc. Kiryat Tivon, Israel; Eshcar Hillel, Yahoo! Labs, Haifa, Israel; Sriram Rao, Yahoo! Labs, Santa Clara, US
Extremely slow, or straggler, tasks are a major performance bottleneck in map-reduce systems. Hadoop infrastructure makes an effort to both avoid them (through minimizing remote data accesses) and handle them in the runtime (through speculative execution). However, the mechanisms in place neither guarantee the avoidance of performance hotspots in task scheduling, nor provide any easy way to tune the timely detection of stragglers. We suggest a machine-learning approach to address these problems, and introduce a slowdown predictor – an oracle to forecast how much slower a task will run on a given node, compared to similar tasks. Slowdown predictors can be embedded in the map-reduce infrastructure to improve the agility and timeliness of scheduling decisions. We provide initial evaluation to demonstrate the viability of our approach, and discuss the use cases for the new paradigm.
|
3:00 p.m.–3:15 p.m. |
Wednesday |
|
3:15 p.m.–4:45 p.m. |
Wednesday |
Session Chair: Livio Soares, IBM Research
Jeongseob Ahn, Changdae Kim, and Jaeung Han, KAIST; Young-ri Choi, UNIST; Jaehyuk Huh, KAIST
Although virtual machine (VM) migration has been used to avoid conflicts on traditional system resources like CPU and memory, micro-architectural resources such as shared caches, memory controllers, and nonuniform memory access (NUMA) affinity, have only relied on intra-system scheduling to reduce contentions on them. This study shows that live VM migration can be used to mitigate the contentions on micro-architectural resources. Such cloud-level VM scheduling can widen the scope of VM selections for architectural shared resources beyond a single system, and thus improve the opportunity to further reduce possible conflicts. This paper proposes and evaluates two cluster-level virtual machine scheduling techniques for cache sharing and NUMA affinity, which do not require any prior knowledge on the behaviors of VMs.
Navraj Chohan, Anand Gupta, Chris Bunch, Sujay Sundaram, and Chandra Krintz, University of California, Santa Barbara
Cloud technology is evolving at a rapid pace with innovation occurring throughout the software stack. While updates to Software-as-a-Service (SaaS) products require a simple push of code to the production servers or platform, updates to the Infrastructure-as-a-Service (IaaS) or Platform-as-a-Service (PaaS) layers require more intricate procedures to prevent disruption to services at higher abstraction layers.
In this work we address the need for rolling upgrades to PaaS systems. We do so with the AppScale PaaS, which is a multi-application, multi-language, multi-infrastructure, and multi-datastore platform. Our design and implementation allows for applications and tenants to be migrated live from one cloud deployment to another with guaranteed transaction semantics and minimal performance degradation. In this paper we motivate the need for PaaS migration support and empirically evaluate migrations between two AppScale deployments using highly scalable datastores.
Raja R. Sambasivan and Gregory R. Ganger, Carnegie Mellon University
Automated management is critical to the success of cloud computing, given its scale and complexity. But, most systems do not satisfy one of the key properties required for automation: predictability, which in turn relies upon low variance. Most automation tools are not effective when variance is consistently high. Using automated performance diagnosis as a concrete example, this position paper argues that for automation to become a reality, system builders must treat variance as an important metric and make conscious decisions about where to reduce it. To help with this task, we describe a framework for reasoning about sources of variance in distributed systems and describe an example tool for helping identify them.
|
4:45 p.m.–5:00 p.m. |
Wednesday |
|
5:00 p.m.–6:30 p.m. |
Wednesday |
Session Chair: Sharon Goldberg, Boston University
Justin Thaler, Mike Roberts, Michael Mitzenmacher, and Hanspeter Pfister, Harvard University, School of Engineering and Applied Sciences
As the cloud computing paradigm has gained prominence, the need for verifiable computation has grown increasingly urgent. Protocols for verifiable computation enable a weak client to outsource difficult computations to a powerful, but untrusted server, in a way that provides the client with a guarantee that the server performed the requested computations correctly. By design, these protocols impose a minimal computational burden on the client, but they require the server to perform a very large amount of extra bookkeeping to enable a client to easily verify the results. Verifiable computation has thus remained a theoretical curiosity, and protocols for it have not been implemented in real cloud computing systems.
In this paper, we assess the potential of parallel processing to help make practical verification a reality, identifying abundant data parallelism in a state-of-the-art general-purpose protocol for verifiable computation. We implement this protocol on the GPU, obtaining 40-120× server-side speedups relative to a state-of-the-art sequential implementation. For benchmark problems, our implementation thereby reduces the slowdown of the server to within factors of 100-500× relative to the original computations requested by the client. Furthermore, we reduce the already small runtime of the client by 100×. Our results demonstrate the immediate practicality of using GPUs for verifiable computation, and more generally, that protocols for verifiable computation have become sufficiently mature to deploy in real cloud computing systems.
Masoud Moshref and Minlan Yu, University of Southern California; Abhishek Sharma, University of Southern California and NEC; Ramesh Govindan, University of Southern California
Cloud operators increasingly need many fine-grained rules to better control individual network flows for various management tasks. While previous approaches have advocated placing rules either on hypervisors or switches, we argue that future data centers would benefit from leveraging rule processing capabilities at both for better scalability and performance. In this paper, we propose vCRIB, a virtualized Cloud Rule Information Base that allows operators to freely define different management policies without the need to consider underlying resource constraints. The challenge in our approach is the design of a vCRIB manager that automatically partitions and places rules at both hypervisors and switches to achieve a good trade-off between resource usage and performance.
Bryan Ford, Yale University
The cloud model’s dependence on massive parallelism and resource sharing exacerbates the security challenge of timing side-channels. Timing Information Flow Control (TIFC) is a novel adaptation of IFC techniques that may offer a way to reason about, and ultimately control, the flow of sensitive information through systems via timing channels. With TIFC, objects such as files, messages, and processes carry not just content labels describing the ownership of the object’s “bits,” but also timing labels describing information contained in timing events affecting the object, such as process creation/termination or message reception. With two system design tools—deterministic execution and pacing queues—TIFC enables the construction of “timing-hardened” cloud infrastructure that permits statistical multiplexing, while aggregating and rate-limiting timing information leakage between hosted computations.
|