8:00 a.m.–9:00 a.m. |
Wednesday |
Continental Breakfast
Columbus Foyer
|
9:00 a.m.–10:00 a.m. |
Wednesday |
Albert Greenberg, Director of Development, Microsoft Azure Networking Large scale cloud infrastructure requires many management applications to run concurrently to function smoothly. Traditional management applications fall short, tripped up by unexpected failures and unanticipated interference. We present a graph based data model for cloud infrastructure, enabling architects and operators to describe large scale complex infrastructure. A goal state driven framework enables quick and safe application development, sustaining infrastructure growth and maintenance at huge scale. Large scale cloud infrastructure requires many management applications to run concurrently to function smoothly. Traditional management applications fall short, tripped up by unexpected failures and unanticipated interference. We present a graph based data model for cloud infrastructure, enabling architects and operators to describe large scale complex infrastructure. A goal state driven framework enables quick and safe application development, sustaining infrastructure growth and maintenance at huge scale.
|
10:00 a.m.–10:30 a.m. |
Wednesday |
Break with Refreshments
Columbus Foyer
|
10:30 a.m.–12:10 p.m. |
Wednesday |
Session Chair: Jonathan Appavoo, Boston University
Wei Zhang and Timothy Wood, The George Washington University; K.K. Ramakrishnan, Rutgers University; Jinho Hwang, IBM T. J. Watson Research Center A revolution is beginning in communication networks with the adoption of network function virtualization, which allows network services to be run on common off-the-shelf hardware—even in virtual machines—to increase flexibility and lower cost. An exciting prospect for cloud users is that these software-based network services can be merged with compute and storage resources to flexibly integrate all of the cloud’s resources.
We are developing an application aware networking platform that can perform not only basic packet switching, but also typical functions left to compute platforms such as load balancing based on application-level state, localized data caching, and even arbitrary computation. Our prototype “memcached-aware smart switch” reduces request latency by half and increases throughput by eight fold compared to Twitter’s TwemProxy. We also describe how a Hadoop-aware switch could automatically cache data blocks near worker nodes, or perform some computation directly on the data stream. This approach enables a new breed of application designs that blur the line between the cloud’s network and its servers.
Paolo Costa, Hitesh Ballani, and Dushyanth Narayanan, Microsoft Research The rack is increasingly replacing individual servers as the basic building block of modern data centers. Future rack-scale computers will comprise a large number of tightly integrated systems-on-chip, interconnected by a switch-less internal fabric. This design enables thousands of cores per rack and provides high bandwidth for rack-scale applications. Most of the benefits promised by these new architectures, however, can only be achieved with adequate support from the software stack.
In this paper, we take a step in this direction by focusing on the network stack for rack-scale computers. Using routing and rate control as examples, we show how the peculiarities of rack architectures allow for new approaches that are attuned to the underlying hardware. We also discuss other exciting research challenges posed by rack-scale computers.
William Culhane, Kirill Kogan, Chamikara Jayalath, and Patrick Eugster, Purdue University Aggregation underlies the distillation of information from big data. Many well-known basic operations including top-k matching and word count hinge on fast aggregation across large data-sets. Common frameworks including MapReduce support aggregation, but do not explicitly consider or optimize it. Optimizing aggregation however becomes yet more relevant in recent “online” approaches to expressive big data analysis which store data in main memory across nodes. This shifts the bottlenecks from disk I/O to distributed computation and network communication and significantly increases the impact of aggregation time on total job completion time.
This paper presents LOOM, a (sub)system for efficient big data aggregation for use within big data analysis frameworks. LOOM efficiently supports two-phased (sub)computations consisting in a first phase performed on individual data sub-sets (e.g., word count, top-k matching) followed by a second aggregation phase which consolidates individual results of the first phase (e.g., count sum, top-k). Using characteristics of an aggregation function, LOOM constructs a specifically configured aggregation overlay to minimize aggregation costs. We present optimality heuristics and experimentally demonstrate the benefits of thus optimizing aggregation overlays using microbenchmarks and real world examples.
Katrina LaCurts, MIT/CSAIL; Jeffrey C. Mogul, Google, Inc.; Hari Balakrishnan, MIT/CSAIL; Yoshio Turner, HP Labs In cloud-computing systems, network-bandwidth guarantees have been shown to improve predictability of application performance and cost [1, 28]. Most previous work on cloud-bandwidth guarantees has assumed that cloud tenants know what bandwidth guarantees they want [1, 17]. However, as we show in this work, application bandwidth demands can be complex and time-varying, and many tenants might lack sufficient information to request a guarantee that is well-matched to their needs, which can lead to over-provisioning (and thus reduced cost-efficiency) or under-provisioning (and thus poor user experience).
We analyze traffic traces gathered over six months from an HP Cloud Services datacenter, finding that application bandwidth consumption is both time-varying and spatially inhomogeneous. This variability makes it hard to predict requirements. To solve this problem, we develop a prediction algorithm usable by a cloud provider to suggest an appropriate bandwidth guarantee to a tenant. With tenant VM placement using these predictive guarantees, we find that the inter-rack network utilization in certain datacenter topologies can be more than doubled.
|
12:10 p.m.–1:30 p.m. |
Wednesday |
FCW '14 Luncheon
Grand Ballrom ABC
|
1:30 p.m.–3:35 p.m. |
Wednesday |
Session Chair: John Arrasjid, VMware
Owen Vallis, Jordan Hochenbaum, and Arun Kejariwal, Twitter Inc. High availability and performance of a web service is key, amongst other factors, to the overall user experience (which in turn directly impacts the bottom-line). Exogenic and/or endogenic factors often give rise to anomalies that make maintaining high availability and delivering high performance very challenging. Although there exists a large body of prior research in anomaly detection, existing techniques are not suitable for detecting long-term anomalies owing to a predominant underlying trend component in the time series data.
To this end, we developed a novel statistical technique to automatically detect long-term anomalies in cloud data. Specifically, the technique employs statistical learning to detect anomalies in both application as well as system metrics. Further, the technique uses robust statistical metrics, viz., median, and median absolute deviation (MAD), and piecewise approximation of the underlying long-term trend to accurately detect anomalies even in the presence of intra-day and/or weekly seasonality. We demonstrate the efficacy of the proposed technique using production data and report Precision, Recall, and F-measure measure. Multiple teams at Twitter are currently using the proposed technique on a daily basis.
Daniel J. Dean, Hiep Nguyen, Peipei Wang, and Xiaohui Gu, North Carolina State University Infrastructure-as-a-service (IaaS) clouds are becoming widely adopted. However, as multiple tenants share the same physical resources, performance anomalies have become one of the top concerns for users. Unfortunately, performance anomaly diagnosis in the production IaaS cloud often takes a long time due to its inherent com- plexity and sharing nature. In this paper, we present PerfCompass, a runtime performance anomaly fault lo- calization tool using online system call trace analysis techniques. Specifically, PerfCompass tackles a chal- lenging fault localization problem for IaaS clouds, that is, differentiating whether a production-run performance anomaly is caused by an external fault (e.g., interfer- ence from other co-located applications) or an internal fault (e.g., software bug). PerfCompass does not require any application source code or runtime instrumentation, which makes it practical for production IaaS clouds. We have tested PerfCompass using a set of popular soft- ware systems (e.g., Apache, MySQL, Squid, Cassandra, Hadoop) and a range of common cloud environment issues and real software bugs. The results show that PerfCompass accurately diagnoses all the faults while imposing low overhead during normal application exe- cution time.
Sherif Akoush, Lucian Carata, Ripduman Sohan, and Andy Hopper, University of Cambridge Organisations are starting to publish datasets containing potentially sensitive information in the Cloud; hence it is important there is a clear audit trail to show that involved parties are respecting data sharing laws and policies.
Information Flow Control (IFC) has been proposed as a solution. However, fine-grained IFC has various deployment challenges and runtime overhead issues that have limited wide adoptation so far.
In this paper we present MrLazy, a system that practically addresses some of these issues for MapReduce. Within one trust domain, we relax the need of continuously checking policies. We instead rely on lineage (information about the origin of a piece of data) as a mechanism to retrospectively apply policies on-demand. We show that MrLazy imposes manageable temporal and spatial overheads while enabling fine-grained data regulation.
Qinghua Lu, China University of Petroleum and NICTA; Liming Zhu, Xiwei Xu, and Len Bass, NICTA; Shanshan Li, Weishan Zhang, and Ning Wang, China University of Petroleum
Paper Only/No Presentation Conducting system operations (such as upgrade, reconfiguration, deployment) for large-scale systems in cloud is error prone and complex. These operations rely heavily on unreliable cloud infrastructure APIs to complete. The inherent uncertainties and inevitable errors cause a long-tail in the completion time distribution of operations. In this paper, we propose mechanisms and deployment architecture tactics to tolerate the long-tail. We wrapped cloud provisioning API calls and implemented deployment tactics at the architecture level for system operations. Our initial evaluation shows that the mechanisms and deployment tactics can effectively reduce the long tail.
Junji Zhi, Sahil Suneja, and Eyal de Lara, University of Toronto System testing is an essential part of software development. Unfortunately, comprehensive testing of large systems is often resource intensive and time-consuming. In this paper, we explore the possibility of leveraging hierarchical virtual machine (VM) fork to optimize system testing in the cloud. Testing using VM fork has the potential to save system configuration effort, obviate the need to run redundant common steps, and reduce disk and memory requirements by sharing resources across test cases. A preliminary experiment that uses VM fork to run a subset of MySQL database test suite shows that the technique reduces VM run time to complete all test cases by 60%.
|
3:35 p.m.–4:05 p.m. |
Wednesday |
Break with Refreshments
Columbus Foyer
|
4:05 p.m.–5:20 p.m. |
Wednesday |
Session Chair: Michael Kozuch, Intel Labs
Li Chen and Kai Chen, The Hong Kong University of Science and Technology Accounting and billing of cloud resources is vital for the operation of cloud service providers and their tenants. In this paper, we categorize the trust models of current industrial and academic cloud billing solutions, and discuss the problems with these models in terms of degree of trust, scalability and robustness. Based on the analysis, we propose a novel public trust model to ensure natural and intuitive verification of billable events in the cloud. Leveraging a Bitcoin-like mechanism, we design BitBill, a scalable, robust and mutually verifiable billing system for cloud computing. Our initial results show that BitBill has significantly better scalability (supporting 10x concurrent tenants using the billing service) than the state-of-the-art third-party centralized billing system.
Robert Jellinek, Yan Zhai, Thomas Ristenpart, and Michael Swift, University of Wisconsin—Madison Cloud computing platforms such as Amazon Web Services, Google Compute Engine, and Rackspace Public Cloud have been the subject of numerous measurement studies considering performance, reliability, and cost efficiency. However, little attention has been paid to billing. Cloud providers rely upon complex, large-scale billing systems that track customer resource usage at fine granularity and generate bills reflecting measured usage. However, it is not known how visible such usage is to customers, and how closely provider charges correspond to customers’ view of their resource usage.
We initiate a study of cloud billing systems, focusing on Amazon EC2, Google Compute Engine, and Rackspace, and uncover a variety of issues, including: inherent difficulties in predicting charges; bugs that lead to free CPU time on EC2 and over-charging for storage in Rackspace; and long and unpredictable billing-update latency. Our measurements motivate further study on billing systems, and so we conclude with a brief discussion of open questions for future work.
Cheng Wang, Bhuvan Urgaonkar, George Kesidis, Uday V. Shanbhag, and Qian Wang, The Pennsylvania State University Since energy-related costs make up an increasingly significant component of overall costs for data centers run by cloud providers, it is important that these costs be propagated to their tenants in ways that are fair and promote workload modulation that is aligned with overall cost-efficacy. We argue that there exists a big gap in how electric utilities charge data centers for their energy consumption (on the one hand) and the pricing interface exposed by cloud providers to their tenants (on the other). Whereas electric utilities employ complex features such as peak-based, time-varying, or tiered (load-dependent) pricing schemes, cloud providers charge tenants based on IT abstractions. This gap can create shortcomings such as unfairness in how tenants are charged and may also hinder overall cost-effective resource allocation. To overcome these shortcomings, we propose a novel idea of a virtual electric utility (VEU) that cloud providers should expose to individual tenants (in addition to their existing IT-based offerings). We discuss initial ideas underlying VEUs and challenges that must be addressed to turn them into a practical idea whose merits can be systematically explored.
|
6:30 p.m.–8:00 p.m. |
Wednesday |
Wednesday Reception
Grand Ballroom AB
|