8:30 a.m.–9:00 a.m. |
Sunday |
Continental Breakfast
Centennial Foyer |
9:00 a.m.–10:00 a.m. |
Sunday |
Session Chair: Flavio Junqueira, Microsoft Research Cambridge
Robbert van Renesse, Department of Computer Science, Cornell University
The last decade or two have seen a wild growth of large-scale systems that have strong responsiveness requirements. Such systems include cloud services as well as sensor networks. Their scalability and reliability requirements mandate that these systems are both sharded and replicated. Also, these systems evolve quickly as a result of changes in workload, adding functionality, deploying new hardware, and so on. While these systems are useful, they can behave in erratic ways and it is not clear that one can build mission- and life-critical systems this way.
The last decade or two have seen a wild growth of large-scale systems that have strong responsiveness requirements. Such systems include cloud services as well as sensor networks. Their scalability and reliability requirements mandate that these systems are both sharded and replicated. Also, these systems evolve quickly as a result of changes in workload, adding functionality, deploying new hardware, and so on. While these systems are useful, they can behave in erratic ways and it is not clear that one can build mission- and life-critical systems this way.
This talk is inspired by work that I’m doing within the ARPA-E GENI program to modernize the power grid using cloud infrastructure. I will survey some of the techniques available to build scalable systems in a modular and reasoned fashion. I will discuss some of the experience I have working with formal methods to derive provably correct building blocks and provably correct transformations that improve trust in the responsiveness or reliability of distributed systems. Finally, I will talk about some of the open questions that remain.
Robbert van Renesse is a Principal Research Scientist in the Department of Computer Science at Cornell University. He received a Ph.D. from the Vrije Universiteit in Amsterdam in 1989. After working at AT&T Bell Labs in Murray Hill he joined Cornell in 1991. He was associate editor of IEEE Transactions on Parallel and Distributed Systems from 1997 to 1999, and he is currently associate editor for ACM Computing Surveys. His research interests include the fault tolerance and scalability of distributed systems. Van Renesse is an ACM Fellow.
|
10:00 a.m.–10:30 a.m. |
Sunday |
Break with Refreshments
Centennial Foyer |
10:30 a.m.–noon |
Sunday |
Session Chair: Flavio Junqueira, Microsoft Research Cambridge
Arjun Narayan, Antonis Papadimitriou, and Andreas Haeberlen, University of Pennsylvania Building dependable federated systems is often complicated by privacy concerns: if the domains are not willing to share information with each other, a global or ‘systemic’ threat may not be detected until it is too late. In this paper, we study this problem using a somewhat unusual example: the financial crisis of 2008. Based on results from the economics literature, we argue that a) the impending crisis could have been predicted by performing a specific distributed computation on the financial information of each bank, but that b) existing tools, such as secure multiparty computation, do not offer enough
privacy to make participation safe from the banks’ perspective. We then sketch the design of a system that can perform this (and possibly other) computation at scale with strong privacy guarantees. Results from an early prototype suggest that the computation and communication costs are reasonable.
Stefan Brenner, Colin Wulf, and Rüdiger Kapitza, Technische Universität Braunschweig Cloud computing is a recent trend in computer science. However, privacy concerns and a lack of trust in cloud providers are an obstacle for many deployments. Maturing hardware support for implementing Trusted Execution Environments (TEEs) aims at mitigating these problems. Such technologies allow to run applications in a trusted environment, thereby protecting data from unauthorized access. To reduce the risk of security vulnerabilities, code executed inside a TEE should have a minimal Trusted Codebase. As a consequence, there is a trend for partitioning application’s logic in trusted and untrusted parts. Only the trusted parts are executed inside the TEE handling the privacy-sensitive processing steps.
In this paper, we add a transparent encryption layer to ZooKeeper by means of a privacy proxy supposed to run inside a TEE. We show what measures are necessary to split an application into a trusted and an untrusted part in order to protect the data stored inside, with ZooKeeper as an example. With our solution, ZooKeeper can be deployed at untrusted cloud providers, establishing confidential coordination for distributed applications. With our privacy proxy, all ZooKeeper functionality is retained while there is little degradation of throughput.
Takeshi Yoshimura and Kenji Kono, Keio University Static code checkers have been useful for finding bugs in large-scale C code. Domain-specific checkers are particularly effective in finding deep/subtle bugs because they can make use of domain-specific knowledge. To develop domain-specific checkers, however, typical bug patterns in certain domains must first be extracted. This paper explores the use of machine learning to help extract bug
patterns from bug repositories. We used natural language processing to analyze over 370,000 bug descriptions of Linux and classified them into 66 clusters. Our preliminary work with this approach is encouraging: by investigating one of the 66 clusters, we were able to identify typical bug patterns in PCI device drivers and developed static code checkers to find them. When applied to the latest version of Linux, the developed checkers found two unknown bugs.
Nuno Santos, INESC-ID and Instituto Superior Técnico, Universidade de Lisboa; Nuno P. Lopes, INESC-ID and Instituto Superior Técnico, Universidade de Lisboa and Microsoft Research In the last years, it has emerged a market of virtual appliances, i.e., virtual machine images specifically configured to provide a given service (e.g., web hosting). The virtual appliance model greatly reduces the burden of configuring virtual machines from scratch. However, the current model involves risks: security threats, misconfigurations, privacy loss, etc. In this paper, we propose an approach to build dependable virtual machines. It is based on trusted computing and model checking: trusted computing allows for low-level attestation of the software of a virtual appliance, and model checking provides for the automatic verification of the software’s high-level configuration properties. We present our approach, and discuss open research challenges.
|
Noon–1:00 p.m. |
Sunday |
Workshop Luncheon
Interlocken B |
1:00 p.m.–2:40 p.m. |
Sunday |
Session Chair: Andreas Haerbelen, University of Pennsylvania
Takeshi Miyamae, Takanori Nakao, and Kensuke Shiozawa, Fujitsu Laboratories Ltd. The ever-growing importance and volume of digital content generated by ICT services has led to the demand for highly durable and space-efficient content storage technology. Erasure code can be an effective solution to such requirements, but the current research outcomes do not efficiently handle simultaneous multiple disk failures. We propose Shingled Erasure Code (SHEC), an erasure code with local parity groups shingled with each other, to provide efficient recovery for multiple disk failures while ensuring that the conflicting properties of space efficiency and durability are adjustable according to user requirements. We have confirmed that SHEC meets the design goals using the result of a numerical study on the relationships among the conflicting properties, and a performance evaluation of an actual SHEC implementation on Ceph, a type of open source scalable object storage software.
Shehbaz Jaffer, Mangesh Chitnis, and Ameya Usgaonkar, NetApp Inc. A Virtual Storage Architecture (VSA) is a storage controller deployed as a virtual machine on a server with a hypervisor. The advantage of VSA is to leverage shared data storage services without procuring additional storage hardware, which is a cost effective solution. In case of VSA, high availability (HA) is achieved by restarting the failed virtual machine on an event of a software failure. Rebooting the VSA is a slow operation thereby reducing the overall service availability. In this paper, we describe the challenges and approaches taken to decrease reboot time of VSA to achieve High Availability. We have been able to reduce the VSA reboot time by 18% using our optimizations. We also ex-plore the changes required in Journal based File systems for efficient operation in the Cloud.
Xin Xu and H. Howie Huang, George Washington University Hardware errors are no longer the exceptions in modern cloud data centers. Although virtualization provides software failure isolation across different virtual machines (VM), the virtualization infrastructure including the hypervisor and privileged VMs remains vulnerable to hardware errors. Making matters worse is that such errors are unlikely bounded by virtualization boundary and may
lead to loss of work in multiple guest VMs due to unexpected and/or mishandled failures. To understand reliability implication of hardware errors in virtualized systems, in this paper we develop a simulation-based framework that enables a comprehensive fault injection study on the hypervisor with a wide range of configurations. Our analysis shows that, in current systems, many hardware errors can propagate through various paths for an extended time before an observed failure (e.g., whole system crash). We further discuss the challenges of designing error tolerance techniques for the hypervisor.
Jonathan Mace, Brown University; Peter Bodik, Microsoft Research; Rodrigo Fonseca, Brown University; Madanlal Musuvathi, Microsoft Research In distributed services shared by multiple tenants, managing resource allocation is an important pre-requisite to providing dependability and quality of service guarantees. Many systems deployed today experience contention, slowdown, and even system outages due to aggressive tenants and a lack of resource management. Improperly throttled background tasks, such as data replication, can overwhelm
a system; conversely, high-priority background tasks, such as heartbeats, can be subject to resource starvation. In this paper, we outline five design principles necessary for effective and efficient resource management policies that could provide guaranteed performance, fairness, or isolation. We present Retro, a resource instrumentation framework that is guided by these principles. Retro instruments all system resources and exposes detailed, real-time statistics of pertenant resource consumption, and could serve as a base for the implementation of such policies.
Johannes Behl, Technische Universität Braunschweig; Tobias Distler, Friedrich-Alexander-Universität Erlangen-Nürnberg; Rüdiger Kapitza, Technische Universität Braunschweig To pave the way for Byzantine fault-tolerant (BFT) systems that can exploit the potential of modern multi-core platforms, we present a new parallelization scheme enabling BFT systems to scale with the number of available cores and to provide the performance required by critical central services. The main idea is to organize parallelism around complete instances of the underlying multi-phase BFT agreement protocols, and not around single tasks (e.g., authenticating messages), as realized in state-of-the-art systems. We implemented this consensus-oriented parallelization scheme on basis of
a BFT prototype that permits flexibly configured parallelism by relying on an actor decomposition. In an early evaluation conducted on machines with twelve cores, the consensus-oriented parallelization achieved over 200% higher throughput than a traditional approach while leaving the potential to utilize even more cores and exhibiting a significantly greater efficiency in a single-core setup.
|
2:40 p.m.–3:00 p.m. |
Sunday |
Break
Centennial Foyer |
3:00 p.m.–4:00 p.m. |
Sunday |
Session Chair: Nalini Venkatasubramanian, University of California, Irvine
Panelists: David Corman, National Science Foundation; Sokwoo Rhee, National Institute of Standards and Technology (NIST); Amar Phanishayee, Microsoft Research, Redmond; Nalini Venkatasubramanian, University of California, Irvine The Internet of Things (also known as the Internet of Everything, the Industrial Internet, or simply IoT) is seen by many as The Next Big Thing. The IoT is a technology, an economic ecosystem, and a market development that looks at the interconnection of objects (some everyday and some not) among themselves and to computational resources, as well as the applications that run on them. IoT has the promise of creating a rich and dynamic environment of smart applications and systems that will improve and enrich our lives. The IT research agency International Data Corporation estimated that the current IoT market is around $1.9 trillion and will grow to $7.1 trillion by 2020. At the same time, there are real worries that the IoT is being developed and deployed with little to no thought about security, privacy, and dependability. Some researchers have coined a new name—the Internet of Broken Things—to emphasize this lack of concern to such crosscutting issues. The Internet of Things (also known as the Internet of Everything, the Industrial Internet, or simply IoT) is seen by many as The Next Big Thing. The IoT is a technology, an economic ecosystem, and a market development that looks at the interconnection of objects (some everyday and some not) among themselves and to computational resources, as well as the applications that run on them. IoT has the promise of creating a rich and dynamic environment of smart applications and systems that will improve and enrich our lives. The IT research agency International Data Corporation estimated that the current IoT market is around $1.9 trillion and will grow to $7.1 trillion by 2020. At the same time, there are real worries that the IoT is being developed and deployed with little to no thought about security, privacy, and dependability. Some researchers have coined a new name—the Internet of Broken Things—to emphasize this lack of concern to such crosscutting issues.
This panel will have presentations from researchers in industry, academia, and federal agencies who are involved in the Internet of Things. The speakers will address issues at different layers of the system stack and will also cover a diverse set of application domains. We will open up for a Q/A session and group discussion on the issues, approaches, and roadblocks on the way to a dependable IoT.
|
4:00 p.m.–4:30 p.m. |
Sunday |
Break with Refreshments
Centennial Foyer |
4:30 p.m.–5:30 p.m. |
Sunday |
Session Chair: Flavio Junqueira, Microsoft Research Cambridge
Jim Kurose, School of Computer Science, University of Massachusetts Amherst  The notion of an Internet "control plane" has slowly evolved from its origins in specific routing protocols (e.g., the original ARPAnet routing algorithm) to a more general framework for providing forwarding in network routers and maintaining network state. Several distinct generations of the control plane can be identified, reflecting the different concerns (e.g., scalability, reliability, and openness), technology tradeoffs (e.g., forwarding versus processing speeds, memory), and envisioned use at the time. Along the way, fundamentally new concepts were pioneered and validated in practice, including a best-effort versus guaranteed service model; end-point control and soft state; traffic engineering capabilities; and an increasingly clear separation of routing from forwarding and policy from implementation.  The notion of an Internet "control plane" has slowly evolved from its origins in specific routing protocols (e.g., the original ARPAnet routing algorithm) to a more general framework for providing forwarding in network routers and maintaining network state. Several distinct generations of the control plane can be identified, reflecting the different concerns (e.g., scalability, reliability, and openness), technology tradeoffs (e.g., forwarding versus processing speeds, memory), and envisioned use at the time. Along the way, fundamentally new concepts were pioneered and validated in practice, including a best-effort versus guaranteed service model; end-point control and soft state; traffic engineering capabilities; and an increasingly clear separation of routing from forwarding and policy from implementation. In this talk, we discuss this evolution and its driving factors, contrasting them with their counterparts in telephone and mobile cellular networks, and drawing lessons learned. We discuss the continuing evolution of the control plane through software-defined networking, and potential future control planes envisioned in Future Internet Architecture (FIA) projects, including MobilityFirst.
Jim Kurose received a B.A. in physics from Wesleyan University and a Ph.D. in computer science from Columbia University. He is a Distinguished University Professor in the School of Computer Science at the University of Massachusetts, where he has also served in a number of campus administrative roles, including department chair and dean of the College of Natural Sciences and Mathematics. He has been a Visiting Scientist at IBM Research, INRIA, LINCS, Institut EURECOM, the University of Paris, and Technicolor Research Labs.
His research interests include network protocols and architecture, network measurement, multimedia communication, and modeling and performance evaluation. Dr. Kurose has served as Editor-in-Chief of the IEEE Transactions on Communications and was the founding Editor-in-Chief of the IEEE/ACM Transactions on Networking. He has been active in the program committees for IEEE Infocom, ACM SIGCOMM, ACM SIGMETRICS and the ACM Internet Measurement conferences for a number of years, and has served as Technical Program Co-Chair for these conferences. He has received several conference best paper awards, the ACM Sigcomm Test of Time Award, and the IEEE Infocom Award. He is also the recipient of number of teaching awards, including the IEEE Taylor Booth Education Medal.
He currently serves on the Board of Directors of the Computing Research Association and the advisory council of the Computer and Information Science and Engineering (CISE) Directorate at the National Science Foundation. He is a Fellow of the IEEE and the ACM.
With Keith Ross, he is the co-author of the textbook Computer Networking, a Top-Down Approach (6th edition), published by Pearson.
|
5:30 p.m.–5:40 p.m. |
Sunday |
Session Chair: Flavio Junqueira, Microsoft Research Cambridge
|