Practical Hardening of {Crash-Tolerant} Systems

Miguel Correia; Michael McThrow; Daniel Gómez Ferro; Flavio P. Junqueira; Dahlia Malkhi; Marco Serafini; Flavio P. Junqueira; Hojung Cha; Karsten Schwan; Ganapati Srinivasa; Adrian Perrig; Scott Shenker; Ion Stoica

USENIX ATC '12 Technical Sessions

All sessions will be held in Constitution A unless otherwise noted.

The full USENIX ATC '12 Proceedings are now available:

Attendee Files

atc12_proceedings.pdf

atc12_erratum.pdf

atc12_update.pdf

atc12_cover__isbn.pdf

atc12_frontmatter.pdf

atc12-proceedings.epub

atc12_errataslip.epub

atc12-proceedings.mobi

atc12_errataslip.mobi

Wednesday, June 13, 2012

8:30 a.m.–9:00 a.m.				Wednesday
Welcome and Awards Program Co-Chairs: Gernot Heiser, NICTA and University of New South Wales; Wilson Hsieh, Google
9:00 a.m.–10:30 a.m.				Wednesday
Cloud Constitution Ballroom Session Chair: Wolfgang Schröder-Preikschat, Friedrich-Alexander University Erlangen-Nuremberg Demand Based Hierarchical QoS Using Storage Resource Pools Ajay Gulati and Ganesha Shanmuganathan, VMware Inc.; Xuechen Zhang, Wayne State University; Peter Varman, Rice University The high degree of storage consolidation in modern virtualized datacenters requires flexible and efficient ways to allocate IO resources among virtual machines (VMs). Existing IO resource management techniques have two main deficiencies: (1) they are restricted in their ability to allocate resources across multiple hosts sharing a storage device, and (2) they do not permit the administrator to set allocations for a group of VMs that are providing a single service or belong to the same application. In this paper we present the design and implementation of a novel software system called Storage Resource Pools (SRP). SRP supports the logical grouping of related VMs into hierarchical pools. SRP allows reservations, limits and proportional shares, at both the VM and pool levels. Spare resources are allocated to VMs in the same pool in preference to other VMs. The VMs may be distributed across multiple physical hosts without consideration of their logical groupings. We have implemented a prototype of storage resource pools in the VMware ESX hypervisor. Our results demonstrate that SRP provides hierarchical performance isolation and sharing among groups of VMs running across multiple hosts, while maintaining high utilization of the storage device. Available Media Erasure Coding in Windows Azure Storage Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin, Microsoft Corporation Awarded Best Paper! Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere, at any time, and only pay for what they use and store. To provide durability for that data and to keep the cost of storage low, WAS uses erasure coding. In this paper we introduce a new set of codes for erasure coding called Local Reconstruction Codes (LRC). LRC reduces the number of erasure coding fragments that need to be read when reconstructing data fragments that are offline, while still keeping the storage overhead low. The important benefits of LRC are that it reduces the bandwidth and I/Os required for repair reads over prior codes, while still allowing a significant reduction in storage overhead. We describe how LRC is used in WAS to provide low overhead durable storage with consistently low read latencies. Available Media Composable Reliability for Asynchronous Systems Sunghwan Yoo, Purdue University and HP Labs; Charles Killian, Purdue University; Terence Kelly, HP Labs; Hyoun Kyu Cho, HP Labs and University of Michigan; Steven Plite, Purdue University Distributed systems designs often employ replication to solve two different kinds of availability problems. First, to prevent the loss of data through the permanent destruction or disconnection of a distributed node, and second, to allow prompt retrieval of data when some distributed nodes respond slowly. For simplicity, many systems further handle crash-restart failures and timeouts by treating them as a permanent disconnection followed by the birth of a new node, relying on peer replication rather than persistent storage to preserve data. We posit that for applications deployed in modern managed infrastructures, delays are typically transient and failed processes and machines are likely to be restarted promptly, so it is often desirable to resume crashed processes from persistent checkpoints. In this paper we present MaceKen, a synthesis of complementary techniques including Ken, a lightweight and decentralized rollback-recovery protocol that transparently masks crash-restart failures by careful handling of messages and state checkpoints; and Mace, a programming toolkit supporting development of distributed applications and application-specific availability via replication. MaceKen requires near-zero additional developer effort—systems implemented in Mace can immediately benefit from the Ken protocol by virtue of following the Mace execution model. Moreover, this model allows multiple, independently developed application components to be seamlessly composed, preserving strong global reliability guarantees. Our implementation is available as open source software. Available Media
10:30 a.m.–11:00 a.m.				Wednesday
Break Constitution Foyer
11:00 a.m.–12:30 p.m.				Wednesday
Multicore Session Chair: Alexandra Fedorova, Simon Fraser University Managing Large Graphs on Multi-Cores with Graph Awareness Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Lidong Zhou, and Maya Haridasan, Microsoft Research Grace is a graph-aware, in-memory, transactional graph management system, specifically built for real-time queries and fast iterative computations. It is designed to run on large multi-cores, taking advantage of the inherent parallelism to improve its performance. Grace contains a number of graph-specific and multi-core-specific optimizations including graph partitioning, careful in-memory vertex ordering, updates batching, and load-balancing. It supports queries, searches, iterative computations, and transactional updates. Grace scales to large graphs (e.g., a Hotmail graph with 320 million vertices) and performs up to two orders of magnitude faster than commercial key-value stores and graph databases. Available Media MemProf: A Memory Proﬁler for NUMA Multicore Systems Renaud Lachaize, UJF; Baptiste Lepers, CNRS; Vivien Quéma, GrenobleINP Modern multicore systems are based on a Non-Uniform Memory Access (NUMA) design. Efficiently exploiting such architectures is notoriously complex for programmers. One of the key concerns is to limit as much as possible the number of remote memory accesses (i.e., main memory accesses performed from a core to a memory bank that is not directly attached to it). However, in many cases, existing profilers do not provide enough information to help programmers achieve this goal. This paper presents MemProf, a profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems. MemProf builds temporal flows of interactions between threads and objects, which help programmers understand why and which memory objects are accessed remotely. We evaluate MemProf on Linux using four applications (FaceRec, Streamcluster, Psearchy, and Apache) on three different machines. In each case, we show how MemProf helps us choose and implement efficient optimizations, unlike existing profilers. These optimizations provide significant performance gains (up to 161%), while requiring very lightweight modifications (10 lines of code or less). Available Media Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller, LIP6/INRIA The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. In this paper, we propose a new lock algorithm, Remote Core Locking (RCL), that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server core. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the core acquiring the lock because such data can typically remain in the server core’s cache. We have developed a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX locks into RCL locks. We have evaluated our approach on 18 applications: Memcached, Berkeley DB, the 9 applications of the SPLASH-2 benchmark suite and the 7 applications of the Phoenix2 benchmark suite. 10 of these applications, including Memcached and Berkeley DB, are unable to scale because of locks, and benefit from RCL. Using RCL locks, we get performance improvements of up to 2.6 times with respect to POSIX locks on Memcached, and up to 14 times with respect to Berkeley DB. Available Media
12:30 p.m.–1:30 p.m.				Wednesday
FCW Luncheon Back Bay CD
1:30 p.m.–3:30 p.m.				Wednesday
Packet Processing Session Chair: Eddie Kohler, Harvard University The Click2NetFPGA Toolchain Teemu Rinta-aho and Mika Karlstedt, NomadicLab, Ericsson Research; Madhav P. Desai, Indian Institute of Technology (Bombay) High Level Synthesis (HLS) is a promising technology where algorithms described in high level languages are automatically transformed into a hardware design. Although many HLS tools exist, they are mainly targeting developers who want to use a high level programming language to design hardware modules. They are not designed to automatically compile a complete software system, such as a network packet processing application, into a hardware design. In this paper, we describe a compiler toolchain that automatically transforms existing software in a limited domain to a functional hardware design. We have selected the Click Modular Router as the input system, and the Stanford NetFPGA as the target hardware platform. Our toolchain uses LLVM to transform Click C++ code into a form suitable for hardware implementation and then uses AHIR, a high level synthesis toolchain, to produce a VHDL netlist. The resulting netlist has been verified with actual hardware on the NetFPGA platform. The resulting hardware can achieve 20-50 % of the performance compared to version handwritten in Verilog. We expect that improvements on the toolchain could provide better performance, but for the first prototype the results are good. We feel that one of the biggest contribution of this work is that it shows some new principles of high-level synthesis that could also be applied to different domains, source languages and targets. Available Media Building a Power-Proportional Software Router Luca Niccolini, University of Pisa; Gianluca Iannaccone, RedBow Labs; Sylvia Ratnasamy, University of California, Berkeley; Jaideep Chandrashekar, Technicolor Labs; Luigi Rizzo, University of Pisa and University of California, Berkeley We aim at improving the power efficiency of network routers without compromising their performance. Using server-based software routers as our prototyping vehicle, we investigate the design of a router that consumes power in proportion to the rate of incoming traffic. We start with an empirical study of power consumption in current software routers, decomposing the total power consumption into its component causes. Informed by this analysis, we develop software mechanisms that exploit the underlying hardware’s power management features for more energy-efficient packet processing. We incorporate these mechanisms into Click and demonstrate a router that matches the peak performance of the original (unmodified) router while consuming up to half the power at low loads, with negligible impact on the packet forwarding latency. Available Media netmap: A Novel Framework for Fast Packet I/O Luigi Rizzo, Università di Pisa, Italy Awarded Best Paper! Many applications (routers, traffic monitors, firewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previous work, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain. netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5x and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap. Available Media Toward Efﬁcient Querying of Compressed Network Payloads Teryl Taylor, UNC Chapel Hill; Scott E. Coull, RedJack; Fabian Monrose, UNC Chapel Hill; John McHugh, RedJack Forensic analysts typically require access to application-layer information gathered over long periods of time to completely investigate network security incidents. Unfortunately, storing longitudinal network data is often at odds with maintaining detailed payload information due to the overhead associated with storing and querying such data. Thus, the analyst is left to choose between coarse information about long-term network activities or brief glimpses of detailed attack activity. In this paper, we take the first steps toward a storage framework for network payload information that provides a better balance between these two extremes. We take advantage of the redundancy found in network data to aggregate payload information into flexible and efficiently compressible data objects that are associated with network flows. To enable interactive querying, we introduce a hierarchical indexing structure for both the flow and payload information, which allows us to quickly prune irrelevant data and answer queries directly from the indexing information. Our empirical results on data collected from a campus network show that our approach can significantly reduce the volume of the stored data, while simultaneously preserving the ability to perform detailed queries with response times on the order of seconds. Available Media
3:30 p.m.–4:00 p.m.				Wednesday
Break Constitution Foyer
4:00 p.m.–5:30 p.m.				Wednesday
Plenary Session Build a Linux-Based Mobile Robotics Platform (for Less than $500) Mark Woodward, Actifio Let's face it: Whether you are a child (of any age) or a serious researcher, robots are cool. And they can be extremely useful, as burgeoning work in ROV, UAV, search-and-rescue, mapping and surveying, and simple housecleaning has shown. In this talk we'll look at some of the nuts and bolts of building robots and show how to use basic technologies to build a mobile robotic platform for your application (or hobby) for less than $500. We'll compare available choices for batteries, power supplies, motors/wheels, and drive electronics. We will also discuss how to use the Arduino processor and to implement closed loop motor control, and we'll talk about user-space hardware I/O programming in Linux. If you've ever thought about building and/or using a robot, you'll not only be surprised at how easy it can be but will ask yourself why you haven't done it yet! Mark is a software engineer and has worked in the industry for over 25 years. His first high-tech position was at Denning Mobile Robotics as an electrical engineer/technician. Not sticking to purely UNIX/Linux systems, he is also a contributing author of Tricks of the Windows 3.1 Masters and wrote assorted corporate publications on device driver design during his employ at Keithley Metrabyte. Since that time he has been a CTO during the dotcom boom and director of technology at a Web-based startup. He is currently working as a Principal Engineer at Actifio. While focusing mainly on software for his professional career, his passion remains robotics.
6:30 p.m.–8:00 p.m.				Wednesday
Joint USENIX ATC '12 and HotStorage '12 Poster Session and Happy Hour Grand Ballroom Session Chair: Emil Sit, Hadapt The joint USENIX ATC '12 and HotStorage '12 poster session will be held in conjunction with a happy hour and will allow researchers to present recent and ongoing projects. Join us for drinks and hor d'oeuvres. The poster session is an excellent forum to discuss new ideas and get useful feedback from the community. ATC '12 Poster Session HotStorage '12 Poster Session

Thursday, June 14, 2012

8:30 a.m.–10:30 a.m.				Thursday
Security Session Chair: Andreas Haeberlen, University of Pennsylvania Body Armor for Binaries: Preventing Buffer Overflows Without Recompilation Asia Slowinska, Vrije Universiteit Amsterdam; Traian Stancescu, Google, Inc.; Herbert Bos, Vrije Universiteit Amsterdam BinArmor is a novel technique to protect existing C binaries from memory corruption attacks on both control data and non-control data. Without access to source code, non-control data attacks cannot be detected with current techniques. Our approach hardens binaries against both kinds of overflow, without requiring the programs’ source or symbol tables. We show that BinArmor is able to stop real attacks—including the recent noncontrol data attack on Exim. Moreover, we did not incur a single false positive in practice. On the downside, the current overhead of BinArmor is high—although no worse than competing technologies like taint analysis that do not catch attacks on non-control data. Specifically, we measured an overhead of 70% for gzip, 16%-180% for lighttpd, and 190% for the nbench suite. Available Media Abstractions for Usable Information Flow Control in Aeolus Winnie Cheng, IBM Research; Dan R.K. Ports and David Schultz, MIT CSAIL; Victoria Popic, Stanford; Aaron Blankstein, Princeton; James Cowling and Dorothy Curtis, MIT CSAIL; Liuba Shrira, Brandeis; Barbara Liskov, MIT CSAIL Despite the increasing importance of protecting confidential data, building secure software remains as challenging as ever. This paper describes Aeolus, a new platform for building secure distributed applications. Aeolus uses information flow control to provide confidentiality and data integrity. It differs from previous information flow control systems in a way that we believe makes it easier to understand and use. Aeolus uses a new, simpler security model, the first to combine a standard principal-based scheme for authority management with thread-granularity information flow tracking. The principal hierarchy matches the way developers already reason about authority and access control, and the coarse-grained information flow tracking eases the task of defining a program’s security restrictions. In addition, Aeolus provides a number of new mechanisms (authority closures, compound tags, boxes, and shared volatile state) that support common design patterns in secure application design. Available Media Treehouse: Javascript Sandboxes to Help Web Developers Help Themselves Lon Ingram, The University of Texas at Austin and Waterfall Mobile; Michael Walﬁsh, The University of Texas at Austin Many Web applications (meaning sites that employ JavaScript) incorporate third-party code and, for reasons rooted in today’s Web ecosystem, are vulnerable to bugs or malice in that code. Our goal is to give Web developers a mechanism that (a) contains included code, limiting (or eliminating) its influence as appropriate; and (b) is deployable today, or very shortly. While the goal of containment is far from new, the requirement of deployability leads us to a new design point, one that applies the OS ideas of sandboxing and virtualization to the JavaScript context. Our approach, called TreeHouse, sandboxes JavaScript code by repurposing a feature of current browsers (namely Web Workers). TreeHouse virtualizes the browser’s API to the sandboxed code (allowing the code to run with few or no modifications) and gives the application author fine-grained control over that code. Our implementation and evaluation of TreeHouse show that its overhead is modest enough to handle performance-sensitive applications and that sandboxing existing code is not difficult. Available Media Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems Lorenzo Martignoni, University of California, Berkeley; Pongsin Poosankam, University of California, Berkeley, and Carnegie Mellon University; Matei Zaharia, University of California, Berkeley; Jun Han, Carnegie Mellon University; Stephen McCamant, Dawn Song, and Vern Paxson, University of California, Berkeley; Adrian Perrig, Carnegie Mellon University; Scott Shenker and Ion Stoica, University of California, Berkeley Current PC- and web-based applications provide insufficient security for the information they access, because vulnerabilities anywhere in a large client software stack can compromise confidentiality and integrity. We propose a new architecture for secure applications, Cloud Terminal, in which the only software running on the end host is a lightweight secure thin terminal, and most application logic is in a remote cloud rendering engine. The secure thin terminal has a very small TCB (23 KLOC) and no dependence on the untrusted OS, so it can be easily checked and remotely attested to. The terminal is also general-purpose: it simply supplies a secure display and input path to remote software. The cloud rendering engine runs an off-the-shelf application in a restricted VM hosted by the provider, but resource sharing between VMs lets one server support hundreds of users. We implement a secure thin terminal that runs on standard PC hardware and provides a responsive interface to applications like banking, email, and document editing. We also show that our cloud rendering engine can provide secure online banking for 5–10 cents per user per month. Available Media
10:30 a.m.–11:00 a.m.				Thursday
Break Constitution Foyer
11:00 a.m.–Noon				Thursday
Short Papers: Tools and Networking Session Chair: Gernot Heiser, NICTA and University of New South Wales Mosh: An Interactive Remote Shell for Mobile Clients Keith Winstein and Hari Balakrishnan, M.I.T. Computer Science and Artiﬁcial Intelligence Laboratory Mosh (mobile shell) is a remote terminal application that supports intermittent connectivity, allows roaming, and speculatively and safely echoes user keystrokes for better interactive response over high-latency paths. Mosh is built on the State Synchronization Protocol (SSP), a new UDP-based protocol that securely synchronizes client and server state, even across changes of the client’s IP address. Mosh uses SSP to synchronize a character-cell terminal emulator, maintaining terminal state at both client and server to predictively echo keystrokes. Our evaluation analyzed keystroke traces from six different users covering a period of 40 hours of real-world usage. Mosh was able to immediately display the effects of 70% of the user keystrokes. Over a commercial EV-DO (3G) network, median keystroke response latency with Mosh was less than 5 ms, compared with 503 ms for SSH. Mosh is free software, available from http://mosh.mit.edu. It was downloaded more than 15,000 times in the first week of its release. Available Media TROPIC: Transactional Resource Orchestration Platform in the Cloud Changbin Liu, University of Pennsylvania; Yun Mao, Xu Chen, and Mary F. Fernández, AT&T Labs—Research; Boon Thau Loo, University of Pennsylvania; Jacobus E. Van der Merwe, AT&T Labs—Research Realizing Infrastructure-as-a-Service (IaaS) cloud requires a control platform to orchestrate cloud resource provisioning, configuration, and decommissioning across a distributed set of diverse physical resources. This orchestration is challenging due to the rapid growth of data centers, high failure rate of commodity hardware and the increasing sophistication of cloud services. This paper presents the design and implementation of TROPIC, a highly available, transactional resource orchestration platform for building IaaS cloud infrastructures. TROPIC’s orchestration procedures that manipulate physical resources are transactional, automatically guaranteeing atomicity, consistency, isolation and durability of cloud operations. Through extensive evaluation of our prototype implementation, we demonstrate that TROPIC can meet production-scale cloud orchestration demands, while maintaining our design goals of safety, robustness, concurrency and high availability. Available Media Trickle: Rate Limiting YouTube Video Streaming Monia Ghobadi, University of Toronto; Yuchung Cheng, Ankur Jain, and Matt Mathis, Google YouTube traffic is bursty. These bursts trigger packet losses and stress router queues, causing TCP’s congestion-control algorithm to kick in. In this pa- per, we introduce Trickle, a server-side mechanism that uses TCP to rate limit YouTube video streaming. Trickle paces the video stream by placing an upper bound on TCP’s congestion window as a function of the streaming rate and the round-trip time. We evaluated Trickle on YouTube production data centers in Europe and India and analyzed its impact on losses, bandwidth, RTT, and video buffer under-run events. The results show that Trickle reduces the average TCP loss rate by up to 43% and the average RTT by up to 28% while maintaining the streaming rate requested by the application. Available Media Tolerating Overload Attacks Against Packet Capturing Systems Antonis Papadogiannakis, FORTH-ICS; Michalis Polychronakis, Columbia University; Evangelos P. Markatos, FORTH-ICS Passive network monitoring applications such as intrusion detection systems are susceptible to overloads, which can be induced by traffic spikes or algorithmic singularities triggered by carefully crafted malicious packets. Under overload conditions, the system may consume all the available resources, dropping most of the monitored traffic until the overload condition is resolved. Unfortunately, such an awkward response to overloads may be easily capitalized by attackers who can intentionally overload the system to evade detection. In this paper we propose Selective Packet Paging (SPP), a two-layer memory management design that gracefully responds to overload conditions by storing selected packets in secondary storage for later processing, while using randomization to avoid predictable evasion by sophisticated attackers. We describe the design and implementation of SPP within the widely used Libpcap packet capture library. Our evaluation shows that the detection accuracy of Snort on top of Libpcap is significantly reduced under algorithmic complexity and traffic overload attacks, while SPP makes it resistant to both algorithmic overloads and traffic bursts. Available Media Enforcing Murphy’s Law for Advance Identification of Run-time Failures Zach Miller, Todd Tannenbaum, and Ben Liblit, University of Wisconsin—Madison Applications do not typically view the kernel as a source of bad input. However, the kernel can behave in unusual (yet permissible) ways for which applications are badly unprepared. We present Murphy, a language-agnostic tool that helps developers discover and isolate run-time failures in their programs by simulating difficult-to-reproduce but completely-legitimate interactions between the application and the kernel. Murphy makes it easy to enable or disable sets of kernel interactions, called gremlins, so developers can focus on the failure scenarios that are im- portant to them. Gremlins are implemented using the ptrace interface, intercepting and potentially modifying an application’s system call invocation while requiring no invasive changes to the host machine. We show how to use Murphy in a variety of modes to find different classes of errors, present examples of the kernel interactions that are tested, and explain how to apply delta debugging techniques to isolate the code causing the failure. While our primary goal was the development of a tool to assist in new software development, we successfully demonstrate that Murphy also has the capability to find bugs in hardened, widely-deployed software. Available Media
Noon–1:30 p.m.				Thursday
FCW Luncheon Back Bay CD
1:30 p.m.–3:30 p.m.				Thursday
Distributed Systems Session Chair: Jon Howell, Microsoft Research A Scalable Server for 3D Metaverses Ewen Cheslack-Postava, Tahir Azim, Behram F.T. Mistree, and Daniel Reiter Horn, Stanford University; Jeff Terrace, Princeton University; Philip Levis, Stanford University; Michael J. Freedman, Princeton University Metaverses are three-dimensional virtual worlds where anyone can add and script new objects. Metaverses today, such as Second Life, are dull, lifeless, and stagnant because users can see and interact with only a tiny region around them, rather than a large and immersive world. Current metaverses impose this distance restriction on visibility and interaction in order to scale to large worlds, as the restriction avoids appreciable shared state in underlying distributed systems. We present the design and implementation of the Sirikata metaverse server. The Sirikata server scales to support large, complex worlds, even as it allows users to see and interact with the entire world. It achieves both goals simultaneously by leveraging properties of the real world and 3D environments in its core systems, such as a novel distributed data structure for virtual object queries based on visible size. We evaluate core services in isolation as well as part of the entire system, demonstrating that these novel designs do not sacrifice performance. Applications developed by Sirikata users support our claim that removing the distance restriction enables new, compelling applications that are infeasible in today’s metaverses. Available Media Granola: Low-Overhead Distributed Transaction Coordination James Cowling and Barbara Liskov, MIT CSAIL This paper presents Granola, a transaction coordination infrastructure for building reliable distributed storage applications. Granola provides a strong consistency model, while significantly reducing transaction coordination overhead. We introduce specific support for a new type of independent distributed transaction, which we can serialize with no locking overhead and no aborts due to write conflicts. Granola uses a novel timestamp-based coordination mechanism to order distributed transactions, offering lower latency and higher throughput than previous systems that offer strong consistency. Our experiments show that Granola has low overhead, is scalable and has high throughput. We implemented the TPC-C benchmark on Granola, and achieved 3× the throughput of a platform using a locking approach. Available Media High Performance Vehicular Connectivity with Opportunistic Erasure Coding Ratul Mahajan, Jitendra Padhye, Sharad Agarwal, and Brian Zill, Microsoft Research Motivated by poor network connectivity from moving vehicles, we develop a new loss recovery method called opportunistic erasure coding (OEC). Unlike existing erasure coding methods, which are oblivious to the level of spare capacity along a path, OEC transmits coded packets only during instantaneous openings in a path’s spare capacity. This transmission strategy ensures that coded packets provide as much protection as the level of spare capacity allows, without delaying or stealing capacity from data packets. OEC uses a novel encoding that greedily maximizes the amount of new data recovered by the receiver with each coded packet. We design and implement a system called PluriBus that uses OEC in the vehicular context. We deploy it on two buses for two months and show that PluriBus reduces the mean flow completion time by a factor of 4 for a realistic workload. We also show that OEC outperforms existing loss recovery methods in a range of lossy environments. Available Media Server-assisted Latency Management for Wide-area Distributed Systems Wonho Kim, Princeton University; KyoungSoo Park, KAIST; Vivek S. Pai, Princeton University Recently many Internet services employ wide-area platforms to improve the end-user experience in the WAN. To maintain close control over their remote nodes, the wide-area systems require low-latency dissemination of new updates for system configurations, customer requirements, and task lists at runtime. However, we observe that existing data transfer systems focus on resource efficiency for open client populations, rather than focusing on completion latency for a known set of nodes. In examining this problem, we find that optimizing for latency produces strategies radically different from existing systems, and can dramatically reduce latency across a wide range of scenarios. This paper presents a latency-sensitive file transfer system, Lsync that can be used as synchronization building block for wide-area systems where latency matters. Lsync performs novel node selection, scheduling, and adaptive policy switching that dynamically chooses the best strategy using information available at runtime. Our evaluation results from a PlanetLab deployment show that Lsync outperforms a wide variety of data transfer systems and achieves significantly higher synchronization ratio even under frequent file updates. Available Media
3:30 p.m.–4:00 p.m.				Thursday
Break Constitution Foyer
4:00 p.m.–5:30 p.m.				Thursday
Deduplication Session Chair: Haibo Chen, Shanghai Jiao Tong University Generating Realistic Datasets for Deduplication Analysis Vasily Tarasov and Amar Mudrankit, Stony Brook University; Will Buik, Harvey Mudd College; Philip Shilane, EMC Corporation; Geoff Kuenning, Harvey Mudd College; Erez Zadok, Stony Brook University Deduplication is a popular component of modern storage systems, with a wide variety of approaches. Unlike traditional storage systems, deduplication performance depends on data content as well as access patterns and meta-data characteristics. Most datasets that have been used to evaluate deduplication systems are either unrepresentative, or unavailable due to privacy issues, preventing easy comparison of competing algorithms. Understanding how both content and meta-data evolve is critical to the realistic evaluation of deduplication systems. We developed a generic model of file system changes based on properties measured on terabytes of real, diverse storage systems. Our model plugs into a generic framework for emulating file system changes. Building on observations from specific environments, the model can generate an initial file system followed by ongoing modifications that emulate the distribution of duplicates and file sizes, realistic changes to existing files, and file system growth. In our experiments we were able to generate a 4TB dataset within 13 hours on a machine with a single disk drive. The relative error of emulated parameters depends on the model size but remains within 15% of real-world observations. Available Media An Empirical Study of Memory Sharing in Virtual Machines Sean Barker, University of Massachusetts Amherst; Timothy Wood, The George Washington University; Prashant Shenoy and Ramesh Sitaraman, University of Massachusetts Amherst Content-based page sharing is a technique often used in virtualized environments to reduce server memory requirements. Many systems have been proposed to capture the benefits of page sharing. However, there have been few analyses of page sharing in general, both considering its real-world utility and typical sources of sharing potential. We provide insight into this issue through an exploration and analysis of memory traces captured from real user machines and controlled virtual machines. First, we observe that absolute sharing levels (excluding zero pages) generally remain under 15%, contrasting with prior work that has often reported savings of 30% or more. Second, we find that sharing within individual machines often accounts for nearly all (>90%) of the sharing potential within a set of machines, with inter-machine sharing contributing only a small amount. Moreover, even small differences between machines significantly reduce what little inter-machine sharing might otherwise be possible. Third, we find that OS features like address space layout randomization can further diminish sharing potential. These findings both temper expectations of real-world sharing gains and suggest that sharing efforts may be equally effective if employed within the operating system of a single machine, rather than exclusively targeting groups of virtual machines. Available Media Primary Data Deduplication—Large Scale Study and System Design Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Oltean, Jin Li, and Sudipta Sengupta, Microsoft Corporation We present a large scale study of primary data deduplication and use the findings to drive the design of a new primary data deduplication system implemented in the Windows Server 2012 operating system. File data was analyzed from 15 globally distributed file servers hosting data for over 2000 users in a large multinational corporation. The findings are used to arrive at a chunking and compression approach which maximizes deduplication savings while minimizing the generated metadata and producing a uniform chunk size distribution. Scaling of deduplication processing with data size is achieved using a RAM frugal chunk hash index and data partitioning – so that memory, CPU, and disk seek resources remain available to fulfill the primary workload of serving IO. We present the architecture of a new primary data deduplication system and evaluate the deduplication performance and chunking aspects of the system. Available Media

Friday, June 15, 2012

8:30 a.m.–10:30 a.m.				Friday
Languages and Tools Session Chair: Angela Demke Brown, University of Toronto Design and Implementation of an Embedded Python Run-Time System Thomas W. Barr, Rebecca Smith, and Scott Rixner, Rice University This paper presents the design and implementation of a complete embedded Python run-time system for the ARM Cortex-M3 microcontroller. The Owl embedded Python run-time system introduces several key innovations, including a toolchain that is capable of producing relocatable memory images that can be utilized directly by the run-time system and a novel foreign function interface that enables the efficient integration of native C code with Python. The Owl system demonstrates that it is practical to run high-level languages on embedded microcontrollers. Instrumentation within the system has led to an overall system design that enables Python code to be executed with low memory and speed overheads. Furthermore, this paper presents an evaluation of an autonomous RC car that uses a controller written entirely in Python. This demonstrates the ease with which complex embedded software systems can be built using the Owl infrastructure. Available Media AddressSanitizer: A Fast Address Sanity Checker Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov, Google Memory access bugs, including buffer overflows and uses of freed heap memory, remain a serious problem for programming languages like C and C++. Many memory error detectors exist, but most of them are either slow or detect a limited set of bugs, or both. This paper presents AddressSanitizer, a new memory error detector. Our tool finds out-of-bounds accesses to heap, stack, and global objects, as well as use-after-free bugs. It employs a specialized memory allocator and code instrumentation that is simple enough to be implemented in any compiler, binary translation system, or even in hardware. AddressSanitizer achieves efficiency without sacrificing comprehensiveness. Its average slowdown is just 73% yet it accurately detects bugs at the point of occurrence. It has found over 300 previously unknown bugs in the Chromium browser and many bugs in other software. Available Media Software Persistent Memory Jorge Guerra, Leonardo Mármol, Daniel Campello, Carlos Crespo, Raju Rangaswami, and Jinpeng Wei, Florida International University Persistence of in-memory data is necessary for many classes of application and systems software. We propose Software Persistent Memory (SoftPM), a new memory abstraction which allows malloc style allocations to be selectively made persistent with relative ease. Particularly, SoftPM’s persistent containers implement automatic, orthogonal persistence for all in-memory data reachable from a developer-defined root structure. Writing new applications or adapting existing applications to use SoftPM only requires identifying such root structures within the code. We evaluated the correctness, ease of use, and performance of SoftPM using a suite of microbenchmarks and real world applications including a distributed MPI application, SQLite (an in-memory database), and memcachedb (a distributed memory cache). In all cases, SoftPM was incorporated with minimal developer effort, was able to store and recover data successfully, and provide significant performance speedup (e.g., up to 10X for memcachedb and 83% for SQLite). Available Media Rivet: Browser-agnostic Remote Debugging for Web Applications James Mickens, Microsoft Research Rivet is the first fully-featured, browser-agnostic remote debugger for web applications. Using Rivet, developers can inspect and modify the state of live web pages that are running inside unmodified end-user web browsers. This allows developers to explore real application bugs in the context of the actual machines on which those bugs occur. To make an application Rivet-aware, developers simply add the Rivet JavaScript library to the client-side portion of the application. Later, when a user detects a problem with the application, the user informs Rivet; in turn, Rivet pauses the application and notifies a remote debug server that a debuggable session is available. The server can launch an interactive debugger front-end for a human developer, or use Rivet’s live patching mechanism to automatically install a fix on the client or run diagnostics for offline analysis. Experiments show that Rivet imposes negligible overhead during normal application operation. At debug time, Rivet’s network footprint is small, and Rivet is computationally fast enough to support non-trivial diagnostics and live patches. Available Media
10:30 a.m.–11:00 a.m.				Friday
Break Constitution Foyer
11:00 a.m.–Noon				Friday
Short Papers: Performance Session Chair: Wilson Hsieh, Google Inc. Wimpy Nodes with 10GbE: Leveraging One-Sided Operations in Soft-RDMA to Boost Memcached Patrick Stuedi, Animesh Trivedi, and Bernard Metzler, IBM Research, Zurich Recently, various key/value stores have been proposed targeting clusters built from low-power CPUs. The typical network configuration is that the nodes in those clusters are connected using 1 Gigabit Ethernet. During the last couple of years, 10 Gigabit Ethernet has become commodity and is increasingly used within the data centers providing cloud computing services. The boost in network link speed, however, poses a challenge to the cluster nodes because filling the network link can be a CPU-intensive task. In particular for CPUs running in low-power mode, it is therefore important to spend CPU cycles used for networking as efficiently as possible. In this paper, we propose a modified Memcached architecture to leverage the one-side semantics of RDMA. We show how the modified Memcached is more CPU efficient and can serve up to 20% more GET operations than the standard Memcached implementation on low-power CPUs. While RDMA is a networking technology typically associated with specialized hardware, our solution uses soft-RDMA which runs on standard Ethernet and does not require special hardware. Available Media Revisiting Software Zero-Copy for Web-caching Applications with Twin Memory Allocation Xiang Song and Jicheng Shi, Shanghai Jiao Tong University and Fudan University; Haibo Chen, Shanghai Jiao Tong University; Binyu Zang, Shanghai Jiao Tong University and Fudan University A key concern with zero copy is that the data to be sent out might be mutated by applications. In this paper, fo-cusing specially on web-caching application, we observe that in most cases the data to be sent out is not supposed to be mutated by applications, while the metadata around it does get mutated. Based on this observation, we propose a lightweight software zero-copy mechanism that uses a twin memory allocator to allocate spaces for zero-copying data, and ensures such data is unchanged before being sent out with a lightweight data protection mech- anism. The only change required to an application is to allocate zero-copying data through a specific ZCopy memory allocator. To demonstrate the effectiveness of ZCopy, we have designed and implemented a prototype based on Linux and ported two applications with very little effort. Experiments with Memcached and Varnish shows that show that ZCopy can achieve up to 41% performance improvement over the vanilla Linux with less CPU consumption. Available Media Seagull: Intelligent Cloud Bursting for Enterprise Applications Tian Guo and Upendra Sharma, UMASS Amherst; Timothy Wood, The George Washington University; Sambit Sahu, IBM Watson; Prashant Shenoy, UMASS Amherst Enterprises with existing IT infrastructure are beginning to employ a hybrid cloud model where the enterprise uses its own private resources for the majority of its computing, but then “bursts” into the cloud when local resources are insufficient. However, current approaches to cloud bursting cannot be effectively automated because they heavily rely on system administrator knowledge to make decisions. In this paper we describe Seagull, a system designed to facilitate cloud bursting by determining which applications can be transitioned into the cloud most economically, and automating the movement process at the proper time. We further optimize the deployment of applications into the cloud using an intelligent precopying mechanism that proactively replicates virtualized applications, lowering the bursting time from hours to minutes. Our evaluation illustrates how our prototype can reduce cloud costs by more than 45% when bursting to the cloud, and the incremental cost added by precopying applications is offset by a burst time reduction of nearly 95%. Available Media The Forgotten ‘Uncore’: On the Energy-Efﬁciency of Heterogeneous Cores Vishal Gupta, Georgia Tech; Paul Brett, David Koufaty, Dheeraj Reddy, and Scott Hahn, Intel Labs; Karsten Schwan, Georgia Tech; Ganapati Srinivasa, Intel Corporation Heterogeneous multicore processors (HMPs), consisting of cores with different performance/power characteristics, have been proposed to deliver higher energy efficiency than symmetric multicores. This paper investigates the opportunities and limitations in using HMPs to gain energy-efficiency. Unlike previous work focused on server systems, we focus on the client workloads typically seen in modern end-user devices. Further, beyond considering core power usage, we also consider the ‘uncore’ subsystem shared by all cores, which in modern platforms, is an increasingly important contributor to total SoC power. Experimental evaluations use client applications and usage scenarios seen on mobile devices and a unique testbed comprised of heterogeneous cores, with results that highlight the need for uncore-awareness and uncore scalability to maximize intended efficiency gains from heterogeneous cores. Available Media
Noon–1:00 p.m.				Friday
FCW Luncheon Back Bay CD
1:00 p.m.–2:30 p.m.				Friday
OS Session Chair: Emil Sit, Hadapt Software Techniques for Avoiding Hardware Virtualization Exits Ole Agesen, Jim Mattson, Radu Rugina, and Jeffrey Sheldon, VMware On modern processors, hardware-assisted virtualization outperforms binary translation for most workloads. But hardware virtualization has a potential problem: virtualization exits are expensive. While hardware virtualization executes guest instructions at native speed, guest/VMM transitions can sap performance. Hardware designers attacked this problem both by reducing guest/VMM transition costs and by adding architectural extensions such as nested paging support to avoid exits. This paper proposes complementary software techniques for reducing the exit frequency. In the simplest form, our VMM inspects guest code dynamically to detect back-to-back pairs of instructions that both exit. By handling a pair of instructions when the first one exits, we save 50% of the transition costs. Then, we generalize from pairs to clusters of instructions that may include loops and other control flow. We use a binary translator to generate, and cache, custom translations for handling exits. The analysis cost is paid once, when the translation is generated, but amortized over all future executions. Our techniques have been fully implemented and validated in recent versions of VMware products. We show that clusters consistently reduce the number of exits for all examined workloads. When execution is dominated by exit costs, this translates into measurable runtime improvements. Most importantly, clusters enable substantial gains for nested virtual machines, delivering speedups as high as 1.52x. Intuitively, this result stems from the fact that transitions between the inner guest and VMM are extremely costly, as they are implemented in software by the outer VMM. Available Media AppScope: Application Energy Metering Framework for Android Smartphone Using Kernel Activity Monitoring Chanmin Yoon, Dongwon Kim, Wonwoo Jung, Chulkoo Kang, and Hojung Cha, Yonsei University, Korea Understanding the energy consumption of a smartphone application is a key area of interest for end users, as well as application and system software developers. Previous work has only been able to provide limited information concerning the energy consumption of individual applications because of limited access to underlying hardware and system software. The energy consumption of a smartphone application is, therefore, often estimated with low accuracy and granularity. In this paper, we propose AppScope, an Android-based energy metering system. This system monitors application’s hardware usage at the kernel level and accurately estimates energy consumption. AppScope is implemented as a kernel module and uses an event-driven monitoring method that generates low overhead and provides high accuracy. The evaluation results indicate that AppScope accurately estimates the energy consumption of Android applications expending approximately 35mW and 2.1% in power consumption and CPU utilization overhead, respectively. Available Media Gdev: First-Class GPU Resource Management in the Operating System Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt, UC Santa Cruz Graphics processing units (GPUs) have become a very powerful platform embracing a concept of heterogeneous many-core computing. However, application domains of GPUs are currently limited to specific systems, largely due to a lack of “first-class” GPU resource management for general-purpose multi-tasking systems. We present Gdev, a new ecosystem of GPU resource management in the operating system (OS). It allows the user space as well as the OS itself to use GPUs as first-class computing resources. Specifically, Gdev’s virtual memory manager supports data swapping for excessive memory resource demands, and also provides a shared device memory functionality that allows GPU contexts to communicate with other contexts. Gdev further provides a GPU scheduling scheme to virtualize a physical GPU into multiple logical GPUs, enhancing isolation among working sets of multi-tasking systems. Our evaluation conducted on Linux and the NVIDIA GPU shows that the basic performance of our prototype implementation is reliable even compared to proprietary software. Further detailed experiments demonstrate that Gdev achieves a 2x speedup for an encrypted file system using the GPU in the OS. Gdev can also improve the makespan of dataflow programs by up to 49% exploiting shared device memory, while an error in the utilization of virtualized GPUs can be limited within only 7%. Available Media
2:30 p.m.–3:00 p.m.				Friday
Break Constitution Foyer
3:00 p.m.–5:00 p.m.				Friday
Replication Session Chair: Andrew Birrell, Microsoft Research Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication Yang Wang, Lorenzo Alvisi, and Mike Dahlin, The University of Texas at Austin This paper describes Gnothi, a block replication system that separates data from metadata to provide efficient and available storage replication. Separating data from metadata allows Gnothi to execute disk accesses on subsets of replicas while using fully replicated metadata to ensure that requests are executed correctly and to speed up recovery of slow or failed replicas. Performance evaluation shows that Gnothi can achieve 40-64% higher write throughput than previous work and significantly save storage space. Furthermore, while a failed replica recovers, Gnothi can provide about 100-200% higher throughput, while still retaining the same recovery time and while guaranteeing that recovery eventually completes. Available Media Dynamic Reconﬁguration of Primary/Backup Clusters Alexander Shraer and Benjamin Reed, Yahoo! Research; Dahlia Malkhi, Microsoft Research; Flavio Junqueira, Yahoo! Research Dynamically changing (reconfiguring) the membership of a replicated distributed system while preserving data consistency and system availability is a challenging problem. In this paper, we show that reconfiguration can be simplified by taking advantage of certain properties commonly provided by Primary/Backup systems. We describe a new reconfiguration protocol, recently implemented in Apache Zookeeper. It fully automates configuration changes and minimizes any interruption in service to clients while maintaining data consistency. By leveraging the properties already provided by Zookeeper our protocol is considerably simpler than state of the art. Available Media Surviving Congestion in Geo-Distributed Storage Systems Brian Cho, University of Illinois at Urbana-Champaign; Marcos K. Aguilera, Microsoft Research Silicon Valley We present Vivace, a key-value storage system for web applications that span many geographically-distributed sites. Vivace provides strong consistency and replicates data across sites for access locality and disaster tolerance. Vivace is designed to cope well with network congestion across sites, which occurs because the bandwidth across sites is smaller than within sites. To deal with congestion, Vivace relies on two novel algorithms that prioritize a small amount of critical data to avoid delays due to congestion. We evaluate Vivace to show its feasibility and effectiveness. Available Media Practical Hardening of Crash-Tolerant Systems Miguel Correia, IST-UTL/INESC-ID; Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Seraﬁni, Yahoo! Research Recent failures of production systems have highlighted the importance of tolerating faults beyond crashes. The industry has so far addressed this problem by hardening crash-tolerant systems with ad hoc error detection checks, potentially overlooking critical fault scenarios. We propose a generic and principled hardening technique for Arbitrary State Corruption (ASC) faults, which specifically model the effects of realistic data corruptions on distributed processes. Hardening does not require the use of trusted components or the replication of the process over multiple physical servers. We implemented a wrapper library to transparently harden distributed processes. To exercise our library and evaluate our technique, we obtained ASC-tolerant versions of Paxos, of a subset of the ZooKeeper API, and of an eventually consistent storage by implementing crash-tolerant protocols and automatically hardening them using our library. Our evaluation shows that the throughput of our ASC-hardened state machine replication outperforms its Byzantine-tolerant counterpart by up to 70%. Available Media