Technical Sessions

All sessions will be held in the Imperial Ballroom unless otherwise noted.

Conference full papers and full proceedings are available to conference registrants immediately and to everyone beginning Wednesday, June 26, 2013. Everyone can view the abstracts and the proceedings front matter immediately.

Proceedings Front Matter: 
Proceedings Covers | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

 

Wednesday, June 26, 2013

8:15 a.m.–9:00 a.m. Wednesday

Continental Breakfast

Market Street Foyer

9:00 a.m.–9:15 a.m. Wednesday

Opening Remarks

Welcome to USENIX ATC '13

Program Co-Chairs: Andrew Birrell, Microsoft Research Silicon Valley, and Emin Gün Sirer, Cornell University

Available Media

9:15 a.m.–10:45 a.m. Wednesday

Virtual Machine Implementation

Session Chair: Emin Gün Sirer, Cornell University

Optimizing VM Checkpointing for Restore Performance in VMware ESXi

Irene Zhang, University of Washington and VMware; Tyler Denniston, MIT CSAIL and VMware; Yury Baskakov,VMware; Alex Garthwaite, CloudPhysics and VMware

Cloud providers are increasingly looking to use virtual machine checkpointing for new applications beyond fault tolerance. Existing checkpointing systems designed for fault tolerance only optimize for saving checkpointed state, so they cannot support these new applications, which require better restore performance. Improving restore performance requires a predictive technique to reduce the number of disk accesses to bring in the VM’s memory on restore. However, complex VM workloads can diverge at any time due to external inputs, background processes, and timing variation, so predicting which pages the VM will access on restore to reduce faults to disk is impossible. Instead, we focus on predicting which pages the VM will access together on restore to improve the efficiency of disk accesses.

To reduce the number of faults to disk on restore, we group memory pages likely to be accessed together into locality blocks. On each fault, we can load a block of pages that are likely to be accessed with the faulting page, eliminating future faults and increasing disk efficiency. We implement support for locality blocks, along with several other optimizations, in a new checkpointing system for VMware ESXi Server called Halite. Our experiments show that Halite reduces restore overhead by up to 94% for a range of workloads.

Available Media

Hyper-Switch: A Scalable Software Virtual Switching Architecture

Kaushik Kumar Ram, Alan L. Cox, Mehul Chadha, and Scott Rixner, Rice University

In virtualized datacenters, the last hop switching happens inside a server. As the number of virtual machines hosted on the server goes up, the last hop switch can be a performance bottleneck. This paper presents the Hyper-Switch, a highly efficient and scalable softwarebased network switch for virtualization platforms that support driver domains. It combines the best of the existing I/O virtualization architectures by hosting device drivers in a driver domain to isolate faults and placing the packet switch in the hypervisor for efficiency. In addition, this paper presents several optimizations that enhance performance. They include virtual machine (VM) state-aware batching of packets to mitigate the costs of hypervisor entries and guest notifications, preemptive copying and immediate notification of blocked VMs to reduce packet arrival latency, and, whenever possible, packet processing is dynamically offloaded to idle processor cores. These optimizations enable efficient packet processing, better utilization of the available CPU resources, and higher concurrency. 

We implemented a Hyper-Switch prototype in the Xen virtualization platform. This prototype’s performance was then compared to Xen’s default network I/O architecture and KVM’s vhost-net architecture. The Hyper-Switch prototype performed better than both, especially for inter-VM network communication. For instance, in one scalability experiment measuring aggregate inter-VM network throughput, the Hyper-Switch achieved a peak of 81 Gbps as compared to only 31 Gbps under Xen and 47 Gbps under KVM

Available Media

MiG: Efficient Migration of Desktop VMs Using Semantic Compression

Anshul Rai and Ramachandran Ramjee, Microsoft Research India; Ashok Anand, Bell Labs India; Venkata N. Padmanabhan, Microsoft Research India; George Varghese, Microsoft Research US

We consider the problem of efficiently migrating desktop virtual machines. The key challenge is to migrate the desktop VM quickly and in a bandwidth-efficient manner. The idea of replaying computation to reconstruct state seems appealing. However, our detailed analysis shows that the match between the source memory and the memory reconstructed via replay at the destination is poor, even at the sub-page level; the ability to reconstruct memory state is stymied because modern OSes use address space layout randomization (ASLR) to improve security, and page prefetching to improve performance.

Despite these challenges, we show that desktop VMmemory state can be efficiently compressed for transfer without relying on replay, using a suite of semantic techniques—collectively dubbed as MiG—that are tailored to the type of each memory page. Our evaluation on Windows and Linux desktop VMs shows that MiG is able to compress the VM state effectively, requiring on average 51-65% fewer bytes to be transferred during migration compared to standard compression, and halving the migration time in a typical setting.

Available Media

10:45 a.m.–11:15 a.m. Wednesday

Break with Refreshments

Market Street Foyer

11:15 a.m.–12:30 p.m. Wednesday

Computing in the Cloud

Session Chair: Steve Ko, University of Buffalo

Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Asaf Cidon, Stephen Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout, and Mendel Rosenblum, Stanford University
Awarded Best Student Paper!  

Random replication is widely used in data center storage systems to prevent data loss. However, random replication is almost guaranteed to lose data in the common scenario of simultaneous node failures due to cluster-wide power outages. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event.

We present Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node’s data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook’s HDFS cluster, it reduces the probability from 22.8% to 0.78%.

Available Media

TAO: Facebook’s Distributed Data Store for the Social Graph

Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani, Facebook, Inc.

We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model. TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook’s demanding workload using a fixed set of queries. It is deployed at Facebook, replacing memcache for many data types that fit its model. The system runs on thousands of machines, is widely distributed, and provides access to many petabytes of data. TAO can process a billion reads and millions of writes each second.

Available Media

PIKACHU: How to Rebalance Load in Optimizing MapReduce On Heterogeneous Clusters

Rohan Gandhi, Di Xie, and Y. Charlie Hu, Purdue University

For power, cost, and pricing reasons, datacenters are evolving towards heterogeneous hardware. However, MapReduce implementations, which power a representative class of datacenter applications, were originally designed for homogeneous clusters and performed poorly on heterogeneous clusters. The natural solution, rebalancing load among the reducers running on heterogeneous nodes has been explored in Tarazu, but shown to be only mildly effective.

In this paper, we revisit the key design challenge in this important optimization for MapReduce on heterogeneous clusters and make three contributions. (1) We show that Tarazu estimates the target load distribution too early into MapReduce job execution, which results in the rebalanced load far from the optimal. (2) We articulate the delicate tradeoff between the estimation accuracy versus wasted work from delayed load adjustment, and propose a load rebalancing scheme that strikes a balance between the tradeoff. (3) We implement our design in the PIKACHU task scheduler, which outperforms Hadoop by up to 42% and Tarazu by up to 23%.

Available Media

12:30 p.m.–2:00 p.m. Wednesday

FCW Luncheon

Market Street Foyer

2:00 p.m.–3:30 p.m. Wednesday

Flash-based Storage

Session Chair: Hermann Härtig, Technische Universität Dresden

FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs

Kai Shen and Stan Park, University of Rochester

On Flash-based solid-state disks (SSDs), different I/O operations (reads vs. writes, operations of different sizes) incur substantially different resource usage. This presents challenges for fair resource management in multi-programmed computer systems and multi-tenant cloud systems. Existing timeslice-based I/O schedulers achieve fairness at the cost of poor responsiveness, particularly when a large number of tasks compete for I/O simultaneously. At the same time, the diminished benefits of I/O spatial proximity on SSDs motivate fine-grained fair queueing approaches that do not enforce task-specific timeslices. This paper develops a new Flash I/O scheduler called FlashFQ. It enhances the start-time fair queueing schedulers with throttled dispatch to exploit restricted Flash I/O parallelism without losing fairness. It also employs I/O anticipation to minimize fairness violation due to deceptive idleness. We implemented FlashFQ in Linux and compared it with several existing I/O schedulers—Linux CFQ, an Argon-inspired quanta scheduler, FIOS timeslice scheduler, FIOS with short timeslices, and 4-Tag SFQ(D). Results on synthetic I/O benchmarks, the Apache web server, and Kyoto Cabinet key-value store demonstrate that only FlashFQ can achieve both fairness and high responsiveness on Flash-based SSDs.

Available Media

The Harey Tortoise: Managing Heterogeneous Write Performance in SSDs

Laura M. Grupp, University of California, San Diego; John D. Davis, Microsoft Research; Steven Swanson, University of California, San Diego

Recent years have witnessed significant gains in the adoption of flash technology due to increases in bit density, enabling higher capacities and lower prices. Unfortunately, these improvements come at a significant cost to performance with trends pointing toward worst-case flash program latencies on par with disk writes.

We extend a conventional flash translation layer to schedule flash program operations to flash pages based on the operations’ performance needs and the pages’ performance characteristics. We then develop policies to improve performance in two scenarios: First, we improve peak performance for latency-critical operations of short bursts of intensive activity by 36%. Second, we realize steady-state bandwidth improvements of up to 95% by rate-matching garbage collection performance and external access performance.

Available Media

Janus: Optimal Flash Provisioning for Cloud Storage Workloads

Christoph Albrecht, Arif Merchant, Murray Stokely, Muhammad Waliji, François Labelle, Nate Coehlo, Xudong Shi, and C. Eric Schrock, Google, Inc.

Janus is a system for partitioning the flash storage tier between workloads in a cloud-scale distributed file system with two tiers, flash storage and disk. The file system stores newly created files in the flash tier and moves them to the disk tier using either a First-In-First-Out (FIFO) policy or a Least-Recently-Used (LRU) policy, subject to per-workload allocations. Janus constructs compact metrics of the cacheability of the different workloads, using sampled distributed traces because of the large scale of the system. From these metrics, we formulate and solve an optimization problem to determine the flash allocation to workloads that maximizes the total reads sent to the flash tier, subject to operator-set priorities and bounds on flash write rates. Using measurements from production workloads in multiple data centers using these recommendations, as well as traces of other production workloads, we show that the resulting allocation improves the flash hit rate by 47–76% compared to a unified tier shared by all workloads. Based on these results and an analysis of several thousand production workloads, we conclude that flash storage is a cost-effective complement to disks in data centers.

Available Media

3:30 p.m.–4:00 p.m. Wednesday

Break with Refreshments

Market Street Foyer

4:00 p.m.–5:45 p.m. Wednesday

Miscellanea #1

Session Chair: Rama Ramasubramanian, Microsoft Research Silicon Valley

Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store

Christopher Mitchell, New York University;  Yifeng Geng, Tsinghua University; Jinyang Li, New York University

Recent technological trends indicate that future datacenter networks will incorporate High Performance Computing network features, such as ultra-low latency and CPU bypassing. How can these features be exploited in datacenter-scale systems infrastructure? In this paper, we explore the design of a distributed in-memory key-value store called Pilaf that takes advantage of Remote Direct Memory Access to achieve high performance with low CPU overhead.

In Pilaf, clients directly read from the server’s memory via RDMA to perform gets, which commonly dominate key-value store workloads. By contrast, put operations are serviced by the server to simplify the task of synchronizing memory accesses. To detect inconsistent RDMA reads with concurrent CPU memory modifications, we introduce the notion of self-verifying data structures that can detect read-write races without client-server coordination. Our experiments show that Pilaf achieves low latency and high throughput while consuming few CPU resources. Specifically, Pilaf can surpass 1.3 million ops/sec (90% gets) using a single CPU core compared with 55K for Memcached and 59K for Redis.

Available Media

Lightweight Memory Tracing

Mathias Payer, Enrico Kravina, and Thomas R. Gross, ETH Zurich

Memory tracing (executing additional code for every memory access of a program) is a powerful technique with many applications, e.g., debugging, taint checking, or tracking dataflow. Current approaches are limited: software-only memory tracing incurs high performance overhead (e.g., for Libdft up to 10x) because every single memory access of the application is checked by additional code that is not part of the original application and hardware is limited to a small set of watched locations.

This paper introduces memTrace, a lightweight memory tracing technique that builds on dynamic on-the-fly cross-ISA binary translation of 32-bit code to 64-bit code. Our software only approach enables memory tracing for unmodified, binaryonly x86 applications using the x64 extension that is available in current CPUs; no OS extensions or special hardware is required. The additional registers in x64 and the wider memory addressing enable a low-overhead tracing infrastructure that is protected from the application code (i.e., uses disjunct registers and memory regions). MemTrace handles multi-threaded applications. Two case studies discuss a framework for unlimited read and write watchpoints and an allocation-based memory checker similar in functionality to memgrind.

The performance evaluation of memTrace shows that the time overhead is between 1.3x and 3.1x for the SPEC CPU2006 benchmarks, with a geometric mean of 1.97x.

Available Media

Flash Caching on the Storage Client

David A. Holland, Elaine Angelino, Gideon Wald, and Margo I. Seltzer, Harvard University

Flash memory has recently become popular as a caching medium. Most uses to date are on the storage server side. We investigate a different structure: flash as a cache on the client side of a networked storage environment. We use trace-driven simulation to explore the design space. We consider a wide range of configurations and policies to determine the potential client-side caches might offer and how best to arrange them.

Our results show that the flash cache writeback policy does not significantly affect performance. Write-through is sufficient; this greatly simplifies cache consistency handling. We also find that the chief benefit of the flash cache is its size, not its persistence. Cache persistence offers additional performance benefits at system restart at essentially no runtime cost. Finally, for some workloads a large flash cache allows using miniscule amounts of RAM for file caching (e.g., 256 KB) leaving more memory available for application use.

Available Media

Practical and Effective Sandboxing for Non-root Users

Taesoo Kim and Nickolai Zeldovich, MIT CSAIL

MBOX is a lightweight sandboxing mechanism for nonroot users in commodity OSes. MBOX’s sandbox usage model executes a program in the sandbox and prevents the program from modifying the host filesystem by layering the sandbox filesystem on top of the host filesystem. At the end of program execution, the user can examine changes in the sandbox filesystem and selectively commit them back to the host filesystem. MBOX implements this by interposing on system calls and provides a variety of useful applications: installing system packages as a non-root user, running unknown binaries safely without network accesses, checkpointing the host filesystem instantly, and setting up a virtual development environment without special tools. Our performance evaluation shows that MBOX imposes CPU overheads of 0.1–45.2% for various workloads. In this paper, we present MBOX’s design, efficient techniques for interposing on system calls, our experience avoiding common system call interposition pitfalls, and MBOX’s performance evaluation.

Available Media

 

Thursday, June 27, 2013

8:30 a.m.–9:00 a.m. Thursday

Continental Breakfast

Market Street Foyer

9:00 a.m.–10:45 a.m. Thursday

Data Storage

Session Chair: Dave Presotto, Google

TABLEFS: Enhancing Metadata Efficiency in the Local File System

Kai Ren and Garth Gibson, Carnegie Mellon University

File systems that manage magnetic disks have long recognized the importance of sequential allocation and large transfer sizes for file data. Fast random access has dominated metadata lookup data structures with increasing use of B-trees on-disk. Yet our experiments with workloads dominated by metadata and small file access indicate that even sophisticated local disk file systems like Ext4, XFS and Btrfs leave a lot of opportunity for performance improvement in workloads dominated by metadata and small files.

In this paper we present a stacked file system, TABLEFS, which uses another local file system as an object store. TABLEFS organizes all metadata into a single sparse table backed on disk using a Log-Structured Merge (LSM) tree, LevelDB in our experiments. By stacking, TABLEFS asks only for efficient large file allocation and access from the underlying local file system. By using an LSM tree, TABLEFS ensures metadata is written to disk in large, non-overwrite, sorted and indexed logs. Even an inefficient FUSE based user level implementation of TABLEFS can perform comparably to Ext4, XFS and Btrfs on data-intensive benchmarks, and can outperform them by 50% to as much as 1000% for metadata-intensive workloads. Such promising performance results from TABLEFS suggest that local disk file systems can be significantly improved by more aggressive aggregation and batching of metadata updates.

Available Media

Characterization of Incremental Data Changes for Efficient Data Protection

Hyong Shim, Philip Shilane, and Windsor Hsu, EMC Corporation

Protecting data on primary storage often requires creating secondary copies by periodically replicating the data to external target systems. We analyze over 100,000 traces from 125 customer block-based primary storage systems to gain a high-level understanding of I/O characteristics and then perform an in-depth analysis of over 500 traces from 13 systems that span at least 24 hours. Our analysis has the twin goals of minimizing overheads on primary systems and improving data replication efficiency. We compare our results with a study a decade ago and provide fresh insights into patterns of incremental changes on primary systems over time.

Primary storage systems often create snapshots as point-in-time copies in order to support host I/O while replicating changed data to target systems. However, creating standard snapshots on a primary storage system incurs overheads in terms of capacity and I/O, and we present a new snapshot technique called a replication snapshot that reduces these overheads. Replicated data also requires capacity and I/O on the target system, and we investigate techniques to significantly reduce these overheads. We also find that highly sequential or random I/O patterns have different incremental change characteristics. Where applicable, we present our findings as advice to storage engineers and administrators.

Available Media

On the Efficiency of Durable State Machine Replication

Alysson Bessani, Marcel Santos, João Felix, and Nuno Neves, FCUL/LaSIGE, University of Lisbon; Miguel Correia, INESC-ID, IST, University of Lisbon

State Machine Replication (SMR) is a fundamental technique for ensuring the dependability of critical services in modern internet-scale infrastructures. SMR alone does not protect from full crashes, and thus in practice it is employed together with secondary storage to ensure the durability of the data managed by these services. In this work we show that the classical durability enforcing mechanisms—logging, checkpointing, state transfer—can have a high impact on the performance of SMR-based services even if SSDs are used instead of disks. To alleviate this impact, we propose three techniques that can be used in a transparent manner, i.e., without modifying the SMR programming model or requiring extra resources: parallel logging, sequential checkpointing, and collaborative state transfer. We show the benefits of these techniques experimentally by implementing them in an open-source replication library, and evaluating them in the context of a consistent key-value store and a coordination service.

Available Media

Estimating Duplication by Content-based Sampling

Fei Xie, Michael Condict, and Sandip Shete, NetApp Inc.

We define a new technique for accurately estimating the amount of duplication in a storage volume from a small sample and we analyze its performance and accuracy. The estimate is useful for determining whether it is worthwhile to incur the overhead of deduplication. The technique works by scanning the fingerprints of every block in the volume, but only including in the sample a single copy of each fingerprint that passes a filter. The selectivity of the filter is repeatedly increased while reading the fingerprints, to produce the target sample size. We show that the required sample size for a reasonable accuracy is small and independent of the size of the volume. In addition, we define and analyze an on-line technique that, once an initial scan of all fingerprints has been performed, efficiently maintains an up-to-date estimate of the duplication as the file system is modified. Experiments with various real data sets show that the accuracy is as predicted by theory. We also prototyped the proposed technique in an enterprise storage system and measured the performance overhead using the IOzone micro-benchmark.

Available Media

10:45 a.m.–11:15 a.m. Thursday

Break with Refreshments

Market Street Foyer

11:15 a.m.–12:30 p.m. Thursday

Miscellanea #2

Session Chair: Kai Shen, University of Rochester

MutantX-S: Scalable Malware Clustering Based on Static Features

Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research Labs; Kang G. Shin, University of Michigan

The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs’ static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of x86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.

Available Media

Redundant State Detection for Dynamic Symbolic Execution

Suhabe Bugrara and Dawson Engler, Stanford University

Many recent tools use dynamic symbolic execution to perform tasks ranging from automatic test generation, finding security flaws, equivalence verification, and exploit generation. However, while symbolic execution is promising, it perennially struggles with the fact that the number of paths in a program increases roughly exponentially with both code and input size. This paper presents a technique that attacks this problem by eliminating paths that cannot reach new code before they are executed and evaluates it on 66 system intensive, complicated, and widely-used programs. Our experiments demonstrate that the analysis speeds up dynamic symbolic execution by an average of 50.5 X, with a median of 10 X, and increases coverage by an average of 3.8 %.

Available Media

packetdrill: Scriptable Network Stack Testing, from Sockets to Packets

Neal Cardwell, Yuchung Cheng, Lawrence Brakmo, Matt Mathis, Barath Raghavan, Nandita Dukkipati, Hsiao-keng Jerry Chu, Andreas Terzis, and Tom Herbert, Google

Testing today’s increasingly complex network protocol implementations can be a painstaking process. To help meet this challenge, we developed packetdrill, a portable, open-source scripting tool that enables testing the correctness and performance of entire TCP/UDP/IP network stack implementations, from the system call layer to the hardware network interface, for both IPv4 and IPv6. We describe the design and implementation of the tool, and our experiences using it to execute 657 test cases. The tool was instrumental in our development of three new features for Linux TCP—Early Retransmit, Fast Open, and Loss Probes—and allowed us to find and fix 10 bugs in Linux. Our team uses packetdrill in all phases of the development process for the kernel used in one of the world’s largest Linux installations.

Available Media

12:30 p.m.–2:00 p.m. Thursday

FCW Luncheon

Market Street Foyer

2:00 p.m.–3:30 p.m. Thursday

Virtual Machine Performance

Session Chair: Terence Kelly, HP Labs

DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments

Dejan Novaković, Nedeljko Vasić, and Stanko Novaković, École Polytechnique Fédérale de Lausanne (EPFL); Dejan Kostić, Institute IMDEA Networks; Ricardo Bianchini, Rutgers University

We describe the design and implementation of DeepDive, a system for transparently identifying and managing performance interference between virtual machines (VMs) co-located on the same physical machine in Infrastructure-as-a-Service cloud environments. DeepDive successfully addresses several important challenges, including the lack of performance information from applications, and the large overhead of detailed interference analysis. We first show that it is possible to use easily-obtainable, low-level metrics to clearly discern when interference is occurring and what resource is causing it. Next, using realistic workloads, we show that DeepDive quickly learns about interference across co-located VMs. Finally, we show DeepDive’s ability to deal efficiently with interference when it is detected, by using a low-overhead approach to identifying a VM placement that alleviates interference.

Available Media

Efficient and Scalable Paravirtual I/O System

Nadav Har’El, Abel Gordon, and Alex Landau, IBM Research–Haifa; Muli Ben-Yehuda, Technion IIT and Hypervisor Consulting; Avishay Traeger and Razya Ladelsky, IBM Research–Haifa

The most popular I/O virtualization method today is paravirtual I/O. Its popularity stems from its reasonable performance levels while allowing the host to interpose, i.e., inspect or control, the guest’s I/O activity.

We show that paravirtual I/O performance still significantly lags behind that of state-of-the-art non-interposing I/O virtualization, SRIOV. Moreover, we show that in the existing paravirtual I/O model, both latency and throughput significantly degrade with increasing number of guests. This scenario is becoming increasingly important, as the current trend of multi-core systems is towards an increasing number of guests per host.

We present an efficient and scalable virtual I/O system that provides all of the benefits of paravirtual I/O. Running host functionality on separate cores dedicated to serving multiple guest’s I/O combined with a fine-grained I/O scheduling and exitless notifications our I/O virtualization system provides performance which is 1.2x–3x better than the baseline, approaching and in some cases exceeding non-interposing I/O virtualization performance.

Available Media

vTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core

Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, and Dongyan Xu, Purdue University

In a virtual machine (VM) consolidation environment, it has been observed that CPU sharing among multiple VMs will lead to I/O processing latency because of the CPU access latency experienced by each VM. In this paper, we present vTurbo, a system that accelerates I/O processing for VMs by offloading I/O processing to a designated core. More specifically, the designated core—called turbo core—runs with a much smaller time slice (e.g., 0.1ms) than the cores shared by production VMs. Most of the I/O IRQs for the production VMs will be delegated to the turbo core for more timely processing, hence accelerating the I/O processing for the production VMs. Our experiments show that vTurbo significantly improves the VMs’ network and disk I/O throughput, which consequently translates into application-level performance improvement.

Available Media

3:30 p.m.–4:00 p.m. Thursday

Break with Refreshments

Market Street Foyer

4:00 p.m.–5:45 p.m. Thursday

Managing Resources

Session Chair: Manuel Costa, Microsoft Research Cambridge

When Slower Is Faster: On Heterogeneous Multicores for Reliable Systems

Tomas Hruby, Herbert Bos, and Andrew S. Tanenbaum, VU University Amsterdam

Breaking up the OS in many small components is attractive from a dependability point of view. If one of the components crashes or needs an update, we can replace it on the fly without taking down the system. The question is how to achieve this without sacrificing performance and without wasting resources unnecessarily. In this paper, we show that heterogeneous multicore architectures allow us to run OS code efficiently by executing each of the OS components on the most suitable core. Thus, components that require high single-thread performance run on (expensive) high-performance cores, while components that are less performance critical run on wimpy cores. Moreover, as current trends suggest that there will be no shortage of cores, we can give each component its own dedicated core when performance is of the essence, and consolidate multiple functions on a single core (saving power and resources) when performance is less critical for these components. Using frequency scaling to emulate different x86 cores, we evaluate our design on the most demanding subsystem of our operating system—the network stack. We show that less is sometimes more and that we can deliver better throughput with slower and, likely, less power hungry cores. For instance, we support network processing at close to 10 Gbps (the maximum speed of our NIC), while using an average of just 60% of the core speeds. Moreover, even if we scale all the cores of the network stack down to as little as 200 MHz, we still achieve 1.8 Gbps, which may be enough for many applications.

Available Media

IAMEM: Interaction-Aware Memory Energy Management

Mingsong Bi, Intel Corporation; Srinivasan Chandrasekharan and Chris Gniady, University of Arizona

Energy efficiency has become one of the most important factors in the development of computer systems. As applications become more data centric and put more pressure on the memory subsystem, managing energy consumption of main memory is becoming critical. Therefore, it is critical to take advantage of all memory idle times by placing memory in low power modes even during the active process execution. However, current solutions only offer energy optimizations on a per-process basis and are unable to take advantage of memory idle times when the process is executing. To allow accurate and fine-grained memory management during the process execution, we propose Interaction-Aware Memory Energy Management (IAMEM). IAMEM relies on accurate correlation of user-initiated tasks with the demand placed on the memory subsystem to accurately predict power state transitions for maximal energy savings while minimizing the impact on performance. Through detailed trace-driven simulation, we show that IAMEM reduces the memory energy consumption by as much as 16% as compared to the state-of-the-art approaches, while maintaining the user-perceivable performance comparable to the system without any energy optimizations.

Available Media

XLH: More Effective Memory Deduplication Scanners Through Cross-layer Hints

Konrad Miller, Fabian Franz, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa, Karlsruhe Institute of Technology

Limited main memory size is the primary bottleneck for consolidating virtual machines (VMs) on hosting servers. Memory deduplication scanners reduce the memory footprint of VMs by eliminating redundancy. Our approach extends main memory deduplication scanners through Cross Layer I/O-based Hints (XLH) to find and exploit sharing opportunities earlier without raising the deduplication overhead.

Prior work on memory scanners has shown great opportunity for memory deduplication. In our analyses, we have confirmed these results; however, we have found memory scanners to work well only for deduplicating fairly static memory pages. Current scanners need a considerable amount of time to detect new sharing opportunities (e.g., 5 min) and therefore do not exploit the full sharing potential. XLH’s early detection of sharing opportunities saves more memory by deduplicating otherwise missed short-lived pages and by increasing the time long-lived duplicates remain shared.

Compared to I/O-agnostic scanners such as KSM, our benchmarks show that XLH can merge equal pages that stem from the virtual disk image earlier by minutes and is capable of saving up to four times as much memory; e.g., XLH saves 290 MiB vs. 75 MiB of main memory for two VMs with 512 MiB assigned memory each.

Available Media

Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack

Konstantinos Menychtas, Kai Shen, and Michael L. Scott, University of Rochester

General-purpose GPUs now account for substantial computing power on many platforms, but the management of GPU resources—cycles, memory, bandwidth—is frequently hidden in black-box libraries, drivers, and devices, outside the control of mainstream OS kernels. We believe that this situation is untenable, and that vendors will eventually expose sufficient information about cross-black-box interactions to enable whole-system resource management. In the meantime, we want to enable research into what that management should look like.

We systematize, in this paper, a methodology to uncover the interactions within black-box GPU stacks. The product of this methodology is a state machine that captures interactions as transitions among semantically meaningful states. The uncovered semantics can be of significant help in understanding and tuning application performance. More importantly, they allow the OS kernel to intercept—and act upon—the initiation and completion of arbitrary GPU requests, affording it full control over scheduling and other resource management. While insufficiently robust for production use, our tools open whole new fields of exploration to researchers outside the GPU vendor labs.

Available Media

 

Friday, June 28, 2013

8:30 a.m.–9:00 a.m. Friday

Continental Breakfast

Market Street Foyer

9:00 a.m.–10:30 a.m. Friday

Small Applications

Session Chair: Christopher Small, Quanta Research

Mantis: Automatic Performance Prediction for Smartphone Applications

Yongin Kwon, Seoul National University; Sangmin Lee, University of Texas at Austin; Hayoon Yi, Donghyun Kwon, and Seungjun Yang, Seoul National University; Byung-Gon Chun, Microsoft; Ling Huang and Petros Maniatis, Intel; Mayur Naik, Georgia Institute of Technology; Yunheung Paek, Seoul National University

We present Mantis, a framework for predicting the performance of Android applications on given inputs automatically, accurately, and efficiently. A key insight underlying Mantis is that program execution runs often contain features that correlate with performance and are automatically computable efficiently. Mantis synergistically combines techniques from program analysis and machine learning. It constructs concise performance models by choosing from many program execution features only a handful that are most correlated with the program’s execution time yet can be evaluated efficiently from the program’s input. We apply program slicing to accurately estimate the evaluation cost of a feature and automatically generate executable code snippets for efficiently evaluating features. Our evaluation shows that Mantis predicts the execution time of six Android apps with estimation error in the range of 2.2-11.9% by executing predictor code costing at most 1.3% of their execution time on Galaxy Nexus.

Available Media

I/O Stack Optimization for Smartphones

Sooman Jeong, Hanyang University; Kisung Lee, Samsung Electronics; Seongjin Lee, Hanyang University; Seoungbum Son, Samsung Electronics; Youjip Won, Hanyang University
Awarded Best Paper!  

The Android I/O stack consists of elaborate and mature components (SQLite, the EXT4 filesystem, interrupt-driven I/O, and NAND-based storage) whose integrated behavior is not well-orchestrated, which leaves a substantial room for an improvement. We overhauled the block I/O behavior of five filesystems (EXT4, XFS, BTRFS, NILFS, and F2FS) under each of the five different journaling modes of SQLite. We found that the most significant inefficiency originates from the fact that filesystem journals the database journaling activity; we refer to this as the JOJ (Journaling of Journal) anomaly. The JOJ overhead compounds in EXT4 when the bulky EXT4 journaling activity is triggered by an fsync() call from SQLite. We propose (i) the elimination of unnecessary metadata journaling for the filesystem, (ii) external journaling and (iii) polling-based I/O to improve the I/O performance, primarily to improve the efficiency of filesystem journaling in the SQLite environment. We examined the performance trade-offs for each combination of the five database journaling modes, five filesystems and three optimization techniques. When we applied three optimization techniques in existing Android I/O stack, the SQLite performance (inserts/sec) improved by 130%. With the F2FS filesystem, WAL journaling mode (SQLite), and the combination of our optimization efforts, we improved the SQLite performance (inserts/sec) by 300%, from 39 ins/sec to 157 ins/sec, compared to the stock Android I/O stack.

Available Media

How to Run POSIX Apps in a Minimal Picoprocess

Jon Howell, Bryan Parno, and John R. Douceur, Microsoft Research

We envision a future where Web, mobile, and desktop applications are delivered as isolated, complete software stacks to a minimal, secure client host. This shift imbues app vendors with full autonomy to maintain their apps’ integrity. Achieving this goal requires shifting complex behavior out of the client platform and into the vendors’ isolated apps. We ported rich, interactive POSIX apps, such as Gimp and Inkscape, to a spartan host platform. We describe this effort in sufficient detail to support reproducibility.

Available Media

10:30 a.m.–11:00 a.m. Friday

Break with Refreshments

Market Street Foyer

11:00 a.m.–12:15 p.m. Friday

Packets

Session Chair: Jon Howell, Microsoft Research Redmond

Network Interface Design for Low Latency Request-Response Protocols

Mario Flajslik and Mendel Rosenblum, Stanford University

Ethernet network interfaces in commodity systems are designed with a focus on achieving high bandwidth at low CPU utilization, while often sacrificing latency. This approach is viable only if the high interface latency is still overwhelmingly dominated by software request processing times. However, recent efforts to lower software latency in request-response based systems, such as memcached and RAMCloud, have promoted network interface into a significant contributor to the overall latency. We present a low latency network interface design suitable for request-response based applications. Evaluation on a prototype FPGA implementation has demonstrated that our design exhibits more than double latency improvements without a meaningful negative impact on either bandwidth or CPU power. We also investigate latency-power tradeoffs between using interrupts and polling, as well as the effects of processor’s low power states.

Available Media

DEFINED: Deterministic Execution for Interactive Control-Plane Debugging

Chia-Chi Lin, Virajith Jalaparti, and Matthew Caesar, University of Illinois at Urbana-Champaign; Jacobus Van der Merwe, University of Utah

Large-scale networks are among the most complex software infrastructures in existence. Unfortunately, the extreme complexity of their basis, the control-plane software, leads to a rich variety of nondeterministic failure modes and anomalies. Research on debugging modern control-plane software has focused on designing comprehensive record and replay systems, but the large volumes of recordings often hinder the scalability of these designs. Here, we argue for a different approach. Namely, we take the position that deterministic network execution would vastly simplify the control-plane debugging process. This paper presents the design and implementation of DEFINED, a user-space substrate for interactive debugging that provides deterministic execution of networks in highly distributed and dynamic environments. We demonstrate our system’s advantages by reproducing discovery of known ordering and timing bugs in popular software routing platforms, XORP and Quagga. Using Rocketfuel topologies and routing data from a Tier-1 backbone, we show DEFINED is practical and scalable for interactive fault diagnosis in large networks.

Available Media

Improving Server Application Performance via Pure TCP ACK Receive Optimization

Michael Chan and David R. Cheriton, Stanford University

Network stack performance is critical to server scalability and user-perceived application experience. Per-packet overhead is a major bottleneck in scaling network I/O. While much effort is expended on reducing per-packet overhead for data-carrying packets, small control packets such as pure TCP ACKs have received relatively scarce attention. In this paper, we show that ACK receive processing can consume up to 20% cycles in server applications. We propose a simple kernel-level optimization which reduces this overhead through fewer memory allocations and a simplified code path. Using this technique, we demonstrate cycles savings of 15% in a Web application, and 33% throughput improvement in reliable multicast.

Available Media

12:15 p.m.–12:30 p.m. Friday

Awards and Closing Remarks

Best Paper Awards

Program Co-Chairs: Andrew Birrell, Microsoft Research Silicon Valley, and Emin Gün Sirer, Cornell University

Available Media