8:30 a.m.–9:00 a.m. |
Thursday |
Continental Breakfast
Market Street Foyer |
9:00 a.m.–10:45 a.m. |
Thursday |
Session Chair: Dave Presotto, Google
Kai Ren and Garth Gibson, Carnegie Mellon University File systems that manage magnetic disks have long recognized the importance of sequential allocation and large transfer sizes for file data. Fast random access has dominated metadata lookup data structures with increasing use of B-trees on-disk. Yet our experiments with workloads dominated by metadata and small file access indicate that even sophisticated local disk file systems like Ext4, XFS and Btrfs leave a lot of opportunity for performance improvement in workloads dominated by metadata and small files.
In this paper we present a stacked file system, TABLEFS, which uses another local file system as an object store. TABLEFS organizes all metadata into a single sparse table backed on disk using a Log-Structured Merge (LSM) tree, LevelDB in our experiments. By stacking, TABLEFS asks only for efficient large file allocation and access from the underlying local file system. By using an LSM tree, TABLEFS ensures metadata is written to disk in large, non-overwrite, sorted and indexed logs. Even an inefficient FUSE based user level implementation of TABLEFS can perform comparably to Ext4, XFS and Btrfs on data-intensive benchmarks, and can outperform them by 50% to as much as 1000% for metadata-intensive workloads. Such promising performance results from TABLEFS suggest that local disk file systems can be significantly improved by more aggressive aggregation and batching of metadata updates.
Hyong Shim, Philip Shilane, and Windsor Hsu, EMC Corporation Protecting data on primary storage often requires creating secondary copies by periodically replicating the data to external target systems. We analyze over 100,000 traces from 125 customer block-based primary storage systems to gain a high-level understanding of I/O characteristics and then perform an in-depth analysis of over 500 traces from 13 systems that span at least 24 hours. Our analysis has the twin goals of minimizing overheads on primary systems and improving data replication efficiency. We compare our results with a study a decade ago and provide fresh insights into patterns of incremental changes on primary systems over time.
Primary storage systems often create snapshots as point-in-time copies in order to support host I/O while replicating changed data to target systems. However, creating standard snapshots on a primary storage system incurs overheads in terms of capacity and I/O, and we present a new snapshot technique called a replication snapshot that reduces these overheads. Replicated data also requires capacity and I/O on the target system, and we investigate techniques to significantly reduce these overheads. We also find that highly sequential or random I/O patterns have different incremental change characteristics. Where applicable, we present our findings as advice to storage engineers and administrators.
Alysson Bessani, Marcel Santos, João Felix, and Nuno Neves, FCUL/LaSIGE, University of Lisbon; Miguel Correia, INESC-ID, IST, University of Lisbon State Machine Replication (SMR) is a fundamental technique for ensuring the dependability of critical services in modern internet-scale infrastructures. SMR alone does not protect from full crashes, and thus in practice it is employed together with secondary storage to ensure the durability of the data managed by these services. In this work we show that the classical durability enforcing mechanisms—logging, checkpointing, state transfer—can have a high impact on the performance of SMR-based services even if SSDs are used instead of disks. To alleviate this impact, we propose three techniques that can be used in a transparent manner, i.e., without modifying the SMR programming model or requiring extra resources: parallel logging, sequential checkpointing, and collaborative state transfer. We show the benefits of these techniques experimentally by implementing them in an open-source replication library, and evaluating them in the context of a consistent key-value store and a coordination service.
Fei Xie, Michael Condict, and Sandip Shete, NetApp Inc. We define a new technique for accurately estimating the amount of duplication in a storage volume from a small sample and we analyze its performance and accuracy. The estimate is useful for determining whether it is worthwhile to incur the overhead of deduplication. The technique works by scanning the fingerprints of every block in the volume, but only including in the sample a single copy of each fingerprint that passes a filter. The selectivity of the filter is repeatedly increased while reading the fingerprints, to produce the target sample size. We show that the required sample size for a reasonable accuracy is small and independent of the size of the volume. In addition, we define and analyze an on-line technique that, once an initial scan of all fingerprints has been performed, efficiently maintains an up-to-date estimate of the duplication as the file system is modified. Experiments with various real data sets show that the accuracy is as predicted by theory. We also prototyped the proposed technique in an enterprise storage system and measured the performance overhead using the IOzone micro-benchmark.
|
10:45 a.m.–11:15 a.m. |
Thursday |
Break with Refreshments
Market Street Foyer |
11:15 a.m.–12:30 p.m. |
Thursday |
Session Chair: Kai Shen, University of Rochester
Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research Labs; Kang G. Shin, University of Michigan The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs’ static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of x86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.
Suhabe Bugrara and Dawson Engler, Stanford University Many recent tools use dynamic symbolic execution to perform tasks ranging from automatic test generation, finding security flaws, equivalence verification, and exploit generation. However, while symbolic execution is promising, it perennially struggles with the fact that the number of paths in a program increases roughly exponentially with both code and input size. This paper presents a technique that attacks this problem by eliminating paths that cannot reach new code before they are executed and evaluates it on 66 system intensive, complicated, and widely-used programs. Our experiments demonstrate that the analysis speeds up dynamic symbolic execution by an average of 50.5 X, with a median of 10 X, and increases coverage by an average of 3.8 %.
Neal Cardwell, Yuchung Cheng, Lawrence Brakmo, Matt Mathis, Barath Raghavan, Nandita Dukkipati, Hsiao-keng Jerry Chu, Andreas Terzis, and Tom Herbert, Google Testing today’s increasingly complex network protocol implementations can be a painstaking process. To help meet this challenge, we developed packetdrill, a portable, open-source scripting tool that enables testing the correctness and performance of entire TCP/UDP/IP network stack implementations, from the system call layer to the hardware network interface, for both IPv4 and IPv6. We describe the design and implementation of the tool, and our experiences using it to execute 657 test cases. The tool was instrumental in our development of three new features for Linux TCP—Early Retransmit, Fast Open, and Loss Probes—and allowed us to find and fix 10 bugs in Linux. Our team uses packetdrill in all phases of the development process for the kernel used in one of the world’s largest Linux installations.
|
12:30 p.m.–2:00 p.m. |
Thursday |
FCW Luncheon
Market Street Foyer |
2:00 p.m.–3:30 p.m. |
Thursday |
Session Chair: Terence Kelly, HP Labs
Dejan Novaković, Nedeljko Vasić, and Stanko Novaković, École Polytechnique Fédérale de Lausanne (EPFL); Dejan Kostić, Institute IMDEA Networks; Ricardo Bianchini, Rutgers University We describe the design and implementation of DeepDive, a system for transparently identifying and managing performance interference between virtual machines (VMs) co-located on the same physical machine in Infrastructure-as-a-Service cloud environments. DeepDive successfully addresses several important challenges, including the lack of performance information from applications, and the large overhead of detailed interference analysis. We first show that it is possible to use easily-obtainable, low-level metrics to clearly discern when interference is occurring and what resource is causing it. Next, using realistic workloads, we show that DeepDive quickly learns about interference across co-located VMs. Finally, we show DeepDive’s ability to deal efficiently with interference when it is detected, by using a low-overhead approach to identifying a VM placement that alleviates interference.
Nadav Har’El, Abel Gordon, and Alex Landau, IBM Research–Haifa; Muli Ben-Yehuda, Technion IIT and Hypervisor Consulting; Avishay Traeger and Razya Ladelsky, IBM Research–Haifa The most popular I/O virtualization method today is paravirtual I/O. Its popularity stems from its reasonable performance levels while allowing the host to interpose, i.e., inspect or control, the guest’s I/O activity.
We show that paravirtual I/O performance still significantly lags behind that of state-of-the-art non-interposing I/O virtualization, SRIOV. Moreover, we show that in the existing paravirtual I/O model, both latency and throughput significantly degrade with increasing number of guests. This scenario is becoming increasingly important, as the current trend of multi-core systems is towards an increasing number of guests per host.
We present an efficient and scalable virtual I/O system that provides all of the benefits of paravirtual I/O. Running host functionality on separate cores dedicated to serving multiple guest’s I/O combined with a fine-grained I/O scheduling and exitless notifications our I/O virtualization system provides performance which is 1.2x–3x better than the baseline, approaching and in some cases exceeding non-interposing I/O virtualization performance.
Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, and Dongyan Xu, Purdue University In a virtual machine (VM) consolidation environment, it has been observed that CPU sharing among multiple VMs will lead to I/O processing latency because of the CPU access latency experienced by each VM. In this paper, we present vTurbo, a system that accelerates I/O processing for VMs by offloading I/O processing to a designated core. More specifically, the designated core—called turbo core—runs with a much smaller time slice (e.g., 0.1ms) than the cores shared by production VMs. Most of the I/O IRQs for the production VMs will be delegated to the turbo core for more timely processing, hence accelerating the I/O processing for the production VMs. Our experiments show that vTurbo significantly improves the VMs’ network and disk I/O throughput, which consequently translates into application-level performance improvement.
|
3:30 p.m.–4:00 p.m. |
Thursday |
Break with Refreshments
Market Street Foyer |
4:00 p.m.–5:45 p.m. |
Thursday |
Session Chair: Manuel Costa, Microsoft Research Cambridge
Tomas Hruby, Herbert Bos, and Andrew S. Tanenbaum, VU University Amsterdam Breaking up the OS in many small components is attractive from a dependability point of view. If one of the components crashes or needs an update, we can replace it on the fly without taking down the system. The question is how to achieve this without sacrificing performance and without wasting resources unnecessarily. In this paper, we show that heterogeneous multicore architectures allow us to run OS code efficiently by executing each of the OS components on the most suitable core. Thus, components that require high single-thread performance run on (expensive) high-performance cores, while components that are less performance critical run on wimpy cores. Moreover, as current trends suggest that there will be no shortage of cores, we can give each component its own dedicated core when performance is of the essence, and consolidate multiple functions on a single core (saving power and resources) when performance is less critical for these components. Using frequency scaling to emulate different x86 cores, we evaluate our design on the most demanding subsystem of our operating system—the network stack. We show that less is sometimes more and that we can deliver better throughput with slower and, likely, less power hungry cores. For instance, we support network processing at close to 10 Gbps (the maximum speed of our NIC), while using an average of just 60% of the core speeds. Moreover, even if we scale all the cores of the network stack down to as little as 200 MHz, we still achieve 1.8 Gbps, which may be enough for many applications.
Mingsong Bi, Intel Corporation; Srinivasan Chandrasekharan and Chris Gniady, University of Arizona Energy efficiency has become one of the most important factors in the development of computer systems. As applications become more data centric and put more pressure on the memory subsystem, managing energy consumption of main memory is becoming critical. Therefore, it is critical to take advantage of all memory idle times by placing memory in low power modes even during the active process execution. However, current solutions only offer energy optimizations on a per-process basis and are unable to take advantage of memory idle times when the process is executing. To allow accurate and fine-grained memory management during the process execution, we propose Interaction-Aware Memory Energy Management (IAMEM). IAMEM relies on accurate correlation of user-initiated tasks with the demand placed on the memory subsystem to accurately predict power state transitions for maximal energy savings while minimizing the impact on performance. Through detailed trace-driven simulation, we show that IAMEM reduces the memory energy consumption by as much as 16% as compared to the state-of-the-art approaches, while maintaining the user-perceivable performance comparable to the system without any energy optimizations.
Konrad Miller, Fabian Franz, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa, Karlsruhe Institute of Technology Limited main memory size is the primary bottleneck for consolidating virtual machines (VMs) on hosting servers. Memory deduplication scanners reduce the memory footprint of VMs by eliminating redundancy. Our approach extends main memory deduplication scanners through Cross Layer I/O-based Hints (XLH) to find and exploit sharing opportunities earlier without raising the deduplication overhead.
Prior work on memory scanners has shown great opportunity for memory deduplication. In our analyses, we have confirmed these results; however, we have found memory scanners to work well only for deduplicating fairly static memory pages. Current scanners need a considerable amount of time to detect new sharing opportunities (e.g., 5 min) and therefore do not exploit the full sharing potential. XLH’s early detection of sharing opportunities saves more memory by deduplicating otherwise missed short-lived pages and by increasing the time long-lived duplicates remain shared.
Compared to I/O-agnostic scanners such as KSM, our benchmarks show that XLH can merge equal pages that stem from the virtual disk image earlier by minutes and is capable of saving up to four times as much memory; e.g., XLH saves 290 MiB vs. 75 MiB of main memory for two VMs with 512 MiB assigned memory each.
Konstantinos Menychtas, Kai Shen, and Michael L. Scott, University of Rochester General-purpose GPUs now account for substantial computing power on many platforms, but the management of GPU resources—cycles, memory, bandwidth—is frequently hidden in black-box libraries, drivers, and devices, outside the control of mainstream OS kernels. We believe that this situation is untenable, and that vendors will eventually expose sufficient information about cross-black-box interactions to enable whole-system resource management. In the meantime, we want to enable research into what that management should look like.
We systematize, in this paper, a methodology to uncover the interactions within black-box GPU stacks. The product of this methodology is a state machine that captures interactions as transitions among semantically meaningful states. The uncovered semantics can be of significant help in understanding and tuning application performance. More importantly, they allow the OS kernel to intercept—and act upon—the initiation and completion of arbitrary GPU requests, affording it full control over scheduling and other resource management. While insufficiently robust for production use, our tools open whole new fields of exploration to researchers outside the GPU vendor labs.
|