Proceedings Front Matter:
Covers |  Title Page and List of Organizers |  Table of Contents |  Message from the Program Co-Chairs

Full Proceedings PDFs
 FAST '14 Full Proceedings (PDF)
 FAST '14 Proceedings Interior (PDF, best for mobile devices)

Full Proceedings ePub (for iPad and most eReaders)
 FAST '14 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
 FAST '14 Full Proceedings (Mobi)

Download Proceedings Archive (Conference Attendees Only)

Attendee Files 
FAST '14 Proceedings Archive (ZIP, includes Conference Attendee list)

 

Tuesday, February 18, 2014

8:45 a.m.–9:00 a.m. Tuesday

Opening Remarks and Best Paper Award

Grand Ballroom ABGH

FAST '14 Opening Remarks and Awards

Session Chairs: Bianca Schroeder, University of Toronto, and Eno Thereska, Microsoft Research

 

Session Chairs: Bianca Schroeder, University of Toronto, and Eno Thereska, Microsoft Research

 

Available Media

9:00 a.m.–10:30 a.m. Tuesday

Big Memory

Grand Ballroom ABGH

Session Chair: Hakim Weatherspoon, Cornell University

Log-structured Memory for DRAM-based Storage

Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout, Stanford University
Awarded Best Paper!

Traditional memory allocation mechanisms are not suitable for new DRAM-based storage systems because they use memory inefficiently, particularly under changing access patterns. In contrast, a log-structured approach to memory management allows 80-90% memory utilization while offering high performance. The RAMCloud storage system implements a unified log-structured mechanism both for active information in memory and backup data on disk. The RAMCloud implementation of log-structured memory uses a two-level cleaning policy, which conserves disk bandwidth and improves performance up to 6x at high memory utilization. The cleaner runs concurrently with normal operations and employs multiple threads to hide most of the cost of cleaning.

Available Media

Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory

Brendan Cully, Jake Wires, Dutch Meyer, Kevin Jamieson, Keir Fraser, Tim Deegan, Daniel Stodden, Geoffrey Lefebvre, Daniel Ferstay, and Andrew Warfield, Coho Data

Strata is a commercial storage system designed around the high performance density of PCIe flash storage. We observe a parallel between the challenges introduced by this emerging flash hardware and the problems that were faced with underutilized server hardware about a decade ago. Borrowing ideas from hardware virtualization, we present a novel storage system design that partitions functionality into an address virtualization layer for high performance network-attached flash, and a hosted environment for implementing scalable protocol implementations. Our system targets the storage of virtual machine images for enterprise environments, and we demonstrate dynamic scale to over a million IO operations per second using NFSv3 in 13u of rack space, including switching.

Available Media

Evaluating Phase Change Memory for Enterprise Storage Systems: A Study of Caching and Tiering Approaches

Hyojun Kim, Sangeetha Seshadri, Clement L. Dickey, and Lawrence Chiu, IBM Almaden Research Center

Storage systems based on Phase Change Memory (PCM) devices are beginning to generate considerable attention in both industry and academic communities. But whether the technology in its current state will be a commercially and technically viable alternative to entrenched technologies such as flash-based SSDs remains undecided. To address this it is important to consider PCM SSD devices not just from a device standpoint, but also from a holistic perspective.

This paper presents the results of our performance study of a recent all-PCM SSD prototype. The average latency for a 4 KiB random read is 6.7 s, which is about 16 faster than a comparable eMLC flash SSD. The distribution of I/O response times is also much narrower than flash SSD for both reads and writes. Based on the performance measurements and real-world workload traces, we explore two typical storage use-cases: tiering and caching. For tiering, we model a hypothetical storage system that consists of flash, HDD, and PCM to identify the combinations of device types that offer the best performance within cost constraints. For caching, we study whether PCM can improve performance compared to flash in terms of aggregate I/O time and read latency. We report that the IOPS/$ of a tiered storage system can be improved by 12–66% and the aggregate elapsed time of a server-side caching solution can be improved by up to 35% by adding PCM.

Our results show that—even at current price points—PCM storage devices show promising performance as a new component in enterprise storage systems.

Available Media

10:30 a.m.–11:00 a.m. Tuesday

Break

Grand Ballroom Foyer

11:00 a.m.–12:30 p.m. Tuesday

Flash and SSDs

Grand Ballroom ABGH

Session Chair: Steve Swanson, University of California, San Diego

Wear Unleveling: Improving NAND Flash Lifetime by Balancing Page Endurance

Xavier Jimenez, David Novo, and Paolo Ienne, Ecole Polytechnique Fédérale de Lausanne (EPFL)

Flash memory cells typically undergo a few thousand Program/Erase (P/E) cycles before they wear out. However, the programming strategy of flash devices and process variations cause some flash cells to wear out significantly faster than others. This paper studies this variability on two commercial devices, acknowledges its unavoidability, figures out how to identify the weakest cells, and introduces a wear unbalancing technique that let the strongest cells relieve the weak ones in order to lengthen the overall lifetime of the device. Our technique periodically skips or relieves the weakest pages whenever a flash block is programmed. Relieving the weakest pages can lead to a lifetime extension of up to 60% for a negligible memory and storage overhead, while minimally affecting (sometimes improving) the write performance. Future technology nodes will bring larger variance to page endurance, increasing the need for techniques similar to the one proposed in this work.

Available Media

Lifetime Improvement of NAND Flash-based Storage Systems Using Dynamic Program and Erase Scaling

Jaeyong Jeong and Sangwook Shane Hahn, Seoul National University; Sungjin Lee, MIT/CSAIL; Jihong Kim, Seoul National University

The cost-per-bit of NAND flash memory has been continuously improved by semiconductor process scaling and multi-leveling technologies (e.g., a 10 nm-node TLC device). However, the decreasing lifetime of NAND flash memory as a side effect of recent advanced technologies is regarded as a main barrier for a wide adoption of NAND flash-based storage systems. In this paper, we propose a new system-level approach, called dynamic program and erase scaling (DPES), for improving the lifetime (particularly, endurance) of NAND flash memory. The DPES approach is based on our key observation that changing the erase voltage as well as the erase time significantly affects the NAND endurance. By slowly erasing a NAND block with a lower erase voltage, we can improve the NAND endurance very effectively. By modifying NAND chips to support multiple write and erase modes with different operation voltages and times, DPES enables a flash software to exploit the new tradeoff relationships between the NAND endurance and erase voltage/speed under dynamic program and erase scaling. We have implemented the first DPES-aware FTL, called autoFTL, which improves the NAND endurance with a negligible degradation in the overall write throughput. Our experimental results using various I/O traces show that autoFTL can improve the maximum number of P/E cycles by 61.2% over an existing DPES-unaware FTL with less than 2.2% decrease in the overall write throughput.
Available Media

ReconFS: A Reconstructable File System on Flash Storage

Youyou Lu, Jiwu Shu, and Wei Wang, Tsinghua University

Hierarchical namespaces (directory trees) in file systems are effective in indexing file system data. However, the update patterns of namespace metadata, such as intensive writeback and scattered small updates, exaggerate the writes to flash storage dramatically, which hurts both performance and endurance (i.e., limited program/erase cycles of flash memory) of the storage system.

In this paper, we propose a reconstructable file system, ReconFS, to reduce namespace metadata writeback size while providing hierarchical namespace access. ReconFS decouples the volatile and persistent directory tree maintenance. Hierarchical namespace access is emulated with the volatile directory tree, and the consistency and persistence of the persistent directory tree are provided using two mechanisms in case of system failures. First, consistency is ensured by embedding an inverted index in each page, eliminating the writes of the pointers (indexing for directory tree). Second, persistence is guaranteed by compacting and logging the scattered small updates to the metadata persistence log, so as to reduce write size. The inverted indices and logs are used respectively to reconstruct the structure and the content of the directory tree on reconstruction. Experiments show that ReconFS provides up to 46.3% performance improvement and 27.1% write reduction compared to ext2, a file system with low metadata overhead.

Available Media

12:30 p.m.–2:00 p.m. Tuesday

Conference Luncheon and Awards Presentations

Grand Ballroom CDEF

2:00 p.m.–3:30 p.m. Tuesday

Personal and Mobile

Grand Ballroom ABGH

Session Chair: Jay Lorch, Microsoft Research

Toward Strong, Usable Access Control for Shared Distributed Data

Michelle L. Mazurek, Yuan Liang, William Melicher, Manya Sleeper, Lujo Bauer, Gregory R. Ganger, and Nitin Gupta, Carnegie Mellon University; Michael K. Reiter, University of North Carolina at Chapel Hill

As non-expert users produce increasing amounts of personal digital data, usable access control becomes critical. Current approaches often fail, because they insufficiently protect data or confuse users about policy specification. This paper presents Penumbra, a distributed file system with access control designed to match users’ mental models while providing principled security. Penumbra’s design combines semantic, tag-based policy specification with logic-based access control, flexibly supporting intuitive policies while providing high assurance of correctness. It supports private tags, tag disagreement between users, decentralized policy enforcement, and unforgeable audit records. Penumbra’s logic can express a variety of policies that map well to real users’ needs. To evaluate Penumbra’s design, we develop a set of detailed, realistic case studies drawn from prior research into users’ access-control preferences. Using microbenchmarks and traces generated from the case studies, we demonstrate that Penumbra can enforce users’ policies with overhead less than 5% for most system calls.

Available Media

On the Energy Overhead of Mobile Storage Systems

Jing Li, University of California, San Diego; Anirudh Badam and Ranveer Chandra, Microsoft Research; Steven Swanson, University of California, San Diego; Bruce Worthington and Qi Zhang, Microsoft

Secure digital cards and embedded multimedia cards are pervasively used as secondary storage devices in portable electronics, such as smartphones and tablets. These devices cost under 70 cents per gigabyte. They deliver more than 4000 random IOPS and 70 MBps of sequential access bandwidth. Additionally, they operate at a peak power lower than 250 milliwatts. However, software storage stack above the device level on most existing mobile platforms is not optimized to exploit the low-energy characteristics of such devices. This paper examines the energy consumption of the storage stack on mobile platforms.

We conduct several experiments on mobile platforms to analyze the energy requirements of their respective storage stacks. Software storage stack consumes up to 200 times more energy when compared to storage hardware, and the security and privacy requirements of mobile apps are a major cause. A storage energy model for mobile platforms is proposed to help developers optimize the energy requirements of storage intensive applications. Finally, a few optimizations are proposed to reduce the energy consumption of storage systems on these platforms.

Available Media

ViewBox: Integrating Local File Systems with Cloud Storage Services

Yupu Zhang, University of Wisconsin—Madison; Charlotte Dragga, University of Wisconsin—Madison and NetApp, Inc.; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

Cloud-based file synchronization services have become enormously popular in recent years, both for their ability to synchronize files across multiple clients and for the automatic cloud backups they provide. However, despite the excellent reliability that the cloud back-end provides, the loose coupling of these services and the local file system makes synchronized data more vulnerable than users might believe. Local corruption may be propagated to the cloud, polluting all copies on other devices, and a crash or untimely shutdown may lead to inconsistency between a local file and its cloud copy. Even without these failures, these services cannot provide causal consistency.

To address these problems, we present ViewBox, an integrated synchronization service and local file system that provides freedom from data corruption and inconsistency. ViewBox detects these problems using ext4-cksum, a modified version of ext4, and recovers from them using a user-level daemon, cloud helper, to fetch correct data from the cloud. To provide a stable basis for recovery, ViewBox employs the view manager on top of ext4-cksum. The view manager creates and exposes views, consistent in-memory snapshots of the file system, which the synchronization client then uploads. Our experiments show that ViewBox detects and recovers from both corruption and inconsistency, while incurring minimal overhead.

Available Media

3:30 p.m.–4:00 p.m. Tuesday

Break

Grand Ballroom Foyer

4:00 p.m.–5:30 p.m. Tuesday

RAID and Erasure Codes

Grand Ballroom ABGH

Session Chair: James Plank, University of Tennesee

CRAID: Online RAID Upgrades Using Dynamic Hot Data Reorganization

Alberto Miranda, Barcelona Supercomputing Center (BSC-CNS); Toni Cortes, Barcelona Supercomputing Center (BSC-CNS) and Technical University of Catalonia (UPC)

Current algorithms used to upgrade RAID arrays typically require large amounts of data to be migrated, even those that move only the minimum amount of data required to keep a balanced data load. This paper presents CRAID, a self-optimizing RAID array that performs an online block reorganization of frequently used, long-term accessed data in order to reduce this migration even further. To achieve this objective, CRAID tracks frequently used, long-term data blocks and copies them to a dedicated partition spread across all the disks in the array. When new disks are added, CRAID only needs to extend this process to the new devices to redistribute this partition, thus greatly reducing the overhead of the upgrade process. In addition, the reorganized access patterns within this partition improve the array’s performance, amortizing the copy overhead and allowing CRAID to offer a performance competitive with traditional RAIDs.

We describe CRAID’s motivation and design and we evaluate it by replaying seven real-world workloads including a file server, a web server and a user share. Our experiments show that CRAID can successfully detect hot data variations and begin using new disks as soon as they are added to the array. Also, the usage of a dedicated partition improves the sequentiality of relevant data access, which amortizes the cost of reorganizations. Finally, we prove that a full-HDD CRAID array with a small distributed partition (<1.28% per disk) can compete in performance with an ideally restriped RAID-5 and a hybrid RAID-5 with a small SSD cache.

Available Media

STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems

Mingqiang Li and Patrick P. C. Lee, The Chinese University of Hong Kong

Practical storage systems often adopt erasure codes to tolerate device failures and sector failures, both of which are prevalent in the field. However, traditional erasure codes employ device-level redundancy to protect against sector failures, and hence incur significant space overhead. Recent sector-disk (SD) codes are available only for limited configurations due to the relatively strict assumption on the coverage of sector failures. By making a relaxed but practical assumption, we construct a general family of erasure codes called STAIR codes, which efficiently and provably tolerate both device and sector failures without any restriction on the size of a storage array and the numbers of tolerable device failures and sector failures. We propose the upstairs encoding and downstairs encoding methods, which provide complementary performance advantages for different configurations. We conduct extensive experiments to justify the practicality of STAIR codes in terms of space saving, encoding/ decoding speed, and update cost. We demonstrate that STAIR codes not only improve space efficiency over traditional erasure codes, but also provide better computational efficiency than SD codes based on our special code construction.

Available Media

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage

Jeremy C. W. Chan, Qian Ding, Patrick P. C. Lee, and Helen H. W. Chan, The Chinese University of Hong Kong

Many modern storage systems adopt erasure coding to provide data availability guarantees with low redundancy. Log-based storage is often used to append new data rather than overwrite existing data so as to achieve high update efficiency, but introduces significant I/O overhead during recovery due to reassembling updates from data and parity chunks. We propose parity logging with reserved space, which comprises two key design features: (1) it takes a hybrid of in-place data updates and log-based parity updates to balance the costs of updates and recovery, and (2) it keeps parity updates in a reserved space next to the parity chunk to mitigate disk seeks. We further propose a workload-aware scheme to dynamically predict and adjust the reserved space size. We prototype an erasure-coded clustered storage system called CodFS, and conduct testbed experiments on different update schemes under synthetic and real-world workloads. We show that our proposed update scheme achieves high update and recovery performance, which cannot be simultaneously achieved by pure in-place or log-based update schemes.

Available Media

5:30 p.m.–7:30 p.m. Tuesday

Poster Session and Reception I

Grand Ballroom EF

Check out the cool new ideas and the latest preliminary research on display at the Poster Sessions and Receptions. Take part in discussions with your colleagues over complimentary food and drinks.

The list of accepted posters is available here.

 

Wednesday, February 19, 2014

9:00 a.m.–10:00 a.m. Wednesday

Keynote Presentation

Grand Ballroom ABGH

Session Chair: Jason Flinn, University of Michigan

FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers

Krste Asanović, University of California, Berkeley

The first generation of Warehouse-Scale Computers (WSC) built everything from commercial off-the-shelf (COTS) components: computers, switches, and racks. The second generation, which is being deployed today, uses custom computers, custom switches, and even custom racks, albeit all built using COTS chips. We believe the third generation of WSC in 2020 will be built from custom chips. If WSC architects are free to design custom chips, what should they do differently?

The first generation of Warehouse-Scale Computers (WSC) built everything from commercial off-the-shelf (COTS) components: computers, switches, and racks. The second generation, which is being deployed today, uses custom computers, custom switches, and even custom racks, albeit all built using COTS chips. We believe the third generation of WSC in 2020 will be built from custom chips. If WSC architects are free to design custom chips, what should they do differently?

FireBox is a new project at UC Berkeley proposing a new system architecture for these third-generation WSCs. Firebox is a 50kW WSC building block containing a thousand compute sockets and 100 Petabytes (2^57B) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. We expect a 2020 WSC to be composed of 200 to 400 FireBoxes instead of 20,000 to 40,000 servers, thereby reducing management overhead. Each compute socket contains a System-on-a-Chip (SoC) with around 100 cores connected to high-bandwidth on-package DRAM. Fast SoC network interfaces reduce the software overhead of communicating between application services and high-radix network backplane switches connected by Terabit/sec optical fibers reduce the network's contribution to tail latency. The very large non-volatile store directly supports in-memory databases, and pervasive encryption ensures that data is always protected in transit and in storage.

To explore the many design options before building FireBox, we are building on DIABLO-1 (Datacenter-in-a-Box at Low Cost), our prior work simulating a WSC using FPGAs. DIABLO-2 will simulate an entire FireBox, including the fiber-optic network, the switch, the NIC, and 1000 SOCs, with every core running the full BDAS stack (from the AMP Lab) and the Linux OS, as well as interactive services and batch applications, with only a factor of 1000x slowdown from realtime.

Krste Asanović received a B.A. in Electrical and Information Sciences from Cambridge University in 1987 and a Ph.D. in Computer Science from U.C. Berkeley in 1998. He was an Assistant and Associate Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, Cambridge, from 1998 to 2007. In 2007, he joined the faculty at the University of California, Berkeley, where he co-founded the Berkeley Parallel Computing Laboratory. He is currently a Professor of Electrical Engineering and Computer Sciences and Director of the Berkeley ASPIRE Laboratory, which is developing new techniques to increase computing efficiency above the transistor level. He is an IEEE Fellow and an ACM Distinguished Scientist.

Available Media

10:00 a.m.–10:30 a.m. Wednesday

Break

Grand Ballroom Foyer

10:30 a.m.–12:20 p.m. Wednesday

Experience from Real Systems

Grand Ballroom ABGH

Session Chair: Angela Demke-Brown, University of Toronto

(Big)Data in a Virtualized World: Volume, Velocity, and Variety in Cloud Datacenters

Robert Birke, Mathias Bjoerkqvist, and Lydia Y. Chen, IBM Research Zurich Lab; Evgenia Smirni, College of William and Mary; Ton Engbersen, IBM Research Zurich Lab

Virtualization is the ubiquitous way to provide computation and storage services to datacenter end-users. Guaranteeing sufficient data storage and efficient data access is central to all datacenter operations, yet little is known of the effects of virtualization on storage workloads. In this study, we collect and analyze field data from production datacenters that operate within the private cloud paradigm, during a period of three years. The datacenters of our study consist of 8,000 physical boxes, hosting over 90,000 VMs, which in turn use over 22 PB of storage. Storage data is analyzed from the perspectives of volume, velocity, and variety of storage demands on virtual machines and of their dependency on other resources. In addition to the growth rate and churn rate of allocated and used storage volume, the trace data illustrates the impact of virtualization and consolidation on the velocity of IO reads and writes, including IO deduplication ratios and peak load analysis of co-located VMs. We focus on a variety of applications which are roughly classified as app, web, database, file, mail, and print, and correlate their storage and IO demands with CPU, memory, and network usage. This study provides critical storage workload characterization by showing usage trends and how application types create storage traffic in large datacenters.

Available Media

From Research to Practice: Experiences Engineering a Production Metadata Database for a Scale Out File System

Charles Johnson, Kimberly Keeton, and Charles B. Morrey III, HP Labs; Craig A. N. Soules, Natero; Alistair Veitch, Google; Stephen Bacon, Oskar Batuner, Marcelo Condotta, Hamilton Coutinho, Patrick J. Doyle, Rafael Eichelberger, Hugo Kiehl, Guilherme Magalhaes, James McEvoy, Padmanabhan Nagarajan, Patrick Osborne, Joaquim Souza, Andy Sparkes, Mike Spitzer, Sebastien Tandel, Lincoln Thomas, and Sebastian Zangaro, HP Storage

HP’s StoreAll with Express Query is a scalable commercial file archiving product that offers sophisticated file metadata management and search capabilities. A new REST API enables fast, efficient searching to find all files that meet a given set of metadata criteria and the ability to tag files with custom metadata fields. The product brings together two significant systems: a scale out file system and a metadata database based on LazyBase. In designing and building the combined product, we identified several real-world issues in using a pipelined database system in a distributed environment, and overcame several interesting design challenges that were not contemplated by the original research prototype. This paper highlights our experiences.

Available Media

Analysis of HDFS Under HBase: A Facebook Messages Case Study

Tyler Harter, University of Wisconsin—Madison; Dhruba Borthakur, Siying Dong, Amitanand Aiyer, and Liyin Tang, Facebook Inc.; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We present a multilayer study of the Facebook Messages stack, which is based on HBase and HDFS. We collect and analyze HDFS traces to identify potential improvements, which we then evaluate via simulation. Messages represents a new HDFS workload: whereas HDFS was built to store very large files and receive mostly sequential I/O, 90% of files are smaller than 15MB and I/O is highly random. We find hot data is too large to easily fit in RAM and cold data is too large to easily fit in flash; however, cost simulations show that adding a small flash tier improves performance more than equivalent spending on RAM or disks. HBase’s layered design offers simplicity, but at the cost of performance; our simulations show that network I/O can be halved if compaction bypasses the replication layer. Finally, although Messages is read-dominated, several features of the stack (i.e., logging, compaction, replication, and caching) amplify write I/O, causing writes to dominate disk I/O.

Available Media

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces

Yang Liu, North Carolina State University; Raghul Gunasekaran, Oak Ridge National Laboratory; Xiaosong Ma, Qatar Computing Research Institute and North Carolina State University; Sudharshan S. Vazhkudai, Oak Ridge National Laboratory

Competing workloads on a shared storage system cause I/O resource contention and application performance vagaries. This problem is already evident in today’s HPC storage systems and is likely to become acute at exascale. We need more interaction between application I/O requirements and system software tools to help alleviate the I/O bottleneck, moving towards I/O-aware job scheduling. However, this requires rich techniques to capture application I/O characteristics, which remain evasive in production systems.

Traditionally, I/O characteristics have been obtained using client-side tracing tools, with drawbacks such as non-trivial instrumentation/development costs, large trace traffic, and inconsistent adoption. We present a novel approach, I/O Signature Identifier (IOSI), to characterize the I/O behavior of data-intensive applications. IOSI extracts signatures from noisy, zero-overhead server-side I/O throughput logs that are already collected on today’s supercomputers, without interfering with the compiling/execution of applications. We evaluated IOSI using the Spider storage system at Oak Ridge National Laboratory, the S3D turbulence application (running on 18,000 Titan nodes), and benchmark-based pseudo-applications. Through our experiments we confirmed that IOSI effectively extracts an application’s I/O signature despite significant server-side noise. Compared to client-side tracing tools, IOSI is transparent, interface-agnostic, and incurs no overhead. Compared to alternative data alignment techniques (e.g., dynamic time warping), it offers higher signature accuracy and shorter processing time.

Available Media

12:20 p.m.–2:00 p.m. Wednesday

Lunch, on your own

2:00 p.m.–3:30 p.m. Wednesday

Work-in-Progress Reports (WiP)

Session Chairs: Bianca Schroeder, University of Toronto, and Eno Thereska, Microsoft Research

The list of accepted Work-in-Progress reports is available here.

Available Media

3:30 p.m.–4:00 p.m. Wednesday

Break

Grand Ballroom Foyer

4:00 p.m.–5:30 p.m. Wednesday

Performance and Efficiency

Grand Ballroom ABGH

Session Chair: Erez Zadok, Stony Brook University

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation

Hui Wang and Peter Varman, Rice University

Multi-tiered storage made up of heterogeneous devices are raising new challenges in allocating throughput fairly among concurrent clients. The fundamental problem is finding an appropriate balance between fairness to the clients and maximizing system utilization.

In this paper we cast the problem within the broader framework of fair allocation for multiple resources. We present a new allocation model BAA based on the notion of per-device bottleneck sets. Clients bottlenecked on the same device receive throughputs in proportion to their fair shares, while allocation ratios between clients in different bottleneck sets are chosen to maximize system utilization. We show formally that BAA satisfies fairness properties of Envy Freedom and Sharing Incentive. We evaluated the performance of our method using both simulation and implementation on a Linux platform. The experimental results show that our method can provide both high efficiency and fairness.

Available Media

SpringFS: Bridging Agility and Performance in Elastic Distributed Storage

Lianghong Xu, James Cipar, Elie Krevat, Alexey Tumanov, and Nitin Gupta, Carnegie Mellon University; Michael A. Kozuch, Intel Labs; Gregory R. Ganger, Carnegie Mellon University

Elastic storage systems can be expanded or contracted to meet current demand, allowing servers to be turned off or used for other tasks. However, the usefulness of an elastic distributed storage system is limited by its agility: how quickly it can increase or decrease its number of servers. Due to the large amount of data they must migrate during elastic resizing, state-of-the-art designs usually have to make painful tradeoffs among performance, elasticity and agility.

This paper describes an elastic storage system, called SpringFS, that can quickly change its number of active servers, while retaining elasticity and performance goals. SpringFS uses a novel technique, termed bounded write offloading, that restricts the set of servers where writes to overloaded servers are redirected. This technique, combined with the read offloading and passive migration policies used in SpringFS, minimizes the work needed before deactivation or activation of servers. Analysis of real-world traces from Hadoop deployments at Facebook and various Cloudera customers and experiments with the SpringFS prototype confirm SpringFS’s agility, show that it reduces the amount of data migrated for elastic resizing by up to two orders of magnitude, and show that it cuts the percentage of active servers required by 67– 82%, outdoing state-of-the-art designs by 6–120%.

Available Media

Migratory Compression: Coarse-grained Data Reordering to Improve Compressibility

Xing Lin, University of Utah; Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace, EMC Corporation—Data Protection and Availability Division

We propose Migratory Compression (MC), a coarse-grained data transformation, to improve the effectiveness of traditional compressors in modern storage systems. In MC, similar data chunks are re-located together, to improve compression factors. After decompression, migrated chunks return to their previous locations. We evaluate the compression effectiveness and overhead of MC, explore reorganization approaches on a variety of datasets, and present a prototype implementation of MC in a commercial deduplicating file system. We also compare MC to the more established technique of delta compression, which is significantly more complex to implement within file systems.

We find that Migratory Compression improves compression effectiveness compared to traditional compressors, by 11% to 105%, with relatively low impact on runtime performance. Frequently, adding MC to a relatively fast compressor like gzip results in compression that is more effective in both space and runtime than slower alternatives. In archival migration, MC improves gzip compression by 44–157%. Most importantly, MC can be implemented in broadly used, modern file systems.

Available Media

5:30 p.m.–7:30 p.m. Wednesday

Poster Session and Reception II

Grand Ballroom EF

Check out the cool new ideas and the latest preliminary research on display at the Poster Sessions and Receptions. Take part in discussions with your colleagues over complimentary food and drinks.

The list of accepted posters is available here.

 

Thursday, February 20, 2014

9:00 a.m.–10:20 a.m. Thursday

OS and Storage Interactions

Grand Ballroom ABGH

Session Chair: Raju Rangaswami, Florida International University

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split

Wook-Hee Kim and Beomseok Nam, Ulsan National Institute of Science and Technology; Dongil Park and Youjip Won, Hanyang University

Misaligned interaction between SQLite and EXT4 of the Android I/O stack yields excessive random writes. In this work, we developed multi-version B-tree with lazy split (LS-MVBT) to effectively address the Journaling of Journal anomaly in Android I/O. LS-MVBT is carefully crafted to minimize the write traffic caused by fsync() call of SQLite. The contribution of LS-MVBT consists of two key elements: (i) Multi-version B-tree effectively reduces “the number of fsync() calls” via weaving the crash recovery information within the database itself instead of maintaining a separate file, and (ii) it significantly reduces “the number of dirty pages to be synchronized in a single fsync() call” via optimizing the multi-version B-tree for Android I/O. The optimization of multi-version B-tree consists of three elements: lazy split, metadata embedding, and disabling sibling redistribution. We implemented LS-MVBT in Samsung Galaxy S4 with Android 4.3 Jelly Bean. The results are impressive. For SQLite, the LS-MVBT exhibits 70% (704 insertions/sec vs. 416 insertions/sec), and 1,220% performance improvement against WAL mode and TRUNCATE mode (704 insertions/sec vs. 55 insertions/sec), respectively.

Available Media

Journaling of Journal Is (Almost) Free

Kai Shen, Stan Park, and Meng Zhu, University of Rochester

Lightweight databases and key-value stores manage the consistency and reliability of their own data, often through rollback-recovery journaling or write-ahead logging. They further rely on file system journaling to protect the file system structure and metadata. Such journaling of journal appears to violate the classic end-to-end argument for optimal database design. In practice, we observe a significant cost (up to 73% slowdown) by adding the Ext4 file system journaling to the SQLite database on a Google Nexus 7 tablet running a Ubuntu Linux installation. The cost of file system journaling is up to 58% on a conventional machine with an Intel 311 SSD.

In this paper, we argue that such cost is largely due to implementation limitations of the existing system. We apply two simple techniques—ensuring a single I/O operation on the synchronous commit path, and adaptively allowing each file to have a custom journaling mode (in particular, whether to journal the file data in addition to the metadata). Compared to SQLite without file system journaling, our enhanced journaling improves the performance or incurs minor (<6%) slowdown on all but one of our 24 test cases (with 14% slowdown in the exceptional case). On average, our enhanced journaling implementation improves the SQLite performance by 7%.

Available Media

Checking the Integrity of Transactional Mechanisms

Daniel Fryer, Dai Qin, Jack Sun, Kah Wai Lee, Angela Demke Brown, and Ashvin Goel, University of Toronto

Data corruption is the most common consequence of filesystem bugs, as shown by a recent study. When such corruption occurs, the file system’s offline check and recovery tools need to be used, but they are error prone and cause significant downtime. Previous work has shown that a runtime checker for the Ext3 journaling file system can verify that metadata updates within a transaction are mutually consistent, helping detect corruption in metadata blocks at commit time. However, corruption can still be caused when a bug in the file system’s transactional mechanism loses, misdirects, or corrupts writes. We show that a runtime checker needs to enforce the atomicity and durability properties of the file system on every write, in addition to checking transactions at commit time, to provide the strong guarantee that every block write will maintain file system consistency.

In this paper, we identify the invariants that need to be enforced on journaling and shadow paging file systems to preserve the integrity of committed transactions. We also describe the key properties that make it feasible to check these invariants for a file system. Based on this characterization, we have implemented runtime checkers for a modified version of the Ext3 file system and for the Btrfs file system. Our evaluation shows that both checkers detect data corruption effectively, and they can be used during normal operation with low overhead.

Available Media

10:20 a.m.–10:50 a.m. Thursday

Break

Grand Ballroom Foyer

10:50 a.m.–11:40 a.m. Thursday

OS and Peripherals

Grand Ballroom ABGH

Session Chair: Tom Talpey, Microsoft

DC Express: Shortest Latency Protocol for Reading Phase Change Memory over PCI Express

Dejan Vučinić, Qingbo Wang, Cyril Guyot, Robert Mateescu, Filip Blagojević, Luiz Franca-Neto, and Damien Le Moal, HGST San Jose Research Center; Trevor Bunker, Jian Xu, and Steven Swanson, University of California, San Diego; Zvonimir Bandić, HGST San Jose Research Center

Phase Change Memory (PCM) presents an architectural challenge: writing to it is slow enough to make attaching it to a CPU’s main memory controller impractical, yet reading from it is so fast that using it in a peripheral storage device would leave much of its performance potential untapped at low command queue depths, throttled by the high latencies of the common peripheral buses and existing device protocols.

Here we explore the limits of communication latency with a PCM-based storage device over PCI Express. We devised a communication protocol, dubbed DC Express, where the device continuously polls read command queues in host memory without waiting for host-driven initiation, and completion signals are eliminated in favor of a novel completion detection procedure that marks receive buffers in host memory with incomplete tags and monitors their disappearance. By eliminating superfluous PCI Express packets and context switches in this manner we are able to exceed 700,000 IOPS on small random reads at queue depth 1.

Available Media

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Junbin Kang, Benlong Zhang, Tianyu Wo, Chunming Hu, and Jinpeng Huai, Beihang University

OS-level virtualization is an efficient method for server consolidation. However, the sharing of kernel services among the co-located virtualized environments (VEs) incurs performance interference between each other. Especially, interference effects within the shared I/O stack would lead to severe performance degradations on many-core platforms incorporating fast storage technologies (e.g., non-volatile memories).

This paper presents MultiLanes, a virtualized storage system for OS-level virtualization on many cores. MultiLanes builds an isolated I/O stack on top of a virtualized storage device for each VE to eliminate contention on kernel data structures and locks between them, thus scaling them to many cores. Moreover, the overhead of storage device virtualization is tuned to be negligible so that MultiLanes can deliver competitive performance against Linux. Apart from scalability, MultiLanes also delivers flexibility and security to all the VEs, as the virtualized storage device allows each VE to run its own guest file system.

The evaluation of our prototype system built for Linux container (LXC) on a 16-core machine with a RAM disk demonstrates MultiLanes outperforms Linux by up to 11.32X and 11.75X in micro- and macro-benchmarks, and exhibits nearly linear scalability.

Available Media