Technical Sessions

All sessions will be held in Grand Ballroom ABCFGH unless otherwise noted.

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter 
Cover Page | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
 FAST '16 Full Proceedings (PDF)
 FAST '16 Proceedings Interior (PDF, best for mobile devices)

Full Proceedings ePub (for iPad and most eReaders)
 FAST '16 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
 FAST '16 Full Proceedings (Mobi)

Download Proceedings and Attendee List (Conference Attendees Only)

Attendee Files 
FAST '16 Proceedings Archive (ZIP, includes attendee list)
FAST '16 Attendee List (PDF)
FAST '16 Attendee List (PDF, updated 2.24.16)

 

Tuesday, February 23, 2016

8:00 am–9:00 am Tuesday

Continental Breakfast

9:00 am–9:15 am Tuesday

Welcome to the Fishbowl

Opening Remarks and Awards

Program Co-Chairs: Angela Demke Brown, University of Toronto, and Florentina Popovici, Google

9:15 am–10:30 am Tuesday

Keynote Address

Spinning Disks and Their Cloudy Future

Eric Brewer, VP Infrastructure at Google

Modern hard disks are a miracle of engineering, worthy of celebration. However, solid-state disks already own mobile and performance use cases and will leave spinning disks only the role of storing large amounts of mostly cold data. Fortunately for disk vendors, this is an area with tremendous growth, primarily due to video. Such use implies multiple copies in different locations and professional maintenance, and thus drives those bytes to the Cloud. Yet, current disks were designed for enterprise servers, not for their coming dominant use as Cloud-based storage. This talk reviews this shift and provides a wish list for "Cloud disks"—the next, largest, and perhaps final market for spinning drives.

Modern hard disks are a miracle of engineering, worthy of celebration. However, solid-state disks already own mobile and performance use cases and will leave spinning disks only the role of storing large amounts of mostly cold data. Fortunately for disk vendors, this is an area with tremendous growth, primarily due to video. Such use implies multiple copies in different locations and professional maintenance, and thus drives those bytes to the Cloud. Yet, current disks were designed for enterprise servers, not for their coming dominant use as Cloud-based storage. This talk reviews this shift and provides a wish list for "Cloud disks"—the next, largest, and perhaps final market for spinning drives.

Available Media
10:30 am–11:00 am Tuesday

Break with Refreshments

11:00 am–12:30 pm Tuesday

The Blueprint: File and Storage System Designs

Session Chair: Geoff Kuenning, Harvey Mudd College

Optimizing Every Operation in a Write-optimized File System

Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, and Pooja Deo, Stony Brook University; Zardosht Kasheff, Facebook; Leif Walsh, Two Sigma; Michael A. Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology; Donald E. Porter, Stony Brook University

Awarded Best Paper!

File systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not obtained all of these performance gains without sacrificing performance on other operations, such as file deletion, file or directory renaming, or sequential writes.

Using three techniques, late-binding journaling, zoning, and range deletion, we show that there is no fundamental trade-off in write-optimization. These dramatic improvements can be retained while matching conventional file systems on all other operations.

BetrFS 0.2 delivers order-of-magnitude better performance than conventional file systems on directory scans and small random writes and matches the performance of conventional file systems on rename, delete, and sequential I/O. For example, BetrFS 0.2 performs directory scans 2.2x faster, and small random writes over two orders of magnitude faster, than the fastest conventional file system. But unlike BetrFS 0.1, it renames and deletes files commensurate with conventional file systems and performs large sequential I/O at nearly disk bandwidth. The performance benefits of these techniques extend to applications as well. BetrFS 0.2 continues to outperform conventional file systems on many applications, such as as rsync, git-diff, and tar, but improves git-clone performance by 35% over BetrFS 0.1, yielding performance comparable to other file systems.

Available Media

The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance

Shuanglong Zhang, Helen Catanese, and An-I Andy Wang, Florida State University

Traditional file system optimizations typically retain the one-to-one mapping of logical files to their physical metadata representations. This rigid mapping results in missed opportunities for an entire class of optimizations in which such coupling is removed.

We have designed, implemented, and evaluated a composite-file file system, which allows many-to-one mappings of files to metadata, and we have explored the design space of different mapping strategies. Under webserver and software development workloads, our empirical evaluation shows up to a 27% performance improvement. This result demonstrates the promise of decoupling files and their metadata.

Available Media

Isotope: Transactional Isolation for Block Storage

Ji-Yong Shin, Cornell University; Mahesh Balakrishnan, Yale University; Tudor Marian, Google; Hakim Weatherspoon, Cornell University

Existing storage stacks are top-heavy and expect little from block storage. As a result, new high-level storage abstractions—and new designs for existing abstractions—are difficult to realize, requiring developers to implement from scratch complex functionality such as failure atomicity and fine-grained concurrency control. In this paper, we argue that pushing transactional isolation into the block store (in addition to atomicity and durability) is both viable and broadly useful, resulting in simpler high-level storage systems that provide strong semantics without sacrificing performance. We present Isotope, a new block store that supports ACID transactions over block reads and writes. Internally, Isotope uses a new multi-version concurrency control protocol that exploits fine-grained, sub-block parallelism in workloads and offers both strict serializability and snapshot isolation guarantees. We implemented several high-level storage systems over Isotope, including two key-value stores that implement the LevelDB API over a hashtable and B-tree, respectively, and a POSIX filesystem. We show that Isotope’s block-level transactions enable systems that are simple (100s of lines of code), robust (i.e., providing ACID guarantees), and fast (e.g., 415 MB/s for random file writes). We also show that these systems can be composed using Isotope, providing applications with transactions across different high-level constructs such as files, directories and key-value pairs.

Available Media

BTrDB: Optimizing Storage System Design for Timeseries Processing

Michael P Andersen and David E. Culler, University of California, Berkeley

The increase in high-precision, high-sample-rate telemetry timeseries poses a problem for existing timeseries databases which can neither cope with the throughput demands of these streams nor provide the necessary primitives for effective analysis of them. We present a novel abstraction for telemetry timeseries data and a data structure for providing this abstraction: a time-partitioning version-annotated copy-on-write tree. An implementation in Go is shown to outperform existing solutions, demonstrating a throughput of 53 million inserted values per second and 119 million queried values per second on a four-node cluster. The system achieves a 2.9x compression ratio and satisfies statistical queries spanning a year of data in under 200ms, as demonstrated on a year-long production deployment storing 2.1 trillion data points. The principles and design of this database are generally applicable to a large variety of timeseries types and represent a significant advance in the development of technology for the Internet of Things.

Available Media
12:30 pm–2:00 pm Tuesday

Conference Luncheon

2:00 pm–3:15 pm Tuesday

Emotional Rescue: Reliability

Session Chair: Haryadi Gunawi, University of Chicago

Environmental Conditions and Disk Reliability in Free-cooled Datacenters

Ioannis Manousakis, Rutgers University; Sriram Sankar, GoDaddy; Gregg McKnight, Microsoft; Thu D. Nguyen, Rutgers University; Ricardo Bianchini, Microsoft
Awarded Best Paper!

Free cooling lowers datacenter costs significantly, but may also expose servers to higher and more variable temperatures and relative humidities. It is currently unclear whether these environmental conditions have a significant impact on hardware component reliability. Thus, in this paper, we use data from nine hyperscale datacenters to study the impact of environmental conditions on the reliability of server hardware, with a particular focus on disk drives and free cooling. Based on this study, we derive and validate a new model of disk lifetime as a function of environmental conditions. Furthermore, we quantify the tradeoffs between energy consumption, environmental conditions, component reliability, and datacenter costs. Finally, based on our analyses and model, we derive server and datacenter design lessons.

We draw many interesting observations, including (1) relative humidity seems to have a dominant impact on component failures; (2) disk failures increase significantly when operating at high relative humidity, due to controller/adaptor malfunction; and (3) though higher relative humidity increases component failures, software availability techniques can mask them and enable free-cooled operation, resulting in significantly lower infrastructure and energy costs that far outweigh the cost of the extra component failures.

Available Media

Flash Reliability in Production: The Expected and the Unexpected

Bianca Schroeder, University of Toronto; Raghav Lagisetty and Arif Merchant, Google, Inc.

As solid state drives based on flash technology are becoming a staple for persistent data storage in data centers, it is important to understand their reliability characteristics. While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.

Available Media

Opening the Chrysalis: On the Real Repair Performance of MSR Codes

Lluis Pamies-Juarez, Filip Blagojevic, Robert Mateescu, and Cyril Gyuot, WD Research; Eyal En Gad, University of Southern California; Zvonimir Bandic, WD Research

Large distributed storage systems use erasure codes to reliably store data. Compared to replication, erasure codes are capable of reducing storage overhead. However, repairing lost data in an erasure coded system requires reading from many storage devices and transferring over the network large amounts of data. Theoretically, Minimum Storage Regenerating (MSR) codes can significantly reduce this repair burden. Although several explicit MSR code constructions exist, they have not been implemented in real-world distributed storage systems. We close this gap by providing a performance analysis of Butterfly codes, systematic MSR codes with optimal repair I/O. Due to the complexity of modern distributed systems, a straightforward approach does not exist when it comes to implementing MSR codes. Instead, we show that achieving good performance requires to vertically integrate the code with multiple system layers. The encoding approach, the type of inter-node communication, the interaction between different distributed system layers, and even the programming language have a significant impact on the code repair performance. We show that with new distributed system features, and careful implementation, we can achieve the theoretically expected repair performance of MSR codes.

Available Media
3:15 pm–3:45 pm Tuesday

Break with Refreshments

3:45 pm–4:50 pm Tuesday

They Said It Couldn't Be Done: Writing to Flash

Session Chair: Hakim Weatherspoon, Cornell University

The Devil Is in the Details: Implementing Flash Page Reuse with WOM Codes

Fabio Margaglia, Johannes Gutenberg—Universität; Gala Yadgar and Eitan Yaakobi, Technion—Israel Institute of Technology; Yue Li, California Institute of Technology; Assaf Schuster, Technion—Israel Institute of Technology; André Brinkmann, Johannes Gutenberg—Universität

Flash memory is prevalent in modern servers and devices. Coupled with the scaling down of flash technology, the popularity of flash memory motivates the search for methods to increase flash reliability and lifetime. Erasures are the dominant cause of flash cell wear, but reducing them is challenging because flash is a write-oncemedium—memory cells must be erased prior to writing.

An approach that has recently received considerable attention relies on write-once memory (WOM) codes, designed to accommodate additional writes on write-once media. However, the techniques proposed for reusing flash pages with WOM codes are limited in their scope. Many focus on the coding theory alone, while others suggest FTL designs that are application specific, or not applicable due to their complexity, overheads, or specific constraints of MLC flash.

This work is the first that addresses all aspects of page reuse within an end-to-end implementation of a general-purpose FTL on MLC flash. We use our hardware implementation to directly measure the short and long-term effects of page reuse on SSD durability, I/O performance and energy consumption, and show that FTL design must explicitly take them into account.

Available Media

Reducing Solid-State Storage Device Write Stress through Opportunistic In-place Delta Compression

Xuebin Zhang, Jiangpeng Li, and Hao Wang, Rensselaer Polytechnic Institute; Kai Zhao, SanDisk Corporation; Tong Zhang, Rensselaer Polytechnic Institute

Inside modern SSDs, a small portion of MLC/TLC NAND flash memory blocks operate in SLC-mode to serve as write buffer/cache and/or store hot data. These SLC-mode blocks absorb a large percentage of write operations. To balance memory wear-out, such MLC/TLC-to-SLC configuration rotates among all the memory blocks inside SSDs. This paper presents a simple yet effective design approach to reduce write stress on SLC-mode flash blocks and hence improve the overall SSD lifetime. The key is to implement well-known delta compression without being subject to the read latency and data management complexity penalties inherent to conventional practice. The underlying theme is to leverage the partial programmability of SLC-mode flash memory pages to ensure that the original data and all the subsequent deltas always reside in the same memory physical page. To avoid the storage capacity overhead, we further propose to combine intra-sector lossless data compression with intra-page delta compression, leading to opportunistic in-place delta compression. This paper presents specific techniques to address important issues for its practical implementation, including data error correction, and intra-page data placement and management. We carried out comprehensive experiments, simulations, and ASIC (application-specific integrated circuit) design. The results show that the proposed design solution can largely reduce the write stress on SLC-mode flash memory pages without significant latency overhead and meanwhile incurs relatively small silicon implementation cost.

Available Media

Access Characteristic Guided Read and Write Cost Regulation for Performance Improvement on Flash Memory

Qiao Li and Liang Shi, Chongqing University; Chun Jason Xue, City University of Hong Kong; Kaijie Wu, Chongqing University; Cheng Ji, City University of Hong Kong; Qingfeng Zhuge and Edwin H.-M. Sha, Chongqing University

The relatively high cost of write operations has become the performance bottleneck of flash memory. Write cost refers to the time needed to program a flash page using incremental-step pulse programming (ISPP), while read cost refers to the time needed to sense and transfer a page from the storage. If a flash page is written with a higher cost by using a finer step size during the ISPP process, it can be read with a relatively low cost due to the time saved in sensing and transferring, and vice versa.

We introduce AGCR, an access characteristic guided cost regulation scheme that exploits this tradeoff to improve flash performance. Based on workload characteristics, logical pages receiving more reads will be written using a finer step size so that their read cost is reduced. Similarly, logical pages receiving more writes will be written using a coarser step size so that their write cost is reduced. Our evaluation shows that AGCR incurs negligible overhead, while improving performance by 15% on average, compared to previous approaches.

Available Media
6:00 pm–8:00 pm Tuesday

Poster Session and Reception I

Santa Clara Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

View the list of accepted posters.

Wednesday, February 24, 2016

8:00 am–9:00 am Wednesday

Continental Breakfast

9:00 am–10:15 am Wednesday

Songs in the Key of Life: Key-Value Stores

Session Chair: Brent Welch, Google

 

WiscKey: Separating Keys from Values in SSD-conscious Storage

Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We present WiscKey, a persistent LSM-tree-based key-value store with a performance-oriented data layout that separates keys from values to minimize I/O amplification. The design of WiscKey is highly SSD optimized, leveraging both the sequential and random performance characteristics of the device. We demonstrate the advantages of WiscKey with both microbenchmarks and YCSB workloads. Microbenchmark results show that WiscKey is 2.5x–111x faster than LevelDB for loading a database and 1.6x–14x faster for random lookups. WiscKey is faster than both LevelDB and RocksDB in all six YCSB workloads.

Available Media

Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs

Hyeontaek Lim and David G. Andersen, Carnegie Mellon University; Michael Kaminsky, Intel Labs

Multi-stage log-structured (MSLS) designs, such as LevelDB, RocksDB, HBase, and Cassandra, are a family of storage system designs that exploit the high sequential write speeds of hard disks and flash drives by using multiple append-only data structures. As a first step towards accurate and fast evaluation of MSLS, we propose new analytic primitives and MSLS design models that quickly give accurate performance estimates. Our model can almost perfectly estimate the cost of inserts in LevelDB, whereas the conventional worst-case analysis gives 1.8–3.5x higher estimates than the actual cost. A few minutes of offline analysis using our model can find optimized system parameters that decrease LevelDB’s insert cost by up to 9.4–26.2%; our analytic primitives and model also suggest changes to RocksDB that reduce its insert cost by up to 32.0%, without reducing query performance or requiring extra memory.

Available Media

Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication

Heng Zhang, Mingkai Dong, and Haibo Chen, Shanghai Jiao Tong University

In-memory key/value store (KV-store) is a key building block for many systems like databases and large websites. Two key requirements for such systems are efficiency and availability, which demand a KV-store to continuously handle millions of requests per second. A common approach to availability is using replication such as primary-backup (PBR), which, however, requires M+1 times memory to tolerate M failures. This renders scarce memory unable to handle useful user jobs.

This paper makes the first case of building highly available in-memory KV-store by integrating erasure coding to achieve memory efficiency, while not notably degrading performance. A main challenge is that an in-memory KV-store has much scattered metadata. A single KV put may cause excessive coding operations and parity updates due to numerous small updates to metadata. Our approach, namely Cocytus, addresses this challenge by using a hybrid scheme that leverages PBR for small-sized and scattered data (e.g., metadata and key), while only applying erasure coding to relatively large data (e.g., value). To mitigate well-known issues like lengthy recovery of erasure coding, Cocytus uses an online recovery scheme by leveraging the replicated metadata information to continuously serving KV requests. We have applied Cocytus to Memcached. Evaluation using YCSB with different KV configurations shows that Cocytus incurs low overhead for latency and throughput, can tolerate node failures with fast online recovery, yet saves 33% to 46% memory compared to PBR when tolerating two failures.

Available Media

SNIA Industry Track Session 1

Grand Ballroom DE
Download tutorial materials from the SNIA Web site.

Session ends at 10:30 am

Utilizing VDBench to Perform IDC AFA Testing

Michael Ault, Oracle Guru, IBM, Inc.

9:00 am–9:45 am

IDC has released a document on testing all-flash-arrays (AFA) to provide a common framework for judging AFAs from various manufacturers. This parpa provides procedures scripts and examples to perform the IDC test framework utilizing the free tool VDBench on AFAs to provide a common set of results for comparison of multiple AFAs suitability for cloud or other network based storage.

IDC has released a document on testing all-flash-arrays (AFA) to provide a common framework for judging AFAs from various manufacturers. This parpa provides procedures scripts and examples to perform the IDC test framework utilizing the free tool VDBench on AFAs to provide a common set of results for comparison of multiple AFAs suitability for cloud or other network based storage.

Learning Objectives:

  • Undertand the requirements of IDC testing
  • Provide guidelines and scripts for use with VDBench for IDC tests
  • Demonstrate a Framework for evaluating multiple AFAs using IDC guidelines

MIke Ault has worked with computers since 1979 and with Oracle databases since 1990. Mike has spent the last eight years working with Flash storage in relation to Oracle and other database storage needs. Mike is a frequent presenter at user conferences and has written over 24 Oracle-related books. Mike currently works as an Oracle expert for the STG flash group at IBM, Inc.

Practical Online Cache Analysis and Optimization

Carl Waldspurger, Research and Development, CloudPhysics, Inc., and Irfan Ahmad, CTO, CloudPhysics, Inc.

9:45 am–10:30 am

The benefits of storage caches are notoriously difficult to model and control, varying widely by workload, and exhibiting complex, nonlinear behaviors. However, recent advances make it possible to analyze and optimize high-performance storage caches using lightweight, continuously-updated miss ratio curves (MRCs). Previously relegated to offline modeling, MRCs can now be computed so inexpensively that they are practical for dynamic, online cache management, even in the most demanding environments.

The benefits of storage caches are notoriously difficult to model and control, varying widely by workload, and exhibiting complex, nonlinear behaviors. However, recent advances make it possible to analyze and optimize high-performance storage caches using lightweight, continuously-updated miss ratio curves (MRCs). Previously relegated to offline modeling, MRCs can now be computed so inexpensively that they are practical for dynamic, online cache management, even in the most demanding environments.

After reviewing the history and evolution of MRC algorithms, we will examine new opportunities afforded by recent techniques. MRCs capture valuable information about locality that can be leveraged to guide efficient cache sizing, allocation, and partitioning, in order to support diverse goals such as improving performance, isolation, and quality of service. We will also describe how multiple MRCs can be used to track different alternatives at various timescales, enabling online tuning of cache parameters and policies.

Learning Objectives:

  • Storage cache modeling and analysis
  • Efficient cache sizing, allocation, and partitioning
  • Online tuning of commercial storage cache parameters and policies

Carl Waldspurger has been leading research at CloudPhysics since its inception. He is active in the systems research community, and serves as a technical advisor to several startups. For over a decade, Carl was responsible for core resource management and virtualization technologies at VMware. Prior to VMware, he was a researcher at the DEC Systems Research Center. Carl holds a Ph.D. in computer science from MIT.

Irfan Ahmad is the Chief Technology Officer of CloudPhysics, which he cofounded in 2011. Prior to CloudPhysics, Irfan was at VMware, where he was R&D tech lead for the DRS team and co-inventor for flagship products, including Storage DRS and Storage I/O Control. Irfan worked extensively on interdisciplinary endeavors in memory, storage, CPU, and distributed resource management, and developed a special interest in research at the intersection of systems. Irfan also spent several years in performance analysis and optimization, both in systems software and OS kernels. Before VMware, Irfan worked on a software microprocessor at Transmeta.

10:15 am–10:45 am Wednesday

Break with Refreshments

10:45 am–noon Wednesday

Master of Puppets: Adapting Cloud and Datacenter Storage

Session Chair: Theodore M. Wong, Human Longevity, Inc.

 

Slacker: Fast Distribution with Lazy Docker Containers

Tyler Harter, University of Wisconsin—Madison; Brandon Salmon and Rose Liu, Tintri; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

Containerized applications are becoming increasingly popular, but unfortunately, current container-deployment methods are very slow. We develop a new container benchmark, HelloBench, to evaluate the startup times of 57 different containerized applications. We use HelloBench to analyze workloads in detail, studying the block I/O patterns exhibited during startup and compressibility of container images. Our analysis shows that pulling packages accounts for 76% of container start time, but only 6.4% of that data is read. We use this and other findings to guide the design of Slacker, a new Docker storage driver optimized for fast container startup. Slacker is based on centralized storage that is shared between all Docker workers and registries. Workers quickly provision container storage using backend clones and minimize startup latency by lazily fetching container data. Slacker speeds up the median container development cycle by 20x and deployment cycle by 5x.

Available Media

sRoute: Treating the Storage Stack Like a Network

Ioan Stefanovici and Bianca Schroeder, University of Toronto; Greg O'Shea, Microsoft Research; Eno Thereska, Confluent and Imperial College London

In a data center, an IO from an application to distributed storage traverses not only the network, but also several software stages with diverse functionality. This set of ordered stages is known as the storage or IO stack. Stages include caches, hypervisors, IO schedulers, file systems, and device drivers. Indeed, in a typical data center, the number of these stages is often larger than the number of network hops to the destination. Yet, while packet routing is fundamental to networks, no notion of IO routing exists on the storage stack. The path of an IO to an endpoint is predetermined and hard-coded. This forces IO with different needs (e.g., requiring different caching or replica selection) to flow through a one-size-fits-all IO stack structure, resulting in an ossified IO stack.

This paper proposes sRoute, an architecture that provides a routing abstraction for the storage stack. sRoute comprises a centralized control plane and “sSwitches” on the data plane. The control plane sets the forwarding rules in each sSwitch to route IO requests at runtime based on application-specific policies. A key strength of our architecture is that it works with unmodified applications and VMs. This paper shows significant benefits of customized IO routing to data center tenants (e.g., a factor of ten for tail IO latency, more than 60% better throughput for a customized replication protocol and a factor of two in throughput for customized caching).

Available Media

Flamingo: Enabling Evolvable HDD-based Near-Line Storage

Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black, Microsoft Research

Cloud providers and companies running large-scale data centers offer near-line, cold, and archival data storage, which trade access latency and throughput performance for cost. These often require physical rack-scale storage designs, e.g. Facebook/Open Compute Project (OCP) Cold Storage or Pelican, which co-design the hardware, mechanics, power, cooling and software to minimize costs to support the desired workload. A consequence is that the rack resources are restricted, requiring a software stack that can operate within the provided resources. The co-design makes it hard to understand the end-to-end performance impact of relatively small physical design changes and, worse, the software stacks are brittle to these changes.

Flamingo supports the design of near-line HDD-based storage racks for cloud services. It requires a physical rack design, a set of resource constraints, and some target performance characteristics. Using these Flamingo is able to automatically parameterize a generic storage stack to allow it to operate on the physical rack. It is also able to efficiently explore the performance impact of varying the rack resources. It incorporates key principles learned from the design and deployment of cold storage systems. We demonstrate that Flamingo can rapidly reduce the time taken to design custom racks to support near-line storage.

Available Media

SNIA Industry Track Session 2

Grand Ballroom DE

Session ends at 12:15 pm

SMB Remote File Protocol (Including SMB 3.x)

Tom Talpey, Architect, Microsoft

10:45 am–11:30 am

The SMB protocol evolved over time from CIFS to SMB1 to SMB2, with implementations by dozens of vendors including most major Operating Systems and NAS solutions. The SMB 3.0 protocol had its first commercial implementations by Microsoft, NetApp and EMC by the end of 2012, and many other implementations exist or are in progress. The SMB3 protocol is currently at 3.1.1 and continues to advance.

The SMB protocol evolved over time from CIFS to SMB1 to SMB2, with implementations by dozens of vendors including most major Operating Systems and NAS solutions. The SMB 3.0 protocol had its first commercial implementations by Microsoft, NetApp and EMC by the end of 2012, and many other implementations exist or are in progress. The SMB3 protocol is currently at 3.1.1 and continues to advance.

This SNIA Tutorial begins by describing the history and basic architecture of the SMB protocol and its operations. The second part of the tutorial covers the various versions of the SMB protocol, with details of improvements over time. The final part covers the latest changes in SMB3, and the resources available in support of its development by industry.

Learning Objectives:

  • Understand the basic architecture of the SMB protocol family
  • Enumerate the main capabilities introduced with SMB 2.0/2.1
  • Describe the main capabilities introduced with SMB 3.0 and beyond

Tom Talpey is an Architect in the File Server Team at Microsoft. His responsibilities include SMB 3, SMB Direct (SMB over RDMA), and all the protocols and technologies that support the SMB ecosystem. Tom has worked in the areas of network filesystems, network transports and RDMA for many years and recently has been working on storage traffic management, with application not only to SMB but in broad end-to-end scenarios. He is a frequent presenter at Storage Dev.

Object Drives: A New Architectural Partitioning

Mark Carlson, Principal Engineer, Industry Standards, Toshiba

11:30 am–12:15 pm

A number of scale out storage solutions, as part of open source and other projects, are architected to scale out by incrementally adding and removing storage nodes. Example projects include:

  • Hadoop’s HDFS
  • CEPH
  • Swift (OpenStack object storage)

The typical storage node architecture includes inexpensive enclosures with IP networking, CPU, Memory and Direct Attached Storage (DAS). While inexpensive to deploy, these solutions become harder to manage over time. Power and space requirements of Data Centers are difficult to meet with this type of solution. Object Drives further partition these object systems allowing storage to scale up and down by single drive increments.

A number of scale out storage solutions, as part of open source and other projects, are architected to scale out by incrementally adding and removing storage nodes. Example projects include:

  • Hadoop’s HDFS
  • CEPH
  • Swift (OpenStack object storage)

The typical storage node architecture includes inexpensive enclosures with IP networking, CPU, Memory and Direct Attached Storage (DAS). While inexpensive to deploy, these solutions become harder to manage over time. Power and space requirements of Data Centers are difficult to meet with this type of solution. Object Drives further partition these object systems allowing storage to scale up and down by single drive increments.

This talk will discuss the current state and future prospects for object drives. Use cases and requirements will be examined and best practices will be described.

Learning Objectives:

  • What are object drives?
  • What value do they provide?
  • Where are they best deployed?

Mark A. Carlson has more than 35 years of experience with networking and storage development and more than 18 years experience with Java technology. Mark was one of the authors of the CDMI Cloud Storage standard. He has spoken at numerous industry forums and events. He is the co-chair of the SNIA Cloud Storage and Object Drive technical working groups, and serves as vice chair on the SNIA Technical Council.

Noon–2:00 pm Wednesday

Lunch (on your own)

2:00 pm–3:30 pm Wednesday

Magical Mystery Tour: Miscellaneous

Session Chair: Ethan L. Miller, University of California, Santa Cruz, and Pure Storage

 

PCAP: Performance-aware Power Capping for the Disk Drive in the Cloud

Mohammed G. Khatib and Zvonimir Bandic, WDC Research

Power efficiency is pressing in today’s cloud systems. Datacenter architects are responding with various strategies, including capping the power available to computing systems. Throttling bandwidth has been proposed to cap the power usage of the disk drive. This work revisits throttling and addresses its shortcomings. We show that, contrary to the common belief, the disk’s power usage does not always increase as the disk’s throughput increases. Furthermore, throttling unnecessarily sacrifices I/O response times by idling the disk. We propose a technique that resizes the queues of the disk to cap its power. Resizing queues not only imposes no delays on servicing requests, but also enables performance differentiation.

We present the design and implementation of PCAP, an agile performance-aware power capping system for the disk drive. PCAP dynamically resizes the disk’s queues to cap power. It operates in two performanceaware modes, throughput and tail-latency, making it viable for cloud systems with service-level differentiation. We evaluate PCAP for different workloads and disk drives. Our experiments show that PCAP reduces power by up to 22%. Further, under PCAP, 60% of the requests observe service times below 100 ms compared to just 10% under throttling. PCAP also reduces worst-case latency by 50% and increases throughput by 32% relative to throttling.

Available Media

Mitigating Sync Amplification for Copy-on-write Virtual Disk

Qingshu Chen, Liang Liang, Yubin Xia, and Haibo Chen, Shanghai Jiao Tong University

Copy-on-write virtual disks (e.g., qcow2 images) provide many useful features like snapshot, de-duplication, and full-disk encryption. However, our study uncovers that they introduce additional metadata for block organization and notably more disk sync operations (e.g., more than 3X for qcow2 and 4X for VMDK images). To mitigate such sync amplification, we propose three optimizations, namely per virtual disk internal journaling, dual-mode journaling, and adaptive-preallocation, which eliminate the extra sync operations while preserving those features in a consistent way. Our evaluation shows that the three optimizations result in up to 110% performance speedup for varmail and 50% for TPCC.

Available Media

Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!)

Pantazis Deligiannis, Imperial College London; Matt McCutchen, Massachusetts Institute of Technology; Paul Thomson, Imperial College London; Shuo Chen, Microsoft; Alastair F. Donaldson, Imperial College London; John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte, Microsoft

Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors.

Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging.

Available Media

The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments

Mingzhe Hao, University of Chicago; Gokul Soundararajan and Deepak Kenchammana-Hosekote, NetApp, Inc.; Andrew A. Chien and Haryadi S. Gunawi, University of Chicago

We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage performance instability is not uncommon: 0.2% of the time, a disk is more than 2x slower than its peer drives in the same RAID group (and 0.6% for SSD). As a consequence, disk and SSD-based RAIDs experience at least one slow drive (i.e., storage tail) 1.5% and 2.2% of the time. To understand the root causes, we correlate slowdowns with other metrics (workload I/O rate and size, drive event, age, and model). Overall, we find that the primary cause of slowdowns are the internal characteristics and idiosyncrasies of modern disk and SSD drives. We observe that storage tails can adversely impact RAID performance, motivating the design of tail-tolerant RAID. To the best of our knowledge, this work is the most extensive documentation of storage performance instability in the field.

Available Media

SNIA Industry Track Session 3

Grand Ballroom DE

Fog Computing and Its Ecosystem

Ramin Elahi, Adjunct Faculty, UC Santa Cruz Silicon Valley

2:00 pm–2:45 pm

In relation to "Cloud computing," it is bringing the computing and services to the edge of the network. Fog provides data, compute, storage, and application services to end users. The distinguishing Fog characteristics are its proximity to end users, its dense geographical distribution, and its support for mobility. Services are hosted at the network edge or even end devices such as set-top-boxes or access points. Thus, it can alleviate issues the IoT (Internet of Things) is expected to produce such as reducing service latency, and improving QoS, resulting in superior user experience. Fog Computing supports emerging Internet of Everything (IoE) applications that demand real-time/predictable latency (industrial automation, transportation, networks of sensors and actuators). Thanks to its wide geographical distribution the Fog paradigm is well positioned for real time big data and real time analytics. Fog supports densely distributed data collection points, hence adding a fourth axis to the often mentioned Big Data dimensions (volume, variety, and velocity)

In relation to "Cloud computing," it is bringing the computing and services to the edge of the network. Fog provides data, compute, storage, and application services to end users. The distinguishing Fog characteristics are its proximity to end users, its dense geographical distribution, and its support for mobility. Services are hosted at the network edge or even end devices such as set-top-boxes or access points. Thus, it can alleviate issues the IoT (Internet of Things) is expected to produce such as reducing service latency, and improving QoS, resulting in superior user experience. Fog Computing supports emerging Internet of Everything (IoE) applications that demand real-time/predictable latency (industrial automation, transportation, networks of sensors and actuators). Thanks to its wide geographical distribution the Fog paradigm is well positioned for real time big data and real time analytics. Fog supports densely distributed data collection points, hence adding a fourth axis to the often mentioned Big Data dimensions (volume, variety, and velocity)

Ramin Elahi, MSEE, is an Adjunct Professor and Advisory Board Member at UC Santa Cruz Silicon Valley. He has taught Data Center Storage, Unix Networking, and System Administration at the University of California, Santa Cruz and University of California, Berkeley Extensions since 1996. He is also a Senior Education Consultant at EMC Corp. He has also served as a Training Solutions Architect at NetApp, where he managed the engineering on-boarding and training curricula development. Prior to NetApp, he was Training Site Manager at Hitachi Data Systems Academy in charge of development and delivery of enterprise storage arrays certification programs. He also was the global network storage curricula manager at Hewlett-Packard. His areas of expertise are data center storage design and architecture, Data ONTAP, cloud storage, and virtualizations. He also held variety of positions at Cisco, Novell and SCO as a consultant and escalation engineer. He implemented the first university-level Data Storage and Virtualization curriculum in Northern California back in 2007.

Privacy vs. Data Protection: The Impact of EU Data Protection Legislation

Thomas Rivera, Senior Technical Associate, HDS

2:45 pm–3:30 pm

After reviewing the diverging data protection legislation in the EU member states, the European Commission (EC) decided that this situation would impede the free flow of data within the EU zone. The EC response was to undertake an effort to "harmonize" the data protection regulations, and it started the process by proposing a new data protection framework. This proposal includes some significant changes like defining a data breach to include data destruction, adding the right to be forgotten, adopting the U.S. practice of breach notifications, and many other new elements. Another major change is a shift from a directive to a rule, which means the protections are the same for all 27 countries and includes significant financial penalties for infractions. This tutorial explores the new EU data protection legislation and highlights the elements that could have significant impacts on data handling practices.

After reviewing the diverging data protection legislation in the EU member states, the European Commission (EC) decided that this situation would impede the free flow of data within the EU zone. The EC response was to undertake an effort to "harmonize" the data protection regulations, and it started the process by proposing a new data protection framework. This proposal includes some significant changes like defining a data breach to include data destruction, adding the right to be forgotten, adopting the U.S. practice of breach notifications, and many other new elements. Another major change is a shift from a directive to a rule, which means the protections are the same for all 27 countries and includes significant financial penalties for infractions. This tutorial explores the new EU data protection legislation and highlights the elements that could have significant impacts on data handling practices.

Learning Objectives:

  • Highlight the major changes to the previous data protection directive
  • Review the differences between "Directives" versus "Regulations," as it pertains to the EU legislation
  • Learn the nature of the Reforms as well as the specific proposed changes—in both the directives and the regulations

Thomas Rivera has over 30 years of experience in the storage industry, specializing in file services and data protection technology, and is a senior technical associate with Hitachi Data Systems. Thomas is also an active member of the Storage Networking Industry Association (SNIA) as an elected member of the SNIA Board of Directors, and is co-chair of the Data Protection and Capacity Optimization (DPCO) Committee, as well as a member of the Security Technical Working Group, and the Analytics and Big Data Committee.

3:30 pm–4:00 pm Wednesday

Break with Refreshments

4:00 pm–5:30 pm Wednesday

The Magic Whip: Work-in-Progress Reports

Session Chairs: Haryadi Gunawi, University of Chicago; Daniel Peek, Facebook

SNIA Industry Track Session 4

Grand Ballroom DE

Converged Storage Technology

4:00 pm-5:30 pm

Liang Ming, Research Engineer, Development and Research, Distributed Storage Field, Huawei

4:00pm–5:30pm

At first, we will introduce the current status and pain point of Huawei distributed storage technology. And then, the next generation of key-value converged storage solution will be presented. Following, we will discuss the conception of key-value storage and show what we have done to promote the key-value standard.

Next, we will show how do we build our block service, file service and object service based on the same kay-value pool. Converged Storage Technology. At last, the future of storage technology for VM and container will be discussed. The audience of topic should be the engineers of storage technology. And we want to discuss about converged storage technology with storage peers.

At first, we will introduce the current status and pain point of Huawei distributed storage technology. And then, the next generation of key-value converged storage solution will be presented. Following, we will discuss the conception of key-value storage and show what we have done to promote the key-value standard.

Next, we will show how do we build our block service, file service and object service based on the same kay-value pool. Converged Storage Technology. At last, the future of storage technology for VM and container will be discussed. The audience of topic should be the engineers of storage technology. And we want to discuss about converged storage technology with storage peers.

Learning Objectives:

  • Convergence, Consolidation, and Virtualization of Infrastructure, Storage Devices, and Servers
  • Deployment: Tutorial address use-cases typical deployment or operational considerations focused on use
6:00 pm–8:00 pm Wednesday

Poster Session and Reception II

Grand Ballroom Foyer/TusCA Courtyard

Sponsored by NetApp
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

View the list of accepted posters.

Thursday, February 25, 2016

8:00 am–9:00 am Thursday

Continental Breakfast

9:00 am–10:20 am Thursday

Eliminator: Deduplication

Session Chair: Carl Waldspurger, CloudPhysics

 

Estimating Unseen Deduplication—from Theory to Practice

Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa

Estimating the deduplication ratio of a very large dataset is both extremely useful, but genuinely very hard to perform. In this work we present a new method for accurately estimating deduplication benefits that runs 3X to 15X faster than the state of the art to date. The level of improvement depends on the data itself and on the storage media that it resides on. The technique is based on breakthrough theoretical work by Valiant and Valiant from 2011, that give a provably accurate method for estimating various measures while seeing only a fraction of the data. However, for the use case of deduplication estimation, putting this theory into practice runs into significant obstacles. In this work, we find solutions and novel techniques to enable the use of this new and exciting approach. Our contributions include a novel approach for gauging the estimation accuracy, techniques to run it with low memory consumption, a method to evaluate the combined compression and deduplication ratio, and ways to perform the actual sampling in real storage systems in order to actually reap benefits from these algorithms. We evaluated our work on a number of real world datasets.

Available Media

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash

Zhuan Chen and Kai Shen, University of Rochester

Flash storage is commonplace on mobile devices, sensors, and cloud servers. I/O deduplication is beneficial for saving the storage space and reducing expensive Flash writes. This paper presents a new approach, called OrderMergeDedup, that deduplicates storage writes while realizing failure-consistency, efficiency, and persistence at the same time. We devise a soft updates-style metadata write ordering that maintains storage data consistency without consistency-induced additional I/O. We further explore opportunities of I/O delay and merging to reduce the metadata I/O writes. We evaluate our Linux device mapper-based implementation using several mobile and server workloads—package installation and update, BBench web browsing, vehicle counting, Hadoop, and Yahoo Cloud Serving Benchmark. Results show that OrderMergeDedup can realize 18–63% write reduction on workloads that exhibit 23– 73% write content duplication. It has significantly less metadata write overhead than alternative I/O shadowingbased deduplication. Our approach has a slight impact on the application latency and may even improve the performance due to reduced I/O load.

Available Media

CacheDedup: In-line Deduplication for Flash Caching

Wenji Li, Arizona State University; Gregory Jean-Baptise, Juan Riveros, and Giri Narasimhan, Florida International University; Tony Zhang, Rensselaer Polytechnic Institute; Ming Zhao, Arizona State University

Flash caching has emerged as a promising solution to the scalability problems of storage systems by using fast flash memory devices as the cache for slower primary storage. But its adoption faces serious obstacles due to the limited capacity and endurance of flash devices. This paper presents CacheDedup, a solution that addresses these limitations using in-line deduplication. First, it proposes a novel architecture that integrates the caching of data and deduplication metadata (source addresses and fingerprints of the data) and efficiently manages these two components. Second, it proposes duplication-aware cache replacement algorithms (D-LRU, DARC) to optimize both cache performance and endurance. The paper presents a rigorous analysis of the algorithms to prove that they do not waste valuable cache space and that they are efficient in time and space usage. The paper also includes an experimental evaluation using real-world traces, which confirms that CacheDedup substantially improves I/O performance (up to 20% reduction in miss ratio and 51% in latency) and flash endurance (up to 89% reduction in writes sent to the cache device) compared to traditional cache management. It also shows that the proposed architecture and algorithms can be extended to support the combination of compression and deduplication for flash caching and improve its performance and endurance.

Available Media

Using Hints to Improve Inline Block-layer Deduplication

Sonam Mandal, Stony Brook University; Geoff Kuenning, Harvey Mudd College; Dongju Ok and Varun Shastry, Stony Brook University; Philip Shilane, EMC Corporation; Sun Zhen, Stony Brook University and National University of Defense Technology; Vasily Tarasov, IBM Research; Erez Zadok, Stony Brook University

Block-layer data deduplication allows file systems and applications to reap the benefits of deduplication without requiring per-system or per-application modifications. However, important information about data context (e.g., data vs. metadata writes) is lost at the block layer. Passing such context to the block layer can help improve deduplication performance and reliability. We implemented a hinting interface in an open-source block-layer deduplication system, dmdedup, that passes relevant context to the block layer, and evaluated two hints, NODEDUP and PREFETCH. To allow upper storage layers to pass hints based on the available context, we modified the VFS and file system layers to expose a hinting interface to user applications. We show that passing the NODEDUP hint speeds up applications by up to 5.3 on modern machines because the overhead of deduplication is avoided when it is unlikely to be beneficial. We also show that the PREFETCH hint accelerates applications up to 1.8 by caching hashes for data that is likely to be accessed soon.

Available Media
10:20 am–10:45 am Thursday

Break with Refreshments

10:45 am–noon Thursday

The Unforgettable Fire: Flash and NVM

Session Chair: Nisha Talagala, SanDisk

NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories

Jian Xu and Steven Swanson, University of California, San Diego

Fast non-volatile memories (NVMs) will soon appear on the processor memory bus alongside DRAM. The resulting hybrid memory systems will provide software with sub-microsecond, high-bandwidth access to persistent data, but managing, accessing, and maintaining consistency for data stored in NVM raises a host of challenges. Existing file systems built for spinning or solid-state disks introduce software overheads that would obscure the performance that NVMs should provide, but proposed file systems for NVMs either incur similar overheads or fail to provide the strong consistency guarantees that applications require.

We present NOVA, a file system designed to maximize performance on hybrid memory systems while providing strong consistency guarantees. NOVA adapts conventional log-structured file system techniques to exploit the fast random access that NVMs provide. In particular, it maintains separate logs for each inode to improve concurrency, and stores file data outside the log to minimize log size and reduce garbage collection costs. NOVA’s logs provide metadata, data, and mmap atomicity and focus on simplicity and reliability, keeping complex metadata structures in DRAM to accelerate lookup operations. Experimental results show that in write-intensive workloads, NOVA provides 22% to 216× throughput improvement compared to state-of-the-art file systems, and 3.1× to 13.5× improvement compared to file systems that provide equally strong data consistency guarantees.

Available Media

Application-Managed Flash

Sungjin Lee, Ming Liu, Sangwoo Jun, and Shuotao Xu, MIT CSAIL; Jihong Kim, Seoul National University; Arvind, MIT CSAIL

In flash storage, an FTL is a complex piece of code that resides completely inside the storage device and is provided by the manufacturer. Its principal virtue is providing interoperability with conventional HDDs. However, this virtue is also its biggest impediment in reaching the full performance of the underlying flash storage. We propose to refactor the flash storage architecture so that it relies on a new block I/O interface which does not permit overwriting of data without intervening erasures. We demonstrate how high-level applications, in particular file systems, can deal with this restriction efficiently by employing append-only segments. This refactoring dramatically reduces flash management overhead and improves performance of applications, such as file systems and databases, by permitting them to directly manage flash storage. Our experiments on a machine with the new block I/O interface show that DRAM in the flash controller is reduced by 128x and the performance of the file system improves by 80% over conventional SSDs.

Available Media

CloudCache: On-demand Flash Cache Management for Cloud Computing

Dulcardo Arteaga and Jorge Cabrera, Florida International University; Jing Xu, VMware Inc.; Swaminathan Sundararaman, Parallel Machines; Ming Zhao, Arizona State University

Available Media