Workshop Program

All sessions will take place in the Santa Clara Ballroom unless otherwise noted.

The workshop papers are available for download below. Copyright to the individual works is retained by the author(s).

Downloads for Registered Attendees

Attendee Files 
HotStorage Paper Archive (ZIP)
HotStorage Attendee List (PDF)

 

Monday, July 6, 2015

8:00 am–9:00 am Monday

Continental Breakfast

Mezzanine East/West

9:00 am–9:15 am Monday

Opening Remarks

Program Co-Chairs: Ken Salem, University of Waterloo, and John Strunk,NetApp

9:15 am–10:45 am Monday

Visions and Visualizations

Session Chair: Daniel Ellard, Raytheon BBN Technologies

Parametric Optimization of Storage Systems

Erez Zadok, Aashray Arora, Zhen Cao, Akhilesh Chaganti, Arvind Chaudhary, and Sonam Mandal, Stony Brook University

Most storage systems come with large set of parameters to directly or indirectly control a specific set of metrics that may include performance, energy, etc. Often, storage systems are deployed with default configurations, rendering them sub-optimal. Finding optimal configurations is difficult due to the numerous combinations of parameters and parameter sensitivity to workloads and deployed environments. Previous research on parameter optimization was either limited to narrow problems or not widely applicable to storage stack parameter optimization in general. Based on promising early results, we propose using meta-heuristic techniques such as genetic algorithms to efficiently find nearoptimal configurations for storage systems.

Available Media

Enabling Automated, Rich, and Versatile Data Management for Android Apps with BlueMountain

Sharath Chandrashekhara, Kyle Marcus, Rakesh G. M. Subramanya, Hrishikesh S. Karve, Karthik Dantu, and Steven Y. Ko, SUNY-Buffalo

Today’s mobile apps often leverage cloud services to manage their own data as well as user data, enabling many desired features such as backup and sharing. However, this comes at a cost; developers have to manually craft their logic and potentially repeat a similar process for different cloud providers. In addition, users are restricted to the design choices made by developers; for example, once a developer releases an app that uses a particular cloud service, it is impossible for a user to later customize the app and choose a different service.

In this paper, we explore the design space of an app instrumentation tool that automatically integrates cloud storage services for Android apps. Our goal is to allow developers to treat all storage operations as local operations, and automatically enable cloud features customized for individual needs of users and developers. We discuss various scenarios that can benefit from such an automated tool, challenges associated with the development of it, and our ideas to address these challenges.

Available Media

It’s Not Where Your Data Is, It’s How It Got There

Gala Yadgar, Roman Shor, Eitan Yaakobi, and Assaf Schuster, Technion—Israel Institute of Technology

Modern flash devices, which perform updates ‘out of place’, require different optimization strategies than hard disks. The focus for flash devices is on optimizing data movement, rather than optimizing data placement. An understanding of the processes that cause data movement within a flash drive is crucial for analyzing and managing it.

While sequentiality on hard drives is easy to visualize, as is done by various defragmentation tools, data movement on flash is inherently dynamic. With the lack of suitable visualization tools, researchers and developers must rely on aggregated statistics and histograms from which the actual movement is derived. The complexity of this task increases with the complexity of state-of-the-art FTL production and research optimizations.

Adding visualization to existing research and analysis tools will greatly improve our understanding of modern, complex flash-based systems. We developed SSDPlayer, a graphical tool for visualizing the various processes that cause data movement on SSDs. We use SSDPlayer to demonstrate how visualization can help us shed light on the complex phenomena that cause data movement and expose new opportunities for optimization.

Available Media

Toward Eidetic Distributed File Systems

Xianzheng Dou, Jason Flinn, and Peter M. Chen, University of Michigan

We propose a new point in the design space of versioning and provenance-aware file systems in which the entire operating system, not just the file system, supports such functionality. We leverage deterministic recordand- replay to substitute computation for data. This leads to a new file system design where the log of nondeterministic inputs, not file data, is the fundamental unit of persistent storage. We outline a distributed storage system design based on these principles and describe the challenges we foresee for achieving our vision.

Available Media
10:45 am–11:15 am Monday

Break with Refreshments

Mezzanine East/West

11:15 am–12:30 pm Monday

Poster Session

Mezzanine East/West

12:30 pm–2:15 pm Monday

Luncheon for Workshop Attendees

Terra Courtyard

2:15 pm–3:30 pm Monday

Compression and Coding

Session Chair: Nisha Talagala, SanDisk

Leveraging Progressive Programmability of SLC Flash Pages to Realize Zero-overhead Delta Compression for Metadata Storage

Xuebin Zhang, Jiangpeng Li, Kai Zhao, Hao Wang, and Tong Zhang, Rensselaer Polytechnic Institute

This paper presents a method to implement delta compression for metadata storage in flash memory. With the abundant temporal redundancy in metadata, it is very intuitive to expect flash-based metadata storage can significantly benefit from delta compression. However, straightforward realization of delta compression demands the storage of the original data and the deltas among different versions in different flash memory physical pages, which leads to significant overhead in terms of read/write latency and data management complexity. Through experiments with 20nm NAND flash memory chips, we observed that, when operating in SLC mode, flash memory page can be programmed in a progressive manner, i.e., different portion of the same SLC flash memory page can be programmed at different time. This motivates us to propose a simple design approach that can realize delta compression for metadata storage without latency and data management complexity overheads. The key idea is to allocate SLC-mode flash memory pages for metadata, and store the original data and all the subsequent deltas in the same physical page through progressive programming. Experimental results show that this approach can significantly reduce the metadata write traffic without any latency overhead.

Available Media

Beehive: Erasure Codes for Fixing Multiple Failures in Distributed Storage Systems

Jun Li and Baochun Li, University of Toronto

Distributed storage systems have been increasingly deploying erasure codes (such as Reed-Solomon codes) for fault tolerance. Though Reed-Solomon codes require much less storage space than replication, a significant amount of network transfer and disk I/O will be imposed when fixing unavailable data by reconstruction. Traditionally, it is expected that unavailable data are fixed separately. However, since it is observed that failures in the data center are correlated, fixing unavailable data of multiple failures is both unavoidable and even common. In this paper, we show that reconstructing data of multiple failures in batches can cost significantly less network transfer and disk I/O than fixing them separately. We present Beehive, a new design of erasure codes, that can fix unavailable data of multiple failures in batches while consuming the optimal network transfer with nearly optimal storage overhead. Evaluation results show that Beehive codes can save network transfer by up to 69:4% and disk I/O by 75% during reconstruction.

Available Media

Edelta: A Word-Enlarging Based Fast Delta Compression Approach

Wen Xia and Chunguang Li, Huazhong University of Science and Technology; Hong Jiang, University of Nebraska–Lincoln; Dan Feng, Yu Hua, Leihua Qin, and Yucheng Zhang, Huazhong University of Science and Technology

Delta compression, a promising data reduction approach capable of finding the small differences (i.e., delta) among very similar files and chunks, is widely used for optimizing replicate synchronization, backup/archival storage, cache compression, etc. However, delta compression is costly because of its time-consuming word-matching operations for delta calculation. Our in-depth examination suggests that there exists strong word-content locality for delta compression, which means that contiguous duplicate words appear in approximately the same order in their similar versions. This observation motivates us to propose Edelta, a fast delta compression approach based on a word-enlarging process that exploits word-content locality. Specifically, Edelta will first tentatively find a matched (duplicate) word, and then greedily stretch the matched word boundary to find a likely much longer (enlarged) duplicate word. Hence, Edelta effectively reduces a potentially large number of the traditional time-consuming word-matching operations to a single word-enlarging operation, which significantly accelerates the delta compression process. Our evaluation based on two case studies shows that Edelta achieves an encoding speedup of 3X10X over the state-of-the-art Ddelta, Xdelta, and Zdelta approaches without noticeably sacrificing the compression ratio.

Available Media
3:30 pm–4:00 pm Monday

Break with Refreshments

Mezzanine East/West

4:00 pm–5:15 pm Monday

High Performance

Session Chair: Raju Rangaswami, Florida International University

Scaling Out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory

Michaela Blott, Ling Liu, and Kimon Karras, Xilinx Research, Dublin; Kees Vissers, Xilinx Research, San Jose

Current web infrastructure relies increasingly on distributed in-memory key-value stores such as memcached whereby typical x86-based implementations of TCP/IP compliant memcached yield limited performance scalability. FPGA-based data-flow architectures overcome and exceed every other published and fully compliant implementation in regards to throughput and provide scalability to 80Gbps, while offering much higher power efficiency and lower latency. However, value store capacity remains limited given the DRAM support in today’s devices.

In this paper, we present and quantify novel hybrid memory systems that combine conventional DRAMs and serial-attached flash to increase value store capacity to 40Terabytes with up to 200 million entries while providing access at 80Gbps. This is achieved by an object distribution based on size using different storage devices over DRAM and flash and data-flow based architectures using customized memory controllers that compensate for large variations in access latencies and bandwidths. We present measured experimental proofpoints, mathematically validate these concepts for published value size distributions from Facebook, Wikipedia, Twitter and Flickr and compare to existing solutions.

Available Media

On the Non-Suitability of Non-Volatility

John Bent, EMC Corporation; Brad Settlemyer and Nathan DeBardeleben, Los Alamos National Laboratory; Sorin Faibish, Uday Gupta, Dennis Ting, and Percy Tzelnic, EMC Corporation

For many emerging and existing architectures, NAND flash is the storage media used to fill the cost-performance gap between DRAM and spinning disk. However, while NAND flash is the best of the available options, for many workloads its specific design choices and trade-offs are not wholly suitable. One such workload is long-running scientific applications which use checkpoint-restart for failure recovery. For these workloads, HPC data centers are deploying NAND flash as a storage acceleration tier, commonly called burst buffers, to provide high levels of write bandwidth for checkpoint storage. In this paper, we compare the costs of adding reliability to such a layer versus the benefits of not doing so. We find that, even though NAND flash is non-volatile, HPC burst buffers should not be reliable when the performance overhead of adding reliability is greater than 2%.

Available Media

MDHIM: A Parallel Key/Value Framework for HPC

Hugh Greenberg, Los Alamos National Laboratory; John Bent, EMC Corporation; Gary Grider, Los Alamos National Laboratory

The long-expected convergence of High Performance Computing and Big Data Analytics is upon us. Unfortunately, the computing environments created for each workload are not necessarily conducive for the other. In this paper, we evaluate the ability of traditional high performance computing architectures to run big data analytics. We discover and describe limitations which prevent the seamless utilization of existing big data analytics tools and software. Specifically, we evaluate the effectiveness of distributed key-value stores for manipulating large data sets across tightly coupled parallel supercomputers. Although existing distributed key-value stores have proven highly effective in cloud environments, we find their performance on HPC clusters to be degraded. Accordingly, we have built an HPC specific key-value stored called the Multi-Dimensional Hierarchical Indexing Middleware (MDHIM). Using standard big data benchmarks we find that MDHIM performance more than triples that of Cassandra on HPC systems.

Available Media
6:00 pm–7:00 pm Monday

Joint Poster Session and Happy Hour with HotCloud

Mezzanine East/West

 

Tuesday, July 7, 2015

8:00 am–9:00 am Tuesday

Continental Breakfast

Mezzanine East/West

9:00 am–10:30 am Tuesday

Joint Keynote Address with HotCloud

Santa Clara Ballroom

Kubernetes and the Path to Cloud Native

Eric Brewer, Google

We are in the midst of an important shift to higher levels of abstraction than virtual machines. Kubernetes aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and show how they work together to simplify evolution and scaling.

Eric Brewer is a vice president of infrastructure at Google. He pioneered the use of clusters of commodity servers for Internet services, based on his research at Berkeley. His “CAP Theorem” covers basic tradeoffs required in the design of distributed systems and followed from his work on a wide variety of systems, from live services, to caching and distribution services, to sensor networks. He is a member of the National Academy of Engineering, and winner of the ACM Infosys Foundation award for his work on large-scale services.

We are in the midst of an important shift to higher levels of abstraction than virtual machines. Kubernetes aims to simplify the deployment and management of services, including the construction of applications as sets of interacting but independent services. We explain some of the key concepts in Kubernetes and show how they work together to simplify evolution and scaling.

Eric Brewer is a vice president of infrastructure at Google. He pioneered the use of clusters of commodity servers for Internet services, based on his research at Berkeley. His “CAP Theorem” covers basic tradeoffs required in the design of distributed systems and followed from his work on a wide variety of systems, from live services, to caching and distribution services, to sensor networks. He is a member of the National Academy of Engineering, and winner of the ACM Infosys Foundation award for his work on large-scale services.

10:30 am–11:00 am Tuesday

Break with Refreshments

Mezzanine East/West

11:00 am–12:15 pm Tuesday

Deduplication

Session Chair: Geoff Kuenning, Harvey Mudd College

Metadata Considered Harmful…to Deduplication

Xing Lin, University of Utah;  Fred Douglis and Jim Li, EMC Corporation; Xudong Li, Nankai University; Robert Ricci, University of Utah; Stephen Smaldone and Grant Wallace, EMC Corporation

Deduplication is widely used to improve space efficiency in storage systems. While much attention has been paid to making the process of deduplication fast and scalable, the effectiveness of deduplication can vary dramatically depending on the data stored. We show that many file formats suffer from a fundamental design property that is incompatible with deduplication: they intersperse metadata with data in ways that result in otherwise identical data being different. We examine three models for improving deduplication in the presence of embedded metadata: deduplicationfriendly data formats, application-level post-processing, and format-aware deduplication. Working with realworld file formats and datasets, we find that by separating metadata from data, deduplication ratios are improved significantly—in some cases as dramatically as 5.6.

Available Media

Storage Efficiency Opportunities and Analysis for Video Repositories

Suganthi Dewakar, Sethuraman Subbiah, Gokul Soundararajan, and Mike Wilson, NetApp; Mark Storer, Box Inc.; Kishore Kasi Udayashankar, Exablox; Kaladhar Voruganti, Equinix; Minglong Shao

Conventional wisdom states that deduplication techniques do not yield good storage efficiency savings for video data. Indeed, even if the video content is the same, the video files differ from each other due to differences in closed-captioning, text overlay, language, and video resolution. Therefore deduplication techniques have not yielded good storage efficiency savings. In this paper, we look at the effectiveness of four different deduplication algorithms on different scenarios. We evaluate fixedsized, variable-sized, and two content-aware deduplication techniques. Our study shows that content-aware and variable-sized deduplication techniques do provide significant storage efficiency savings.

Available Media

Accordion: Multi-Scale Recipes for Adaptive Detection of Duplication

Russell Lewis and John H. Hartman, University of Arizona

A recipe is metadata that describes the contents of a file as a sequence of blocks identified by their hash. Using recipes, one can rapidly compare the contents of two files without reading the files themselves. Unfortunately, recipes present a space/precision tradeoff: small block sizes will maximize the duplication that is discoverable, but large block sizes produce small recipes that can be compared more quickly. In this paper, we present Accordion, a toolset for the creation and use of multi-scale recipes—that is, recipes that include blocks at several different scales. We demonstrate two duplication-detection algorithms—one optimized for situations where lots of duplication is expected, and another for those where the existence of duplication is uncertain.

Available Media
12:15 pm–2:00 pm Tuesday

Luncheon for Workshop Attendees

Terra Courtyard

2:00 pm–3:30 pm Tuesday

Files, Caches, and Disks

Session Chair: Jason Flinn, University of Michigan

To ARC or Not to ARC

Ricardo Santana and Steven Lyons, Florida International University; Ricardo Koller, IBM T. J. Watson Research Center; Raju Rangaswami and Jason Liu, Florida International University

Cache replacement algorithms have focused on managing caches that are in the datapath. In datapath caches, every cache miss results in a cache update. Cache updates are expensive because they induce cache insertion and cache eviction overheads which can be detrimental to both cache performance and cache device lifetime. Nondatapath caches, such as host-side flash caches, allow the flexibility of not having to update the cache on each miss. We propose the multi-modal adaptive replacement cache (mARC), a new cache replacement algorithm that extends the adaptive replacement cache (ARC) algorithm for non-datapath caches. Our initial trace-driven simulation experiments suggest that mARC improves the cache performance over ARC while significantly reducing the number of cache updates for two sets of storage I/O workloads from MSR Cambridge and FIU.

Available Media

Terra Incognita: On the Practicality of User-Space File Systems

Vasily Tarasov, Stony Brook University and IBM Research–Almaden; Abhishek Gupta and Kumar Sourav, Stony Brook University; Sagar Trehan, Stony Brook University and Nimble Storage; Erez Zadok, Stony Brook University

To speed up development and increase reliability the Microkernel approach advocated moving many OS services to user space. At that time, the main disadvantage of microkernels turned out to be their poor performance. In the last two decades, however, CPU and RAM technologies have improved significantly and researchers demonstrated that by carefully designing and implementing a microkernel its overhead can be reduced significantly. Storage devices often remain a major bottleneck in systems due to their relatively slow speed. Thus, user-space I/O services, such as file systems and block layer, might see significantly lower relative overhead than any other OS services. In this paper we examine the reality of a partial return of the microkernel architecture—but for I/O subsystems only. We observed over 100 user-space file systems have been developed in recent years. However, performance analysis and careful design of user-space file systems were disproportionately overlooked by the storage community. Through extensive benchmarks we present Linux FUSE performance for several systems and 45 workloads. We establish that in many setups, FUSE already achieves acceptable performance but further research is needed for file systems to comfortably migrate to user space.

Available Media

Caveat-Scriptor: Write Anywhere Shingled Disks

Saurabh Kadekodi, Swapnil Pimpale, and Garth A. Gibson, Carnegie Mellon University

In this paper we will present a simple model for Host- Managed Caveat-Scriptor, describe a simple FUSE-based file system for Host-Managed Caveat-Scriptor, construct and describe a file system aging tool and report initial performance comparisons between Strict-Append and Caveat-Scriptor. We will show the potential for Caveat-Scriptor to help limit heavy tail response times for shingled disks.

Available Media

Suspend-aware Segment Cleaning in Log-structured File System

Dongil Park, Seungyong Cheon, and Youjip Won, Hanyang University

The suspend feature of the modern smart device practically suppresses the background segment cleaning of the log-structured file system. In this work, we develop Suspend-aware Segment Cleaning for the log-structured file system. We seamlessly integrate the segment cleaning into the suspend module of the smartphone OS so that the log-structured file system can reclaim the free segments without interfering with the foreground user activity. The Suspend-aware Segment Cleaning consists of two key ingredients: (i) Virtual Segment Cleaning and (ii) Utilization-based Segment Cleaning. We implement Suspend-aware Segment Cleaning in commodity smartphone which uses the log-structured file system, F2FS, as its stock file system (Moto G). F2FS with Suspend-aware Segment Cleaning consolidates 6 more segments than stock smartphone does. With Suspend-aware Segment Cleaning, the F2FS consolidates 2 more segments even with suspend mode on than the case where the phone is always on.

Available Media
3:30 pm–4:00 pm Tuesday

Break with Refreshments

Mezzanine East/West

4:10 pm–5:25 pm Tuesday

Not Enough Speed

Joint Session with HotCloud

Dynacache: Dynamic Cloud Caching

Asaf Cidon and Assaf Eisenman, Stanford University; Mohammad Alizadeh, MIT CSAIL; Sachin Katti, Stanford University

Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains, for example a marginal increase in hit rate of 1% can reduce the application layer latency by over 25%. Yet, surprisingly many of these systems use generic firstcome- first-serve designs with simple fixed size allocations that are oblivious to the application’s requirements. In this paper, we use detailed empirical measurements from a widely used caching service, Memcachier to show that these simple default policies can lead to significant performance penalties, in some cases increasing the number of cache misses by as much as 3x.

Motivated by these empirical analyses, we propose Dynacache, a cache controller that significantly improves the hit rate of web applications, by profiling applications and dynamically tailoring memory resources and eviction policies. We show that for certain applications in our real-world traces from Memcachier, Dynacache reduces the number of misses by more than 65% with a minimal overhead on the average request performance. We also show that Memcachier would need to more than double the number of Memcached servers in order to achieve the same reduction of misses that is achieved by Dynacache. In addition, Dynacache allows Memcached operators to better plan their resource allocation and manage server costs, by estimating the cost of cache hits as a function of memory.

Available Media

Pricing Games for Hybrid Object Stores in the Cloud: Provider vs. Tenant

Yue Cheng and M. Safdar Iqbal, Virginia Tech; Aayush Gupta, IBM Almaden Research Center; Ali R. Butt, Virginia Tech

Cloud object stores are increasingly becoming the de facto storage choice for big data analytics platforms, mainly because they simplify the management of large blocks of data at scale. To ensure cost-effectiveness of the storage service, the object stores use hard disk drives (HDDs). However, the lower performance of HDDs affect tenants who have strict performance requirements for their big data applications. The use of faster storage devices such as solid state drives (SSDs) is thus desirable by the tenants, but incurs significant maintenance costs to the provider. We design a tiered object store for the cloud, which comprises both fast and slow storage devices. The resulting hybrid store exposes the tiering to tenants with a dynamic pricing model that is based on the tenants’ usage and the provider’s desire to maximize profits. The tenants leverage knowledge of their workloads and current pricing information to select a data placement strategy that would meet the application requirements at the lowest cost. Our approach allows both a service provider and its tenants to engage in a pricing game, which our results show yields a win–win situation.

Available Media

The Cloud is Not Enough: Saving IoT from the Cloud

Ben Zhang, Nitesh Mor, John Kolb, Douglas S. Chan, Nikhil Goyal, Ken Lutz, Eric Allman, John Wawrzynek, Edward Lee, and John Kubiatowicz, University of California, Berkeley

The Internet of Things (IoT) represents a new class of applications that can benefit from cloud infrastructure. However, the current approach of directly connecting smart devices to the cloud has a number of disadvantages and is unlikely to keep up with either the growing speed of the IoT or the diverse needs of IoT applications.

In this paper we explore these disadvantages and argue that fundamental properties of the IoT prevent the current approach from scaling. What is missing is a wellarchitected system that extends the functionality of the cloud and provides seamless interplay among the heterogeneous components in the IoT space. We argue that raising the level of abstraction to a data-centric design—focused around the distribution, preservation and protection of information—provides a much better match to the IoT.We present early work on such a distributed platform, called the Global Data Plane (GDP), and discuss how it addresses the problems with the cloud-centric architecture.

Available Media