The {TokuFS} Streaming File System

Workshop Program

All sessions will be held in Back Bay AB unless otherwise noted.

June 13, 2012

1:30 p.m.–1:45 p.m.				Wednesday
Opening Remarks Program Chair: Raju Rangaswami, Florida International University HotStorage '12 Opening Remarks Available Media Read more about HotStorage '12 Opening Remarks
1:45 p.m.–3:15 p.m.				Wednesday
Dealing with Dynamics Session Chair: Ajay Gulati, VMware Multi-structured Redundancy Eno Thereska, Phil Gosset, and Richard Harper, Microsoft Research, Cambridge, UK One-size-fits-all solutions have not worked well in storage systems. This is true in the enterprise where noSQL, Map-Reduce and column-stores have added value to traditional database workloads. This is also true outside the enterprise. A recent paper [7] illustrated that even the single-desktop store is a rich mixture of file systems, databases and key-value stores. Yet, in research one-size-fits-all solutions are always tempting and point-optimizations emerge, with the current theme du jour being key-value stores [8]. Workloads naturally change their requirements over time (e.g., from update-intensive to query-intensive). This paper proposes research around a multi-structured storage architecture. Such architecture is composed of many lightweight data structures such as BTrees, key- value stores, graph stores and chunk stores. The call for modular storage and systems is not dissimilar to the Exokernel [4] or Anvil [10] approaches. The key difference that this paper argues about is that we want these data structures to co-exist in the same system. The system should then automatically use the right one at the right workload phase. To enable this technically, we propose to leverage the existing N-way redundancy in the data center and have each of N replicas embody a different data structure. Available Media MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, University of Toronto; Gokul Soundararajan, NetApp; Cristiana Amza, University of Toronto Data analytics and enterprise applications have very different storage functionality requirements. For this reason, enterprise deployments of data analytics are on a separate storage silo. This may generate additional costs and inefficiencies in data management, e.g., whenever data needs to be archived, copied, or migrated across silos. We introduce MixApart, a scalable data processing framework for shared enterprise storage systems. With MixApart, a single consolidated storage back-end manages enterprise data and services all types of workloads, thereby lowering hardware costs and simplifying data management. In addition, MixApart enables the local storage performance required by analytics through an integrated data caching and scheduling solution. Our preliminary evaluation shows that MixApart can be 45% faster than the traditional ingest-then-compute workflow used in enterprise IT analytics, while requiring one third of storage capacity when compared to HDFS. Available Media LoadIQ: Learning to Identify Workload Phases from a Live Storage Trace Pankaj Pipada, Achintya Kundu, K. Gopinath, and Chiranjib Bhattacharyya, Indian Institute of Science; Sai Susarla and P. C. Nagesh, NetApp Storage infrastructure in large-scale cloud data center environments must support applications with diverse, time-varying data access patterns while observing the quality of service. Deeper storage hierarchies induced by solid state and rotating media are enabling new storage management tradeoffs that do not apply uniformly to all application phases at all times. To meet service level requirements in such heterogeneous application phases, storage management needs to be phase-aware and adaptive, i.e., to identify specific storage access patterns of applications as they occur and customize their handling accordingly. This paper presents LoadIQ, a novel, versatile, adaptive, application phase detector for networked (file and block) storage systems. In a live deployment, LoadIQ analyzes traces and emits phase labels learnt on the fly by using Support Vector Machines(SVM), a state of the art classifier. Such labels could be used to generate alerts or to trigger phase-specific system tuning. Our results show that LoadIQ is able to identify workload phases (such as in TPC-DS) with accuracy > 93%. Available Media
3:15 p.m.–3:30 p.m.				Wednesday
Break Constitution Foyer
3:30 p.m.–5:00 p.m.				Wednesday
Cloudy with a Chance of QoS Session Chair: Himabindu Pucha, Violin Memory Gecko: A Contention-Oblivious Design for Cloud Storage Ji Yong Shin, Cornell University; Mahesh Balakrishnan, Microsoft Research; Lakshmi Ganesh, UT Austin; Tudor Marian, Google; Hakim Weatherspoon, Cornell University Disk contention is a fact of life in modern data centers, with multiple applications sharing the storage resources of a single physical machine. Log-structured storage designs are ideally suited for such high-contention settings, but historically have suffered from performance problems due to cleaning overheads. In this paper, we introduce Gecko, a novel design for storage arrays where a single log structure is distributed across a chain of drives, physically separating the tail of the log (where writes occur) from its body1. This design provides the benefits of logging – fast, sequential writes for any number of contending applications – while eliminating the disruptive effect of log cleaning activity on application I/O. Available Media A Parallel Page Cache: IOPS and Caching for Multicore Systems Da Zheng, Randal Burns, and Alexander S. Szalay, Johns Hopkins University We present a set-associative page cache for scalable parallelism of IOPS in multicore systems. The design eliminates lock contention and hardware cache misses by partitioning the global cache into many independent page sets, each requiring a small amount of metadata that fits in few processor cache lines. We extend this design with message passing among processors in a non-uniform memory architecture (NUMA). We evaluate the set-associative cache on 12-core processors and a 48-core NUMA to show that it realizes the scalable IOPS of direct I/O (no caching) and matches the cache hits rates of Linux’s page cache. Set-associative caching maintains IOPS at scale in contrast to Linux for which IOPS crash beyond eight parallel threads. Available Media Efﬁcient QoS for Multi-Tiered Storage Systems Ahmed Elnably and Hui Wang, Rice University; Ajay Gulati, VMware Inc.; Peter Varman, Rice University Multi-tiered storage systems using tiers of SSD and traditional hard disk is one of the fastest growing trends in the storage industry. Although using multiple tiers provides a flexible trade-off in terms of IOPS performance and storage capacity, we believe that providing performance isolation and QoS guarantees among various clients, gets significantly more challenging in such environments. Existing solutions focus mainly on either disk-based or SSD-based storage backends. In particular, the notion of IO cost that is used by existing solutions gets very hard to estimate or use. In this paper, we first argue that providing QoS in multi-tiered systems is quite challenging and existing solutions aren’t good enough for such cases. To handle their drawbacks, we use a model of storage QoS called as reward scheduling and a corresponding algorithm, which favors the clients whose IOs are less costly on the back-end storage array for reasons such as better locality, read-mostly sequentiality, smaller working set as compared to SSD allocation etc. This allows for higher efficiency of the underlying system while providing desirable performance isolation. These results are validated using a simulation-based modeling of a multi-tiered storage system. We make a case that QoS in multi-tiered storage is an open problem and hope to encourage future research in this area. Available Media
6:30 p.m.–8:00 p.m.				Wednesday
Joint Poster Session with USENIX ATC Today's peer-reviewed HotStorage presentations will be represented by posters in the Poster Session.

June 14, 2012

9:00 a.m.–10:15 a.m.				Thursday
Keynote Address: Designing Storage Systems with Flash Umesh Maheshwari, Nimble Storage "Designing Storage Systems with Flash" Slides Available Media Read more about "Designing Storage Systems with Flash" Slides
10:15 a.m.–10:30 a.m.				Thursday
Break Constitution Foyer
10:30 a.m.–Noon				Thursday
Dealing with Devices Session Chair: Youjip Won, Hanyang University Exploiting Peak Device Throughput from Random Access Workload Young Jin Yu, Seoul National University; Dong In Shin, Taejin Infotec, Korea; Woong Shin, Nae Young Song, Hyeonsang Eom, and Heon Young Yeom, Seoul National University In this work, we propose a new batching scheme called temporal merge, which dispatches discontiguous block requests using a single I/O operation. It overcomes the disadvantages of narrow block interface and enables an OS to exploit peak throughput of a storage device for small random requests as well as a single large request. Temporal merge significantly enhances device and channel utilization regardless of access sequentiality of a workload, which has not been achievable by traditional schemes. We extended the block I/O interface of a DRAM-based SSD in cooperation with its vendor, and implemented temporal merge into I/O subsystem in Linux 2.6.32. The experimental results show that under multi-threaded random access workload, the proposed solution can achieve 87%∼100% of peak throughput of the SSD. We expect that the new temporal merge interface will lead to better design of future host controller interfaces such as NVMHCI for next-generation storage devices. Available Media Finding Soon-to-Fail Disks in a Haystack Moises Goldszmidt, Microsoft Research This paper presents a detector of soon-to-fail disks based on a combination of statistical models. During operation the detector takes as input a performance signal from each disk and sends and alarm when there is enough evidence (according to the models) that the disk is not healthy. The parameters of these models are automatically trained using signals from healthy and failed disks. In an evaluation on a population of 1190 production disks from a popular customer-facing internet service, the detector was able to predict 15 out of the 17 failed disks (88.2% detection) with 30 false alarms (2.56% false positive rate). Available Media An Evaluation of Different Page Allocation Strategies on High-Speed SSDs Myoungsoo Jung and Mahmut Kandemir, The Pennsylvania State University Exploiting internal parallelism over hundreds NAND flash memory is becoming a key design issue in high-speed Solid State Disks (SSDs). In this work, we simulated a cycle-accurate SSD platform with twenty four page allocation strategies, geared toward exploiting both system-level parallelism and flash-level parallelism with a variety of design parameters. Our extensive experimental analysis reveals that 1) the previously-proposed channel-and-way striping based page allocation scheme is not the best from a performance perspective, 2) As opposed to the current perception that system and flash-level concurrency mechanisms are largely orthogonal, flash-level parallelism are interfered by the system-level concurrency mechanism employed, and 3) With most of the current parallel data access methods, internal resources are significantly under-utilized. Finally, we present several optimization points to achieve maximum internal parallelism. Available Media
Noon–1:00 p.m.				Thursday
FCW Luncheon Back Bay CD
1:00 p.m.–2:30 p.m.				Thursday
Panel: What's Next: Storage Evolution or Revolution? Moderator: Ric Wheeler, Redhat Panelists: Mohit Aron, Nutanix; Michael Cornwell, Pure Storage; Kaladhar Voruganti, NetApp; Ed Lee, Tintri; Nisha Talagala, Fusion-IO "Panel: What's Next: Storage Evolution or Revolution?" Presentation Moderator: Ric Wheeler, Redhat Panelists: Mohit Aron, Nutanix; Michael Cornwell, Pure Storage; Kaladhar Voruganti, NetApp; Ed Lee, Tintri;Nisha Talagala, Fusion-IO Available Media Read more about "Panel: What's Next: Storage Evolution or Revolution?" Presentation "Nutanix: Bringing Compute and Storage Together" Slides Mohit Aron, Nutanix Available Media Read more about "Nutanix: Bringing Compute and Storage Together" Slides "The Future of Storage: Simple and Intelligent" Slides Ed Lee, Tintri Available Media Read more about "The Future of Storage: Simple and Intelligent" Slides "Future of Storage Architectures" Slides Kaladhar Voruganti, NetApp Available Media Read more about "Future of Storage Architectures" Slides
2:30 p.m.–2:45 p.m.				Thursday
Break Constitution Foyer
2:45 p.m.–3:45 p.m.				Thursday
Archival Storage Session Chair: Brandon Salmon, Tintri Delta Compressed and Deduplicated Storage Using Stream-Informed Locality Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu, EMC Corporation For backup storage, increasing compression allows users to protect more data without increasing their costs or storage footprint. Though removing duplicate regions (deduplication) and traditional compression have become widespread, further compression is attainable. We demonstrate how to efficiently add delta compression to deduplicated storage to compress similar (non-duplicate) regions. A challenge when adding delta compression is the large number of data regions to be indexed. We observed that stream-informed locality is effective for delta compression, so an index for delta compression is unnecessary, and we built the first storage system prototype to combine delta compression and deduplication with this technology. Beyond demonstrating extra compression benefits between 1.4-3.5X, we also investigate throughput and data integrity challenges that arise. Available Media Non-Linear Compression: Gzip Me Not! Michael F. Nowlan, Bryan Ford, and Ramakrishna Gummadi, Yale University Most compression algorithms used in storage systems today are based on an increasingly outmoded sequential processing model. Systems wishing to decompress blocks out-of-order or in parallel must reset the compressor’s state before each block, reducing adaptiveness and limiting compression ratios. To remedy this situation, we present Non-Linear Compression, a novel compression model enabling systems to impose an arbitrary partial order on inter-block dependencies. Mutually unordered blocks may be compressed and decompressed out-of-order or in parallel, and a compressor can adaptively compress each block based on all causally prior blocks. This graph structure captures the system’s data dependencies explicitly and completely, enabling the compressor to adapt using long-lived state without the constraint of sequential processing. Preliminary experiences with a simple Huffman compressor suggest that non-linear compression fits a diverse set of storage applications. Available Media
3:45 p.m.–4:00 p.m.				Thursday
Break Constitution Foyer
4:00 p.m.–5:30 p.m.				Thursday
A Coterie Collection Session Chair: Jiri Schindler, NetApp Don’t Trust Your Roommate, or, Access Control and Replication Protocols in “Home” Environments Vassilios Lekakis, Yunus Basagalar, and Pete Keleher, University of Maryland A “home” sharing environment consists of the data sharing relationships between family members, friends, and acquaintances. We argue that this environment, far from being simple, has sharing and trust relationships as complex as any general-purpose network. Such environments need strong access control and privacy guarantees. We show that avoiding information leakage requires both to be integrated directly into (rather than layered on top of) replication protocols, and propose a system structure that meets these guarantees. Available Media Analyzing Compute vs. Storage Tradeoff for Video-aware Storage Efficiency Atish Kathpal, Mandar Kulkarni, and Ajay Bakre, NetApp Inc. Video content is quite unique from its storage footprint perspective. In a video distribution environment, a master video file needs to be transcoded into different resolutions, bitrates, codecs and containers to enable distribution to a wide variety of devices and media players over different kinds of networks. Our experiments show that when 8 master videos are transcoded into most popular 376 formats (derived from 8 resolutions and 6 containers), transcoded versions occupy 8 times more storage than the master video. One major challenge with efficiently storing such content is that traditional de-duplication algorithms cannot detect significant duplication between any 2 versions. Transcoding on-the-fly is a technique in which a distribution copy is created only when requested by a user. This technique saves storage but at the expense of extra compute cost and latency resulting from transcoding after a user request is received. In this paper we develop cost metrics that allow us to compare storage vs. compute costs and suggest when a transcoding on-the-fly solution can be cost effective. We also analyze how such a solution can be deployed in a practical storage system using access pattern information or a variant of ski-rent [1] online algorithm when such information is not available. Available Media The TokuFS Streaming File System John Esmet, Tokutek & Rutgers; Michael A. Bender, Tokutek & Stony Brook; Martin Farach-Colton, Tokutek & Rutgers; Bradley C. Kuszmaul, Tokutek & MIT The TokuFS file system outperforms write-optimized file systems by an order of magnitude on microdata write workloads, and outperforms read-optimized file systems by an order of magnitude on read workloads. Microdata write workloads include creating and destroying many small files, performing small unaligned writes within large files, and updating metadata. TokuFS is implemented using Fractal Tree indexes, which are primarily used in databases. TokuFS employs block-level compression to reduce its disk usage. Available Media
5:30 p.m.–5:35 p.m.				Thursday
Concluding Remarks Program Chair: Raju Rangaswami, Florida International University
6:30 p.m.–8:00 p.m.				Thursday
FCW '12 Reception and HotStorage '12 Poster Session Grand Ballroom Today's peer-reviewed HotStorage presentations will be represented by posters in the Poster Session.