7:30 am–9:00 am |
Monday |
Continental Breakfast
Ballroom Foyer
|
8:45 am–9:00 am |
Monday |
Program Co-Chairs: Nitin Agrawal, Samsung Research, and Sam H. Noh, UNIST (Ulsan National -Institute of Science and Technology)
|
9:00 am–10:40 am |
Monday |
Session Chair: Nitin Agrawal, Samsung Research
Francis Deslauriers, Peter McCormick,
George Amvrosiadis, Ashvin Goel, and Angela Demke Brown, University of Toronto Cluster computing frameworks such as Apache Hadoop and Apache Spark are commonly used to analyze large data sets. The analysis often involves running multiple, similar queries on the same data sets. This data reuse should improve query performance, but we find that these frameworks schedule query tasks independently of each other and are thus unable to exploit the data sharing across these tasks. We present Quartet, a system that leverages information on cached data to schedule together tasks that share data. Our preliminary results are promising, showing that Quartet can increase the cache hit rate of Hadoop and Spark jobs by up to 54%. Our results suggest a shift in the way we think about job and task scheduling today, as Quartet is expected to perform better as more jobs are dispatched on the same data.
William Jannen and Michael A. Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology; Donald E. Porter, Stony Brook University We propose a class of query, called a derange query, that maps a function over a set of records and lazily aggregates the results. Derange queries defer work until it is either convenient or necessary, and, as a result, can reduce total I/O costs of the system. Derange queries operate on a view of the data that is consistent with the point in time that they are issued, regardless of when the computation completes. They are most useful for performing calculations where the results are not needed until some future deadline. When necessary, derange queries can also execute immediately. Users can view partial results of in-progress queries at low cost.
Kalapriya Kannan, Suparna Bhattacharya, Kumar Raj, Muthukumar Murugan, and Doug Voigt, Hewlett Packard Enterprise The key to successful deployment of big data solutions lies in the timely distillation of meaningful information. This is made difficult by the mismatch between volume and velocity of data at scale and challenges posed by disparate speeds of IO, CPU, memory and communication links of data storage and processing systems. Instead of viewing storage as a bottleneck in this pipeline, we believe that storage systems are best positioned to discover and exploit intrinsic data properties to enhance information density of stored data. This has the potential to reduce the amount of new information that needs to be processed by an analytics workflow. Towards exploring this possibility, we propose SEeSAW, a Similarity Exploiting Storage for Accelerating Analytics Workflows that makes similarity a fundamental storage primitive. We show that SEeSAW transparently eliminates the need for applications to process uninformative data, thereby incurring substantially lower costs on IO, memory, computation and communication while speeding up (about 97% as observed in our experiment) the rate at which actionable outcomes can be derived by analyzing data. By increasing capacity of analytics workloads to absorb more data within the same resource envelope, SEeSAW can open up rich opportunities to reap greater benefits from machine and human generated data accumulated from various sources.
Erci Xu, The Ohio State University; Mohit Saxena and Lawrence Chiu, IBM Almaden Research Center In-memory analytics frameworks such as Apache Spark are rapidly gaining popularity as they provide order of magnitude performance speedup over disk-based systems for iterative workloads. For example, Spark uses the Resilient Distributed Dataset (RDD) abstraction to cache data in memory and iteratively compute on it in a distributed cluster.
In this paper, we make the case that existing abtractions such as RDD are coarse-grained and only allow discrete cache levels to be used for caching data. This results in inefficient memory utilization and lower than optimal performance. In addition, relying on the programmer to enforce caching decisions for an RDD makes it infeasible for the system to adapt to runtime changes. To overcome these challenges, we propose Neutrino that employs fine-grained memory caching of RDD partitions and adapts to the use of different in-memory cache levels based on runtime characteristics of the cluster. First, it extracts a data flow graph to capture the data access dependencies between RDDs across different stages of a Spark application without relying on cache enforcement decisions from the programmer. Second, it uses a dynamic-programming based algorithm to guide caching decisions across the cluster and adaptively convert or discard the RDD partitions from the different cache levels.
We have implemented a prototype of Neutrino as an extension to Spark and use four different machine-learning workloads for performance evaluation. Neutrino improves the average job execution time by up to 70% over the use of Spark’s native memory cache levels.
|
10:40 am–11:15 am |
Monday |
Break with Refreshments
Ballroom Foyer
|
11:15 am–12:30 pm |
Monday |
Session Chair: Vasily Tarasov, IBM Almaden Research Center
Richard Black, Austin Donnelly, and Dave Harper, Microsoft Research; Aaron Ogus, Microsoft; Antony Rowstron, Microsoft Research Microsoft’s Pelican storage rack uses a new class of hard disk drive (HDD), known by vendors as archival class HDD. These HDDs are explicitly designed to store cooler and archival data, differing from existing HDDs by trading performance for cost. Our early Pelican experiences have helped some vendors define the particular characteristics of this class of drive. During the last twelve or so months we have gained considerable data on how these drives perform in Pelicans and in this paper we present data gathered from a test and a production environment. A key design choice for Pelican was to have only a small fraction of the HDDs concurrently spun up making Pelican a harsh environment to operate a HDD. We present data showing how the drives have been used, their power profile, their AFR, and conclude by discussing some issues for the future of these archive HDDs. As flash capacities increase eventually all HDDs will be archive class, so understanding their characteristics is of wide interest.
Adam Manzanares, Western Digital Research; Noah Watkins, University of California, Santa Cruz; Cyril Guyot and Damien LeMoal, Western Digital Research; Carlos Maltzahn, University of California, Santa Cruz; Zvonimr Bandic, Western Digital Research Digital data is projected to double every two years creating the need for cost effective and performant storage media [4]. Hard disk drives (HDDs) are a cost effective storage media that sit between speedy yet costly flashbased storage, and cheap but slower media such as tape drives. However, virtually all HDDs today use a technology called perpendicular magnetic recording, and the density achieved with this technology is reaching scalability limits due to physical properties of the technology [17]. While new technologies such as shingled magnetic recording (SMR) that further increase areal density are slated to enter the market [6], existing systems software is not prepared to fully utilize these devices because of the unique I/O constraints that they introduce.
SMR requires systems software to conform to the shingling constraint. The shingling constraint is an I/O ordering constraint imposed at the device level, and requires that writes be sequential and contiguous within a subset of the disk, called a zone. Thus, software that requires random block updates must use a scheme to serialize writes to the drive. This scheme can be handled internally in a drive or an alternative approach is to expose the zone abstraction and shingling constraint to the host operating system. Host level solutions are challenging because the shingling constraint is not compatible with software that assumes a random-write block device model, which has been in use for decades. The shingling constraint influences all layers of the I/O stack, and each layer must be made SMR compliant.
In order to manage the shingling write constraint of SMR HDDs, we have designed a zone-based extent allocator that maps ZEA logical blocks (ZBA) to LBAs of the HDD. Figure 1a depicts how ZEA is mapped onto a SMR HDD comprised of multiple types of zones, which are described in Table 1. ZEA writes logical extents, comprised of data and metadata, sequentially onto the SMR zone maintaining the shingling constraint.
Fenggang Wu, University of Minnesota, Twin Cities; Ming-Chang Yang, National Taiwan University; Ziqi Fan, Baoquan Zhang, Xiongzi Ge, and David H.C. Du, University of Minnesota, Twin Cities Shingled Magnetic Recording (SMR) technology increases the areal density of hard disk drives. Among the three types of SMR drives on the market today, Host Aware SMR (HA-SMR) drives look the most promising. In this paper, we carry out evaluation to understand the performance of HA-SMR drives with the objective of building large-scale storage systems using this type of drive. We focus on evaluating the special features of HA-SMR drives, such as the open zone issue and media cache cleaning efficiency. Based on our observations we propose a novel host-controlled indirection buffer to enhance the drive’s I/O performance. Finally, we present a case study of the open zone issue to show the potential of this host-controlled indirection buffer for HA-SMR drives.
|
12:30 pm–2:00 pm |
Monday |
Luncheon for Workshop Attendees
Colorado Ballroom E
|
2:00 pm–3:15 pm |
Monday |
Session Chair: Song Jiang, Wayne State University
Gala Yadgar and Moshe Gabel, Technion—Israel Institute of Technology Storage systems are designed and optimized relying on wisdom derived from analysis studies of file-system and block-level workloads. However, while SSDs are becoming a dominant building block in many storage systems, their design continues to build on knowledge derived from analysis targeted at hard disk optimization. Though still valuable, it does not cover important aspects relevant for SSD performance. In a sense, we are “searching under the streetlight”, possibly missing important opportunities for optimizing storage system design.
We present the first I/O workload analysis designed with SSDs in mind. We characterize traces from four repositories and examine their ‘temperature’ ranges, sensitivity to page size, and ‘logical locality’. Our initial results reveal nontrivial aspects that can significantly influence the design and performance of SSD-based systems.
Hyeong-Jun Kim, Sungkyunkwan University; Young-Sik Lee, Korea Advanced Institute of Science and Technology (KAIST); Jin-Soo Kim, Sungkyunkwan University The performance of storage devices has been increased significantly due to emerging technologies such as Solid State Drives (SSDs) and Non-Volatile Memory Express (NVMe) interface. However, the complex I/O stack of the kernel impedes utilizing the full performance of NVMe SSDs. The application-specific optimization is also difficult on the kernel because the kernel should provide generality and fairness.
In this paper, we propose a user-level I/O framework which improves the performance by allowing user applications to access commercial NVMe SSDs directly without any hardware modification. Moreover, the proposed framework provides flexibility where user applications can select their own I/O policies including I/O completion method, caching, and I/O scheduling. Our evaluation results show that the proposed framework outperforms the kernel-based I/O by up to 30% on microbenchmarks and by up to 15% on Redis.
Zhaoyan Shen, Hong Kong Polytechnic University; Feng Chen and Yichen Jia, Louisiana State University; Zili Shao, Hong Kong Polytechnic University Flash-based key-value cache systems, such as Facebook’s McDipper [1] and Twitter’s Fatcache [2], provide a cost-efficient solution for high-speed key-value caching. These cache solutions typically take commercial SSDs and adopt a Memcached-like scheme to store and manage key-value pairs in flash. Such a practice, though simple, is inefficient. We advocate to reconsider the hardware/software architecture design by directly opening device-level details to key-value cache systems. This co-design approach can effectively bridge the semantic gap and closely connect the two layers together. Leveraging the domain knowledge of key-value caches and the unique device-level properties, we can maximize the efficiency of a key-value cache system on flash devices while minimizing its weakness. We are implementing a prototype based on the Open-channel SSD hardware platform. Our preliminary experiments show very promising results.
|
3:15 pm–3:45 pm |
Monday |
Break with Refreshments
Ballroom Foyer
|
3:45 pm–5:25 pm |
Monday |
Session Chair: Marcos Aguilera, VMware Research
Ali Anwar and Yue Cheng, Virginia Polytechnic Institute and State University; Hai Huang, IBM T. J. Watson Research Center; Ali R. Butt, Virginia Polytechnic Institute and State University The growing variety of data storage and retrieval needs is driving the design and development of an increasing number of distributed storage applications such as keyvalue stores, distributed file systems, object stores, and databases. We observe that, to a large extent, such applications would implement their own way of handling features of data replication, failover, consistency, cluster topology, leadership election, etc. We found that 45– 82% of the code in six popular distributed storage applications can be classified as implementations of such common features. While such implementations allow for deeper optimizations tailored for a specific application, writing new applications to satisfy the ever-changing requirements of new types of data or I/O patterns is challenging, as it is notoriously hard to get all the features right in a distributed setting.
In this paper, we argue that for most modern storage applications, the common feature implementation (i.e., the distributed part) can be automated and offloaded, so developers can focus on the core application functions. We are designing a framework, ClusterOn, which aims to take care of the messy plumbing of distributed storage applications. The envisioned goal is that a developer simply “drops” a non-distributed application into ClusterOn, which will convert it into a scalable and highly configurable distributed application.
Michael Wei, VMware Research and University of California, San Diego; Amy Tai, VMware Research and Princeton University; Chris Rossbach, Ittai Abraham, and Udi Wieder, VMware Research; Steven Swanson, University of California, San Diego; Dahlia Malkhi, VMware Research The storage needs of users have shifted from just needing to store data to requiring a rich interface which enables the efficient query of versions, snapshots and creation of clones. Providing these features in a distributed file system while maintaining scalability, strong consistency and performance remains a challenge. In this paper we introduce Silver, a file system which leverages the Corfu distributed logging system to not only store data, but to provide fast strongly consistent snapshots, clones and multi-versioning while preserving the scalability and performance of the distributed shared log. We describe and implement Silver using a FUSE prototype and show its performance characteristics.
Richard P. Spillane, Wenguang Wang, Luke Lu, Maxime Austruy, Christos Karamanolis, and Rawlinson Rivera, VMware Our key innovation is to allow volume snapshots in VDFS (our native hyper-converged distributed file system) to be exported to a stand-alone regular file that can be imported to another VDFS cluster efficiently (zerocopy when possible) called exo-clones. Our exo-clones carry provenance, policy, and similar to git commits, the fingerprints of the parent clones from which they were derived. They are analogous to commits in a distributed source control system, and can be stored outside of VDFS, rebased, and signed. Although they can be unpacked to any directory, when used with VDFS they can be mounted directly with zero-copying and are instantly available to all nodes mounting VDFS. VDFS with exoclones provides the format and the tools necessary to both transfer, and run encapsulated applications in both public and private clouds, and in both test/dev and production environments.
Neville Carvalho, Hyojun Kim, Maohua Lu, Prasenjit Sarkar, Rohit Shekhar, Tarun Thakur, Pin Zhou, Datos IO; Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison We present a new problem in data storage: how to build efficient backup and restore tools for increasingly popular Next-generation Eventually Consistent STorage systems (NECST). We show that the lack of a concise, consistent, logical view of data at a point-in-time is the key underlying problem; we suggest a deep semantic understanding of the data stored within the system of interest as a solution. We discuss research and productization challenges in this new domain, and present the status of our platform, Datos CODR (Consistent Orchestrated Distributed Recovery), which can extract consistent and deduplicated backups from NECST systems such as Cassandra, MongoDB, and many others.
|
6:00 pm–7:00 pm |
Monday |
Joint Poster Session and Happy Hour with HotCloud
Colorado Ballroom A–E
|