sponsors
help promote
usenix conference policies
Estimating Unseen Deduplication—from Theory to Practice
Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov, IBM Research—Haifa
Estimating the deduplication ratio of a very large dataset is both extremely useful, but genuinely very hard to perform. In this work we present a new method for accurately estimating deduplication benefits that runs 3X to 15X faster than the state of the art to date. The level of improvement depends on the data itself and on the storage media that it resides on. The technique is based on breakthrough theoretical work by Valiant and Valiant from 2011, that give a provably accurate method for estimating various measures while seeing only a fraction of the data. However, for the use case of deduplication estimation, putting this theory into practice runs into significant obstacles. In this work, we find solutions and novel techniques to enable the use of this new and exciting approach. Our contributions include a novel approach for gauging the estimation accuracy, techniques to run it with low memory consumption, a method to evaluate the combined compression and deduplication ratio, and ways to perform the actual sampling in real storage systems in order to actually reap benefits from these algorithms. We evaluated our work on a number of real world datasets.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Danny Harnik and Ety Khaitzin and Dmitry Sotnikov},
title = {Estimating Unseen {Deduplication{\textemdash}from} Theory to Practice},
booktitle = {14th USENIX Conference on File and Storage Technologies (FAST 16)},
year = {2016},
isbn = {978-1-931971-28-7},
address = {Santa Clara, CA},
pages = {277--290},
url = {https://www.usenix.org/conference/fast16/technical-sessions/presentation/harnik},
publisher = {USENIX Association},
month = feb
}
connect with us