sponsors
usenix conference policies
Generating Realistic Datasets for Deduplication Analysis
12 Tuesday | 13 Wednesday | 14 Thursday | 15 Friday |
---|---|---|---|
HotCloud '12 | TaPP '12 | ||
WiAC '12 | USENIX ATC '12 | ||
UCMS '12 | HotStorage '12 | NSDR '12 | |
USENIX Cyberlaw '12 | WebApps '12 |
Vasily Tarasov and Amar Mudrankit, Stony Brook University; Will Buik, Harvey Mudd College; Philip Shilane, EMC Corporation; Geoff Kuenning, Harvey Mudd College; Erez Zadok, Stony Brook University
Deduplication is a popular component of modern storage systems, with a wide variety of approaches. Unlike traditional storage systems, deduplication performance depends on data content as well as access patterns and meta-data characteristics. Most datasets that have been used to evaluate deduplication systems are either unrepresentative, or unavailable due to privacy issues, preventing easy comparison of competing algorithms. Understanding how both content and meta-data evolve is critical to the realistic evaluation of deduplication systems.
We developed a generic model of file system changes based on properties measured on terabytes of real, diverse storage systems. Our model plugs into a generic framework for emulating file system changes. Building on observations from specific environments, the model can generate an initial file system followed by ongoing modifications that emulate the distribution of duplicates and file sizes, realistic changes to existing files, and file system growth. In our experiments we were able to generate a 4TB dataset within 13 hours on a machine with a single disk drive. The relative error of emulated parameters depends on the model size but remains within 15% of real-world observations.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Vasily Tarasov and Amar Mudrankit and Will Buik and Philip Shilane and Geoff Kuenning and Erez Zadok},
title = {Generating Realistic Datasets for Deduplication Analysis},
booktitle = {2012 USENIX Annual Technical Conference (USENIX ATC 12)},
year = {2012},
isbn = {978-931971-93-5},
address = {Boston, MA},
pages = {261--272},
url = {https://www.usenix.org/conference/atc12/technical-sessions/presentation/tarasov},
publisher = {USENIX Association},
month = jun
}
connect with us