Nadav Elias, Technion - Israel Institute of Technology; Philip Shilane, Dell Technologies; Sarai Sheinvald, ORT Braude College of Engineering; Gala Yadgar, Technion - Israel Institute of Technology
Deduplication is widely used to effectively increase the logical capacity of large-scale storage systems, by replacing redundant chunks of data with references to their unique copies. As a result, the logical size of a storage system may be many multiples of the physical data size. The many-to-one relationship between logical references and physical chunks complicates many functionalities supported by traditional storage systems, but, at the same time, presents an opportunity to rethink and optimize others. We focus on the offline task of searching for one or more byte strings (keywords) in a large data repository.
The traditional, naïve, search mechanism traverses the directory tree and reads the data chunks in the order in which they are referenced, fetching them from the underlying storage devices repeatedly if they are referenced multiple times. We propose the DedupSearch algorithm that operates in two phases: a physical phase that first scans the storage sequentially and processes each data chunk only once, recording keyword matches in a temporary result database, and a logical phase that then traverses the system’s metadata in its logical order, attributing matches within chunks to the files that contain them. The main challenge is to identify keywords that are split between logically adjacent chunks. To do that, the physical phase records keyword prefixes and suffixes at chunk boundaries, and the logical phase matches these substrings when processing the file’s metadata. We limit the memory usage of the result database by offloading records of tiny (one-character) partial matches to the SSD/HDD, and ensure that it is rarely accessed.
We compare DedupSearch to the naïve algorithm on datasets of three different data types (text, code, and binaries), and show that it can reduce the overall search time by orders of magnitude.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
This content is available to:
author = {Nadav Elias and Philip Shilane and Sarai Sheinvald and Gala Yadgar},
title = {{DedupSearch}: {Two-Phase} Deduplication Aware Keyword Search},
booktitle = {20th USENIX Conference on File and Storage Technologies (FAST 22)},
year = {2022},
isbn = {978-1-939133-26-7},
address = {Santa Clara, CA},
pages = {233--246},
url = {https://www.usenix.org/conference/fast22/presentation/elias},
publisher = {USENIX Association},
month = feb
}