FINDING SIMILAR FILES IN A LARGE FILE SYSTEM
Udi Manber
Department of Computer Science
University of Arizona
Tucson, AZ 85721
udi@cs.arizona.edu
Supported in part by an NSF Presidential Young Investigator Award
(grant DCR-8451397), with matching funds from AT&T, by NSF grants
CCR-9002351 and CCR-9301129, and by the Advanced Research Projects
Agency under contract number DABT63-93-C-0052. Part of this work was
done while the author was visiting the University of Washington.
The information contained in this paper does not necessarily reflect
the position or the policy of the U.S. Government or other sponsors of
this research. No official endorsement should be inferred.
Abstract
We present a tool, called sif, for finding all similar files in a large
file system. Files are considered similar if they have significant
number of common pieces, even if they are very different otherwise.
For example, one file may be contained, possibly with some changes, in
another file, or a file may be a reorganization of another file. The
running time for finding all groups of similar files, even for as
little as 25% similarity, is on the order of 500MB to 1GB an hour. The
amount of similarity and several other customized parameters can be
determined by the user at a post-processing stage, which is very fast.
Sif can also be used to very quickly identify all similar files to a
query file using a preprocessed index. Application of sif can be found
in file management, information collecting (to remove duplicates),
program reuse, file synchronization, data compression, and maybe even
plagiarism detection.
Download the full text of this paper in
ASCII (32,888 bytes) form.
To Become a USENIX Member, please see our
Membership Information.