USENIX Symposium on Internet Technologies and Systems, 1997
SASE: Implementation of a Compressed Text Search Engine
Srinidhi Varadarajan and Tzi-cker Chiueh
State University of New York, Stony Brook
Abstract
Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledge-based systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet.
The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression schemes. The sum of the compressed file size and the inverted indices is only between 55-76% of the original database while the search performance is comparable to a fully inverted index. The framework allows a flexible trade-off between search performance and storage requirements for the search indices.
- USENIX Members may view the full text of this paper in
HTML form and
PDF form.
(Use your membership number as the username for access.)
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
- To become a USENIX Member, please see our Membership Information.
- Current USENIX Members may change their password.
|