NSDI '06 Abstract
Pp. 143153 of the Proceedings
OverCite: A Distributed, Cooperative CiteSeer
Jeremy Stribling, MIT Computer Science and Artificial Intelligence Laboratory; Jinyang Li, New York University and MIT Computer Science and Artificial Intelligence Laboratory via University of California, Berkeley; Isaac G. Councill, Pennsylvania State University; M. Frans Kaashoek and Robert Morris, MIT Computer Science and Artificial Intelligence Laboratory
Abstract
CiteSeer is a popular online resource for the computer science research
community, allowing users to search and browse a large archive of
research papers. CiteSeer is expensive: it generates 35 GB of network
traffic per day, requires nearly one terabyte of disk storage, and
needs significant human maintenance.
OverCite is a new digital research library system that aggregates
donated resources at multiple sites to provide CiteSeer-like document
search and retrieval. OverCite enables members of the community to
share the costs of running CiteSeer. The challenge facing OverCite is how
to provide scalable and load-balanced storage and query processing
with automatic data management. OverCite uses a three-tier design:
presentation servers provide an identical user interface to CiteSeer's;
application servers partition and replicate a search index to spread
the work of answering each query among several nodes; and a
distributed hash table stores documents and meta-data, and coordinates
the activities of the servers.
Evaluation of a prototype shows that OverCite increases its query
throughput by a factor of seven with a nine-fold increase in the
number of servers. OverCite requires more total storage and network
bandwidth than centralized CiteSeer, but spreads these costs over all the
sites. OverCite can exploit the resources of these sites to support
new features such as document alerts and to scale to larger data sets.
- View the full text of this paper in HTML and PDF. Listen to the presentation in MP3 format.
Until May 2007, you will need your USENIX membership identification in order to access the full papers. The Proceedings are published as a collective work, © 2006 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
|