USENIX Symposium on Internet Technologies and Systems, 1997
Using the Structure of HTML Documents to Improve Retrieval
Michal Cutler, Yungming Shih, and Weiyi Meng
State University of New York, Binghamton
Abstract
The World Wide Web (WWW) is a gigantic information resource,
which is growing daily. As more and more data are added to the WWW, it
is becoming increasingly difficult to effectively locate useful information
from this environment. In this paper, we propose a method for making use
of the structures and hyperlinks of HTML documents to improve the effectiveness
of retrieving HTML documents. Our study assigns the occurrences of terms
in a document collection into six classes according to the tags in which
a particular term appears (such as Title, H1-H6, and Anchor). Based on
the assignment, we extend the weighting schemes in traditional information
retrieval by incorporating different importance factors to terms in different
classes. The rationale is that terms appearing in different places of a
document may have different significance in identifying the document. For
this research we have built a Web based search tool, Webor, created a testbed,
and conducted extensive experiments to determine an optimal class importance
factor combination. Our study indicates that substantial improvement of
retrieval effectiveness can be achieved using this technique.
- View the full text of this paper in
HTML form and
PDF form.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
- To become a USENIX Member, please see our Membership Information.
|