Index Databases

While it is very expensive to store a copy of a large distributed hyperdocument at a single site, it is relatively easy to download the whole hyperdocument into a single site, building a limited index while downloading, and "forgetting" nodes after they have been retrieved in order to save space.

Hypertext systems normally support more embedded codes than just the delimitation of anchors. The World Wide Web uses HTML, a language based on the SGML syntax. Other languages, such as HyTime and RTF also support markup that could be used for building an index database.

Small index databases for the World Wide Web use information in the header or the title of nodes. When a word is given (as a query) to these databases they return a list of addresses (URL's) of nodes with that word in their header or title. Although the databases cannot guarantee to return URL's to all documents about the requested subject (since they only know about titles), they generally provide a good starting point for further information retrieval.

Building an index database can only be done by downloading nodes into the site containing the database. In order to find which nodes exist, links must be extracted from the nodes, and all nodes must be reachable from the node from which the search starts. In the World Wide Web the entire hyperdocument cannot easily be downloaded this way for the following reasons:

Despite their incompleteness, index databases are becoming increasingly popular. An example of an index-database that uses only address and title information was the World Wide Web Worm [MB94], developed at the University of Colorado (at Boulder). Lycos is the oldest "giant" database. It was developed at Carnegie Mellon University. It selects a number of words from the body of each document. Documents can still only be found based on keywords if Lycos happens to have selected these words from the documents. But when indexing millions of documents, even selecting only a few (between 20 and 100) words per node still leads to a very large database. Some full text databases also exist, either covering part of the Web, like WebCrawler and InfoSeek, or covering almost the entire Web, like Alta Vista. Only full text index-databases can locate documents that do not contain certain words.