Index Databases
While it is very expensive to store a copy of a large distributed hyperdocument
at a single site, it is relatively easy to download the whole hyperdocument
into a single site, building a limited index while downloading, and
"forgetting" nodes after they have been retrieved in order to save space.
Hypertext systems normally support more embedded codes than just the
delimitation of anchors.
The World Wide Web
uses HTML,
a language based on the SGML syntax.
Other languages, such as HyTime and RTF also support markup
that could be used for building an index database.
Small index databases for the World Wide Web use information in the header or
the title of nodes.
When a word is given (as a query) to these databases they return a list of
addresses (URL's) of nodes with that word in their header or title.
Although the databases cannot guarantee to return URL's to all documents
about the requested subject (since they only know about titles),
they generally provide a good starting point for further information retrieval.
Building an index database can only be done by downloading nodes into
the site containing the database. In order to find which nodes exist,
links must be extracted from the nodes, and all nodes must be reachable
from the node from which the search starts.
In the World Wide Web the entire hyperdocument cannot easily be downloaded
this way for the following reasons:
-
The whole Web takes several weeks to transfer to a single site, under the
optimistic assumption that all sites are available when needed.
-
Nobody knows which nodes form good starting points from where the entire
Web can be reached. There are nodes that are referenced by many documents,
but what is needed to reach the entire Web is a node or set of nodes with
a large number of interesting outgoing links.
-
An increasing part of the Web can only be reached by following links that
go through clickable images. These links are not embedded in a node,
but are delivered by a program, for which the input is a set of coordinates
in the image. Trying all possible coordinates would be very time-consuming.
-
A lot of nodes are only reachable by filling out a form first.
The pages in this adaptive course text form an excellent example.
There are too many possible ways to fill out a form to try all possible
inputs automatically. Besides, doing so might have undesirable side effects,
like ordering (unwanted) products.
Despite their incompleteness, index databases are becoming increasingly
popular.
An example of an index-database that uses only address and title information
was the
World Wide Web Worm [MB94],
developed at the University of Colorado
(at Boulder).
Lycos is the oldest "giant" database.
It was developed at Carnegie Mellon University.
It selects a number of words from the body of each document.
Documents can still only be found based on keywords if Lycos happens to
have selected these words from the documents.
But when indexing millions of documents, even selecting only a few
(between 20 and 100) words per node still leads to a very large database.
Some full text databases also exist, either covering part of
the Web, like WebCrawler and
InfoSeek, or covering almost
the entire Web, like Alta Vista.
Only full text index-databases can locate documents that do not
contain certain words.