Information Retrieval in Distributed Hypertexts

Two approaches towards supporting information retrieval in distributed hypertexts have been used: Either way, for a distributed hyperdocument as large and as loosely connected as World Wide Web the answers to queries will most likely be incomplete. A database will probably not contain all information of all nodes, because the navigation algorithm cannot be certain to locate all the nodes, given that parts of the Web may be disconnected, and some nodes may be hidden behind "clickable images" or forms. A navigational-search will also be incomplete because it does not have the time to scan the whole hyperdocument, and it too cannot find all the documents because they may not be reachable by navigation. A reasonable compromise is to start a navigational search from the answer given by a very large index-database. For the World Wide Web index databases exist, such as Alta Vista, while a navigational search algorithm, called the fish-search is available from the Eindhoven University of Technology.

Useful interface for index databases are the "Savvy Search" from the University of Colorado and the MetaCrawler, originally developed at the University of Washington. They forward a search operation to several (of the well known) index databases in parallel.

A complex information retrieval architecture for world-wide systems including the World Wide Web and Gopherspace is proposed by the Harvest Information Discovery and Access System. It uses a distributed set of information gatherers and brokers, thereby spreading the load for the world-wide access and search requests over many machines.

As distributed hypertexts are usually read much more frequently than they are written, their performance benefits greatly from replication. Just like a cache memory is used between a cpu and main memory, and between main memory and disk, a cache between a local hypertext browser and the actual (remote parts of the) hyperdocument can be used to improve the performance and reduce the network traffic caused by searching for information in a distributed hypertext. Several implementations of a cache for the World Wide Web exist, including lagoon, the cache developed at the computing science department of the TUE, and Squid, a freely available cache which has its roots in the original WWW-server from CERN.