Information Retrieval in Distributed Hypertexts
Two approaches towards supporting information retrieval in distributed
hypertexts have been used:
-
By building (and periodically updating) an
index database
for the whole hyperdocument,
the first part of a query (finding candidate documents to
be searched) can be supported. The database can deliver addresses
(URL's) of nodes that satisfy certain conditions,
like containing a given word in their title or header.
(The database can be built manually or automatically.)
-
Searching can be done by navigation, meaning that nodes are retrieved
by following links, and are scanned for the required information.
From the embedded links in these nodes, new nodes to be retrieved are
chosen and the links leading to them are followed. Since this search
mechanism is time- and network-resource consuming, a clever selection
algorithm and a good starting point are important.
Either way, for a distributed hyperdocument as
large and as loosely connected as
World Wide Web
the answers to queries will most likely be incomplete.
A database will probably not contain all information of all nodes,
because the navigation algorithm cannot be certain to locate all the nodes,
given that parts of the Web may be disconnected, and some nodes may be
hidden behind "clickable images" or forms.
A navigational-search will also be incomplete because it does not have
the time to scan the whole hyperdocument, and it too cannot find
all the documents because they may not be reachable by navigation.
A reasonable compromise is to start a navigational search from
the answer given by a very large index-database.
For the World Wide Web index databases exist, such as
Alta Vista,
while a navigational search algorithm, called the
fish-search
is available from the
Eindhoven University of Technology.
Useful interface for index databases are the
"Savvy Search"
from the University of Colorado
and the MetaCrawler,
originally developed at the University of Washington.
They forward a search operation to several (of the well known) index databases
in parallel.
A complex information retrieval architecture for world-wide systems
including the World Wide Web and Gopherspace is proposed by the
Harvest Information Discovery and Access System.
It uses a distributed set of information gatherers and
brokers, thereby spreading the load for the world-wide access
and search requests over many machines.
As distributed hypertexts are usually read much more frequently than they
are written, their performance benefits greatly from replication.
Just like a cache memory is used between a cpu and main memory,
and between main memory and disk, a cache between a local hypertext browser
and the actual (remote parts of the) hyperdocument can be used to improve
the performance and reduce the network traffic caused by searching for
information in a distributed hypertext.
Several implementations of a cache for the World Wide Web exist,
including lagoon, the cache developed at the
computing science department of the TUE,
and Squid,
a freely available cache which
has its roots in the original WWW-server from CERN.