The Harvest Information Discovery and Access System
The Harvest
Information Discovery and Access System is a generic and general
system that discloses information from the
World Wide Web (WWW),
from Gopherspace
and from file-servers (ftp servers).
It was developed at the
University of Colorado (at Boulder).
Its (simplified) architecture is shown by the figure below:
The main components are:
- A provider is an information server, like a WWW-server,
a Gopher server or an ftp server. It can deliver information directly to
clients, but all access is preferably done through a cache that is
located nearer to the client than the provider is.
Since providers are existing information servers they are not
really a part of the Harvest system.
(Neither are the clients for that matter.)
- A gatherer collects information from one or more providers
and maintains an index database. In the Harvest system a tool, called
Essence is capable of extracting information from different
information sources. It can extract files from a tar archive, but it can
also recognize structural elements in Latex documents.
- A broker offers information retrieval access. It answers
queries by means of information it gets from gatherers and from other
brokers. In many cases a broker and gatherer will run on the same machine.
A broker delivers addresses of the objects that meet the user's information
need (not the objects themselves).
- Access to the desired objects is done through one or more caches.
In order to reduce network traffic a number of caches need to be installed
all over the world, containing a large amount of recently used data.
The index databases used in Harvest are currently based on
Glimpse [MW94].
Glimpse requires only about 3 to 7% of the size of the documents being
indexed (compared to at least 100% for other indexing techniques).
Also, it provides regular expression search with a configurable number
of typographic errors allowed.