The Harvest Information Discovery and Access System

The Harvest Information Discovery and Access System is a generic and general system that discloses information from the World Wide Web (WWW), from Gopherspace and from file-servers (ftp servers). It was developed at the University of Colorado (at Boulder). Its (simplified) architecture is shown by the figure below:

The main components are:

The index databases used in Harvest are currently based on Glimpse [MW94]. Glimpse requires only about 3 to 7% of the size of the documents being indexed (compared to at least 100% for other indexing techniques). Also, it provides regular expression search with a configurable number of typographic errors allowed.