Information retrieval is a large research area, mostly concerned with finding information in textual material [Bärtschi85], [Salton89]. The simplest form of information retrieval is the full text search, which finds occurrences of words or phrases specified by the user, combined by boolean operators and weighting of words based on their statistical properties. When a hyperdocument is simply regarded as a text database (ignoring the links) this type of information retrieval is the same as for other textual databases, like dictionaries, encyclopedia, on-line library catalogs, etc.
Finding information is a three-step process:
When a text database is large, but centralized, special indexing mechanisms can be employed to speed up the search. For relatively static documents, a popular indexing mechanism is the use of inverted files. Because of the high space overhead of this technique, some alternatives have been developed for indexing huge hyperdocuments like World Wide Web. Glimpse [MW94] is an indexing mechanism which allows for regular expression search and still takes up only about 3 to 7% of the original textual content. Glimpse [MW94] is the most popular of these techniques.
Bruza [Bruza-90] proposed a two-level hypertext architecture for hyperdocuments, containing a hyperindex used for information retrieval. First the index term describing the required information would be searched, followed by a "beam down" operation to the hyperdocument itself, to evaluate the selected nodes from the hyperdocument. Bruza proposed measures to determine the effectiveness of index expressions in the hyperindex.
The result of a search may be either a pointer to the first match found, or a scored list of matches. Information retrieval is inherently uncertain: a very general query (like asking for one keyword) may yield too many answers, while a very specific query may result in no answers at all.
Structural querying is what distinguished information retrieval in hypertext from that in ordinary text databases. Beery and Kornatzky [BK90] have suggested a logical query language that allows a combination of structural and content-based queries. The logic is a combination of propositional calculus (without predicates or variables) and quantifiers such as many, most, at least m, exactly n, etc. Attribute/value pairs are used to denote content-properties. Another attempt to develop structural querying facilities is the GraphLog language by Consens and Mendelzon [CM89]. GraphLog is a visual language, based on pattern matching in the graph-structure of the hyperdocument.
Information retrieval in distributed hypertexts is inherently more complicated. Global queries are no longer possible. All search activities have to be done by means of automated browsing. The so-called fish-search is an example of a search tool using this technique. In case links carry information, like attribute/value pairs, that can be useful in determining whether or not to follow links, this information can be used to significantly reduce the search space, as explained in [FS92].
In order to verify whether you have learned enough about information retrieval you must complete a test on IR in hypertext.
A lot more information on the subject of information retrieval can be obtained from a course in Nijmegen.