Scoring Search Matches

Most search-interfaces for (text and) hypertext show a list or menu of nodes containing the search-string, or satisfying whatever the search condition was. In this list, each node is given a score, indicating how well the node answers the information request. This contrasts with so called "boolean" retrieval, where documents are either relevant or not, but there is no floating degree of relevance.

Determining the score of a node requires more than just counting "hits". Determining the relevance of a document means taking into account the probability of words occurring (possibly together) in a text and normalizing to account for the length of the document.

When also using the link structure, the scoring function could take into account the relevance of nodes that can be reached from a the node to be scored. This would favor clusters of relevant nodes over single nodes.

Most information retrieval software that is currently available follows Luhn's assumption (as described in the Information Retrieval book of Van Rijsbergen [CJR]): the frequency of word occurrence in an article furnishes a useful measurement of word significance. There are a few pitfalls though:

Stripping, stemming and looking for synonyms must be done both on the words in the search string and on the words in the documents.