Scoring Search Matches
Most search-interfaces for (text and) hypertext show a list or menu of
nodes containing the search-string, or satisfying whatever the search
condition was. In this list, each node is given a score, indicating how
well the node answers the information request.
This contrasts with so called "boolean" retrieval, where documents are
either relevant or not, but there is no floating degree of relevance.
Determining the score of a node requires more than just counting "hits".
Determining the relevance of a document means taking into account the
probability of words occurring (possibly together) in a text and
normalizing to account for the length of the document.
When also using the link structure, the scoring function could take
into account the relevance of nodes that can be reached from a the node
to be scored. This would favor clusters of relevant nodes over single nodes.
Most information retrieval software that is currently available
follows Luhn's assumption (as described in the
Information Retrieval
book of Van Rijsbergen [CJR]):
the frequency of word occurrence in an article
furnishes a useful measurement of word significance.
There are a few pitfalls though:
- Words that occur very often, (and in almost all documents,) are
useless. Words that are very rare also have a low discriminating power.
- Suffixes must be stripped. But beware: the suffix "ual" may be
removed from "factual", but not from "equal". Determining the right
context for removing a suffix is difficult.
- Words must be stemmed. Two words like "absorb" and "absorpt"
have the same stem, and are thus equivalent. But beware: words like
"neutron" and "neutralize" also have the same stem, but are not equivalent.
- A thesaurus of synonyms may be used. But with words having several
meanings, another word may be a synonym for one meaning of the word but
not of the other.
- When searching for several words, the user may find one word more
important than another, while the system would rank them differently
because of their discriminating power.
Stripping, stemming and looking for synonyms must be done both on the
words in the search string and on the words in the documents.