Abstract: This paper briefly describes an information retrieval perspective on building user models for web retrieval. It describes the very simple model that prevails, and some of the impediments to having a more complicated model. The key impediment may be the nature of information retrieval experiments based on delivering lists of documents, with associated judgements of the accuracy of the list. Finally the paper argues that sophisicated modelling but simple implementation may be appropriate for low cost solutions, and describes work being conducted to investigate this hypothesis.Keywords: information retrieval; relevance feedback; user models
The typical model of a user retrieving information on the web is derived from the information retrieval model – the user is their query. Thus any user typing the same query gets the same response from the database. This model has been the fundamental model of information retrieval for as long as information retrieval has been a computing discipline, see for example [7, 8] In particular, the models of information retrieval are based on similarity of the information request and the objects in the information space. Most web retrieval engines are based on the vector space model, where a query and a web document are each represented in a high dimensional vector space and are compared.
Naturally there are severe limitations with this incredibly simplistic user model, but due to its computational simplicity, and the success of the model, it has survived. One of the background reasons for its enduring nature is the mechanisms used for evaluating the success of information retrieval systems. This is done using a test collection, a set of queries, and then determining the ability of the search engine to place documents judged as relevant ahead of those judged irrelevant. The particular measurements that are most popular are recall and precision [7]. Thus, we can see that success is based on the premise that relevance can be determined solely on the basis of the query, and not on the user, their past history, current circumstances, and the future use of the retrieved information.
There have been important strands of information retrieval research that breaks down the model to some degree. The most mainstream of these investigations has been that associated with relevance feedback, that by and large has not been implemented in Web search engines. Under this environment, after an initial query, the user specifies that presented documents are relevant and irrelevant. This allows both a history to be built up and for more knowledge about the user requirements. However, this information is generally used to simply modify the query and another matching takes place.
Another important piece of work is occurring in the context of TREC [9], an annual text retrieval conference. For the first time this year there is a track concentrating on web retrieval, but another track is concentrating on the retrieval of information where the task is to find different aspects of an answer. In this context, it is not simply good enough to identify documents as relevant or irrelevant but it is necessary to find one piece of evidence for each aspect of a topic. We thus start to see that different users may have different information needs, but it is still the case that there is no attempt to account for individual differences.
There has been significant research done on investigating user models for information retrieval, and in particular investigating searcher behaviour in a library environment. However this work has rarely been adopted in unmediated environments such as web retrieval. A key reason for limited activity in user modeling in information retrieval is that it is not clear what we would do with a complex model. The reason is that we have algorithms for matching sets of queries against large sets of documents efficiently, but if the user model is represented as a set of constraints, dependencies, and a complex history, then we have no method of matching information needs. It is thus the case that we need to build user models that are sufficiently representative that they will allow individual information needs to be represented, and yet sufficiently simple, that efficient matching is possible. How then do we move forward?
There may be several components that are needed to allow for retrieval that more accurately reflects user needs. A first step is to recognise that a list of documents is often not what is desired, so different answer types are needed. We have seen that in TREC we may seek coverage across aspects of a topic. One way of discovering these aspects may be to develop clusters of information so that the user can build their own map of the information space [11]. In separate work we explored the idea that users may not so much be seeking specific answers, as seeking points to commence exploration in an information space, such as a well designed intranet [10]. We see that it is thus important to model the nature of the user's desired answer.
A second component to successful retrieval is to recognise that there may be no single document that provides the information needed. Thus, parts of different documents, and information from databases may need to be synthesized into a virtual document, created precisely to satisfy a particular user's needs at the time. We thus see the need to match on partial documents [5], taking into account the structure of the documents [1,2], we need to extract information from databases and express it in natural language [4], and we need to deliver this information in virtual documents created from these components [6].
A third component is to recognise that the user model changes as the retrieval process takes place. This occurs all of the time in a dialogue. As points in a dialogue are reached, some information is no longer relevant, and other information becomes important. It is thus important to see how discourse and dialogue models influence user models during an interaction.
Finally, we come to pragmatics - sophisticated user models are simply not going to be implemented in a high volume environment such as the web. However a careful analysis of some key elements of a user model, that lead to significant gain in the effectiveness of a user's interaction with a retrieval system deserve analysis. Some elements are clear - identification of the user allows history to be taken into account. Identification of the type of answer needed allows one of a variety of types of answers to be selected. Identification of the preferred language of the user allows appropriate delivery. Identification of the volume of information needed allows tailoring and relevant abstracting to take place. Studies in the key factors of the user model are needed to substantially improve retrieval systems.
A team at CSIRO, including Cecile Paris, Maria Milosavljevic, François Paradis, Mingfang Wu and Ross Wilkinson are investigating how to combine the disciplines of information retrieval, virtual documentation, natural language generation, and user modelling to understand how the delivery of information tailored to the particular needs of a user can help us take a substantial step beyond the vector space model as used in the web today. We are investigating particular case studies including the tourist, who may be represented by an itinerary, and other salient features, and wishes to derive a relevant travel guide from Web resources. We are also investigating building business analysis reports that would be delivered daily from a variety of business and news sources that reflect the particular needs of each analyst. We believe that it is important to understand how tailored information delivery can make a difference in particular circumstances, so that we can develop a deeper understanding of the key features of a user model that facilitates better retrieval.