Mike Perkowitz Oren Etzioni
Department of Computer Science and Engineering, Box 352350
University of Washington, Seattle, WA 98195
(206) 616-1845 Fax: (206) 543-2969
Designing a complex web site so that it readily yields its information is tricky. First, different visitors have distinct goals. Second, the same visitor may seek different information at different times. Third, many sites outgrow their original design, accumulating links and pages in unlikely places. Fourth, a site may be designed for a particular kind of use, but may be used in many different ways in practice; the designer's a priori expectations may be violated. Too often web site designs are fossils cast in HTML, while web navigation is dynamic, time-dependent, and idiosyncratic.
In [Perkowitz and Etzioni1997], we challenged the AI community to address this problem by creating adaptive web sites: sites that automatically improve their organization and presentation by learning from visitor access patterns. Many AI advances, both practical and theoretical, have come about in response to such challenges. The quest to build a chess-playing computer, for example, has led to many advances in search techniques (e.g., [Anantharaman et al.1990]). Similarly, the autonomous land vehicle project at CMU [Thorpe1990] resulted not only in a highway-cruising vehicle but also in breakthroughs in vision, robotics, and neural networks. We believe that the adaptive web sites challenge will also drive AI advances.
Much of the previous work on adaptive web sites has focused on fairly simple adaptations (e.g., automatically creating shortcuts in the site) and on customization -- ``personalizing'' a web site to suit the needs of each individual user. In contrast, our own work has been motivated by two goals: first, we seek to demonstrate that relatively sophisticated adaptations can be generated; second, we seek to aggregate information gleaned from a population of users to transform the site -- altering it to make navigation easier for a large number of users. These goals have led us to investigate the problem of index page synthesis: the automatic creation of navigational pages that consist of a comprehensive set of links to pages at the site on a particular topic (e.g., an index page on ``electric guitars'' at a musical instruments web site).
In [Perkowitz and Etzioni1998], we formally defined the index page synthesis problem and presented an algorithm called PageGather for discovering candidate link sets which would form the basis for new index pages, based on visitor access logs. In that paper, we compared PageGather to classical clustering algorithms [Voorhees1986,Rasmussen1992,Willet1988] and to Apriori, the classical data mining algorithm for the discovery of frequent sets [Agrawal and Srikant1994] using the ``cohesiveness'' of the link sets generated as the basis for comparison We found that PageGather produced substantially better candidates in our domain. Surprisingly, we also found that PageGather's candidates were better than human-authored index pages available at our experimental test site.
More detailed examination of the data reveals the reason for PageGather's ``super-human'' performance: human index page authors operate under a constraint that PageGather ignored -- for successful navigation, index pages typically correspond to a topic or concept that is intuitive to visitors; they cannot be composed of a cohesive, but arbitrary, set of links. For example, if the topic of the index page is ``electric guitars'', the page should contain no links to information about pianos or drums, and there should be no local pages about electric guitars that are not linked to from the index page.
PageGather relies on a statistical approach to discovering candidate link sets; its candidates do not correspond precisely to intuitive concepts, whereas human-authored index pages do. In our current work we formalize index page synthesis as a conceptual clustering problem and introduce a novel approach which we call conceptual cluster mining: we search for a small number of cohesive clusters that correspond to concepts in a given concept description language L.
We have presented SCML, an algorithm schema that combines a statistical clustering algorithm with a concept learning algorithm. The clustering algorithm is used to generate seed clusters, and the concept learning algorithm to describe these seed clusters using expressions in L. We have presented preliminary experimental evidence that instantiations of SCML outperform existing algorithms (e.g., COBWEB) in this domain. Our work also constitutes a novel approach to the long-standing problem of conceptual clustering [Michalski and Stepp1983].