d ver ion to appear in First International ference on Autonomous Agents, February 1997, Marina del Rey CA An Adaptive Web Page recommendation Service Marko balabanov Department of Computer Science Stanford University Abstract nformation retrieval(IR) system to match web pages n adaptive recom mendat ion service seeks to adapt to user-supplied queries, and will deliver the pages via to its users, providing increasingly personalized rec the web. A news clipping service collects items by sub ommendat ions over time. In this paper we introduce scri bing to news wires, selects items for users according the "Fab"adaptive web page recommendation ser to profiles they submit, and might deliver them by fax vice. There has been much research on analyzing doc der to in or search results. More recent ly rese archers have be Our work explores adaptive met hods for the collec tion and selection of web pages. In order to facilitate be exploited to the same ends. The Fab system strike this exploration we have designed the"Fabarchitec a balance between these two approaches, taking ad ture as a test-bed for trying out different collection and ant age of the shared interests among users without election methods, supporting a popul ation of users losing the content analysis. Running since March 1996, it has d ent al results been populated wit h a collection of agents for the col lection and selection of we b pages, whose interaction The two major classes of adaptive recommendation sters emergent collaborative properties. In this pa service are content-based (recommending items based per we explain the design of the system architecture on some analysis of their content)and collaborative and report the results of our first experiment, eval ( recommending items based on the recommendations uat ing recommendat ions provided to a group of test of other users). We have attempted to establish a mid dle of the shared among users without losing the benefits of the repre 1 Introduction sent ations provided by content analy sis When faced with an insurmountably large array of In this paper we report resul ts from our ongoing choices, people often turn to some kind of recommen first experiment. The Fab sy stem has been instant dation service. For inst ance they will look in their ated with classes of adaptive content-based collection newspaper for movie reviews, or in a travel guide for and selection agents w hose interaction fosters emer tourist sights. We are interested in creating an auto- gent coll abor ative properties. By agent" we simpl latic recommendation service which over time adapts mean a process which maint ains a long-term state-in to its users, so that they receive increasingly person- our case a profile w hich can be used to induce a rank alized recommendations We can think of on-line rec- ing among web pages. The operation of the system is as follows: users can request recommendations at any Collection first collect the items to be recom- time, and will be shown the ten highest-ranking web mended pages according to their profile. They rate each page rding to how well it matches their interests. and Selection Next select from the collected items those the collection and selection agents use this feedback to best for a particular user refine their profiles Delivery Finally deliver the selected items to the The following section pl aces the work in context by describing related projects from both the IR and the lost on-line recommendation or search services can Al communities. We then describe the design of be described in this way. For inst ance a web index(eg chitecture and of the agents which popul ate it. Finally Mauldin Leavitt 1994))will have an exhaustive col we describe our experimental design and give results t which i from our first exper
Stanford University Digital Libraries Pro ject Working Paper SIDL-WP-1996-0041 Revised version to appear in First International Conference on Autonomous Agents, February 1997, Marina del Rey CA An Adaptive Web Page Recommendation Service Marko Balabanovic Department of Computer Science, Stanford University Abstract An adaptive recommendation service seeks to adapt to its users, providing increasingly personalized recommendations over time. In this paper we introduce the \Fab" adaptive web page recommendation service. There has been much research on analyzing doc- ument content in order to improve recommendations or search results. More recently researchers have begun to explore how the similarities between users can be exploited to the same ends. The Fab system strikes a balance between these two approaches, taking advantage of the shared interests among users without losing the benets of the representations provided by content analysis. Running since March 1996, it has been populated with a collection of agents for the collection and selection of web pages, whose interaction fosters emergent collaborative properties. In this paper we explain the design of the system architecture and report the results of our rst experiment, evaluating recommendations provided to a group of test users. 1 Introduction When faced with an insurmountably large array of choices, people often turn to some kind of recommendation service. For instance they will look in their newspaper for movie reviews, or in a travel guide for tourist sights. We are interested in creating an automatic recommendation service which over time adapts to its users, so that they receive increasingly personalized recommendations. We can think of on-line recommendation as a three-stage process: Collection First collect the items to be recommended. Selection Next select from the collected items those best for a particular user. Delivery Finally deliver the selected items to the user. Most on-line recommendation or search services can be described in this way. For instance a web index (e.g., (Mauldin & Leavitt 1994)) will have an exhaustive collection component, a selection component which is an information retrieval (IR) system to match web pages to user-supplied queries, and will deliver the pages via the web. A news clipping service collects items by subscribing to news wires, selects items for users according to proles they submit, and might deliver them by fax or e-mail. Our work explores adaptive methods for the collection and selection of web pages. In order to facilitate this exploration we have designed the \Fab" architecture as a test-bed for trying out dierent collection and selection methods, supporting a population of users and recording experimental results. The two major classes of adaptive recommendation service are content-based (recommending items based on some analysis of their content) and col laborative (recommending items based on the recommendations of other users). We have attempted to establish a middle ground, taking advantage of the shared interests among users without losing the benets of the representations provided by content analysis. In this paper we report results from our ongoing rst experiment. The Fab system has been instantiated with classes of adaptive content-based collection and selection agents, whose interaction fosters emergent collaborative properties. By \agent" we simply mean a process which maintains a long-term state|in our case a prole which can be used to induce a ranking among web pages. The operation of the system is as follows: users can request recommendations at any time, and will be shown the ten highest-ranking web pages according to their prole. They rate each page according to how well it matches their interests, and the collection and selection agents use this feedback to rene their proles. The following section places the work in context by describing related projects from both the IR and the AI communities. We then describe the design of the architecture and of the agents which populate it. Finally we describe our experimental design and give results from our rst experiment
2 Related work 3 System Design Two strands of rel ated research are of particular rel 3.1 Background evance: work on document classi fication(which con For the content-based approach which we are tak siders al ternative techniques for selection) and collab there are four essenti al requirements or ative systems (w hich propose a di fferent architecture based on a novel selection technique) w A representation of a web page m A representation of the users interests 2.1 Document classification r(w, m) A function to determine the pertinence of a Document cl assification lies at the intersection of web page given a user's interests chine learning(ML)and IR, and there is a large lit- u(w, m, s)A function returning an updated user pro erature. To enable the application of Ml techniques file m given the user's feedback s on a page w there has been much interest recently in dimension ality reduction (e. g,(Deerwester et al. 1990)). In The assumption underlying content-based systems is that the content (rather than appearance, inter activity the IR community variants of relev ance feedback have speed of loading, etc )of a page is what determines the TRECconferences(Harman 1995), for example(Buck user 's interest. No ley et al. 1995; All an 1995). There have also been nu that we can represent the content of a page purely nerous comparisons between these vari ants and non by considering the words contained in the text. We incremental ML techniques (Schutze, Hull, Peder gnore all mark-up tags, images and other multimedi a sen 1995; Lang 1995; Pazz ani, Muramatsu, billsus information The vector-space model of information retrieval f such technic the large number of examples that are required before (Salton Mcgill 1983) provides us with an appropri ate representation for documents based on their con- he learning algorithm can be applied. In contrast we stituent words. This model has been used and studied assume an on-line model w here the user gradually see bet ter and bet ter pages, and we do not assume a fixed xtensively forms the basis for commercial web search document collection-users' back influences which systems and has been shown to be competitive with pages are collected and shown to them al ternative IR methods(Harman 1995) Some of the work on document filtering(where In this model documents and queries are represented as vectors. Assume some dictionary vector d. where the classification is used to pick relevant documents each element di is a word. Each document then has a from an incoming stream) shares our on-line model of learning(Sheth Maes 1993; Foltz Dumais 1992) vector w, where element w; is the weight of word d; fo that document. If the document does not contain di Assisted browsing systems(Armstrong et al. 1995 Lieberman 1995) share our goal of recommending web then W;=0 In our formul ation we reduce words to their stems pages, but restrict themselves to the section of the web just ahead of the user's current browser location, and using the Porter algorithm(Porter 1980), ignore words te links to f from a standard stop list of 57l words, and calculate T FIDF weight the weight u; of a word d; in a docu- ment w is given by 2.2 Collaborative Systems Collaborative recommendation services (Shardanand =05+05 &e Maes 1995; Resnick et al. 1994) present an inter df(i) esting al ternative to traditional IR techniques. Ty p- re tf(i) is the number of times word d; appears in cally they work by identify ing the nearest neighbors document W(the term frequency), df(i)is the number to a user in the space of past ratings, and then recom- of documents in the collection which contain d:(the mending something one of these neighbors has rated document frequency), n is the number of documents in the collection and tmar is the maximum term fre- Domains with const ant stream of ficult for coll aborative systems: a single iter ncle re approximated by be rated very often by different users, and much of the using a fixed dictionary of approximately 70 information gained by the system is rooted in the past termed words created from a sample of 5, 229 rar and therefore unusable. Examples of such domains are domly picked web pages (so in the above equation eb pages or news feeds. In contrast more static do 29). If d, do he dictionary those with a much smaller num et df(i) ber of new items appearing(eg, movies) will enable In an attempt to avoid over-fitting, and reduce mem collaborative systems to work well, as lots of different ory and communications load, we use just the 100 users will rate the same items. Additionally for these highest-weighted words from any document. Recent domains it is very hard to do any oontent analysis so experiments described in(Pazz ani, Muramatsu, bill content-based systems will be harder to implemer sus 1996) have shown that words lead
2 Related Work Two strands of related research are of particular relevance: work on document classication (which considers alternative techniques for selection) and collaborative systems (which propose a dierent architecture based on a novel selection technique). 2.1 Document Classication Document classication lies at the intersection of machine learning (ML) and IR, and there is a large literature. To enable the application of ML techniques there has been much interest recently in dimensionality reduction (e.g., (Deerwester et al. 1990)). In the IR community variants of relevance feedback have been studied in the context of the routing task at the TREC conferences (Harman 1995), for example (Buckley et al. 1995; Allan 1995). There have also been nu- merous comparisons between these variants and nonincremental ML techniques (Schutze, Hull, & Pedersen 1995; Lang 1995; Pazzani, Muramatsu, & Billsus 1996). One of the disadvantages of such techniques is the large number of examples that are required before the learning algorithm can be applied. In contrast we assume an on-line model where the user gradually sees better and better pages, and we do not assume a xed document collection|users' feedback in uences which pages are collected and shown to them. Some of the work on document ltering (where the classication is used to pick relevant documents from an incoming stream) shares our on-line model of learning (Sheth & Maes 1993; Foltz & Dumais 1992). Assisted browsing systems (Armstrong et al. 1995; Lieberman 1995) share our goal of recommending web pages, but restrict themselves to the section of the web just ahead of the user's current browser location, and then recommend appropriate links to follow. 2.2 Collaborative Systems Collaborative recommendation services (Shardanand & Maes 1995; Resnick et al. 1994) present an interesting alternative to traditional IR techniques. Typically they work by identifying the nearest neighbors to a user in the space of past ratings, and then recommending something one of these neighbors has rated highly. Domains with a constant stream of new items are dif- cult for collaborative systems: a single item may not be rated very often by dierent users, and much of the information gained by the system is rooted in the past and therefore unusable. Examples of such domains are web pages or news feeds. In contrast more static domains (e.g., music) or those with a much smaller num- ber of new items appearing (e.g., movies) will enable collaborative systems to work well, as lots of dierent users will rate the same items. Additionally for these domains it is very hard to do any content analysis so content-based systems will be harder to implement. 3 System Design 3.1 Background For the content-based approach which we are taking, there are four essential requirements: w A representation of a web page. m A representation of the user's interests. r(w; m) A function to determine the pertinence of a web page given a user's interests. u(w; m; s) A function returning an updated user pro- le m given the user's feedback s on a page w. The assumption underlying content-based systems is that the content (rather than appearance, interactivity, speed of loading, etc.) of a page is what determines the user's interest. Now we make the further assumption that we can represent the content of a page purely by considering the words contained in the text. We ignore all mark-up tags, images and other multimedia information. The vector-space model of information retrieval (Salton & McGill 1983) provides us with an appropriate representation for documents based on their constituent words. This model has been used and studied extensively, forms the basis for commercial web search systems and has been shown to be competitive with alternative IR methods (Harman 1995). In this model documents and queries are represented as vectors. Assume some dictionary vector d, where each element di is a word. Each document then has a vector w, where element wi is the weight of word di for that document. If the document does not contain di then wi = 0. In our formulation we reduce words to their stems using the Porter algorithm (Porter 1980), ignore words from a standard stop list of 571 words, and calculate a TFIDF weight: the weight wi of a word di in a docu- ment W is given by: wi = 0:5+0:5 tf (i) tfmax log n df (i) where tf (i) is the number of times word di appears in document W (the term frequency), df (i) is the number of documents in the collection which contain di (the document frequency), n is the number of documents in the collection and tfmax is the maximum term frequency over all words in W. The document frequencies are approximated by using a xed dictionary of approximately 70,000 stemmed words created from a sample of 5,229 randomly picked web pages (so in the above equation n = 5;229). If di does not occur in the dictionary we set df (i) = 1. In an attempt to avoid over-tting, and reduce memory and communications load, we use just the 100 highest-weighted words from any document. Recent experiments described in (Pazzani, Muramatsu, & Billsus 1996) have shown that using too many words leads
Excellent ld be easy to add Very Good extra users, agents and resource Neutral To take advantage of potential overl aps betweer Bad The architecture is best explained by considering the Terrible path of a document w through the system. Lets say the document is found by agent A, one of the various Figure 1: Ordinal scale users use to rank documents types of collect ent(more details in section 3.3) and immediately converted to its vector representa- on w. At regular intervals collection agents submit to a decrease in performance when classify ing web the pages they have gathered w hich best match their pages using supervised learning methods. Our ex- search profiles to the central repository, replacing the ploratory experiments have agreed with their findings pages they previously submitted. Thus at any time the that the optimums between 30 and 100. We are using the figure of 100 pending more extensive experiments which will be possi ble once more user ratings have been When a user requests their Fab recommendations gathered their selection agent (of which there is one per user Once the top 100 words have been picked we nor picks, from the entire repository, those pages which malize w to be of unit length, to allow comparisons best match the user's personal profile. The user then between documents of di fferent lengt hs ranks these pages. These rankings are used both for This vector representation is used both for pages aluation purposes(discussed in section 4)and as as explained, and for the model of the user's interests feedback for the agents to learn from. The selection the user profile m (which corresponds to the query in agent uses the feed back to update the user's personal a retrospective IR system). In order to measure how profile(using the function u). It also forwards the well a page w mat ches a profile m, we use a vari ant of feedback, via the central repository, to the originat the standard IR cosine measure ing agent A, which will update its search profile in the There are several advantageous features of th chitect Updating m also corresponds to a normal operation n retrospective IR: relevance feed back(Rocchio 1971) A user's personal profile is a valuable resource, rep We use a simple update rule eventing considerable effort and time spent ranking s representation as part of the user s se lection agent means that the user ' s information will uw.ms=m sw never be lost through some adaptation scheme or where s is the user's score for page w(we transl ate the "drowned out"by other users (it is only updated with the user's own feedback ). Fur thermore it makes 2,-3. Relevance feedback has been demonstrated the profile avail able to the user for use with other ap to be a powerful techniq Lents have shown plications(eg-, e-mail or news filtering), as opposed that queries automatically generated through relevance to in a purely coll aborative sy stem, where there is feed back can outperform human searchers( Foltz Du ingful way to consider one user's profile in mais 1992: Harman 1995) d ation We assume that a user s interests change over time a br and new user to the system is shown a selection nd furthermore that this happens in real time, re of pages which are randomly chosen from the repos gardless of the number of recommendations requested itory. However the repos tory contains pages w hich modelled us various agents believe will best match the current function--each night all the weights in the profiles are Iser popul ation. Thus the new user is already st art multiplied by 0.97 ing from a much higher level than would be expected from an empty profile, especially if the system is de 3.2 Architecture ployed in an organiz ation or special interest group Figure 2 shows the main components of the architec here there will be significant overlap between use ture: selection agents. collection agents and the central epository. All agents maintain a profile: each user has It is possi ble to cheaply maintain a popul ation of a selection agent, which maintains their user profile each collection agent maintains a search profile which ceived (from all users) into an amalgamated profile s used to guide it in its collection of web pages which represents an average of the user communit The goals underlying the design of the architecture We can then pick pages from the repos tory accord ere twofold to thi d able thus
Excellent Very Good Good Neutral Poor Very Bad Terrible Figure 1: Ordinal scale users use to rank documents. to a decrease in performance when classifying web pages using supervised learning methods. Our exploratory experiments have agreed with their ndings that the optimum is between 30 and 100. We are using the gure of 100 pending more extensive experiments which will be possible once more user ratings have been gathered. Once the top 100 words have been picked we nor- malize w to be of unit length, to allow comparisons between documents of dierent lengths. This vector representation is used both for pages, as explained, and for the model of the user's interests, the user prole m (which corresponds to the query in a retrospective IR system). In order to measure how well a page w matches a prole m, we use a variant of the standard IR cosine measure: r(w; m) = w:m Updating m also corresponds to a normal operation in retrospective IR: relevance feedback (Rocchio 1971). We use a simple update rule: u(w; m; s) = m + sw where s is the user's score for page w (we translate the scale shown in Figure 1 to the integers 3, 2, 1, 0, 1, 2, 3). Relevance feedback has been demonstrated to be a powerful technique|experiments have shown that queries automatically generated through relevance feedback can outperform human searchers (Foltz & Dumais 1992; Harman 1995). We assume that a user's interests change over time, and furthermore that this happens in real time, regardless of the number of recommendations requested per day. This process is modelled using a simple decay function|each night all the weights in the proles are multiplied by 0:97. 3.2 Architecture Figure 2 shows the main components of the architecture: selection agents, collection agents and the central repository. All agents maintain a prole: each user has a selection agent, which maintains their user prole; each collection agent maintains a search prole which is used to guide it in its collection of web pages. The goals underlying the design of the architecture were twofold: To allow scaling up, so that it would be easy to add extra users, agents and resources. To take advantage of potential overlaps between users' interests. The architecture is best explained by considering the path of a document W through the system. Let's say the document is found by agent A, one of the various types of collection agent (more details in section 3.3), and immediately converted to its vector representation w. At regular intervals collection agents submit the pages they have gathered which best match their search proles to the central repository, replacing the pages they previously submitted. Thus at any time the repository contains each collection agent's best pages in their own opinions. When a user requests their Fab recommendations their selection agent (of which there is one per user) picks, from the entire repository, those pages which best match the user's personal prole. The user then ranks these pages. These rankings are used both for evaluation purposes (discussed in section 4) and as feedback for the agents to learn from. The selection agent uses the feedback to update the user's personal prole (using the function u). It also forwards the feedback, via the central repository, to the originating agent A, which will update its search prole in the same way. There are several advantageous features of this architecture: A user's personal prole is a valuable resource, representing considerable eort and time spent ranking pages. Its representation as part of the user's selection agent means that the user's information will never be lost through some adaptation scheme or \drowned out" by other users (it is only updated with the user's own feedback). Furthermore it makes the prole available to the user for use with other applications (e.g., e-mail or news ltering), as opposed to in a purely collaborative system, where there is no meaningful way to consider one user's prole in isolation. A brand new user to the system is shown a selection of pages which are randomly chosen from the repository. However the repository contains pages which various agents believe will best match the current user population. Thus the new user is already starting from a much higher level than would be expected from an empty prole, especially if the system is deployed in an organization or special interest group where there will be signicant overlap between users' interests. It is possible to cheaply maintain a population of \parasitic" users. We can add every evaluation received (from all users) into an amalgamated prole, which represents an average of the user community. We can then pick pages from the repository according to this prole and make them available, thus
Figure 2: Overview cf the Fab architecture. allowing "parasites "to view pages preselected by an rankings of pages. However there also exists a need tablished group of users, without any requirement for a longer term evolution of the agents, to insure a to provide feedback spread of profiles in the space of topics. It is une In addi tion to these features, our design relies or sirable to maintain collection agents which have very he hypot hesis that the following two behav iors will be mil ar profiles (it woul d be more efficient to combine exhibited them), or agents whose pages are rarely picked or re- ceive very low scores. We have definedour architecture Specialization By only giving feedback to the col so that this long-term adaptation component is easily lection agent which originally found a page, we are replaceable, as we intend to conduct further research to investig ate different schenes at tempting to ecializati begins to specialize to a topic, users not interested in that topic shoul d no longer be shown pages from that 3.3 Collection Agents gent, which should narrow its specializ ation further We have implemented two kinds of adaptive collection Eventually we expect the agent will set tle into a niche agents, along with a variety of non-adaptive kinds for where there is a group of users interested in the topi comparative purposes which its profile represents. In future experiments w hope to see the overlap between users' interests lead Search Agents First described in(Bal abanovic ng to an economy of scale, whereby several users in Shoham 1995), these agents perform a best-first search terested in one topic can be served by a single agent of the web. For each agent the heuristic function used and conversely users interested in sever al topics can be erved by sever al agents to evaluate a web page w is r(w, m), where mis the agents search profile. In order to better cope with the increasing size and br aching factor of the web Serendipity If any collection agent finds a page we limit the size of the search fringe, and restrict the which matches a users profile, the user shoul d see it number of nodes from any particul ar site. A record This includes agents who have specialized to finding of all nodes visited is maintained separately, with a pages for other groups of users, agents who monitor one month timeout. In effect we have a hy brid best first/beam search agents which merely produce random pages, etc. This property can be thought of as allowing for serendip tous "wordof-mouth" recommendation, whereby the Index agents Rather than actually searching the shown pages from outsi de of their regu web, these agents attempt to construct queries for e hich happen to match their interests isting web indexes in an attempt to avoid duplicating work. The indexes used are alta vista Inktomi and So far we have only discussed the short-term learn Excite. since little information about the ir models g of the agents: updating profiles based employed by these commercial systems is forthcoming it is hard to des ptimal queries. Furtherore the es from Fab selected according to its amalgamated pRofile are vailable daily 2L ins to all the on line services mentioned can be foumd http://fab.stanordedu onthewebathttp://fab.stanford.edu/papers/aa97/
Figure 2: Overview of the Fab architecture. allowing \parasites" to view pages preselected by an established group of users, without any requirement to provide feedback1 . In addition to these features, our design relies on the hypothesis that the following two behaviors will be exhibited: Specialization By only giving feedback to the collection agent which originally found a page, we are attempting to encourage specialization. If an agent begins to specialize to a topic, users not interested in that topic should no longer be shown pages from that agent, which should narrow its specialization further. Eventually we expect the agent will settle into a niche where there is a group of users interested in the topic which its prole represents. In future experiments we hope to see the overlap between users' interests leading to an economy of scale, whereby several users interested in one topic can be served by a single agent, and conversely users interested in several topics can be served by several agents. Serendipity If any collection agent nds a page which matches a user's prole, the user should see it. This includes agents who have specialized to nding pages for other groups of users, agents who monitor various existing web page recommendation services, agents which merely produce random pages, etc. This property can be thought of as allowing for serendipitous \word-of-mouth" recommendation, whereby the user is shown pages from outside of their regular sources which happen to match their interests. So far we have only discussed the short-term learning of the agents: updating proles based on users' 1Top pages from Fab selected according to its amalgamated prole are available daily at http://fab.stanford.edu. rankings of pages. However there also exists a need for a longer term evolution of the agents, to insure a spread of proles in the space of topics. It is undesirable to maintain collection agents which have very similar proles (it would be more ecient to combine them), or agents whose pages are rarely picked or receive very low scores. We have dened our architecture so that this long-term adaptation component is easily replaceable, as we intend to conduct further research to investigate dierent schemes. 3.3 Collection Agents We have implemented two kinds of adaptive collection agents, along with a variety of non-adaptive kinds for comparative purposes. Search Agents First described in (Balabanovic & Shoham 1995), these agents perform a best-rst search of the web. For each agent the heuristic function used to evaluate a web page w is r(w; m), where m is the agent's search prole. In order to better cope with the increasing size and branching factor of the web, we limit the size of the search fringe, and restrict the number of nodes from any particular site. A record of all nodes visited is maintained separately, with a one month timeout. In eect we haveahybrid best- rst/beam search. Index Agents Rather than actually searching the web, these agents attempt to construct queries for existing web indexes in an attempt to avoid duplicating work. The indexes used are Alta Vista2 , Inktomi and Excite. Since little information about the IR models employed by these commercial systems is forthcoming, it is hard to design optimal queries. Furthermore the 2Links to all the on-line services mentioned can be found on the web at http://fab.stanford.edu/papers/aa97/
indexes are tailored mainly to short queries, and ence to the standard IR measures of p recision and re. tering more than 20 seard terms often results in zero call or the standard mad ine learning measure of clas ecall. Thus the current imp lementation is fairly ru ification accuracy. The ndpm reasure is a distance mentary: submit as a disjunctive q uery the top p words between the user's ranking of a set of documents and from the agent p rofile, where p has been op timized by the systems ranking of the sare documents, normal hand for different indexes. and varies from 10 to 20 ized to lie between 0 and 1. Space p recluses a defi- tion but wewill summarize the advantages of this new Non- Adaptive agents Forpurp oses of comparison measure. wehaveimplemented some simp leragents w hich do not The notion of relevance as used in IR is problem maintain their ow n profile atic in general, and in particular for our domain. User no-memory index agents These agents work ex tudies have show n that users have difficulty raking act ly like regular index agents, but they draw their onsistert relevance judgments over a long period of ords from the amalgamted profile. Thus these time when asked to rate documerts on an absolute agents are not trying to sp ecialize but are serving relevance scale (Lesk salton 1971; Saracevic 1975) the“ mass market and in fact this is a general feature of human judg. ments(Rorvig 1988). There will also be considerable various sources of random pages availabe ss from r andom agent s These agents retrieve page disagreement between the judgments of different users on the (Saracevic 1995). The usual way to circuMent this are no I pages problem is to test IR systems on standardized collec- but rather randomly p icked fromvarious preselected tions of documerts and queries, so as to make recall collections). Currently pages are draw n from Alta and p recision figures comparable. Vista. Yahoo. ROulEtte Since our dorain is the web, we cannot rely on using cool agents These agents retrieve humaI standardized collection Furthermore since there is recomended pages from nine "cool page of no query inivolved in the use of our system the con- the day? sites around the web. These pages have cept of relevance is inapp IOp riate. Saracevic(1975 been selected for their interest to the general has drawn a distinction between the users unartic community, not any particular user, and are often ulated inform ation need and the query whid results new ly released sites from that. Relevance is relative to the query, whereas Long-term evolution As exp lained in section 3.2 pertinence is relative to the information need. In our ore mechanismto insure that agents' p rofiles sp read system it is unnecessary to formulate a query, so we the information out in the vector space is required. This remains a need is met nly attempt to measure l topic for further research, but for the time being we In contrast to the negative results from user studie cedure is followed separately for ead agent type, once mentioned, it has been show n that human j udges are a weel good at making relative judgments given a collecti of documents. and will be consistent over time and 1. F ind best and worst agents over last week accordin ng compared to other judges(Lesk Salton 1971) to their median feedback score e ndpm measure only requires relative judgments 2. If they both scored worse than Neutr al, restart the from users, and since there is no q uery involved the judgments relate to pertinence not relevance. Thus it 3. If the worst agent scored worse than Ne ut r al but the is more app Iop riate than precision and recall, whicl best agent scored better, dup licate the best agent are based on absolute relevance judgments. Further and kill the worst more, directly comp aring the user and system rankings gives us a single valued measure, simp ler to interp ret 3. 4 Selection Agents than precision- recall graphs(interestingly in the de Currently only one selection method is used, based on generate case where two-level rankings are used, ndpm n be show n to be a comp osite measure of recall and valueof thisfunction(using the r We calculate the p recision, and is related to several other measures pio- page in the repository, The highest-scoring pages are apart fromthese considerations, recall is impossible to measure and difficult even to estimate when the docu dentical or from the same site, and that the user has ment collection is the whole web not seen an identical page in the last month An alternative app roach would be to use classifica. tio However n and recall w Id have to ask users 4.1 Backg to classify items on an absolute scale which is prob- We have decided to use a erformance measure leveloped by Yao(1995)in ourexperiments, in prefer ormalized distance-based perfo
indexes are tailored mainly to short queries, and entering more than 20 search terms often results in zero recall. Thus the current implementation is fairly rudi- mentary: submit as a disjunctive query the top p words from the agent prole, where p has been optimized by hand for dierent indexes, and varies from 10 to 20. Non-Adaptive Agents For purposes of comparison we have implemented some simpler agents which do not maintain their own prole: no-memory index agents These agents work exactly like regular index agents, but they draw their words from the amalgamated prole. Thus these agents are not trying to specialize but are serving the \mass market". random agents These agents retrieve pages from various sources of random pages available on the web (i.e. they are not truly randomly picked pages, but rather randomly picked from various preselected collections). Currently pages are drawn from Alta Vista, Yahoo, URouLette. cool agents These agents retrieve humanrecommended pages from nine \cool page of the day" sites around the web. These pages have been selected for their interest to the general community, not any particular user, and are often newly released sites. Long-term evolution As explained in section 3.2, some mechanism to insure that agents' proles spread out in the vector space is required. This remains a topic for further research, but for the time being we have a preliminary implementation. The following procedure is followed separately for each agent type, once a week: 1. Find best and worst agents over last week, according to their median feedback score. 2. If they both scored worse than Neutral, restart the worst agent from scratch. 3. If the worst agent scored worse than Neutral but the best agent scored better, duplicate the best agent and kill the worst. 3.4 Selection Agents Currently only one selection method is used, based on the comparison function r(w; m). We calculate the value of this function (using the user's prole) for every page in the repository. The highest-scoring pages are shown to the user, with the proviso that no two are identical or from the same site, and that the user has not seen an identical page in the last month. 4 Experimental Design 4.1 Background We have decided to use a new performance measure developed by Yao (1995) in our experiments, in preference to the standard IR measures of precision and recall or the standard machine learning measure of classication accuracy. The ndpm3 measure is a distance between the user's ranking of a set of documents and the system's ranking of the same documents, normalized to lie between 0 and 1. Space precludes a denition, but we will summarize the advantages of this new measure. The notion of relevance as used in IR is problematic in general, and in particular for our domain. User studies have shown that users have diculty making consistent relevance judgments over a long period of time when asked to rate documents on an absolute relevance scale (Lesk & Salton 1971; Saracevic 1975), and in fact this is a general feature of human judg- ments (Rorvig 1988). There will also be considerable disagreement between the judgments of dierent users (Saracevic 1995). The usual way to circumvent this problem is to test IR systems on standardized collections of documents and queries, so as to make recall and precision gures comparable. Since our domain is the web, we cannot rely on using a standardized collection. Furthermore, since there is no query involved in the use of our system the concept of relevance is inappropriate. Saracevic (1975) has drawn a distinction between the user's unarticulated information need and the query which results from that. Relevance is relative to the query, whereas pertinence is relative to the information need. In our system it is unnecessary to formulate a query, so we can only attempt to measure how well the information need is met. In contrast to the negative results from user studies mentioned, it has been shown that human judges are good at making relative judgments given a collection of documents, and will be consistent over time and compared to other judges (Lesk & Salton 1971). The ndpm measure only requires relative judgments from users, and since there is no query involved the judgments relate to pertinence not relevance. Thus it is more appropriate than precision and recall, which are based on absolute relevance judgments. Furthermore, directly comparing the user and system rankings gives us a single-valued measure, simpler to interpret than precision-recall graphs (interestingly in the degenerate case where two-level rankings are used, ndpm can be shown to be a composite measure of recall and precision, and is related to several other measures proposed over the years|details in (Yao 1995)). Even apart from these considerations, recall is impossible to measure and dicult even to estimate when the docu- ment collection is the whole web. An alternative approach would be to use classication accuracy as our performance measure. However as with precision and recall we would have to ask users to classify items on an absolute scale, which is prob- 3Normalized Distance-based Performance Measure
Computer graphics/Game progr amming Library cataloging and classification 4.2 Select ion performance Post -industrial music Sports information and gaming On a day-to-day basis the system supplies the user Native American culture with a number of documents it thinks the user will highly. It uses the resulting scores in order to 60]s musi per form relevance feedback and improve the user pro- Hiking e documents shown to the user represent a Evolution narrow segment of the entire set of documents, close to each ot her in terms of user preference, making them an unsuitable set for use with the measure descri bed. Fur Figure 3: Topics chosen by users thermore during the normal operation of the system it has been observed that users will skew their rankings In order to influence the system's future behavior ( There has been criticism of collabor ative recommen the feed back they give does not represent their act atic oPInion been shown to do better than the "majority vote Therefore we have used the follow ing scheme. At the most popul ar items. This is not directly appll a. rule, ie. they would do as well just recommend regul ar intervals the user is given a special list of doc uments, and asked to rank them according to their own ble in our case: as there is a constant stream of new interests. This list is selected randomly from the repos items, no one item will necessarily get rated by many which contains docu users. However the comparison between personal and specializing in many different topics as well ents public pages gives us a measure of how much difference who produce random pages. The rankings are not used the personaliz ation makes for relevance feedback so there is no ulterior motive The met hodo of the earlier ex for the user. The system also ranks the documents ad ily adapted for this new comparison: we show users cording to how well they match the current user profile an equal number of pages from each source, randoml this is the system's prediction of the user's rankings permuted. For each of the sources there is an ideal We use an ordinal scale to obt ain user rankings. The anking consisting of two equivalence classes, whereby user places each document into one of seven categories all of its pages are ranked above all other pages. Thus as shown in Figure 1(Cox(1980)recommends 5, 7 or we can measure for each source the dist ance between 9-point scales; the adjectives are selected from(Mittel this ideal ranking and the actual ranking supplied by stadt 1971 the user (using the ndpm measure as before The desired result for this experiment is for the ndpm distance between the user's and systems rank 4.4 Declared Topics point)to decrease gradually over time, as the systam: ngs (aver aged over all users at the same evaluation Our most scarce resource is not computer speed or stor age, but the attention and quantity of users willing to model of the user becomes more accur ate test the sy stem. Thus we decided to try to combine the two experiments descri bed into one, and furthermore 4.3 Collection performan ce to try to build up a dataset which could be used for future off-line experiments When discussing the use of his distance measure, Yao assumes that the entire document collection is ranked The results w hich follow describe a combined expe iment. Each user was asked to decl are a topic of inter by both the user and the system. This is clearly im est in advance(Figure 3), and rate pages according to practical but our alternative formulation leaves us with that topic. This was to allow easier interpretation of a problem. It is possible for the system to learn a the profiles d and help us re-use the data for fu perfect model of the user without it ever recommend- ing any pages the user wants to see(e.g, it could al ture experiments. Every five rounds of evaluation they were show n a selection of 16 pages, four each from the ways correctly predict that the user's rating would be the four sources as descri bed. The rel ative rankings of the T sources and the difference between user and system absolute performance of the system as well, to see how well the collection component is doing rankings were recorded. We can think of these two In order to do this we measure how well the system- measurements as recording the performance of collec- tion and selection agents, respectively nded to pages from three other sources: random pages, cool ages and public pages. Random and cool pages are 5 Results found by random and cool agents respectively; public For this series of experiments we ran nine best- pages are those pages from the repository which best search collection agents for eight hours per day match the amalgamated profile during which time they would inspect between 1,000
lematic as already discussed. 4.2 Selection Performance On a day-to-day basis the system supplies the user with a number of documents it thinks the user will rank highly. It uses the resulting scores in order to perform relevance feedback and improve the user pro- le. Thus the documents shown to the user represent a narrow segment of the entire set of documents, close to each other in terms of user preference, making them an unsuitable set for use with the measure described. Furthermore during the normal operation of the system it has been observed that users will skew their rankings in order to in uence the system's future behavior (i.e. the feedback they give does not represent their actual opinions). Therefore we have used the following scheme. At regular intervals the user is given a special list of doc- uments, and asked to rank them according to their own interests. This list is selected randomly from the repository, which contains documents produced by agents specializing in many dierent topics as well as agents who produce random pages. The rankings are not used for relevance feedback, so there is no ulterior motive for the user. The system also ranks the documents according to how well they match the current user prole (this is the system's prediction of the user's rankings). We use an ordinal scale to obtain user rankings. The user places each document into one of seven categories as shown in Figure 1 (Cox (1980) recommends 5, 7 or 9-point scales; the adjectives are selected from (Mittelstaedt 1971)). The desired result for this experiment is for the ndpm distance between the user's and system's rankings (averaged over all users at the same evaluation point) to decrease gradually over time, as the system's model of the user becomes more accurate. 4.3 Collection Performance When discussing the use of his distance measure, Yao assumes that the entire document collection is ranked by both the user and the system. This is clearly impractical but our alternative formulation leaves us with a problem. It is possible for the system to learn a perfect model of the user without it ever recommending any pages the user wants to see (e.g., it could al- ways correctly predict that the user's rating would be Terrible). Thus we need some way to measure the absolute performance of the system as well, to see how well the collection component is doing. In order to do this we measure how well the systemrecommended personal pages perform in comparison to pages from three other sources: random pages, cool pages and public pages. Random and cool pages are found by random and cool agents respectively; public pages are those pages from the repository which best match the amalgamated prole. Computer graphics/Game programming Library cataloging and classification Post-industrial music Sports information and gaming Native American culture Cookery 60's music Hiking Evolution Figure 3: Topics chosen by users. There has been criticism of collaborative recommen- dation services (Mitchell 1995) that they have never been shown to do better than the \majority vote" rule, i.e. they would do as well just recommending the most popular items. This is not directly applicable in our case: as there is a constant stream of new items, no one item will necessarily get rated by many users. However the comparison between personal and public pages gives us a measure of how much dierence the personalization makes. The methodology of the earlier experiment is easily adapted for this new comparison: we show users an equal number of pages from each source, randomly permuted. For each of the sources there is an ideal ranking consisting of two equivalence classes, whereby all of its pages are ranked above all other pages. Thus we can measure for each source the distance between this ideal ranking and the actual ranking supplied by the user (using the ndpm measure as before). 4.4 Declared Topics Our most scarce resource is not computer speed or storage, but the attention and quantity of users willing to test the system. Thus we decided to try to combine the two experiments described into one, and furthermore to try to build up a dataset which could be used for future o-line experiments. The results which follow describe a combined experiment. Each user was asked to declare a topic of interest in advance (Figure 3), and rate pages according to that topic. This was to allow easier interpretation of the proles learned and help us re-use the data for future experiments. Every ve rounds of evaluation they were shown a selection of 16 pages, four each from the four sources as described. The relative rankings of the sources and the dierence between user and system rankings were recorded. We can think of these two measurements as recording the performance of collection and selection agents, respectively. 5 Results For this series of experiments we ran nine best-rst search collection agents for eight hours per day each, during which time they would inspect between 1,000
and 5,000 web pages(depending on computer and net significantly better than pages from the other sources work lo ad). We also ran two of each ty pe of index Given the small number of predecl ared topics it is not agent (once every 8 hours), two random agents(also surprising that thethe cool pages are not much better Ice every 8 hours)and one cool agent(once a day than the random pages. The public pages will pro- As well as the users participating in the experiment vide a more interesting comparison when we allow ad re supported around 30 par asitic users, whose feed ditional users and arbitrary topics, as there will be a to one month of usage. Nine users participate ough greater overlap of interest are shown from 25 evaluations, corresponding roughly xperiment, although only seven made it to the 1 evaluation, only five to the 20u and only two to the 25 n. All the profiles were started empty 07 o Figure 5: For each source, average ndpm distance bet ween user rankings and its ideal ranking, over all users at e valu- There was considerable variation in collection agents Figure 4: Average ndpm dist are bet ween user and system performance for individual users. For inst ance the user rankings, over all users at evaluat ion points nterested in cooking received between six and ten per inent documents per day after only four days, whereas the user interested in library cat aloging, despite receiv 5.1 Selection perform ance ing many documents about libraries, never recei ved Figure 4 shows the aver age ndpm at each evaluation one on the exact topic point. The gr adual downward progression indicates A comparison of the s uccess(medi an feedback score that the selection agents are successfully learning in and popularity (average percentage of users'pages sup- creasingly accurate profiles of the users over time, as plied) scores for search and index agents yields no dis the distance between the predicted and actual rankings cernible trends as yet, with both classes of collection decreases. Indeed by the 25 evaluation it is down to agent performing similarly. The index agents use negl almost nothin ble resources in comparison to the search agents, but The difficulty of the topics chosen by uses varies we expect over a longer period that the finer adapt enormously. A very rough idea of this can be gained by ability of the search agents will be a dominant factor doing a simple key word search of the Infoseek web in dex: 20,558 documents match "cooking" but only 158 5.3 Special izat match"library cataloging classification". One of the In some cases it is very easy to see the speci alizati advant ages of the vector space represent ation is that that has gone on among the collection agents-for in we can inspect the learned profiles: 90% of the top tance an inspection of all the profiles reveals that one 400 terms in the profile learned for cooking are obvi agent is a“ cooking expert”,with77% of the top4 ously cooking related, compared to only 2% for library terms obv iously cooking related, one other has a tan- ential interest and the rest have no apparent inter- est. Indeed the cooking user customarily receives be 2 Collection perform an ce tween 50% and 90%o of their recommendations from the Figure 5 shows how pages from the different sources are cooking expert” agent per forming relative to one another. As expected the On the other hand“ music” is a common theme ersonal pages improve as adapt ation occurs, and de for two of the topics, and this is reflected in the
and 5,000 web pages (depending on computer and net- work load). We also ran two of each type of index agent (once every 8 hours), two random agents (also once every 8 hours) and one cool agent (once a day). As well as the users participating in the experiment we supported around 30 parasitic users, whose feedback did not aect the collection agents. The results are shown from 25 evaluations, corresponding roughly to one month of usage. Nine users participated in the experiment, although only seven made it to the 15th evaluation, only ve to the 20th and only two to the 25th. All the proles were started empty. 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of evaluations (approx. 1 per day) ndpm Figure 4: Average ndpm distance between user and system rankings, over all users at evaluation points. 5.1 Selection Performance Figure 4 shows the average ndpm at each evaluation point. The gradual downward progression indicates that the selection agents are successfully learning increasingly accurate proles of the users over time, as the distance between the predicted and actual rankings decreases. Indeed by the 25th evaluation it is down to almost nothing. The diculty of the topics chosen by uses varies enormously. Avery rough idea of this can be gained by doing a simple keyword search of the Infoseek web index: 20,558 documents match \cooking" but only 158 match \library cataloging classication". One of the advantages of the vector space representation is that we can inspect the learned proles: 90% of the top 400 terms in the prole learned for cooking are obviously cooking related, compared to only 2% for library cataloging and classication. 5.2 Collection Performance Figure 5 shows how pages from the dierent sources are performing relative to one another. As expected the personal pages improve as adaptation occurs, and do signicantly better than pages from the other sources. Given the small number of predeclared topics it is not surprising that the the cool pages are not much better than the random pages. The public pages will provide a more interesting comparison when we allow additional users and arbitrary topics, as there will be a greater overlap of interests. 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of evaluations (approx. 1 per day) ndpm Personal Cool Random Public Figure 5: For each source, average ndpm distance between user rankings and its ideal ranking, over all users at evaluation points. There was considerable variation in collection agents performance for individual users. For instance the user interested in cooking received between six and ten pertinent documents per day after only four days, whereas the user interested in library cataloging, despite receiving many documents about libraries, never received one on the exact topic. A comparison of the success (median feedback score) and popularity (average percentage of users' pages supplied) scores for search and index agents yields no discernible trends as yet, with both classes of collection agent performing similarly. The index agents use negligible resources in comparison to the search agents, but we expect over a longer period that the ner adaptability of the search agents will be a dominant factor. 5.3 Specialization In some cases it is very easy to see the specialization that has gone on among the collection agents|for instance an inspection of all the proles reveals that one agent is a \cooking expert", with 77% of the top 400 terms obviously cooking related, one other has a tangential interest and the rest have no apparent interest. Indeed the cooking user customarily receives between 50% and 90% of their recommendations from the \cooking expert" agent. On the other hand \music" is a common theme for two of the topics, and this is re ected in the
less strict specialization of the agents. Th ree agent Refe ces h ave an approx imately equal number of msic-related Allan. 1995. Relevance feedback with too mug terms in th eir profi les, and the two users interested data. In Proceedings of the 18 Annul International in misic related topics ty pically receive th eir music- ACM SIGIR Conference on Research and Develop related pages from a mix of th ese agent nt in Information retri Armstrong, R: Freitag, D. Joac ims, T; and 5.4 Sere n dipity Mitchell, T. 1995. WebWataer: A learning appren An example of the serendipitous recomendations re- tice for the world Wide Web. In Proceed ings of the sulting froth e interplay between selection and collec- AAAI Spring Symposium on Information Gathering tion agents comes from a collection agent specializ ing from HEterogenous Distributed Resources. in pages about India(mistakenly, as the intended topic Balaban ic, M, and Sh Oham Y. 1995. Learning was Native American cultures). A part from attempt- information retrieval agents: Ex perirents w ith auto- ing to serve the user interested mated web browsing. In Proceedings of the AAAI cultures, at various times th is agent prov ided pages on pring Symposium on Information Gathering from biodiversity in India to a user interested in evolution Heterogenous, Distributed Resources. and recipes for Indian food to the user interested in Buckley, C: Salton, G. Allan, J i and Singh al, A cooking TREC-3. In Proceedings of the id. ts Smart 1995. Automatic query expansion usin Conference. a problem we encountered during th is ex periment Cox, IIL, E P. 1980. Theoptimal nmber of response w as users being extremely strict giving a h 1gh Score alternatives for a scale: A rev iew. Journal of Market- only to pages exactly on th eir topic and a very low ing Research 17: 407-422 score otherw ise. Th is leads to a lack of positive exam ples to learn from and subsequent fairly random be Deerwester S: Dumais. S.T.: Furnas. G. w h avior. The system works best when users"guide " it dauer t.K. and Harsh man. R. 1990. Indexing b by starting w ith a broad topic and slow ly narrowing it latent seranticanalysis Journal af the American Se down. Weare invest igating way s to informusers about ciety for Information Science 41(6 ): 39 1-407 effective rank ing strategies Foltz p w. and dmais s.t. 1992. Personalized in formation delivery: a n analy sis of informat ion filte ing meth ods Communications ofthe ACM 35(12): 51 6 Conclusion Fab h as been running live since Marc 1996, and Harman. D. 1995. Over jew of th e th ird text re provides a robust test-bed for comparing adal trieval Conference G3. In Proceedings of the tech niques and investigating agent interact ions. Our 3 a Text Retrieval Conference. design Sh Ow S h ow in a specialized dorain it is pos- Lang, K. 1995. Newsweeder: Learning to filter net le for mi ltiple agents to coordinate th eir act iv ities news. In Proceedings of the 12th Intemational C with out ex plicit commInication. As in a collabora- ference on Machine Leaming. tive system a user's recommendations are influenced and Salton. G. 1971. Relevance assess. by similar users of th e sy stem. However we retain the ments and retrieval system evaluation. In The Smart advantages of content-based sy stems: a user's profile System-Experiments in Automatic Document P ex ists as a stand alone representation, and can be used cessing. Prentice Hall Inc. 506-527 to recommend items unseen by other users Lieberman, H. 1995. Letizia: An agent th at assists The results sh Ow that the two components of our web browsing. In International Joint Conference on System are both performing well. The selection agents Artificial Intelligence. are learning increasing ly accurate profiles of th eir Mauldin M. L, and Leav itt, J. R. 1994. Web-agent users, and the adaptive collect ion agents are collect related researc at the CMT. In Proceedings af the ly pertinent web pages, consistent ly rated as better th an pages from the competing sources ACM Special Interest Group on Networked Informa- The two hy poth esized beh av iors of specializ ation and tion Discovery and Retrieval (SIGNIDR 94) serendipity h ave both been observedafter even a Mitchell.t. 1995. Personal communication Short time agents are converging on topics, and where Mittelstaedt, R. A. 1971. Semantic properties of se common interests between users exist the agent pro- lective evaluative adjectives: Oth erev idence Journal files are overlapping. Our hope is to use th is ty pe of convergence to be able to serve a greater number of luramatsu. J. and Billsus. D. users froma fixed pool ofagents our next ex periment Syskill Webert Identifying interesting web sites will investigate th e effects of scaling up the numberof In Proccelings af the 13 u National Conference on Artificial Intell (to
less strict specialization of the agents. Three agents have an approximately equal number of music-related terms in their proles, and the two users interested in music-related topics typically receive their musicrelated pages from a mix of these agents. 5.4 Serendipity An example of the serendipitous recommendations resulting from the interplay between selection and collection agents comes from a collection agent specializing in pages about India (mistakenly, as the intended topic was Native American cultures). Apart from attempting to serve the user interested in Native American cultures, at various times this agent provided pages on biodiversity in India to a user interested in evolution and recipes for Indian food to the user interested in cooking. A problem we encountered during this experiment was users being extremely strict, giving a high score only to pages exactly on their topic and a very low score otherwise. This leads to a lack of positive exam- ples to learn from and subsequent fairly random behavior. The system works best when users \guide" it, by starting with a broad topic and slowly narrowing it down. We are investigating ways to inform users about eective ranking strategies. 6 Conclusions Fab has been running \live" since March 1996, and provides a robust test-bed for comparing adaptation techniques and investigating agent interactions. Our design shows how in a specialized domain it is possible for multiple agents to coordinate their activities without explicit communication. As in a collaborative system, a user's recommendations are in uenced by similar users of the system. However we retain the advantages of content-based systems: a user's prole exists as a stand-alone representation, and can be used to recommend items unseen by other users. The results show that the two components of our system are both performing well. The selection agents are learning increasingly accurate proles of their users, and the adaptive collection agents are collecting increasingly pertinent web pages, consistently rated as better than pages from the competing sources. The two hypothesized behaviors of specialization and serendipity have both been observed|after even a short time agents are converging on topics, and where common interests between users exist the agent pro- les are overlapping. Our hope is to use this type of convergence to be able to serve a greater number of users from a xed pool of agents|our next experiment will investigate the eects of scaling up the number of users. References Allan, J. 1995. Relevance feedback with too much data. In Proceedings of the 18 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Armstrong, R.; Freitag, D.; Joachims, T.; and Mitchell, T. 1995. WebWatcher: A learning apprentice for the World-Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources. Balabanovic, M., and Shoham, Y. 1995. Learning information retrieval agents: Experiments with automated web browsing. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogenous, Distributed Resources. Buckley, C.; Salton, G.; Allan, J.; and Singhal, A. 1995. Automatic query expansion using SMART: TREC-3. In Proceedings of the 3 rd Text REtrieval Conference. Cox, III, E. P. 1980. The optimal number of response alternatives for a scale: A review. Journal of Marketing Research 17:407{422. Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; and Harshman, R. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6):391{407. Foltz, P. W., and Dumais, S. T. 1992. Personalized information delivery: An analysis of information ltering methods. Communications of the ACM 35(12):51{ 60. Harman, D. 1995. Overview of the third Text REtrieval Conference (TREC-3). In Proceedings of the 3 rd Text REtrieval Conference. Lang, K. 1995. Newsweeder: Learning to lter netnews. In Proceedings of the 12 th International Conference on Machine Learning. Lesk, M. E., and Salton, G. 1971. Relevance assess- ments and retrieval system evaluation. In The Smart System|Experiments in Automatic Document Processing. Prentice Hall Inc. 506{527. Lieberman, H. 1995. Letizia: An agent that assists web browsing. In International Joint Conference on Articial Intel ligence. Mauldin, M. L., and Leavitt, J. R. 1994. Web-agent related research at the CMT. In Proceedings of the ACM Special Interest Group on Networked Information Discovery and Retrieval (SIGNIDR-94). Mitchell, T. 1995. Personal communication. Mittelstaedt, R. A. 1971. Semantic properties of selective evaluative adjectives: Other evidence. Journal of Marketing Research 8:236{237. Pazzani, M.; Muramatsu, J.; and Billsus, D. 1996. Syskill & Webert: Identifying interesting web sites. In Proccedings of the 13 th National Conference on Articial Intel ligence (to appear)
Porter, M. 1980. An algorithm for suffix stripping ogran 14(3):130-137 Resnick, P; lacovou, N; Suchak, M. Bergstrom, P and Riedl, J. 1994. GroupLens: An open architecture of the A CM Confere nce on Computer Supported 6 for coll aborative fil tering of netnews. In Proceeding Rocchio. Jr.1971 Relevance feedback in infor ma- tion retrieval In The Smart Syste m-Experime nts in Auto matic Doc ume nt Processing. Prentice Hall Inc 313-323. Rorvig, M.E. 1988. Psy chometric measurement and nformation retrieval. In williams. e. ed. An- nual Re view of In for mation Science and Technolog volume 23. Elsevier Science Publishers B.v. 157-189 Salton. G. and J. 1983. An Intro duction Modern In for mation Retrie Saracevic.t. 1975. Relevance a review of and a fr amework for the thinking on the notion in infor mation science. Jo ur nal of the Ame rican Society for Information Science 26(6): 321-343 Saracevic.T. 1995. Eyaluation of mation retrieval. In 18 u Inter na A CM SIGIR Confere ne on Re search and De velo pme nt in Infor ation Re trieval Schutze, H Hull, D. A; and Pedersen, J.O. 1995 A comparison of classifiers and document represen t ations for the routing problem. In Proceedings of the 18 Anual Internatio nal A CM SIGIR Confer e nce on Research and Develo pme nt in In for n Retrieval land. U. and maes mation filtering: Algorithms for automating " word of nouth.In Conference on Human Factors in Com puting Syste ms(CHI95) Sheth, B, and Maes, P. 1993. Evolving agents for ed information filtering. In Proceedings of IEEE Confere nce on Arti ficial Intellige nce for A pplicatio based on user preference ofdhs s retrieval effectiveness Yao.YY 1995. measurir cument American Socie ty for Information Science 46(2): 133-
Porter, M. 1980. An algorithm for sux stripping. Program 14(3):130{137. Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; and Riedl, J. 1994. GroupLens: An open architecture for collaborative ltering of netnews. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. Rocchio, Jr., J. 1971. Relevance feedback in information retrieval. In The Smart System|Experiments in Automatic Document Processing. Prentice Hall Inc. 313{323. Rorvig, M. E. 1988. Psychometric measurement and information retrieval. In Williams, M. E., ed., An- nual Review of Information Science and Technology, volume 23. Elsevier Science Publishers B.V. 157{189. Salton, G., and McGill, M. J. 1983. An Introduction to Modern Information Retrieval. McGraw-Hill. Saracevic, T. 1975. Relevance: A review of and a framework for the thinking on the notion in information science. Journal of the American Society for Information Science 26(6):321{343. Saracevic, T. 1995. Evaluation of evaluation in information retrieval. In 18 th International ACM SIGIR Conference on Research and Development in Information Retrieval. Schutze, H.; Hull, D. A.; and Pedersen, J. O. 1995. A comparison of classiers and document representations for the routing problem. In Proceedings of the 18 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Shardanand, U., and Maes, P. 1995. Social information ltering: Algorithms for automating \word of mouth". In Conference on Human Factors in Computing Systems (CHI '95). Sheth, B., and Maes, P. 1993. Evolving agents for personalized information ltering. In Proceedings of the 9 th IEEE Conference on Articial Intel ligence for Applications. Yao, Y. Y. 1995. Measuring retrieval eectiveness based on user preference of documents. Journal of the American Society for Information Science 46(2):133{ 145