Ontological User Profiling in Recommender Systems STUART E MIDDLETON. NIGEL R SHAD BOLT AND DAVID C. DE ROURE Intelligence, Agents, Multimedia Group, University of Southampton We explore a novel ontological approach to user profiling within recommender systems, working on the problem of recommending on-line academic research papers. Our two experimental systems, Quickstep an profiles in terms of a research paper topic ontology. A novel profile visualization approach is taken to acquire rofile feedback Research papers are classified using ontological classes and collaborative recommendation xperiments, with 24 subjects over 3 months, and a large-scale experiment, with 260 subjects over an academic year, are conducted to evaluate different aspects of our approach. Ontological inference is shown to improve user profiling, extemal ontological knowledge used to successfully bootstrap a recommender system and profile isualization employed to improve profiling accuracy. The overall performance of our ontological recommender esented and favourably compared to other systems in the literature. or,L2.11 ARtificial Intelligence): Distributed Artificial Intelligence- Intelligent agents, H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval- Information filtering, Relevance feedback General Terms: Algorithms, Measurement, Design, Experimentation dditional Key Words and Phrases: Agent, Machine learning, Ontology, Personalization, Recommender stems, User profiling, User modelling INTRODUCTION The mass of content available on the World-Wide Web raises important questions over its effective use. The web is largely unstructured, with pages authored by many people pics, making simple br Web page filtering has thus become necessary for most web users Search engines are effective at filtering pages that match explicit queries Unfortunately, people find articulating what they want explicitly difficult, especially if forced to use a limited vocabulary such as keywords. As such search queries are often as supported by EPsrc award number 99308831 and the Interdisciplinary Research Technologies(AKT) project GR/N15764/01 of Electronics and ce University of Southampton, Southampton, Sol 1BJ, UK omitted 3/10/02. Revision 6/4/03. Final revision 29/9/03 rmission to make digitalhard copy of part of this work for personal or classroom use is granted without fee the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific n and/or a fee 2001ACM1073-0516/01/30000345500
Ontological User Profiling in Recommender Systems STUART E. MIDDLETON, NIGEL R. SHADBOLT AND DAVID C. DE ROURE Intelligence, Agents, Multimedia Group, University of Southampton ________________________________________________________________________ We explore a novel ontological approach to user profiling within recommender systems, working on the problem of recommending on-line academic research papers. Our two experimental systems, Quickstep and Foxtrot, create user profiles from unobtrusively monitored behaviour and relevance feedback, representing the profiles in terms of a research paper topic ontology. A novel profile visualization approach is taken to acquire profile feedback. Research papers are classified using ontological classes and collaborative recommendation algorithms used to recommend papers seen by similar people on their current topics of interest. Two small-scale experiments, with 24 subjects over 3 months, and a large-scale experiment, with 260 subjects over an academic year, are conducted to evaluate different aspects of our approach. Ontological inference is shown to improve user profiling, external ontological knowledge used to successfully bootstrap a recommender system and profile visualization employed to improve profiling accuracy. The overall performance of our ontological recommender systems are also presented and favourably compared to other systems in the literature. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning - Knowledge acquisition; I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence - Intelligent agents; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - Information filtering, Relevance feedback General Terms: Algorithms, Measurement, Design, Experimentation Additional Key Words and Phrases: Agent, Machine learning, Ontology, Personalization, Recommender systems, User profiling, User modelling ________________________________________________________________________ 1. INTRODUCTION The mass of content available on the World-Wide Web raises important questions over its effective use. The web is largely unstructured, with pages authored by many people on a diverse range of topics, making simple browsing too time consuming to be practical. Web page filtering has thus become necessary for most web users. Search engines are effective at filtering pages that match explicit queries. Unfortunately, people find articulating what they want explicitly difficult, especially if forced to use a limited vocabulary such as keywords. As such search queries are often ________________________________________________________________________ This research was supported by EPSRC studentship award number 99308831 and the Interdisciplinary Research Collaboration In Advanced Knowledge Technologies (AKT) project GR/N15764/01. Authors' addresses: Intelligence, Agents, Multimedia Group, Department of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK Authors’ email: {sem99r,nrs,dder}@ecs.soton.ac.uk. Submitted 3/10/02, Revision 6/4/03, Final revision 29/9/03 Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2001 ACM 1073-0516/01/0300-0034 $5.00
poorly formulated, and result in large lists of search results that contain only a handful of useful pages The semantic web offers the potential for help, allowing more intelligent search queries based on web pages marked up with semantic metadata Semantic web technology is, however, very dependant on the degree to which authors annotate their web pages, and automatic web page annotation is still in its infancy. Annotation requires selflessness in authors because the annotations provided will only help other people searching their web pages. Because of this, the vast majority of web pages are not annotated, and in the foreseeable future will remain so. The semantic web can thus only be of limited benefit to the problem of effective searching Recommender systems go some way to addressing these issues. We present a novel ontological approach to user profiling within recommender systems. Two recommender systems are build, called Quickstep and Foxtrot, and three experiments conducted to evaluate different aspects of their performance. Quickstep uses ontological inference to improve profiling accuracy and integrates an external ontology for profile bootstrapping Foxtrot enhances the Quickstep system by employing the novel idea of visualizing user profiles to acquire direct profile feedback This section discusses our chosen problem domain and our general approach to ontological recommendation, along with related work. In section 2 we describe the Quickstep recommender system and an experiment to show how inference can improve user profiling and hence recommendation accuracy. Section 3 details an integration between the Quickstep recommender system and an external ontology, along with an experiment to demonstrate its effectiveness at bootstrapping profiles. In section 4 the Foxtrot recommender system is described, with an experiment to demonstrate how profile visualization can be used to acquire feedback and hence improve profile accuracy. Lastly in section 5 we bring this work together, collating the evidence found to support ontological to user profiling within recommender systems and discuss future work 1.1 Recommender systems People fin vant hard. but when they see it. This insight has led to the utilization of relevance feedback, where people rate web pages as'interestingor 'not interesting and the system tries to find pages that match the interesting,, positive examples and do not match the not nteresting,, negative examples. With sufficient positive and negative examples, modern machine learning techniques can classify new pages with impressive accuracy, in some
poorly formulated, and result in large lists of search results that contain only a handful of useful pages. The semantic web offers the potential for help, allowing more intelligent search queries based on web pages marked up with semantic metadata. Semantic web technology is, however, very dependant on the degree to which authors annotate their web pages, and automatic web page annotation is still in its infancy. Annotation requires selflessness in authors because the annotations provided will only help other people searching their web pages. Because of this, the vast majority of web pages are not annotated, and in the foreseeable future will remain so. The semantic web can thus only be of limited benefit to the problem of effective searching. Recommender systems go some way to addressing these issues. We present a novel ontological approach to user profiling within recommender systems. Two recommender systems are build, called Quickstep and Foxtrot, and three experiments conducted to evaluate different aspects of their performance. Quickstep uses ontological inference to improve profiling accuracy and integrates an external ontology for profile bootstrapping. Foxtrot enhances the Quickstep system by employing the novel idea of visualizing user profiles to acquire direct profile feedback. This section discusses our chosen problem domain and our general approach to ontological recommendation, along with related work. In section 2 we describe the Quickstep recommender system and an experiment to show how inference can improve user profiling and hence recommendation accuracy. Section 3 details an integration between the Quickstep recommender system and an external ontology, along with an experiment to demonstrate its effectiveness at bootstrapping profiles. In section 4 the Foxtrot recommender system is described, with an experiment to demonstrate how profile visualization can be used to acquire feedback and hence improve profile accuracy. Lastly, in section 5 we bring this work together, collating the evidence found to support ontological to user profiling within recommender systems, and discuss future work. 1.1 Recommender systems People find articulating what they want hard, but they are very good at recognizing it when they see it. This insight has led to the utilization of relevance feedback, where people rate web pages as ‘interesting’ or ‘not interesting’ and the system tries to find pages that match the ‘interesting’, positive examples and do not match the ‘not interesting’, negative examples. With sufficient positive and negative examples, modern machine learning techniques can classify new pages with impressive accuracy; in some
cases text classification accuracy exceeding human capability has been demonstrated arkey 1998] Obtaining sufficient examples is difficult however, especially when trying to obtain negative examples. The problem with asking people for examples is that the cost, in terms of time and effort, of providing the examples generally outweighs the reward people will eventually receive Negative examples are particularly unrewarding, since there could be many irrelevant items to any typical query Unobtrusive monitoring provides positive examples of what the user is looking for without interfering with the users normal work activity. Heuristics can also be applied to nfer negative examples from observed behaviour, although generally with less confidence. This idea has led to content-based recommender systems, which unobtrusively watch user behaviour and recommend new items that correlate with a user's profile Another way to recommend items is based on the ratings provided by other people who have liked the item before. Collaborative recommender systems do this by asking people to rate items explicitly and then recommend new items that similar users have rated highly. An issue with collaborative filtering is that there is no direct reward for providing examples since they only help other people. This leads to initial difficulties in obtaining a sufficient number of ratings for the system to be useful, a problem known as the cold-start problem [Maltz and Ehrlich 1995 Hybrid systems, attempting to combine the advantages of content-based and collaborative recommender systems, have proved popular to-date. The feedback required for content-based recommendation is shared, allowing collaborative recommendation as 1.2 User profiling User profiling is typically either knowledge-based or behaviour-based. Knowledge-based approaches engineer static models of users and dynamically match users to the closest model. Questionnaires and interviews are often employed to obtain this user knowledge Behaviour-based approaches use the user's behaviour as a model, commonly using machine-learning techniques to discover useful patterns in the behaviour. Behavioural logging is employed to obtain the data necessary from which to extract patterns [ Kobsa 1993] provides a good survey of user modelling techniques The user profiling approach used by most recommender systems is behaviour-based, commonly using a binary class model to represent what users find interesting and uninteresting. Machine-learning techniques are then used to find potential items of
cases text classification accuracy exceeding human capability has been demonstrated [Larkey 1998]. Obtaining sufficient examples is difficult however, especially when trying to obtain negative examples. The problem with asking people for examples is that the cost, in terms of time and effort, of providing the examples generally outweighs the reward people will eventually receive. Negative examples are particularly unrewarding, since there could be many irrelevant items to any typical query. Unobtrusive monitoring provides positive examples of what the user is looking for, without interfering with the users normal work activity. Heuristics can also be applied to infer negative examples from observed behaviour, although generally with less confidence. This idea has led to content-based recommender systems, which unobtrusively watch user behaviour and recommend new items that correlate with a user’s profile. Another way to recommend items is based on the ratings provided by other people who have liked the item before. Collaborative recommender systems do this by asking people to rate items explicitly and then recommend new items that similar users have rated highly. An issue with collaborative filtering is that there is no direct reward for providing examples since they only help other people. This leads to initial difficulties in obtaining a sufficient number of ratings for the system to be useful, a problem known as the cold-start problem [Maltz and Ehrlich 1995]. Hybrid systems, attempting to combine the advantages of content-based and collaborative recommender systems, have proved popular to-date. The feedback required for content-based recommendation is shared, allowing collaborative recommendation as well. 1.2 User profiling User profiling is typically either knowledge-based or behaviour-based. Knowledge-based approaches engineer static models of users and dynamically match users to the closest model. Questionnaires and interviews are often employed to obtain this user knowledge. Behaviour-based approaches use the user’s behaviour as a model, commonly using machine-learning techniques to discover useful patterns in the behaviour. Behavioural logging is employed to obtain the data necessary from which to extract patterns. [Kobsa 1993] provides a good survey of user modelling techniques. The user profiling approach used by most recommender systems is behaviour-based, commonly using a binary class model to represent what users find interesting and uninteresting. Machine-learning techniques are then used to find potential items of
nterest in respect to the binary model. There are a lot of effective machine learning algorithms based on two classes. a binary profile does not, however, lend itself to sharing examples of interest or integrating any domain knowledge that might be available Sebastiani 2002] provides a good survey of current machine learning tech An ontology is a conceptualisation of a domain into a human-understandable, but machine-readable format consisting of entities, attributes, relationships, and axioms Guarino and Giaretta 1995]. Ontologies can provide a rich conceptualisation of the working domain of an organisation, representing the main concepts and relationships of the work activities. These relationships could represent isolated information such as an home phone number, or they could represent an activity such as authoring a document, or attending a conference We use the term ontology to refer to the classification structure and instances within a 1.4 Problem domain The web is increasingly becoming the primary source of research papers to the modern researcher. With millions of research papers available over the web from thousands of web sites, finding the right papers and being informed of newly available papers is a problematic task. Browsing this many web sites is too time consuming and search queries are only fully effective if an explicit search query can be formulated for what you need All too often papers are missed We address the problem of recommending on-line research papers to the academic staff and students at the University of Southampton. Academics need to search for explicit research papers and be kept up-to-date on their own research areas when new papers are published. We examine an ontological recommender system approach to support these two activities. Unobtrusive monitoring methods are preferred because researchers have their normal work to perform and would not welcome interruptions from a new system. Very high accuracy on recommendations is not required since users will have the option to simply ignore poor recommendations Real world knowledge acquisition systems are both tricky and complex to evaluate [Shadbolt et al. 1999]. A lot of evaluations are performed with user log data, simulating real user activity, or with standard benchmark collections, such as newspaper articles over a period of one year, that provide a basis for comparison with other systems. Although these evaluations are useful, especially for technique comparison, it is important to back hem up with real world studies so we can see how the benchmark tests generalize to the
interest in respect to the binary model. There are a lot of effective machine learning algorithms based on two classes. A binary profile does not, however, lend itself to sharing examples of interest or integrating any domain knowledge that might be available. [Sebastiani 2002] provides a good survey of current machine learning techniques. 1.3 Ontologies An ontology is a conceptualisation of a domain into a human-understandable, but machine-readable format consisting of entities, attributes, relationships, and axioms [Guarino and Giaretta 1995]. Ontologies can provide a rich conceptualisation of the working domain of an organisation, representing the main concepts and relationships of the work activities. These relationships could represent isolated information such as an employee’s home phone number, or they could represent an activity such as authoring a document, or attending a conference. We use the term ontology to refer to the classification structure and instances within a knowledge base. 1.4 Problem domain The web is increasingly becoming the primary source of research papers to the modern researcher. With millions of research papers available over the web from thousands of web sites, finding the right papers and being informed of newly available papers is a problematic task. Browsing this many web sites is too time consuming and search queries are only fully effective if an explicit search query can be formulated for what you need. All too often papers are missed. We address the problem of recommending on-line research papers to the academic staff and students at the University of Southampton. Academics need to search for explicit research papers and be kept up-to-date on their own research areas when new papers are published. We examine an ontological recommender system approach to support these two activities. Unobtrusive monitoring methods are preferred because researchers have their normal work to perform and would not welcome interruptions from a new system. Very high accuracy on recommendations is not required since users will have the option to simply ignore poor recommendations. Real world knowledge acquisition systems are both tricky and complex to evaluate [Shadbolt et al. 1999]. A lot of evaluations are performed with user log data, simulating real user activity, or with standard benchmark collections, such as newspaper articles over a period of one year, that provide a basis for comparison with other systems. Although these evaluations are useful, especially for technique comparison, it is important to back them up with real world studies so we can see how the benchmark tests generalize to the
eal world setting. Similar problems are seen in the agent domain where, as Nwana INwana 1996] argues, it has yet to be conclusively demonstrated that people really benefit from agent-based information systems This is why a real problem has been chosen upon which to evaluate our work. 1.5 Related work Group Lens (Konstan et al. 1997 is an example of a collaborative filter, recommending newsgroup articles based on a Pearson-r correlation of other users' ratings. Fab [Balabanovic and shoham 1997 is a content-based recommender, recommending web pages based on a nearest-neighbour algorithm working with each individual user's set of positive examples. The Quickstep and Foxtrot systems are hybrid recommender systems combining both these types of approach. Personal web-based agents such as News Dude and Daily Learner [Billsus and Pazzani 2000], Personal Web Watcher [Mladenic 1996] and News Weeder [Lang 1995] build profiles from observed user behaviour. These systems filter news stories/web pages and recommend unseen ones based on content, using k-Nearest Neighbour, naive Bayes and TF-IDF machine learning techniques. Individual sets of positive and negative examples re maintained for each user's profile. In contrast, by using an ontology to represent user profiles we pool these limited training examples, sharing between users examples of each Ontologies are used to improve content-based search, as seen in OntoSeek [ Guarino et al. 1999]. Users of OntoSeek navigate the ontology in order to formulate queries Ontologies are also used to automatically construct knowledge bases from web pages such as in Web-KB [Craven et al. 1998. Web-KB takes manually labelled examples of domain concepts and applies machine-learning techniques to classify new web pages Both systems do not, however, capture dynamic information such as user interests Digital libraries classify and store research papers, such as CiteSeer Bollacker et al 1998]. Typically such libraries are manually created and manually categorized. while systems are digital libraries, the content is dynamically and autonomously updated fre the browsing behaviour of its users IMladenic and Stefan 1999 provides a good survey of text-learning and agent systems, including content-based and collaborative approaches. The systems most related to Quickstep and Foxtrot are Entree [Burke 2000], which uses a knowledge base and case-based reasoning to recommend restaurant data, and raaP [Delgado et al. 1998] that uses simple categories to represent user profiles with unshared individual training sets for each user. None of these systems use an ontology to explicitly represent user profiles
real world setting. Similar problems are seen in the agent domain where, as Nwana [Nwana 1996] argues, it has yet to be conclusively demonstrated that people really benefit from agent-based information systems. This is why a real problem has been chosen upon which to evaluate our work. 1.5 Related work Group Lens [Konstan et al. 1997] is an example of a collaborative filter, recommending newsgroup articles based on a Pearson-r correlation of other users’ ratings. Fab [Balabanović and Shoham 1997] is a content-based recommender, recommending web pages based on a nearest-neighbour algorithm working with each individual user’s set of positive examples. The Quickstep and Foxtrot systems are hybrid recommender systems, combining both these types of approach. Personal web-based agents such as NewsDude and Daily Learner [Billsus and Pazzani 2000], Personal WebWatcher [Mladenić 1996] and NewsWeeder [Lang 1995] build profiles from observed user behaviour. These systems filter news stories/web pages and recommend unseen ones based on content, using k-Nearest Neighbour, naïve Bayes and TF-IDF machine learning techniques. Individual sets of positive and negative examples are maintained for each user’s profile. In contrast, by using an ontology to represent user profiles we pool these limited training examples, sharing between users examples of each class. Ontologies are used to improve content-based search, as seen in OntoSeek [Guarino et al. 1999]. Users of OntoSeek navigate the ontology in order to formulate queries. Ontologies are also used to automatically construct knowledge bases from web pages, such as in Web-KB [Craven et al. 1998]. Web-KB takes manually labelled examples of domain concepts and applies machine-learning techniques to classify new web pages. Both systems do not, however, capture dynamic information such as user interests. Digital libraries classify and store research papers, such as CiteSeer [Bollacker et al. 1998]. Typically such libraries are manually created and manually categorized. While our systems are digital libraries, the content is dynamically and autonomously updated from the browsing behaviour of its users. [Mladenić and Stefan 1999] provides a good survey of text-learning and agent systems, including content-based and collaborative approaches. The systems most related to Quickstep and Foxtrot are Entrée [Burke 2000], which uses a knowledge base and case-based reasoning to recommend restaurant data, and RAAP [Delgado et al. 1998] that uses simple categories to represent user profiles with unshared individual training sets for each user. None of these systems use an ontology to explicitly represent user profiles
Df note is that very few systems in the recommender system literature perform user trials using real users. To test classifier accuracy, most use either labelled benchmark document collections, such as Reuters news feed collection, or logged user data, such as 1.6 Overview of approach Our ontological approach to recommender systems uses a hybrid recommender system employing both collaborative and content-based recommendation techniques and representing user profiles in ontological terms. Two experimental systems have been built that follow this approach, called Quickstep and Foxtrot. Quickstep is a recommender ystem for a set of researchers within a computer science laboratory, while Foxtrot is a searchable database and recommender system for a computer science department. Figure I shows the generic structure of our ontological recommender systems Web World-Wide Browser Profile Profiler Web Recommender Classifier Email Search A web proxy is used to unobtrusively monitor each user's web browsing, adding new research papers to the central database as users discover them. The research paper database thus acts as a pool of shared knowledge, available to all users via search and recommendation. The database of research papers is classified using a research paper logy and a set of Recorded web browsing and relevance feedback elicited from users is used to ompute daily profiles of user's research interests. Interest profiles are represented in ontological terms, allowing other interests to be inferred that go beyond that just seen from directly observed behaviour. The interest profiles are visualized to allow elicitation
Of note is that very few systems in the recommender system literature perform user trials using real users. To test classifier accuracy, most use either labelled benchmark document collections, such as Reuters news feed collection, or logged user data, such as Usenet logs. 1.6 Overview of approach Our ontological approach to recommender systems uses a hybrid recommender system, employing both collaborative and content-based recommendation techniques and representing user profiles in ontological terms. Two experimental systems have been built that follow this approach, called Quickstep and Foxtrot. Quickstep is a recommender system for a set of researchers within a computer science laboratory, while Foxtrot is a searchable database and recommender system for a computer science department. Figure 1 shows the generic structure of our ontological recommender systems. Web Proxy Profiler Recommender Classifier Research Paper Search Database Recommendation Page Email World-Wide Web World-Wide Web Web Browser Visualized Profile OOnnt t ool l ogy ogy Web Proxy Profiler Recommender Classifier Research Paper Search Database Recommendation Page Email World-Wide Web World-Wide Web Web Browser Visualized Profile OOnnt t ool l ogy ogy Fig. 1. Our ontological approach to recommender systems A web proxy is used to unobtrusively monitor each user’s web browsing, adding new research papers to the central database as users discover them. The research paper database thus acts as a pool of shared knowledge, available to all users via search and recommendation. The database of research papers is classified using a research paper topic ontology and a set of training examples. Recorded web browsing and relevance feedback elicited from users is used to compute daily profiles of user’s research interests. Interest profiles are represented in ontological terms, allowing other interests to be inferred that go beyond that just seen from directly observed behaviour. The interest profiles are visualized to allow elicitation
of direct profile feedback, providing an additional source of information from which profiles can be compute Recommendations are compiled daily using collaborative filtering techniques to find sets of interesting papers. These papers are then constrained to match the top topics of ithin the content-based profiles. The papers left are used to create the recommendations Users can view their recommendations via a web page or weekly email message, look at and comment on visualizations of their profile via a web page or just search the research paper database for specific papers of interest. Quickstep, the earlier system supports only web page recommendation while Foxtrot supports all the interface features 1.7 Empirical evaluation This paper describes three experiments performed using our two recommender systems The first uses the Quickstep system to measure the effectiveness of using ontological nference in user profiling. Two 1.5 month trials were run using 24 members from the IAM research laboratory, comparing use of ontological profiles and inference to that of using unstr The second experiment integrated the Quickstep system with an external personnel and publication ontology. This experiment measured how effectively an external ontology can bootstrap a recommender system to reduce the recommender system cold-start problem. Behaviour logs from the previous experiment were used as the basis for this evaluation The third experiment took the Foxtrot recommender system and measures its overall effectiveness and the performance increase obtained when profiles are vis profile feedback acquired. A trial was run using 260 staff and students from the computer science department of the University of Southampton for an academic year, compar performance of those subjects who provided profile feedback to those who did not 2. ONTOLOGICAL USER PROFILING AND PROFILE INFERENCE Our ontological approach to recommender systems, shown in figure 2, involves various sub-processes. Our first experimental recommender system, called Quickstep[Middleton et al. 2001 implements all these processes but with just a web page interface. Quickstep is thus just a recommender system, without any search, email or visualization facilities. It was built to help researchers in a computer science laboratory setting, representing user profiling with a research topic ontology and using ontological inference to assist the profiling process. An experiment was run to compare the recommendation performance for subjects whose profiler used ontological inference with those whose profiler did not
of direct profile feedback, providing an additional source of information from which profiles can be computed. Recommendations are compiled daily using collaborative filtering techniques to find sets of interesting papers. These papers are then constrained to match the top topics of interest within the content-based profiles. The papers left are used to create the recommendations. Users can view their recommendations via a web page or weekly email message, look at and comment on visualizations of their profile via a web page or just search the research paper database for specific papers of interest. Quickstep, the earlier system, supports only web page recommendation while Foxtrot supports all the interface features. 1.7 Empirical evaluation This paper describes three experiments performed using our two recommender systems. The first uses the Quickstep system to measure the effectiveness of using ontological inference in user profiling. Two 1.5 month trials were run using 24 members from the IAM research laboratory, comparing use of ontological profiles and inference to that of using unstructured profiles. The second experiment integrated the Quickstep system with an external personnel and publication ontology. This experiment measured how effectively an external ontology can bootstrap a recommender system to reduce the recommender system cold-start problem. Behaviour logs from the previous experiment were used as the basis for this evaluation. The third experiment took the Foxtrot recommender system and measures its overall effectiveness and the performance increase obtained when profiles are visualized and profile feedback acquired. A trial was run using 260 staff and students from the computer science department of the University of Southampton for an academic year, comparing performance of those subjects who provided profile feedback to those who did not. 2. ONTOLOGICAL USER PROFILING AND PROFILE INFERENCE Our ontological approach to recommender systems, shown in figure 2, involves various sub-processes. Our first experimental recommender system, called Quickstep [Middleton et al. 2001], implements all these processes but with just a web page interface. Quickstep is thus just a recommender system, without any search, email or visualization facilities. It was built to help researchers in a computer science laboratory setting, representing user profiling with a research topic ontology and using ontological inference to assist the profiling process. An experiment was run to compare the recommendation performance for subjects whose profiler used ontological inference with those whose profiler did not
2. 1 Overview of the Quickstep recommender system Quickstep unobtrusively monitors user browsing behaviour via a web proxy, logging each URL browsed during normal work activity. A machine-learning algorithm classifies browsed URLs overnight, using classes within a research paper topic ontology, and saves each classified paper in a central paper store. Explicit relevance feedback and browsed topics form the basis of the interest profile for each user. Is-a relationships within the research paper topic ontology are also exploited to infer general interests when specific pics are observ Each day a set of recommendations is computed, based on correlations between user nterest profiles and classified paper topics. These recommendations are accessible to users via a web page. Any feedback offered on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set improves over time as well as the ⊙e World- Wide Web Recommendation ..Page Profiler Prox Recommendation Recommend Classifier commendation Research Paper Database 2.2 Approach of the Quickstep recommender system The Quickstep system uses a java-based web proxy, which records time-stamped URL for each user. This proxy could handle about 30 users. The system ran on a Solaris platform and was mostly written in Java. The research paper topic ontology is based on the computer science classifications made by the dmoz open directory project [dmoz] and some minor customisations. We chose to re-use an existing taxonomy to speed development time and provide a potential
2.1 Overview of the Quickstep recommender system Quickstep unobtrusively monitors user browsing behaviour via a web proxy, logging each URL browsed during normal work activity. A machine-learning algorithm classifies browsed URLs overnight, using classes within a research paper topic ontology, and saves each classified paper in a central paper store. Explicit relevance feedback and browsed topics form the basis of the interest profile for each user. Is-a relationships within the research paper topic ontology are also exploited to infer general interests when specific topics are observed. Each day a set of recommendations is computed, based on correlations between user interest profiles and classified paper topics. These recommendations are accessible to users via a web page. Any feedback offered on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set improves over time as well as the profiles. Web Proxy Profiler Recommender Classifier Research Paper Database Recommendation Page World-Wide Web World-Wide Web Web Browser OnOntotoll ooggyy Recommendation Page Recommendation Page Web Proxy Profiler Recommender Classifier Research Paper Database Recommendation Page World-Wide Web World-Wide Web Web Browser OnOntotoll ooggyy Recommendation Page Recommendation Page Fig. 2. The Quickstep system 2.2 Approach of the Quickstep recommender system The Quickstep system uses a java-based web proxy, which records time-stamped URLs for each user. This proxy could handle about 30 users. The system ran on a Solaris platform and was mostly written in Java. 2.2.1 Ontology The research paper topic ontology is based on the computer science classifications made by the dmoz open directory project [dmoz] and some minor customisations. We chose to re-use an existing taxonomy to speed development time and provide a potential
route for system integration with other external ontologies in the future. Our simple ontology holds is-a relationships between research paper topics, and has 27 classes; for the second trial this ontology was extended to 32 classes. Figure 3 shows a section from he ontology. Pre-trial interviews formed the basis of which additional topics would be added to the ontology to customize it for the target researchers. An expert review by two domain experts validated the ontology for correctness before use in our experiment Artificial-Agents IntelligenceBeliefN E-Commerce ulti-Agent-Systems H ndustrial Hypermed Literature [ hypermedial [hypertext] Web [ hypermedia to 2. 2. 2 Research paper representation Research papers are represented using term vectors. We use term'to mean a single word within the text of a paper, thus all words that appear in the training set of example papers add one dimension to our term vectors. Term vector weights are computed from the term frequency (tF)divided by total number of terms, representing the normalized equency in which a word appears within a research paper. Since many words are either too common or too rare to have useful discriminating power to a classifier, we use a few dimensionality reduction techniques to reduce the number of dimensions of the term vectors. Porter stemming [Porter 1980] is used to remove term suffixes and the SMarT [SMART Staff 1974] stop list is used to remove very common words like"the"and"or Term frequencies below 2 are removed since they have little discriminating power Dimensionality reduction is common in information system; [Sebastiani 2002] provides a good discussion of the issues
route for system integration with other external ontologies in the future. Our simple ontology holds is-a relationships between research paper topics, and has 27 classes; for the second trial this ontology was extended to 32 classes. Figure 3 shows a section from the ontology. Pre-trial interviews formed the basis of which additional topics would be added to the ontology to customize it for the target researchers. An expert review by two domain experts validated the ontology for correctness before use in our experiment. Artificial Intelligence Hypermedia E-Commerce Interface Agents Mobile Agents Multi-Agent-Systems Recommender Systems Agents Belief Networks Fuzzy Game Theory Genetic Algorithms Genetic Programming Knowledge Representation Information Filtering Information Retrieval Machine Learning Natural Language Neural Networks Philosophy [AI] Robotics [AI] Speech [AI] Vision [AI] Text Classification Ontologies Adaptive Hypermedia Hypertext Design Industrial Hypermedia Literature [hypermedia] Open Hypermedia Spatial Hypertext Taxonomic Hypertext Visualization [hypertext] Web [hypermedia] Content-Based Navigation Architecture [open hypermedia] Fig. 3. Section from the Quickstep research paper topic ontology 2.2.2 Research paper representation Research papers are represented using term vectors. We use ‘term’ to mean a single word within the text of a paper, thus all words that appear in the training set of example papers add one dimension to our term vectors. Term vector weights are computed from the term frequency (TF) divided by total number of terms, representing the normalized frequency in which a word appears within a research paper. Since many words are either too common or too rare to have useful discriminating power to a classifier, we use a few dimensionality reduction techniques to reduce the number of dimensions of the term vectors. Porter stemming [Porter 1980] is used to remove term suffixes and the SMART [SMART Staff 1974] stop list is used to remove very common words like “the” and “or”. Term frequencies below 2 are removed since they have little discriminating power. Dimensionality reduction is common in information system; [Sebastiani 2002] provides a good discussion of the issues
Most on-line research papers are in HTML, PS or PDF formats, with many papers being compressed. We support all these formats for maximum coverage in our proble domain, converting the papers to plain text and using this text to create the term vectors Unusual or corrupt formats are ignored. Several heuristics are used to determine if the research papers are converted to text correctly and look like a typical research paper with terms such as abstractand'references. In the later experiments, term vectors for papers had around 15,000 dimensions after dimensionality reduction 2.2.3 Classifier Research papers in the central database are classified by an IBk [Aha et al. 1991 classifier, which is boosted by the AdaBoostMI [Freund and Schapire 1996] algorithm The IBk classifier is a k-Nearest Neighbour type classifier that uses example documents, called a training set, added to a term-vector space. Example documents in the training set re manually labelled using the class names within the research paper topic ontology Figure 4 shows the basic k-Nearest Neighbour algorithm. The closeness of an unclassified vector to its neighbour vectors within the term-vector space determines its classification wdd=y∑ w(d db) knn distance between document a and b document vectors number of terms in document set weight of term j document a Fig 4. k-Nearest Neighbour algorithm lassifiers like k-Nearest Neighbour allow more training examples to be added to their term-vector space without the need to re-build the entire classifier. They also degrade well, so even whe neighbourhood" and so at least partially relevant. This makes k-Nearest Neighbour a robust choice of algorithm for research paper classification Boosting works by repeatedly running a weak learning algorithm on various distributions of the training set, and then combining the specialist classifiers produced by the weak learner into a single composite classifier. The"weak" learning algorithm here is the Ibk classifier. Figure 5 shows the AdaboostMi algorithm
Most on-line research papers are in HTML, PS or PDF formats, with many papers being compressed. We support all these formats for maximum coverage in our problem domain, converting the papers to plain text and using this text to create the term vectors. Unusual or corrupt formats are ignored. Several heuristics are used to determine if the research papers are converted to text correctly and look like a typical research paper with terms such as ‘abstract’ and ‘references’. In the later experiments, term vectors for papers had around 15,000 dimensions after dimensionality reduction. 2.2.3 Classifier Research papers in the central database are classified by an IBk [Aha et al. 1991] classifier, which is boosted by the AdaBoostM1 [Freund and Schapire 1996] algorithm. The IBk classifier is a k-Nearest Neighbour type classifier that uses example documents, called a training set, added to a term-vector space. Example documents in the training set are manually labelled using the class names within the research paper topic ontology. Figure 4 shows the basic k-Nearest Neighbour algorithm. The closeness of an unclassified vector to its neighbour vectors within the term-vector space determines its classification. w(da,db) = √ ____________ Σ j = 1..T (tja – tjb)2 w(da,db) kNN distance between document a and b da,db document vectors T number of terms in document set tja weight of term j document a Fig. 4. k-Nearest Neighbour algorithm Classifiers like k-Nearest Neighbour allow more training examples to be added to their term-vector space without the need to re-build the entire classifier. They also degrade well, so even when incorrect the class returned is normally in the right “neighbourhood” and so at least partially relevant. This makes k-Nearest Neighbour a robust choice of algorithm for research paper classification. Boosting works by repeatedly running a weak learning algorithm on various distributions of the training set, and then combining the specialist classifiers produced by the weak learner into a single composite classifier. The “weak” learning algorithm here is the IBk classifier. Figure 5 shows the AdaBoostM1 algorithm