Exploiting Synergy between Ontologies and Recommender Systems Stuart E. Middleton Harith Alani. David C. De roure Intelligence, Agents and Multimedia Group Department of Electronics and Computer Science University of Southampton Southampton, So17 1BJ, UK (sem99r, ha, dder)@ecssoton ac uk ABSTRACT Recommender systems [23 learn about user preferences over time Recommender systems learn about user preferences over time and automatically find things of similar interest, thus reducing the automatically finding things of similar interest. This reduces the burden of creating explicit queries. They dynamically track users burden of creating explicit queries. Recommender systems do, as their interests change. However, such systems require an initial however, suffer from cold-start problems where no initial learning phase where behaviour information is built up to form an formation is available early on upon which to base user profile. During this initial learning phase performance is often poor due to the lack of user information; this is known as the cold-start problem [17] Semantic knowledge stru can provide valuable domain knowl Howeve There has been increasing interest in developing and using tools h kn not a trivial for creating annotated content and making it available over the task and user interests acquire and semantic web Ontologies are one such tool, used to maintain and provide access to specific knowledge repositories. Such source could complement the behavioral information held within This paper investigates the synergy between a web-based research recommender systems, by providing some initial knowledge about aper recommender system and an ontology containing users and their domains of interest. It should thus be possible to information automatically extracted from departmental databases available on the web. The ontology is used to address the bootstrap the initial learning phase of a recommender system with such knowledge, easing the cold-start problem. system addresses the ontology's interest-acquisition problem. An In return for any bootstrap information the recommender system performance of the integrated systems measured This would reduce the effort involved in acquiring and maintaining knowledge of people's research interests. To this end General terms we investigate the integration of Quickstep, a web-based Design, Experimentation recommender system, an ontology for the academic domain and OntoCoPl, a community of practice identifier that can pick out Keywords Cold-start problem, interest-acquisition problem, ontology, 2. RECOMMENDER SYSTEMS recommender system. People may find articulating what they want hard, but they are good at recognizing it when they see it. This insight has led to the 1. INTRODUCTION utilization of relevance feedback [24], where people rate web The mass of content available on the World-Wide Web raises pages as interesting or not interesting and the system tries to find important questions over its effective use. Search engines filter pages that match the interesting, positive examples and do not web pages that match explicit queries, but most people find match the not interesting, negative examples. With sufficient articulating exactly what they want difficult. The result is large positive and negative examples, modern machine learning lists of search results that contain a handful of useful pages, techniques can classify new pages with impressive accuracy. Such defeating the purpose of filtering in the first place systems are called content-based recommender systems Another way to recommend pages is based on the ratings of other people who have seen the page before. Collaborative Permission to make d ersonal or classroom use is granted c. ork for recommender systems do this by asking people to rate explicitly not made or distributed for profit and that pages and then recommending new pages that similar users have rated highly. The problem with collaborative filtering is that there otherwise, to republish, to post on servers or to redistribute to lists, is no direct reward for providing examples since they only help ecific permission by the authors other people. This leads to initial difficulties in obtaining a Semantic Web Workshop 2002 Hawaii, USA ufficient number of ratings for the system to be useful. Copyright by the autho
Exploiting Synergy between Ontologies and Recommender Systems Stuart E. Middleton, Harith Alani, David C. De Roure Intelligence, Agents and Multimedia Group Department of Electronics and Computer Science University of Southampton Southampton, SO17 1BJ, UK {sem99r,ha,dder}@ecs.soton.ac.uk ABSTRACT Recommender systems learn about user preferences over time, automatically finding things of similar interest. This reduces the burden of creating explicit queries. Recommender systems do, however, suffer from cold-start problems where no initial information is available early on upon which to base recommendations. Semantic knowledge structures, such as ontologies, can provide valuable domain knowledge and user information. However, acquiring such knowledge and keeping it up to date is not a trivial task and user interests are particularly difficult to acquire and maintain. This paper investigates the synergy between a web-based research paper recommender system and an ontology containing information automatically extracted from departmental databases available on the web. The ontology is used to address the recommender systems cold-start problem. The recommender system addresses the ontology’s interest-acquisition problem. An empirical evaluation of this approach is conducted and the performance of the integrated systems measured. General Terms Design, Experimentation. Keywords Cold-start problem, interest-acquisition problem, ontology, recommender system. 1. INTRODUCTION The mass of content available on the World-Wide Web raises important questions over its effective use. Search engines filter web pages that match explicit queries, but most people find articulating exactly what they want difficult. The result is large lists of search results that contain a handful of useful pages, defeating the purpose of filtering in the first place. Recommender systems [23] learn about user preferences over time and automatically find things of similar interest, thus reducing the burden of creating explicit queries. They dynamically track users as their interests change. However, such systems require an initial learning phase where behaviour information is built up to form an user profile. During this initial learning phase performance is often poor due to the lack of user information; this is known as the cold-start problem [17]. There has been increasing interest in developing and using tools for creating annotated content and making it available over the semantic web. Ontologies are one such tool, used to maintain and provide access to specific knowledge repositories. Such sources could complement the behavioral information held within recommender systems, by providing some initial knowledge about users and their domains of interest. It should thus be possible to bootstrap the initial learning phase of a recommender system with such knowledge, easing the cold-start problem. In return for any bootstrap information the recommender system could provide details of dynamic user interests to the ontology. This would reduce the effort involved in acquiring and maintaining knowledge of people’s research interests. To this end we investigate the integration of Quickstep, a web-based recommender system, an ontology for the academic domain and OntoCoPI, a community of practice identifier that can pick out similar users. 2. RECOMMENDER SYSTEMS People may find articulating what they want hard, but they are good at recognizing it when they see it. This insight has led to the utilization of relevance feedback [24], where people rate web pages as interesting or not interesting and the system tries to find pages that match the interesting, positive examples and do not match the not interesting, negative examples. With sufficient positive and negative examples, modern machine learning techniques can classify new pages with impressive accuracy. Such systems are called content-based recommender systems. Another way to recommend pages is based on the ratings of other people who have seen the page before. Collaborative recommender systems do this by asking people to rate explicitly pages and then recommending new pages that similar users have rated highly. The problem with collaborative filtering is that there is no direct reward for providing examples since they only help other people. This leads to initial difficulties in obtaining a sufficient number of ratings for the system to be useful. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors. Semantic Web Workshop 2002 Hawaii, USA Copyright by the authors
Hybrid systems, attempting to combine the advantages of content- team (Advanced Knowledge Technologies (200). It models ased and collaborative recommender systems, have proved people, projects, papers, events and research interests. The pular to-date. The feedback required for content-based ontology itself is implemented in Protege 2000 [101, a graphical commendation is shared, allowing collaborative tool for developing knowledge-based systems. It is populated with ecommendation as well. We use the Quickstep [18 hybrid information extracted automatically from a departmental papers. is focused on people, projects, and publications 2.1 The cold-start problem One difficult problem commonly faced by recommender systems 3.1 The Interest-acquisition Problem the cold-start problem [17) where recommendations are Peoples areas of expertise and interests are an important type of quired for new items or users for whom little or no information knowledge for many applications, for example expert finders [ 9] has yet been acquired. Poor performance resulting from a cold Semantic web technology can be a good source of such start can deter user uptake of a recommender system. This effect is thus self-destructive, since the recommender never achieves good the web pages up-to-date. The majority of web pages receive little performance since users never use it for long enough. We will maintenance, holding information that does not date quickly examine two types of cold-start problem Since interests and areas of expertise are dynamic in nature they The new-system cold-start problem is where there are are not often held within web pages. It is thus particularly difficult ratings by users, and hence no profiles of users. In this for an ontology to acquire such information; this is the interest- most recommender systems have no basis Is sItuation acquisition problem. recommend, and hence perform very poorly Many existing systems force users to perform self-assessment to The new-user cold-start problem is where the system s exist, but gather such information, but this has numerous disadvantages [ 5] running for a while and a set of user profiles and ratings Lotus have developed a system that monitors user interaction with no information is available about a new user. Most recon a document to capture interests and expertise [16]. Their system systems perform poorly in this situation too does not, however, consider the online documents that users Collaborative recommender systems fail to help in cold-star ituations, as they cannot discover similar user behaviour because This paper investigates linking an ontology with a recommender there is not enough previously logged behaviour data upon whic system to help overcoming the interest acquisition problem. The to base any correlations. Content-based and hybrid recommender recommender system will regularly provide the ontology with systems perform a little better since they need just a few examp nterest profiles for users, obtained by monitoring user web of user interest in order to find similar items ysing feedback on recommended papers No recommender system can cope alone with a totally cold-start however, since even content-based recommenders require a small 4. Related work number of examples on which to base recommendations. We ropose to link together a recommender system and an ontology recommend items liked by similar people. PHOAKS [26 is an to address this problem. The ontology can provide a variety of example of a collaborative filtering, recommending web links information on users and their publications. Publications provide mentioned in newsgroups articles. Only newsgroups with at least mportant information about what interests a user has had in the 20 posted web links are considered by PHOAKs, avoiding the ast, so provide a basis upon which to create initial profiles that can address the new-system cold start problem. Personnel records cold-start problems associated with newer newsgroups containing allow similar users to be identified. This will address the new-user less messages. Group Lens [14 is an alternative example, recommending newsgroup articles. Group Lens reports two cold old-start problem by providing a set of similar users on which to start problems in their experimental analysis. Users abandoned the base a new-user profile. system before they had provided enough ratings to receive commendations and early adopters of the system received poor 3. ONTOLOGIES recommendations until enough ratings were gathered. These An ontology is a conceptualisation of a domain into a human- systems are typical of collaborative recommenders, where a cold- nderstandable, but machine-readable format consisting of start makes early recommendation poor until sufficient people ships, and axioms [12]. Ontologies can provide a rich conceptualisation of the working domain of an Content-based recommender systems recommend items with organisation, representing the main concepts and relationships of similar content to things the user has liked before. An example of information such as an employees home phone number, or they could represent an activity such as authoring a document,or pages. Fab needs a few early ratings from each user in order to recommender, recommending funding information from a In this paper we use the term ontology to refer to the classification database. elfI observes users using a database and infers both tructure and instances within the knowledge base positive and negative examples of interest from this behaviour The ontology used in our work is designed to represent the oth these systems are typical of content-based recommender academic domain, and was developed by Sou on's akt systems, requiring users to use the system for an initial period of time before the cold-start problem is overcome
Hybrid systems, attempting to combine the advantages of contentbased and collaborative recommender systems, have proved popular to-date. The feedback required for content-based recommendation is shared, allowing collaborative recommendation as well. We use the Quickstep [18] hybrid recommender system in this paper to recommend on-line research papers. 2.1 The Cold-start Problem One difficult problem commonly faced by recommender systems is the cold-start problem [17], where recommendations are required for new items or users for whom little or no information has yet been acquired. Poor performance resulting from a coldstart can deter user uptake of a recommender system. This effect is thus self-destructive, since the recommender never achieves good performance since users never use it for long enough. We will examine two types of cold-start problem. The new-system cold-start problem is where there are no initial ratings by users, and hence no profiles of users. In this situation most recommender systems have no basis on which to recommend, and hence perform very poorly. The new-user cold-start problem is where the system has been running for a while and a set of user profiles and ratings exist, but no information is available about a new user. Most recommender systems perform poorly in this situation too. Collaborative recommender systems fail to help in cold-start situations, as they cannot discover similar user behaviour because there is not enough previously logged behaviour data upon which to base any correlations. Content-based and hybrid recommender systems perform a little better since they need just a few examples of user interest in order to find similar items. No recommender system can cope alone with a totally cold-start however, since even content-based recommenders require a small number of examples on which to base recommendations. We propose to link together a recommender system and an ontology to address this problem. The ontology can provide a variety of information on users and their publications. Publications provide important information about what interests a user has had in the past, so provide a basis upon which to create initial profiles that can address the new-system cold start problem. Personnel records allow similar users to be identified. This will address the new-user cold-start problem by providing a set of similar users on which to base a new-user profile. 3. ONTOLOGIES An ontology is a conceptualisation of a domain into a humanunderstandable, but machine-readable format consisting of entities, attributes, relationships, and axioms [12]. Ontologies can provide a rich conceptualisation of the working domain of an organisation, representing the main concepts and relationships of the work activities. These relationships could represent isolated information such as an employee’s home phone number, or they could represent an activity such as authoring a document, or attending a conference. In this paper we use the term ontology to refer to the classification structure and instances within the knowledge base. The ontology used in our work is designed to represent the academic domain, and was developed by Southampton’s AKT team (Advanced Knowledge Technologies [20]). It models people, projects, papers, events and research interests. The ontology itself is implemented in Protégé 2000 [10], a graphical tool for developing knowledge-based systems. It is populated with information extracted automatically from a departmental personnel database and publication database. The ontology consists of around 80 classes, 40 slots, over 13000 instances and is focused on people, projects, and publications. 3.1 The Interest-acquisition Problem People’s areas of expertise and interests are an important type of knowledge for many applications, for example expert finders [9]. Semantic web technology can be a good source of such information, but usually requires substantial maintenance to keep the web pages up-to-date. The majority of web pages receive little maintenance, holding information that does not date quickly. Since interests and areas of expertise are dynamic in nature they are not often held within web pages. It is thus particularly difficult for an ontology to acquire such information; this is the interestacquisition problem. Many existing systems force users to perform self-assessment to gather such information, but this has numerous disadvantages [5]. Lotus have developed a system that monitors user interaction with a document to capture interests and expertise [16]. Their system does not, however, consider the online documents that users browse. This paper investigates linking an ontology with a recommender system to help overcoming the interest acquisition problem. The recommender system will regularly provide the ontology with interest profiles for users, obtained by monitoring user web browsing and analysing feedback on recommended research papers. 4. Related Work Collaborative recommender systems utilize user ratings to recommend items liked by similar people. PHOAKS [26] is an example of a collaborative filtering, recommending web links mentioned in newsgroups articles. Only newsgroups with at least 20 posted web links are considered by PHOAKS, avoiding the cold-start problems associated with newer newsgroups containing less messages. Group Lens [14] is an alternative example, recommending newsgroup articles. Group Lens reports two coldstart problems in their experimental analysis. Users abandoned the system before they had provided enough ratings to receive recommendations and early adopters of the system received poor recommendations until enough ratings were gathered. These systems are typical of collaborative recommenders, where a coldstart makes early recommendation poor until sufficient people have provided ratings. Content-based recommender systems recommend items with similar content to things the user has liked before. An example of a content-based recommender is Fab [4], which recommends web pages. Fab needs a few early ratings from each user in order to create a training set. ELFI [25] is another content-based recommender, recommending funding information from a database. ELFI observes users using a database and infers both positive and negative examples of interest from this behaviour. Both these systems are typical of content-based recommender systems, requiring users to use the system for an initial period of time before the cold-start problem is overcome
Personal web-based agents such as Letizia [15]. Syskill Webert Quick rofiles on an [21] and Personal Webwatcher [ 19] track the users browsing and research paper topics. This allows inferences from the ontology to formulate user profiles. Profiles are constructed from positive and assist profile generation; in our case topic inheritance is used to egative examples of interest, obtained from explicit feedback or infer interest in super-classes of specific topics. Sharing interest euristics analysing browsing behaviour. They then suggest which profiles with the AkT ontology is not difficult since they are ks are worth following from the current web page by explicitly represented using ontological terms commending page links most similar to the users profile. Just Previous trials [18] of Quickstep used hand-crafted initial profiles ke a content-based recommender system, a few examples of based on interview data, to cope with the cold-start problem interest must be observed or elicited from the user before a useful profile can be constructed Linking Quickstep with the Akt ontology automates this process, allowing a more realistic cold-start solution that will scale to Ontologies can be used to improve content-base larger numbers of users in OntoSeek [13]. Users of Onto Seek navigate order to formulate queries. Ontologies can 5.1 Paper classification algorithm automatically construct knowledge bases from web pages, such as Every research paper within Quicksteps central database is in Web-KB [8]. Web-kB takes manually labelled examples of represented using a term frequency vector. Terms are single words domain concepts and applies machine-learning techniques to within the document, so term frequency vectors are computed by lassify new web pages. Both systems do not, however, capture counting the number of times words appear within the paper, Each dynamic information such as user interest dimension within a vector represents a term. Dimensionality Also of relevance are systems such as CiteSeer [6], which use reduction on vectors is achieved by removing common words found op-list and stemming words using the Porter[22] content-based similarity matching to help search for interesting stemming algorithm. Quickstep uses vectors with 10-15,000 research papers within a digital librar dimensions 5. THE QUICKSTEP RECOMMENDER Once added to the database, papers are classified using an IBk [1] SYSTEM classifier boosted by the AdaBoostMl [11 algorithm. The IBk Quickstep [18] is a hybrid recommender system, addressing the classifier is a k-Nearest Neighbour type classifier that uses eal-world problem of recommending on-line research papers to researchers. User browsing behaviour is unobtrusively monitored Figure 2 shows the basic k-Nearest Neighbour algorithm. The via a proxy server, logging each URL browsed during normal closeness of an unclassified vector to its neighbours within the work activity. A nearest-neighbour algorithm classifies browsed ector space determines its classification URLs based on a training set of labelled example papers, storing each new paper in a central database. The database of known pers grows over time, building a shared pool of knowledge. w(d4)=y∑(-b) Explicit feedback and browsed URL's form the basis of the interest profile for each user. Figure 1 shows an overview of the Quickstep system w(da, db )knn distance between document a and b orld wie Users Profiles number of terms in document set Figure 2. k-Nearest Neighbour algorithm Classifie Recommender Classifiers like k-Nearest Neighbour allow more training examples to be added to their vector space without the need to re- build the entire classifier. They also degrade well, so even when incorrect the class returned is normally in the right Classified neighbourhood"and so at least partially relevant. This makes k- papers Nearest Neighbour a robust choice of algorithm for this task Boosting works by repeatedly running a weak learning algorith Figure 1. The Quickstep recommender system on various distributions of the training set, and then combining the classifiers produced by the weak learner into a single Cach day a set of recommendations is computed, based or composite classifier. The "weak "learning algorithm here is the IBk classifier. Figure 3 shows the Ada BoostMI algorithm correlations between user interest profiles and classified paper opics. Any feedback offered by users on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set, and hence classification accuracy, Improves over time
Personal web-based agents such as Letizia [15], Syskill & Webert [21] and Personal Webwatcher [19] track the users browsing and formulate user profiles. Profiles are constructed from positive and negative examples of interest, obtained from explicit feedback or heuristics analysing browsing behaviour. They then suggest which links are worth following from the current web page by recommending page links most similar to the users profile. Just like a content-based recommender system, a few examples of interest must be observed or elicited from the user before a useful profile can be constructed. Ontologies can be used to improve content-based search, as seen in OntoSeek [13]. Users of OntoSeek navigate the ontology in order to formulate queries. Ontologies can also be used to automatically construct knowledge bases from web pages, such as in Web-KB [8]. Web-KB takes manually labelled examples of domain concepts and applies machine-learning techniques to classify new web pages. Both systems do not, however, capture dynamic information such as user interests. Also of relevance are systems such as CiteSeer [6], which use content-based similarity matching to help search for interesting research papers within a digital library. 5. THE QUICKSTEP RECOMMENDER SYSTEM Quickstep [18] is a hybrid recommender system, addressing the real-world problem of recommending on-line research papers to researchers. User browsing behaviour is unobtrusively monitored via a proxy server, logging each URL browsed during normal work activity. A nearest-neighbour algorithm classifies browsed URL’s based on a training set of labelled example papers, storing each new paper in a central database. The database of known papers grows over time, building a shared pool of knowledge. Explicit feedback and browsed URL’s form the basis of the interest profile for each user. Figure 1 shows an overview of the Quickstep system. World Wide Web Users Profiles Classifier Recommender Classified papers World Wide Web Users Profiles Classifier Recommender Classified papers Figure 1. The Quickstep recommender system Each day a set of recommendations is computed, based on correlations between user interest profiles and classified paper topics. Any feedback offered by users on these recommendations is recorded when the user looks at them. Users can provide new examples of topics and correct paper classifications where wrong. In this way the training set, and hence classification accuracy, improves over time. Quickstep bases its user interest profiles on an ontology of research paper topics. This allows inferences from the ontology to assist profile generation; in our case topic inheritance is used to infer interest in super-classes of specific topics. Sharing interest profiles with the AKT ontology is not difficult since they are explicitly represented using ontological terms. Previous trials [18] of Quickstep used hand-crafted initial profiles, based on interview data, to cope with the cold-start problem. Linking Quickstep with the AKT ontology automates this process, allowing a more realistic cold-start solution that will scale to larger numbers of users. 5.1 Paper classification algorithm Every research paper within Quickstep’s central database is represented using a term frequency vector. Terms are single words within the document, so term frequency vectors are computed by counting the number of times words appear within the paper. Each dimension within a vector represents a term. Dimensionality reduction on vectors is achieved by removing common words found in a stop-list and stemming words using the Porter [22] stemming algorithm. Quickstep uses vectors with 10-15,000 dimensions. Once added to the database, papers are classified using an IBk [1] classifier boosted by the AdaBoostM1 [11] algorithm. The IBk classifier is a k-Nearest Neighbour type classifier that uses example documents, called a training set, added to a vector space. Figure 2 shows the basic k-Nearest Neighbour algorithm. The closeness of an unclassified vector to its neighbours within the vector space determines its classification. w(da,db) = √ ____________ Σ j = 1..T (tja – tjb) 2 w(da,db) kNN distance between document a and b da,db document vectors T number of terms in document set tja weight of term j document a w(da,db) = √ ____________ Σ j = 1..T (tja – tjb) 2 w(da,db) kNN distance between document a and b da,db document vectors T number of terms in document set tja weight of term j document a Figure 2. k-Nearest Neighbour algorithm Classifiers like k-Nearest Neighbour allow more training examples to be added to their vector space without the need to rebuild the entire classifier. They also degrade well, so even when incorrect the class returned is normally in the right “neighbourhood” and so at least partially relevant. This makes kNearest Neighbour a robust choice of algorithm for this task. Boosting works by repeatedly running a weak learning algorithm on various distributions of the training set, and then combining the classifiers produced by the weak learner into a single composite classifier. The “weak” learning algorithm here is the IBk classifier. Figure 3 shows the AdaBoostM1 algorithm
ise all values ofd to 1/N Artificial- Agents or t=l.T Game Theory calculate errore calculate B e/(l-ey calculate D计1 nowledge Representation classifier= argmax ∑ Machine Learning ith result class c hilosophy (All ech [Al] D class weight distribution on iteration Vision[All number of classes T number of iterations weak-learn(D, weak learner with distribution D, weak learn error on iteration t error adjustment value on iteration t Content-Based Navigation classifier final boosted classifier alization[hypertext] Figure 3. AdaboostMl boosting algorithm Figure 5. Section of the research paper topic ontology AdaBoostMI has been shown to improve [11] the performance of 5.3 Recommendation algorithm algorithms, particularly for stronger learning Recommendations are formulated from a correlation between the e k-Nearest Neighbour. It is thus a sensible choice users current topics of interest and papers classified as belonging to boo k classifier to those topics. a paper is only recommended if it does not appear in the users browsed URL log, that recommendations 5.2 User profiling algorithm have not been seen before. For each user, the top three interesting The profiling algorithm performs correlation between paper topic topics are selected with 10 recommendations made in total, Papers classifications and user browsing logs. Whenever a research pape are ranked in order of the recommendation confidence before is browsed that has been classified as belonging to a topic, it being presented to the user. Figure 6 shows the recommendation accumulates an interest score for that topic. Explicit feedback or commendations also accumulates interest value for topics. The current interest of a topic is computed using the inverse time Recommendation confidence= classification confidence t weighting algorithm shown in Figure 4 pic interest value Figure 6. Recommendation algorithm opIc Interest T ∑ Interest value(n)/ days old(n) 6. ONTOCOP Interest values Paper browsed=1 The Ontology-based Communities of Practice Identifier (Onto CoPD)[2] is an experimental system that uses the AKT Recommendation followed= 2 ontology to help identifying communities of practice(CoP). The Topic rated interesting= 10 community of practice of a person is taken here to be the closest Topic rated not interesting=-10 group of people, based on specific features they have in common with that given person. A community of practice is thus Figure 4. Profiling algorithm informal group of people who share some common interest in a particular practice [7[27]. Workplace communities of practice An is-a hierarchy of research paper topics is held so that super- mprove organisational performance by maintaining implicit knowledge, helping the spread of new ideas and sol class relationships can be used to infer broader topic interest. as a focus for innovation and driving organisational strategy When a specific topic is browsed, fractional interest is inferred for ach super-class of that topic, using a 1/ eel weighting where Identifying communities of practice is an essential first step to level'refers to how many classes up the is-a tree the super-class understand the knowledge resources of an organization [28] is from the original gure ows a section from the Organisations can bring the right people together to help the identified communities of practice to flourish and expand, for example by providing them with appropriate infrastructure and give them support and recognition. However, community of practice identification is currently a resource-heavy process
Initialise all values of D to 1/N Do for t=1..T call weak-learn(Dt ) calculate error et calculate βt = et /(1-et ) calculate Dt+1 Dt class weight distribution on iteration t N number of classes T number of iterations weak-learn(Dt ) weak learner with distribution Dt et weak_learn error on iteration t βt error adjustment value on iteration t classifier final boosted classifier C all classes classifier = argmax Σ log t = all iterations with result class c c ∈ C βt 1 __ Initialise all values of D to 1/N Do for t=1..T call weak-learn(Dt ) calculate error et calculate βt = et /(1-et ) calculate Dt+1 Dt class weight distribution on iteration t N number of classes T number of iterations weak-learn(Dt ) weak learner with distribution Dt et weak_learn error on iteration t βt error adjustment value on iteration t classifier final boosted classifier C all classes classifier = argmax Σ log t = all iterations with result class c c ∈ C βt 1 __ classifier = argmax Σ log t = all iterations with result class c c ∈ C βt 1 __ Figure 3. AdaBoostM1 boosting algorithm AdaBoostM1 has been shown to improve [11] the performance of weak learner algorithms, particularly for stronger learning algorithms like k-Nearest Neighbour. It is thus a sensible choice to boost our IBk classifier. 5.2 User profiling algorithm The profiling algorithm performs correlation between paper topic classifications and user browsing logs. Whenever a research paper is browsed that has been classified as belonging to a topic, it accumulates an interest score for that topic. Explicit feedback on recommendations also accumulates interest value for topics. The current interest of a topic is computed using the inverse time weighting algorithm shown in Figure 4. ˇ n 1..no of instances Topic interest = Interest value(n) / days old(n) Interest values Paper browsed = 1 Recommendation followed = 2 Topic rated interesting = 10 Topic rated not interesting = -10 ˇ n 1..no of instances Topic interest = Interest value(n) / days old(n) Interest values Paper browsed = 1 Recommendation followed = 2 Topic rated interesting = 10 Topic rated not interesting = -10 Figure 4. Profiling algorithm An is-a hierarchy of research paper topics is held so that superclass relationships can be used to infer broader topic interest. When a specific topic is browsed, fractional interest is inferred for each super-class of that topic, using a 1/2level weighting where ‘level’ refers to how many classes up the is-a tree the super-class is from the original topic. Figure 5 shows a section from the research paper topic ontology. Artificial Intelligence Hypermedia E-Commerce Interface Agents Mobile Agents Multi-Agent-Systems Recommender Systems Agents Belief Networks Fuzzy Game Theory Genetic Algorithms Genetic Programming Knowledge Representation Information Filtering Information Retrieval Machine Learning Natural Language Neural Networks Philosophy [AI] Robotics [AI] Speech [AI] Vision [AI] Text Classification Ontologies Adaptive Hypermedia Hypertext Design Industrial Hypermedia Literature [hypermedia] Open Hypermedia Spatial Hypertext Taxonomic Hypertext Visualization [hypertext] Web [hypermedia] Content-Based Navigation Architecture [open hypermedia] Figure 5. Section of the research paper topic ontology 5.3 Recommendation algorithm Recommendations are formulated from a correlation between the users current topics of interest and papers classified as belonging to those topics. A paper is only recommended if it does not appear in the users browsed URL log, ensuring that recommendations have not been seen before. For each user, the top three interesting topics are selected with 10 recommendations made in total. Papers are ranked in order of the recommendation confidence before being presented to the user. Figure 6 shows the recommendation algorithm. Recommendation confidence = classification confidence * topic interest value Figure 6. Recommendation algorithm 6. ONTOCOPI The Ontology-based Communities of Practice Identifier (OntoCoPI) [2] is an experimental system that uses the AKT ontology to help identifying communities of practice (CoP). The community of practice of a person is taken here to be the closest group of people, based on specific features they have in common with that given person. A community of practice is thus an informal group of people who share some common interest in a particular practice [7] [27]. Workplace communities of practice improve organisational performance by maintaining implicit knowledge, helping the spread of new ideas and solutions, acting as a focus for innovation and driving organisational strategy. Identifying communities of practice is an essential first step to understand the knowledge resources of an organization [28]. Organisations can bring the right people together to help the identified communities of practice to flourish and expand, for example by providing them with appropriate infrastructure and give them support and recognition. However, community of practice identification is currently a resource-heavy process
largely based of such community structures that are normally hidden within and AKT across organisations. Ontology OntoCoPI is a tool that uses ontology-based network analysis to User interest breadth-first spreading activation algorithm is applied by User OntoCoPI to crawl the ontology network of instances and publications relationships to extract patterns of certain relations between entities relating to a community of practice. The crawl can be Quickstep OntoCoPl limited to a given set of ontology relationships. These relationships can be traced to find specific information, such as who attended the same events, who co-authored papers and who are members of the same project or organisation. Communities of Figure 7. Ontology and recommender system integration practice are based on informal sets of relationships while ntologies are normally made up of formal relationships. The Upon start-up, the ontology provides the recommender hypothesis underlying OntoCoPI is that some informa with an initial set of publications for each of its registere ationships can be inferred from the presence of formal ones Each user's known publications are then correlated For instance, if A and B have no formal relationships, but they recommender systems classified paper database, and a set of have both authored papers with C, then that could indicate a historical interests compiled for that user. These historical interests form the basis of an initial profile, overcoming the new One of the advantages of using an ontology to identify system cold-start problem. Figure 8 details the initial profile communities of practice, rather than other traditional information algorithm. As per the Quickstep profiling algorithm, fractional networks [3 is that relationships can be selected according to interest in a topic super-classes is inferred when a specific topic is their semantics, and can have different weights to reflect relative importance. For example the relations of document authorship and project membership can be selected if it is required to identify communities of practice based on publications and project work. ∑ 1/ publication age(n) OntoCoPI allows manual selection of relationships or automatic selection based on the frequency of relationship use within the knowledge base. Selecting the right relationships and weights is an experimental process that is dependent on the ontolog structure,the type and amount of information in the ontology, and new-system initial profile =(t, topic interest(D))* the type of community of practice required t= When working with a new community of practice some Figure 8. New-system initial profile algorithm experiments will be needed to see which relationships are relevant to the desired community of practice, and how to set relative weights. In the experiments described in this paper, certain When the recommender system is up and running and a new user relationships were selected manually and weighted based on our is added, the ontology provides the historical publication list of preferences. Further trials are needed to determine the most the new user and the OntoCoPI system provides a ranked list of effective selection similar users. The initial profile of the new user is formed from a correlation between historical publications and any similar user 7. INTEGRATION OF THE TWO profiles. This algorithm is detailed in figure 9, and addresses the TECHNOLOGIES new-user cold-start problem We have investigated the integration of the ontology, Onto CoPI and Quickstep recommender system to provide a solution to both the cold-start problem and interest acquisition problem. Figure 7 shows our experimental systems after integration
largely based on interviews, mainly because of the informal nature of such community structures that are normally hidden within and across organisations. OntoCoPI is a tool that uses ontology-based network analysis to support the task of community of practice identification. A breadth-first spreading activation algorithm is applied by OntoCoPI to crawl the ontology network of instances and relationships to extract patterns of certain relations between entities relating to a community of practice. The crawl can be limited to a given set of ontology relationships. These relationships can be traced to find specific information, such as who attended the same events, who co-authored papers and who are members of the same project or organisation. Communities of practice are based on informal sets of relationships while ontologies are normally made up of formal relationships. The hypothesis underlying OntoCoPI is that some informal relationships can be inferred from the presence of formal ones. For instance, if A and B have no formal relationships, but they have both authored papers with C, then that could indicate a shared interest. One of the advantages of using an ontology to identify communities of practice, rather than other traditional information networks [3] is that relationships can be selected according to their semantics, and can have different weights to reflect relative importance. For example the relations of document authorship and project membership can be selected if it is required to identify communities of practice based on publications and project work. OntoCoPI allows manual selection of relationships or automatic selection based on the frequency of relationship use within the knowledge base. Selecting the right relationships and weights is an experimental process that is dependent on the ontology structure, the type and amount of information in the ontology, and the type of community of practice required. When working with a new community of practice some experiments will be needed to see which relationships are relevant to the desired community of practice, and how to set relative weights. In the experiments described in this paper, certain relationships were selected manually and weighted based on our preferences. Further trials are needed to determine the most effective selection. 7. INTEGRATION OF THE TWO TECHNOLOGIES We have investigated the integration of the ontology, OntoCoPI and Quickstep recommender system to provide a solution to both the cold-start problem and interest acquisition problem. Figure 7 shows our experimental systems after integration. AKT Ontology User interest profiles User publications User and domain knowledge Communities of practice Quickstep OntoCoPI AKT Ontology User interest profiles User publications User and domain knowledge Communities of practice Quickstep OntoCoPI Figure 7. Ontology and recommender system integration Upon start-up, the ontology provides the recommender system with an initial set of publications for each of its registered users. Each user’s known publications are then correlated with the recommender systems classified paper database, and a set of historical interests compiled for that user. These historical interests form the basis of an initial profile, overcoming the newsystem cold-start problem. Figure 8 details the initial profile algorithm. As per the Quickstep profiling algorithm, fractional interest in a topic super-classes is inferred when a specific topic is added. ˇ n 1.. publications belonging to class t topic interest(t) = 1 / publication age(n) t = new-system initial profile = (t, topic interest(t))* ˇ n 1.. publications belonging to class t topic interest(t) = 1 / publication age(n) t = new-system initial profile = (t, topic interest(t))* Figure 8. New-system initial profile algorithm When the recommender system is up and running and a new user is added, the ontology provides the historical publication list of the new user and the OntoCoPI system provides a ranked list of similar users. The initial profile of the new user is formed from a correlation between historical publications and any similar user profiles. This algorithm is detailed in figure 9, and addresses the new-user cold-start problem
7.1 Example of system operation When the Quickstep recommender system is first initialised, it retrieves a list of people and their publication URLs from the N ontology. Quickstep analyses these publications and classifies them according to the research topic hierarchy in the ontology 1/ publication age(n) Paper topics are associated with their date of publication, and the new-system initial profile'algorithm used to compute a set of initial profiles for each user. profile interest(u, t)=interest of user u in topic t* CoP confidence Tables I and 2 shows an example of this for the user Nigel Shadbolt. His publications are analysed and a set of topics and new-user initial p (t, topic interest(t))* dates formulated. The 'new-system initial profile'algorithm then t=research paper topic computes the interest values for each topic. For example, Knowledge Acquisition'has one publication two year old(round y= weighting constant >=0 up) so its value is 1.0/2=0.5 Nsimilr= number of similar users pubs= number of publications belonging to class t Tablel. Publication list for Shad bolt CoP confidence= Communities of practice confidence Publication Date Topic Figure 9. New-user initial profile algorithm User Preferences: ontologies 2001 Recommender The task of populating and maintaining the ontology of user recommender systems search interests is left to the recommender system. The recommender system compiles user profiles on a daily basis, and Knowledge Technologies these profiles are asserted into the ontology when ready. Figure 10 2001Technology details the structure of these profiles. In this way up-to-date The Use of Ontologies for 2001 Ontology terests are maintained, providing a solution to the interest Knowledge Acquisition acquisition problem. The interest data is used alongside the more Certifying KBSs: Using atic information within the ontology to improve the accuracy of CommonKADS to provide the Onto CoPl system. Supporting Evidence for Fitness for2000Knowledge user profile=(topic, interest Purpose of KBSs topic =research topic Extracting Focused Knowledge from interest interest value the Semantic Web 2000Acquisition Figure 10. Daily profiles sent to the AKT ontology Knowledge Engineering and 2000 Knowledge Management Table 2. Example of new-system profile for Shadbolt Interest dge Technology\Ontology AIAgents\Recommender Systems Knowledge Technology\Knowledge Acquisition 0.5
t = research paper topic u = user γ = weighting constant >= 0 Nsimilar = number of similar users Npubs t = number of publications belonging to class t CoP confidence = Communities of practice confidence topic interest(t) = ˇ n 1.. Npubs t + 1 / publication age(n) ˇ u 1.. Nsimilar _____ profile interest(u,t) Nsimilar γ profile interest(u,t) = interest of user u in topic t * CoP confidence new-user initial profile = (t, topic interest(t))* Figure 9. New-user initial profile algorithm The task of populating and maintaining the ontology of user research interests is left to the recommender system. The recommender system compiles user profiles on a daily basis, and these profiles are asserted into the ontology when ready. Figure 10 details the structure of these profiles. In this way up-to-date interests are maintained, providing a solution to the interest acquisition problem. The interest data is used alongside the more static information within the ontology to improve the accuracy of the OntoCoPI system. user profile = (topic, interest)* topic = research topic interest = interest value Figure 10. Daily profiles sent to the AKT ontology 7.1 Example of system operation When the Quickstep recommender system is first initialised, it retrieves a list of people and their publication URLs from the ontology. Quickstep analyses these publications and classifies them according to the research topic hierarchy in the ontology. Paper topics are associated with their date of publication, and the ‘new-system initial profile’ algorithm used to compute a set of initial profiles for each user. Tables 1 and 2 shows an example of this for the user Nigel Shadbolt. His publications are analysed and a set of topics and dates formulated. The ‘new-system initial profile’ algorithm then computes the interest values for each topic. For example, ‘Knowledge Acquisition’ has one publication two year old (round up) so its value is 1.0 / 2 = 0.5. Table1. Publication list for Shadbolt Publication Date Topic Capturing Knowledge of User Preferences: ontologies on recommender systems 2001 Recommender systems Knowledge Technologies 2001 Knowledge Technology The Use of Ontologies for Knowledge Acquisition 2001 Ontology Certifying KBSs: Using CommonKADS to Provide Supporting Evidence for Fitness for Purpose of KBSs 2000 Knowledge Management Extracting Focused Knowledge from the Semantic Web 2000 Knowledge Acquisition Knowledge Engineering and Management 2000 Knowledge Management … Table2. Example of new-system profile for Shadbolt Topic Interest Knowledge Technology\Knowledge Management 1.5 Knowledge Technology\Ontology 1.0 AI\Agents\Recommender Systems 1.0 Knowledge Technology\Knowledge Acquisition 0.5 …
At a later stage, after Quickstep has bet en runnin new user registers with email address sem99r@ecs soton ac uk. OntoCoPI identifies this email account as that of stuart Table 5. New-user profile for Middleton Middleton, a PhD candidate within the department, and returns Topic the ranked and normalised communities of practise list displayed Al\AgentslRecommender Systems 176 in table 3. This communities of practise list is identified using relations on conference attendance, supervision, authorship, An\Agents\ Mobile Agents 0.77 search interest, and project membership, using the weights 0.4, ANDistributed Systems 7, 0.3, 0.8, and 0.5 respectively. De Roure was found to be the closest person as he is Middletons supervisor, and has one joint Knowledge Technology oNtology publication co-authored with Middleton and Shadbolt. The people nowledge Technology Knowledge Devices with 0.82 values are other supervisees of De Roure. Alani Knowledge Technology Knowledge Management attended the same conference that middleton went to in 2001 Knowledge Technology \Knowledge Management 0.16 Table 3. OntoCoPI results for Middleton Person Relevance Person Relevance DeTour Alani 0.47 Every day Quicksteps profiles are updated and automatically fed back to the ontology, where they are used to populate the research Revill Shadbolt interest relationships of the relevant people 8. EMPIRICAL EVALUATION to evaluate the effect both the ney The communities of practise list is then sent to Quickstep, which initial profiling algorithms have on our integrated system, we searches for matching user profiles. These profiles will be more onducted an experiment based around the browsing behaviour accurate and up to date than those initially created profiles based logs obtained from the Quickstep [18] user trials. The algorithm on publications. Quickstep manages to find the profiles in table 4 previously described are used, as per the example in the previous in its logs. section, and the average performance for all users calculated 8.1 Experimental approach Table 4. Profiles of similar people to Middleton Users were selected from the Quickstep trials whom had entries Person within the departmental publication database. We selected nine Topic Interest users in total, with each user typically having one or two AlDistributed Systems DeRoure Al\Agents\Recommender Systems 0.73 The URL browsing logs of these users, extracted from 3 months Al\AgentslMobile Agent of browsing behaviour recorded during the Quickstep trials, were Revill then broken up into weekly log entries. Seven weeks of browsing Al\Agents\Recommender Systems 0.4 behaviour were taken from the start of the Quickstep trials, and an Knowledge Technology Knowledge empty log created to simulate the very start of the trial Beals Devices Eight iterations of the integrated system were thus run, the first Al\Agents Mobile agents 0.87 simulating the start of the trial and others simulating the following weeks I to 7. Interest profiles were recorded after each iteration 1.8 Two complete runs were made, one with the ' new-system initial Alani Knowledge Technology Knowledge profiling algorithm and one without. The control run without the Management\ CoP new-system initial profiling'algorithm started with blank profiles for each of its users Knowledge TechnologyKnowledge An additional iteration was run to evaluate the effectiveness of the Shadbolt new-user initial profile' algorithm. We took the communities of Al\Agents\Recommender Systems practice for each user, based on data from week 7, and used the new-user initial profile algorithm to compute initial profiles for These profiles are merged to create a profile for the new user, each user as if they were being entered onto the system at t of the trial. These profiles were recorded. Because we are Middleton, using the 'new-user initial profile algorithm with ay early prototype version of Onto CoPl, communities of value of 2.5. For example, Middleton has a publication on confidence values were not available: we thus used co Recommender Systemsthat is I year old and DeRoure, Revill values of 1 throughout this experiment. and Shadbolt have interest in 'Recommender Systems'-this In order to evaluate our algorithms effect on the cold-start (1.0*0.73+0.82*0.4+0.46*1.0)=1.76. Table5 shows the resulting problem, we compared all recorded profiles to the benchmark week 7 profile. This allows us to measure how quickly profiles converge to the stable state existing after a reasonable amount of
At a later stage, after Quickstep has been running for a while, a new user registers with email address sem99r@ecs.soton.ac.uk. OntoCoPI identifies this email account as that of Stuart Middleton, a PhD candidate within the department, and returns the ranked and normalised communities of practise list displayed in table 3. This communities of practise list is identified using relations on conference attendance, supervision, authorship, research interest, and project membership, using the weights 0.4, 0.7, 0.3, 0.8, and 0.5 respectively. De Roure was found to be the closest person as he is Middleton’s supervisor, and has one joint publication co-authored with Middleton and Shadbolt. The people with 0.82 values are other supervisees of De Roure. Alani attended the same conference that Middleton went to in 2001. Table 3. OntoCoPI results for Middleton Person Relevance value Person Relevance value DeRoure 1.0 Alani 0.47 Revill 0.82 Shadbolt 0.46 Beales 0.82 The communities of practise list is then sent to Quickstep, which searches for matching user profiles. These profiles will be more accurate and up to date than those initially created profiles based on publications. Quickstep manages to find the profiles in table 4 in its logs. Table 4. Profiles of similar people to Middleton Person Topic Interest AI\Distributed Systems 1.2 DeRoure AI\Agents\Recommender Systems … 0.73 AI\Agents\Mobile Agents 1.0 Revill AI\Agents\Recommender Systems … 0.4 Knowledge Technology\Knowledge Devices 0.9 Beals AI\Agents\Mobile Agents … 0.87 Knowledge Technology\Ontology 1.8 Alani Knowledge Technology\Knowledge Management\ CoP … 0.7 Knowledge Technology\Knowledge Management 1.5 Shadbolt AI\Agents\Recommender Systems … 1.0 These profiles are merged to create a profile for the new user, Middleton, using the ‘new-user initial profile’ algorithm with a γ value of 2.5. For example, Middleton has a publication on ‘Recommender Systems’ that is 1 year old and DeRoure, Revill and Shadbolt have interest in ‘Recommender Systems’ – this topics value is therefore 1/1 + 2.5/5 * (1.0*0.73+0.82*0.4+0.46*1.0) = 1.76. Table 5 shows the resulting profile. Table 5. New-user profile for Middleton Topic Interest AI\Agents\Recommender Systems 1.76 AI\Agents\Mobile Agents 0.77 AI\Distributed Systems 0.6 Knowledge Technology\Ontology 0.42 Knowledge Technology\Knowledge Devices 0.37 Knowledge Technology\Knowledge Management 0.35 Knowledge Technology\Knowledge Management\ CoP 0.16 … Every day Quickstep’s profiles are updated and automatically fed back to the ontology, where they are used to populate the research interest relationships of the relevant people. 8. EMPIRICAL EVALUATION In order to evaluate the effect both the new-system and new-user initial profiling algorithms have on our integrated system, we conducted an experiment based around the browsing behaviour logs obtained from the Quickstep [18] user trials. The algorithms previously described are used, as per the example in the previous section, and the average performance for all users calculated. 8.1 Experimental approach Users were selected from the Quickstep trials whom had entries within the departmental publication database. We selected nine users in total, with each user typically having one or two publications. The URL browsing logs of these users, extracted from 3 months of browsing behaviour recorded during the Quickstep trials, were then broken up into weekly log entries. Seven weeks of browsing behaviour were taken from the start of the Quickstep trials, and an empty log created to simulate the very start of the trial. Eight iterations of the integrated system were thus run, the first simulating the start of the trial and others simulating the following weeks 1 to 7. Interest profiles were recorded after each iteration. Two complete runs were made, one with the ‘new-system initial profiling’ algorithm and one without. The control run without the ‘new-system initial profiling’ algorithm started with blank profiles for each of its users. An additional iteration was run to evaluate the effectiveness of the ‘new-user initial profile’ algorithm. We took the communities of practice for each user, based on data from week 7, and used the ‘new-user initial profile’ algorithm to compute initial profiles for each user as if they were being entered onto the system at the end of the trial. These profiles were recorded. Because we are using an early prototype version of OntoCoPI, communities of practice confidence values were not available; we thus used confidence values of 1 throughout this experiment. In order to evaluate our algorithms effect on the cold-start problem, we compared all recorded profiles to the benchmark week 7 profile. This allows us to measure how quickly profiles converge to the stable state existing after a reasonable amount of
behaviour data has been accumulated. The quicker the profiles tended to miss current interests. This is because publications are move to this state the quicker they will have overcome the cold. generally not available for up-to-date interests As we would expect, once the weekly behaviour logs become Week 7 was chosen as the cut-off point of our analysis since after available to the system the profiles adjust accordingly, moving about 7 weeks of use the behaviour data gathered by Quickstep away from the initial bootstrapping. On week 7 the profiles ll dominate the user profiles. The effects of bootstrapping yond this point would not be significant. If we were to run the system beyond week 7 we would simply see the profiles The new-user algorithm result show a more dramatic increase in precision to 0.84, but comes at the price of a significant error rate continually adjusting to the behaviour logged each week of 0.55. The profiles produced by the new-user algorithm tended 8.2 Experimental results to be very inclusive, taking the set of similar user interests and Two measurements were preformed when comparing profiles to producing a union of these interests. While this captures many of the benchmark week 7 profile. The first, profile precision, interests not relevant to the new user but which were interesting to measures how many topics were mentioned in both the current the people similar to the new user. profile and benchmark profile. Profile precision is an indication how quickly the profile is converging to the final state, and thus Profile precision relative to benchmark profile how quickly the effects of the cold-start are overcome. The second, profile error rate, measures how many topics appeared in the current profile that did not appear within the benchmark 0.75 profile. Profile error rate is an indication of the errors introduced by the two bootstrapping algorithms. Figure 11 describes these metrIcs It should be noted that we are not measuring the precision and error rate of the profiles -only the relative and error rate compared to the week 7 steady state Measuring absolute profile accuracy subjective matter ve do not attempt it here, we ar interested in ho ly profiles reach their stead Figure 12. Profile precision evaluation of Quicksteps overall profiling and recommendatio performance can be found in [18]. Profile error rate relative to benchmark profile Noted+ N new-system bootstrap user bootstrap profile error rate of user topics that appear in current and benchmark profile Week of user topics that appear in benchmark Figure 13. Profile error rate profile but not in current profile Nineorreet Number of user topics that appear in current profile but not in benchmark profile Since error rate is measured relative to the final benchmark profile Total number of users of week 7, all the topics seen in the behaviour logs will be present Figure 11. Evaluation metrics within the benchmark profile. Incorrect topics must thus come from another source- in this case bootstrapping on week 0. This causes error rates to be constant over the 7 weeks the The results of our experimental runs are detailed in figures 12 and incorrect topics introduced on week 0 remain for all subsequent 13. The new-user results consist of a single iteration, so appear on weeks the graphs as a single point 9. DISCUSSION At the start, week 0, no browsing behaviour log data is available Cold-starts in recommender systems and interest acquisition in to the system so the profiles without bootstrapping are empty. The ontologies are serious problems. If initial recommendations are new-system algorithm, however, can bootstrap the initial us inaccurate, user confidence in the recommender system may drop profiles and achieves a reasonable precision of 0.35 and a low with the result that not enough usage data is gathered to overcome error rate of 0.06. We found that the new-system profiles the cold-start. In regards to ontologies, up-to-date interests are not accurately captured interests users had a year or so ago, but
behaviour data has been accumulated. The quicker the profiles move to this state the quicker they will have overcome the coldstart. Week 7 was chosen as the cut-off point of our analysis since after about 7 weeks of use the behaviour data gathered by Quickstep will dominate the user profiles. The effects of bootstrapping beyond this point would not be significant. If we were to run the system beyond week 7 we would simply see the profiles continually adjusting to the behaviour logged each week. 8.2 Experimental results Two measurements were preformed when comparing profiles to the benchmark week 7 profile. The first, profile precision, measures how many topics were mentioned in both the current profile and benchmark profile. Profile precision is an indication of how quickly the profile is converging to the final state, and thus how quickly the effects of the cold-start are overcome. The second, profile error rate, measures how many topics appeared in the current profile that did not appear within the benchmark profile. Profile error rate is an indication of the errors introduced by the two bootstrapping algorithms. Figure 11 describes these metrics. It should be noted that we are not measuring the absolute precision and error rate of the profiles – only the relative precision and error rate compared to the week 7 steady state profiles. Measuring absolute profile accuracy is a very subjective matter, and we do not attempt it here; we are only interested in how quickly profiles reach their steady states. A more complete evaluation of Quickstep’s overall profiling and recommendation performance can be found in [18]. Ncorrect Number of user topics that appear in current profile and benchmark profile Nmissing Number of user topics that appear in benchmark profile but not in current profile Nincorrect Number of user topics that appear in current profile but not in benchmark profile Nusers Total number of users profile error rate = ______________________ Ncorrect + Nincorrect + Nmissing Nincorrect profile precision = Ncorrect + Nmissing Ncorrect _____________ Nusers 1 _____ ˇ 1.. Nusers user Nusers 1 _____ ˇ 1.. Nusers user Ncorrect Number of user topics that appear in current profile and benchmark profile Nmissing Number of user topics that appear in benchmark profile but not in current profile Nincorrect Number of user topics that appear in current profile but not in benchmark profile Nusers Total number of users profile error rate = ______________________ Ncorrect + Nincorrect + Nmissing Nincorrect ______________________ Ncorrect + Nincorrect + Nmissing Nincorrect profile precision = Ncorrect + Nmissing Ncorrect _____________ Ncorrect + Nmissing Ncorrect Ncorrect + Nmissing Ncorrect _____________ Nusers 1 _____ ˇ 1.. Nusers user Nusers 1 _____ ˇ 1.. Nusers user Figure 11. Evaluation metrics The results of our experimental runs are detailed in figures 12 and 13. The new-user results consist of a single iteration, so appear on the graphs as a single point. At the start, week 0, no browsing behaviour log data is available to the system so the profiles without bootstrapping are empty. The new-system algorithm, however, can bootstrap the initial user profiles and achieves a reasonable precision of 0.35 and a low error rate of 0.06. We found that the new-system profiles accurately captured interests users had a year or so ago, but tended to miss current interests. This is because publications are generally not available for up-to-date interests. As we would expect, once the weekly behaviour logs become available to the system the profiles adjust accordingly, moving away from the initial bootstrapping. On week 7 the profiles converge to the benchmark profile. The new-user algorithm result show a more dramatic increase in precision to 0.84, but comes at the price of a significant error rate of 0.55. The profiles produced by the new-user algorithm tended to be very inclusive, taking the set of similar user interests and producing a union of these interests. While this captures many of the new users real interests, it also included a large number of interests not relevant to the new user but which were interesting to the people similar to the new user. Profile precision relative to benchmark profile 0 0.25 0.5 0.75 1 01234567 Week Precision new-system bootstrap no bootstrap new-user bootstrap Figure 12. Profile precision Profile error rate relative to benchmark profile 0 0.25 0.5 0.75 1 01234567 Week Error rate new-system bootstrap no bootstrap new-user bootstrap Figure 13. Profile error rate Since error rate is measured relative to the final benchmark profile of week 7, all the topics seen in the behaviour logs will be present within the benchmark profile. Incorrect topics must thus come from another source – in this case bootstrapping on week 0. This causes error rates to be constant over the 7 weeks, since the incorrect topics introduced on week 0 remain for all subsequent weeks. 9. DISCUSSION Cold-starts in recommender systems and interest acquisition in ontologies are serious problems. If initial recommendations are inaccurate, user confidence in the recommender system may drop with the result that not enough usage data is gathered to overcome the cold-start. In regards to ontologies, up-to-date interests are not
generally available from periodically updated information sources The new-user file algorithm defines the constant h as web pages, personal records or publication databases which determi proportional significance of previous Our integration of the Quickstep recommender system, AKT publications an users. Factors such as the availability of ontology and Onto CoPl system has demonstrated one approach to relationship data within the ontology and quality of the reduce both the cold-start and interest-acquisition problems Our publication database will affect the choice of value for y We used practical work suggests that using an ontology to bootstrap user a value of 2.5, but empirical evaluation would be needed to files can significantly reduce the impact of the recommender determine the best value system cold-start problem. It is particularly useful for the new- There is an issue as to how best to calculate the system cold-start problem, where the alternative is to start with no distance"between topics within the is-a hierarchy. We formation at all. Regularly feeding the recommender systems simplifying assumption that all is-a links have equal terest profiles back to the ontology also clearly assists in the but the exact relevance will depend on each topic in question. If acquisition of up-to-date interests. A number of issues have, individual weightings were allowed for each topic, a method for however, arisen from our integration determination of these weights would have to be considered The new-system algorithm produced profiles with a low error rate Alternatively the is-a hierarchy could be carefully constructed to and a reasonable precision of 0.35. This reflects that previous ensure equal semantic distance publications are a good indication of users current interests, and A positive feedback loop exists between the recommender system so can produce a good starting point for a bootstrap profile. and ontology, making data incest a potential problem. For new Where the new-system algorithm fails is for more recent interests, users there are no initial interest entries within the ontology, so hich make up the remaining 65% of the topics in the final new user profiles are not incestuous. If the recommender system benchmark profile. To discover these more recent interests, it is were to use the communities of practice for more than just initial possible that the new-system al gorithm could be extended to take profiles, however, a self-confirming loop would exist and interest ome of the other information available within the ontology into calculations would be incestuous account, such as the projects a user is working on. To what degree Finally, a question still remains as to just how good an initial the ontology itself has great difficulty in acquiring knowledge of profile must be to fully overcome the effects of the cold-start problem. If initial recommendations are poor users will not us the recommender system and hence it will never get a chance to For the purposes of our experiment, we selected those users who had some entries within the universities on-line publication initial profiles, but further empirical evaluation would be needed database. There were some users who had not entered their to evaluate exactly how much improvement is needed before the publications into this database or who have yet to publish their system is"good enough"for users to give it a chance work. For these users there is little information within the ontology, making any new-system initial profiles of little use In a 10. FUTURE WORK arger scale system, more sources of information would be needed The next step for the integrated system is to continue to improve from the ontology to build the new-system profiles. This would the set of relationships and weights used to calculate communities low some redundancy, and hence improve robustness in the of practice, and find a more selective new-user initial profile realistic situation where information is sparsely available Igorithm. With more precise communities of practice the new The community of practice for a user was found not to be al ways user bootstrapping error rate should fall substantially. We could relevant based on our selection of relationships and weights. For then conduct a set of further user trials. This would allow the ample, Dave de Roure supervises Stuart Middleton, but Dave assessment of user up-take and use of the integrated system, and upervises a lot of other students interested in mobile agents. eveal how improving initial profiles affect overall system usage These topics are not relevant to Stuart, which raises the questio patterns of how relevant the supervision relations requirements, and how best to weight such a relationship. Further ep er me ded eo system is currently being extended experiments are needed to identify the most relevant settings for profiles. A large-scale trial is un der s ologies to represent user over a full academic year community of practice identification. The accuracy of our to evaluate the new system, which is called the Foxtrot communities of practice are also linked to the accuracy of the recommender system research interest information as identified by the recommender 11. ACKNOWLEDGEMENTS This work is funded by EPSrC studentship award number The new-user algorithm achieved good precision of 0.84 at the expense of a significant 0.55 error rate. This was because both the 99308831 and the AKT project communities of practice generated for users were not always 12. REFERENCES precise, and because of the new-user algorithm included all 1 Aha, D Kibler, D. Albert, M. Instance-based learnin use those interests held by the majority of the people within a algorithms, Machine Learning, 6: 37-66, 1991 community of practice. This would exclude some of the less [21 Alani, H, O'Hara, K, and Shadbolt, N ONTOCOPI ommon interests that would otherwise be included into the new Methods and Tools for Identifying Communities of Practice user profile Intelligent Information Processing Conference, IFIP World Computer Congress(WCC), Montreal, Canada, 2002
generally available from periodically updated information sources such as web pages, personal records or publication databases. Our integration of the Quickstep recommender system, AKT ontology and OntoCoPI system has demonstrated one approach to reduce both the cold-start and interest-acquisition problems. Our practical work suggests that using an ontology to bootstrap user profiles can significantly reduce the impact of the recommender system cold-start problem. It is particularly useful for the newsystem cold-start problem, where the alternative is to start with no information at all. Regularly feeding the recommender systems interest profiles back to the ontology also clearly assists in the acquisition of up-to-date interests. A number of issues have, however, arisen from our integration. The new-system algorithm produced profiles with a low error rate and a reasonable precision of 0.35. This reflects that previous publications are a good indication of users current interests, and so can produce a good starting point for a bootstrap profile. Where the new-system algorithm fails is for more recent interests, which make up the remaining 65% of the topics in the final benchmark profile. To discover these more recent interests, it is possible that the new-system algorithm could be extended to take some of the other information available within the ontology into account, such as the projects a user is working on. To what degree these relationships will help is difficult to predict however, since the ontology itself has great difficulty in acquiring knowledge of recent interests. For the purposes of our experiment, we selected those users who had some entries within the universities on-line publication database. There were some users who had not entered their publications into this database or who have yet to publish their work. For these users there is little information within the ontology, making any new-system initial profiles of little use. In a larger scale system, more sources of information would be needed from the ontology to build the new-system profiles. This would allow some redundancy, and hence improve robustness in the realistic situation where information is sparsely available. The community of practice for a user was found not to be always relevant based on our selection of relationships and weights. For example, Dave de Roure supervises Stuart Middleton, but Dave supervises a lot of other students interested in mobile agents. These topics are not relevant to Stuart, which raises the question of how relevant the supervision relationship is to our requirements, and how best to weight such a relationship. Further experiments are needed to identify the most relevant settings for community of practice identification. The accuracy of our communities of practice are also linked to the accuracy of the research interest information as identified by the recommender system. The new-user algorithm achieved good precision of 0.84 at the expense of a significant 0.55 error rate. This was because both the communities of practice generated for users were not always precise, and because of the new-user algorithm included all interests from the similar users. An improvement would be to only use those interests held by the majority of the people within a community of practice. This would exclude some of the less common interests that would otherwise be included into the newuser profile. The new-user initial profile algorithm defines the constant γ, which determines the proportional significance of previous publications and similar users. Factors such as the availability of relationship data within the ontology and quality of the publication database will affect the choice of value for γ. We used a value of 2.5, but empirical evaluation would be needed to determine the best value. There is an issue as to how best to calculate the “semantic distance” between topics within the is-a hierarchy. We make the simplifying assumption that all is-a links have equal relevance, but the exact relevance will depend on each topic in question. If individual weightings were allowed for each topic, a method for determination of these weights would have to be considered. Alternatively the is-a hierarchy could be carefully constructed to ensure equal semantic distance. A positive feedback loop exists between the recommender system and ontology, making data incest a potential problem. For new users there are no initial interest entries within the ontology, so new user profiles are not incestuous. If the recommender system were to use the communities of practice for more than just initial profiles, however, a self-confirming loop would exist and interest calculations would be incestuous. Finally, a question still remains as to just how good an initial profile must be to fully overcome the effects of the cold-start problem. If initial recommendations are poor users will not use the recommender system and hence it will never get a chance to improve. We have shown that improvements can be made to initial profiles, but further empirical evaluation would be needed to evaluate exactly how much improvement is needed before the system is “good enough” for users to give it a chance. 10. FUTURE WORK The next step for the integrated system is to continue to improve the set of relationships and weights used to calculate communities of practice, and find a more selective ‘new-user initial profile’ algorithm. With more precise communities of practice the newuser bootstrapping error rate should fall substantially. We could then conduct a set of further user trials. This would allow the assessment of user up-take and use of the integrated system, and reveal how improving initial profiles affect overall system usage patterns. The Quickstep recommender system is currently being extended to explore further the idea of using ontologies to represent user profiles. A large-scale trial is under way over a full academic year to evaluate the new system, which is called the Foxtrot recommender system. 11. ACKNOWLEDGEMENTS This work is funded by EPSRC studentship award number 99308831 and the AKT project. 12. REFERENCES [1] Aha, D. Kibler, D. Albert, M. Instance-based learning algorithms, Machine Learning, 6:37-66, 1991 [2] Alani, H., O’Hara, K., and Shadbolt, N. ONTOCOPI: Methods and Tools for Identifying Communities of Practice, Intelligent Information Processing Conference, IFIP World Computer Congress (WCC), Montreal, Canada, 2002
Albert. R. and Barabasi. AL. Statistical Mechanics of [15 Lieberman, H Letizia: An Agent That Assists Web omplex Networks. Review of Modern Physics, 74, 47 Conference on Artificial Intelligence, Montreal, Canada, [4]Balabanovi, M, and Shoham, Y. Fab: content-based collaborative recommendation, Communications of the ACM [16 Lotus. Locating Organisational Expertise with the Lotus Volume 40, No. 3(Mar 1997) Discovery Server, White Paper, 2001 5 Becerra-Fernandez, L. Facilitating the Online Seach of [17 Maltz, D. Ehrlich, E Pointing the way: Active collaborative Experts at NASA using Expert Seeker People-Finder iltering, CHI95 Human Factors in Computing Systems, Proceedings of the 3rd International Conference on Practical 1995 Aspects of Knowledge Management(PAK M), Basel, Switzerland 2000 [18 Middleton, SE, De Roure, D. C, and Shadbolt, N.R. Capturing Knowledge of User Preferences: ontologies on [6]Bollacker, K D, Lawrence, S, and Giles, C.L. CiteSeer:An recommender systems, In Proceedings of the First Autonomous Web Agent for Automatic Retrieval and International Conference on Knowledge Capture(K-CAP Identification of Interesting Publications, Proceedings of the 2001), Oct 2001, Victoria, B.C. Canada. Second International Conference on Autonomous agents Minneapolis MN, USA, 1998 [19 Mladenic, D. Personal WebWatcher: Implementation and Design, Technical Report IS-DP-7472, Department of [7 Brown and Duguid 2000. The social life of information Intelligent Systems, J.Stefan Institute, Slovenia, 1996 Harvard Buisness School Press [20]O'Hara, K, Shadbolt, N uckingham Shum S. The 8 Craven, M. DiPasquo, D Freitag, D. McCallum, A Mitchell, AKT Manifesto. 2001 T Nigam K, and Slattery, S Learning to Extract Symbolic Knowledge from the World wide Web, Proceedings of the www.aktors.org/publications/manifesto.doc Sth National Conference on Artificial Intelligence(AAAl [21]Pazzani, M. Muramatsu J Billsus, D Syskill Webert Identifying interesting web sites, Proceedings of the National Conference on Artificial Intelligence, Portland, Oregon, techniques for finding people. Proceedings of the 3rd International Conference on Practical Aspects of Knowledge 22 Porter, M. An algorithm for suffix stripping, Program 14(3), July 1980,pp.130-13 [10]Eriksson, H, Fergeson, R, Shahr, Y, and Musen, M. [23]Resnick, P. Varian, H.R. Recommender systems, 1999). Automatic generation of ontology editors. 12th Communications of the ACM, 40(3), 1997, 56-58 Workshop on Knowledge Acquisition, Modelling, and [ Roo J.J. Relevance feedback in information retrieval. in Management(KAW 99), Ban, Alberta, Canada the SMArT Retrieval System- Experiments in Aut [I1]Freund, Y Schapire, R.E. Experiments with a New Boosting Document Processing, Englewood Cliffs, NJ, 1971 Algorithm, Proceedings of the Thirteenth International Hall, Inc. pp. 313-323 Conference on Machine Learning, 1996 25 Schwab, I, Pohl, W, and Koychev, I. Learning to [12]Guarino, N, and Giaretta, P (1995). Ontologies and Recommend from Positive Evidence, Proceedings of Knowledge bases: towards a terminological clarification. Intelligent User Interfaces 2000, ACM Press, pp 241-247 Towards Very Large Knowledge Bases: Knowledge Building [26]Terveen, L, Hill, w, Amento, B, McDonald, D, and and Knwoledge Sharing N. Mars, IOS Press: 25-32 Creter, J PHOAKS: a system for sharing recommendations, [13]Guarino, N, Masolo, C and Vetere, G Onto Seek: Content- Communications of the ACM Volume 40, No. 3(Mar, 1997) Based Access to the Web, IEEE Intelligent Systems, Vol 14, [27Wenger, E. C, and Snyder, w. M. Communities of Practice No. 3, May/June 1999 The Organizational Frontier. Harvard Business Review S gordon . A, Miler, B N. Maltz, D, Herlocker, J.L. [14]Konstar January-February: 139-145. 2000 [28] Wenger, E. Communities of practice: the key to knowledge ollaborative filtering to Usenet news, Communications of strategy. Knowledge Directions: The Journal of the Institute the ACM Volume 40, No. 3(Mar. 1997) for Knowledge Management, 1, 48-93, 1999
[3] Albert, R. and Barabasi, AL. Statistical Mechanics of Complex Networks. Review of Modern Physics, 74, 47, 2002. [4] Balabanovi, M., and Shoham, Y. Fab: content-based, collaborative recommendation, Communications of the ACM Volume 40, No. 3 (Mar. 1997) [5] Becerra-Fernandez, I. Facilitating the Online Seach of Experts at NASA using Expert Seeker People-Finder. Proceedings of the 3rd International Conference on Practical Aspects of Knowledge Management (PAKM), Basel, Switzerland, 2000. [6] Bollacker, K.D., Lawrence, S., and Giles, C.L. CiteSeer: An Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications, Proceedings of the Second International Conference on Autonomous Agents, Minneapolis MN, USA, 1998 [7] Brown and Duguid 2000. The social life of information Harvard Buisness School Press [8] Craven, M. DiPasquo, D. Freitag, D. McCallum, A. Mitchell, T. Nigam K. and Slattery, S. Learning to Extract Symbolic Knowledge from the World Wide Web, Proceedings of the 15th National Conference on Artificial Intelligence (AAAI- 98), 1998 [9] Dunlop, M. D. Development and evaluation of clustering techniques for finding people. Proceedings of the 3rd International Conference on Practical Aspects of Knowledge Management (PAKM), Basel, Switzerland, 2000. [10]Eriksson, H., Fergeson, R., Shahr, Y., and Musen, M. (1999). Automatic generation of ontology editors. 12th Workshop on Knowledge Acquisition, Modelling, and Management (KAW'99), Ban, Alberta, Canada. [11]Freund, Y. Schapire, R.E. Experiments with a New Boosting Algorithm, Proceedings of the Thirteenth International Conference on Machine Learning, 1996 [12]Guarino, N., and Giaretta, P. (1995). Ontologies and Knowledge bases: towards a terminological clarification. Towards Very Large Knowledge Bases: Knowledge Building and Knwoledge Sharing. N. Mars, IOS Press: 25-32. [13]Guarino, N., Masolo, C. and Vetere, G. OntoSeek: ContentBased Access to the Web, IEEE Intelligent Systems, Vol. 14, No. 3, May/June 1999 [14]Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., and Riedl, J. GroupLens: applying collaborative filtering to Usenet news, Communications of the ACM Volume 40, No. 3 (Mar. 1997) [15]Lieberman, H. Letizia: An Agent That Assists Web Browsing, Proceedings of the 1995 International Joint Conference on Artificial Intelligence, Montreal, Canada, August 1995 [16]Lotus. Locating Organisational Expertise with the Lotus Discovery Server, White Paper, 2001. [17]Maltz, D. Ehrlich, E. Pointing the way: Active collaborative filtering, CHI’95 Human Factors in Computing Systems, 1995 [18]Middleton, S.E., De Roure, D. C., and Shadbolt, N.R. Capturing Knowledge of User Preferences: ontologies on recommender systems, In Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001), Oct 2001, Victoria, B.C. Canada. [19]Mladenic, D. Personal WebWatcher: Implementation and Design, Technical Report IJS-DP-7472, Department of Intelligent Systems, J.Stefan Institute, Slovenia, 1996 [20]O’Hara, K., Shadbolt, N., and Buckingham Shum, S. The AKT Manifesto, 2001. www.aktors.org/publications/Manifesto.doc [21]Pazzani, M. Muramatsu J. Billsus, D. Syskill & Webert: Identifying interesting web sites, Proceedings of the National Conference on Artificial Intelligence, Portland, Oregon, 1996 [22]Porter, M. An algorithm for suffix stripping, Program 14 (3), July 1980, pp. 130-137 [23]Resnick, P. Varian, H. R. Recommender systems, Communications of the ACM, 40(3), 1997, 56-58 [24]Rocchio, J.J. Relevance feedback in information retrieval, in the SMART Retrieval System -- Experiments in Automatic Document Processing, Englewood Cliffs, NJ, 1971. Prentice Hall, Inc. pp. 313-323 [25]Schwab, I., Pohl, W., and Koychev, I. Learning to Recommend from Positive Evidence, Proceedings of Intelligent User Interfaces 2000, ACM Press, pp 241-247. [26]Terveen, L., Hill, W., Amento, B., McDonald, D., and Creter, J. PHOAKS: a system for sharing recommendations, Communications of the ACM Volume 40, No. 3 (Mar. 1997) [27]Wenger, E. C., and Snyder, W. M. Communities of Practice: The Organizational Frontier. Harvard Business Review. January-February: 139-145. 2000 [28]Wenger, E. Communities of practice: the key to knowledge strategy. Knowledge Directions: The Journal of the Institute for Knowledge Management, 1, 48-93, 1999