USER: User-Sensitive Expert Recommendations for Knowledge.Dense Environments Colin DeLong, Prasanna Desikan2, and Jaideep Srivastava I College of Liberal Arts, University of Minnesota, 101 Pleasant St SE, 55 Minneapolis, MN, United States of America delo0041@umn. edu Department of Computer Science, University of Minnesota, 200 Union Street SE. 55455 Minneapolis, MN, United States of America [desikan, srivastalecs. umn. edu Abstract. Traditional recommender systems tend to focus on e-commerce ap- plications, recommending products to users from a large catalog of available items. The goal has been to increase sales by tapping into the users interests by utilizing information from various data sources to make relevant recommenda tions. Education, government, and policy websites face parallel challenges, ex- cept the product is information and their users may not be aware of what is relevant and what isnt. Given a large, knowledge-dense website and expert user searching for information, making relevant recommendations be- comes a significant challenge. This paper addresses the problem of providing recommendations to non-experts, helping them understand what they need to know, as opposed to what is popular among other users. The approach is user sensitive in that it adopts a 'model of learning whereby the user's context is dynamically interpreted as they browse and then leveraging that information to 1 Introduction Current recommender systems typically ask the question "What does the user or groups of similar users find interesting?""and make recommendations, usually in the form of products or Web documents, on that basis. In some domains, however, the question of what is a relevant recommendation becomes a function of user interest and expert opinion, a need recommender systems havent traditionally dealt with This paper is meant to address this very issue: providing expert-driven recommenda tions. To our knowledge, this topic has been little-explored outside of a few tangen tially-related recommender systems [24]. Essentially, the goal of an expert-driven recommendation system is to determine the users context and provide recommenda- ons by incorporating a mapping of expert knowledge. The idea is to direct users to what they need to know versus what is popular, as the two are not al ways in perfect alignment with each other. For instance, an important human resources policy affect- ing some subset of people in an organization may not receive much traffic online, but that in and of itself does not mean the policy is irrelevant or unimportant. Addition- ally, with more and more organizations putting more and more expert-oriented O Nasraoui et al. (Eds WebKDD 2005, LNAl4198, pp 77-95, 2006 o Springer-Verlag Berlin Heidelberg 2006
O. Nasraoui et al. (Eds.): WebKDD 2005, LNAI 4198, pp. 77 – 95, 2006. © Springer-Verlag Berlin Heidelberg 2006 USER: User-Sensitive Expert Recommendations for Knowledge-Dense Environments Colin DeLong1, Prasanna Desikan2, and Jaideep Srivastava2 1 College of Liberal Arts, University of Minnesota, 101 Pleasant St. SE, 55455 Minneapolis, MN, United States of America delo0041@umn.edu 2 Department of Computer Science, University of Minnesota, 200 Union Street SE, 55455 Minneapolis, MN, United States of America {desikan, srivasta}@cs.umn.edu Abstract. Traditional recommender systems tend to focus on e-commerce applications, recommending products to users from a large catalog of available items. The goal has been to increase sales by tapping into the user’s interests by utilizing information from various data sources to make relevant recommendations. Education, government, and policy websites face parallel challenges, except the product is information and their users may not be aware of what is relevant and what isn’t. Given a large, knowledge-dense website and a nonexpert user searching for information, making relevant recommendations becomes a significant challenge. This paper addresses the problem of providing recommendations to non-experts, helping them understand what they need to know, as opposed to what is popular among other users. The approach is usersensitive in that it adopts a ‘model of learning’ whereby the user’s context is dynamically interpreted as they browse and then leveraging that information to improve our recommendations. 1 Introduction Current recommender systems typically ask the question “What does the user or groups of similar users find interesting?” and make recommendations, usually in the form of products or Web documents, on that basis. In some domains, however, the question of what is a relevant recommendation becomes a function of user interest and expert opinion, a need recommender systems haven’t traditionally dealt with. This paper is meant to address this very issue: providing expert-driven recommendations. To our knowledge, this topic has been little-explored outside of a few tangentially-related recommender systems [24]. Essentially, the goal of an expert-driven recommendation system is to determine the user’s context and provide recommendations by incorporating a mapping of expert knowledge. The idea is to direct users to what they need to know versus what is popular, as the two are not always in perfect alignment with each other. For instance, an important human resources policy affecting some subset of people in an organization may not receive much traffic online, but that in and of itself does not mean the policy is irrelevant or unimportant. Additionally, with more and more organizations putting more and more expert-oriented
78 C. DeLong. P. Desikan. and J Srivastava content online, user navigational issues can be exacerbated to the point where such websites are essentially unusable for non-experts. Other examples would include any Web sites that have dense policy information, such as those created for academic advising tax law, and human resources. This paper, which is an extension and update of a previous paper of ours [25], is step towards providing a generalized framework for expert-driven recommendations Our starting point is the"Model of Learning; the philosophical basis our system and its components are constructed from. User Intent"is modeled through usage data analysis, producing sets of sequential association rules, and"Expert Knowledge through the structure and hypertext of the website itself, an expansion over our previ ous work, resulting in a concept graph mapped to individual URLs. Three separate rankings-UsageRank, ExpertRank, and GraphRank- produced and incorporated into an overall USER Rank, which produces the final list of recommendations. To the best of our knowledge, this approach of providing expert-driven recommendations is novel Section 2 talks briefly about the related work in this area. In Section 3 we discuss the key problems addressed by this paper. Section 4 describes our method from a philosophical and technical perspective. In Section 5 we give an overview of the architecture for the recommendation system that has been created for testing and evaluation purposes. We present our experiments and results in Section 6. Finally, we provide conclusions and some future research directions in Section 7. 2 Related Work Research of recommendation systems started in early 90s, when these systems were meant to aggregate past preferences of all users to provide recommendations to users with similar interests [18], [21],[8].[17]. Various possible classifications of recom- mendation techniques have been proposed [6], [21 ],[22]. There have also been vari ous survey papers with focus on particular aspects of recommender systems, such as rating-based recommendations [4], hybrid recommendations [5], and e-commerce applications [22]. In general, recommendation systems can be categorized, based on their approach, into four categories, an extension of the initial categorization(note by earlier work [6],[4]) Content-Based Approach: These recommendations are derived from the similarity of a product the user is interested in with other products that the user has bought/referred to in the past and preference level to those products as expressed by the user. Collaborative Approach: In such an approach, similarity is measured with respect to the end products purchased by other users. For instance, a shared abset of previously-purchased items between the user and other customers be a driving factor for recommendations Usage-Behavior Recommendations: Similarity based on usage behavior of users is used as a criterion for recommendations These recommendations are mapped to correlations in browsing behavior and mined from logs such as those generated by Web servers. Hybrid Approach: Any combination of the above three approaches
78 C. DeLong, P. Desikan, and J. Srivastava content online, user navigational issues can be exacerbated to the point where such websites are essentially unusable for non-experts. Other examples would include any Web sites that have dense policy information, such as those created for academic advising, tax law, and human resources. This paper, which is an extension and update of a previous paper of ours [25], is a step towards providing a generalized framework for expert-driven recommendations. Our starting point is the “Model of Learning”; the philosophical basis our system and its components are constructed from. “User Intent” is modeled through usage data analysis, producing sets of sequential association rules, and “Expert Knowledge” through the structure and hypertext of the website itself, an expansion over our previous work, resulting in a concept graph mapped to individual URLs. Three separate rankings – UsageRank, ExpertRank, and GraphRank – produced and incorporated into an overall USER Rank, which produces the final list of recommendations. To the best of our knowledge, this approach of providing expert-driven recommendations is novel. Section 2 talks briefly about the related work in this area. In Section 3 we discuss the key problems addressed by this paper. Section 4 describes our method from a philosophical and technical perspective. In Section 5 we give an overview of the architecture for the recommendation system that has been created for testing and evaluation purposes. We present our experiments and results in Section 6. Finally, we provide conclusions and some future research directions in Section 7. 2 Related Work Research of recommendation systems started in early 90’s, when these systems were meant to aggregate past preferences of all users to provide recommendations to users with similar interests [18], [21], [8], [17]. Various possible classifications of recommendation techniques have been proposed [6], [21], [22]. There have also been various survey papers with focus on particular aspects of recommender systems, such as rating-based recommendations [4], hybrid recommendations [5], and e-commerce applications [22]. In general, recommendation systems can be categorized, based on their approach, into four categories, an extension of the initial categorization (noted by earlier work [6], [4]): • Content-Based Approach: These recommendations are derived from the similarity of a product the user is interested in with other products that the user has bought/referred to in the past and preference level to those products as expressed by the user. • Collaborative Approach: In such an approach, similarity is measured with respect to the end products purchased by other users. For instance, a shared subset of previously-purchased items between the user and other customers would be a driving factor for recommendations. • Usage-Behavior Recommendations: Similarity based on usage behavior of users is used as a criterion for recommendations. These recommendations are mapped to correlations in browsing behavior and mined from logs such as those generated by Web servers. • Hybrid Approach: Any combination of the above three approaches
USER: User-Sensitive Expert Recommendations Content-Based and Collaborative Filtering [6], [10] approaches require th ability of user profiles and their explicitly-solicited opinion for a set of target ucts. Collaborative approaches also suffer from well known problems such as fication Requirement, the"Pump Priming"problem, Scalability, Sparseness, an Rating Bias'[7]. Web mining techniques such as Web Usage Mining overcome of these problems through the use of the implicitly-influenced opinion of the use extracting information from Web usage logs for ranking relevant Web pages s [201, [14]. The goal of most existing recommendation systems that follow collabora tive approaches [13] or hybrid approaches [1[2], [3] is to recommend products in order to boost sales, as opposed to recommending 'products'-Web documents in this paper- necessary to the user Web page recommendations and personalization techniques in which a users pref erences are automatically learned from Web usage mining techniques became popular in late 90s [14,[19]. Li and Zaiane have illustrated an approac ombine content. structure, and usage for Web page recommendations Huang et al make use of transitive associations and a bipartite graph to address the sparsity issue[27] and have conducted thorough research on other graph-based models as well [28]. An extensive study of Web personalization based on Web usage mining can be found in [12]. Ishi- kawa et al [9] have studied the effectiveness of Web usage mining as a recommenda tion system empirically. Most of these recommendation systems have been developed through the use of data obtained from the explicit link structure of the Web and Web usage logs. For this extended paper, we use these same sets of data in our efforts to generalize the expert-recommendation problem and its potential solutions. Our previ- ous work in this area [25] focused on a richer data set, whereby each Web document was assigned one or more concepts by an expert via a custom content management system(CMS)called Crimson. For this paper, a similar mapping of concepts for Web pages has been assembled using the link structure and link text, but the construction of this mapping is automated, a departure from traditional knowledge-based methods. In our approach, the web graph itself can be thought of as an unprocessed representation of domain knowledge. This information is then leveraged to generate a set of recommen dations relevant to the user's context in a knowledge-dense Web environment 3 The problem The primary reason for the creation of an expert-driven recommender system is to facilitate the navigation of non-expert surfers through websites containing expert- defined content, the numbers of which are increasing rapidly. Parallel with the growth of the World Wide Web, user demands for complete content availability have also increased. To keep up with expectations of constituent users, expert-defined websites, such as those found in higher education, medicine, and government, are making available vast amounts of technically-oriented content, including policies rocedures, and other knowledge-dense information. Generally, this trend is good for users in that more information can be easily accessed by visiting websites, but diffi- culties arise when attempting to connect users to the"right"content. Here, two key problems arise
USER: User-Sensitive Expert Recommendations 79 Content-Based and Collaborative Filtering [6], [10] approaches require the availability of user profiles and their explicitly-solicited opinion for a set of targeted products. Collaborative approaches also suffer from well known problems such as ‘Identification Requirement, the “Pump Priming” problem, Scalability, Sparseness, and Rating Bias’ [7]. Web mining techniques such as Web Usage Mining overcome some of these problems through the use of the implicitly-influenced opinion of the user by extracting information from Web usage logs for ranking relevant Web pages [11], [20], [14]. The goal of most existing recommendation systems that follow collaborative approaches [13] or hybrid approaches [1],[2], [3] is to recommend products in order to boost sales, as opposed to recommending ‘products’ - Web documents in this paper - necessary to the user. Web page recommendations and personalization techniques in which a user’s preferences are automatically learned from Web usage mining techniques became popular in late 90s [14], [19]. Li and Zaiane have illustrated an approach to combine content, structure, and usage for Web page recommendations [11]. Huang et al make use of transitive associations and a bipartite graph to address the sparsity issue [27] and have conducted thorough research on other graph-based models as well [28]. An extensive study of Web personalization based on Web usage mining can be found in [12]. Ishikawa et al [9] have studied the effectiveness of Web usage mining as a recommendation system empirically. Most of these recommendation systems have been developed through the use of data obtained from the explicit link structure of the Web and Web usage logs. For this extended paper, we use these same sets of data in our efforts to generalize the expert-recommendation problem and its potential solutions. Our previous work in this area [25] focused on a richer data set, whereby each Web document was assigned one or more concepts by an expert via a custom content management system (CMS) called Crimson. For this paper, a similar mapping of concepts for Web pages has been assembled using the link structure and link text, but the construction of this mapping is automated, a departure from traditional knowledge-based methods. In our approach, the web graph itself can be thought of as an unprocessed representation of domain knowledge. This information is then leveraged to generate a set of recommendations relevant to the user’s context in a knowledge-dense Web environment. 3 The Problem The primary reason for the creation of an expert-driven recommender system is to facilitate the navigation of non-expert surfers through websites containing expertdefined content, the numbers of which are increasing rapidly. Parallel with the growth of the World Wide Web, user demands for complete content availability have also increased. To keep up with expectations of constituent users, expert-defined websites, such as those found in higher education, medicine, and government, are making available vast amounts of technically-oriented content, including policies, procedures, and other knowledge-dense information. Generally, this trend is good for users in that more information can be easily accessed by visiting websites, but difficulties arise when attempting to connect users to the “right” content. Here, two key problems arise
C. DeLong. P Desikan. and J Srivastava First, the surfer is often unable to"ask the right question". Even if expert-defined websites are well-organized, the content language is often not intuitive to non-experts Non-experts browsing such a website may quickly find themselves lost or uninter- ested because of this. Additionally, a site search using Google, for instance, will not always return relevant links if the surfer's query is inconsistent with the website's language. As such, surfer navigation can become a formidable task. This is further complicated when legal concerns must take precedence over the simplification of content wording, which might otherwise leave out important clauses or restrictions that could be applicable to the surfer The second problem, related to the first, is how expert-defined websites can meet educational goals by connecting surfers to content relevant to their questions. The etermination of relevance is at the core of all recommende tems. however. for expert-defined websites, such recommendations must depend not only on the surfers definition of relevance, but also on the experts opinion. Higher education websites face this issue in particular, where a students context must be brought into alignment with certain academic outcomes, such as timely graduation. As such, a set of relevant recommendations might include links that Web usage mining alone would have never produced. Relevance in an expert recommender system is, therefore, a function of both surfer intent and expert knowledge. This lays the foundation for the model of learning on which our initial system is built 4 Method The philosophical groundwork of learning in general had to be examined prior to the implementation of our candidate expert recommender system. This gave us an intui- tive base to work from, presenting a clear picture of what pieces would be necessary in order to build a prototype, and what future work could be done to improve upon it. 4.1 Philosophical Here, we introduce a model of learning which is ultimately recursive in nature, and represents the core philosophy at the foundation of our recommender system. This model is constructed from interactions typically found in a conversation between an expert and a learner, such as those between a student and their academic advisor First, a question is formulated by the learner and asked of the expert, who then lev erages their accumulated knowledge in order to answer the question and/or ask other questions which try to discern the intent of the learner. Next, the expert must align the question's language with the language the answer or answers are defined in. As a result, the learner is able to more properly formulate additional questions and the expert can be more helpful -and detailed -in their replies. Using the student/advisor interaction as an example, an advisor may be asked about a policy the student does not understand(such as the 13-credit policy, where students must file a petition in order to take less than 13 credits of coursework in a semester). The student's question might not be worded consistently with name or content of the policy they are seeking to know more about, especially if it is technically-worded for legal reasons. The advi- sor must then bridge this knowledge gap by trying to discover the intent of the
80 C. DeLong, P. Desikan, and J. Srivastava First, the surfer is often unable to “ask the right question”. Even if expert-defined websites are well-organized, the content language is often not intuitive to non-experts. Non-experts browsing such a website may quickly find themselves lost or uninterested because of this. Additionally, a site search using Google, for instance, will not always return relevant links if the surfer’s query is inconsistent with the website’s language. As such, surfer navigation can become a formidable task. This is further complicated when legal concerns must take precedence over the simplification of content wording, which might otherwise leave out important clauses or restrictions that could be applicable to the surfer. The second problem, related to the first, is how expert-defined websites can meet educational goals by connecting surfers to content relevant to their questions. The determination of relevance is at the core of all recommender systems. However, for expert-defined websites, such recommendations must depend not only on the surfer’s definition of relevance, but also on the expert’s opinion. Higher education websites face this issue in particular, where a student’s context must be brought into alignment with certain academic outcomes, such as timely graduation. As such, a set of relevant recommendations might include links that Web usage mining alone would have never produced. Relevance in an expert recommender system is, therefore, a function of both surfer intent and expert knowledge. This lays the foundation for the model of learning on which our initial system is built. 4 Method The philosophical groundwork of learning in general had to be examined prior to the implementation of our candidate expert recommender system. This gave us an intuitive base to work from, presenting a clear picture of what pieces would be necessary in order to build a prototype, and what future work could be done to improve upon it. 4.1 Philosophical Here, we introduce a model of learning which is ultimately recursive in nature, and represents the core philosophy at the foundation of our recommender system. This model is constructed from interactions typically found in a conversation between an expert and a learner, such as those between a student and their academic advisor. First, a question is formulated by the learner and asked of the expert, who then leverages their accumulated knowledge in order to answer the question and/or ask other questions which try to discern the intent of the learner. Next, the expert must align the question’s language with the language the answer or answers are defined in. As a result, the learner is able to more properly formulate additional questions and the expert can be more helpful – and detailed – in their replies. Using the student/advisor interaction as an example, an advisor may be asked about a policy the student does not understand (such as the 13-credit policy, where students must file a petition in order to take less than 13 credits of coursework in a semester). The student’s question might not be worded consistently with name or content of the policy they are seeking to know more about, especially if it is technically-worded for legal reasons. The advisor must then bridge this knowledge gap by trying to discover the intent of the
USER: User-Sensitive Expert Recommendations question and connect it to appropriate answers. Through on,the gap is further narrowed as the student's questions become more con ith the language the policy is defined in Often, the advisor may have questions about the reason for the student's question ("Why do you want to file a petition?"), which may uncover additional information that can help the advisor in recommending the best course of action for the student. A student may not want to be asked about why they want to take less than 13-credits but it is advisor's job to encourage a student's academic success, even if it means telling them to take a full credit load. This is a vital piece of our model: an expert is teaching the learner what they need to know, versus telling them what they want to know. This is the primary difference between the model presented in this work and those of other recommender systems. LEARNER EXPERT List of Recommendations Fig. 1. Philosophical Model of Learning Figure I is a visual representation of this model as demonstrated through the dy namic of an expert and a learner. Similar to the conversation between the student and their advisor, the model is recursive, a representation of learning at the meta level. We use this model as the foundation for our recommender system, which incorporates both learner intent and expert knowledge in order to generate a list of 4.2 Technical The different perspectives of the expert and user are captured using different model from the available data(described in more detail below ). The problem then reduces to applying the set of ranking methods to the data representing expert and user opinions and finally merging them. The data contains the following descriptive eler
USER: User-Sensitive Expert Recommendations 81 question and connect it to appropriate answers. Through conversation, the gap is further narrowed as the student's questions become more consistent with the language the policy is defined in. Often, the advisor may have questions about the reason for the student’s question (“Why do you want to file a petition?”), which may uncover additional information that can help the advisor in recommending the best course of action for the student. A student may not want to be asked about why they want to take less than 13-credits, but it is advisor’s job to encourage a student's academic success, even if it means telling them to take a full credit load. This is a vital piece of our model: an expert is teaching the learner what they need to know, versus telling them what they want to know. This is the primary difference between the model presented in this work and those of other recommender systems. Fig. 1. Philosophical Model of Learning Figure 1 is a visual representation of this model as demonstrated through the dynamic of an expert and a learner. Similar to the conversation between the student and their advisor, the model is recursive, a representation of learning at the meta level. We use this model as the foundation for our recommender system, which incorporates both learner intent and expert knowledge in order to generate a list of recommendations. 4.2 Technical The different perspectives of the expert and user are captured using different models from the available data (described in more detail below). The problem then reduces to applying the set of ranking methods to the data representing expert and user opinions and finally merging them. The data contains the following descriptive elements: LLL EEE AAA RRR NNN EEE RRR EEE XXX PPP EEE RRR TTT Query Words, Usage Sequence List of Recommendations
C. DeLong. P Desikan. and J Srivastava A set of concepts C=(Cl, C2, C3,... where each document D; can be la- beled by a subset of concepts from C A set of Web pages, S, generated by the content management system, which have a one-to-one mapping with each document D · Web server usage logs The views of an expert are captured explicitly by building a graph of Web pa connected by concepts that are derived from a Web graph generated by a set perts using a content management system. While our earlier work [25] concent on deriving expert knowledge from the content management system itself, in our current work, we present a more generally-applicable case of dealing with a Web graph, thus removing the dependence on the content management system. The expert views are captured using this graph and the expert opinion of the relative importance of any given Web page is captured using an ExpertRank The navigational importance is captured from the Web graph by the StructureRank. The queries of the learner are captured from the Usage Rank. By defining these three different kinds of rank we are able to capture what the learner is intending to learn and what the expert thinks the learner needs to know n The most popular kind of information infrastructure for representing documents nd their interconnections is a link graph. Such an information infrastructure can be modeled as a graph, G (V, E)-where V is a set of vertices of that represents units of information and E is a set of edges that represents the interaction between them. For pe of this paper, a vertex represents a single Web page and an edge represents a hyperlink between pages or a relation between two pages. As can be seen in Figure 2, two different graphs are constructed: each corresponding to a different type of edge relationship. In order to generate relevance ranks of nodes on these graphs, Googles PageRank [15] is used as a foundation due to its stability and success in the Web search domain. However, it should be noted that the best way to capture an experts advice on the whole set of documents automatically without an expert involved is an ExpertRank: A concept graph is constructed from the Web graph In this graph, a vertex is a single Web page and its edges correspond to its conceptual links with other Web pages. In the prototype recommender system, the concepts of a Web page are sented by individual anchor text words and the information-retaining phras e repre- derived from the anchor text of the Web pages pointing to it. The concepts are repre- be grown out of them. If two Web pages share a concept, but are not already itly-linked, then an implicit link is introduced between them. Two pages are mined to share a concept if the intersection of the set of concepts each represents is not empty. The set of concepts by themselves are determined by the anchor text of the s pointing to them. On this constructed graph, Page Rank is applied to obtain the ortance ranking of these documents thus the rank of a document d is defined as eR(d) d',d∈G Where d is a given Web page and d is a set of all Web pages that point to d, either by an explicit or an implicit link, Np is the number of documents in the Web graph, and a is the dampening factor
82 C. DeLong, P. Desikan, and J. Srivastava • A set of concepts C = {C1, C2, C3,….}, where each document Di can be labeled by a subset of concepts from C • A set of Web pages, S, generated by the content management system, which have a one-to-one mapping with each document D • Web server usage logs The views of an expert are captured explicitly by building a graph of Web pages connected by concepts that are derived from a Web graph generated by a set of experts using a content management system. While our earlier work [25] concentrated on deriving expert knowledge from the content management system itself, in our current work, we present a more generally-applicable case of dealing with a Web graph, thus removing the dependence on the content management system. The expert views are captured using this graph and the expert opinion of the relative importance of any given Web page is captured using an ExpertRank. The navigational importance is captured from the Web graph by the StructureRank. The queries of the learner are captured from the UsageRank. By defining these three different kinds of rank we are able to capture what the learner is intending to learn and what the expert thinks the learner needs to know. The most popular kind of information infrastructure for representing documents and their interconnections is a link graph. Such an information infrastructure can be modeled as a graph, G (V, E) – where V is a set of vertices of that represents units of information and E is a set of edges that represents the interaction between them. For the scope of this paper, a vertex represents a single Web page and an edge represents a hyperlink between pages or a relation between two pages. As can be seen in Figure 2, two different graphs are constructed: each corresponding to a different type of edge relationship. In order to generate relevance ranks of nodes on these graphs, Google’s PageRank [15] is used as a foundation due to its stability and success in the Web search domain. However, it should be noted that the best way to capture an experts advice on the whole set of documents automatically without an expert involved is an open issue of research. ExpertRank: A concept graph is constructed from the Web graph. In this graph, a vertex is a single Web page and its edges correspond to its conceptual links with other Web pages. In the prototype recommender system, the concepts of a Web page are derived from the anchor text of the Web pages pointing to it. The concepts are represented by individual anchor text words and the information-retaining phrases that can be grown out of them. If two Web pages share a concept, but are not already explicitly-linked, then an implicit link is introduced between them. Two pages are determined to share a concept if the intersection of the set of concepts each represents is not empty. The set of concepts by themselves are determined by the anchor text of the pages pointing to them. On this constructed graph, PageRank is applied to obtain the importance ranking of these documents. Thus the rank of a document d is defined as: ( ) ∑′ ∈ ′ ′ = + − ⋅ D d d G OutDeg d ER d N ER d , ( ) ( ) ( ) 1 α α (1) Where d is a given Web page and d’ is a set of all Web pages that point to d, either by an explicit or an implicit link, ND is the number of documents in the Web graph, and α is the dampening factor
USER: User-Sensitive Expert Recommendations 83 USER GraphRank: The graph of the Web site is generated by the content management system. Here, the vertices represent individual Web pages and the edges represent the hyperlinks connecting them. Though the Web pages can be mapped to the document set, the edge set is different primarily due to the difference in purpose and method of creating the edges. The Structure graph contains explicit links that can are created mainly for easy navigation. Applying Page Rank to this graph gives a set of rankings for which Web pages are important, in a very general sense. However, since the edges shared by two linked pages in the Web graph do not take context into account this is not the optimal representation of relationships between documents/pages de fined by experts. This graph contains information about certain Web pages that are important, such as good hubs, but the edge sets are vastly different from that of the concept graph and, hence, will not be reflected in the document rank in the same way The Page Rank of a page, S is computed using the Page Rank metric: PR(s) PR(s’) OutDeg(s) Where s is a given Web page and s'is a set of all Web pages that point to s, either by an explicit or an implicit link, Ns is the number of documents in the Web graph, and a is the dan opening fac
USER: User-Sensitive Expert Recommendations 83 Concept Generation S1 S2 S3 S4 S5 Web Server Logs Offline Online Assoc. Rule generator Usage Confidence Analysis Rules UR Query Concept (Expert) Graph Analysis Web Graph Analysis ER PR USER D1 D2 D3 D4 D5 Explicit Link Implicit Link Concept Generation S1 S2 S3 S4 S5 Web Server Logs Offline Online Assoc. Rule generator Usage Confidence Analysis Rules UR Query Concept (Expert) Graph Analysis Web Graph Analysis ER PR USER D1 D2 D3 D4 D5 Explicit Link Implicit Link Concept Generation S1 S2 S3 S4 S5 Web Server Logs Offline Online Assoc. Rule generator Usage Confidence Analysis Rules UR Query Concept (Expert) Graph Analysis Web Graph Analysis ER PR USER D1 D2 D3 D4 D5 Explicit Link Implicit Link Fig. 2. Technical approach to obtain User Sensitive Expert Rank (USER) GraphRank: The graph of the Web site is generated by the content management system. Here, the vertices represent individual Web pages and the edges represent the hyperlinks connecting them. Though the Web pages can be mapped to the document set, the edge set is different primarily due to the difference in purpose and method of creating the edges. The Structure graph contains explicit links that can are created mainly for easy navigation. Applying PageRank to this graph gives a set of rankings for which Web pages are important, in a very general sense. However, since the edges shared by two linked pages in the Web graph do not take context into account, this is not the optimal representation of relationships between documents/pages defined by experts. This graph contains information about certain Web pages that are important, such as good hubs, but the edge sets are vastly different from that of the concept graph and, hence, will not be reflected in the document rank in the same way. The PageRank of a page, S is computed using the PageRank metric: ( ) ∑′ ∈ ′ ′ = + − ⋅ s s s G OutDeg s PR s N PR s , ( ) ( ) ( ) 1 α α (2) Where s is a given Web page and s’ is a set of all Web pages that point to s, either by an explicit or an implicit link, NS is the number of documents in the Web graph, and α is the dampening factor
84 C. DeLong. P. Desikan. and J Srivastava Usage Rank: User interests are modeled based on user browsing patterns. Us ser browsing patterns are mined and association rules are generated offline. For an online user browsing session, the current sequence of Web pages navigated by the user is considered as a Query'the user is interested in. Given this sequence and using asso- ciation rules, the set of pages the user would likely be interested in navigating to in the next step is generated. The generated list of Web pages is ranked according to the confidence metric of each sequential association rule. The Usage Rank of a page is, however, dependant on the current browsing sequence of the user, and hence may have different values for different queries. Let us define a question, g to be sequence of pages that a user has browsed in the current session. Let a be the set of sequences, (qsi) that have been logged from previous user sessions, where s; is an URL that a previous user has traversed to, after browsing a sequence of pages defined by, Thus the Usage Rank for the page s; for a particular question q can be defined as Thus the output is a set of Web pages, S whose confidence metric is higher than a pre-defined threshold. USER Rank: Once the sets of ranks are obtained. the task of obtaining an integrated score by a combination of the scores obtained from each ranking metric, intuitively translates into integration of user interest with expert advice. These sets are merged to uce the final list of recommendations using the following function USER(s)=(p*(n*UR(S))+q*(n2 ER(s))+r*(n3*PR(S))/(p+q+r p, q and r are floating point numbers between 0 and I used to weight their corre- sponding ranks. The weights p, q and r can be varied, giving more (or less) prefer ence to one set of rankings over another. n1, n2, and n3 are normalization values for each of the corresponding ranks. Since each of the ranking types has a different distribution of values, their results must be normalized prior to the application of p, q, and r. These values are pre-computed floating point numbers between 0 and 1 d can be thought of as a distribution normalizing share percentage. From the distribution of returned UR, er, and PR scores, the average score value for each ranking type is calculated. These scores are used to compute each rank type's aver age"share"of the average total score( the sum of the average scores ). Using these scores as"sample"values for each rank type and assuming each rank type contrib utes an equal proportion of the average total score, nI, n2, and n3 are computed. It should be noted that while a linear method of combining the rank scores is adopted it is not claimed to be the best approach. The area of optimal combination of ranks obtained from different dimensions is open for research, especially in this domain Whether it is better to combine the ranks obtained from different models or use a single combination model to obtain a unique rank is still to be examined. It is better to avoid a single weight that overpowers the other parameters unduly, and it seem unlikely that there is a perfect weight for each rank for all usage sequences. Thus, more elegant methods of weight balancing, such as dynamic weighting, are an impor tant part of our future work. However, the simple approach of linear combination of these ranks helps to begin the process of providing the user with recommendations eflecting what they need to know
84 C. DeLong, P. Desikan, and J. Srivastava UsageRank: User interests are modeled based on user browsing patterns. User browsing patterns are mined and association rules are generated offline. For an online user browsing session, the current sequence of Web pages navigated by the user is considered as a ‘Query’ the user is interested in. Given this sequence and using association rules, the set of pages the user would likely be interested in navigating to in the next step is generated. The generated list of Web pages is ranked according to the confidence metric of each sequential association rule. The Usage Rank of a page is, however, dependant on the current browsing sequence of the user, and hence may have different values for different queries. Let us define a question, q to be sequence of pages that a user has browsed in the current session. Let A be the set of sequences, {q→si} that have been logged from previous user sessions, where si is an URL that a previous user has traversed to, after browsing a sequence of pages defined by, q. Thus the Usage Rank for the page si for a particular question q can be defined as: UR(si) = confidence(q→si)= support(q→si)/support(q) (3) Thus the output is a set of Web pages, S whose confidence metric is higher than a pre-defined threshold. USER Rank: Once the sets of ranks are obtained, the task of obtaining an integrated score by a combination of the scores obtained from each ranking metric, intuitively translates into integration of user interest with expert advice. These sets are merged to produce the final list of recommendations using the following function: USER(s) = (p*(n1*UR(s)) + q*(n2*ER(s)) + r*(n3*PR(s))) / (p + q + r) (4) p, q and r are floating point numbers between 0 and 1 used to weight their corresponding ranks. The weights p, q and r can be varied, giving more (or less) preference to one set of rankings over another. n1, n2, and n3 are normalization values for each of the corresponding ranks. Since each of the ranking types has a different distribution of values, their results must be normalized prior to the application of p, q, and r. These values are pre-computed floating point numbers between 0 and 1, and can be thought of as a distribution normalizing share percentage. From the distribution of returned UR, ER, and PR scores, the average score value for each ranking type is calculated. These scores are used to compute each rank type's average "share" of the average total score (the sum of the average scores). Using these scores as "sample" values for each rank type and assuming each rank type contributes an equal proportion of the average total score, n1, n2, and n3 are computed. It should be noted that while a linear method of combining the rank scores is adopted, it is not claimed to be the best approach. The area of optimal combination of ranks obtained from different dimensions is open for research, especially in this domain. Whether it is better to combine the ranks obtained from different models or use a single combination model to obtain a unique rank is still to be examined. It is better to avoid a single weight that overpowers the other parameters unduly, and it seems unlikely that there is a perfect weight for each rank for all usage sequences. Thus, more elegant methods of weight balancing, such as dynamic weighting, are an important part of our future work. However, the simple approach of linear combination of these ranks helps to begin the process of providing the user with recommendations reflecting what they need to know
USER: User-Sensitive Expert Recommendations 85 Expert resolution PageRank content network Management Syste Recommender Engine Wab Usage USER Learner Cleaned Fig 3 System architecture of the Class Website Recommendation Systen 5 System Architecture The recommender system architecture was designed to closely reflect a philosophical model of learning. In its initial implementation, the recommender system incorpo rates both the expert and learner portions of the learning model as shown in Figure 3 However, analysis of queries by an expert to a learner is left for future work. The recommender system works by integrating the output of three offline analysis-the expert knowledge and learner queries, and the PageRank of our Web graph-into a set of online recommendations determined by the user's current context. Expert: As an introduction, the website data used to build and test the prototype re- commender system was from the University of Minnesotas College of Liberal Arts StudentServices(class)website(httP://www.class.umn.edu).Thesiteisdeployed and maintained through the use of an in-house content management system(CMS) called Crimson. Crimson distributes the workload of creating and maintaining web- sites to CLASS domain experts; the academic advisors. In this paper, conceptual information is derived from the Web graph and anchor tag text. For every Web page, we construct a term frequency vector using the anchor text from incoming links Each term in this vector is then used as a seed for phrase growth, which builds an array of all possible left-to-right phrase candidates. These candidates are then pruned according to the following criteria: If a phrase occurs only once in the corpus, delete it. If a phrase is a stop word (using the mysQL 4.1 stop word list [26 ), has a string length of l,or delete it If a candidate phrase is a left-side perfect substring of a phrase one word longer than the candidate and their frequencies are equal, delete the candidate. If a candidate phrase is a right-side perfect substring of a phrase one word longer than the candidate and their frequencies are equal, delete the candidate
USER: User-Sensitive Expert Recommendations 85 Fig. 3. System architecture of the CLASS Website Recommendation System 5 System Architecture The recommender system architecture was designed to closely reflect a philosophical model of learning. In its initial implementation, the recommender system incorporates both the expert and learner portions of the learning model as shown in Figure 3. However, analysis of queries by an expert to a learner is left for future work. The recommender system works by integrating the output of three offline analysis - the expert knowledge and learner queries, and the PageRank of our Web graph – into a set of online recommendations determined by the user’s current context. Expert: As an introduction, the website data used to build and test the prototype recommender system was from the University of Minnesota’s College of Liberal Arts Student Services (CLASS) website (http://www.class.umn.edu). The site is deployed and maintained through the use of an in-house content management system (CMS) called Crimson. Crimson distributes the workload of creating and maintaining websites to CLASS domain experts; the academic advisors. In this paper, conceptual information is derived from the Web graph and anchor tag text. For every Web page, we construct a term frequency vector using the anchor text from incoming links. Each term in this vector is then used as a seed for phrase growth, which builds an array of all possible left-to-right phrase candidates. These candidates are then pruned according to the following criteria: • If a phrase occurs only once in the corpus, delete it. • If a phrase is a stop word (using the MySQL 4.1 stop word list [26]), has a string length of 1, or is numeric, delete it. • If a candidate phrase is a left-side perfect substring of a phrase one word longer than the candidate and their frequencies are equal, delete the candidate. • If a candidate phrase is a right-side perfect substring of a phrase one word longer than the candidate and their frequencies are equal, delete the candidate
86 C. DeLong. P. Desikan. and J Srivastava In this way, only information-retaining phrases are kept (i.e. a phrase and any phrase substring of it will have different frequency counts). From here, we create a directed graph where each vertex is a Web page and each edge is a link from one Web page to another along the previously-calculated concept links. This graph is then queried for any missing links between related Web pages (i.e:"implicit"links Implicit links are formed when two Web pages share a concept, but there are either 1 or 0 explicit links existing between them. If there are 0 links between them, then one implicit link is created for each direction (i.e.: A>B and B-A)along the shared concept link. If there is one link between them already, then one implicit link is cre- ated for the direction that is missing. This is our concept graph. This concept graph is then ranked using a modified version of Page Rank called Content Rank. The result- ing ranked concept graph can then be queried by the recommender system for relevant content given the user's concept context. Learner: The Learner portion begins by retrieving usage data from the CMs. This usage data is then cleaned to remove bad addresses (the CMs records all visit at- tempts, even if the Web addresses are invalid), server visits resulting from periodic refreshing of RSS newsfeed content, and spider bot visits. For the remaining usage data, session id information is assigned using the Ip address of the user and their visit timestamp. An integer indicating the numerical order in which the Web page was visited for each session is assigned as well, easing sequential association rule genera- tion. Single-page user sessions are also deleted (i.e r visits one Web page on the site and then leaves). Finally, sequential association rules are generated with the cleaned, session-defined usage data according to the confidence metric. The resulting rules are stored as tuples in a rule database, easing rule recall for the recommender tem USER System: The set of recommendations made to the learner while he/she browses the website is accomplished by determining the users current context, creat ing a list of candidate recommendations, and merging their corresponding UsageR anks, ExpertRanks, and GraphRanks obtained from Web structure(the Page Rank of a URL in the Web graph) to create a final list of recommendations To determine the learners context, browsing history is used to query for all se- quential association rules meeting the following conditions: The premise portion of the rule matches some ordered subset of the learner's browsing history The last URL in the premise portion of the rule is equal to the URL of the learner's current location The rules returned from this query form the"usage recommendation set, the set of URLS retrieved from usage data. If we were making recommendations based on usage data alone, this is the set of URLs we would draw from-a set of predicted next urls for the learner to visit these urls are then resolved to each their con- ceptual components, such as ' and 13-credit policy. We then query our concept graph using these concepts to return a set of URLs conceptually related to the set of next'URLS, including the set of 'next URLS themselves(as they are obviously related to the learners context). For each URL in this set, the URl's rank (from the concept graph) is normalized using its corresponding rule confidence. In this way
86 C. DeLong, P. Desikan, and J. Srivastava In this way, only information-retaining phrases are kept (i.e.: a phrase and any phrase substring of it will have different frequency counts). From here, we create a directed graph where each vertex is a Web page and each edge is a link from one Web page to another along the previously-calculated concept links. This graph is then queried for any missing links between related Web pages (i.e.: “implicit” links). Implicit links are formed when two Web pages share a concept, but there are either 1 or 0 explicit links existing between them. If there are 0 links between them, then one implicit link is created for each direction (i.e.: AÆB and BÆA) along the shared concept link. If there is one link between them already, then one implicit link is created for the direction that is missing. This is our concept graph. This concept graph is then ranked using a modified version of PageRank called ContentRank. The resulting ranked concept graph can then be queried by the recommender system for relevant content given the user’s concept context. Learner: The Learner portion begins by retrieving usage data from the CMS. This usage data is then cleaned to remove bad addresses (the CMS records all visit attempts, even if the Web addresses are invalid), server visits resulting from periodic refreshing of RSS newsfeed content, and spider bot visits. For the remaining usage data, session id information is assigned using the IP address of the user and their visit timestamp. An integer indicating the numerical order in which the Web page was visited for each session is assigned as well, easing sequential association rule generation. Single-page user sessions are also deleted (i.e.: user visits one Web page on the site and then leaves). Finally, sequential association rules are generated with the cleaned, session-defined usage data according to the confidence metric. The resulting rules are stored as tuples in a rule database, easing rule recall for the recommender system. USER System: The set of recommendations made to the learner while he/she browses the website is accomplished by determining the user’s current context, creating a list of candidate recommendations, and merging their corresponding UsageRanks, ExpertRanks, and GraphRanks obtained from Web structure (the PageRank of a URL in the Web graph) to create a final list of recommendations. To determine the learner’s context, browsing history is used to query for all sequential association rules meeting the following conditions: • The premise portion of the rule matches some ordered subset of the learner's browsing history • The last URL in the premise portion of the rule is equal to the URL of the learner's current location The rules returned from this query form the “usage recommendation set”, the set of URLs retrieved from usage data. If we were making recommendations based on usage data alone, this is the set of URLs we would draw from – a set of predicted ‘next’ URLs for the learner to visit. These URLs are then resolved to each their conceptual components, such as 'advising' and '13-credit policy'. We then query our concept graph using these concepts to return a set of URLs conceptually related to the set of 'next' URLs, including the set of ‘next’ URLs themselves (as they are obviously related to the learner’s context). For each URL in this set, the URL’s rank (from the concept graph) is normalized using its corresponding rule confidence. In this way