Decision Support Systems 51(2011)772-781 Contents lists available at ScienceDirect Decision Support Systems ELSEVIER journalhomepagewww.elsevier.com/locate/dss Collaborative user modeling for enhanced content filtering in recommender systems Heung-Nam Kim a Inay Ha, Kee-Sung Lee Geun-Sik Jo, Abdulmotaleb El-Saddik ol of In formation Technology and niversity of ottawa, 800 King Edward, ottawa, Ontario, KIN 6N5, Canada b School of Computer and Information Eng Inha University, 253 Younghyun-dong, Nam-gu, Incheon(402-751). Korea ARTICLE IN FO A BSTRACT Available online 31 January 2011 Recommender systems, which have emerged in response to the problem of information overload, provide users with recommendations of content suited to their needs. To provide proper recommendations to users. personalized recommender systems require accurate user models of characteristics, preferences and needs. In h to user model Recommender system to users. Our approach first discovers useful and meaningful user patterns, and then enriches the personal model vith collaboration from other similar users. In order to evaluate the performance of our approach, we compare experimental results with those of a probabilistic learming model, a user model based on collaborative filtering pproaches, and a vector space model We present experimental results that show how our model performs better than existing alternatives D 2011 Elsevier B.V. All rights reserved. 1 Introduction Nevertheless, collaborative filtering suffers from a fundame problem, namely the cold start problem, which can be divided into cold The prevalence of Web 2.0 technologies and services enables end- start items and cold start users [25. Several researchers have offered users to be producers as well as consumers of content. Even on a daily proposals dealing with the challenge of addressing this problem basis, an enormous amount of textual content, such as online news, 10, 17, 22, 25]. In a collaborative filtering-based recommender system, research papers, blog articles, and wikis is generated on the Web. It is an item cannot be recommended until a large number of us getting more difficult to make automatic recommendations to a user previously rated it This is known as a cold start item. this related to his/her preferences, not only because of the huge amount of applies to new items generated every few minutes and can be information but also because of the difficulty of automatically grasping alleviated by content-based technology. In the case of domains such his/her interests [7]. Recommender systems, which have emerged in textual documents, content-based filtering has proven to be effective in response to the above challenges, provide users with recommendations of locating textual content relevant to a specific content information need content suited to their needs. There are two widely used approaches 6, 10. However, content-based filtering also encounters limitations for among recommender systems, content-based filtering and collaborative a cold start user, similar to collaborative filtering. A cold start user filtering. The traditional task in collaborative filtering is to predict the describes a new user that joins a recommender system and has utility of a certain item for the target user from the opinions of other presented few opinions (ie, the user has insufficient preference similar users, and thereby make appropriate recommendations 21]. On history). With these situations, the system is generally unable to make the other hand, content-based filtering provides recommendations by high quality recommendations. comparing representations of content contained in an item to those of a We address these issues by introducing a collaborative approach to users interest content, ignoring opinions of other similar users [17] user modeling for enhancing content filtering, Our goal is to build Collaborative filtering has an advantage over content-based filtering in robust user model that can be applied to personalized recommender situations where it is hard to analyze the underlying content, e. g, music, systems. By capturing a users content of interest, we can discover the videos, and photos. Because collaborative filtering process is only based on preference patterns and terms existing in the users content of interest. historical information about whether or not a given target user has In addition to partially overcome the cold start user problem, we previously preferred an item, analysis of the actual content, itself, is not propose an enrichment method of the personal model in collaboration necessarily required. with other similar users This paper presents three specific contributions toward user modeling in recommender systems. First, we propose a new method author.TeL:+16135625800x6248:fax:+16135625664 of building a user model, allowing understanding and filtering of the ni4596@gmaiL com(H-N. Kim), inayeeslab inhaac kr (L Ha). user'sinterests We then present a method of a collaborative enrichment eslab inha ac kr(K-S. Lee), gsjo@inha.ackr(G -S Jo), abed@mcrlab ottawa ca of user interests in dealing with the cold start problem. Second, we propose how the individual model can be applied to personalized 0167-9236/S-see front matter e 2011 Elsevier B V. All rights reserved oi:10.1016/dss201101012
Collaborative user modeling for enhanced content filtering in recommender systems Heung-Nam Kim a, ⁎, Inay Ha b , Kee-Sung Lee b , Geun-Sik Jo b , Abdulmotaleb El-Saddik a a School of Information Technology and Engineering, University of Ottawa, 800 King Edward, Ottawa, Ontario, K1N 6N5, Canada b School of Computer and Information Engineering, Inha University, 253 Younghyun-dong, Nam-gu, Incheon (402–751), Korea article info abstract Available online 31 January 2011 Keywords: Collaborative user modeling Recommender system Personalization Content-based user model Recommender systems, which have emerged in response to the problem of information overload, provide users with recommendations of content suited to their needs. To provide proper recommendations to users, personalized recommender systems require accurate user models of characteristics, preferences and needs. In this study, we propose a collaborative approach to user modeling for enhancing personalized recommendations to users. Our approach first discovers useful and meaningful user patterns, and then enriches the personal model with collaboration from other similar users. In order to evaluate the performance of our approach, we compare experimental results with those of a probabilistic learning model, a user model based on collaborative filtering approaches, and a vector space model. We present experimental results that show how our model performs better than existing alternatives. © 2011 Elsevier B.V. All rights reserved. 1. Introduction The prevalence of Web 2.0 technologies and services enables endusers to be producers as well as consumers of content. Even on a daily basis, an enormous amount of textual content, such as online news, research papers, blog articles, and wikis is generated on the Web. It is getting more difficult to make automatic recommendations to a user related to his/her preferences, not only because of the huge amount of information but also because of the difficulty of automatically grasping his/her interests [7]. Recommender systems, which have emerged in response to the above challenges, provide users with recommendations of content suited to their needs. There are two widely used approaches among recommender systems, content-based filtering and collaborative filtering. The traditional task in collaborative filtering is to predict the utility of a certain item for the target user from the opinions of other similar users, and thereby make appropriate recommendations [21]. On the other hand, content-based filtering provides recommendations by comparing representations of content contained in an item to those of a user's interest content, ignoring opinions of other similar users [17]. Collaborative filtering has an advantage over content-based filtering in situations where it is hard to analyze the underlying content, e.g., music, videos, and photos. Because collaborative filtering process is only based on historical information about whether or not a given target user has previously preferred an item, analysis of the actual content, itself, is not necessarily required. Nevertheless, collaborative filtering suffers from a fundamental problem, namely the cold start problem, which can be divided into cold start items and cold start users [25]. Several researchers have offered proposals dealing with the challenge of addressing this problem [10,17,22,25]. In a collaborative filtering-based recommender system, an item cannot be recommended until a large number of users have previously rated it. This is known as a cold start item. This problem applies to new items generated every few minutes and can be partially alleviated by content-based technology. In the case of domains such as textual documents, content-based filtering has proven to be effective in locating textual content relevant to a specific content information need [6,10]. However, content-based filtering also encounters limitations for a cold start user, similar to collaborative filtering. A cold start user describes a new user that joins a recommender system and has presented few opinions (i.e., the user has insufficient preference history). With these situations, the system is generally unable to make high quality recommendations. We address these issues by introducing a collaborative approach to user modeling for enhancing content filtering. Our goal is to build a robust user model that can be applied to personalized recommender systems. By capturing a user's content of interest, we can discover the preference patterns and terms existing in the user's content of interest. In addition to partially overcome the cold start user problem, we propose an enrichment method of the personal model in collaboration with other similar users. This paper presents three specific contributions toward user modeling in recommender systems. First, we propose a new method of building a user model, allowing understanding and filtering of the user'sinterests.We then present a method of a collaborative enrichment of user interests in dealing with the cold start problem. Second, we propose how the individual model can be applied to personalized Decision Support Systems 51 (2011) 772–781 ⁎ Corresponding author. Tel.: +1 613 562 5800x6248; fax: +1 613 562 5664. E-mail addresses: nami4596@gmail.com (H.-N. Kim), inay@eslab.inha.ac.kr (I. Ha), lks@eslab.inha.ac.kr (K.-S. Lee), gsjo@inha.ac.kr (G.-S. Jo), abed@mcrlab.uottawa.ca (A. El-Saddik). 0167-9236/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2011.01.012 Contents lists available at ScienceDirect Decision Support Systems j o u r n a l h om e p a g e : www. e l s ev i e r. c om / l o c a t e / d s s
H-N. Kim et al / Decision Support Systems 51(2011)772-781 commendations relevant to the users needs. We incorporate Collaborative and content-based filtering methods have unique collaborative characteristics into a content-based approach. Third, we advantages and disadvantages. Therefore, some studies combine these provide detailed experimental evaluations with real datasets and techniques in developing hybrid recommender systems [4]. Berkovsky et investigate how collaborative user models work in terms of improving al. [22] presented a method of user modeling data integration for the the recommendation performance purposes of a specific recommendation task, referred to as the mediation The subsequent sections are organized as follows: Section 2 of a user model. By importing and integrating data collected from other ummarizes previous studies related to user modeling and personalized recommender systems, four types of user model mediation are presented recommendations In Section 3, we describe the notations and method cross-user, cross-item, cross-context, and cross-representation In [3]. the to build the initial user model. We then describe a collaborative same authors presented mediated user models that are transformed from approach for modeling user interests and recommending content in collaborative filtering to content-based recommender systems. In 8.a Section 4. Next, Section 5 describes the implemented system and content-collaborative hybrid recommender system is proposed that interface In Section 6, we present the effectiveness of our approach in exploits WordNet-based user profiles to capture the semantics of user terms of its performance. Finally, conclusions are presented and future interests. Similar to our approach, the authors generated the neighbor work is discussed in Section 7 hood of a user through content-based methods. Melville et al.[1 followed a two-stage approach. First they applied a naive Bayesian classifier as content-based predictor to complete the rating matrix, and 2 Related work then they re-estimated ratings from this full rating matrix by collaborative filtering. CinemaScreen 22] reversed the stages. It executed content-based In personalized recommender systems, two main approaches have filtering on a result set generated through collaborative filtering. been developed: a content-based filtering approach and a collaborative Although the above-mentioned studies combine collaborative and filtering approach. Following the proposal of GroupLens [21]. the first content-based filtering approaches to exploit the benefits of each and system to generate automated recommendations, collaborative filtering lessen the disadvantages, our approach takes a different stance. Differing approaches have seen the widest use in a large number of information from earlier work, we automatically identify meaningful or useful patterns filtering problems relating to such things as movies, books, music, online in building a user model. In addition, rather than utilizing explicit user news, TV programs, and research papers. Despite success and popularity. feedback such as numeric ratings assigned to content, our aim is to build a collaborative filtering encounters several limitations, including the robust user model implicitly inferred by the system from observing user parsity of the data, scalability, the cold start problem, and untrustworthy behavior. Through the identification of useful patterns of a user in users. A number of researchers have addressed these problems using collaboration with other similar users, we discover content relevant to the content-based filtering 4]. user's needs. Content-based filtering methods, which are another well-known technique in recommender systems, have been developed using 3. Building a personal user model learning procedures. These procedures require training data to identify personal preferences (user model)from information objects and their The capability to learn users' preferences is at the heart of a ntent. Webmate tracked documents of interest to the user and personalized recommender system. In order to provide proper exploited the vector space model using the TF-IDF(Term Frequency- recommendations to users, personalized recommender systems require Inverse Document Frequency) method [ 6. Schwab et al 26 explored user models of characteristics, preferences, and needs. This information the use of a classification approach to recommend articles relevant to is typically referred to in the literature as a User Model(UM)3. the user profile, such as NewsDude In News Dude, two types of user Additionally, since every user can have different interests, featur interests are used: short-term and long-term interests. To avoid selection for representing users' interests should be personalized and recommendations of very similar documents, a short-term profile is performed individually for each user [16. In this section, we describe used. For the long-term interests of a user, the probabilities of a our approach to building a personal user model that is driven by the document are calculated using Naive Bayes approach to classify a users content of interest. document as interesting or not. Instead of learning from users' explicit Before going into further detail, the notation and definitions information, PVA 5 learned a user profile implicitly without user required for understanding our approach are introduced. Let C analysis algorithm is employed to present novel information for users by content c is a set of terms, each of which may appear in multiple identifying the novelty of articles in the contexts of articles they content with different weights that quantify the importance of the previously reviewed. Lihua et al 14 proposed a method of modeling term for describing the content. In our study, a weight Wy associated multiple user interests by using a self-organizing map neural network with a pair(t, g)(i.e. a term ty of a content c)is computed by a fairly with a changeable network structure. SitelF[15 proposed using word common type of TF-IDF weighting scheme [23. To build a personal sense-based document representation to build a model of the user's user model, potentially representative of user interests, we initially interests. A filtering procedure was employed to dynamically predi eed some information given by the user, called user feedback. The new documents based on a semantic network st common ways to obtain the feedback is to use information user's interaction [19]. Explicit feed back requires a user to evaluate content and indicate how relevant or interesting specific content is to him /her using like/dislike(a binary scale)or numerical ratings. Even though explicit feedback helps us to capture user preferences After mining user us content of interest, five personalized term patterns are found accurately, there is a serious drawback in that users do not tend to Length provide enough feedback. Users are generally not motivated to [t1,t2,t3 provide their feedback if they do not receive immediate benefits even when they would profit in the long-term 19. Therefore, in our study, we take implicit feedback into consideration in the sense that [t2,t,t4 the system automatically infers the users preferences from the user's 0.32 behaviors 2,7, 19. In general, the preference indicator of implicit
recommendations relevant to the user's needs. We incorporate collaborative characteristics into a content-based approach. Third, we provide detailed experimental evaluations with real datasets and investigate how collaborative user models work in terms of improving the recommendation performance. The subsequent sections are organized as follows: Section 2 summarizes previous studies related to user modeling and personalized recommendations. In Section 3, we describe the notations and method to build the initial user model. We then describe a collaborative approach for modeling user interests and recommending content in Section 4. Next, Section 5 describes the implemented system and interface. In Section 6, we present the effectiveness of our approach in terms of its performance. Finally, conclusions are presented and future work is discussed in Section 7. 2. Related work In personalized recommender systems, two main approaches have been developed: a content-based filtering approach and a collaborative filtering approach. Following the proposal of GroupLens [21], the first system to generate automated recommendations, collaborative filtering approaches have seen the widest use in a large number of information filtering problems relating to such things as movies, books, music, online news, TV programs, and research papers. Despite success and popularity, collaborative filtering encounters several limitations, including the sparsity of the data, scalability, the cold start problem, and untrustworthy users. A number of researchers have addressed these problems using content-based filtering [4]. Content-based filtering methods, which are another well-known technique in recommender systems, have been developed using learning procedures. These procedures require training data to identify personal preferences (user model) from information objects and their content. Webmate tracked documents of interest to the user and exploited the vector space model using the TF-IDF (Term Frequency– Inverse Document Frequency) method [6]. Schwab et al. [26] explored the use of a classification approach to recommend articles relevant to the user profile, such as NewsDude. In NewsDude, two types of user interests are used: short-term and long-term interests. To avoid recommendations of very similar documents, a short-term profile is used. For the long-term interests of a user, the probabilities of a document are calculated using Naïve Bayes approach to classify a document as interesting or not. Instead of learning from users' explicit information, PVA [5] learned a user profile implicitly without user intervention. The user profile is represented as a keyword vector in the form of a hierarchical category structure. In Newsjunkie [11], a noveltyanalysis algorithm is employed to present novel information for users by identifying the novelty of articles in the contexts of articles they previously reviewed. Lihua et al. [14] proposed a method of modeling multiple user interests by using a self-organizing map neural network with a changeable network structure. SiteIF [15] proposed using word sense-based document representation to build a model of the user's interests. A filtering procedure was employed to dynamically predict new documents based on a semantic network. Collaborative and content-based filtering methods have unique advantages and disadvantages. Therefore, some studies combine these techniques in developing hybrid recommender systems [4]. Berkovsky et al. [22] presented a method of user modeling data integration for the purposes of a specific recommendation task, referred to as the mediation of a user model. By importing and integrating data collected from other recommender systems, four types of user model mediation are presented: cross-user, cross-item, cross-context, and cross-representation. In [3], the same authors presented mediated user models that are transformed from collaborative filtering to content-based recommender systems. In [8], a content-collaborative hybrid recommender system is proposed that exploits WordNet-based user profiles to capture the semantics of user interests. Similar to our approach, the authors generated the neighborhood of a user through content-based methods. Melville et al. [17] followed a two-stage approach. First they applied a naive Bayesian classifier as content-based predictor to complete the rating matrix, and then they re-estimated ratings from this full rating matrix by collaborative filtering. CinemaScreen [22] reversed the stages. It executed content-based filtering on a result set generated through collaborative filtering. Although the above-mentioned studies combine collaborative and content-based filtering approaches to exploit the benefits of each and lessen the disadvantages, our approach takes a different stance. Differing from earlier work, we automatically identifymeaningful or useful patterns in building a user model. In addition, rather than utilizing explicit user feedback such as numeric ratings assigned to content, our aim is to build a robust user model implicitly inferred by the system from observing user behavior. Through the identification of useful patterns of a user in collaboration with other similar users, we discover content relevant to the user's needs. 3. Building a personal user model The capability to learn users' preferences is at the heart of a personalized recommender system. In order to provide proper recommendations to users, personalized recommender systems require user models of characteristics, preferences, and needs. This information is typically referred to in the literature as a User Model (UM) [3]. Additionally, since every user can have different interests, feature selection for representing users' interests should be personalized and performed individually for each user [16]. In this section, we describe our approach to building a personal user model that is driven by the user's content of interest. Before going into further detail, the notation and definitions required for understanding our approach are introduced. Let C= {c1, c2, …, cn} be the set of all content, T= {t1, t2, …, tm} be the set of all index terms, and U= {u1, u2, …, ul} be the set of distinct users. The content cj is a set of terms, each of which may appear in multiple content with different weights that quantify the importance of the term for describing the content. In our study, a weight wi,j associated with a pair (ti, cj) (i.e., a term ti of a content cj) is computed by a fairly common type of TF-IDF weighting scheme [23]. To build a personal user model, potentially representative of user interests, we initially need some information given by the user, called user feedback. The most common ways to obtain the feedback is to use information given explicitly or to get information observed implicitly from the user's interaction [19]. Explicit feedback requires a user to evaluate content and indicate how relevant or interesting specific content is to him/her using like/dislike (a binary scale) or numerical ratings. Even though explicit feedback helps us to capture user preferences accurately, there is a serious drawback in that users do not tend to provide enough feedback. Users are generally not motivated to provide their feedback if they do not receive immediate benefits, even when they would profit in the long-term [19]. Therefore, in our study, we take implicit feedback into consideration in the sense that the system automatically infers the user's preferences from the user's behaviors [2,7,19]. In general, the preference indicator of implicit Table 1 After mining user u's content of interest, five personalized term patterns are found. Pattern-id PTP PS Length p1 {t1, t2, t3} 0.56 3 p2 {t1, t2, t3, t4} 0.51 4 p3 {t1, t2, t5} 0.47 3 p4 {t4, t5} 0.41 2 p5 {t2, t3, t4} 0.32 3 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 773
H-N. Kim et aL / Decision Support Systems 51(2011)772-781 edback can be represented as a form of a co-occurrence pair (un, c)), a given pattern Pk E Fu the pattern weight of pk for user u, denoted here u, Uis a user and cE is specific content. The co-occurrence PWu(Pk), is computed by air implies that user un viewed, clicked, collected or bookmarked content c. While implicit feedback on specific content by a user does not necessarily mean that he/ she likes the content, we assume that (3) the co-occurrence pairs of the user are his/her interest content, implicitly where H u is the mean weight for term ti in Iu and is computed as 3. 1. Modeling user interests by text mining Our approach to modeling user interests mainly consists of three Tx2间0 steps: extracting terms, mining frequent patterns, and pruning patterns In this section, we present the steps to initially build a personal user where lu()is the set of interest content for user u containing term ti model in detail and wiy is the weight of term t in content c For any pattern Px in Fu The first step in user modeling is the extraction of the terms from we determine the patterns for which the pattern weight is greater than Interest hat have b reprocessed by removing the minimum pattern weight, min_pw, and model user preferences words aming words After extracting terms. based on the identified patterns, collectively called a Personalized C is represented as a vector of attribute-value pairs Term Pattern. In addition, terms that appear within personalized term as follows. called personalized terms 5={(w)(2m2)…(m,wn) (1) Definition 2(Personalized Term Pattem, PIP) A personalized term patten is defined as a frequent term pattern for where ty is the extracted term in c and wi is the weight of t in c Wi is which the pattern weight is greater than the minimum pattern weight computed by the static TF-iDF term-weighting scheme [23] and min_pw, i.e. Pk E Fu and PW(px)>min_pw. a set of personalized term defined as follows. patterns for user u is denoted as PTPu such that PIPu=((pk, PSu (Pr)) PW(Pk)> min_pw A pk∈F Definition 3(Personalized Term, Pr) where fy is the frequency of occurrence of term ti in content c, n is the a personalized term is a term that occurs within personalized term total number of content pieces in the collections, and n; is the number patterns. The set of personalized terms for user u is denoted as PTu In of content pieces in which term t occurs. The weight indicates the addition, the vector for PTu is represented by PTu=(u, u, H2, u,.. 4, u). importance of a term in representing the content. where t is the total number of personalized terms and A, u is the mean The second step is to mine frequent term patterns from the interest weight for term t, which is computed by Eq (4) content of each user. Since every user has different interests, content Frequent patterns are a set of terms that appear frequently togetherilo used for the mining process must be selected individually for each use The formal description of the model for user u, Mu, is as follows: Mu=(PIPu, PT), where PIPu models the interest patterns(Definition 2) set of a users interest content. For example, if a set of terms and Plu models the interest terms(Definition 3). And the model is [recommendation, collaborative, personalization, filtering) appear stored in a prefix tree structure, which is inspired by a frequent-pattern frequently together in a users set of interest content, the set of those tree(FP-tree)[12 to save memory space, explore relationships of terms is a frequent pattern for the user. In the data mining research terms, and retrieve PTPs having some PTs efficiently literature, frequent pattems are typically defined as patterns that occur For example, if five personalized term patterns are found, as t least as frequently as a predetermined minimum support(min_sup) shown in Table 1, after mining the content of interest for user u, the 12). In our study, we apply the mining process based on the following tree structure of the model for user u is then constructed as follows. assumption: each transaction corresponds to an interest content of a All PTu are stored in the header table and sorted in order of descending user, items in a transaction are terms extracted from the content, and a frequency of terms since there are better chances that more prefix transaction database corresponds to a users set of interest content. terms can be shared [12]. Therefore, if the pattern support of pattern Pk( Definition 1) that is First, we create the root of the tree, labeled with"null For the first composed of at least 1(>2)different terms, is above min_sup, i.e. PSu term pattern, t, t2, t3] is inserted into the tree as a path from the root (Pk)>min_sup, then pattern px is referred to as a frequent term pattern. node, where tz is linked as the child of the root, t, is linked to tz, and t, is We denote a set of frequent term patterns for user u as Fu linked to t]. PS and length of the pattern(PS(p1)=0.56, length=3)are then attached to the last node t,. The nodes linked together in the path Definition 1(Pattem Support, Ps) imply that the nodes(terms)contained in the pattern co-occur frequently in the users interest content. For the second pattern, since Let lu be user us set of interest content and pattern pk=(t, t, tn its term pattern, t, t2, t3, and ta, shares a common prefix(t2, t, and t3) be a set of terms such that Pk T and n 22. A content piece c is said to with the existing path for the first term pattern, a new node ta is created contain pattern Pk if and only if Pk C Pattern support for pattern Pk in lu. and linked as a child of node t3. Thereafter, PS(P2)and length(p2)are written as PSu(Pk), is the ratio of content in Lu that contains pattern pk. attached to the last node ta. The third, fourth, and fifth patterns are That is, Psu(pk)=fu(pk)/ Iu l. where fu(px) indicates the occurrence inserted in a manner similar to the first and second patterns To facilitate frequency of pattern Pk in lu. tree traversal, a header table is built, in which each term points to its occurrence in the tree via a node-link Nodes with the same term-name terns containing unnecessary terms from the set of frequent constructed as shown in Fig. note that the built tree is a compact data term patterns. To this end, we define the importance of each term in structure for representing the whole interest patterns and terms of user representing a certain pattern, called the pattern weight. Formally, for u by sharing personalized terms in the personalized patterns
feedback can be represented as a form of a co-occurrence pair (uh, cj), where uh∈U is a user and cj∈C is specific content. The co-occurrence pair implies that user uh viewed, clicked, collected, or bookmarked content cj. While implicit feedback on specific content by a user does not necessarily mean that he/she likes the content, we assume that the co-occurrence pairs of the user are his/her interest content, implicitly. 3.1. Modeling user interests by text mining Our approach to modeling user interests mainly consists of three steps: extracting terms, mining frequent patterns, and pruning patterns. In this section, we present the steps to initially build a personal user model in detail. The first step in user modeling is the extraction of the terms from interest content that have been preprocessed by removing stop words and stemming words [20]. After extracting terms, each interest content cj is represented as a vector of attribute-value pairs as follows: cj= t1; j; w1; j ; t2; j; w2;j ;:::; tm; j; wm; j n o ð1Þ where ti,j is the extracted term in cj and wi,j is the weight of ti in cj. wi,j is computed by the static TF-IDF term-weighting scheme [23] and defined as follows: wi; j= fi; j maxl fl; j × log n ni ð2Þ where fi,j is the frequency of occurrence of term ti in content cj, n is the total number of content pieces in the collections, and ni is the number of content pieces in which term ti occurs. The weight indicates the importance of a term in representing the content. The second step is to mine frequent term patterns from the interest content of each user. Since every user has different interests, content used for the mining process must be selected individually for each user. Frequent patterns are a set of terms that appear frequently together in a set of a user's interest content. For example, if a set of terms {recommendation, collaborative, personalization, filtering} appear frequently together in a user's set of interest content, the set of those terms is a frequent pattern for the user. In the data mining research literature, frequent patterns are typically defined as patterns that occur at least as frequently as a predetermined minimum support (min_sup) [12]. In our study, we apply the mining process based on the following assumption: each transaction corresponds to an interest content of a user, items in a transaction are terms extracted from the content, and a transaction database corresponds to a user's set of interest content. Therefore, if the pattern support of pattern pk (Definition 1) that is composed of at least l (l≥2) different terms, is above min_sup, i.e., PSu (pk)Nmin_sup, then pattern pk is referred to as a frequent term pattern. We denote a set of frequent term patterns for user u as Fu. Definition 1 (Pattern Support, PS) Let Ιu be user u's set of interest content and pattern pk= {t1, t2, ..., tn} be a set of terms such that pk T and n≥2. A content piece cj is said to contain pattern pk if and only if pk cj. Pattern support for pattern pk in Ιu, written as PSu(pk), is the ratio of content in Ιu that contains pattern pk. That is, PSu(pk)=fu(pk)/| Ιu |, where fu(pk) indicates the occurrence frequency of pattern pk in Ιu. Once the frequent patterns are mined, in the third step we remove the patterns containing unnecessary terms from the set of frequent term patterns. To this end, we define the importance of each term in representing a certain pattern, called the pattern weight. Formally, for a given pattern pk ∈ Fu, the pattern weight of pk for user u, denoted as PWu(pk), is computed by: PWu pk ð Þ= 1 jpk j ⋅∑i∈pk μi;u ð3Þ where μi,u is the mean weight for term ti in Ιu and is computed as follows: μi;u= 1 jΙuð Þj i ×∑j∈Iuð Þi wi; j ð4Þ where Ιu(i) is the set of interest content for user u containing term ti and wi,j is the weight of term ti in content cj. For any pattern pk in Fu, we determine the patterns for which the pattern weight is greater than the minimum pattern weight, min_pw, and model user preferences based on the identified patterns, collectively called a Personalized Term Pattern. In addition, terms that appear within personalized term patterns are called Personalized Terms. Definition 2 (Personalized Term Pattern, PTP) A personalized term pattern is defined as a frequent term pattern for which the pattern weight is greater than the minimum pattern weight min_pw, i.e., pk ∈ Fu and PWu(pk)Nmin_pw. A set of personalized term patterns for user u is denoted as PTPu such that PTPu= {(pk, PSu(pk))| PWu(pk)Nmin_pw ∧ pk ∈ Fu}. Definition 3 (Personalized Term, PT) A personalized term is a term that occurs within personalized term patterns. The set of personalized terms for user u is denoted as PTu. In addition, the vector for PTu is represented by PTu → = (μ1,u, μ2,u, …, μt,u), where t is the total number of personalized terms and μi,u is the mean weight for term ti, which is computed by Eq. (4). The formal description of the model for user u, Mu, is as follows: Mu=〈PTPu, PTu〉, where PTPu models the interest patterns (Definition 2) and PTu models the interest terms (Definition 3). And the model is stored in a prefix tree structure, which is inspired by a frequent-pattern tree (FP-tree) [12], to save memory space, explore relationships of terms, and retrieve PTPs having some PTs efficiently. For example, if five personalized term patterns are found, as shown in Table 1, after mining the content of interest for user u, the tree structure of the model for user u is then constructed as follows. All PTu are stored in the header table and sorted in order of descending frequency of terms since there are better chances that more prefix terms can be shared [12]. First, we create the root of the tree, labeled with “null”. For the first term pattern, {t1, t2, t3} is inserted into the tree as a path from the root node, where t2 is linked as the child of the root, t1 is linked to t2, and t3 is linked to t1. PS and length of the pattern (PS(p1)=0.56, length=3) are then attached to the last node t3. The nodes linked together in the path imply that the nodes (terms) contained in the pattern co-occur frequently in the user's interest content. For the second pattern, since its term pattern, {t1, t2, t3, and t4}, shares a common prefix {t2, t1, and t3} with the existing path for the first term pattern, a new node t4 is created and linked as a child of node t3. Thereafter, PS(p2) and length(p2) are attached to the last node t4. The third, fourth, and fifth patterns are inserted in a manner similar to the first and second patterns. To facilitate tree traversal, a header table is built, in which each term points to its occurrence in the tree via a node-link. Nodes with the same term-name are linked in sequence via such node-links. Finally, the model for user u is constructed as shown in Fig. 1. Note that the built tree is a compact data structure for representing the whole interest patterns and terms of user u by sharing personalized terms in the personalized patterns. 774 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781
H-N. Kim et al / Decision Support Systems 51(2011)772-781 4. 2. Collaborative enrichment of user interests Header Table Once we have identified the set of the nearest neighbors for a certain user u, his/her initial model Mu=(PIP. PTu) is enriched from the eighbors. The basic idea of enriching the model of the user u starts from assuming that the user is likely to prefer similar patterns that have been 03 04121 discovered from the neighbors with similar tastes. The patterns discovered from more similar users contribute more to enriching the model of the target user. For example, if the pattern, such as 32,3 (personalization, recommender), frequently appears in interest content of a user, he/she might also be interested in the pattern, such as [personalization, recommender, collaborative, filtering that frequently 0514 appears in interest content of users similar to him/her. This enrichment process is particularly effective to some users who do not contain Fig. 1. A tree structure of Mu for personalized term patterns in Table 1 interest terms and patterns in their user model, such as the cold start users 4. Collaborative user modeling for content filtering We elaborate on the general idea of the enrichment process in the following. Let N(u)=v,, V2,, Vk) be a sorted neighbor list of target In this section we describe how to enrich the model for a specific user u, PTPu be a set of personalized term patterns for user u, and PTPv. user. The model M, described in Section 3 is referred to the initial user VEN(u), be a set of personalized term patterns for neighbor user vof model for user u. This model can be applied immediately to generate user u. Firstly, we choose neighbor user v in descending order of content recommendations. However, diverse pattens for user u cannot similarity between target user u and neighbors. For each pattern p in PIPu, specific patterns of pi in PIPy are identified. this situation, initial personalized term patterns may not be sufficient to only if pi is a subset of p), Le..CPy On the contrary, p is said to be a represent user preferences, and thus our approach is generally unable to specific pattern of p. For example, letp:=(ta and ts)be the personalized make high quality recommendations. In addition, when we only use the terms pattern for user u such that pIE PIPu, and PTP, =(Pz, P3, Pa, and ps) itial model for recommendations, it is hard to recommend to the user be the set of PTPs for user v such that p2=(t2, ta, and ts). p3=(ta. novel content of value aside from the usual set. For the above reasons, ts), p4=(tz, L4. ts, and tz), and ps=(t, and ts)as shown in Fig. 2 we propose an enrichment method of the user model via personalized they are said to be a specific pattern. Several specific patterns thatoN Since pattern P2, P3, and pa contain the entire terms of pattern F 4. 1. Content-based neighborhood formatio than that of the general pattern. Assume that the pattern support for pr p2, Pa, and pa is 0.41, 0.5, 0.47, and 0.35, respectively (ie PSu(P,)=0.41 The main goal of neighborhood formation is to identify a set of user PSp2)=0.5. Psp3)=0.47, and PSM(p4)=0.35). In this case, only neighbors, k nearest neighbors, which is defined as a group of users pattern pz and p3 is used for enriching the model of user u if they are not exhibiting interest terms similar to those of the target user. A typical PIPs for user u, as can be seen in Fig. 3. Patterns such as p2 and p3 are collaborative filtering recommender system encounters serious limita- called Collaborative Term Patterns(CTPs)for target user u. An enriched ons for finding a set of users, namely the sparsity problem [8, 17). The model for user u by neighbor user v is built, as shown in Fig. 4. sparsity problem occurs when available data is insufficient to identify Finally, a set of collaborative patterns is identified from k nearest practice, even when users are very active, the result of ale content In neighbors, with respect to target user u. Note that the collaborative term for the target user is not allowed to be redundant. That is, only a small proportion of the total number of content. Accordingly it is if the same patterns that were previously enriched by neighbor v are similarity cannot be computed. Even when the computation of similarity for v+h, those patterns are pruned is possible, it may not be very reliable, because insufficient information is The enriched model for user u is defined as a triple Mu=(PTPu, CTP rocessed.To this end, in our study, we select the best neighbors by using PT.where PTPu is the set of personalized term patterns for user u, CTPu the personalized terms, PT, of each user. In order to find k nearest ighbors, the cosine similarity, which quantifies the similarity of a ectors according to their angle is employed to measure the similarity alues between a target user and every other user. As noted in Definition 3, the personalized terms of a pair of users, u and v, are represented as Header Table t-dimensional vectors, PT and PTy respectively. Therefore, the similarity between a pair of users, u and v is measured by Eq (5). ta ts y=(所)=后二此二 The similarity score between a pair of users is in the range[0, 1 053 0473 and the higher a users score, the more similar he/she is to the target user. After computing the all-to-all similarity between users, we define the set of nearest neighbors of each user u as an ordered list of k 035,4 sers N(u)=vi, V2,, Vk) such that uEEN(u), and sim(u, vi)is the maximum, sim(u, z)is the next maximum etc. 24 Fig. 2. Initial model for user v who is a neighbor of target user u
4. Collaborative user modeling for content filtering In this section we describe how to enrich the model for a specific user. The model Mu described in Section 3 is referred to the initial user model for user u. This model can be applied immediately to generate content recommendations. However, diverse patterns for user u cannot be discovered via the mining process in the case where the user has a small number of interest content. This is known as a cold start user.With this situation, initial personalized term patterns may not be sufficient to represent user preferences, and thus our approach is generally unable to make high quality recommendations. In addition, when we only use the initial model for recommendations, it is hard to recommend to the user novel content of value aside from the usual set. For the above reasons, we propose an enrichment method of the user model via personalized term patterns of like-minded users. 4.1. Content-based neighborhood formation The main goal of neighborhood formation is to identify a set of user neighbors, k nearest neighbors, which is defined as a group of users exhibiting interest terms similar to those of the target user. A typical collaborative filtering recommender system encounters serious limitations for finding a set of users, namely the sparsity problem [8,17]. The sparsity problem occurs when available data is insufficient to identify similar users (neighbors) due to the immense amount of content. In practice, even when users are very active, the result of rated content is only a small proportion of the total number of content. Accordingly, it is often the case that a pair of users has nothing in common, and hence the similarity cannot be computed. Even when the computation of similarity is possible, it may not be very reliable, because insufficient information is processed. To this end, in our study, we select the best neighbors by using the personalized terms, PT, of each user. In order to find k nearest neighbors, the cosine similarity, which quantifies the similarity of a pair of vectors according to their angle, is employed to measure the similarity values between a target user and every other user. As noted in Definition 3, the personalized terms of a pair of users, u and v, are represented as t-dimensional vectors, PTu → and PTv → respectively. Therefore, the similarity between a pair of users, u and v is measured by Eq. (5). sim uð Þ ; v = cos PTu → ; PTv → = ∑t k = 1μk;u ×μk;v ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑t k= 1μ2 k;u q × ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑t k= 1μ2 k;v q ð5Þ The similarity score between a pair of users is in the range [0, 1] and the higher a user's score, the more similar he/she is to the target user. After computing the all-to-all similarity between users, we define the set of nearest neighbors of each user u as an ordered list of k users Ν(u)={v1, v2,…, vk} such that u∉Ν(u), and sim(u,v1) is the maximum, sim(u,v2) is the next maximum etc. [24]. 4.2. Collaborative enrichment of user interests Once we have identified the set of the nearest neighbors for a certain user u, his/her initial model Mu=〈PTPu, PTu〉 is enriched from the neighbors. The basic idea of enriching the model of the user u starts from assuming that the user is likely to prefer similar patterns that have been discovered from the neighbors with similar tastes. The patterns discovered from more similar users contribute more to enriching the model of the target user. For example, if the pattern, such as {personalization, recommender}, frequently appears in interest content of a user, he/she might also be interested in the pattern, such as {personalization, recommender, collaborative, filtering}, that frequently appears in interest content of users similar to him/her. This enrichment process is particularly effective to some users who do not contain interest terms and patterns in their user model, such as the cold start users. We elaborate on the general idea of the enrichment process in the following. Let Ν(u)={v1,v2,…,vk} be a sorted neighbor list of target user u, PTPu be a set of personalized term patterns for user u, and PTPv, v ∈ Ν(u), be a set of personalized term patterns for neighbor user v of user u. Firstly, we choose neighbor user v in descending order of similarity between target user u and neighbors. For each pattern pi in PTPu, specific patterns of pi in PTPv are identified. Given two patterns pi and pj, pi is said to be a general pattern of pj if and only if pi is a subset of pj, i.e., pi⊂pj. On the contrary, pj is said to be a specific pattern of pi. For example, let p1= {t4, and t5} be the personalized terms pattern for user u such that p1∈ PTPu, and PTPv= {p2, p3, p4, and p5} be the set of PTPs for user v such that p2= {t2, t4, and t5}, p3= {t4, t5, and t8}, p4= {t2, t4, t5, and t7}, and p5= {t7, and t8} as shown in Fig. 2. Since pattern p2, p3, and p4 contain the entire terms of pattern p1, they are said to be a specific pattern. Several specific patterns that occur in the PTPs of neighbor v, PTPv, may be found. For efficient enrichment, we only consider specific patterns which have higher pattern support than that of the general pattern. Assume that the pattern support for p1, p2, p3, and p4 is 0.41, 0.5, 0.47, and 0.35, respectively (i.e., PSu(p1)=0.41, PSv(p2)=0.5, PSv(p3)=0.47, and PSv(p4)=0.35). In this case, only pattern p2 and p3 is used for enriching the model of user u if they are not PTPs for user u, as can be seen in Fig. 3. Patterns such as p2 and p3 are called Collaborative Term Patterns (CTPs) for target user u. An enriched model for user u by neighbor user v is built, as shown in Fig. 4. Finally, a set of collaborative patterns is identified from k nearest neighbors, with respect to target user u. Note that the collaborative term pattern for the target user is not allowed to be redundant. That is, if the same patterns that were previously enriched by neighbor v are also discovered from another neighbor h such that sim(u,v)≥sim(u,h), for v≠h, those patterns are pruned. The enriched model for user u is defined as a triple M+u=〈PTPu, CTPu, PT+u〉 where PTPu is the set of personalized term patterns for user u, CTPu Root 0.56, 3 0.47, 3 0.51, 4 0.41, 2 PT Node -links t 2 t 2 t 1 t 3 t 4 t 5 Header Table 0.8 0.6 0.7 0.3 0.4 µ 0.32, 3 t 1 t 3 t 3 t 4 t 4 t 4 t 5 t 5 Fig. 1. A tree structure of Mu for personalized term patterns in Table 1. Root 0.5, 3 0.47, 3 0.35, 4 0.31, 2 PT Node -links t 4 t 4 t 5 t 5 t 2 t 2 t 7 t 7 t 7 t 8 t 8 t 8 Header Table 0.6 0.7 0.4 0.3 0.5 µ Fig. 2. Initial model for user v who is a neighbor of target user u. H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 775
776 H-N. Kim et aL / Decision Support Systems 51(2011)772-781 target( where Nu is the total number of patterns in both PTPu and CTPu, and BA is binary variable for determining whether or not pattern pk occurs in content cn. That is, BR* is 1 if pattern Pk appears in content cn and o otherwise, and op represents the weighted pattern support of pk for user u, which is given by: 0412 0.41 ={ PS(Pk xsim(u, v) if p=Cmp∈PP, 0.47,3 053 0473 The main concept of prediction dictates that interest patterns in the Fig. 3. Specific patterns, general pattern, and enriched patterns. model of the target user are a good estimate of the preference for the selected content. The more the content contains the patterns in the model, the higher rank the content obtains This scheme can also make recommendations for new content added regularly to the system. known as the new item problem in collaborative filtering 1, as well as is the set of collaborative term patterns for user u, and PTu is the set of support serendipitous recommendations [13]. Recommender systems interest terms that occur within either the personalized patterns or the relying exclusively on a user's interest content can only recommend collaborative patterns, respectively. In the enriched model M*u PIPu content highly related to that which the user has previously selected. It models the interest patterns of user u whereas CIPu models the enriched is hard to recommend novel content that are different from anything the interest patterns by the neighbors of user u. user has previously read before. This is known as the problem with overspecialization [1, 22). In our approach, by utilizing the enriched ( Collaborative Term Pattern, CIP) patterns from neighbors with similar tastes, we can make content to be a Let pi be a personalized term pattern for target user u, P EPTPu, and higher rank in the recommended set that the content contains the p be a personalized term pattern for neighbor v such that PE PIPw, and collaborative(enriched) patterns valuable to the target user, even E N(u). We define the set of collaborative term patterns for user u, though the patterns are not directly discovered from the user's interest denoted as CTPu. as the set of neighbor patterns p such that P: Cpi. content. pyEPTPu, and PSu(P)≤PSP) Once the content predictions about the target user, which the user has not previously read, are computed, the content are sorted in order of descending predicted value Pun. Finally, the set of N ordered content elements with the highest values are identified for user u. This After the model is enriched, we are ready to provide recommenda- is the set of content recommended to user u(top-N recommendation) tions for new content that a user has not previously read. Based on the niched model for each user, we recommend to the user the top-N ranked content that he/she might be interested in reading. Definition 5(Top-N recommendation) the mostimportant task in personalized recommendation is Let c be the set of all content, Xu be the content list that user u has a prediction, that is, speculation about how much a certain reviously collected or added to his preference list(interest content) prefer unseen content. In our study, we consider matched patterns, that and Yu be the content list not previously read by user u, Yu=C-Xu and is, how many interest patterns in a user model are contained in the new Xun Yu=0. Given a pair of content elements c and c, C andc EYu. ontent. Formally, the numeric score of the target user u for the content content c will be of more interest to user u than content c if and only if Cn, denoted as Pun, is obtained as follows the prediction score Pui of the target user u for the content ci is higher than that of content c Pui>Puf Top-N recommendations for user u ∑ PrE(PIPyUCIP)B∑D=Pmm) lPkIxoP x Ba ∑=uCm)o (6) identifies an ordered set of N content, TopNu, that will be of interest to user u such that TopN|≤N,TopN∩=, and TopNu Y Header Table 0.41,2 0804.33[0531043 Fig. 4. Enriched user u model, Mw by neighbor user v
is the set of collaborative term patterns for user u, and PT+u is the set of interest terms that occur within either the personalized patterns or the collaborative patterns, respectively. In the enriched model M+u, PTPu models the interest patterns of user u whereas CTPu models the enriched interest patterns by the neighbors of user u. Definition 4 (Collaborative Term Pattern, CTP) Let pi be a personalized term pattern for target user u, pi∈PTPu, and pj be a personalized term pattern for neighbor v such that pj∈PTPv, and v ∈ Ν(u). We define the set of collaborative term patterns for user u, denoted as CTPu, as the set of neighbor patterns pj such that pi⊂pj, pj∉PTPu, and PSu(pi)≤PSv(pj). 4.3. Personalized content recommendation After the model is enriched, we are ready to provide recommendations for new content that a user has not previously read. Based on the enriched model for each user, we recommend to the user the top-N ranked content that he/she might be interested in reading. To this end, the most important task in personalized recommendation is to generate a prediction, that is, speculation about how much a certain user would prefer unseen content. In our study, we consider matched patterns, that is, how many interest patterns in a user model are contained in the new content. Formally, the numeric score of the target user u for the content cn, denoted as Pu,n, is obtained as follows: Pu;n= ∑pk∈ PTPu∪CTPu ð ÞBpk n Nu ⋅ ∑pk∈ PTPu∪CTPu ð Þ jpk j ×ωpk u ×Bpk n ∑Pk∈ PTPu∪CTPu ð Þωpk u ð6Þ where Nu is the total number of patterns in both PTPu and CTPu, and Bn pk is binary variable for determining whether or not pattern pk occurs in content cn. That is, Bn pk is 1 if pattern pk appears in content cn and 0 otherwise, and ωu pk represents the weighted pattern support of pk for user u, which is given by: ωpk u = PSu pk ð Þ if pk∈PTPu PSv pk ð Þ×sim uð Þ ; v if pk∈CTPu; pk∈PTPv ð7Þ The main concept of prediction dictates that interest patterns in the model of the target user are a good estimate of the preference for the selected content. The more the content contains the patterns in the model, the higher rank the content obtains. This scheme can also make recommendations for new content added regularly to the system, known as the new item problem in collaborative filtering [1], as well as support serendipitous recommendations [13]. Recommender systems relying exclusively on a user's interest content can only recommend content highly related to that which the user has previously selected. It is hard to recommend novel content that are different from anything the user has previously read before. This is known as the problem with overspecialization [1,22]. In our approach, by utilizing the enriched patterns from neighbors with similar tastes, we can make content to be a higher rank in the recommended set that the content contains the collaborative (enriched) patterns valuable to the target user, even though the patterns are not directly discovered from the user's interest content. Once the content predictions about the target user, which the user has not previously read, are computed, the content are sorted in order of descending predicted value Pu,n. Finally, the set of N ordered content elements with the highest values are identified for user u. This is the set of content recommended to user u (top-N recommendation). Definition 5 (Top-N recommendation) Let C be the set of all content, Xu be the content list that user u has previously collected or added to his preference list (interest content), and Yu be the content list not previously read by user u, Yu=C−Xu and Xu ∩ Yu=∅. Given a pair of content elements ci and cj, ci ∈Yu and cj ∈Yu, content ci will be of more interest to user u than content cj if and only if the prediction score Pu,i of the target user u for the content ci is higher than that of content cj, Pu,iNPu,j. Top-N recommendations for user u identifies an ordered set of N content, TopNu, that will be of interest to user u such that |TopNu|≤N, TopNu ∩ Xu=∅, and TopNu Yu. Root 0.5, 3 0.47, 3 0.35, 4 0.41, 2 Root 0.5, 3 0.47, 3 Root 0.41, 2 Specific patterns in neighbor user General pattern in target user Enriched patterns for target user t 4 t 5 t 4 t 5 t 4 t 5 t 2 t 2 t 7 t 8 t 8 Fig. 3. Specific patterns, general pattern, and enriched patterns. 0.56, 3 0.47, 3 0.51, 4 0.41, 2 0.32, 3 0.5, 3 0.47, 3 Root PT Node -links Header Table t 2 t 2 t 2 t 1 t 1 t 3 t 3 t 3 t 4 t 4 t 4 t 4 t 5 t 5 t 5 0.8 0.6 0.7 0.3 0.4 t 8 t 8 0.5 µ Fig. 4. Enriched user u model, M+u, by neighbor user v. 776 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781
H-N. Kim et al / Decision Support Systems 51(2011)772-781 5. System implementation Table 2 Datasets used in experimental evaluation Based on the requirements defined in Sections 3 and 4, we Number of users Number of items Number of interest items developed a prototype system to support personalize recommendations, named PRCUM(Personalized Recon ded into fous MLens 658 682 50.318 via Collaborative User Model). The PRCUM system is divide main types of tasks: (a)Observing relevance feedback of a given user. (b) Modeling user interests from observed content. (c)Enriching user interests from nearest neighbors, and(d) Generating content recommendations for a given user. An overall system process for related to computer science. The selected dataset contains 974 unique personalized content recommendations is shown in Fig. 5. e. conte PRCUM first requires the user to sign in with his/her username and abstracts. In addition, we collected 9845 preference histories(ie password, and then it allows the user to add content to a preference list interest content of users)from 78 users. We refer to this dataset as NSE and monitors the users browsing inside the system. Because PRCUM The second dataset comes from movielens. which is a web-based cannotmakerecommendationstotheuserbeforebuildingtheresearchrecommendationsystem(www.movielens.org).theoriginal individual model, it delays recommendations until the model is of a dataset does not contain any information about movie content, and sufficient size and has been successfully built. The user can adjust the thus we extracted the textual descriptions(i.e. genres, keywords desiredmodelparameterssuchastheminimumsupport(min-sup),thesummary)foreachmoviefromtheIMDbdatabase(www.imdb.com). minimum pattern weight (min_pw)and the number of nearest Though the dataset contains numerical ratings, we ignored these and neighbors(k). Once the model has been built, PRCUM allows the user binarized them as follows: if a certain user's movie rating is larger to enter his/her personalized pages and proposes to him/her a list of than his/her average rating we set the rating to 1(ie, the interest recommended content. The GUl of PRCUMis implemented using C# and movie of the user), or 0 otherwise. Thereafter, we removed users who the server side is implemented using MysQL 5.0 and PHP 5.2 in an had less than 20 ratings. The binarized dataset consists of 50,318 Apache 2.2 environment. ratings on 1682 movies from 658 users We refer to this dataset as The GUI mainly consists of four frames: a menu frame, a favorite MLens. Table 2 briefly describes our datasets frame, a recommendation frame and a main frame By interacting with the menu frame, users can choose the functions of PRCUM rendered by the main frame. As one of the principal functions in PRCUM, the 6. 2. Evaluation design and metrics recommendation frame provides a list of recommended content, a list of nearest neighbors, and recently added interest content And the favorite To evaluate the performance of the recommendations, we randomly frame is used for jumping to content in favorites previously registered in divided the dataset into a training set and a test set. The users'interest PRCUM. Users can maximize(display) or minimize(hide) the items were split into a test set with 10 items per user(ie, 780 items for recommendation frame and the favorite frame according to their NSF and 6580 items for MLens)and a training set with the remaining preference. Fig. 6 shows a snapshot of the user interface for the PRCUM content (ie, 9065 items for NSF and 43, 738 items for MLens) that was to used to learn and build a model of each user In order to evaluate the performance of our approach, we im 6. Experimental evaluation ted the following: i)a user-based collaborative filtering method UCF [24] ii)an item-based collaborative filtering method, which employ In this section, we empirically evaluate the proposed approach and cosine-based similarity ICF 9 in)a probabilistic learning algorithm compare its performance against that of the benchmark algorithms. termed NB that applies the multinomial event model of a naive bayes All experiments were performed on a Dual Xeon 3.0 GHz, 2.5 GB RAM assumption [16), and iv)a TF-IDF vector-based algorithm Vr [6]. For the computer running the MS-window 2003 server. content recommendation process, in the case of NB, content were ranked using the calculated probability values, whereas they were ranked using 1. Data the calculated cosine similarity for VT. For UCF and ICE, the proximit between users or items was measured by cosine-based similarity and Ne use two test datasets for our comparative experiments. The first items were ranked using the weighted sum using the similarity as the dataset is taken from NSF( National Science Foundation)research weight. Our top-N recommendation strategy(Mt)was then compared award abstracts [20). The original dataset is too large to be used in with the benchmark algorithms We adopted two evaluation measures practice and thus we selected award abstracts with topics highly that are defined as follows: 素看 Contents for Pattern Mining、 observation User Modeling iltering Fig. 5. An overview of PRCUM for content recommendations
5. System implementation Based on the requirements defined in Sections 3 and 4, we developed a prototype system to support personalized content recommendations, named PRCUM (Personalized Recommendations via Collaborative User Model). The PRCUM system is divided into four main types of tasks: (a) Observing relevance feedback of a given user, (b) Modeling user interests from observed content, (c) Enriching user interests from nearest neighbors, and (d) Generating content recommendations for a given user. An overall system process for personalized content recommendations is shown in Fig. 5. PRCUM first requires the user to sign in with his/her username and password, and then it allows the user to add content to a preference list and monitors the user's browsing inside the system. Because PRCUM cannot make recommendations to the user before building the individual model, it delays recommendations until the model is of a sufficient size and has been successfully built. The user can adjust the desired model parameters, such as the minimum support (min_sup), the minimum pattern weight (min_pw) and the number of nearest neighbors (k). Once the model has been built, PRCUM allows the user to enter his/her personalized pages and proposes to him/her a list of recommended content. The GUI of PRCUM is implemented using C# and the server side is implemented using MySQL 5.0 and PHP 5.2 in an Apache 2.2 environment. The GUI mainly consists of four frames: a menu frame, a favorite frame, a recommendation frame and a main frame. By interacting with the menu frame, users can choose the functions of PRCUM rendered by the main frame. As one of the principal functions in PRCUM, the recommendation frame provides a list of recommended content, a list of nearest neighbors, and recently added interest content. And the favorite frame is used for jumping to content in favorites previously registered in PRCUM. Users can maximize (display) or minimize (hide) the recommendation frame and the favorite frame according to their preference. Fig. 6 shows a snapshot of the user interface for the PRCUM system. 6. Experimental evaluation In this section, we empirically evaluate the proposed approach and compare its performance against that of the benchmark algorithms. All experiments were performed on a Dual Xeon 3.0 GHz, 2.5 GB RAM computer running the MS-Window 2003 server. 6.1. Datasets We use two test datasets for our comparative experiments. The first dataset is taken from NSF (National Science Foundation) research award abstracts [20]. The original dataset is too large to be used in practice and thus we selected award abstracts with topics highly related to computer science. The selected dataset contains 974 unique abstracts (i.e., content) and 9823 unique terms were obtained from the abstracts. In addition, we collected 9845 preference histories (i.e., interest content of users) from 78 users.We refer to this dataset as NSF. The second dataset comes from MovieLens, which is a web-based research recommendation system (www.movielens.org). The original dataset does not contain any information about movie content, and thus we extracted the textual descriptions (i.e., genres, keywords, summary) for each movie from the IMDb database (www.imdb.com). Though the dataset contains numerical ratings, we ignored these and binarized them as follows: if a certain user's movie rating is larger than his/her average rating we set the rating to 1 (i.e., the interest movie of the user), or 0 otherwise. Thereafter, we removed users who had less than 20 ratings. The binarized dataset consists of 50,318 ratings on 1682 movies from 658 users. We refer to this dataset as MLens. Table 2 briefly describes our datasets. 6.2. Evaluation design and metrics To evaluate the performance of the recommendations, we randomly divided the dataset into a training set and a test set. The users' interest items were split into a test set with 10 items per user (i.e., 780 items for NSF and 6580 items for MLens) and a training set with the remaining content (i.e., 9065 items for NSF and 43,738 items for MLens) that was to used to learn and build a model of each user. In order to evaluate the performance of our approach, we implemented the following: i) a user-based collaborative filtering method UCF [24], ii) an item-based collaborative filtering method, which employs cosine-based similarity ICF [9], iii) a probabilistic learning algorithm termed NB that applies the multinomial event model of a naïve Bayes assumption [16], and iv) a TF-IDF vector-based algorithm VT [6]. For the content recommendation process, in the case of NB, content were ranked using the calculated probability values, whereas they were ranked using the calculated cosine similarity for VT. For UCF and ICF, the proximity between users or items was measured by cosine-based similarity and items were ranked using the weighted sum using the similarity as the weight. Our top-N recommendation strategy (M+) was then compared with the benchmark algorithms. We adopted two evaluation measures that are defined as follows: interest contents Contents for recommendation Target User Observation User Modeling Frequent Pattern Mining Classify Filter Recommend PT PTP Collaborative Enrichment Filtering Neighborhood Relevance feedback (interest content) Fig. 5. An overview of PRCUM for content recommendations. Table 2 Datasets used in experimental evaluation. Number of users Number of items Number of interest items NSF 78 974 9845 MLens 658 1682 50,318 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 777
778 H-N. Kim et aL / Decision Support Systems 51(2011)772-781 Control MenuSign Out Setting I Scrap Contents Learning I My Model I Neighbor Model Contents I Neighbors CONTENTS HELP Personal Information Quick Menu fAvorites Recommendation Menu Top Contents Recommendation to You Recommendations Scrap selected contents SBIR Phase 1: Group Coding for Call for papen《3 Ranking Machine Learning, On-line Algo. R32应 PROLEARN Acad GBIR Phase I: Group Coding for Reliable High Performance Network-Centric s Collaborative Research: Intera. a 2 Machine Leaning, On-line Algorithms, and Optimization ITR: Culldlut duve ResedIu: R. a 3 Colaborative Research: Interactive Level. Set Modeling for Visualization of Bi ITR: Collaborative Research: R ITR: Virtual Instruments: Scalable Software Instruments for the Grid CAREER: An Information-Theoret. mmnc日5mR和 ne Capture, Management and Reconstru Integrated Design and performa 白≥ Articles(2 ITR:Collaborative Research: Real-time Capture, Management and Reconstru ITR: Collaborative Research: L. a 7 CAREER: An Information-Theoretic Approach to Computational Learning with Nearest Neighbors -E IMDb News a 8 Integrated Design and Performance Steering of Real-Time Systems a 9 ITR: Collaboratve Research: Innovative Software for Large- Scale Nonlinear t similarity La 10 Architectures for Emerging Applcatons andrew L 11 Collaborative Research-New Directions in Turbo Coding a 12 CsCL 99 Doctoral Consortium and Teacher partcipabon Programs 0.6735 13 CAREER: Malong Exponential-Time Learning Algorithms Efficient a 14 CAREER: An Integrated video-Based Storage System with Guaranteed Perfo Scraped Article 16 Collaborative Research: Algorthmic Problems in Next Generation Network Workshop on the socal Aspects. a 17 CAREER: Energy-Efficient Architectures and Their Interaction with software rkshop on Value-Sensitive De. e Impact of ubhc opinion, a 18 SBIR Phase I: Dependence Graphs for Intemet Technol Workshop: Support for Student Tact Coordination of collect 20 CAREER: Research and Development of Database Technologies for Modem A Collaborative Research: A Bina 21 A Control-Theoretical Approach to performance Guarantees n perform Fig. 6. A snapshot of the user interface for PRCUM 6.2. 1 Hit Rate(HR) The reciprocal hit-rank for user u is defined as: In the context of top-N recommendations, the hit-rate, a measure of how often a list of recommendations contains items that the user is actually interested in, was used for the evaluation metric[9]. The hit-rate RHR() for user u is defined as where rank(in) refers to the recommended ranking of item in within the Test∩TopN (8) hitset of user u. Thatis, hit content that appear earlier in the top-N list are Test given more weight than later ones. Finally, the overall RHR for all users is mouted by averaging the personal rhr(u)in the test data. The higher where test, is the item list of user u in the test data and TopN is the the RHR, the more accurately the algorithm recommends items. top-N recommended item list for user u. Finally, the overall HR of top-N 6.3. Experimental results recommendation for all users is computed by averaging the personal HR(u) in the test data. In this section, we present detailed experimental results. The per formance evaluation is divided into three dimensions the effect of the 6.2.2. Reciprocal Hit Rank(RHR) eighbor size on the performance of model enrichment is first One limitation of the hit-rate measure is that it treats all hits evaluated and then the effectiveness of model enrichment is evaluated egardless of the ranking of recommended content. In other a in comparison with the initial user model. Finally, the accuracy of ontent item that is recommended with top ranking is treated content recommendations is evaluated in comparison with the with an item that is recommended with Nth ranking. To address this benchmark methods. In the experiments, min_sup and min_pw was limitation, we adopted the reciprocal hit-rank metric described in 9). set to 0.1 (10%)and 0.5, respectively. 3 Table 4 HR and RHR with respect to increasing neighborhood size(NSF) HR and rhr with respect to increasing neighborhood size nMLens). 20 30 Neighbors: 10 20 30 0.15840.16360.16400 02043021360.2255022620.226 RHR 042380489104889047450473204732 RHR 02822036660.388103881038760.3732
6.2.1. Hit Rate (HR) In the context of top-N recommendations, the hit-rate, a measure of how often a list of recommendations contains items that the user is actually interested in, was used for the evaluation metric [9]. The hit-rate for user u is defined as: HR uð Þ= j Testu∩TopNu j j Testu j ð8Þ where Testu is the item list of user u in the test data and TopNu is the top-N recommended item list for user u. Finally, the overall HR of top-N recommendation for all users is computed by averaging the personal HR(u) in the test data. 6.2.2. Reciprocal Hit Rank (RHR) One limitation of the hit-rate measure is that it treats all hits equally regardless of the ranking of recommended content. In other words, a content item that is recommended with top ranking is treated equally with an item that is recommended with Nth ranking. To address this limitation, we adopted the reciprocal hit-rank metric described in [9]. The reciprocal hit-rank for user u is defined as: RHR uð Þ= ∑ in∈ Testu∩uTopNu ð Þ 1 rank in ð Þ ð9Þ where rank(in) refers to the recommended ranking of item in within the hit set of user u. That is, hit content that appear earlier in the top-Nlist are given more weight than later ones. Finally, the overall RHR for all users is computed by averaging the personal RHR(u) in the test data. The higher the RHR, the more accurately the algorithm recommends items. 6.3. Experimental results In this section, we present detailed experimental results. The performance evaluation is divided into three dimensions. The effect of the neighbor size on the performance of model enrichment is first evaluated, and then the effectiveness of model enrichment is evaluated in comparison with the initial user model. Finally, the accuracy of content recommendations is evaluated in comparison with the benchmark methods. In the experiments, min_sup and min_pw was set to 0.1 (10%) and 0.5, respectively. Fig. 6. A snapshot of the user interface for PRCUM. Table 3 HR and RHR with respect to increasing neighborhood size (NSF). Neighbors: 10 20 30 40 50 60 HR 0.1584 0.1636 0.1640 0.1651 0.1655 0.1643 RHR 0.4238 0.4891 0.4889 0.4745 0.4732 0.4732 Table 4 HR and RHR with respect to increasing neighborhood size nMLens). Neighbors: 10 20 30 40 50 60 HR 0.2043 0.2136 0.2255 0.2262 0.2262 0.2288 RHR 0.2822 0.3666 0.3881 0.3881 0.3876 0.3732 778 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781
H-N. Kim et al / Decision Support Systems 51(2011)772-781 Hit Rate( NSF) Reciprocal Hit Rank (NSF) 0.70 30 030 0.20 0.10 Top-20 M口M+ Hit Rate( MLens Reciprocal Hit Rank (MLens 050 030 0.20 000 000 mM口M+ ■M口M+ Fig. 7. Comparison of HR and rhr obtained by the initial model and the enriched modeL 6.3. 1. Experiments with neighborhood size We further examined the performance of the MLens dataset. The following experiment investigates the effect of the enriched Similar results to NSF were obtained for MLens, as can be seen in model through the neighborhood And the number of recommended Table 4. For example, when the neighborhood size is 30, this provides items N was set to 10 foreach user in the test set. As noted in a number of a reasonably good performance for both hr and rhr. previous studies, the size of the neighborhood influences the recom- These results were affected by the fact that a neighborhood with a mendation quality of neighborhood-based algorithms. Therefore, small size provides enough collaborative term patterns for each user. different numbers of user neighbors were used for model enrichment: Recall that patterns are selected for enriching collaborative term pattems 10.20,30,40,50,and60 according to the nearest-order of neighbors, and thus redundant patterns Table 3 summarizes the results of rhR and HR for the NS dataset. generated by farthest neighbors are pruned. Another reason might be that With respect to hR, we observe that hR tends to improve slightly as we were only looking for a small number of recommended content (i.e, the neighborhood size increases from 10 to 20; beyond this point, any N=10). That is, once the number of nearest neighbors is relatively large further increase of the model size did not affect the performance. the rank of recommended content for each user is barely changed by any Interestingly, RHR was poorer for a neighborhood size of 30, 40, 50, further increases in the number of nearest neighbors. In practice, and 60 than for a size of 20 recommender systems make a trade-off between recommendation Hit Rate(NSF) Reciprocal Hit Rank (NSF) 60 040 020 10 T。p-30 Top-20 ■VTNB口UcF■|CF口 Hit Rate(MLens) Reciprocal Hit Rank (MLens) 030 Top-10 Top-10 Top-30 Fig 8. Comparisons of HR and RHR with respect to increasing N
6.3.1. Experiments with neighborhood size The following experiment investigates the effect of the enriched model through the neighborhood. And the number of recommended itemsN was set to 10 for each user in the test set. As noted in a number of previous studies, the size of the neighborhood influences the recommendation quality of neighborhood-based algorithms. Therefore, different numbers of user neighbors were used for model enrichment: 10, 20, 30, 40, 50, and 60. Table 3 summarizes the results of RHR and HR for the NSF dataset. With respect to HR, we observe that HR tends to improve slightly as the neighborhood size increases from 10 to 20; beyond this point, any further increase of the model size did not affect the performance. Interestingly, RHR was poorer for a neighborhood size of 30, 40, 50, and 60 than for a size of 20. We further examined the performance of the MLens dataset. Similar results to NSF were obtained for MLens, as can be seen in Table 4. For example, when the neighborhood size is 30, this provides a reasonably good performance for both HR and RHR. These results were affected by the fact that a neighborhood with a small size provides enough collaborative term patterns for each user. Recall that patterns are selected for enriching collaborative term patterns according to the nearest-order of neighbors, and thus redundant patterns generated by farthest neighbors are pruned. Another reason might be that we were only looking for a small number of recommended content (i.e., N=10). That is, once the number of nearest neighbors is relatively large, the rank of recommended content for each user is barely changed by any further increases in the number of nearest neighbors. In practice, recommender systems make a trade-off between recommendation Fig. 7. Comparison of HR and RHR obtained by the initial model and the enriched model. Fig. 8. Comparisons of HR and RHR with respect to increasing N. H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 779
H-N. Kim et aL / Decision Support Systems 51(2011)772-781 accuracy and real-time performance efficiency by pre-selecting a number be predicted, respectively. As noted previously, such results are due to of nearest neighbors. In consideration of both accuracy and computation the fact that the collaborative filtering approaches, UCF and ICE, can only cost, we selected 20 and 30 as the neighborhood size for NSF and MLens make predictions for items that at least a few users have rated On the model enrichment, respectively, in subsequent experiments. other hand, VTand M can only make predictions for items that contain terms in the target user model, although they never suffer from cold 6.3. 2. Effect of model enrichment start items This section investigates the effect of the enriched model M* of These comparison experiments show that our collaborative model ch user in more detail, by comparing the results obtained by the effectively and consistently improves the recommendation quality initial model M of each user. We performed an experiment with N values of 10, 20, and 30 and examined the average number of 7. Conclusions and future work collaborative term patterns of users. In the case of NSF, we found that 92 patterns had been enriched for each user, whereas the average number was 234 for Mlens Automated recommender systems are becoming widely used as a Fig 7 presents the results of the experiment. The results demonstrate solution for reducing information overload of diverse domains. In this that the enriched model provides considerably improved HR values on paper we presented a new and unique method for modeling user all occasions, compared to the initial model. For example, the interests via a collaborative approach of users. It also provides enhanced model M+ achieves 8.7% and 11.4% average improvement for recommendation accuracy. The major advantage of the proposed MLens, respectively, in terms of HR, compared to the initial modeling method is that it supports not only identification of each importantly, we found that the enriched model outperforms the initial patterns. As noted in our experimental results, our model obtained model in all cases that the number of recommended content is small. better recommendation accuracy compared to the benchmark methods. When n is 10. the enriched model obtains an rhr value of o 489 and 0.388 for NSF and MLens, respectively, whereas the initial model content for user preferences, even when the number of recommended demonstrates an RHR value of 0.358 and 0.281, respectively. This is items is small. There are common issues that have been mentioned in particularly important, since users tend to click on content with higher keyword-based analysis: homonymy and synonymy. We expect to ranks. We conclude that the collaborative model has significant mprove our user model further by considering word semantics such as advantages in terms of improving both the recommendation accuracy WordNet [18 or ontologies. Therefore, we plan to do further study on and the recommendation ranking. semantic user models in recommender system 6.3.3. Comparisons with other methods Re To experimentally evaluate the performance of top-N recomm dation,we calculated the hit rate(HR)and the reciprocal hit rank(rHr) [1 G Adomavicius, A Tuzhi in, Toward the next ge btained by NB, VT, UCF, ICF and M*+. We selectively varied the number nowledge and Data Engineering 17(6)(2005)734-749 of returned items N from 10 to 30 with an increment of 10. According to [2] S. Berkovsky. T. Kulik, F. Ricci, Mediation of user models for enhanced and ICF was set to 50 ig. 8 shows the results of RHR and HR for the NSF and MLens dataset, [3 S. Berkovsky, T. Kuflik, F Ricci, Cross-representation mediation of user models, showing how Mt outperforms the benchmark methods. As the number l &nd use -Adapted ntermcetion 2 t2002, 331-3 0d experiments. user Modeling increase. Comparing the results achieved by M and the benchmark [5] CC Chen, M.C. Chen, algorithms, for both test sets, the HR value of the former was found to be (6) L chen t in Scart: 18 (2002) 173-194 superior to that of the benchmark methods in all cases. In the NSFdataset, nference on autonomous agents and on average, on all occasions, M* outperforms VT, NB, UCF and ICF by 6.7% multi agent systems, 1998, pp 132-139 16.7%, 7% and 8.5%, respectively. And for the MLens dataset, M obtains World Wide Web, 2007, pp. 271-280 11.1%, 12.78, 4.2%, and 4.2% improvement compared to VT, NB, UCF, and [31 M. Degemmis, P Lops, G Semeraro, A content-collaborative recommender that CE. respectively. With respect to RHR, similar results are demonstrated. exploits WordNet-based user profile orhood formation, User Modeling More interestingly, M* significantly outperforms the other methods and User-Adapted Interaction 17(2007)217-255 when a relatively small number of content items were recommended For [9 M. Deshpande, G Karypis, Item-based top-n recommendation algorithms, ACM the MLens dataset, in the case of N=10, our method outperforms all of [101 S. Flesca, S. Greco, A. Tagarelli, E Zumpano, Mining user preferences, page content and usage to personalize website navigation, World wide Web: Internet and web comparable results. That is, M provides more suitable content with a Information System 8(3)(2005)317-345. higher rank in the recommended content set, and thus can provide better 111E Gabrilovich, S. Dumais, E Horvitz, Newsjunkie: providing personalized new Proceedings of the 13th international conference quality of content for the target user than the other methods. Ideally, recommender systems should provide a wide range of [121 J. Han, J. Pei, Y. Yin, Mining frequent pattems without candidate generation desirable content for users. Therefore, we continued to analyze the number of content items for which the methods, except for NB, could [131 JL Herlocker, JA Konstan. J Riedl, Explaining collaborative filtering recommendations. not provide any predictions for a user(ie, the prediction value of the Proceedings of the 2000 ACM conference on computer supported cooperative work. arget user for the content was zero). Recall that NB and VT is a class of 0.pp.241-250 content-based filtering, whereas UCF and ICF is a class of collaborative [14 w. Lihua, L Lu, L Jing, Zongyong Li, Modeling user multiple interests by an filtering. Strictly speaking, our approach is closely connected with [15] B Magnini, C Strapparava, User modelling for news web sites with word sense content-based filtering due to the dependence of content characteristics techniques, User Modeling and Use (ie, content-based user models, content-based neighbors, content- [161 A. McCallum, K Nigam, A comparison of event models for naive Bayes text based enrichments, and content-based recommendations). The results classification, Proceedings of AAAl-98 workshop on learming for text categorization, of the nSFdataset were that 2.6%, 7. 1%, 7.1% and 2.9% of items for VT, UCF ICF and M+ could not be predicted, respectively. For the MLens dataset, [171 P. Melville, R.J. Mooney, R Nagarajan, Content-boosted collaborative filtering for 0.12%,0.29%, 0.68% and 0. 13% of items for VT, UCF, ICF and M+ could not wed recommendations, Proceedings of the 18th national conference artificial intelligence, 2002, pp. 187-192
accuracy and real-time performance efficiency by pre-selecting a number of nearest neighbors. In consideration of both accuracy and computation cost, we selected 20 and 30 as the neighborhood size for NSF and MLens model enrichment, respectively, in subsequent experiments. 6.3.2. Effect of model enrichment This section investigates the effect of the enriched model M+ of each user in more detail, by comparing the results obtained by the initial model M of each user. We performed an experiment with N values of 10, 20, and 30 and examined the average number of collaborative term patterns of users. In the case of NSF, we found that 192 patterns had been enriched for each user, whereas the average number was 234 for MLens. Fig. 7 presents the results of the experiment. The results demonstrate that the enriched model provides considerably improved HR values on all occasions, compared to the initial model. For example, the enriched model M+ achieves 8.7% and 11.4% average improvement for NSF and MLens, respectively, in terms of HR, compared to the initial model M. Similar conclusions are implied by the RHR results as well. More importantly, we found that the enriched model outperforms the initial model in all cases that the number of recommended content is small. When N is 10, the enriched model obtains an RHR value of 0.489 and 0.388 for NSF and MLens, respectively, whereas the initial model demonstrates an RHR value of 0.358 and 0.281, respectively. This is particularly important, since users tend to click on content with higher ranks. We conclude that the collaborative model has significant advantages in terms of improving both the recommendation accuracy and the recommendation ranking. 6.3.3. Comparisons with other methods To experimentally evaluate the performance of top-N recommendation, we calculated the hit rate (HR) and the reciprocal hit rank (RHR) obtained by NB, VT, UCF, ICF and M+. We selectively varied the number of returned items N from 10 to 30 with an increment of 10. According to previous studies for collaborative filtering, the neighborhood size of UCF and ICF was set to 50. Fig. 8 shows the results of RHR and HR for the NSF and MLens dataset, showing how M+ outperforms the benchmark methods. As the number of recommended items N increases, the HR and RHR values tend to increase. Comparing the results achieved by M+ and the benchmark algorithms, for both test sets, the HR value of the former was found to be superior to that of the benchmark methods in all cases. In theNSF dataset, on average, on all occasions, M+ outperforms VT, NB, UCF and ICF by 6.7%, 16.7%, 7% and 8.5%, respectively. And for the MLens dataset, M+ obtains 11.1%, 12.7%, 4.2%, and 4.2% improvement compared to VT, NB, UCF, and ICF, respectively. With respect to RHR, similar results are demonstrated. More interestingly, M+ significantly outperforms the other methods when a relatively small number of content items were recommended. For the MLens dataset, in the case of N=10, our method outperforms all of the other methods, whereas for the NSF dataset, only VT achieves comparable results. That is, M+ provides more suitable content with a higher rank in the recommended content set, and thus can provide better quality of content for the target user than the other methods. Ideally, recommender systems should provide a wide range of desirable content for users. Therefore, we continued to analyze the number of content items for which the methods, except for NB, could not provide any predictions for a user (i.e., the prediction value of the target user for the content was zero). Recall that NB and VT is a class of content-based filtering, whereas UCF and ICF is a class of collaborative filtering. Strictly speaking, our approach is closely connected with content-based filtering due to the dependence of content characteristics (i.e., content-based user models, content-based neighbors, contentbased enrichments, and content-based recommendations). The results of theNSF dataset were that 2.6%, 7.1%, 7.1% and 2.9% of items for VT,UCF, ICF and M+ could not be predicted, respectively. For the MLens dataset, 0.12%, 0.29%, 0.68% and 0.13% of items for VT, UCF, ICF and M+ could not be predicted, respectively. As noted previously, such results are due to the fact that the collaborative filtering approaches, UCF and ICF, can only make predictions for items that at least a few users have rated. On the other hand, VT and M+ can only make predictions for items that contain terms in the target user model, although they never suffer from cold start items. These comparison experiments show that our collaborative model effectively and consistently improves the recommendation quality. 7. Conclusions and future work Automated recommender systems are becoming widely used as a solution for reducing information overload of diverse domains. In this paper we presented a new and unique method for modeling user interests via a collaborative approach of users. It also provides enhanced recommendation accuracy. The major advantage of the proposed modeling method is that it supports not only identification of each user's useful patterns but also enrichment of valuable neighbors' patterns. As noted in our experimental results, our model obtained better recommendation accuracy compared to the benchmark methods. Moreover, we also observed that our method can provide more suitable content for user preferences, even when the number of recommended items is small. There are common issues that have been mentioned in keyword-based analysis: homonymy and synonymy. We expect to improve our user model further by considering word semantics such as WordNet [18] or ontologies. Therefore, we plan to do further study on semantic user models in recommender systems. References [1] G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005) 734–749. [2] S. Berkovsky, T. Kuflik, F. Ricci, Mediation of user models for enhanced personalization in recommender systems, User Modeling and User-Adapted Interaction 18 (3) (2008) 245–286. [3] S. Berkovsky, T. Kuflik, F. Ricci, Cross-representation mediation of user models, User Modeling and User-Adapted Interaction 19 (2009) 35–63. [4] R. Burke, Hybrid recommender systems: survey and experiments, User Modeling and User-Adapted Interaction 12 (2002) 331–370. [5] C.C. Chen, M.C. Chen, Y. Sun, PVA: a self-adaptive personal view agent, Journal of Intelligent Information Systems 18 (2002) 173–194. [6] L. Chen, K. Sycara, WebMate: personal agent for browsing and searching, Proceedings of the 2nd international conference on autonomous agents and multi agent systems, 1998, pp. 132–139. [7] A. Das, M. Datar, A. Garg, Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, 2007, pp. 271–280. [8] M. Degemmis, P. Lops, G. Semeraro, A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation, User Modeling and User-Adapted Interaction 17 (2007) 217–255. [9] M. Deshpande, G. Karypis, Item-based top-n recommendation algorithms, ACM Transactions on Information Systems 22 (1) (2004) 143–177. [10] S. Flesca, S. Greco, A. Tagarelli, E. Zumpano, Mining user preferences, page content and usage to personalize website navigation, World Wide Web: Internet and Web Information System 8 (3) (2005) 317–345. [11] E. Gabrilovich, S. Dumais, E. Horvitz, Newsjunkie: providing personalized news-feeds via analysis of information novelty, Proceedings of the 13th international conference on World Wide Web, 2004, pp. 482–490. [12] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, Data Mining and Knowledge Discovery 8 (2004) 53–87. [13] J.L. Herlocker, J.A. Konstan, J. Riedl, Explaining collaborative filtering recommendations, Proceedings of the 2000 ACM conference on computer supported cooperative work, 2000, pp. 241–250. [14] W. Lihua, L. Lu, L. Jing, Zongyong Li, Modeling user multiple interests by an improved GCS approach, Expert Systems with Applications 29 (2005) 757–767. [15] B. Magnini, C. Strapparava, User modelling for news web sites with word sense based techniques, User Modeling and User-Adapted Interaction 14 (2004) 239–257. [16] A. McCallum, K. Nigam, A comparison of event models for naïve Bayes text classification, Proceedings of AAAI-98 workshop on learning for text categorization, 1998, pp. 41–48. [17] P. Melville, R.J. Mooney, R. Nagarajan, Content-boosted collaborative filtering for improved recommendations, Proceedings of the 18th national conference on artificial intelligence, 2002, pp. 187–192. 780 H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781
H-N. Kim et al / Decision Support Systems 51(2011)772-781 [18 G A Miller, WordNet: a lexical database for English, Communications of the ACM Geun-sik Jo is a Professor in Computer and informatic 11MM(1953941 of recommender agents on the rnet, Artificial Intelligence Review 19(4)(2003)285-330 Inha University. He received the B.s. degree in Compute 0iM.jPazzani,A.MeyersNsfresearchawardsabstracts1990-2003,http://kdd.ics. cience from Inha University in 1982. He received the M.S. omputer Science from City [22 J. Salter, N. Antonopoulos, CinemaScreen recommender agent: combining knowledge manage- collaborative and content-based filtering. IEEE Intelligent Systems 21 (2006) 3] G Salton, C. Buckley, Term weighting approaches in automatic text Web, intelligent E-Commerce, constraint-directed sch au,s authored and coauthored five books and more than 200 publications. ng, knowledge-based systems, decision support systems, and intelligent agents. He [25 AL Schein, A Popescul, LH. Ungar, D M Pennock, Methods and metrics for cold-star Abdulmotaleb El-Saddik University Research Chair and commendations, Proceedings of the 25th annual international ACM rsity of ottawa and recipient o rence on research and development in information retrieval. Friedrich Wilhelm-Bessel Research Award from Germany's Alexander von Humboldt Foundation(2007)the Premiers [26 L Schwab, W. Pohl, I. Koychev, Leaming to recommend from positive evidence. Proceedings of the 5th international conference on intelligent user interfaces, 2000. ital Institute of Telecommunications (NCIT)New pp.241-247 ommunications Research Laboratory(MCRLab)at University ons(ACM TOMCCAP), IEEE Transactions on Multimedia of Ottawa, Canada. His research interests indude collaborative (IEEE TMM) and IEEE Transactions on Computational Intelligence and al in sermodeling, and social networking applications. He obtained Editor for several IEEE Transactions and Journ Games(IEEE TCIAIG )and Gues is Ph.D. in Computer and Information Engineering from Inha several technical progran niversity, Korea. ications. He has authored and coauthored two books and more than 200 public lis research has been selected for the BEST Paper Award at the"Virtual Concepts 2006 er of ACM, an IEEE distinguis curer and a Fellow of the IEEE (FIEEE). the Canadian Academy of Engineers(FCAE and the engineering Institute of Canada(FEIc Inay Ha received the B.S. degree in Computer Science from rea. niversity. Her research interests include Web mining. social networks, recommender systems, and intelligent e-Learning Kee- Sung Lee received the B. in Computer Science 2005. He is include semantic Web tion and retrieval, information visualization and user interface desig
[18] G.A. Miller, WordNet: a lexical database for English, Communications of the ACM 38 (11) (1995) 39–41. [19] M. Montaner, B. Lopez, J.L. de la Rosa, A taxonomy of recommender agents on the internet, Artificial Intelligence Review 19 (4) (2003) 285–330. [20] M.J. Pazzani, A. Meyers, NSF research awards abstracts 1990–2003, http://kdd.ics. uci.edu/databases/nsfabs/nsfawards.html 2003. [21] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, J. Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of 1994 ACM conference on computer supported cooperative work, 1994, pp. 175–186. [22] J. Salter, N. Antonopoulos, CinemaScreen recommender agent: combining collaborative and content-based filtering, IEEE Intelligent Systems 21 (2006) 35–41. [23] G. Salton, C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management 24 (1988) 513–523. [24] B.M. Sarwar, G. Karypis, J.A. Konstan, J.T. Riedl, Analysis of recommendation algorithms for e-commerce, Proceedings of the 2nd ACM conference on electronic commerce, 2000, pp. 158–167. [25] A.I. Schein, A. Popescul, L.H. Ungar, D.M. Pennock, Methods and metrics for cold-start recommendations, Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, 2002, pp. 253–260. [26] I. Schwab, W. Pohl, I. Koychev, Learning to recommend from positive evidence, Proceedings of the 5th international conference on intelligent user interfaces, 2000, pp. 241–247. Heung-Nam Kim is a postdoctoral fellow in the Multimedia Communications Research Laboratory (MCRLab) at University of Ottawa, Canada. His research interests include collaborative filtering, recommender systems, semantic Web, data mining, usermodeling, and social networking applications. He obtained his Ph.D. in Computer and Information Engineering from Inha University, Korea. Inay Ha received the B.S. degree in Computer Science from University of Suwon and the M.Eng. degree in Computer and Information Engineering from Inha University, Korea, in 2007. She is currently working toward the Ph.D. with Intelligent E-Commerce Systems Laboratory (IESL), Inha University. Her research interests include Web mining, social networks, recommender systems, and intelligent e-Learning systems. Kee-Sung Lee received the B.S. degree in Computer Science from Cheon-An University and the M.Eng. degree in Computer and Information Engineering from Inha University, Korea, in 2005. He is working as the Ph.D. student in Intelligent E-Commerce Systems Laboratory (IESL), Inha University. His research interests include semantic Web, image annotation and retrieval, information visualization and user interface design. Geun-Sik Jo is a Professor in Computer and Information Engineering, Inha University, Korea. He is the chairman of the school of Computer and Information Engineering at Inha University. He received the B.S. degree in Computer Science from Inha University in 1982. He received the M.S. and the Ph.D. degrees in Computer Science from City University of New York in 1985 and 1991, respectively. He has been the General Chair and/or Technical Program Chair of more than 20 international conferences and workshops on artificial intelligence, knowledge management, and semantic applications. His research interests include knowledge-based scheduling, ontology, semantic Web, intelligent E-Commerce, constraint-directed scheduling, knowledge-based systems, decision support systems, and intelligent agents. He has authored and coauthored five books and more than 200 publications. Abdulmotaleb El-Saddik University Research Chair and Professor, SITE, University of Ottawa and recipient of the Friedrich Wilhelm-Bessel Research Award from Germany's Alexander von Humboldt Foundation (2007) the Premier's Research Excellence Award (PREA 2004), and the National Capital Institute of Telecommunications (NCIT) New Professorship Incentive Award (2004). He is the director of the Multimedia Communications Research Laboratory (MCRLab). He is Associate Editor of the ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP), IEEE Transactions on Multimedia (IEEE TMM) and IEEE Transactions on Computational Intelligence and AI in Games (IEEE TCIAIG) and Guest Editor for several IEEE Transactions and Journals. Dr. El Saddik has been serving on several technical program committees of numerous IEEE and ACM events. He was the general co-chair of ACM MM 2008. He is leading researcher in haptics, service-oriented architectures, collaborative environments and ambient interactive media and communications. He has authored and coauthored two books and more than 200 publications. His research has been selected for the BEST Paper Award at the “Virtual Concepts 2006” and “IEEE COPS 2007”. Dr. El Saddik is a Senior Member of ACM, an IEEE Distinguished Lecturer and a Fellow of the IEEE (FIEEE), the Canadian Academy of Engineers (FCAE) and the Engineering Institute of Canada (FEIC). H.-N. Kim et al. / Decision Support Systems 51 (2011) 772–781 781