EXPLORING INFORMATION HIDDEN IN TAGS: A SUBJECT-BASED ITEM RECOMMENDATION APPROACH Jing peng Daniel Zeng nstitute of Automation, Chinese Academy of Sciences Department of Management Information Systems, The University of Arizona Jing peng cla ac CI zeng @email. arizona. edu Abstract Collaborative tagging sites allow users to bookmark and annotate their favorite Web contents with tags. These tags provide a novel source of information for collaborative filtering (CF). Research on how to i mprove i tem recommendation q uality leveraging t ags i s emerging yet i nformation idden in tags is far from being fully exploited. In this paper, we aim at finding informative usage patterns from tags by consistent c lustering on tags us ing nonnegative matrix factorization. The clustered subjects, represented by w eighed t ag v ectors, can t hen be used to b uild a s ubject centered us er i nformation seeking model for i tem recommendation. E xperiments o n two reai world datasets show that our subject-based algorithms substantially outperform the traditional CF methods as well as tag-enhanced recommendation approaches reported in the literature Keywords: Collaborative filtering, collaborative tagging, nonnegative matrix factorization, tag-enhanced 1. Introduction Collaborative tagging or social bookmarking sites, such as Delicious (del. icio. us) and CiteULike www.citeulike.org),allowuserstobookmarktheirfavoriteWebcontents(oritems)onlineandprovide tag-based annotations to facilitate future retrieval. Tagging data present interesting opportunities as well as challenges to CF as tags are available as a novel source of data in addition to the typical user-item interaction information Recently, there has been a growing number of efforts aiming to improve Web content or item recommendation quality leveraging tags. Zhao et al.(Zhao et al. 2008)proposed to compute the similarit of two users based on the semantic distance of their tag sets on common items they have bookmarked Tso-Sutter et al. (Tso-Sutter et al. 2008)presented an interesting approach to integrate tags into traditional CF algorithms. They extended the item vectors for users and user vectors for items with tags and then onstructed the user/item neighborhoods based on the extended user/item profiles. Instead of exploring gs for similarity computation, the topic-based method was proposed to take advantage of tags in a probabilistic framework( Peng et al. 2009), viewing each tag as an indicator of a topic and then estimating the probability of a user saving an item by summing up the transition probabilities through all tags However, none of the existing research has explicitly explored why and how a user chooses to bookmark an item and assigns a certain set of tags. We hypothesize that exploiting such information embedded in tagging activities could lead to better recommendation performance. A record in tagging data is a tuple consisting of three fields in the form of . Some preliminary work(Halpin et al. 2007; Lambiotte et al. 2006)has started to examine a tripartite graph structure of collaborative tagging, as shown in Figure 1(a). The topic-based method(Peng et al. 2009) assumes this structure implicitly Nevertheless, this tripartite structure suffers from the following problems: i)a tag usually covers a wide range of topics rather than a specific subject. (ii) Users may be linked to items that they are not interested in at all due to polysemy of tags (iii) Users may miss items they are actually interested in because synonyms are used. (iv) There exists a large number of tags(e.g,"toread, ** ) carrying no value to users other than their originators 19th Workshop on Information Technologies and Systems
EXPLORING INFORMATION HIDDEN IN TAGS: A SUBJECT-BASED ITEM RECOMMENDATION APPROACH Jing Peng1 Daniel Zeng2, 1 1 Institute of Automation, Chinese Academy of Sciences 2 Department of Management Information Systems, The University of Arizona jing.peng@ia.ac.cn zeng@email.arizona.edu Abstract Collaborative tagging sites allow users to bookmark and annotate their favorite Web contents with tags. These tags provide a novel source of information for collaborative filtering (CF). Research on h ow t o i mprove i tem r ecommendation q uality l everaging t ags i s emerging y et i nformation hidden in tags is far from being fully exploited. In this paper, we aim at finding informative usage patterns from tags by c onsistent c lustering o n t ags us ing n onnegative matrix f actorization. The clustered subjects, r epresented by w eighed t ag v ectors, c an t hen be used t o b uild a s ubjectcentered us er i nformation seeking model f or i tem r ecommendation. E xperiments o n two realworld datasets show that our subject-based algorithms substantially outperform the traditional CF methods as well as tag-enhanced recommendation approaches reported in the literature. Keywords: Collaborative filtering, collaborative tagging, nonnegative matrix factorization, tag-enhanced 1. Introduction Collaborative tagging or social bookmarking sites, such as Delicious (del.icio.us) and CiteULike (www.citeulike.org), allow users to bookmark their favorite Web contents (or items) online and provide tag-based annotations to facilitate future retrieval. Tagging data present interesting opportunities as well as challenges to CF as tags are available as a novel source of data in addition to the typical user-item interaction information. Recently, there has been a growing number of efforts aiming to improve Web content or item recommendation quality leveraging tags. Zhao et al. (Zhao et al. 2008) proposed to compute the similarity of two users based on the semantic distance of their tag sets on common items they have bookmarked. Tso-Sutter et al. (Tso-Sutter et al. 2008) presented an interesting approach to integrate tags into traditional CF algorithms. They extended the item vectors for users and user vectors for items with tags and then constructed the user/item neighborhoods based on the extended user/item profiles. Instead of exploring tags for similarity computation, the topic-based method was proposed to take advantage of tags in a probabilistic framework (Peng et al. 2009), viewing each tag as an indicator of a topic and then estimating the probability of a user saving an item by summing up the transition probabilities through all tags. However, none of the existing research has explicitly explored why and how a user chooses to bookmark an item and assigns a certain set of tags. We hypothesize that exploiting such information embedded in tagging activities could lead to better recommendation performance. A record in tagging data is a tuple consisting of three fields in the form of . Some preliminary work (Halpin et al. 2007; Lambiotte et al. 2006) has started to examine a tripartite graph structure of collaborative tagging, as shown in Figure 1 (a). The topic-based method (Peng et al. 2009) assumes this structure implicitly. Nevertheless, this tripartite structure suffers from the following problems: (i) A tag usually covers a wide range of topics rather than a specific subject. (ii) Users may be linked to items that they are not interested in at all due to polysemy of tags. (iii) Users may miss items they are actually interested in because synonyms are used. (iv) There exists a large number of tags (e.g., “toread,” “***”) carrying no value to users other than their originators. 73 19th Workshop on Information Technologies and Systems
2. Consistent Nonnegative Matrix Factorization for Tag Clustering USERS WWEBSITES Figure 1. (a) Tripartite graph structure(b) Subject-centered model In order to overcome the above problems of associating each tag with a distinct topic under the tripartite structure, we propose to cluster tags to gain a more compact and informative representation of the underlying subjects that users actually intend with these tags. The common feature-based hard clustering algorithms(such as k-means)are not suitable for clustering tags for two reasons. First, the tags do not have a well-defined feature space associated with them. Second, many tags can be used to represent different subjects. In our work, we employ Nonnegative Matrix Factorization(NMF) to perform soft clustering on tags. NMf has been proved to be an effective technique to extract hidden structures in data that can be represented as matrices(.g, in image processing(Lee et al. 1999)and text mining lee et al 1999, Xu et al. 2003). Given a nonnegative matrix V, an NMF can be defined as v a WH, where W and H are constrained to be nonnegative. The columns of w are often restricted to be unit length so as to make the factorization result unique. From the aspect of clustering, factor matrix W contains cluster centroids and H contains cluster membership indicators. For notational convenience, a variant of the standard NMF is adopted in this paper, that is, VT HwT, where each row of w indicates a cluster centroid and each row of H holds the cluster membership of this row's corresponding object Subject User UT= User UScM x SubjectSTeM User UScM x Subject STaM IT x Subject STeM Figure 2.(a) NMFs on user-tag and item-tag matrices(b) Consistent NMF on user-item-tag matrix As shown in Figure 2(a), two types of NMF can be applied to find the cluster centroids of tags with each centroid representing a subject. UT and IT represent the user-tag and item-tag co-occurrence matrix, respectively, with each entry being the frequency of the corresponding or pair in the training set. Each row of sT BM indicates a clustered subject and the entire matrix constitutes the bases of the subject space. UScM/IScM holds the coordinate values of users/items in the subject space Nevertheless, this NMF approach is only able to capture the cluster membership of one entity (either user or item)on the clustered subjects. We argue, however, from both modeling and computational standpoints, it is critical to consider these two types of cluster membership together. To this end, we propose a new method called Consistent NMF (CONMF), as shown in Figure 2 (b), to capture both memberships as well as gain consistent clustering result on tags. The computational definition of CoNMf will be presented in Section 3. 2. Here we provide the basic intuition. NMF takes one matrix as input and produces two 19th Workshop on Information Technologies and Systems
2. Consistent Nonnegative Matrix Factorization for Tag Clustering Subject UT IT User UI Item Tag USCM ISCM STBM (a) (b) Figure 1. (a) Tripartite graph structure (b) Subject-centered model In order to overcome the above problems of associating each tag with a distinct topic under the tripartite structure, we propose to cluster tags to gain a more compact and informative representation of the underlying subjects that users actually intend with these tags. The common feature-based hard clustering algorithms (such as k-means) are not suitable for clustering tags for two reasons. First, the tags do not have a well-defined feature space associated with them. Second, many tags can be used to represent different subjects. In our work, we employ Nonnegative Matrix Factorization (NMF) to perform soft clustering on tags. NMF has been proved to be an effective technique to extract hidden structures in data that can be represented as matrices (e.g., in image processing (Lee et al. 1999) and text mining (Lee et al. 1999; Xu et al. 2003)). Given a nonnegative matrix 𝐕𝐕, an NMF can be defined as 𝐕𝐕 ≈ 𝐖𝐖𝐖𝐖, where 𝐖𝐖 and 𝐇𝐇 are constrained to be nonnegative. The columns of 𝐖𝐖 are often restricted to be unit length so as to make the factorization result unique. From the aspect of clustering, factor matrix 𝐖𝐖 contains cluster centroids and 𝐇𝐇 contains cluster membership indicators. For notational convenience, a variant of the standard NMF is adopted in this paper, that is, 𝐕𝐕𝐓𝐓 ≈ 𝐇𝐇𝐓𝐓𝐖𝐖𝐓𝐓, where each row of 𝐖𝐖𝐓𝐓 indicates a cluster centroid and each row of 𝐇𝐇𝐓𝐓 holds the cluster membership of this row’s corresponding object. UT IT User Item Tag = USCM ISCM User Item Subject STBM ˣ Subject Tag = ˣ Subject STBM Tag Subject Tag UT IT User Item Tag = USCM ISCM User Item Subject STBM ˣ Subject Tag = (a) (b) Figure 2. (a) NMFs on user-tag and item-tag matrices (b) Consistent NMF on user-item-tag matrix As shown in Figure 2 (a), two types of NMF can be applied to find the cluster centroids of tags with each centroid representing a subject. UT and IT represent the user-tag and item-tag co-occurrence matrix, respectively, with each entry being the frequency of the corresponding or pair in the training set. Each row of STBM indicates a clustered subject and the entire matrix constitutes the bases of the subject space. USCM/ISCM holds the coordinate values of users/items in the subject space. Nevertheless, this NMF approach is only able to capture the cluster membership of one entity (either user or item) on the clustered subjects. We argue, however, from both modeling and computational standpoints, it is critical to consider these two types of cluster membership together. To this end, we propose a new method called Consistent NMF (CONMF), as shown in Figure 2 (b), to capture both memberships as well as gain consistent clustering result on tags. The computational definition of CONMF will be presented in Section 3.2. Here we provide the basic intuition. NMF takes one matrix as input and produces two 74 19th Workshop on Information Technologies and Systems
matrices whose product approximates the input matrix. CONMf takes two matrices as input and produce two correlated factorizations to approximate the two input matrices. The imposed correlation guarantees consistency, needed for reinterpretation of these two resulting factorizations in a meaningful manner in Based on CONMF results, we are able to build a computational model for tagging data as shown in Figure 1(b), where UScM, IScm and SIBy denote the user-subject, item-subject and tag-subject inter- relationships, respectively. We called this model a"subject-centered"model. This model can be viewed as a generalization of the tripartite graph. If no tag clustering is performed the subjects are an identity mapping to tags and the subject-centered model degenerates into the tripartite graph. In our subject centered model, subject becomes the driving force motivating a user to save an item and assign tags. In other words, a user bookmarks an item because this item falls into her subject interests and a user assigns a tag because this tag is able to describe her subject interest with the item being saving. The solid edges shown in Figure 1(b) indicate the presumed structure of tagging data, whereas the dashed edges indicate the real-world observation of users' tagging behavior. Since the clustered subjects are represented by weighed tag vectors, the subject-centered model can help solve some of the key problems the tripartite graph-based method is facing. More specifically, (i) the range of subjects is meaningfully reduced; (ii)tag semantic ambiguity is lessened; (iii) synonymic tags are grouped together to describe the same subject; and (iv) the noises from meaningless tags are discounted 3. Subject-based recommendation In this section, we present how to make recommendations based on the subject-centered model. We first extract the hidden subjects from users' tagging activities via CONMF, and then estimate the probabilities of items being saved by users based on the constructed subject-centered model 3.I Notation We view users, items, tags, and subjects as four random variables and denote them by U, I, t,s, respectively. The joint distribution of any two random variables X and Y is denoted as F(X,Y). The co. occurrence matrix of two entities is represented by the combination of their corresponding letters. For example, UT stands for the user-tag co-occurrence matrix. We use different subscripts to distinguish different types of matrices. Subscript"PM indicates a transition probability matrix(e. g, USpM stands for the probabilities of users getting interested in different subjects); subscript"BM"indicates basis matrix (e.g, STBM Stands for the basis matrix of subject space); subscript"CM represents coordinate matrix, (e.g, UScM holds the coordinate values of users in the subject space). In addition, the symbol fM) furs( M)means normalizing matrix M to unit row length/sum 3. 2 Subject extraction CONMF is employed to discover the subjects hidden in tagging data In Figure 2(b), the weights for matrix UT and IT are assumed to be equal although different weights can be used as well. In our approach, UT and It are normalized to unit row length so that all users are equally weighed with the weight for UT and all items are equally weighed with the weight for It. We choose these weights such that the sum of weights for UT and IT is 2. ConMf can then be formally written as c·furl(UT) (2-c)·frl(T) IS STBM 1) where c(0 <c<2)reflects the tradeoff between contributions from UT and It to the extracted subjects In fact, combining UT and IT for factorization not only guarantees a consistent clustering result on tags, but also enables the clustered subjects to aggregate information from both matrices. In our experiment, CONMF is implemented with the Matlab library function nnmf. The desired number of subjects, k, is selected manually before-hand and the optimum value of this parameter depends on the dataset. However, according to our experiments, quite stable results can be obtained in a large range of k(e. g, 50-250) 19th Workshop on Information Technologies and Systems
matrices whose product approximates the input matrix. CONMF takes two matrices as input and produce two correlated factorizations to approximate the two input matrices. The imposed correlation guarantees consistency, needed for reinterpretation of these two resulting factorizations in a meaningful manner in the tagging domain. Based on CONMF results, we are able to build a computational model for tagging data as shown in Figure 1 (b), where USCM, ISCM and SIBM denote the user-subject, item-subject and tag-subject interrelationships, respectively. We called this model a “subject-centered” model. This model can be viewed as a generalization of the tripartite graph. If no tag clustering is performed, the subjects are an identity mapping to tags and the subject-centered model degenerates into the tripartite graph. In our subjectcentered model, subject becomes the driving force motivating a user to save an item and assign tags. In other words, a user bookmarks an item because this item falls into her subject interests and a user assigns a tag because this tag is able to describe her subject interest with the item being saving. The solid edges shown in Figure 1 (b) indicate the presumed structure of tagging data, whereas the dashed edges indicate the real-world observation of users’ tagging behavior. Since the clustered subjects are represented by weighed tag vectors, the subject-centered model can help solve some of the key problems the tripartite graph-based method is facing. More specifically, (i) the range of subjects is meaningfully reduced; (ii) tag semantic ambiguity is lessened; (iii) synonymic tags are grouped together to describe the same subject; and (iv) the noises from meaningless tags are discounted. 3. Subject-based Recommendation In this section, we present how to make recommendations based on the subject-centered model. We first extract the hidden subjects from users’ tagging activities via CONMF, and then estimate the probabilities of items being saved by users based on the constructed subject-centered model. 3.1 Notation We view users, items, tags, and subjects as four random variables and denote them by U, I, T, S, respectively. The joint distribution of any two random variables X and Y is denoted as F(X,Y). The cooccurrence matrix of two entities is represented by the combination of their corresponding letters. For example, UT stands for the user-tag co-occurrence matrix. We use different subscripts to distinguish different types of matrices. Subscript “PM” indicates a transition probability matrix (e.g., USPM stands for the probabilities of users getting interested in different subjects); subscript “BM” indicates basis matrix (e.g., STBM stands for the basis matrix of subject space); subscript “CM” represents coordinate matrix, (e.g., USCM holds the coordinate values of users in the subject space). In addition, the symbol furl(M)/ furs(M) means normalizing matrix M to unit row length/sum. 3.2 Subject Extraction CONMF is employed to discover the subjects hidden in tagging data. In Figure 2 (b), the weights for matrix UT and IT are assumed to be equal although different weights can be used as well. In our approach, UT and IT are normalized to unit row length so that all users are equally weighed with the weight for UT and all items are equally weighed with the weight for IT. We choose these weights such that the sum of weights for UT and IT is 2. CONMF can then be formally written as: � 𝑐𝑐 ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐔𝐔𝐔𝐔) (2 − 𝑐𝑐) ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈) � ≈ � 𝐔𝐔𝐔𝐔𝐂𝐂𝐂𝐂 𝐈𝐈𝐈𝐈𝐂𝐂𝐂𝐂 � ∙ 𝐒𝐒𝐒𝐒𝐁𝐁𝐁𝐁 (1) where 𝑐𝑐 (0 < 𝑐𝑐 < 2) reflects the tradeoff between contributions from UT and IT to the extracted subjects. In fact, combining UT and IT for factorization not only guarantees a consistent clustering result on tags, but also enables the clustered subjects to aggregate information from both matrices. In our experiment, CONMF is implemented with the Matlab library function nnmf. The desired number of subjects, k, is selected manually before-hand and the optimum value of this parameter depends on the dataset. However, according to our experiments, quite stable results can be obtained in a large range of k (e.g., 50~250). 75 19th Workshop on Information Technologies and Systems
3.3 Recommendation algorith The basic intuition of subject-based recommendation is that a user's interest in an item can be expressed through subjects. The probability of a user visiting an item is given by p(ilu)= 2sp(slu) p(ils),where p(slu) indicates the probability of user u's interest in subject s and p(ils) represents the probability of saving item i when users are interested in subject s. This formula can be rewritten as UlpM= Us SI According to the above formula, estimating USpM and Slpm becomes another crucial step of our approach besides CONMF Since we already have the coordinate values of users in the subject space, USpM can be computed by UspM= furs (UScM), i.e., the probability that a user is interested in a subject is proportional to her coordinate value on this subject. Likewise, ISPy can be calculated based on IScM. However, simply transposing ISpm will not generate SIpy. Since ISpM and Slpm are actually marginal probability matrices of joint distribution F(l,S)on items and subjects respectively, we compute the joint distribution matrix of F(, S) from ISpm and then estimate SIpM based on F(, S). The details of computing Slpm is illustrated in Item IScM Compute ispy by normalizing IScy to unit Subject Multiply each row of ispm by the frequency of matrix of items' frequencies diag() by Ispy Transpose IS so as to get sI Subject SlpM ow Sum ]subject SI Obtain SIpy by norma sum Figure 3. Estimation of transition probability matrix SIpM SIM=fars((diag(①·fs(IScM))(2) In short, SlpM is calculated using equation(2). Finally, we can compute UlpM based on USpM and SlpM and then recommend those items with the highest probabilities to each user (if the items have not been bookmarked before 3. 4 Tag Generalization As discussed in Section 1, some users tend to use tags which are meaningless to other users. These tags are noises disturbing matrix UT and IT. Since UT records tag frequencies of individual users, there will be considerable noises in it. However, IT is the aggregation of all users' tagging behavior on items, thus it is much more resistant to noisy tags. As a result of this different noise ratio in UT and IT, combining them directly for factorization may lead to poor results. In order to generate a more reliable UT, one possible approach(not without the loss of user-specific information) is as follows. For a user and an item bookmarked by this user, we reset her tag vector as the corresponding tag vectors from IT, reflecting crowd wisdom. In the matrix form, UT UI furl (IT). We call this idea tag generalization and have ested its effectiveness in Section 4 4. An Empirical Study The datasets used in our study were crawled from Delicious, the largest social bookmarking site, from June 2008 to December 2008. Two raw datasets consisting of bookmarking data from 5000 users each were used for our reported experiments. To reduce the size of the raw data, we removed users that had posted less than 15 URLs and items that had been saved by less than 15 users. To create an experimental testbed that can highlight differences between recommendation algorithms, we removed the URLs that had been bookmarked by more than 75 users, which accounted for about 1-2% of all URLs. This removal 19th Workshop on Information Technologies and Systems
3.3 Recommendation Algorithm The basic intuition of subject-based recommendation is that a user’s interest in an item can be expressed through subjects. The probability of a user visiting an item is given by 𝑝𝑝(𝑖𝑖|𝑢𝑢) = ∑ 𝑝𝑝(𝑠𝑠|𝑢𝑢) ∙ 𝑝𝑝(𝑖𝑖|𝑠𝑠) 𝑠𝑠 , where 𝑝𝑝(𝑠𝑠|𝑢𝑢) indicates the probability of user 𝑢𝑢’s interest in subject 𝑠𝑠 and 𝑝𝑝(𝑖𝑖|𝑠𝑠) represents the probability of saving item 𝑖𝑖 when users are interested in subject 𝑠𝑠. This formula can be rewritten as 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 = 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 ∙ 𝐒𝐒𝐒𝐒𝐏𝐏𝐏𝐏. According to the above formula, estimating USPM and SIPM becomes another crucial step of our approach besides CONMF. Since we already have the coordinate values of users in the subject space, USPM can be computed by 𝐔𝐔𝐔𝐔𝐏𝐏𝐏𝐏 = 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐔𝐔𝐔𝐔𝐂𝐂𝐂𝐂), i.e., the probability that a user is interested in a subject is proportional to her coordinate value on this subject. Likewise, ISPM can be calculated based on ISCM. However, simply transposing ISPM will not generate SIPM. Since ISPM and SIPM are actually marginal probability matrices of joint distribution F(I,S) on items and subjects respectively, we compute the joint distribution matrix of F(I,S) from ISPM and then estimate SIPM based on F(I,S). The details of computing SIPM is illustrated in Figure 3. Item ISCM Subject Item ISPM Subject Item IS Subject Subject SI Item Subject SIPM Item Unit Row Sum Multiply Item Freq Transpose Unit Row Sum Figure 3. Estimation of transition probability matrix SIPM 𝐒𝐒𝐒𝐒𝐏𝐏𝐏𝐏 = 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 ((𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(I) ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈𝐂𝐂𝐂𝐂))T) (2) In short, SIPM is calculated using equation (2). Finally, we can compute UIPM based on USPM and SIPM and then recommend those items with the highest probabilities to each user (if the items have not been bookmarked before). 3.4 Tag Generalization As discussed in Section 1, some users tend to use tags which are meaningless to other users. These tags are noises disturbing matrix UT and IT. Since UT records tag frequencies of individual users, there will be considerable noises in it. However, IT is the aggregation of all users’ tagging behavior on items, thus it is much more resistant to noisy tags. As a result of this different noise ratio in UT and IT, combining them directly for factorization may lead to poor results. In order to generate a more reliable UT, one possible approach (not without the loss of user-specific information) is as follows. For a user and an item bookmarked by this user, we reset her tag vector as the corresponding tag vectors from IT, reflecting crowd wisdom. In the matrix form, 𝐔𝐔𝐔𝐔 = 𝐔𝐔𝐔𝐔 ∙ 𝑓𝑓𝑢𝑢𝑢𝑢𝑢𝑢 (𝐈𝐈𝐈𝐈). We call this idea tag generalization and have tested its effectiveness in Section 4. 4. An Empirical Study The datasets used in our study were crawled from Delicious, the largest social bookmarking site, from June 2008 to December 2008. Two raw datasets consisting of bookmarking data from 5000 users each were used for our reported experiments. To reduce the size of the raw data, we removed users that had posted less than 15 URLs and items that had been saved by less than 15 users. To create an experimental testbed that can highlight differences between recommendation algorithms, we removed the URLs that had been bookmarked by more than 75 users, which accounted for about 1~2% of all URLs. This removal Computing SIPM Compute ISPM by normalizing ISCM to unit row sum Multiply each row of ISPM by the frequency of this item being saved in the training set, which is equivalent to multiplying the diagonal matrix of items’ frequencies diag(I) by ISPM. Transpose IS so as to get SI Obtain SIPM by normalizing SI to unit row sum 76 19th Workshop on Information Technologies and Systems
is based on the observation that recommending items that are extremely popular in the datasets is al ways of high probability to be correct thus making algorithm comparison less sensitive. After all these preprocessing steps, two datasets named as Small and Large were obtained. Table 1 provides key statistics of these two datasets Table 1. dataset characteristics Number of users Number of tags Number of transactions Density level (% Average number of URls per user Average number of users per URL We randomly divided each of the datasets into a training set and a test set. The split was based on the training-testing ratio between 20%0-80% and was done at the user level. In the predication phase, all methods recommended 5 items for each user and then compared them with the items in the test set. The evaluation metrics adopted in our experiment were the commonly used ones including precision, recall, F- measure,and rankscore for ranked list prediction Twelve algorithms were evaluated in our experiments. RaNd algorithm generates random recommendations for every user, while PoP algorithm recommends the most popular items to each user The user-based (UB)and item-based (IB)algorithms, as well as their variants namely tagging user-based (TUB)and tagging item-based (TiB)(Peng et al. 2009)were implemented as the baselines. In addition to the topic-based (TB)method, the svd dimensionality reduction method (SVD) which finds hidden structure using singular value decomposition and the fusion algorithm(FUS)extending user/item profiles with tags are also included for comparison. Furthermore, to investigate the benefits of tag generalization we integrate it into the TB and SB algorithms, and rename them as gtB and GSB respectively. The experiment was rerun for 10 times and the final results are averaged over all runs. Table 2 summarizes the results on both datasets Table 2. Experimental results on the Small and large dataset ct Algorithm Precision(%)Recall (%) F-measure(%) Rank Se 23 4.05 22.l8 1 DBB 19.25 2231 27 2468 251 4.55 22.99 222 4.40 23.98 2834 28.50 5.53 RAND 1.74 .39 3.82 108 276 SVD 5.15 1.16 5.7 0.89 672083 6931 0.96 1.75 87 1.77 19th Workshop on Information Technologies and Systems
is based on the observation that recommending items that are extremely popular in the datasets is always of high probability to be correct thus making algorithm comparison less sensitive. After all these preprocessing steps, two datasets named as Small and Large were obtained. Table 1 provides key statistics of these two datasets. Table 1. Dataset characteristics Dataset Small Large Number of users 485 1097 Number of URLs 999 1872 Number of tags 11307 9608 Number of transactions 24612 44599 Density level (%) 5.08 2.17 Average number of URLs per user 50.75 40.66 Average number of users per URL 24.64 23.82 We randomly divided each of the datasets into a training set and a test set. The split was based on the training-testing ratio between 20%-80% and was done at the user level. In the predication phase, all methods recommended 5 items for each user and then compared them with the items in the test set. The evaluation metrics adopted in our experiment were the commonly used ones including precision, recall, Fmeasure, and rankscore for ranked list prediction. Twelve algorithms were evaluated in our experiments. RAND algorithm generates random recommendations for every user, while POP algorithm recommends the most popular items to each user. The user-based (UB) and item-based (IB) algorithms, as well as their variants namely tagging user-based (TUB) and tagging item-based (TIB) (Peng et al. 2009) were implemented as the baselines. In addition to the topic-based (TB) method, the SVD dimensionality reduction method (SVD) which finds hidden structure using singular value decomposition and the fusion algorithm (FUS) extending user/item profiles with tags are also included for comparison. Furthermore, to investigate the benefits of tag generalization, we integrate it into the TB and SB algorithms, and rename them as GTB and GSB respectively. The experiment was rerun for 10 times and the final results are averaged over all runs. Table 2 summarizes the results on both datasets. Table 2. Experimental results on the Small and Large dataset Dataset Algorithm Precision (%) Recall (%) F-measure (%) Rank Score Small RAND 4.27 0.43 0.79 4.27 POP 9.81 1.00 1.81 9.84 UB 22.00 2.23 4.05 22.18 IB 13.52 1.37 2.49 13.52 SVD 19.10 1.93 3.51 19.25 TUB 21.02 2.14 3.88 21.20 TIB 22.31 2.27 4.12 22.27 FUS 24.68 2.51 4.55 24.76 TB 22.99 2.34 4.24 23.15 SB 23.87 2.42 4.40 23.98 GTB 28.34 2.88 5.22 28.50 GSB 30.00 3.05 5.53 30.28 Large RAND 1.74 0.22 0.39 1.73 POP 3.82 0.49 0.86 3.82 UB 4.81 0.61 1.08 4.85 IB 2.75 0.35 0.62 2.76 SVD 5.15 0.65 1.16 5.17 TUB 7.38 0.94 1.66 7.45 TIB 7.06 0.89 1.59 7.06 FUS 7.71 0.98 1.73 7.75 TB 7.62 0.96 1.71 7.67 SB 7.80 0.99 1.75 7.87 GTB 7.89 1.00 1.77 7.99 GSB 8.33 1.05 1.87 8.39 77 19th Workshop on Information Technologies and Systems
It can be seen from Table 2 that the proposed gsb algorithm outperforms other algorithms significantly on both datasets, which demonstrates the effectiveness our approach. According to t-test, GSB is significantly better than the second best algorithm GTB, with p<. 001 on all evaluation metrics for both datasets. Despite its simplicity, tag generalization has led to quite sizable improvements on both the subject-based and topic-based algorithms, confirming our hypothesis about the reliability of matrix UT and IT. The fusion method focuses on occurrences of tags rather than their weighing values, thus are not able to benefit from tag generalization. The subject-based algorithms, with or without tag generalization, have shown superiority over the topic-based algorithms, demonstrating the advantages of extracting subjects for recommendation. In general, the tag-enhanced algorithms outperform the traditional CF approaches strikingly, especially on the sparser Large dataset. It implies that it is of great potential to improve item recommendation quality leveraging tagging information, especially when the dataset is sparse. The prediction accuracy of the implemented algorithms on the Small dataset is much higher compared with that on the Large dataset, which may due to the fact that the Small dataset is much denser than the large dataset and more data are available for training/profiling 5 Conclusions and future research This paper discusses a subject-centered model of collaborative tagging, in which subjects are assumed to play the key role in determining users' tagging behavior. We propose the concept of Consistent Nonnegative Matrix Factorization and use it to discover the hidden subjects and inner relations within this model from tagging data. We also propose a preprocessing technique called tag generalization to remove noises stemming from meaningless tags. An evaluation study using two real-world collaborative tagging datasets from Delicious has demonstrated the effectiveness of our approach. For future work, we plan to design more sophisticated techniques for tag generalization, and apply it to matrix IT as well. We will also investigate how to optimize various control parameters of our approach(e. g, c in Equation 1, k, the target number of subjects to be clustered) automatically Acknowledgements The authors wish to acknowledge research support from the NNSFC (60621001, 60875049, 70890084), MOST (2006AA010106), and CAs(2F07C01, 2F08N03) References Halpin, H, Robu, V, and Shepherd, H. "The complex dynamics of collaborative tagging, " in Proceedings of t he 16th i nternational conference on W orld w ide Web, ACM, Banff, Alberta Lambiotte, R, and Ausloos, M. "Collaborative Tagging as a Tripartite Network, in: Lecture Notes in Computer Science In ICCS 2006, 2006, pp. 1114-1117 Lee, DD, and Seung, H.S. "Learning the parts of objects by non-negative matrix factorization, "Nature (401:6755)1999pp788-791 Peng, J, and Zeng, D. Topic-based Web Page Recommendation Using Tags, "in: Proceedings of The 2nd International Workshop on Social Computing, IEEE, Dallas, Texas, 2009 Tso-Sutter, K H.L., Marinho, L B, and Schmidt-Thieme, L. Tag-aware recommender systems by fusion f collaborative filtering algorithms, in: Proceedings of t he A CM s ymposium on A pplied computing, ACM, Fortaleza, Ceara, Brazil, 2008 Xu, W, Liu, X, and Gong, Y "Document clustering based on non-negative matrix factorization, Proceedings oft he 26t h annu al i nternational ACM SI GIR c onference onR esearch and levelopment in informaion retrieval, ACM, Toronto, Canada, 2003 Zhao, S, Du, N, Nauerz, A, Zhang, X, Yuan, Q, and Fu,R "Improved recommendation based on collaborative tagging behaviors, "in: Proceedings of the 13t h i nternational conference on Intelligent user interfaces, ACM, Gran Canaria, Spain, 2008 19th Workshop on Information Technologies and Systems
It can be seen from Table 2 that the proposed GSB algorithm outperforms other algorithms significantly on both datasets, which demonstrates the effectiveness our approach. According to t-test, GSB is significantly better than the second best algorithm GTB, with p<0.001 on all evaluation metrics for both datasets. Despite its simplicity, tag generalization has led to quite sizable improvements on both the subject-based and topic-based algorithms, confirming our hypothesis about the reliability of matrix UT and IT. The fusion method focuses on occurrences of tags rather than their weighing values, thus are not able to benefit from tag generalization. The subject-based algorithms, with or without tag generalization, have shown superiority over the topic-based algorithms, demonstrating the advantages of extracting subjects for recommendation. In general, the tag-enhanced algorithms outperform the traditional CF approaches strikingly, especially on the sparser Large dataset. It implies that it is of great potential to improve item recommendation quality leveraging tagging information, especially when the dataset is sparse. The prediction accuracy of the implemented algorithms on the Small dataset is much higher compared with that on the Large dataset, which may due to the fact that the Small dataset is much denser than the Large dataset and more data are available for training/profiling. 5. Conclusions and Future Research This paper discusses a subject-centered model of collaborative tagging, in which subjects are assumed to play the key role in determining users’ tagging behavior. We propose the concept of Consistent Nonnegative Matrix Factorization and use it to discover the hidden subjects and inner relations within this model from tagging data. We also propose a preprocessing technique called tag generalization to remove noises stemming from meaningless tags. An evaluation study using two real-world collaborative tagging datasets from Delicious has demonstrated the effectiveness of our approach. For future work, we plan to design more sophisticated techniques for tag generalization, and apply it to matrix IT as well. We will also investigate how to optimize various control parameters of our approach (e.g., c in Equation 1, k, the target number of subjects to be clustered) automatically. Acknowledgements The authors wish to acknowledge research support from the NNSFC (60621001, 60875049, 70890084), MOST (2006AA010106), and CAS (2F07C01, 2F08N03). References Halpin, H., Robu, V., and Shepherd, H. "The complex dynamics of collaborative tagging," in: Proceedings of t he 16th i nternational conference on W orld W ide W eb, ACM, Banff, Alberta, Canada, 2007. Lambiotte, R., and Ausloos, M. "Collaborative Tagging as a Tripartite Network," in: Lecture Notes in Computer Science In ICCS 2006, 2006, pp. 1114-1117. Lee, D.D., and Seung, H.S. "Learning the parts of objects by non-negative matrix factorization," Nature (401:6755) 1999, pp 788-791. Peng, J., and Zeng, D. "Topic-based Web Page Recommendation Using Tags," in: Proceedings of The 2nd International Workshop on Social Computing, IEEE, Dallas, Texas, 2009. Tso-Sutter, K.H.L., Marinho, L.B., and Schmidt-Thieme, L. "Tag-aware recommender systems by fusion of collaborative filtering algorithms," in: Proceedings of t he A CM s ymposium on A pplied computing, ACM, Fortaleza, Ceara, Brazil, 2008. Xu, W., Liu, X., and Gong, Y. "Document clustering based on non-negative matrix factorization," in: Proceedings of t he 26t h annu al i nternational ACM SI GIR c onference on R esearch and development in informaion retrieval, ACM, Toronto, Canada, 2003. Zhao, S., Du, N., Nauerz, A., Zhang, X., Yuan, Q., and Fu, R. "Improved recommendation based on collaborative tagging behaviors," in: Proceedings of the 13t h i nternational conference on Intelligent user interfaces, ACM, Gran Canaria, Spain, 2008. 78 19th Workshop on Information Technologies and Systems
TOWARD MORE DIVERSE RECOMMENDATIONS ITEM RE-RANKING METHODS FOR RECOMMENDER SYSTEMS Gediminas adomavicius YoungOk Kwon Department of Information and Decision Sciences Carlson School of Management, University of Minnesota gedas(@umn.edu, kwonx052@umn.edu Abstract Recommender systems are becoming increasingly important to individual users and businesses. for providing personalized recommendations. However, while the majority of algorithms proposed in recommender systems litera ture have fo cused on imp roving reco mmendation accuracy (a s exemplified by the rece nt Netflix Prize competition), other important aspects of rec ommendation quality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce a nu mber of item re-ran king methods tha t can g enerate sub stantially more d iverse recommendations across all users w hile m aintaining comparable levels of rec ommendation accuracy. Empi rical resul ts con sistently show t he di versity gai ns oft he proposed re-ranking methods for several real-world rating datasets and different rating prediction techniques der systems, collaborative filtering, recommendation diversity, ranking functions 1. Introduction and motivation In recent years, recommender systems have become an important research topic in academia and industry (Adomavicius and Tuzhilin 2005). However, in most cases, new techniques are designed to improve the accuracy of recommendations, including the most recent algorithms from Netflix Prize competition (netflixprize. com); other aspects, such as the diversity of recommendations, have often been overlooked in evaluating the recommendation quality. The importance of diverse recommendations has been emphasized in several recent studies (Adomavicius and Kwon 2008, Bradley and Smyth 2001 Brynjolfsson et al. 2007, Fleder and Hosanagar 2009, Zhang and Hurley 2008, Ziegler et al. 2005). These studies argue that one of the goals of recommender systems is to provide a user with highly idiosyn or personalized items, and more diverse recommendations result in more opportunities for users to get recommended such items. With this motivation, some studies proposed new recommendation methods that can increase the diversity of recommendation sets for a given individual user(i.e, individual diversity), often measured by the average dissimilarity between all pairs of recommended items( bradley and Smyth 2001, Zhang and Hurley 2008, Ziegler et al. 2005) More diverse recommendations could be beneficial for some businesses as well Brynjolfsson et al. 2007, Fleder and Hosanagar 2009). For example, it would be profitable to Netflix if their recommender system can encourage users to rent more"long-tail" type of movies (i.e, more obscure items that are located in the tail of the sales distribution) because they are typically less costly to license and acquire from distributors than new-release or highly-popular movies of big studios(Goldstein and Goldstein 2006) Few recent studies(Brynjolfsson et al. 2007, Fleder and Hosanagar 2009)started examining the impact of recommender systems on sales diversity by Table 1. Accuracy-diversity tradeoff: example recommendations across all users which will be Quality Metric AccuracyDiversity the focus of this individual diversity of recommendations does Popular Item (item with the largest 49 distinct not necessarily imply high aggregate diversity of known ratings) Items For example, while recommending to all users"Long-Tail"Item(item with the 695 distinct the same five best-selling items that are not smallest number of known ratings) similar to each other will result in high individual diversity, the aggregate diversity will Note, Recommendations for 2828 users by a standard item- based collaborative filtering technique on MovieLens data 19th Workshop on Information Technologies and Systems
TOWARD MORE DIVERSE RECOMMENDATIONS: ITEM RE-RANKING METHODS FOR RECOMMENDER SYSTEMS Gediminas Adomavicius YoungOk Kwon Department of Information and Decision Sciences Carlson School of Management, University of Minnesota gedas@umn.edu, kwonx052@umn.edu Abstract Recommender systems are becoming increasingly important to individual users and businesses for providing personalized recommendations. However, while the majority of algorithms proposed in recommender systems litera ture have fo cused on imp roving reco mmendation accuracy (a s exemplified by the rece nt Netflix Prize competition) , other important aspects of rec ommendation quality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce a nu mber of item re-ran king methods tha t can g enerate sub stantially more d iverse recommendations across all users w hile m aintaining comparable l evels of rec ommendation accuracy. Empi rical resul ts con sistently show t he di versity gai ns of t he pr oposed re-ranking methods for several real-world rating datasets and different rating prediction techniques. Keywords: recommender systems, collaborative filtering, recommendation diversity, ranking functions. 1. Introduction and Motivation In recent years, recommender systems have become an important research topic in academia and industry (Adomavicius and Tuzhilin 2005). However, in most cases, new techniques are designed to improve the accuracy of recommendations, including the most recent algorithms from Netflix Prize competition (netflixprize.com); other aspects, such as the diversity of recommendations, have often been overlooked in evaluating the recommendation quality. The importance of diverse recommendations has been emphasized in several recent studies (Adomavicius and Kwon 2008, Bradley and Smyth 2001, Brynjolfsson et al. 2007, Fleder and Hosanagar 2009, Zhang and Hurley 2008, Ziegler et al. 2005). These studies argue that one of the goals of recommender systems is to provide a user with highly idiosyncratic or personalized items, and more diverse recommendations result in more opportunities for users to get recommended such items. With this motivation, some studies proposed new recommendation methods that can increase the diversity of recommendation sets for a given individual user (i.e., individual diversity), often measured by the average dissimilarity between all pairs of recommended items (Bradley and Smyth 2001, Zhang and Hurley 2008, Ziegler et al. 2005). More diverse recommendations could be beneficial for some businesses as well (Brynjolfsson et al. 2007, Fleder and Hosanagar 2009). For example, it would be profitable to Netflix if their recommender system can encourage users to rent more “long-tail” type of movies (i.e., more obscure items that are located in the tail of the sales distribution) because they are typically less costly to license and acquire from distributors than new-release or highly-popular movies of big studios (Goldstein and Goldstein 2006). Few recent studies (Brynjolfsson et al. 2007, Fleder and Hosanagar 2009) started examining the impact of recommender systems on sales diversity by considering aggregate diversity of recommendations across all users, which will be the focus of this paper. Note that high individual diversity of recommendations does not necessarily imply high aggregate diversity. For example, while recommending to all users the same five best-selling items that are not similar to each other will result in high individual diversity, the aggregate diversity will Table 1. Accuracy-diversity tradeoff: example Quality Metric: Top-1 recommendation of: Accuracy Diversity Popular Item (item with the largest number of known ratings) 82% 49 distinct items “Long-Tail” Item (item with the smallest number of known ratings) 68% 695 distinct items Note. Recommendations for 2828 users by a standard itembased collaborative filtering technique on MovieLens data. 79 19th Workshop on Information Technologies and Systems
be very low(because only five distinct items are recommended across all users) Higher diversity(both individual and aggregate), however, can come at the expense of accuracy. The example in Table 1(based on the Movie lens dataset that is publicly available at grouplens. org) shows that it is possible to obtain higher diversity simply by recommending less popular items; however, the loss of recommendation accuracy in this case can be substantial. Some prior work(Adomavicius and Kwon 2008) has attempted to overcome this accuracy-diversity trade-off by filtering out less promising recommendations; as a result, however, such approaches often can offer only a fraction of possible recommendations to users. In contrast, we explore new recommendation approaches that can increase the diversity of recommendations with only a minimal (negligible) accuracy loss using different recommendation ranking techniques, without losing any recommendations. While the traditional recommender systems typically rank the relevant items by their predicted rating and recommend the most highly predicted item to each user, resulting in high accuracy, the proposed approaches consider additional factors, such as item popularity, when ranking the recommended item list to substantially increase recommendation diversity while maintaining comparable levels of accuracy 2 Related work Recommender systems typically operate in a two dimensional space of users and items. Let U be the set of users of a recommender system, and let I be the set of all possible items that can be recommended to users. Then, the utility function that represents the preference of item iel by user ue U is often defined as R: Ux/Rating, where Rating typically represents some numeric scale used by the users to evaluate each item. Also, we use the r(u, i)notation to represent a known rating(i.e, the actual rating that user u gave to item i), and the r*(u, i) notation to represent the system-predicted rating for item i and user u Numerous metrics have been employed for measuring the accuracy of recommendations(Herlocker et al 004). In particular, precision is one of the most popular decision-support metrics that measures the percentage of truly " high"ratings among those that were predicted to be high" by the recommender system. The ratings in the data that we used for our experiments are integers between I and 5, inclusive and accordingly we define the items with ratings greater than 3.5( threshold for "high "ratings, denoted by TH) as"highly-ranked", and the ratings less than 3.5 as"non-highly-ranked " Furthermore, in real world settings, recommender systems typically recommend the most highly-ranked N items since users are usually interested in only several most relevant recommendations, and this list of N items for user u can be defined as Lm(u)=il,., iN), where R (u, ik)> TH for all kE(1, 2,,N. Therefore, in our paper, we evaluate the recommendation accuracy based on the percentage of truly "highly-ranked"ratings, denoted by correct(LN(u), among those that were predicted to be the n most relevant"highly ranked""items for each user, i.e., using the precision-in-top-N metric(Herlocker et al. 2004), as written formally as follows precision-m-m-N=∑ I correct(L2(m/∑|L(m) where correct(LM(u))=(iEL(u)I R(u, i> TH). However, as mentioned earlier, relying on the accuracy of recommendations alone may not be enough to find the most relevant items for a user(McNee et al. 2006) and another important aspect can be measured by the diversity of recommendations. Since we intend to measure the recommendation quality based on the top-N recommendation lists that the system provides to ers,in this paper we use the total number of distinct items recommended across all users aggregate diversity measure, which we will refer to as diversity-in-top-N and can formally express as diversity-in-top-N=U L(u) 3. Improving Diversity By Re-Ranking Recommendation List As mentioned earlier, traditional recommender systems recommend to user u a list of top-N items, LM(u), selected according to some ranking criterion. More formally, item i is ranked ahead of item iy(i.e,ix iy)if rank(i)R is a function representing the ranking criterion. The vast majority of current recommender systems use the predicted rating value as the ranking criterion (i.e rankstandard), and it shares the motivation with the widely accepted probability ranking p rinciple in 19th Workshop on Information Technologies and Systems
be very low (because only five distinct items are recommended across all users). Higher diversity (both individual and aggregate), however, can come at the expense of accuracy. The example in Table 1 (based on the MovieLens dataset that is publicly available at grouplens.org) shows that it is possible to obtain higher diversity simply by recommending less popular items; however, the loss of recommendation accuracy in this case can be substantial. Some prior work (Adomavicius and Kwon 2008) has attempted to overcome this accuracy-diversity trade-off by filtering out less promising recommendations; as a result, however, such approaches often can offer only a fraction of possible recommendations to users. In contrast, we explore new recommendation approaches that can increase the diversity of recommendations with only a minimal (negligible) accuracy loss using different recommendation ranking techniques, without losing any recommendations. While the traditional recommender systems typically rank the relevant items by their predicted rating and recommend the most highly predicted item to each user, resulting in high accuracy, the proposed approaches consider additional factors, such as item popularity, when ranking the recommended item list to substantially increase recommendation diversity while maintaining comparable levels of accuracy. 2. Related Work Recommender systems typically operate in a two dimensional space of users and items. Let U be the set of users of a recommender system, and let I be the set of all possible items that can be recommended to users. Then, the utility function that represents the preference of item i∈I by user u∈U is often defined as R:U×I→Rating, where Rating typically represents some numeric scale used by the users to evaluate each item. Also, we use the R(u, i) notation to represent a known rating (i.e., the actual rating that user u gave to item i), and the R*(u, i) notation to represent the system-predicted rating for item i and user u. Numerous metrics have been employed for measuring the accuracy of recommendations (Herlocker et al. 2004). In particular, precision is one of the most popular decision-support metrics that measures the percentage of truly “high” ratings among those that were predicted to be “high” by the recommender system. The ratings in the data that we used for our experiments are integers between 1 and 5, inclusive, and accordingly we define the items with ratings greater than 3.5 (threshold for “high” ratings, denoted by TH) as “highly-ranked”, and the ratings less than 3.5 as “non-highly-ranked.” Furthermore, in real world settings, recommender systems typically recommend the most highly-ranked N items since users are usually interested in only several most relevant recommendations, and this list of N items for user u can be defined as LN(u) = {i1, …, iN}, where R*(u, ik) ≥ TH for all k∈{1, 2,.., N}. Therefore, in our paper, we evaluate the recommendation accuracy based on the percentage of truly “highly-ranked” ratings, denoted by correct(LN(u)), among those that were predicted to be the N most relevant “highly ranked” items for each user, i.e., using the precision-in-top-N metric (Herlocker et al. 2004), as written formally as follows: - - - | ( ( )) | | ( ) | N N uU uU precision in top N correct L u L u ∈ ∈ = ∑ ∑ , where correct(LN(u)) = {i∈LN(u) | R(u, i) ≥ TH}. However, as mentioned earlier, relying on the accuracy of recommendations alone may not be enough to find the most relevant items for a user (McNee et al. 2006), and another important aspect can be measured by the diversity of recommendations. Since we intend to measure the recommendation quality based on the top-N recommendation lists that the system provides to its users, in this paper we use the total number of distinct items recommended across all users as an aggregate diversity measure, which we will refer to as diversity-in-top-N and can formally express as: - - - () N u U diversity in top N L u ∈ = ∪ . 3. Improving Diversity By Re-Ranking Recommendation List As mentioned earlier, traditional recommender systems recommend to user u a list of top-N items, LN(u), selected according to some ranking criterion. More formally, item ix is ranked ahead of item iy (i.e., ix ≺ iy) if rank(ix) < rank(iy), where rank: I → R is a function representing the ranking criterion. The vast majority of current recommender systems use the predicted rating value as the ranking criterion (i.e., rankStandard), and it shares the motivation with the widely accepted probability ranking p rinciple in 80 19th Workshop on Information Technologies and Systems
information retrieval systems literature that ranks the documents in order of decreasing probability of relevance(Robertson 1997). Note that, by definition, recommending the most highly predicted items is designed to help improve recommendation accuracy, but not recommendation diversity. Therefore, new ranking criteria are needed in order to achieve diversity improvement. Recommending less popular items intuitively should have an effect towards increasing recommendation diversity, as illustrated by the recommendation ranking criterion and its impact on the recommendation accuracy and diversif erity as a example in Section 1. Following this motivation, we explore the possibility to use item popularity as a 3.1 Re-Ranking Recommendation List By Item Popularity We define item popularity as the number of known ratings for each item, and item popularity-based ranking approach( denoted as rankitemPop )recommends the least popular items to users. The results in Figure 1 show that the item popularity-based ranking approach increased recommendation diversity from 385 to 1395 (i.e, 3.6 times); however, recommendation accuracy dropped from 89% to 69%, as compared to the standard ranking approach. Here, despite the significant diversity gain, such a significant accuracy loss(20%)would not be acceptable in most real-life personalization applications. Therefore, in the next subsection we introduce a general technique to parameterize recommendation ranking approaches, which allows to achieve significant diversity gains while controlling accuracy losses(e.g, according to how much loss is tolerable in a given application) 3.2 Controlling Accuracy-Diversity Trade-off: Parameterized Ranking Approaches Item popularity-based ranking approach as well as other ranking approaches proposed in this paper are parameterized with" ranking threshold" TRETH, Tmax](where Tmax is the largest possible rating on the rating scale, e.g., Tmax=5) to provide user the ability to choose a certain level of recommendation rank(i) ifR(u i ETR, Tmax] rank(,TR) here (TR)=(ElIR (u, itRI, I autrankstandard (0 V R()e[TH, R) an= max rank、() Simply put, items that are predicted above ranking threshold TR are ranked according to rank(i), while tems that are below TR are ranked according to the standard ranking approach rankstandard (i). In addition, all items that are above TR get ranked ahead of all items that are below TR (as ensured by au in the above formal definition). Therefore, choosing different TR values in-between TH and Tmax allows the user to set the desired balance between accuracy and diversity, as shown in Figure 1. For example, the item opularity-based ranking approach with ranking threshold 4. 4 could minimize the accuracy loss to 1. 32% but still could obtain 83% diversity gain(from 385 as6 to 703), compared to the standard ranking approach An even higher threshold 4.7 still makes it possible 09 fAns =44 to achieve 20% diversity gain(from 385 to 462) even when there are less than N items above the &a with only 0.06%of accuracy loss. Also note that, TR=41 anking threshold T, by definition, all the items TR=3.8 above TR are recommended to a user, and the 3as remaining top-N items are selected according to the a standard ranking approach. This ensures that all the a7fi-kten opidat-bawedraing worth ranking approaches proposed in this paper provide the same exact number of recommendations as their ot t $oo "piersit in-Topf2100 13001500 corresponding baseline technique, which is also very important from the experimental analysis point of MovieLens data, top-5 items, item-based CE, 50 neighbors view in order to have a fair performance comparison of different ranking techniques Figure 1. Performance of item popularity-based approach with its parameterized versions 19th Workshop on Information Technologies and Systems
information retrieval systems literature that ranks the documents in order of decreasing probability of relevance (Robertson 1997). Note that, by definition, recommending the most highly predicted items is designed to help improve recommendation accuracy, but not recommendation diversity. Therefore, new ranking criteria are needed in order to achieve diversity improvement. Recommending less popular items intuitively should have an effect towards increasing recommendation diversity, as illustrated by the example in Section 1. Following this motivation, we explore the possibility to use item popularity as a recommendation ranking criterion and its impact on the recommendation accuracy and diversity. 3.1 Re-Ranking Recommendation List By Item Popularity We define item popularity as the number of known ratings for each item, and item popularity-based ranking approach (denoted as rankItemPop) recommends the least popular items to users. The results in Figure 1 show that the item popularity-based ranking approach increased recommendation diversity from 385 to 1395 (i.e., 3.6 times); however, recommendation accuracy dropped from 89% to 69%, as compared to the standard ranking approach. Here, despite the significant diversity gain, such a significant accuracy loss (20%) would not be acceptable in most real-life personalization applications. Therefore, in the next subsection we introduce a general technique to parameterize recommendation ranking approaches, which allows to achieve significant diversity gains while controlling accuracy losses (e.g., according to how much loss is tolerable in a given application). 3.2 Controlling Accuracy-Diversity Trade-Off: Parameterized Ranking Approaches Item popularity-based ranking approach as well as other ranking approaches proposed in this paper are parameterized with “ranking threshold” TR∈[TH, Tmax] (where Tmax is the largest possible rating on the rating scale, e.g., Tmax=5) to provide user the ability to choose a certain level of recommendation accuracy. In particular, given any ranking function rankX(i), the ranking threshold TR is used for creating the parameterized version of this ranking function, rankX(i, TR), which is formally defined as: [ ] Standard [ ) , max , * ( ), ( , ) (, ) * ( ), ( , ) T T R T T H R rank i if R u i x rank i T x R rank i if R u i αu ∈ = + ∈ ⎧⎪ ⎨ ⎪⎩ , where )( )( * * * max },),(|{)( i x rank TIi Ru u Ru TiuRIiTI R ∈ = ∈= ≥ α . Simply put, items that are predicted above ranking threshold TR are ranked according to rankX(i), while items that are below TR are ranked according to the standard ranking approach rankStandard(i). In addition, all items that are above TR get ranked ahead of all items that are below TR (as ensured by αu in the above formal definition). Therefore, choosing different TR values in-between TH and Tmax allows the user to set the desired balance between accuracy and diversity, as shown in Figure 1. For example, the item popularity-based ranking approach with ranking threshold 4.4 could minimize the accuracy loss to 1.32%, but still could obtain 83% diversity gain (from 385 to 703), compared to the standard ranking approach. An even higher threshold 4.7 still makes it possible to achieve 20% diversity gain (from 385 to 462) with only 0.06% of accuracy loss. Also note that, even when there are less than N items above the ranking threshold TR, by definition, all the items above TR are recommended to a user, and the remaining top-N items are selected according to the standard ranking approach. This ensures that all the ranking approaches proposed in this paper provide the same exact number of recommendations as their corresponding baseline technique, which is also very important from the experimental analysis point of view in order to have a fair performance comparison of different ranking techniques. MovieLens data, top-5 items, item-based CF, 50 neighbors Figure 1. Performance of item popularity-based approach with its parameterized versions 81 19th Workshop on Information Technologies and Systems
3.3 Additional Ranking approaches We here introduce four additional ranking approaches that can be used as alternatives to ranksrandard to aprove recommendation diversity, and the formal definitions of each ranking approach as well as standard and item popularity-based ranking approaches are provided in Figure 2. As seen from the empirical analysis on the positive relationships between average item popularity and predicted ratings in ( Adomavicius and Kwon 2008), we also consistently observed that popular items, on average, are likel to have higher predicted ratings than less popular items, using different traditional recommendation techniques. Therefore, it can be suggested that recommending not as highly predicted items(but still predicted to be above Th) likely implies recommending, on average, less popular items, potentially leading to diversity improvements; following this observation, we propose to use predicted rating value itself as an item ranking criterion. Based on similar empirical observations, we propose a number of alternative ranking approaches, including the ones based on average rating, absolute likeability, and relative likeability, as defined in Figure 2 4. Empirical Results The proinetflixprize. com) data sets. We pre-processed each dataset to include users and movies with The proposed recommendation ranking approaches were tested with MovieLens (grouplens. org) and significant rating history, which makes it possible to have sufficient number of highly-predicted items for recommendations to each user(in the test data). For each dataset, we randomly chose 60% of the ratings as training data and used them to predict the remaining 40%(i.e, test data 4.1 Performance of Proposed ranking Approaches All five proposed ranking approaches were used in conjunction with one of three widely popular recommendation techniques for rating prediction, including two heuristic-based (user-based and item- based "nearest neighbor" )and one model-based(matrix factorization) collaborative filtering(CF) techniques(Sarwar et al. 2001, Funk 2006), to generate top-N(N=1, 5, 10)recommendations to each user. We set predicted rating threshold as TH=3.5 (out of 5)to ensure that only relevant items are recommended to users, and ranking threshold TR varied from 3.5 to 5.0. The performance of each ranking approach was measured in terms of precision-in-top-N and diversity-in-top-N and, for comparison purposes, its diversity gain and precision loss with respect to the standard ranking approach was calculated. Consistently with the accuracy-diversity tradeoff discussed in the introduction, all the proposed ranking approaches improved the diversity of recommendations by sacrificing the accuracy However, with each ranking approach, as ranking threshold TR increases, the accuracy loss is significantly minimized(smaller precision loss)while still exhibiting substantial diversity improvement. Therefore, with different ranking thresholds, one can obtain different diversity gains for different levels of tolerable precision loss, as compared to the standard ranking approach. Following this idea, in our experiments compare the effectiveness (i.e, diversity gain) of different ranking techniques for various precision lo levels(0.1-10%). Standard, i.e., ranking the candidate(highly. Item Average Rating, i.e., ranking by an average of predicted) items by their predicted rating value. known ratings for each item from highest to lowest rankStandard(O=R(u, i) rankAvgRating(=R(D=>R(u,D)/U() Item Popularity, i.e, ranking by item popularity, Item Ab solute Lik ability, i.e, ranking by how from lowest to highest, where popularity is many users liked the item(i.e, rated it above TH) each item has rankitemPop(D)=U(D, U()=uEUT3R(u, i)) Item Rela tive Lik ability, i.e., ranking by the Reverse Pred icted Rating Value, i.e, ranking by percentage of the users who liked an item(among predicted rating value, from lowest to highest all users who rated it) rankRevPred(i=R (u, i) rankRelLike(D)=JUHDI/JU(Dl Figure 2. Various ranking functions 19th Workshop on Information Technologies and Systems
3.3 Additional Ranking Approaches We here introduce four additional ranking approaches that can be used as alternatives to rankStandard to improve recommendation diversity, and the formal definitions of each ranking approach as well as standard and item popularity-based ranking approaches are provided in Figure 2. As seen from the empirical analysis on the positive relationships between average item popularity and predicted ratings in (Adomavicius and Kwon 2008), we also consistently observed that popular items, on average, are likely to have higher predicted ratings than less popular items, using different traditional recommendation techniques. Therefore, it can be suggested that recommending not as highly predicted items (but still predicted to be above TH) likely implies recommending, on average, less popular items, potentially leading to diversity improvements; following this observation, we propose to use predicted rating value itself as an item ranking criterion. Based on similar empirical observations, we propose a number of alternative ranking approaches, including the ones based on average rating , absolute likeability , and relative likeability, as defined in Figure 2. 4. Empirical Results The proposed recommendation ranking approaches were tested with MovieLens (grouplens.org) and Netflix (netflixprize.com) data sets. We pre-processed each dataset to include users and movies with significant rating history, which makes it possible to have sufficient number of highly-predicted items for recommendations to each user (in the test data). For each dataset, we randomly chose 60% of the ratings as training data and used them to predict the remaining 40% (i.e., test data). 4.1 Performance of Proposed Ranking Approaches All five proposed ranking approaches were used in conjunction with one of three widely popular recommendation techniques for rating prediction, including two heuristic-based (user-based and itembased “nearest neighbor”) and one model-based (matrix factorization) collaborative filtering (CF) techniques (Sarwar et al. 2001, Funk 2006), to generate top-N (N=1, 5, 10) recommendations to each user. We set predicted rating threshold as TH = 3.5 (out of 5) to ensure that only relevant items are recommended to users, and ranking threshold TR varied from 3.5 to 5.0. The performance of each ranking approach was measured in terms of precision-in-top-N and diversity-in-top-N and, for comparison purposes, its diversity gain and precision loss with respect to the standard ranking approach was calculated. Consistently with the accuracy-diversity tradeoff discussed in the introduction, all the proposed ranking approaches improved the diversity of recommendations by sacrificing the accuracy. However, with each ranking approach, as ranking threshold TR increases, the accuracy loss is significantly minimized (smaller precision loss) while still exhibiting substantial diversity improvement. Therefore, with different ranking thresholds, one can obtain different diversity gains for different levels of tolerable precision loss, as compared to the standard ranking approach. Following this idea, in our experiments we compare the effectiveness (i.e., diversity gain) of different ranking techniques for various precision loss levels (0.1-10%). • Standard, i.e., ranking the candidate (highly predicted) items by their predicted rating value, from highest to lowest. rankStandard(i)=R* (u, i) -1. • Item Popul arity, i.e., ranking by item popularity, from lowest to highest, where popularity is represented by the number of known ratings that each item has . rankItemPop(i)= |U(i)|, U(i) = {u∈U | ∃R(u, i)}. • Reverse Pred icted Ra ting Va lue, i.e., ranking by predicted rating value, from lowest to highest. rankRevPred(i) = R*(u,i). • Item Average Rating, i.e., ranking by an average of all known ratings for each item: rankAvgRating(i) = |)(| ),()( )( iUiuRiR iUu ∑∈ = . • Item Ab solute Lik eability, i.e., ranking by how many users liked the item (i.e., rated it above TH): rankAbsLike(i) = |UH(i)|, UH(i)={u∈U(i)| R(u,i) ≥ TH}. • Item Rela tive Lik eability, i.e., ranking by the percentage of the users who liked an item (among all users who rated it). rankRelLike(i) = |UH(i)| / |U(i)|. Figure 2. Various ranking functions 82 19th Workshop on Information Technologies and Systems