IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27.NO.5.MAY 2015 1343 Relational Collaborative Topic Regression for Recommender Systems Hao Wang and Wu-Jun Li,Member,IEEE Abstract-Due to its successful application in recommender systems,collaborative filtering(CF)has become a hot research topic in data mining and information retrieval.In traditional CF methods,only the feedback matrix,which contains either explicit feedback (also called ratings)or implicit feedback on the items given by users,is used for training and prediction.Typically,the feedback matrix is sparse,which means that most users interact with few items.Due to this sparsity problem,traditional CF with only feedback information will suffer from unsatisfactory performance.Recently,many researchers have proposed to utilize auxiliary information,such as item content(attributes),to alleviate the data sparsity problem in CF.Collaborative topic regression(CTR)is one of these methods which has achieved promising performance by successfully integrating both feedback information and item content information.In many real applications,besides the feedback and item content information,there may exist relations (also known as networks)among the items which can be helpful for recommendation.In this paper,we develop a novel hierarchical Bayesian model called Relational Collaborative Topic Regression(RCTR),which extends CTR by seamlessly integrating the user-item feedback information,item content information,and network structure among items into the same model.Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-of-the-art methods with lower empirical training time.Moreover,RCTR can learn good interpretable latent structures which are useful for recommendation. Index Terms-Collaborative filtering,topic models,recommender system,social network,relational learning INTRODUCTION ECOMMENDER systems (RS)play an important role to users,is used for training and prediction.Typically,the Lenable us to make effective use of information.For feedback matrix is sparse,which means that most items example,Amazon [28]adopts RS for product recommenda-are given feedback by few users or most users only give tions,and Netflix [8]uses RS for movie recommendations. feedback to few items.Due to this sparsity problem,tradi- Existing RS methods can be roughly categorized into three tional CF with only feedback information will suffer from classes [1],[4],[10],[41]:content-based methods,collabora- unsatisfactory performance.More specifically,it is difficult tive filtering (CF)based methods,and hybrid methods. for CF methods to achieve good performance in both Content-based methods [5],[26]adopt the profile of the item-oriented setting and user-oriented setting when the users or products for recommendation.CF based methods feedback matrix is sparse.In an item-oriented setting where [7],[11],[13],[18],[27],[301,[331,[351,[37],[39],[42]use we need to recommend users to items,it is generally diffi- past activities or preferences,such as the ratings on items cult to know which users could like an item if it has only given by users,for prediction,without using any user or been given feedback by one or two users.This adds to the product profiles.Hybrid methods [2],[31,[61,[121,[151,[161,difficulty companies face when promoting new products [17],[201,[321,[34],[40],[431,[47],[48]combine both con-(items).Moreover,users'ignorance of new items will result tent-based methods and CF based methods by ensemble in less feedback on the new items,which will further harm techniques.Due to privacy issues,it is harder in general to the accuracy of their recommendations.For the user- collect user profiles than past activities.Hence,CF based oriented setting where we recommend items to users,it is methods have become more popular than content-based also difficult to predict what a user likes if the user has only methods in recent years. given feedback to one or two items.However,in the real In most traditional CF methods,only the feedback world,it is common to find that most users provide only a matrix,which contains either explicit feedback(also called little feedback.Actually,providing good recommendations ratings)or implicit feedback [21]on the items given by for new users with little feedback is more important than for frequent users since new users will only come back to the site(service)depending on how good the recommenda- .H.Wang is with the Department of Computer Science and Engineering, tion is.However,for frequent users,it is most likely that Hong Kong University of Science and Technology,Hong Kong. they are already satisfied with the site(service).If we man- E-mail:lwangaz@cse.ust.hk. .W.-J.Li is with the National Key Laboratory of Novel Software Technol- age to boost the recommendation accuracy for new or infre- ogy,Department of Computer Science and Technology,Nanjing Univer- quent users,more of them will become frequent users,and sity,Nanjing 210023,China.E-mail:liwujun@nju.edu.cn. then better recommendations can be expected with more Manuscript received 24 July 2013;revised 29 Aug.2014;accepted 6 Oct. training data.Therefore,improving the recommendation 2014.Date of publication 29 Oct.2014;date of current version 27 Mar.2015. accuracy at an extremely sparse setting is key to getting the Recommended for acceptance by Y.Koren. recommender systems working in a positive cycle. For information on obtaining reprints of this article,please send e-mail to: reprints@ieee.org,and reference the Digital Object Identifier below. To overcome the sparsity problem of CF-based models, Digital Object Identifier no.10.1109/TKDE.2014.2365789 many researchers have proposed to integrate auxiliary mmission standards
Relational Collaborative Topic Regression for Recommender Systems Hao Wang and Wu-Jun Li, Member, IEEE Abstract—Due to its successful application in recommender systems, collaborative filtering (CF) has become a hot research topic in data mining and information retrieval. In traditional CF methods, only the feedback matrix, which contains either explicit feedback (also called ratings) or implicit feedback on the items given by users, is used for training and prediction. Typically, the feedback matrix is sparse, which means that most users interact with few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from unsatisfactory performance. Recently, many researchers have proposed to utilize auxiliary information, such as item content (attributes), to alleviate the data sparsity problem in CF. Collaborative topic regression (CTR) is one of these methods which has achieved promising performance by successfully integrating both feedback information and item content information. In many real applications, besides the feedback and item content information, there may exist relations (also known as networks) among the items which can be helpful for recommendation. In this paper, we develop a novel hierarchical Bayesian model called Relational Collaborative Topic Regression (RCTR), which extends CTR by seamlessly integrating the user-item feedback information, item content information, and network structure among items into the same model. Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-of-the-art methods with lower empirical training time. Moreover, RCTR can learn good interpretable latent structures which are useful for recommendation. Index Terms—Collaborative filtering, topic models, recommender system, social network, relational learning Ç 1 INTRODUCTION RECOMMENDER systems (RS) play an important role to enable us to make effective use of information. For example, Amazon [28] adopts RS for product recommendations, and Netflix [8] uses RS for movie recommendations. Existing RS methods can be roughly categorized into three classes [1], [4], [10], [41]: content-based methods, collaborative filtering (CF) based methods, and hybrid methods. Content-based methods [5], [26] adopt the profile of the users or products for recommendation. CF based methods [7], [11], [13], [18], [27], [30], [33], [35], [37], [39], [42] use past activities or preferences, such as the ratings on items given by users, for prediction, without using any user or product profiles. Hybrid methods [2], [3], [6], [12], [15], [16], [17], [20], [32], [34], [40], [43], [47], [48] combine both content-based methods and CF based methods by ensemble techniques. Due to privacy issues, it is harder in general to collect user profiles than past activities. Hence, CF based methods have become more popular than content-based methods in recent years. In most traditional CF methods, only the feedback matrix, which contains either explicit feedback (also called ratings) or implicit feedback [21] on the items given by users, is used for training and prediction. Typically, the feedback matrix is sparse, which means that most items are given feedback by few users or most users only give feedback to few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from unsatisfactory performance. More specifically, it is difficult for CF methods to achieve good performance in both item-oriented setting and user-oriented setting when the feedback matrix is sparse. In an item-oriented setting where we need to recommend users to items, it is generally diffi- cult to know which users could like an item if it has only been given feedback by one or two users. This adds to the difficulty companies face when promoting new products (items). Moreover, users’ ignorance of new items will result in less feedback on the new items, which will further harm the accuracy of their recommendations. For the useroriented setting where we recommend items to users, it is also difficult to predict what a user likes if the user has only given feedback to one or two items. However, in the real world, it is common to find that most users provide only a little feedback. Actually, providing good recommendations for new users with little feedback is more important than for frequent users since new users will only come back to the site (service) depending on how good the recommendation is. However, for frequent users, it is most likely that they are already satisfied with the site (service). If we manage to boost the recommendation accuracy for new or infrequent users, more of them will become frequent users, and then better recommendations can be expected with more training data. Therefore, improving the recommendation accuracy at an extremely sparse setting is key to getting the recommender systems working in a positive cycle. To overcome the sparsity problem of CF-based models, many researchers have proposed to integrate auxiliary H. Wang is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong. E-mail: hwangaz@cse.ust.hk. W.-J. Li is with the National Key Laboratory of Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China. E-mail: liwujun@nju.edu.cn. Manuscript received 24 July 2013; revised 29 Aug. 2014; accepted 6 Oct. 2014. Date of publication 29 Oct. 2014; date of current version 27 Mar. 2015. Recommended for acceptance by Y. Koren. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2014.2365789 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015 1343 1041-4347 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.27,NO.5,MAY 2015 information into the model training and prediction proce- Experiments on real-world datasets show that RCTR dures.Some methods [45],[51]utilize the item content can achieve higher prediction accuracy than the (attributes)to facilitate the CF training.One representative state-of-the-art methods. of these methods is collaborative topic regression(CTR)[45] Note that CTR-SMF [31]tries to integrate the user relations which jointly models the user-item feedback matrix and the into the CTR model,which is different from the application item content information (texts of articles).CTR seamlessly scenarios of our RCTR model. incorporates topic modeling [9]with CF to improve the per- The rest of this paper is organized as follows.In Section 2, formance and interpretability.For new items,CTR is able to we briefly introduce the background of CF and CTR. perform out-of-matrix prediction (cold-start prediction) Section 3 presents the details of our proposed RCTR model. [381,[51]using only the content information.Some other Experimental results are described in Section 4 and finally methods [221,[291,[31]try to use social networks among Section 5 concludes the whole paper. users to improve the performance.Among these methods, CTR-SMF [31]extends CTR by integrating the social net- 2 BACKGROUND work among users into CTR with social matrix factorization (SMF)[29]techniques,which has achieved better perfor- In this section,we give a brief introduction about the back- mance than CTR. ground of RCTR,including CF based recommendation, In many real applications,besides the feedback and item matrix factorization (MF)(also called latent factor model) content information,there may exist relations(or networks) based CF methods [251,[351,[361,J491,[501,and CTR [45]. among the items [44],[46]which can also be helpful for rec- ommendation.For example,if we want to recommend 2.1 CF Based Recommendation papers(references)to users in CiteULike,the citation rela- The task of CF is to recommend items to users based on their tions between papers are informative for recommending past feedback of the users.For example,we can deploy a rec- papers with similar topics.Other examples of item net- ommender system to recommend papers (references)to works can be found in hyperlinks among webpages,movies researchers in CiteULike.Here users are researchers and directed by the same directors,and so on. items are papers.As in [451,we assume there are I users and In this paper,we develop a novel hierarchical Bayesian J items in this paper.The feedback on item j given by user i model,called Relational Collaborative Topic Regression is denoted by ri.Although our model is general enough to (RCTR),to incorporate item relations for recommendation. be used for other settings with explicit feedback such as the The main contributions of RCTR are outlined as follows: case with integer ratings ranging from 1 to 5,we assume ri e(0,1}in this paper which is the same setting as that in By extending CTR,RCTR seamlessly integrates CTR [45].Note that this is a setting with implicit feedback as the user-item feedback information,item content introduced in [21].This means that our model tries to predict information and relational (network)structure whether a user likes a item or not.In training data,rij=1 among items into a principled hierarchical Bayes- means that user i likes item j.rij =0 means that the element ian model. is unobserved (missing),i.e.,we do not know whether user i Even if a new item has been given feedback only by likes item j or not.As stated in Section 1,CF methods use one or two users,RCTR can make effective use of the only the feedback matrix frijli 1,2,...,I;j=1,2,...,J information from the item network to alleviate the for training and prediction data sparsity problem in CF,which will conse- There are two different cases of prediction [45]:in-matrix quently improve the recommendation accuracy. prediction and out-of-matrix prediction.For the item-ori- In cases where a new user has given feedback to only ented setting,in-matrix prediction tries to make recommen- one or two items,RCTR can also make effective use dation for items with at least one feedback from the users in of the information from the item network to improve the training data.On the contrary,out-of-matrix prediction the recommendation accuracy. tries to make recommendation for items without any In RCTR,a family of link(relation)probability func- feedback in the training data.The in-matrix prediction and tions is proposed to model the relations between out-of-matrix prediction for user-oriented settings are simi- items.This extension from discrete link probability lar except that we make recommendation for users rather functions like those in [14]to a family of link proba- than items.Out-of-matrix prediction is actually the so-called bility functions increases the modeling capacity of cold-start recommendation in some of the literature [38],[51]. RCTR with better performance. Compared with CTR,a smaller number of learning 2.2 Matrix Factorization for CF iterations are needed for RCTR to achieve satisfac- The existing CF methods can be divided into two main tory accuracy.As a consequence,the total empirical categories [24]:memory-based methods [18],[28],[37]and measured runtime of training RCTR is lower than model-based methods [19],[25],[35].Memory-based meth- that of training CTR even if the time complexity of ods adopt the(weighted)average of the feedback of similar each RCTR iteration is slightly higher than that of (neighborhood)users or items for prediction,while model- CTR. based methods try to learn a statistical model from the train- ● RCTR can learn good interpretable latent structures ing data.Many works have verified that model-based meth- which are useful for recommendation. ods can outperform memory-based methods in general. Hence,model-based methods have become more popular in 1.http://www.citeulike.org/ recent years
information into the model training and prediction procedures. Some methods [45], [51] utilize the item content (attributes) to facilitate the CF training. One representative of these methods is collaborative topic regression (CTR) [45] which jointly models the user-item feedback matrix and the item content information (texts of articles). CTR seamlessly incorporates topic modeling [9] with CF to improve the performance and interpretability. For new items, CTR is able to perform out-of-matrix prediction (cold-start prediction) [38], [51] using only the content information. Some other methods [22], [29], [31] try to use social networks among users to improve the performance. Among these methods, CTR-SMF [31] extends CTR by integrating the social network among users into CTR with social matrix factorization (SMF) [29] techniques, which has achieved better performance than CTR. In many real applications, besides the feedback and item content information, there may exist relations (or networks) among the items [44], [46] which can also be helpful for recommendation. For example, if we want to recommend papers (references) to users in CiteULike,1 the citation relations between papers are informative for recommending papers with similar topics. Other examples of item networks can be found in hyperlinks among webpages, movies directed by the same directors, and so on. In this paper, we develop a novel hierarchical Bayesian model, called Relational Collaborative Topic Regression (RCTR), to incorporate item relations for recommendation. The main contributions of RCTR are outlined as follows: By extending CTR, RCTR seamlessly integrates the user-item feedback information, item content information and relational (network) structure among items into a principled hierarchical Bayesian model. Even if a new item has been given feedback only by one or two users, RCTR can make effective use of the information from the item network to alleviate the data sparsity problem in CF, which will consequently improve the recommendation accuracy. In cases where a new user has given feedback to only one or two items, RCTR can also make effective use of the information from the item network to improve the recommendation accuracy. In RCTR, a family of link (relation) probability functions is proposed to model the relations between items. This extension from discrete link probability functions like those in [14] to a family of link probability functions increases the modeling capacity of RCTR with better performance. Compared with CTR, a smaller number of learning iterations are needed for RCTR to achieve satisfactory accuracy. As a consequence, the total empirical measured runtime of training RCTR is lower than that of training CTR even if the time complexity of each RCTR iteration is slightly higher than that of CTR. RCTR can learn good interpretable latent structures which are useful for recommendation. Experiments on real-world datasets show that RCTR can achieve higher prediction accuracy than the state-of-the-art methods. Note that CTR-SMF [31] tries to integrate the user relations into the CTR model, which is different from the application scenarios of our RCTR model. The rest of this paper is organized as follows. In Section 2, we briefly introduce the background of CF and CTR. Section 3 presents the details of our proposed RCTR model. Experimental results are described in Section 4 and finally Section 5 concludes the whole paper. 2 BACKGROUND In this section, we give a brief introduction about the background of RCTR, including CF based recommendation, matrix factorization (MF) (also called latent factor model) based CF methods [25], [35], [36], [49], [50], and CTR [45]. 2.1 CF Based Recommendation The task of CF is to recommend items to users based on their past feedback of the users. For example, we can deploy a recommender system to recommend papers (references) to researchers in CiteULike. Here users are researchers and items are papers. As in [45], we assume there are I users and J items in this paper. The feedback on item j given by user i is denoted by rij. Although our model is general enough to be used for other settings with explicit feedback such as the case with integer ratings ranging from 1 to 5, we assume rij 2 f0; 1g in this paper which is the same setting as that in CTR [45]. Note that this is a setting with implicit feedback as introduced in [21]. This means that our model tries to predict whether a user likes a item or not. In training data, rij ¼ 1 means that user i likes item j. rij ¼ 0 means that the element is unobserved (missing), i.e., we do not know whether user i likes item j or not. As stated in Section 1, CF methods use only the feedback matrix frijji ¼ 1; 2; ... ; I; j ¼ 1; 2; ... ; Jg for training and prediction. There are two different cases of prediction [45]: in-matrix prediction and out-of-matrix prediction. For the item-oriented setting, in-matrix prediction tries to make recommendation for items with at least one feedback from the users in the training data. On the contrary, out-of-matrix prediction tries to make recommendation for items without any feedback in the training data. The in-matrix prediction and out-of-matrix prediction for user-oriented settings are similar except that we make recommendation for users rather than items. Out-of-matrix prediction is actually the so-called cold-start recommendation in some of the literature [38], [51]. 2.2 Matrix Factorization for CF The existing CF methods can be divided into two main categories [24]: memory-based methods [18], [28], [37] and model-based methods [19], [25], [35]. Memory-based methods adopt the (weighted) average of the feedback of similar (neighborhood) users or items for prediction, while modelbased methods try to learn a statistical model from the training data. Many works have verified that model-based methods can outperform memory-based methods in general. Hence, model-based methods have become more popular in 1. http://www.citeulike.org/ recent years. 1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1345 Matrix factorization [25],[35]and its extensions,such as the probabilistic matrix factorization (PMF)[35],are the most representative model-based methods which in practice have achieved promising performance.The basic idea of MF is to use latent vectors in a low-dimensional space to repre- sent the users and items.More specifically,user i is repre- sented by a latent vector uiE RK of dimensionality K,and item j is represented by a latent vector vjE RK.The predic- Fig.1.The graphical model of collaborative topic regression. tion of the feedback on item j given by user i can be com- puted as follows: article is about(represented by ,)and what the users think =5 of it (represented by vj),as discussed in [45].If we use If we use two latent matrices U=(u)and V=(v)to B=B:k to denote K topics,the generative process of CTR can be listed as follows: denote the latent vectors for all the users and items respec- tively,it means MF has learnt to find the optimal U and V 1)Draw a user latent vector for each user i: by optimizing the following objective function: ui~N(0,XIk). 3 For each item j, v:l a) (1 Draw topic proportions;~Dirichlet(@). 0 Draw item latent offset j~N(0,AIK),then set the item latent vector to be:vj=ej+0j. where·‖denotes the Frobenius norm of a vector,,入uand c) For each word in the document (item)wi,which A are regularization terms for controlling model complex- is denoted as wj, ity.The objective function in (1)corresponds to the maxi- i)Draw topic assignment zin ~Mult(@j). mum a posteriori(MAP)estimate of the PMF model in [35]. ii)Draw word win ~Mult(B.). In [45],a generalization of PMF model is proposed 3) Draw the feedback rij for each user-item pair(i,j), i~N(0,XIK), rN(u,写) vj~N(0,IK). (2) As mentioned in [45],the key to CTR lies in the item r~N(,写) latent offset ej,which makes the item latent vector vj close enough to the topic proportions 0;and diverge from it if where N()denotes the normal distribution,Ik is an iden- necessary.Parameter Ae controls how close v;is to 0j. tity matrix with K rows and columns,and cij is defined as Experiments on scientific article recommendation from follows: CiteULike show that CTR can outperform MF based CF methods. a, if rij=1, b. if rij =0, 3 RELATIONAL COLLABORATIVE TOPIC with a and b being tuning parameters and a>b>0.Please REGRESSION note that all the (i,j)pairs with feedback 0 in the training In this section,we describe the details of our proposed set are used for training in this paper.We can also sample model,called Relational Collaborative Topic Regression. part of them for training in cases where the number of zeros Besides the feedback and item content information modeled in the feedback matrix is too large. by CTR,RCTR can also model the relations among the items MF methods have achieved promising performance in which are informative for recommendations. practice.However,they also suffer from the sparsity prob- lem.Furthermore,as stated in [45],it is not easy for MF 3.1 Model Formulation methods to perform out-of-matrix prediction. To better illustrate the graphical model of RCTR,we adopt a way different from that in Fig.1 which is adopted by the 2.3 Collaborative Topic Regression authors of CTR [45].The graphic model of RCTR is shown Collaborative topic regression [45]is proposed to recom- in Fig.2,in which the component in the dashed rectangle is mend documents(papers)to users by seamlessly integrat- what differentiates RCTR from CTR. ing both feedback matrix and item (document)content The generative process of RCTR is as follows: information into the same model,which can address the problems faced by MF based CF.By combining MF and 1)Draw a user latent vector for each user i:ui~ latent Dirichlet allocation (LDA)[9],CTR achieves better W(0,λ1Ik) prediction performance than MF based CF with better inter- 2) For each item j, pretable results.Moreover,with the item content informa- a) Draw topic proportions ;Dirichlet(@). tion,CTR can predict feedback for out-of-matrix items. b) Draw item latent offset j~N(0,AIK),then set The graphical model of CTR is shown in Fig.1. the item latent vector to be:vj=ej+0j. CTR introduces an item latent offset e;between the topic c) Draw item relational offset j~N(0,IK), proportions 0j in LDA and the item latent vectors vj in CF. then set the item relational vector to be: The offset can be explained by the gap between what the Sj=Tj十Uj
Matrix factorization [25], [35] and its extensions, such as the probabilistic matrix factorization (PMF) [35], are the most representative model-based methods which in practice have achieved promising performance. The basic idea of MF is to use latent vectors in a low-dimensional space to represent the users and items. More specifically, user i is represented by a latent vector ui 2 RK of dimensionality K, and item j is represented by a latent vector vj 2 RK. The prediction of the feedback on item j given by user i can be computed as follows: r^ij ¼ uT i vj: If we use two latent matrices U ¼ ðuiÞ I i¼1 and V ¼ ðvjÞ J j¼1 to denote the latent vectors for all the users and items respectively, it means MF has learnt to find the optimal U and V by optimizing the following objective function: min U;V X I i¼1 X J j¼1 rij uT i vj 2 þ u X I i¼1 kuik2 þ v X J j¼1 kvjk2 ; (1) where kk denotes the Frobenius norm of a vector, u and v are regularization terms for controlling model complexity. The objective function in (1) corresponds to the maximum a posteriori (MAP) estimate of the PMF model in [35]. In [45], a generalization of PMF model is proposed ui Nð0; 1 u IKÞ; vj Nð0; 1 v IKÞ; rij NðuT i vj; c1 ij Þ; (2) where N ðÞ denotes the normal distribution, IK is an identity matrix with K rows and columns, and cij is defined as follows: cij ¼ a; if rij ¼ 1; b; if rij ¼ 0; with a and b being tuning parameters and a>b> 0. Please note that all the ði; jÞ pairs with feedback 0 in the training set are used for training in this paper. We can also sample part of them for training in cases where the number of zeros in the feedback matrix is too large. MF methods have achieved promising performance in practice. However, they also suffer from the sparsity problem. Furthermore, as stated in [45], it is not easy for MF methods to perform out-of-matrix prediction. 2.3 Collaborative Topic Regression Collaborative topic regression [45] is proposed to recommend documents (papers) to users by seamlessly integrating both feedback matrix and item (document) content information into the same model, which can address the problems faced by MF based CF. By combining MF and latent Dirichlet allocation (LDA) [9], CTR achieves better prediction performance than MF based CF with better interpretable results. Moreover, with the item content information, CTR can predict feedback for out-of-matrix items. The graphical model of CTR is shown in Fig. 1. CTR introduces an item latent offset j between the topic proportions uj in LDA and the item latent vectors vj in CF. The offset can be explained by the gap between what the article is about (represented by uj) and what the users think of it (represented by vj), as discussed in [45]. If we use b ¼ b1:K to denote K topics, the generative process of CTR can be listed as follows: 1) Draw a user latent vector for each user i: ui Nð0; 1 u IKÞ. 2) For each item j, a) Draw topic proportions uj DirichletðaÞ. b) Draw item latent offset j Nð0; 1 v IKÞ, then set the item latent vector to be: vj ¼ j þ uj. c) For each word in the document (item) wj, which is denoted as wjn, i) Draw topic assignment zjn MultðujÞ. ii) Draw word wjn Multðbzjn Þ. 3) Draw the feedback rij for each user-item pair ði; jÞ, rij NðuT i vj; c1 ij Þ: As mentioned in [45], the key to CTR lies in the item latent offset j, which makes the item latent vector vj close enough to the topic proportions uj and diverge from it if necessary. Parameter v controls how close vj is to uj. Experiments on scientific article recommendation from CiteULike show that CTR can outperform MF based CF methods. 3 RELATIONAL COLLABORATIVE TOPIC REGRESSION In this section, we describe the details of our proposed model, called Relational Collaborative Topic Regression. Besides the feedback and item content information modeled by CTR, RCTR can also model the relations among the items which are informative for recommendations. 3.1 Model Formulation To better illustrate the graphical model of RCTR, we adopt a way different from that in Fig. 1 which is adopted by the authors of CTR [45]. The graphic model of RCTR is shown in Fig. 2, in which the component in the dashed rectangle is what differentiates RCTR from CTR. The generative process of RCTR is as follows: 1) Draw a user latent vector for each user i: ui N ð0; 1 u IKÞ. 2) For each item j, a) Draw topic proportions uj DirichletðaÞ. b) Draw item latent offset j Nð0; 1 v IKÞ, then set the item latent vector to be: vj ¼ j þ uj. c) Draw item relational offset tj Nð0; 1 r IKÞ, then set the item relational vector to be: sj ¼ tj þ vj. Fig. 1. The graphical model of collaborative topic regression. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1345
1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27,NO.5,MAY 2015 Fig.2.The graphical model of RCTR.The component in the dashed Fig.3.The graphical model of the degenerated RCTR.The component rectangle is what differentiates RCTR from CTR. in the dashed rectangle is what differentiates RCTR from CTR. d) For each word in the document (item)wi,which feedback is binary,it is possible to further improve the per- is denoted as wjn, formance by using a Logistic feedback model. Draw topic assignment zjn~Mult(@j). ii)Draw word wMult(B). 3.2 Learning 3 Draw the parameter n+~N(0,AIK+1). Based on the RCTR model in Fig.2,we may treat all the param- 4 For each pair of items(j,),draw a binary link indicator eters as random variables and resort to fully Bayesian meth- 小s,sy(sj,s,n). ods,such as Markov chain Monte Carlo (MCMC)methods and variational methods [23],for learning and inference.How- ) Draw the feedback rij for each user-item pair (i,j), ever,we do not adopt fully Bayesian methods here for RCTR because these methods typically incur high computational rNN(u防,c写) cost.Furthermore,because the focus of this paper is to illus- trate the importance of the relations among items which are In the above generative process,the link probability func- not utilized in CTR,it is more reasonable to adopt the same tion is defined as follows: learning strategy as that in CTR for learning and inference. Hence,as that in CTR,a maximum a posteriori esti- (时=1s,,n)=[σ(n(sos)+]°, (3) mate is also adopted for parameter learning in RCTR The MAP estimate tries to maximize the log-posteriori where ljf is a binary value,li.f=1 means there exists a of U,V,nt,s1:J,01:1,and B given the hyper-parameters relation (link)between item j and item j,otherwise p,入u入w,入r,and入e ljf=0,v is a scalar value representing the offset, n=(n,v)with the symbol (being the vector-scalar con- =p∑loga(gosy)+)-告∑fu catenation,the operator o means element-wise vector mul- =1 tiplication,and o()represents the sigmoid function which is defined as follows: 告∑g-P西-别 exp(x) a()=1+exp( 合∑号-gPg-)-合, (4) Steps 2(c),3,and 4 in the above generative process are what differentiates RCTR from CTR.The item relational offset +∑∑g(E tj in Step 2(c)is one of the key properties in RCTR.Similar to the item latent offset,rj makes it possible for sj to diverge 罗- from the item latent factor vi if necessary.While the item latent vector v;represents what the users think item j is A constant is omitted and the hyper-parameter of the topic about,the item relational vector s;represents the effect model a is set to 1 as in CTR.This objective function can be other items have on item j.The larger A,is,the more likely optimized using coordinate ascent.Since is not jointly con- it is that v;and sj are close to each other.When Ar goes to vex in all the variables,we design an alternating algorithm to infinity,the model degenerates to the case with vj=sj, learn the parameters.More specifically,each time we opti- which is shown in Fig.3.In the experiments which are pre- mize one parameter with all other parameters fixed. sented in Section 4(Fig.21),we show that the RCTR model For ui and vj,by setting the gradient to zero,we get the in Fig.2 outperforms the degenerated model in Fig.3,which following update rules: verifies the effectiveness of the item relational offset t;.Note 4←(CVr+入wIk)1VC,R, that for simplicity and fair comparison,we use the same Gaussian feedback model in Step 5 as in [45].Since the y-(UCUr+入Ik+入Ik)-(UCB+λ8+入s)
d) For each word in the document (item) wj, which is denoted as wjn, i) Draw topic assignment zjn MultðujÞ. ii) Draw word wjn Multðbzjn Þ. 3) Draw the parameter hþ Nð0; 1 e IKþ1Þ: 4) For each pair of items (j, j0 ), draw a binary link indicator lj;j0 jsj; sj0 cðjsj; sj0 ; hþÞ: 5) Draw the feedback rij for each user-item pair ði; jÞ, rij NðuT i vj; c1 ij Þ: In the above generative process, the link probability function is defined as follows: cðlj;j0 ¼ 1jsj; sj0 ; hþÞ ¼ s hT ðsj sj0Þ þ n r ; (3) where lj;j0 is a binary value, lj;j0 ¼ 1 means there exists a relation (link) between item j and item j0 , otherwise lj;j0 ¼ 0, n is a scalar value representing the offset, hþ ¼ hh; ni with the symbol hi being the vector-scalar concatenation, the operator means element-wise vector multiplication, and sðÞ represents the sigmoid function which is defined as follows: sðxÞ ¼ expðxÞ 1 þ expðxÞ : Steps 2(c), 3, and 4 in the above generative process are what differentiates RCTR from CTR. The item relational offset tj in Step 2(c) is one of the key properties in RCTR. Similar to the item latent offset, tj makes it possible for sj to diverge from the item latent factor vj if necessary. While the item latent vector vj represents what the users think item j is about, the item relational vector sj represents the effect other items have on item j. The larger r is, the more likely it is that vj and sj are close to each other. When r goes to infinity, the model degenerates to the case with vj ¼ sj, which is shown in Fig. 3. In the experiments which are presented in Section 4 (Fig. 21), we show that the RCTR model in Fig. 2 outperforms the degenerated model in Fig. 3, which verifies the effectiveness of the item relational offset tj. Note that for simplicity and fair comparison, we use the same Gaussian feedback model in Step 5 as in [45]. Since the feedback is binary, it is possible to further improve the performance by using a Logistic feedback model. 3.2 Learning Based on the RCTR model in Fig. 2, we may treat all the parameters as random variables and resort to fully Bayesian methods, such as Markov chain Monte Carlo (MCMC) methods and variational methods [23], for learning and inference. However, we do not adopt fully Bayesian methods here for RCTR because these methods typically incur high computational cost. Furthermore, because the focus of this paper is to illustrate the importance of the relations among items which are not utilized in CTR, it is more reasonable to adopt the same learning strategy as that in CTR for learning and inference. Hence, as that in CTR, a maximum a posteriori estimate is also adopted for parameter learning in RCTR. The MAP estimate tries to maximize the log-posteriori of U, V , hþ, s1:J , u1:J , and b given the hyper-parameters r, u, v, r, and e L ¼r X l j;j0¼1 log sðhT ðsj sj0Þ þ nÞ u 2 X i uT i ui v 2 X j ðvj ujÞ T ðvj ujÞ r 2 X j ðsj vjÞ T ðsj vjÞ e 2 hþT hþ þX j X n logX k ujkbk;wjn X i;j cij 2 rij uT i vj 2 : (4) A constant is omitted and the hyper-parameter of the topic model a is set to 1 as in CTR. This objective function can be optimized using coordinate ascent. Since L is not jointly convex in all the variables, we design an alternating algorithm to learn the parameters. More specifically, each time we optimize one parameter with all other parameters fixed. For ui and vj, by setting the gradient to zero, we get the following update rules: ui ðVCiV T þ uIKÞ 1 VCiRi; vj ðUCiUT þ vIK þ rIKÞ 1 ðUCjRj þ vuj þ rsjÞ; Fig. 2. The graphical model of RCTR. The component in the dashed rectangle is what differentiates RCTR from CTR. Fig. 3. The graphical model of the degenerated RCTR. The component in the dashed rectangle is what differentiates RCTR from CTR. 1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1347 where C;is a diagonal matrix with (ciilj=1,,J}being its diagonal entries,and Ri=[rijlj =1,2,...,J}is a column vector containing all the feedbacks of user i.Note that cij is defined in (2),which reflects the confidence controlled by a 0. and b,as discussed in [21]. For s;and n+,since we can not directly take the gradients 0. of with respect to sj or n and set them to zero,gradient ascent is used to update the variables.The gradient of with respect to sj is 三P之-s°s)+o-- Fig.4.A comparison of link probability functions with different p.The curves plot the probability of l;=1 as a function of the inner product of s;and s,.n is set to 1 and v is adapted so that all functions have the The gradient of with respect to nt is same starting point. 了ty=p∑1-atx》z克-入n, 材=1 the latent factor space and L is the total number of relations (links)in the relation network.The cost of updating the where we denoteπ=(s;os,l) item relational matrix S={sjlj=1,2,....J is also O(KL) For 0j,we first define q(zjn=k)=vink as seen in [451, for each iteration.The complexity for updating other varia- and then apply Jensen's inequality after the items contain- bles is the same as that in CTR.For U,the time complexity ing are separated, is O(IK3+IJK2)and for V it is O(JK3+1JK2)where I is the number of users andJ is the number of items.In each %)≥-合g-Pg- iteration,RCTR only adds O(KL)of extra time complexity +∑∑pog日PE.wm-log9) to CTR.Since the relation network is typically sparse which means L can be treated as a constant multiple of/,the extra 程k time cost in RCTR is minimal. =(0,φ): From our experiments,we find that RCTR needs a smaller number of learning iterations than CTR to achieve Here中=(ot)k-·Obviously(0,p,)is a tight satisfactory accuracy.As a consequence,the total empirical lower bound of (;and we can use projection gradient to measured runtime of training RCTR is lower than that of optimize 0.The optimal is training CTR even if the time complexity of each iteration of RCTR is slightly higher than that of CTR.This is verified in 中mk0X0ikPk.wn the experimental results in Section 4.7. As for the parameter B,we follow the same M-step update as in LDA, 3.5 Discussion on Link Probability Function Another key property of RCTR is the family of link proba- Bex∑∑mlom=叫 bility functions,which is inspired by the relational topic model(RTM)[14].The authors in RTM [14]find that differ- ent link probability functions can achieve different predic- 3.3 Prediction tion performance.In RCTR,we use a single parameter p to Let D be the observed test data.Similar to [45],we use a point control the choice of the link probability function.Since p is estimate of ui,0;and ej to calculate the predicted feedback a non-negative real number,the family of link probability functions actually contains an infinite number of candidate E(rilD]EluilD](E[ejD]+ElejlD]), link probability functions.However,only two link probabil- ity functions are proposed in [141.Hence,our new family of where E(.)denotes the expectation operation. link probability functions can increase the modeling capac- For in-matrix prediction ity of RCTR,and consequently better performance can be expected.From the perspective of optimization,p can sim- r旁≈(u))'(+)=(4)' ply be regarded as a necessary regularization hyper-param- eter to control the tradeoff between relations and other observations,which can easily be seen from(4).Compari- For out-of-matrix prediction,since Elej]=0,we have son between link probability functions with different p is 暗≈()' shown in Fig.4,from which we can see that our link proba- bility functions are flexible enough to model different cases. Note that if p=1,our link probability function degener- 3.4 Time Complexity ates to one of the link probability functions mentioned in According to the update rules in the RCTR learning proce- RTM[14].Moreover,when p =1,v=0 and n is a vector with dure,we can see that for each iteration the time complexity all elements being one,(li =1lsj,sy,n+)=a(sfs), for updating n is O(KL)where K is the dimensionality of which is the link probability function in CTR-SMF [31].The
where Ci is a diagonal matrix with fcijjj ¼ 1;;Jg being its diagonal entries, and Ri ¼ frijjj ¼ 1; 2; ... ; Jg is a column vector containing all the feedbacks of user i. Note that cij is defined in (2), which reflects the confidence controlled by a and b, as discussed in [21]. For sj and hþ, since we can not directly take the gradients of L with respect to sj or hþ and set them to zero, gradient ascent is used to update the variables. The gradient of L with respect to sj is rsjL ¼ r X l j;j0¼1 ð1 sðhT ðsj sj0Þ þ nÞÞðh sj0Þ rðsj vjÞ: The gradient of L with respect to hþ is rhþ L ¼ r X l j;j0¼1 ð1 sðhþT pþ j;j0ÞÞpþ j;j0 ehþ; where we denote pþ j;j0 ¼ hsj sj0 ; 1i. For uj, we first define qðzjn ¼ kÞ ¼ cjnk as seen in [45], and then apply Jensen’s inequality after the items containing uj are separated, L ðujÞ v 2 ðvj ujÞ T ðvj ujÞ þX n X k fjnkðlog ujkbk;wjn log fjnkÞ ¼ L ðuj; fjÞ: Here fj ¼ ðfjnkÞ NK n¼1;k¼1. Obviously L ðuj; fjÞ is a tight lower bound of L ðujÞ and we can use projection gradient to optimize uj. The optimal fjnk is fjnk / ujkbk;wjn : As for the parameter b, we follow the same M-step update as in LDA, bkw / X j X n fjnk1½wjn ¼ w : 3.3 Prediction Let D be the observed test data. Similar to [45], we use a point estimate of ui, uj and j to calculate the predicted feedback E½rijjD E½uijD T ðE½ujjD þ E½jjD Þ; where EðÞ denotes the expectation operation. For in-matrix prediction r ij ðu j Þ T ðu j þ j Þ¼ðu iÞ T v j : For out-of-matrix prediction, since E½j ¼ 0, we have r ij ðu iÞ T u j : 3.4 Time Complexity According to the update rules in the RCTR learning procedure, we can see that for each iteration the time complexity for updating h is OðKLÞ where K is the dimensionality of the latent factor space and L is the total number of relations (links) in the relation network. The cost of updating the item relational matrix S ¼ fsjjj ¼ 1; 2; ... ; Jg is also OðKLÞ for each iteration. The complexity for updating other variables is the same as that in CTR. For U, the time complexity is OðIK3 þ IJK2Þ and for V it is OðJK3 þ IJK2Þ where I is the number of users and J is the number of items. In each iteration, RCTR only adds OðKLÞ of extra time complexity to CTR. Since the relation network is typically sparse which means L can be treated as a constant multiple of J, the extra time cost in RCTR is minimal. From our experiments, we find that RCTR needs a smaller number of learning iterations than CTR to achieve satisfactory accuracy. As a consequence, the total empirical measured runtime of training RCTR is lower than that of training CTR even if the time complexity of each iteration of RCTR is slightly higher than that of CTR. This is verified in the experimental results in Section 4.7. 3.5 Discussion on Link Probability Function Another key property of RCTR is the family of link probability functions, which is inspired by the relational topic model (RTM) [14]. The authors in RTM [14] find that different link probability functions can achieve different prediction performance. In RCTR, we use a single parameter r to control the choice of the link probability function. Since r is a non-negative real number, the family of link probability functions actually contains an infinite number of candidate link probability functions. However, only two link probability functions are proposed in [14]. Hence, our new family of link probability functions can increase the modeling capacity of RCTR, and consequently better performance can be expected. From the perspective of optimization, r can simply be regarded as a necessary regularization hyper-parameter to control the tradeoff between relations and other observations, which can easily be seen from (4). Comparison between link probability functions with different r is shown in Fig. 4, from which we can see that our link probability functions are flexible enough to model different cases. Note that if r ¼ 1, our link probability function degenerates to one of the link probability functions mentioned in RTM [14]. Moreover, when r ¼ 1, n ¼ 0 and h is a vector with all elements being one, cðlj;j0 ¼ 1jsj; sj0 ; hþÞ ¼ sðsT j sj0Þ, which is the link probability function in CTR-SMF [31]. The Fig. 4. A comparison of link probability functions with different r. The curves plot the probability of lj;j0 ¼ 1 as a function of the inner product of sj and sj0. h is set to 1 and n is adapted so that all functions have the same starting point. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1347
1348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27,NO.5,MAY 2015 parameters in v and n make our link probability function TABLE 1 more flexible than those in CTR-SMF.In our experiments we Description of Datasets find that the elements in the learned vector n are not identical citeulike-a citeulike-t and v0,which means that parameters n and v are neces- sary for differentiating the importance of different elements #users 5,551 7,947 of the latent vector in deciding whether or not two items are #items 16,980 25,975 linked. #tags 19,107 52,946 #citations 44,709 32565 #user-item pairs 204,987 134860 sparsity 99.78% 99.93% 4 EXPERIMENTS #relations 549,447 438,722 We design several experiments and compare the prediction performance between RCTR and the state-of-the-art meth- ods on two real-world datasets.The questions we are trying we choose the top 20,000 discriminative words according to to answer are: the tf-idf values as our vocabulary. To what degree does RCTR outperform the state-of- Besides the citation relations between papers,we find the-art methods, especially when the data is that the tags associated with papers are also an informative signal about the relations between papers.Hence,we use extremely sparse? To what degree does the family of link probability both citations and tags to construct a single relation network (graph)for each dataset.For each dataset,we first construct functions help improve the prediction performance? How is the prediction performance affected by the a tag graph with a threshold of 4,that is,if two papers have four or more tags in common,they are linked in the tag relational parameter A,and other parameters? graph.Then we impose an OR operation on the tag graph and citation graph to get the final relation graph for the data- 4.1 Datasets set.More specifically,if a relation exists in either the tag We use two real-world datasets to conduct our experiments. graph or the citation graph,it also exists in the final relation Both of them are from CiteULike,2 but they are collected in graph.The final numbers of relations (links)in the relation different ways with different scales and degrees of sparsity. graphs are shown in the last row of Table 1. For the feedback matrix in the datasets,if a user reads (or posts)a paper,the corresponding feedback is 1.Otherwise, if a user has not read (or posted)a paper,the corresponding 4.2 Evaluation Scheme feedback is missing (denoted by 0).The first dataset, We design evaluation schemes to evaluate models in both citeulike-a,is from [45].Note that the original dataset in [45] user-oriented and item-oriented settings. does not contain relations between items.We collect the For the user-oriented setting: items'relational information from CiteULike and Google Scholar.3 The second dataset,citeulike-t,we collect indepen- Select some percentage Q (e.g.10 percent)of the users as test users.The training set contains two dently from the first one.We manually select 273 seed tags and collect all the articles with at least one of these tags.We parts:one part includes all feedbacks of the other also crawl the citations between the articles from Google (1-Q)of the users,and the other part includes P positive feedbacks(with value 1)for each test user. Scholar.Note that the final number of tags associated with Perform prediction on the remaining feedbacks of all the collected articles is far more than the number(273)of the test users. seed tags.Similar to [451,we remove any users with fewer Repeat the above procedure for 1/Q rounds.For than three articles.The description of these two datasets is shown in Table 1.We can see that the number of users and each round,we select different test users.For exam- ple,if Q=10%,we perform 1/Q=10 rounds of items in our collected citeulike-t dataset is larger than that of tests.This is equivalent to a 10-fold cross validation citeulike-a.Furthermore,the ratios of non-missing entries (equal to 1-sparsity)in the user-item matrices of citeulike-a procedure where each user appears one time in a test set.If P is small,the test set actually contains and citeulike-t are 0.0022 and 0.0007 respectively,which means that the second dataset is sparser than the first. some new users with little feedback. The text information (item content)of citeulike-a is pre- We evaluate the predictive performance with two cases: processed by following the same procedure as that in 145] Q=10%and Q=100%.The case Q=10%means that the recommendation system has been running for a long time and we also use their articles'titles and abstracts for the text and only a small number of users are new.The case information of citeulike-t.After removing the stop words, Q=100%means that the system is online for only a while and most of the users are new.As stated in Section 1,pro- 2.CiteULike allows users to create their own collections of articles. There are abstracts,titles,and tags for each article.Other information viding a good recommendation for new users with little like authors,groups,posting time,and keywords is not used in this feedback is more important than that for frequent users. paper.The details can be found at http://www.citeulike.ort/faq/ Hence,it is more interesting to study the performance of data.adp. recommendation algorithms in extremely sparse settings. 3.Most of the articles in CiteULike do not provide citation informa- tion,which means we have to collect them by ourselves.In this paper We let P vary from 1 to 10 in our experiments and the the citation information is collected from Google Scholar:http:// smaller the P,the sparser the training set.Note that when scholar.google.com. P=1 and Q=100%,only 2.7 percent of the entries with
parameters in n and h make our link probability function more flexible than those in CTR-SMF. In our experiments we find that the elements in the learned vector h are not identical and n 6¼ 0, which means that parameters h and n are necessary for differentiating the importance of different elements of the latent vector in deciding whether or not two items are linked. 4 EXPERIMENTS We design several experiments and compare the prediction performance between RCTR and the state-of-the-art methods on two real-world datasets. The questions we are trying to answer are: To what degree does RCTR outperform the state-ofthe-art methods, especially when the data is extremely sparse? To what degree does the family of link probability functions help improve the prediction performance? How is the prediction performance affected by the relational parameter r and other parameters? 4.1 Datasets We use two real-world datasets to conduct our experiments. Both of them are from CiteULike,2 but they are collected in different ways with different scales and degrees of sparsity. For the feedback matrix in the datasets, if a user reads (or posts) a paper, the corresponding feedback is 1. Otherwise, if a user has not read (or posted) a paper, the corresponding feedback is missing (denoted by 0). The first dataset, citeulike-a, is from [45]. Note that the original dataset in [45] does not contain relations between items. We collect the items’ relational information from CiteULike and Google Scholar.3 The second dataset, citeulike-t, we collect independently from the first one. We manually select 273 seed tags and collect all the articles with at least one of these tags. We also crawl the citations between the articles from Google Scholar. Note that the final number of tags associated with all the collected articles is far more than the number (273) of seed tags. Similar to [45], we remove any users with fewer than three articles. The description of these two datasets is shown in Table 1. We can see that the number of users and items in our collected citeulike-t dataset is larger than that of citeulike-a. Furthermore, the ratios of non-missing entries (equal to 1—sparsity) in the user-item matrices of citeulike-a and citeulike-t are 0:0022 and 0:0007 respectively, which means that the second dataset is sparser than the first. The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in [45] and we also use their articles’ titles and abstracts for the text information of citeulike-t. After removing the stop words, we choose the top 20,000 discriminative words according to the tf-idf values as our vocabulary. Besides the citation relations between papers, we find that the tags associated with papers are also an informative signal about the relations between papers. Hence, we use both citations and tags to construct a single relation network (graph) for each dataset. For each dataset, we first construct a tag graph with a threshold of 4, that is, if two papers have four or more tags in common, they are linked in the tag graph. Then we impose an OR operation on the tag graph and citation graph to get the final relation graph for the dataset. More specifically, if a relation exists in either the tag graph or the citation graph, it also exists in the final relation graph. The final numbers of relations (links) in the relation graphs are shown in the last row of Table 1. 4.2 Evaluation Scheme We design evaluation schemes to evaluate models in both user-oriented and item-oriented settings. For the user-oriented setting: Select some percentage Q (e.g. 10 percent) of the users as test users. The training set contains two parts: one part includes all feedbacks of the other (1—Q) of the users, and the other part includes P positive feedbacks (with value 1) for each test user. Perform prediction on the remaining feedbacks of the test users. Repeat the above procedure for 1=Q rounds. For each round, we select different test users. For example, if Q ¼ 10%, we perform 1=Q ¼ 10 rounds of tests. This is equivalent to a 10-fold cross validation procedure where each user appears one time in a test set. If P is small, the test set actually contains some new users with little feedback. We evaluate the predictive performance with two cases: Q ¼ 10% and Q ¼ 100%. The case Q ¼ 10% means that the recommendation system has been running for a long time and only a small number of users are new. The case Q ¼ 100% means that the system is online for only a while and most of the users are new. As stated in Section 1, providing a good recommendation for new users with little feedback is more important than that for frequent users. Hence, it is more interesting to study the performance of recommendation algorithms in extremely sparse settings. We let P vary from 1 to 10 in our experiments and the smaller the P, the sparser the training set. Note that when P ¼ 1 and Q ¼ 100%, only 2:7 percent of the entries with TABLE 1 Description of Datasets citeulike-a citeulike-t #users 5,551 7,947 #items 16,980 25,975 #tags 19,107 52,946 #citations 44,709 32,565 #user-item pairs 204,987 134,860 sparsity 99.78% 99.93% #relations 549,447 438,722 2. CiteULike allows users to create their own collections of articles. There are abstracts, titles, and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper. The details can be found at http://www.citeulike.ort/faq/ data.adp. 3. Most of the articles in CiteULike do not provide citation information, which means we have to collect them by ourselves. In this paper, the citation information is collected from Google Scholar: http:// scholar.google.com. 1348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1349 value 1 are put in the training set for dataset citeulike-a and the number for dataset citeulike-t is 5.8 percent. As in [45]and [31],we use recall as our evaluation metric since zero feedback may be caused either by users who dis- like an item or by users who do not know the existence of the item,which means precision is not a proper metric here. Like most recommender systems,we sort the predicted feedback of candidate items which are any remaining items that are not put into the training data,and recommend the top M items (articles)to the target user.For each user, recall@M is defined as recallaM=number of items the user likes in top M Fig.5.User-oriented recall@300 when Q=10%and P ranges from 1 to 10 on dataset citeulike-a. total number of items the user likes the other parameters.By using five-fold cross validation The final reported result is the average of all the users' on the training set with grid search,we find that both CF recall. and CTR achieve good performance when A=100, The above evaluation scheme is for the user-oriented set- Au =0.01,a =1,b 0.01.As in CTR [45],we choose K to ting.We can also use a similar evaluation scheme for the be 200.Here a and b control the confidence parameters item-oriented setting where we need to recommend users to cij.For RCTR,we set the parameters=100,A=0.01, target items.Under this setting,the item-oriented recall i-a=1,b=0.01,K=200 and vary the other parameters, recall@M can be defined as follows: including the parameter p that controls the link probability function,to see how the choice of link probability function irecallMumber of users interested in the item in top M affects the prediction performance and study the sensitiv- total number of users interested in the item ity to parameters入rand入e. As in CTR,we choose M to be 50,100,150,200,250, 300 in recall@M and i-recall@M.Although a smaller M 4.3 Baselines and Experimental Settings might be more reasonable in some applications,there The models we used for comparison are listed as follows: also exist other scenarios where a large M makes sense. For example,more than 100 papers should typically be MP.The most-popular baseline which orders users read by a PhD student when he does a literature survey or items by how often they appear in the training set. about his thesis topic.Hence,M is set to be no smaller MC.The most-cited baseline which orders items than 50 in this paper. (papers)by how often they are cited in the user- oriented setting.For the item-oriented setting,the 4.4 Performance MC baseline will order the users by the total number As stated in Section 4.2,we have two different recommen- of citations of the items(papers)rated by each user. dation settings:user-oriented recommendation and item- CF.The CF based on MF [25].We use the same vari- oriented recommendation.We evaluate both in this section. ant of [25]as that in [45].It is essentially a probabilis- tic formulation of [25]. 4.4.1 User-Oriented Recommendation CTR.The CTR model [45]with the text information User-oriented recommendation tries to recommend items to as the input of the LDA. RCTR.The RCTR model with the text information as target users.Fig.5 shows the recall@300 on dataset citeulike- the input of the LDA. a when Q=10%and P is set to be 1,2,5,8,10.We can see that the baselines MP and MC perform poorly.For all other CT+.A variant of content-based methods to incorpo- settings,MP and MC always achieve the worst perfor- rate the citation and tag information.We first con- mance.To avoid clutter and better demonstrate the differen- struct a dictionary containing the original words from the text information and the citations and tags ces between other stronger models,we choose to drop the as additional words.The bag-of-words of an article corresponding lines for baselines MC and MP in the follow- is used as its feature vector.The feature vector of a ing experiments.Fig.6 shows the recall@300 on dataset user is calculated as the average of the feature citeulike-t when Q 10%by changing P.Figs.7 and 8 show the recall@300 when O=100%with different P,on vectors of the articles s/he gave feedback to.We rec- ommend the items to users with the largest cosine citeulike-a and citeulike-t,respectively.From these figures, we can find that CF performs poorly when P is extremely similarities. CTR+.The CTR model with the citations and tags as small and gets closer to CTR when P gets larger.The perfor- additional words for the input of the LDA. mance of CF is far from satisfactory due to the sparsity RCTR+.The RCTR model with the citations and tags problem.CTR can outperform CF by using item content as additional words for the input of the LDA. In the experiments,we first use a validation set to find 4.Note that the standard deviations for all the settings are very the optimal parameters入uand入,for CTR and CF.Then small (lie in the range 5.73 x 10-5,8.35 x 10).For better illustration of the figures,we do not separately report the standard deviations for we fix A and A to the optimal values for CTR and vary all the figures in this paper
value 1 are put in the training set for dataset citeulike-a and the number for dataset citeulike-t is 5:8 percent. As in [45] and [31], we use recall as our evaluation metric since zero feedback may be caused either by users who dislike an item or by users who do not know the existence of the item, which means precision is not a proper metric here. Like most recommender systems, we sort the predicted feedback of candidate items which are any remaining items that are not put into the training data, and recommend the top M items (articles) to the target user. For each user, recall@M is defined as recall@M ¼ number of items the user likes in top M total number of items the user likes : The final reported result is the average of all the users’ recall. The above evaluation scheme is for the user-oriented setting. We can also use a similar evaluation scheme for the item-oriented setting where we need to recommend users to target items. Under this setting, the item-oriented recall irecall@M can be defined as follows: i recall@M ¼ number of users interested in the item in top M total number of users interested in the item : 4.3 Baselines and Experimental Settings The models we used for comparison are listed as follows: MP. The most-popular baseline which orders users or items by how often they appear in the training set. MC. The most-cited baseline which orders items (papers) by how often they are cited in the useroriented setting. For the item-oriented setting, the MC baseline will order the users by the total number of citations of the items (papers) rated by each user. CF. The CF based on MF [25]. We use the same variant of [25] as that in [45]. It is essentially a probabilistic formulation of [25]. CTR. The CTR model [45] with the text information as the input of the LDA. RCTR. The RCTR model with the text information as the input of the LDA. CT+. A variant of content-based methods to incorporate the citation and tag information. We first construct a dictionary containing the original words from the text information and the citations and tags as additional words. The bag-of-words of an article is used as its feature vector. The feature vector of a user is calculated as the average of the feature vectors of the articles s/he gave feedback to. We recommend the items to users with the largest cosine similarities. CTR+. The CTR model with the citations and tags as additional words for the input of the LDA. RCTR+. The RCTR model with the citations and tags as additional words for the input of the LDA. In the experiments, we first use a validation set to find the optimal parameters u and v for CTR and CF. Then we fix u and v to the optimal values for CTR and vary the other parameters. By using five-fold cross validation on the training set with grid search, we find that both CF and CTR achieve good performance when v ¼ 100, u ¼ 0:01, a ¼ 1, b ¼ 0:01. As in CTR [45], we choose K to be 200. Here a and b control the confidence parameters cij. For RCTR, we set the parameters v ¼ 100, u ¼ 0:01, a ¼ 1, b ¼ 0:01, K ¼ 200 and vary the other parameters, including the parameter r that controls the link probability function, to see how the choice of link probability function affects the prediction performance and study the sensitivity to parameters r and e. As in CTR, we choose M to be 50, 100, 150, 200, 250, 300 in recall@M and i-recall@M. Although a smaller M might be more reasonable in some applications, there also exist other scenarios where a large M makes sense. For example, more than 100 papers should typically be read by a PhD student when he does a literature survey about his thesis topic. Hence, M is set to be no smaller than 50 in this paper. 4.4 Performance As stated in Section 4.2, we have two different recommendation settings: user-oriented recommendation and itemoriented recommendation. We evaluate both in this section. 4.4.1 User-Oriented Recommendation User-oriented recommendation tries to recommend items to target users. Fig. 5 shows the recall@300 on dataset citeulikea when Q ¼ 10% and P is set to be 1, 2, 5, 8, 10.4 We can see that the baselines MP and MC perform poorly. For all other settings, MP and MC always achieve the worst performance. To avoid clutter and better demonstrate the differences between other stronger models, we choose to drop the corresponding lines for baselines MC and MP in the following experiments. Fig. 6 shows the recall@300 on dataset citeulike-t when Q ¼ 10% by changing P. Figs. 7 and 8 show the recall@300 when Q ¼ 100% with different P, on citeulike-a and citeulike-t, respectively. From these figures, we can find that CF performs poorly when P is extremely small and gets closer to CTR when P gets larger. The performance of CF is far from satisfactory due to the sparsity problem. CTR can outperform CF by using item content Fig. 5. User-oriented recall@300 when Q ¼ 10% and P ranges from 1 to 10 on dataset citeulike-a. 4. Note that the standard deviations for all the settings are very small (lie in the range ½5:73 105; 8:35 103 ). For better illustration of the figures, we do not separately report the standard deviations for all the figures in this paper. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1349
1350 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27,NO.5,MAY 2015 -CTR 0. 0. .3 RCTR -CTR 150 200 80 30 Fig.6.User-oriented recall@300 when =10%and P ranges from 1 to Fig.9.User-oriented recall@M when M ranges from 50 to 300 on data- 10 on dataset citeulike-t. set citeulike-t.O=10%and P is set to 2. -CT日 -CTR -CTR ACTR 0.4 0 200 250 Fig.7.User-oriented recall@300 when Q=100%and P ranges from 1 Fig.10.User-oriented recall @M when M ranges from 50 to 300 on data- to 10 on dataset citeulike-a. set citeulike-t.Q=100%and P is set to 2. .4 .8 CTR CT Fig.8.User-oriented recall@300 when Q=100%and P ranges from 1 to 10 on dataset citeulike-t. Fig.11.Item-oriented i-recall@300 when Q=10%and P ranges from 1 to 10 on dataset citeulike-a. information,and our RCTR can further improve the perfor- space.For RCTR,Ar =1,Ae=1,000,and p=100.Again mance by effectively integrating the item relations into RCTR+achieves the best performance,and RCTR is consis- modeling.In most cases,the CTR and RCTR can outperform tently better than CTR and CF. the content-based method CT+.Adding the tags and rela- tions as extra words means CTR+can achieve a better per- formance than CTR,and in some cases CTR+being 4.4.2 Item-Oriented Recommendation comparable to RCTR.Note that although both CTR+and Item-oriented recommendation tries to recommend users RCTR use the same information for modeling,part of the to target items.In the scientific article area,it can be tag information will be lost in RCTR because RCTR uses used for applications like recommending coauthors or tags for graph construction and only a tag-relation is kept if reviewers.Fig.11 shows the i-recall@300 on dataset two items have four or more tags in common.Hence,it is citeulike-a when Q=10%and P is set to be 1,2,5,8,10 more fair to compare CTR+with RCTR+.We find our Similar to the user-oriented case,we can see that base- RCTR+outperforms CTR+in all cases although RCTR+lines MP and MC perform poorly.For all other settings, uses the same information as CTR+,which verifies the effec-MP and MC always achieve the worst performance.To tiveness of our modeling procedure. avoid clutter and better demonstrate the differences Figs.9 and 10 show the recall of the models on dataset between other stronger models,we choose to drop the citeulike-t when we set P to 2 and M 50,100,150,200,250, corresponding lines for baselines MC and MP in the fol- 300.A similar phenomenon can be observed for dataset cit-lowing experiments.Fig.12 shows the i-recall@300 on eulike-a and other values of P,which are omitted to save dataset citeulike-t when Q=10%with different values of
information, and our RCTR can further improve the performance by effectively integrating the item relations into modeling. In most cases, the CTR and RCTR can outperform the content-based method CT+. Adding the tags and relations as extra words means CTR+ can achieve a better performance than CTR, and in some cases CTR+ being comparable to RCTR. Note that although both CTR+ and RCTR use the same information for modeling, part of the tag information will be lost in RCTR because RCTR uses tags for graph construction and only a tag-relation is kept if two items have four or more tags in common. Hence, it is more fair to compare CTR+ with RCTR+. We find our RCTR+ outperforms CTR+ in all cases although RCTR+ uses the same information as CTR+, which verifies the effectiveness of our modeling procedure. Figs. 9 and 10 show the recall of the models on dataset citeulike-t when we set P to 2 and M = 50, 100, 150, 200, 250, 300. A similar phenomenon can be observed for dataset citeulike-a and other values of P, which are omitted to save space. For RCTR, r ¼ 1, e ¼ 1;000, and r ¼ 100. Again RCTR+ achieves the best performance, and RCTR is consistently better than CTR and CF. 4.4.2 Item-Oriented Recommendation Item-oriented recommendation tries to recommend users to target items. In the scientific article area, it can be used for applications like recommending coauthors or reviewers. Fig. 11 shows the i-recall@300 on dataset citeulike-a when Q ¼ 10% and P is set to be 1, 2, 5, 8, 10. Similar to the user-oriented case, we can see that baselines MP and MC perform poorly. For all other settings, MP and MC always achieve the worst performance. To avoid clutter and better demonstrate the differences between other stronger models, we choose to drop the corresponding lines for baselines MC and MP in the following experiments. Fig. 12 shows the i-recall@300 on dataset citeulike-t when Q ¼ 10% with different values of Fig. 6. User-oriented recall@300 when Q ¼ 10% and P ranges from 1 to 10 on dataset citeulike-t. Fig. 7. User-oriented recall@300 when Q ¼ 100% and P ranges from 1 to 10 on dataset citeulike-a. Fig. 8. User-oriented recall@300 when Q ¼ 100% and P ranges from 1 to 10 on dataset citeulike-t. Fig. 9. User-oriented recall@M when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 10% and P is set to 2. Fig. 10. User-oriented recall@M when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 100% and P is set to 2. Fig. 11. Item-oriented i-recall@300 when Q ¼ 10% and P ranges from 1 to 10 on dataset citeulike-a. 1350 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015
WANG AND LI:RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1351 9 RCTR 0. -CT 0.4 -CE CTR 0.2 0 0.19 100 150 200 250 Fig.12.Item-oriented i-recall@300 when Q=10%and P ranges from 1 Fig.15.Item-oriented i-recall@M when M ranges from 50 to 300 on to 10 on dataset citeulike-t. dataset citeulike-t.Q=10%and P is set to 2. 0.4 0.3 0 CTR 0.1 -CT 0.0 10网 200 20 Fig.13.Item-oriented i-recall@300 when Q=100%and P ranges from Fig.16.Item-oriented i-recall@M when M ranges from 50 to 300 on 1 to 10 on dataset citeulike-a. dataset citeulike-t.Q=100%and P is set to 2. RCTR 0.2 -CTR 0.2 0.22 0.1 0.1 0.1 0.0 150 200 250 0 Fig.14.Item-oriented i-recall@300 when Q=100%and P ranges from Fig.17.User-oriented recall@M when M ranges from 50 to 300 on data- 1 to 10 on dataset citeulike-t. set citeulike-a.Q =100%and Pis set to 1. P.Figs.13 and 14 show the i-recall@300 on citeulike-a Furthermore,the adapted CTR-SMF does not adopt rela- and citeulike-t when Q=100%,respectively. tional offsets which are key parts of RCTR.Figs.17 and 18 Figs.15 and 16 show the item-oriented recall on dataset show the results of CTR,CTR-SMF and RCTR.We find that citeulike-t when we fix P to be 2 and set M =50,100,150, the adapted CTR-SMF outperforms CTR,and our RCTR 200,250,300. outperforms CTR-SMF From the figures,we find that in the item-oriented setting RCTR can also outperform CTR and CF in most cases,and once again RCTR+achieves the best performance. 4.6 Sensitivity to Parameters Fig.19 shows how the choice of link probability functions (depending on p)affect the prediction performance.We set 4.5 Comparison to the Adapted CTR-SMF Ar =1,e=1,000,and P=1,and then calculate the As mentioned in previous sections,CTR-SMF extends CTR recall@M by setting p=1,10,100,1,000,10,000.When by using social networks among users to improve perfor- p=1,the link probability function is identical to the one mance.Hence,CTR-SMF cannot be directly used for any of proposed in RTM [14].We can see that p=100 is the opti- application scenarios in this paper.Here,we adapt the origi- mal link probability function among the five.When p is too nal CTR-SMF in [31]for our scenarios by adopting the same small,the performance is close to that of CTR as expected. techniques in CTR-SMF to model the item network.More Note that the link probability function proposed in specifically,compared with RCTR,the adapted CTR-SMF RTM [14]has the worst prediction performance,which adopts a simpler link function as stated in Section 3.5. demonstrates the importance of choosing the optimal link
P. Figs. 13 and 14 show the i-recall@300 on citeulike-a and citeulike-t when Q ¼ 100%, respectively. Figs. 15 and 16 show the item-oriented recall on dataset citeulike-t when we fix P to be 2 and set M ¼ 50, 100, 150, 200, 250, 300. From the figures, we find that in the item-oriented setting RCTR can also outperform CTR and CF in most cases, and once again RCTR+ achieves the best performance. 4.5 Comparison to the Adapted CTR-SMF As mentioned in previous sections, CTR-SMF extends CTR by using social networks among users to improve performance. Hence, CTR-SMF cannot be directly used for any of application scenarios in this paper. Here, we adapt the original CTR-SMF in [31] for our scenarios by adopting the same techniques in CTR-SMF to model the item network. More specifically, compared with RCTR, the adapted CTR-SMF adopts a simpler link function as stated in Section 3.5. Furthermore, the adapted CTR-SMF does not adopt relational offsets which are key parts of RCTR. Figs. 17 and 18 show the results of CTR, CTR-SMF and RCTR. We find that the adapted CTR-SMF outperforms CTR, and our RCTR outperforms CTR-SMF. 4.6 Sensitivity to Parameters Fig. 19 shows how the choice of link probability functions (depending on r) affect the prediction performance. We set r ¼ 1, e ¼ 1;000, and P ¼ 1, and then calculate the recall@M by setting r ¼ 1; 10; 100; 1;000; 10;000. When r ¼ 1, the link probability function is identical to the one proposed in RTM [14]. We can see that r ¼ 100 is the optimal link probability function among the five. When r is too small, the performance is close to that of CTR as expected. Note that the link probability function proposed in RTM [14] has the worst prediction performance, which demonstrates the importance of choosing the optimal link Fig. 12. Item-oriented i-recall@300 when Q ¼ 10% and P ranges from 1 to 10 on dataset citeulike-t. Fig. 13. Item-oriented i-recall@300 when Q ¼ 100% and P ranges from 1 to 10 on dataset citeulike-a. Fig. 14. Item-oriented i-recall@300 when Q ¼ 100% and P ranges from 1 to 10 on dataset citeulike-t. Fig. 15. Item-oriented i-recall@M when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 10% and P is set to 2. Fig. 16. Item-oriented i-recall@M when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 100% and P is set to 2. Fig. 17. User-oriented recall@M when M ranges from 50 to 300 on dataset citeulike-a. Q ¼ 100% and P is set to 1. WANG AND LI: RELATIONAL COLLABORATIVE TOPIC REGRESSION FOR RECOMMENDER SYSTEMS 1351
1352 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.VOL.27,NO.5,MAY 2015 +一CT 0.3 -CTR-SMF .3 0 0.2 0.2 0.22 100 1 200 Fig.18.User-oriented recall@M when M ranges from 50 to 300 on data- Fig.21.The effect of A,in RCTR on dataset citeulike-t.Q 100%and P set citeulike-t.O =100%and P is set to 1. is set to1.入e=100,入u=0.01,and入e=1,000. probability function.Our RCTR provides a flexible family of functions for choice. To examine the sensitivity of RCTR to parameters A,and Ae,we conduct two sets of experiments with the training 02 data of P=1.First,we set A =1 and see how the perfor- 02 mance changes with Ae.Fig.20 shows the recall@300 on the :022 dataset citeulike-t.We find that RCTR is stable with the change of入e.We then setλe=l,Oo0 and see how入r affects 0.16 the prediction performance.Fig.21 shows the recall@300 on 0.14 the dataset citeulike-t.We can see that the performance is rel- 0.1 atively sensitive to Ar.For a fixed p,the recall first increases 103 5 250 with入,and then decreases after somewhere near入,=l The performance remains relatively poor(near 24 percent) Fig.19.The effect of p in RCTR when M ranges from 50 to 300 on data- when Ar is too large for all three choices of the link probabil- set citeulike-t.Q=100%and P is set to1.入e=l00,入u-0.01,入r=1, ity functions in the experiment.A larger A,means item rela- 入=1.000. tional vectors sj will be closer to item latent vectors vj. When Ar =0 RCTR degenerates to CTR and when Ar=oo, RCTR degenerates to the model in Fig.3.The nice property is that the best performance is always achieved near A,=1 when we change other variables like p.Hence,we can fix Ar I in our experiment. 4.7 Computational Cost 020 Table 2 shows the number of iterations for training CTR and RCTR on dataset citeulike-t.Table 3 shows the average 024 empirical measured time (in seconds)for training CTR and 022 RCTR on dataset citeulike-t.As stated in Section 3.4, although more information is used and then more time is Fig.20.The effect of A.in RCTR on dataset citeulike-t.Q=100%and P needed for each iteration of training in RCTR,the total is set to1.入w=100,λu=0.01,and入r=1. empirical measured time for training RCTR is still lower than that for training CTR.The main reason is that RCTR TABLE 2 needs a smaller number of learning iterations than CTR to Number of Iterations for Training CTR and RCTR on achieve satisfactory accuracy. Dataset citeulike-t Note that the time in Table 3 is measured when K=200 P 1 2 5 8 10 which is the same setting as CTR in [451.In our experiments, we find that the accuracy of K =50 is comparable to that of CTR48.8±2.945.8±2.773.4±1.086.0±1.887.0±2.1 K=200.As mentioned in Section 3.4,the time complexity RCTR14.6±1.614.2±1.419.6±1.018.8±1.119.6±1.0 for one iteration is cubic with respect to K.Hence,if we choose K to be 50,the empirical measured time would be about 1/64 of the time in Table 3,which is small. TABLE 3 Empirical Measured Time for Training CTR and RCTR on Dataset citeulike-t(in Seconds) P 1 2 5 8 10 CTR 16,250士1,114 15,251±1,010 24,442±379 28,638±706 28,971±706 RCTR 5285±657 5,140±594 7,095±412 6,806±472 7,095±412
probability function. Our RCTR provides a flexible family of functions for choice. To examine the sensitivity of RCTR to parameters r and e, we conduct two sets of experiments with the training data of P ¼ 1. First, we set r ¼ 1 and see how the performance changes with e. Fig. 20 shows the recall@300 on the dataset citeulike-t. We find that RCTR is stable with the change of e. We then set e ¼ 1;000 and see how r affects the prediction performance. Fig. 21 shows the recall@300 on the dataset citeulike-t. We can see that the performance is relatively sensitive to r. For a fixed r, the recall first increases with r and then decreases after somewhere near r ¼ 1. The performance remains relatively poor (near 24 percent) when r is too large for all three choices of the link probability functions in the experiment. A larger r means item relational vectors sj will be closer to item latent vectors vj. When r ¼ 0 RCTR degenerates to CTR and when r ¼ 1, RCTR degenerates to the model in Fig. 3. The nice property is that the best performance is always achieved near r ¼ 1 when we change other variables like r. Hence, we can fix r ¼ 1 in our experiment. 4.7 Computational Cost Table 2 shows the number of iterations for training CTR and RCTR on dataset citeulike-t. Table 3 shows the average empirical measured time (in seconds) for training CTR and RCTR on dataset citeulike-t. As stated in Section 3.4, although more information is used and then more time is needed for each iteration of training in RCTR, the total empirical measured time for training RCTR is still lower than that for training CTR. The main reason is that RCTR needs a smaller number of learning iterations than CTR to achieve satisfactory accuracy. Note that the time in Table 3 is measured when K ¼ 200 which is the same setting as CTR in [45]. In our experiments, we find that the accuracy of K ¼ 50 is comparable to that of K ¼ 200. As mentioned in Section 3.4, the time complexity for one iteration is cubic with respect to K. Hence, if we choose K to be 50, the empirical measured time would be about 1/64 of the time in Table 3, which is small. Fig. 21. The effect of r in RCTR on dataset citeulike-t. Q ¼ 100% and P is set to 1. v ¼ 100, u ¼ 0:01, and e ¼ 1;000. Fig. 18. User-oriented recall@M when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 100% and P is set to 1. Fig. 19. The effect of r in RCTR when M ranges from 50 to 300 on dataset citeulike-t. Q ¼ 100% and P is set to 1. v ¼ 100, u ¼ 0:01, r ¼ 1, e ¼ 1;000. Fig. 20. The effect of e in RCTR on dataset citeulike-t. Q ¼ 100% and P is set to 1. v ¼ 100, u ¼ 0:01, and r ¼ 1. TABLE 2 Number of Iterations for Training CTR and RCTR on Dataset citeulike-t P 1 2 5 8 10 CTR 48.8 2.9 45.8 2.7 73.4 1.0 86.0 1.8 87.0 2.1 RCTR 14.6 1.6 14.2 1.4 19.6 1.0 18.8 1.1 19.6 1.0 TABLE 3 Empirical Measured Time for Training CTR and RCTR on Dataset citeulike-t (in Seconds) P 1 2 5 8 10 CTR 16,250 1,114 15,251 1,010 24,442 379 28,638 706 28,971 706 RCTR 5,285 657 5,140 594 7,095 412 6,806 472 7,095 412 1352 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 5, MAY 2015