Social Relations Model for Collaborative Filtering Wu-Jun Lif and Dit-Yan Yeung t Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering,Shanghai Jiao Tong University,China Department of Computer Science and Engineering Hong Kong University of Science and Technology,Hong Kong,China liwujunecs.situ.edu.cn,dyyeungecse.ust.hk Abstract models(Koren 2008),are the most representative model- based methods.MF methods can make use of either proba- We propose a novel probabilistic model for collaborative fil- bilistic (Lim and Teh 2007:Salakhutdinov and Mnih 2008b) tering (CF),called SRMCoFi,which seamlessly integrates both linear and bilinear random effects into a principled or non-probabilistic (Srebro,Rennie,and Jaakkola 2004; framework.The formulation of SRMCoFi is supported by Rennie and Srebro 2005:Takacs et al.2008)formulations. both social psychological experiments and statistical theo- Recently,some hybrid models combining both memory- ries.Not only can many existing CF methods be seen as spe- based and model-based techniques have also emerged(Ko- cial cases of SRMCoFi,but it also integrates their advan- ren2008). tages while simultaneously overcoming their disadvantages The solid theoretical foundation of SRMCoFi is further sup- ported by promising empirical results obtained in extensive Social Relations Model The social relations model! experiments using real CF data sets on movie ratings (SRM)(Kenny 1994;Li and Loken 2002)has been widely used for modeling dyadic data appeared in many fields,in- cluding social psychology,behavioral science and statisti- Introduction cal science.2 SRM was first introduced in the context of in- Collaborative Filtering Collaborative filtering (CF) terpersonal perception studying the beliefs that people have (Breese,Heckerman,and Kadie 1998)does not exploit about others.During the perception process,there exist a explicit user profiles but only past activities of the users, perceiver (or called actor)and a target (or called partner). such as their transaction history or product satisfaction For example,Bob might have beliefs that his friend,Mary. expressed in ratings,to predict the future activities of the is intelligent.Here,Bob corresponds to a perceiver and Mary users.In a typical CF task,we often use a sparse matrix R to a target. of size N x M to denote the rating or preference values on To collect dyadic data,there are two basic designs:round- the items given by the users.Here N denotes the number of robin design and block design.In the round-robin design, users,M is the number of items,and each matrix element each person interacts with or rates every other person in a Rij represents the preference of item j given by user i.The group.In the block design,a group is divided into two sub- matrix A is sparse because many elements are missing,and groups and members from each subgroup interact or rate the such elements Rij are assigned the value of 0 to indicate members of the other subgroup.Sometimes only half of the that item j has not been rated by user i.The goal of CF is data are gathered from the block design,which is referred to learn a function to predict some of the missing elements to as a half-block design.More specifically,in a half-block in R based on the observed elements. design,a group is divided into two subgroups(say A and Existing CF methods can be divided into two main cate- B),and only the members from one subgroup (say A)rate gories (Sarwar et al.2001;Koren 2008;Zhen,Li,and Yeung the members of the other subgroup (say B),but not vice 2009):memory-based and model-based methods.Memory- versa.As stated in (Kenny 1994,page 24),a half-block de- based methods predict new ratings by (weighted)averaging sign is typical when the targets are presented on videotapes the ratings of similar users or items.According to whether or movies.Hence,the CF problem can be seen as an exam- similarity is defined based on users or items,memory- ple of the half-block design in the perception study because based methods can be further divided into user-based meth- only users rate items but not the other way around. ods (Breese,Heckerman,and Kadie 1998)and item-based However,it should be noted that SRM cannot be directly methods(Sarwar et al.2001).Unlike memory-based meth- applied to the CF problem because the focus of SRM is on ods,model-based methods learn a model from data using the analysis of variance (ANOVA)over actors,partners as statistical learning techniques.Methods based on matrix fac- well as the relationships between actors and partners.For torization (MF)(Takacs et al.2008),or called latent factor http://davidakenny.net/srm/soremo.htm Copyright C)2011,Association for the Advancement of Artificial 2http://davidakenny.net/doc/srmbiblio.pdf Intelligence (www.aaai.org).All rights reserved. lists hundreds of references on various applications of SRM
Social Relations Model for Collaborative Filtering Wu-Jun Li† and Dit-Yan Yeung‡ † Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China ‡ Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China liwujun@cs.sjtu.edu.cn, dyyeung@cse.ust.hk Abstract We propose a novel probabilistic model for collaborative filtering (CF), called SRMCoFi, which seamlessly integrates both linear and bilinear random effects into a principled framework. The formulation of SRMCoFi is supported by both social psychological experiments and statistical theories. Not only can many existing CF methods be seen as special cases of SRMCoFi, but it also integrates their advantages while simultaneously overcoming their disadvantages. The solid theoretical foundation of SRMCoFi is further supported by promising empirical results obtained in extensive experiments using real CF data sets on movie ratings. Introduction Collaborative Filtering Collaborative filtering (CF) (Breese, Heckerman, and Kadie 1998) does not exploit explicit user profiles but only past activities of the users, such as their transaction history or product satisfaction expressed in ratings, to predict the future activities of the users. In a typical CF task, we often use a sparse matrix R of size N × M to denote the rating or preference values on the items given by the users. Here N denotes the number of users, M is the number of items, and each matrix element Rij represents the preference of item j given by user i. The matrix R is sparse because many elements are missing, and such elements Rij are assigned the value of 0 to indicate that item j has not been rated by user i. The goal of CF is to learn a function to predict some of the missing elements in R based on the observed elements. Existing CF methods can be divided into two main categories (Sarwar et al. 2001; Koren 2008; Zhen, Li, and Yeung 2009): memory-based and model-based methods. Memorybased methods predict new ratings by (weighted) averaging the ratings of similar users or items. According to whether similarity is defined based on users or items, memorybased methods can be further divided into user-based methods (Breese, Heckerman, and Kadie 1998) and item-based methods (Sarwar et al. 2001). Unlike memory-based methods, model-based methods learn a model from data using statistical learning techniques. Methods based on matrix factorization (MF) (Takacs et al. 2008), or called latent factor ´ Copyright c 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. models (Koren 2008), are the most representative modelbased methods. MF methods can make use of either probabilistic (Lim and Teh 2007; Salakhutdinov and Mnih 2008b) or non-probabilistic (Srebro, Rennie, and Jaakkola 2004; Rennie and Srebro 2005; Takacs et al. 2008) formulations. ´ Recently, some hybrid models combining both memorybased and model-based techniques have also emerged (Koren 2008). Social Relations Model The social relations model1 (SRM) (Kenny 1994; Li and Loken 2002) has been widely used for modeling dyadic data appeared in many fields, including social psychology, behavioral science and statistical science.2 SRM was first introduced in the context of interpersonal perception studying the beliefs that people have about others. During the perception process, there exist a perceiver (or called actor) and a target (or called partner). For example, Bob might have beliefs that his friend, Mary, is intelligent. Here, Bob corresponds to a perceiver and Mary to a target. To collect dyadic data, there are two basic designs: roundrobin design and block design. In the round-robin design, each person interacts with or rates every other person in a group. In the block design, a group is divided into two subgroups and members from each subgroup interact or rate the members of the other subgroup. Sometimes only half of the data are gathered from the block design, which is referred to as a half-block design. More specifically, in a half-block design, a group is divided into two subgroups (say A and B), and only the members from one subgroup (say A) rate the members of the other subgroup (say B), but not vice versa. As stated in (Kenny 1994, page 24), a half-block design is typical when the targets are presented on videotapes or movies. Hence, the CF problem can be seen as an example of the half-block design in the perception study because only users rate items but not the other way around. However, it should be noted that SRM cannot be directly applied to the CF problem because the focus of SRM is on the analysis of variance (ANOVA) over actors, partners as well as the relationships between actors and partners. For 1http://davidakenny.net/srm/soremo.htm 2http://davidakenny.net/doc/srmbiblio.pdf lists hundreds of references on various applications of SRM
example,we can use SRM to analyze the actor variance to This is completely different from the goal of CF which is find out whether people see others as similar in terms of in- to predict some of the missing elements in the rating matrix. telligence.Moreover,in general,it is assumed that we can Hence,SRM cannot be directly applied to CF. collect all of the data for SRM and hence no elements are missing.For CF,however,the goal is to predict some of the SRMCoFi missing ratings,making it fundamentally different from that of the original SRM. Since CF is fundamentally different from traditional appli- cations of SRM,we need to adapt SRM appropriately if we Contributions By adapting SRM,we propose in this pa- want to apply it to CF.The result is our social relations model for collaborative filtering,which we abbreviate as per a novel CF model called SRMCoFi.Some promising SRMCoFi in the sequel. properties of SRMCoFi are highlighted here: SRMCoFi formulates the CF problem under a probabilis- Model tic framework,making it very convenient to learn the SRMCoFi defines the likelihood function over the observed model parameters and subsequently control the model ca- ratings as follows: pacity automatically. SRMCoFi,which subsumes many existing model-based Fii u+ai+bi+UVi CF methods as special cases,provides a general frame- E(RijlF:j)=g-(Fu), work for CF. N M SRMCoFi offers theoretical justifications for many em- p(R F, ΠΠp(RlF,, (3) pirical findings by previous CF methods. =1j=1 SRMCoFi is very efficient and easily implementable due where U E RDxN is the latent user feature matrix and to the closed-form updates for parameter learning. V E RDxM is the latent movie feature matrix,with col- Social Relations Model umn vectors Ui and Vi denoting user-specific and movie- specific latent feature vectors respectively,E(.)is the ex- Let Fi;denote the dyadic measurement of the perception or pectation function,g()is the link function in a generalized rating from actor i to partner j.In SRM(Kenny 1994),Fij linear model (GLM)(Gelman et al.2003).o is the variance is decomposed into random effects a,b,e as follows: or dispersion parameter of the GLM(Gelman et al.2003), F)=μ+ai+b+e+ei (1) and Iij is an indicator variable which is equal to 1 if user i where u is a constant representing the overall mean,ai is rated movie j and 0 otherwise.For example,if g(Fij)= the actor effect representing the average level of rating of an exp(F)andp(RlF)=Poisson(Rilg1(Fi),the actor in the presence of a variety of partners,b;is the part- model is equivalent to a Poisson regression model(Gelman ner effect representing the average level of a rating which a et al.2003)with the log-link function,which is always used person elicits from a variety of actors,eij is the relationship to model data taking the form of counts.One representative effect representing the rating of an actor toward a particu- application of the Poisson regression model is epidemiology lar partner,above and beyond their actor and partner effects where the incidence of diseases is studied (Gelman et al. and eii is an error term representing the unstable aspect of 2003,page51). the perception.In particular,for the movie rating applica- The SRMCoFi model defined in(3)adapts SRM in two tion,actors refer to users and partners to movies. aspects: In general,a,b,e,e are assumed to be random variables .SRMCoFi adopts a bilinear term UTVi to represent the with the following distributions: a,N40,),b,N0,, relationship effect eij of SRM.Hence,SRMCoFi seam- lessly integrates the linear (ai,bi)and bilinear effects 60,. into a principled framework.This adaptation is very cru- e~N(0,o2), cial for SRMCoFi to achieve the goal of CF,which is to Cov(ai,b)≠0, Cov(e,eji)≠0., (2) predict some missing elements in the preference matrix where Cov()denotes the covariance.In SRM.all the other After learning the model,each user i is associated with covariances are assumed to be 0.Cov(ai,bi)0 means a latent feature vector Ui and each movie j with a latent that the actor effect and partner effect of the same individual feature vector V;.Based on these learned feature vectors. ihave nonzero covariance.Since CF is a half-block design, the missing elements in the preference matrix can be pre- one specific individual can only have either actor effect or dicted easily.For example,the expected value E(RiFj) partner effect,but not both.Hence,it is unnecessary to con- computed based on(3)can be used as the predicted value sider these covariances for CF. for a missing element Rij,which is the strategy used in The focus of SRM is not on estimating the effects for spe- this paper.On the other hand,it is not easy to directly ap- cific individuals and relationships but on estimating the vari- ply SRM to predict the missing elements. ance due to effects.So,as said above,ANOVA (Li and Lo- .By exploiting different GLMs,SRMCoFi generalizes ken 2002)is the main goal of the original SRM.Moreover, SRM to accommodate different types of data,such as bi- SRM typically assumes that all of the data are observed. nary measurements and counting data
example, we can use SRM to analyze the actor variance to find out whether people see others as similar in terms of intelligence. Moreover, in general, it is assumed that we can collect all of the data for SRM and hence no elements are missing. For CF, however, the goal is to predict some of the missing ratings, making it fundamentally different from that of the original SRM. Contributions By adapting SRM, we propose in this paper a novel CF model called SRMCoFi. Some promising properties of SRMCoFi are highlighted here: • SRMCoFi formulates the CF problem under a probabilistic framework, making it very convenient to learn the model parameters and subsequently control the model capacity automatically. • SRMCoFi, which subsumes many existing model-based CF methods as special cases, provides a general framework for CF. • SRMCoFi offers theoretical justifications for many empirical findings by previous CF methods. • SRMCoFi is very efficient and easily implementable due to the closed-form updates for parameter learning. Social Relations Model Let Fij denote the dyadic measurement of the perception or rating from actor i to partner j. In SRM (Kenny 1994), Fij is decomposed into random effects a, b, e as follows: Fij = µ + ai + bj + eij + ij , (1) where µ is a constant representing the overall mean, ai is the actor effect representing the average level of rating of an actor in the presence of a variety of partners, bj is the partner effect representing the average level of a rating which a person elicits from a variety of actors, eij is the relationship effect representing the rating of an actor toward a particular partner, above and beyond their actor and partner effects, and ij is an error term representing the unstable aspect of the perception. In particular, for the movie rating application, actors refer to users and partners to movies. In general, a, b, e, are assumed to be random variables with the following distributions: ai iid∼ N (·|0, σ2 a ), bj iid∼ N (·|0, σ2 b ), eij ∼ N (·|0, σ2 e ), ij iid∼ N (·|0, σ2 ), Cov(ai , bi) 6= 0, Cov(eij , eji) 6= 0, (2) where Cov() denotes the covariance. In SRM, all the other covariances are assumed to be 0. Cov(ai , bi) 6= 0 means that the actor effect and partner effect of the same individual i have nonzero covariance. Since CF is a half-block design, one specific individual can only have either actor effect or partner effect, but not both. Hence, it is unnecessary to consider these covariances for CF. The focus of SRM is not on estimating the effects for specific individuals and relationships but on estimating the variance due to effects. So, as said above, ANOVA (Li and Loken 2002) is the main goal of the original SRM. Moreover, SRM typically assumes that all of the data are observed. This is completely different from the goal of CF which is to predict some of the missing elements in the rating matrix. Hence, SRM cannot be directly applied to CF. SRMCoFi Since CF is fundamentally different from traditional applications of SRM, we need to adapt SRM appropriately if we want to apply it to CF. The result is our social relations model for collaborative filtering, which we abbreviate as SRMCoFi in the sequel. Model SRMCoFi defines the likelihood function over the observed ratings as follows: Fij = µ + ai + bj + U T i Vj , E(Rij |Fij ) = g −1 (Fij ), p(R|F, φ) = Y N i=1 Y M j=1 [p(Rij |Fij , φ)]Iij , (3) where U ∈ R D×N is the latent user feature matrix and V ∈ R D×M is the latent movie feature matrix, with column vectors Ui and Vj denoting user-specific and moviespecific latent feature vectors respectively, E(·) is the expectation function, g(·) is the link function in a generalized linear model (GLM) (Gelman et al. 2003), φ is the variance or dispersion parameter of the GLM (Gelman et al. 2003), and Iij is an indicator variable which is equal to 1 if user i rated movie j and 0 otherwise. For example, if g −1 (Fij ) = exp(Fij ) and p(Rij |Fij ) = Poisson(Rij |g −1 (Fij )), the model is equivalent to a Poisson regression model (Gelman et al. 2003) with the log-link function, which is always used to model data taking the form of counts. One representative application of the Poisson regression model is epidemiology where the incidence of diseases is studied (Gelman et al. 2003, page 51). The SRMCoFi model defined in (3) adapts SRM in two aspects: • SRMCoFi adopts a bilinear term U T i Vj to represent the relationship effect eij of SRM. Hence, SRMCoFi seamlessly integrates the linear (ai , bj ) and bilinear effects into a principled framework. This adaptation is very crucial for SRMCoFi to achieve the goal of CF, which is to predict some missing elements in the preference matrix. After learning the model, each user i is associated with a latent feature vector Ui and each movie j with a latent feature vector Vj . Based on these learned feature vectors, the missing elements in the preference matrix can be predicted easily. For example, the expected value E(Rij |Fij ) computed based on (3) can be used as the predicted value for a missing element Rij , which is the strategy used in this paper. On the other hand, it is not easy to directly apply SRM to predict the missing elements. • By exploiting different GLMs, SRMCoFi generalizes SRM to accommodate different types of data, such as binary measurements and counting data
We gave an example above on how to model counting data eters can be written as follows: via a Poisson regression model based on the log-link func- lnp,a,b,U,VR,a2,o2,o,哈,2,)= tion.Here we give two other examples among many possi- bilities: 1 22 风-P-立u Logistic regression model:If Rij is binary,we can =1j=1 defineg-(F)=动andpRF,) 1 M 1 Bernoulli(Rlg(Fj)).Here,the link is the logit func- %-2- tion and there is no dispersion parameter As for CF,if 2系台 i=1 the task is to predict whether user i has rated movie j,i.e., M N M 1 ND to predict Iij,this model might be a good choice. 立-2- 2 Inou i=1j=1 ·Normal--linear model:Ifg-1(Fi)=Fij and MD M p(RijlFij,)=N(RiilFi,02),it is a normal-linear ma2-5no2-yn2lm匠+C,(5、 2 model with=o2,which is a typical choice for mod- where C is a constant that does not depend on the parameters eling continuous values. and hyperparameters. Other combinations of g(Fij)and p(Rij Fii,o)can For fixed hyperparameters,maximizing the log-posterior over the parameters is equivalent to minimizing the follow- be used to devise different models for different data types.3 ing objective function: Hence,SRMCoFi provides a flexible framework for model- N M ing data from diverse applications. C= In this paper,we only study the normal-linear model 2∑∑R-RP+r i=1j=1 in detail with g(Fij)=Fij and p(RijFij,)= N(RiF,2).The learning and inference procedures for avW+告r2+aa+ + 2 bTb,(⑥ 2 other variants are similar except that the closed-form solu- tions for the normal-linear model may not exist for other where tr()denotes the trace of a matrix, variants.Nevertheless,even if this is the case,we can still Av a2/ay.Xu=a2/an.Xa a2/0a.and o a2/08. resort to a gradient method (Shewchuk 1994)to obtain so- lutions.Due to the page limit,such variants will not be de- Parameter Learning scribed in detail in this paper. The learning procedure for SRMCoFi consists of two parts: We place Gaussian priors on the latent variables of learning the parameters and learning the hyperparameters. SRMCoFi: We will discuss hyperparameter learning in detail in the fol- lowing subsection.In this subsection,we only consider the p(a=ⅡW(al0,a,pb)=ΠN6l0,), parameter learning problem,which is to optimize the objec- tive function in(6)with fixed hyperparameters (or,equiva- =1 1=1 lently,with fixed A's).4 N M Note that (6)is not jointly convex over u,a,b,U,V. p(U=ΠN(U0I,p(W)-ΠN(%I0,D, However,it is convex with respect to each individual vari- i=1 i=1 able with the others held fixed.We exploit this property to p(=N(l0,o2), (4) perform alternating updates for each variable iteratively,the convergence of which can be guaranteed and will be proved wherea=(a1,a2,...,aN)T,b=(61,b2,...,bM)T,andI later in this subsection. is the identity matrix.Here,the latent variables u,a,b,U,V To update u,we compute the gradient of C with respect to u and set the gradient to 0,giving are referred to as parameters and a2,,,,aas hyperparameters. w ∑[R,-a,+b+Uy) We assume independence among the parameters +∑∑ (7) and hence p(u,a,b,U,V)=p(u)p(a)p(b)p(U)p(V). Furthermore,the likelihood of the normal-linear Similarly,closed-form solutions exist for updating ai,bi: model is:p(Rlu,a,b,U,V,a2)=p(RF,2) ∑(R,-u-4,-UIY Π1ΠW(Rl,o2. ai a+∑1写 (8) Thus,the log of the posterior distribution over the param- ∑Y,(R,-4-a4-UTy (9) 3Theoretically any combination may be adopted.By taking into +∑1 consideration the computational requirements,however,only some Here we overload the term 'hyperparameters'to refer to the of these combinations are good choices for real applications.For A's.We can easily realize from the context whether'hyperparame- example.we always constrain g(F)and p(RFj,)to what ters'refer to the o's or the A's.In fact,they refer to the same thing can give a concave log-likelihood function. because the A's can be computed from the o's
We gave an example above on how to model counting data via a Poisson regression model based on the log-link function. Here we give two other examples among many possibilities: • Logistic regression model: If Rij is binary, we can define g −1 (Fij ) = exp(Fij ) 1+exp(Fij ) and p(Rij |Fij ) = Bernoulli(Rij |g −1 (Fij )). Here, the link is the logit function and there is no dispersion parameter φ. As for CF, if the task is to predict whether user i has rated movie j, i.e., to predict Iij , this model might be a good choice. • Normal-linear model: If g −1 (Fij ) = Fij and p(Rij |Fij , φ) = N (Rij |Fij , σ2 ), it is a normal-linear model with φ = σ 2 , which is a typical choice for modeling continuous values. Other combinations of g −1 (Fij ) and p(Rij |Fij , φ) can be used to devise different models for different data types.3 Hence, SRMCoFi provides a flexible framework for modeling data from diverse applications. In this paper, we only study the normal-linear model in detail with g −1 (Fij ) = Fij and p(Rij |Fij , φ) = N (Rij |Fij , σ2 ). The learning and inference procedures for other variants are similar except that the closed-form solutions for the normal-linear model may not exist for other variants. Nevertheless, even if this is the case, we can still resort to a gradient method (Shewchuk 1994) to obtain solutions. Due to the page limit, such variants will not be described in detail in this paper. We place Gaussian priors on the latent variables of SRMCoFi: p(a) = Y N i=1 N (ai |0, σ2 a ), p(b) = Y M j=1 N (bj |0, σ2 b ), p(U) = Y N i=1 N (Ui |0, σ2 U I), p(V ) = Y M j=1 N (Vj |0, σ2 V I), p(µ) = N (µ|0, σ2 µ ), (4) where a = (a1, a2, . . . , aN ) T , b = (b1, b2, . . . , bM) T , and I is the identity matrix. Here, the latent variables µ, a, b, U, V are referred to as parameters and σ 2 , σ2 µ , σ2 a , σ2 b , σ2 U , σ2 V as hyperparameters. We assume independence among the parameters and hence p(µ, a, b, U, V ) = p(µ)p(a)p(b)p(U)p(V ). Furthermore, the likelihood of the normal-linear model is: p(R|µ, a, b, U, V, σ2 ) = p(R|F, σ2 ) = QN i=1 QM j=1 N (Rij |Fij , σ2 ) Iij . Thus, the log of the posterior distribution over the param- 3Theoretically any combination may be adopted. By taking into consideration the computational requirements, however, only some of these combinations are good choices for real applications. For example, we always constrain g −1 (Fij ) and p(Rij |Fij , φ) to what can give a concave log-likelihood function. eters can be written as follows: ln p(µ, a, b, U, V |R, σ2 , σ2 µ , σ2 a , σ2 b , σ2 U , σ2 V ) = − 1 2σ 2 X N i=1 X M j=1 Iij (Rij − Fij ) 2 − 1 2σ 2 U X N i=1 U T i Ui − 1 2σ 2 V X M j=1 V T j Vj − 1 2σ 2 µ µ 2 − 1 2σ 2 a X N i=1 a 2 i − 1 2σ 2 b X M j=1 b 2 j − 1 2 X N i=1 X M j=1 Iij ln σ 2 − ND 2 ln σ 2 U − MD 2 ln σ 2 V − 1 2 ln σ 2 µ − N 2 ln σ 2 a − M 2 ln σ 2 b + C, (5) where C is a constant that does not depend on the parameters and hyperparameters. For fixed hyperparameters, maximizing the log-posterior over the parameters is equivalent to minimizing the following objective function: L = 1 2 X N i=1 X M j=1 Iij (Rij − Fij ) 2 + λU 2 tr(U TU) + λV 2 tr(V T V ) + λµ 2 µ 2 + λa 2 a T a + λb 2 b T b, (6) where tr(·) denotes the trace of a matrix, λU = σ 2/σ2 U , λV = σ 2/σ2 V , λµ = σ 2/σ2 µ , λa = σ 2/σ2 a , and λb = σ 2/σ2 b . Parameter Learning The learning procedure for SRMCoFi consists of two parts: learning the parameters and learning the hyperparameters. We will discuss hyperparameter learning in detail in the following subsection. In this subsection, we only consider the parameter learning problem, which is to optimize the objective function in (6) with fixed hyperparameters (or, equivalently, with fixed λ’s).4 Note that (6) is not jointly convex over µ, a, b, U, V . However, it is convex with respect to each individual variable with the others held fixed. We exploit this property to perform alternating updates for each variable iteratively, the convergence of which can be guaranteed and will be proved later in this subsection. To update µ, we compute the gradient of L with respect to µ and set the gradient to 0, giving µ = PN i=1 PM j=1 Iij Rij − (ai + bj + U T i Vj ) λµ + PN i=1 PM j=1 Iij . (7) Similarly, closed-form solutions exist for updating ai , bj : ai = PM j=1 Iij (Rij − µ − bj − U T i Vj ) λa + PM j=1 Iij , (8) bj = PN i=1 Iij (Rij − µ − ai − U T i Vj ) λb + PN i=1 Iij . (9) 4Here we overload the term ‘hyperparameters’ to refer to the λ’s. We can easily realize from the context whether ‘hyperparameters’ refer to the σ 2 s or the λ’s. In fact, they refer to the same thing because the λ’s can be computed from the σ 2 s.
To update U,we first rewrite(6)as follows: probabilistic formulations,a validation set or,more gener- ally,cross validation is often used to determine the hyper- c-2{2 parameters.However,in cross validation,only a small num- ber of discrete values are considered as candidate choices. Hence,the values eventually chosen for the hyperparame- ters are not necessarily optimal. 10) Probabilistic formulation allows the hyperparameters to be learned automatically.In this paper,we adopt an alternat- ing method to efficiently learn the hyperparameters.More where Ci is a constant independent of U.We can see that specifically,in the learning procedure,we fix the hyperpa- the vectors 0;for different i are decoupled,which means rameters and optimize over the parameters first and then op- that we can solve a linear system of equations for each row timize over the hyperparameters with the parameters kept of U separately. fixed,and this procedure iterates until some termination con- It is not difficult to see that the objective dition is satisfied.In this paper,we use a conjugate hyper- function in (10)is convex.If D is small,we prior which gives closed-form updates for inference.More can directly solve the linear system to obtain: specifically,we use an inverse-x2 distribution as a conjugate =(AI+Σyy)[E=1,(R,-4-a-by hyperprior for the hyperparameters(o2s). Assume that p(y0,)=N(yl0,)with a unknown, Otherwise,if D is large,we can instead apply the steep- the likelihood of n i.i.d.observations is est descent method (Shewchuk 1994)with convergence guaranteed. p(ylo) =1近 .1 Similarly,updating V corresponds to solving the follow- p 22 ing linear system: where s is the sufficient statistic:s() AvI+∑lUUF If we use a scaled inverse-x2(Gelman et al.2003)as the Vj= R-μ-a-b)U. hyperprior for:Inv-x2(vo,0),we can get Theorem 1 The learning procedure above guarantees local oy心Inv-x2(o+n, 6i+ns (11) convergence. 0+n Hence,the hyperprior can be thought of as providing Proof:(Sketch)The objective function in (6)is lower prior information equivalent to vo observations with average bounded by 0.Furthermore,the objective function decreases squared deviation (Gelman et al.2003). its value monotonically in each iteration when the variables It is easy to derive the sufficient statistics of the o2s for are updated.Hence,the learning procedure always con- SRMCoFi: verges to a local minimum. Theorem 2 Both the time complexity and space complexity s(σ2)= ∑∑(R-F} ∑1∑1 s(o2)=u2, are linear in the number of training ratings available in R. Proof:Let us denote the number of training ratings by y. It is easy to see that the time needed for updating u,a,b M is O(Dy).For updating 0,the time for computing all the ()for all U,is(D),and for each U:.D) s()=∑UPU s(o2)= ∑y is needed for computing the matrix inverse and multiplica- ND MD tion.Hence,the total time for updating U is O(Dy+D3N). Similarly,O(Dy+D3M)is needed for updating V.Hence, The mode,which is for Inv-x2(v).of the pos- the time complexity for each iteration is O()because D is terior of the o2 is adopted to update the A's in (6).It is easy typically a small constant and N,M.Moreover,from to see that the time and space complexity of hyperparameter our experiment,only a small number (20)of iterations learning is also linear in the number of training ratings is sufficient for the algorithm to achieve good performance. Hence,we can say that the time complexity of the whole Comparison with Related Work algorithm is O(). Let us first consider MF methods based on non- The space complexity of R,u,a,b,U,V is probabilistic formulations.Maximum margin matrix fac- O(),O(1),O(N),O(M),O(DN),O(DM)respec- torization (MMMF)(Srebro.Rennie.and Jaakkola 2004: tively.Hence,the overall space complexity is also O(). Rennie and Srebro 2005)can be seen as a special case of (6)by setting4=ai=bj=u=入a=λb=0.5 Hyperparameter Learning SThe original MMMF(Srebro,Rennie,and Jaakkola 2004)only used the hinge loss,but many subsequent works also used other How to choose appropriate regularization parameters (or loss functions such as smooth hinge (Rennie and Srebro 2005)and called hyperparameters here),such as the入'sin(⑥),is cru- squared loss (Weimer.Karatzoglou,and Smola 2008).Here we re- cial to model capacity control.For models based on non- fer to the squared loss function
To update U, we first rewrite (6) as follows: L = 1 2 X N i=1 U T i λU I + X M j=1 IijVjV T j Ui −2 X M j=1 Iij (Rij − µ − ai − bj )V T j Ui + C1, (10) where C1 is a constant independent of U. We can see that the vectors Ui for different i are decoupled, which means that we can solve a linear system of equations for each row of U separately. It is not difficult to see that the objective function in (10) is convex. If D is small, we can directly solve the linear system to obtain: Ui = λU I + PM j=1 IijVjV T j −1 hPM j=1 Iij (Rij − µ − ai − bj )Vj i . Otherwise, if D is large, we can instead apply the steepest descent method (Shewchuk 1994) with convergence guaranteed. Similarly, updating V corresponds to solving the following linear system: λV I + X N i=1 IijUiU T i ! Vj = X N i=1 Iij (Rij −µ−ai −bj )Ui . Theorem 1 The learning procedure above guarantees local convergence. Proof: (Sketch) The objective function in (6) is lower bounded by 0. Furthermore, the objective function decreases its value monotonically in each iteration when the variables are updated. Hence, the learning procedure always converges to a local minimum. Theorem 2 Both the time complexity and space complexity are linear in the number of training ratings available in R. Proof: Let us denote the number of training ratings by γ. It is easy to see that the time needed for updating µ, a, b is O(Dγ). For updating U, the time for computing all the PM j=1(·) for all Ui is O(Dγ), and for each Ui , O(D3 + D2 ) is needed for computing the matrix inverse and multiplication. Hence, the total time for updating U is O(Dγ +D3N). Similarly, O(Dγ +D3M) is needed for updating V . Hence, the time complexity for each iteration is O(γ) because D is typically a small constant and N, M γ. Moreover, from our experiment, only a small number (< 20) of iterations is sufficient for the algorithm to achieve good performance. Hence, we can say that the time complexity of the whole algorithm is O(γ). The space complexity of R, µ, a, b, U, V is O(γ), O(1), O(N), O(M), O(DN), O(DM) respectively. Hence, the overall space complexity is also O(γ). Hyperparameter Learning How to choose appropriate regularization parameters (or called hyperparameters here), such as the λ’s in (6), is crucial to model capacity control. For models based on nonprobabilistic formulations, a validation set or, more generally, cross validation is often used to determine the hyperparameters. However, in cross validation, only a small number of discrete values are considered as candidate choices. Hence, the values eventually chosen for the hyperparameters are not necessarily optimal. Probabilistic formulation allows the hyperparameters to be learned automatically. In this paper, we adopt an alternating method to efficiently learn the hyperparameters. More specifically, in the learning procedure, we fix the hyperparameters and optimize over the parameters first and then optimize over the hyperparameters with the parameters kept fixed, and this procedure iterates until some termination condition is satisfied. In this paper, we use a conjugate hyperprior which gives closed-form updates for inference. More specifically, we use an inverse-χ 2 distribution as a conjugate hyperprior for the hyperparameters (σ 2 s). Assume that p(y|0, σ2 y ) = N (y|0, σ2 y ) with σ 2 y unknown, the likelihood of n i.i.d. observations is p(y|σ 2 y ) ∝ 1 σ n y exp − Pn i=1 y 2 i 2σ 2 y = 1 σ n y exp − n 2σ 2 y s , where s is the sufficient statistic: s(σ 2 y ) = Pn i=1 y 2 i n . If we use a scaled inverse-χ 2 (Gelman et al. 2003) as the hyperprior for σ 2 y : σ 2 y ∼ Inv-χ 2 (ν0, σ2 0 ), we can get σ 2 y |y ∼ Inv-χ 2 ν0 + n, ν0σ 2 0 + ns ν0 + n . (11) Hence, the hyperprior can be thought of as providing prior information equivalent to ν0 observations with average squared deviation σ 2 0 (Gelman et al. 2003). It is easy to derive the sufficient statistics of the σ 2 s for SRMCoFi: s(σ 2 ) = PN i=1 PM j=1 Iij (Rij − Fij ) 2 PN i=1 PM j=1 Iij , s(σ 2 µ ) = µ 2 , s(σ 2 a ) = PN i=1 a 2 i N , s(σ 2 b ) = PM j=1 b 2 j M , s(σ 2 U ) = PN i=1 U T i Ui ND , s(σ 2 V ) = PM j=1 V T j Vj MD . The mode, which is ν ν+2σ 2 y for Inv-χ 2 (ν, σ2 y ), of the posterior of the σ 2 is adopted to update the λ’s in (6). It is easy to see that the time and space complexity of hyperparameter learning is also linear in the number of training ratings. Comparison with Related Work Let us first consider MF methods based on nonprobabilistic formulations. Maximum margin matrix factorization (MMMF) (Srebro, Rennie, and Jaakkola 2004; Rennie and Srebro 2005) can be seen as a special case of (6) by setting µ = ai = bj = λµ = λa = λb = 0. 5 5The original MMMF (Srebro, Rennie, and Jaakkola 2004) only used the hinge loss, but many subsequent works also used other loss functions such as smooth hinge (Rennie and Srebro 2005) and squared loss (Weimer, Karatzoglou, and Smola 2008). Here we refer to the squared loss function.
It has been demonstrated empirically by many researchers (Takacs et al.2008;Weimer,Karatzoglou,and Smola 2008) Table 1:Data sets for evaluation. that adding offset (or bias)terms to MMMF can improve Dataset #Users Movies Ratings its performance.These methods,MMMF+offset in(Weimer, 943 1,682 100.000 Karatzoglou,and Smola 2008)and BRISMF in (Takacs et EachMovic 61.265 1,623 2.811.718 al.2008),are equivalent to (6)by setting ==0.All these non-probabilistic methods have shortcomings when it comes to choosing the hyperparameters. Other MF methods (Lim and Teh 2007;Salakhutdinov is defined as follows:RMSE=\/ ∑H(R-上.We can and Mnih 2008a;2008b)are based on probabilistic for- see that the smaller the RMSE is.the better the method will mulations.PMF(Salakhutdinov and Mnih 2008b)can be be. seen as a special case of SRMCoFi by removing the terms for u,a,b.Bayesian PMF (BPMF)(Salakhutdinov and Effect of Hyperparameter Learning Mnih 2008a)extends PMF by putting some hyperpriors on the hyperparameters and uses Gibbs sampling for learn- Here,we compare SRMCoFi with MMMF+offset (Weimer, ing and inference.The underlying formulation of VB (Lim Karatzoglou,and Smola 2008).We keep part of the training and Teh 2007)is the same as Bayesian PMF except that data as a validation set to determine the hyperparameters for VB adopts variational methods for Bayesian inference.Al- MMMF+offset,while the hyperparameters of SRMCoFi are though SRMCoFi in this paper does not perform fully automatically learned with our proposed algorithm.Hence. Bayesian inference,the techniques used in extending PMF the main difference between MMMF+offset and SRMCoFi to Bayesian PMF (or MMMF to VB)may also be applied lies in how the hyperparameters are determined. here to extend SRMCoFi to its fully Bayesian counterpart, We randomly select 20%,40%,60%and 80%of the data possibly incurring higher computation cost to take advan- to form the training set and use the rest as the test set for tage of a fully Bayesian approach.In that case,BPMF and evaluation.This random selection is carried out 10 rounds VB can be seen as special cases of the corresponding fully independently for each splitting ratio.In all the experiments Bayesian counterparts of SRMCoFi.We will pursue this di- in this paper,we set D=30.The mean and standard de- rection in our future work.One common characteristic of viation of the RMSE are reported in Table 2.Two other ap- these probabilistic formulations is that they do not include proaches,UserMean and MovieMean,are also included for the very important linear terms used in SRMCoFi.Actually, comparison.UserMean uses the sample mean of the same many variants of MF methods were empirically studied in user's training ratings for prediction,and MovieMean uses (Takacs et al.2008)and MMMF with the added linear terms that of the same movie's training ratings.From the results, (called BRISMF)was found to perform the best.This shows we can see that hyperparameter learning is very important that the importance of the linear terms as supported by psy- for CF and SRMCoFi is very effective. chological experiments is also verified by empirical findings in CF applications. Modeling the offset terms (i.e.,the linear random effects Table 2:Comparison between SRMCoFi and MMMF+offset to show that hyperparameter learning is effective.The best perfor- in SRMCoFi)and hyperparameter learning are two impor- mance(with smallest RMSE)is shown in bold.All standard devi- tant characteristics for a promising CF model.However, ations are 0.002 or less. existing methods do not simultaneously take both of them 20% 40%60%80% into consideration.Although SRMCoFi has been motivated UserMean 1.0601.0491.0441.043 by the study of social relations,it coincides with a model MovieMean 1.054 11.03611.0301.027 which seamlessly integrates these two aspects into a princi- MovieLens MMMF+offset 0.9740.9540.9450.940 pled framework.The solid foundation validated by theories SRMCoFi 0.9690.9410.9180.910 and experiments from psychological and statistical studies UserMean 1.477 】.443 1.4321.427 provides justification for the inclusion of the offset terms. MovieMean 1.389 1.3881387 1.387 Hence,compared with existing methods,better performance EachMovic MMMF+offset 1.286 1233 1.1911.161 can be expected from SRMCoFi. SRMCoFi 1.245 1.195 1.160 1.136 Experiments Data Sets and Evaluation Metric Effect of Linear Random Effects We evaluate SRMCoFi on two widely used data sets: MMMF+offset Vs MMMF The difference between MovieLens (Sarwar et al.2001)and EachMovie (Breese. MMMF+offset and MMMF lies in the linear offset terms Heckerman,and Kadie 1998).Some information about the Here,we compare them to demonstrate the significance of data sets is summarized in Table 1. these terms.Both MMMF+offset and MMMF use a vali- Root mean squared error (RMSE)is used as the error dation set to determine the hyperparameters,with the same measure.Suppose there are totally H elements to be pre- data set partitioning scheme as that above.The results are dicted,and Rh and Rh denote the ground-truth and pre- shown in Table 3.from which we can see that the linear off- dicted values respectively for element h.Then the RMSE set terms dramatically improve performance
It has been demonstrated empirically by many researchers (Takacs et al. 2008; Weimer, Karatzoglou, and Smola 2008) ´ that adding offset (or bias) terms to MMMF can improve its performance. These methods, MMMF+offset in (Weimer, Karatzoglou, and Smola 2008) and BRISMF in (Takacs et ´ al. 2008), are equivalent to (6) by setting µ = λµ = 0. All these non-probabilistic methods have shortcomings when it comes to choosing the hyperparameters. Other MF methods (Lim and Teh 2007; Salakhutdinov and Mnih 2008a; 2008b) are based on probabilistic formulations. PMF (Salakhutdinov and Mnih 2008b) can be seen as a special case of SRMCoFi by removing the terms for µ, a, b. Bayesian PMF (BPMF) (Salakhutdinov and Mnih 2008a) extends PMF by putting some hyperpriors on the hyperparameters and uses Gibbs sampling for learning and inference. The underlying formulation of VB (Lim and Teh 2007) is the same as Bayesian PMF except that VB adopts variational methods for Bayesian inference. Although SRMCoFi in this paper does not perform fully Bayesian inference, the techniques used in extending PMF to Bayesian PMF (or MMMF to VB) may also be applied here to extend SRMCoFi to its fully Bayesian counterpart, possibly incurring higher computation cost to take advantage of a fully Bayesian approach. In that case, BPMF and VB can be seen as special cases of the corresponding fully Bayesian counterparts of SRMCoFi. We will pursue this direction in our future work. One common characteristic of these probabilistic formulations is that they do not include the very important linear terms used in SRMCoFi. Actually, many variants of MF methods were empirically studied in (Takacs et al. 2008) and MMMF with the added ´ linear terms (called BRISMF) was found to perform the best. This shows that the importance of the linear terms as supported by psychological experiments is also verified by empirical findings in CF applications. Modeling the offset terms (i.e., the linear random effects in SRMCoFi) and hyperparameter learning are two important characteristics for a promising CF model. However, existing methods do not simultaneously take both of them into consideration. Although SRMCoFi has been motivated by the study of social relations, it coincides with a model which seamlessly integrates these two aspects into a principled framework. The solid foundation validated by theories and experiments from psychological and statistical studies provides justification for the inclusion of the offset terms. Hence, compared with existing methods, better performance can be expected from SRMCoFi. Experiments Data Sets and Evaluation Metric We evaluate SRMCoFi on two widely used data sets: MovieLens (Sarwar et al. 2001) and EachMovie (Breese, Heckerman, and Kadie 1998). Some information about the data sets is summarized in Table 1. Root mean squared error (RMSE) is used as the error measure. Suppose there are totally H elements to be predicted, and Rh and R˜ h denote the ground-truth and predicted values respectively for element h. Then the RMSE Table 1: Data sets for evaluation. Dataset # Users # Movies # Ratings MovieLens 943 1,682 100,000 EachMovie 61,265 1,623 2,811,718 is defined as follows: RMSE = qPH h=1(Rh−R˜h) 2 H . We can see that the smaller the RMSE is, the better the method will be. Effect of Hyperparameter Learning Here, we compare SRMCoFi with MMMF+offset (Weimer, Karatzoglou, and Smola 2008). We keep part of the training data as a validation set to determine the hyperparameters for MMMF+offset, while the hyperparameters of SRMCoFi are automatically learned with our proposed algorithm. Hence, the main difference between MMMF+offset and SRMCoFi lies in how the hyperparameters are determined. We randomly select 20%, 40%, 60% and 80% of the data to form the training set and use the rest as the test set for evaluation. This random selection is carried out 10 rounds independently for each splitting ratio. In all the experiments in this paper, we set D = 30. The mean and standard deviation of the RMSE are reported in Table 2. Two other approaches, UserMean and MovieMean, are also included for comparison. UserMean uses the sample mean of the same user’s training ratings for prediction, and MovieMean uses that of the same movie’s training ratings. From the results, we can see that hyperparameter learning is very important for CF and SRMCoFi is very effective. Table 2: Comparison between SRMCoFi and MMMF+offset to show that hyperparameter learning is effective. The best performance (with smallest RMSE) is shown in bold. All standard deviations are 0.002 or less. 20% 40% 60% 80% MovieLens UserMean 1.060 1.049 1.044 1.043 MovieMean 1.054 1.036 1.030 1.027 MMMF+offset 0.974 0.954 0.945 0.940 SRMCoFi 0.969 0.941 0.918 0.910 EachMovie UserMean 1.477 1.443 1.432 1.427 MovieMean 1.389 1.388 1.387 1.387 MMMF+offset 1.286 1.233 1.191 1.161 SRMCoFi 1.245 1.195 1.160 1.136 Effect of Linear Random Effects MMMF+offset Vs MMMF The difference between MMMF+offset and MMMF lies in the linear offset terms. Here, we compare them to demonstrate the significance of these terms. Both MMMF+offset and MMMF use a validation set to determine the hyperparameters, with the same data set partitioning scheme as that above. The results are shown in Table 3, from which we can see that the linear offset terms dramatically improve performance
Table 3:Comparison between MMMF+offset and MMMF to show the effectiveness of the offset terms.All standard deviations integrates the advantages of existing methods and simul- are 0.002 or less. taneously overcomes their disadvantages,achieving very 20%409%60%80% promising performance in real CF tasks.To a certain extent, MMMF 1.0400.9810.9640.955 a major contribution of SRMCoFi is that it serves to provide MovieLens MMMF+offset 0.9740.9540.9450.940 justifications from social psychological studies to many em- MMMF 1.377128112651.208 pirical findings by existing CF methods. MMMF4 offset1.28612331.1911.161 Acknowledgments Comparison with PMF Both PMF(Salakhutdinov and Li is supported by a grant from the"project of Arts Sci- Mnih 2008b)and SRMCoFi can perform hyperparameter ence"of Shanghai Jiao Tong University (No.10JCY09).Ye- learning automatically.Hence,the only difference between ung is supported by General Research Fund 621310 from the them lies in the linear terms.The results are shown in Ta- Research Grants Council of Hong Kong. ble 4,which once again verifies that the linear terms are very References crucial for CF. Breese,J.S.;Heckerman,D.;and Kadie,C.M.1998.Empirical Table 4:Comparison between SRMCoFi and PMF.All standard analysis of predictive algorithms for collaborative filtering.In UAl, deviations are 0.002 or less. 43-52. 20% 40%60%80% Gelman,A.;Carlin,J.B.;Stern,H.S.;and Rubin,D.B.2003. PMF 1.029 0.9770.956 0.952 Bayesian Data Analysis,Second Edition.Chapman Hall/CRC. MovicLens SRMCoFi 0.969 0.941 0.918 0.910 Kenny,D.A.1994.Interpersonal Perception:A Social Relations PMF 1.324 1233 1.224 EachMovie 1.178 Analysis.Guilfold Publications,Inc.,New York. SRMCoFi 1.245 1.195 1.160 1.136 Koren,Y.2008.Factorization meets the neighborhood:a multi- faceted collaborative filtering model.In KDD,426-434. Li,H.,and Loken,E.2002.A unified theory of statistical analy- Comparison with VB and MVTM We further compare sis and inference for variance component models for dyadic data. Statistica Sinica 12:519-535. SRMCoFi with other state-of-the-art methods.including VB (Lim and Teh 2007)and the matrix-variate t model Lim,Y.J.,and Teh,Y.W.2007.Variational Bayesian approach to (MVTM)(Zhu,Yu,and Gong 2008).As in (Zhu,Yu,and movie rating prediction.In Proceedings of KDD Cup and Work- shop. Gong 2008),we randomly select 80%of the EachMovie data for training and use the rest for testing.The mean and Rennie,J.D.M.,and Srebro,N.2005.Fast maximum margin standard deviation over 10 rounds are reported in Table 5. matrix factorization for collaborative prediction.In /CML.713- 719. The better performance of SRMCoFi can be directly at- tributed to the linear random effects of SRMCoFi because Salakhutdinov,R..and Mnih,A.2008a.Bayesian probabilistic matrix factorization using markov chain monte carlo.In ICML. VB only has bilinear terms. 880-887. Table 5:Comparison of SRMCoFi with VB and MVTM.All stan- Salakhutdinov,R.,and Mnih,A.2008b.Probabilistic matrix fac- dard deviations are 0.002 or less. torization.In NIPS 20. VB MVTM (mode)MVTM (mean)SRMCoFi Sarwar,B.M.;Karypis,G.;Konstan,J.A.;and Riedl,J.2001. 1.165 Item-based collaborative filtering recommendation algorithms.In 1.162 1.151 1.136 WWW.285-295. Shewchuk,J.R.1994.An introduction to the conjugate gradient method without the agonizing pain.Technical report,Pittsburgh, Computational Cost PA.USA. In real implementation,we find that SRMCoFi is quite effi- Srebro,N.;Rennie,J.D.M.;and Jaakkola,T.2004.Maximum- margin matrix factorization.In N/PS. cient.In the experiment on EachMovie with 80%of data for training,it takes less than 10 minutes to complete one itera- Takacs,G.:Pilaszy,I;Nemeth,B.;and Tikk,D.2008.Investiga- tion of various matrix factorization methods for large recommender tion of both hyperparameter learning and parameter learn- systems.In Proc.of the 2nd KDD Workshop on Large Scale Rec- ing,which is comparable with a C++implementation of ommender Systems and the Netflix Prize Competition. CoFiRank(Weimer et al.2008)even though our SRMCoFi Weimer,M.:Karatzoglou,A.:Le,Q.;and Smola,A.2008.Cofi implementation is in MATLAB.Compared with the MMMF rank maximum margin matrix factorization for collaborative implementation in (Rennie and Srebro 2005),SRMCoFi is ranking.In NIPS 20. an order of magnitude faster. Weimer,M.;Karatzoglou,A.;and Smola,A.2008.Improv- ing maximum margin matrix factorization.Machine Learning Conclusion 72(3):263-276. SRMCoFi,which has been inspired by social psychologi- Zhen,Y.:Li.W.-J.;and Yeung,D.-Y.2009.TagiCoFi:tag informed cal studies,can be seen as a model that seamlessly inte- collaborative filtering.In RecSys,69-76. grates both hyperparameter learning and the use of offset Zhu,S.;Yu,K.;and Gong.Y.2008.Predictive matrix-variate t terms into a principled framework.As a result,SRMCoFi models.In NIPS 20
Table 3: Comparison between MMMF+offset and MMMF to show the effectiveness of the offset terms. All standard deviations are 0.002 or less. 20% 40% 60% 80% MovieLens MMMF 1.040 0.981 0.964 0.955 MMMF+offset 0.974 0.954 0.945 0.940 EachMovie MMMF 1.377 1.281 1.265 1.208 MMMF+offset 1.286 1.233 1.191 1.161 Comparison with PMF Both PMF (Salakhutdinov and Mnih 2008b) and SRMCoFi can perform hyperparameter learning automatically. Hence, the only difference between them lies in the linear terms. The results are shown in Table 4, which once again verifies that the linear terms are very crucial for CF. Table 4: Comparison between SRMCoFi and PMF. All standard deviations are 0.002 or less. 20% 40% 60% 80% MovieLens PMF 1.029 0.977 0.956 0.952 SRMCoFi 0.969 0.941 0.918 0.910 EachMovie PMF 1.324 1.233 1.224 1.178 SRMCoFi 1.245 1.195 1.160 1.136 Comparison with VB and MVTM We further compare SRMCoFi with other state-of-the-art methods, including VB (Lim and Teh 2007) and the matrix-variate t model (MVTM) (Zhu, Yu, and Gong 2008). As in (Zhu, Yu, and Gong 2008), we randomly select 80% of the EachMovie data for training and use the rest for testing. The mean and standard deviation over 10 rounds are reported in Table 5. The better performance of SRMCoFi can be directly attributed to the linear random effects of SRMCoFi because VB only has bilinear terms. Table 5: Comparison of SRMCoFi with VB and MVTM. All standard deviations are 0.002 or less. VB MVTM (mode) MVTM (mean) SRMCoFi 1.165 1.162 1.151 1.136 Computational Cost In real implementation, we find that SRMCoFi is quite effi- cient. In the experiment on EachMovie with 80% of data for training, it takes less than 10 minutes to complete one iteration of both hyperparameter learning and parameter learning, which is comparable with a C++ implementation of CoF iRank (Weimer et al. 2008) even though our SRMCoFi implementation is in MATLAB. Compared with the MMMF implementation in (Rennie and Srebro 2005), SRMCoFi is an order of magnitude faster. Conclusion SRMCoFi, which has been inspired by social psychological studies, can be seen as a model that seamlessly integrates both hyperparameter learning and the use of offset terms into a principled framework. As a result, SRMCoFi integrates the advantages of existing methods and simultaneously overcomes their disadvantages, achieving very promising performance in real CF tasks. To a certain extent, a major contribution of SRMCoFi is that it serves to provide justifications from social psychological studies to many empirical findings by existing CF methods. Acknowledgments Li is supported by a grant from the “project of Arts & Science” of Shanghai Jiao Tong University (No. 10JCY09). Yeung is supported by General Research Fund 621310 from the Research Grants Council of Hong Kong. References Breese, J. S.; Heckerman, D.; and Kadie, C. M. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In UAI, 43–52. Gelman, A.; Carlin, J. B.; Stern, H. S.; and Rubin, D. B. 2003. Bayesian Data Analysis, Second Edition. Chapman & Hall/CRC. Kenny, D. A. 1994. Interpersonal Perception: A Social Relations Analysis. Guilfold Publications, Inc., New York. Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD, 426–434. Li, H., and Loken, E. 2002. A unified theory of statistical analysis and inference for variance component models for dyadic data. Statistica Sinica 12:519–535. Lim, Y. J., and Teh, Y. W. 2007. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop. Rennie, J. D. M., and Srebro, N. 2005. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 713– 719. Salakhutdinov, R., and Mnih, A. 2008a. Bayesian probabilistic matrix factorization using markov chain monte carlo. In ICML, 880–887. Salakhutdinov, R., and Mnih, A. 2008b. Probabilistic matrix factorization. In NIPS 20. Sarwar, B. M.; Karypis, G.; Konstan, J. A.; and Riedl, J. 2001. Item-based collaborative filtering recommendation algorithms. In WWW, 285–295. Shewchuk, J. R. 1994. An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA. Srebro, N.; Rennie, J. D. M.; and Jaakkola, T. 2004. Maximummargin matrix factorization. In NIPS. Takacs, G.; Pil ´ aszy, I.; N ´ emeth, B.; and Tikk, D. 2008. Investiga- ´ tion of various matrix factorization methods for large recommender systems. In Proc. of the 2nd KDD Workshop on Large Scale Recommender Systems and the Netflix Prize Competition. Weimer, M.; Karatzoglou, A.; Le, Q.; and Smola, A. 2008. Cofi rank - maximum margin matrix factorization for collaborative ranking. In NIPS 20. Weimer, M.; Karatzoglou, A.; and Smola, A. 2008. Improving maximum margin matrix factorization. Machine Learning 72(3):263–276. Zhen, Y.; Li, W.-J.; and Yeung, D.-Y. 2009. TagiCoFi: tag informed collaborative filtering. In RecSys, 69–76. Zhu, S.; Yu, K.; and Gong, Y. 2008. Predictive matrix-variate t models. In NIPS 20