正在加载图片...
We gave an example above on how to model counting data eters can be written as follows: via a Poisson regression model based on the log-link func- lnp,a,b,U,VR,a2,o2,o,哈,2,)= tion.Here we give two other examples among many possi- bilities: 1 22 风-P-立u Logistic regression model:If Rij is binary,we can =1j=1 defineg-(F)=动andpRF,) 1 M 1 Bernoulli(Rlg(Fj)).Here,the link is the logit func- %-2- tion and there is no dispersion parameter As for CF,if 2系台 i=1 the task is to predict whether user i has rated movie j,i.e., M N M 1 ND to predict Iij,this model might be a good choice. 立-2- 2 Inou i=1j=1 ·Normal--linear model:Ifg-1(Fi)=Fij and MD M p(RijlFij,)=N(RiilFi,02),it is a normal-linear ma2-5no2-yn2lm匠+C,(5、 2 model with=o2,which is a typical choice for mod- where C is a constant that does not depend on the parameters eling continuous values. and hyperparameters. Other combinations of g(Fij)and p(Rij Fii,o)can For fixed hyperparameters,maximizing the log-posterior over the parameters is equivalent to minimizing the follow- be used to devise different models for different data types.3 ing objective function: Hence,SRMCoFi provides a flexible framework for model- N M ing data from diverse applications. C= In this paper,we only study the normal-linear model 2∑∑R-RP+r i=1j=1 in detail with g(Fij)=Fij and p(RijFij,)= N(RiF,2).The learning and inference procedures for avW+告r2+aa+ + 2 bTb,(⑥ 2 other variants are similar except that the closed-form solu- tions for the normal-linear model may not exist for other where tr()denotes the trace of a matrix, variants.Nevertheless,even if this is the case,we can still Av a2/ay.Xu=a2/an.Xa a2/0a.and o a2/08. resort to a gradient method (Shewchuk 1994)to obtain so- lutions.Due to the page limit,such variants will not be de- Parameter Learning scribed in detail in this paper. The learning procedure for SRMCoFi consists of two parts: We place Gaussian priors on the latent variables of learning the parameters and learning the hyperparameters. SRMCoFi: We will discuss hyperparameter learning in detail in the fol- lowing subsection.In this subsection,we only consider the p(a=ⅡW(al0,a,pb)=ΠN6l0,), parameter learning problem,which is to optimize the objec- tive function in(6)with fixed hyperparameters (or,equiva- =1 1=1 lently,with fixed A's).4 N M Note that (6)is not jointly convex over u,a,b,U,V. p(U=ΠN(U0I,p(W)-ΠN(%I0,D, However,it is convex with respect to each individual vari- i=1 i=1 able with the others held fixed.We exploit this property to p(=N(l0,o2), (4) perform alternating updates for each variable iteratively,the convergence of which can be guaranteed and will be proved wherea=(a1,a2,...,aN)T,b=(61,b2,...,bM)T,andI later in this subsection. is the identity matrix.Here,the latent variables u,a,b,U,V To update u,we compute the gradient of C with respect to u and set the gradient to 0,giving are referred to as parameters and a2,,,,aas hyperparameters. w ∑[R,-a,+b+Uy) We assume independence among the parameters +∑∑ (7) and hence p(u,a,b,U,V)=p(u)p(a)p(b)p(U)p(V). Furthermore,the likelihood of the normal-linear Similarly,closed-form solutions exist for updating ai,bi: model is:p(Rlu,a,b,U,V,a2)=p(RF,2) ∑(R,-u-4,-UIY Π1ΠW(Rl,o2. ai a+∑1写 (8) Thus,the log of the posterior distribution over the param- ∑Y,(R,-4-a4-UTy (9) 3Theoretically any combination may be adopted.By taking into +∑1 consideration the computational requirements,however,only some Here we overload the term 'hyperparameters'to refer to the of these combinations are good choices for real applications.For A's.We can easily realize from the context whether'hyperparame- example.we always constrain g(F)and p(RFj,)to what ters'refer to the o's or the A's.In fact,they refer to the same thing can give a concave log-likelihood function. because the A's can be computed from the o's.We gave an example above on how to model counting data via a Poisson regression model based on the log-link func￾tion. Here we give two other examples among many possi￾bilities: • Logistic regression model: If Rij is binary, we can define g −1 (Fij ) = exp(Fij ) 1+exp(Fij ) and p(Rij |Fij ) = Bernoulli(Rij |g −1 (Fij )). Here, the link is the logit func￾tion and there is no dispersion parameter φ. As for CF, if the task is to predict whether user i has rated movie j, i.e., to predict Iij , this model might be a good choice. • Normal-linear model: If g −1 (Fij ) = Fij and p(Rij |Fij , φ) = N (Rij |Fij , σ2 ), it is a normal-linear model with φ = σ 2 , which is a typical choice for mod￾eling continuous values. Other combinations of g −1 (Fij ) and p(Rij |Fij , φ) can be used to devise different models for different data types.3 Hence, SRMCoFi provides a flexible framework for model￾ing data from diverse applications. In this paper, we only study the normal-linear model in detail with g −1 (Fij ) = Fij and p(Rij |Fij , φ) = N (Rij |Fij , σ2 ). The learning and inference procedures for other variants are similar except that the closed-form solu￾tions for the normal-linear model may not exist for other variants. Nevertheless, even if this is the case, we can still resort to a gradient method (Shewchuk 1994) to obtain so￾lutions. Due to the page limit, such variants will not be de￾scribed in detail in this paper. We place Gaussian priors on the latent variables of SRMCoFi: p(a) = Y N i=1 N (ai |0, σ2 a ), p(b) = Y M j=1 N (bj |0, σ2 b ), p(U) = Y N i=1 N (Ui |0, σ2 U I), p(V ) = Y M j=1 N (Vj |0, σ2 V I), p(µ) = N (µ|0, σ2 µ ), (4) where a = (a1, a2, . . . , aN ) T , b = (b1, b2, . . . , bM) T , and I is the identity matrix. Here, the latent variables µ, a, b, U, V are referred to as parameters and σ 2 , σ2 µ , σ2 a , σ2 b , σ2 U , σ2 V as hyperparameters. We assume independence among the parameters and hence p(µ, a, b, U, V ) = p(µ)p(a)p(b)p(U)p(V ). Furthermore, the likelihood of the normal-linear model is: p(R|µ, a, b, U, V, σ2 ) = p(R|F, σ2 ) = QN i=1 QM j=1 N (Rij |Fij , σ2 ) Iij . Thus, the log of the posterior distribution over the param- 3Theoretically any combination may be adopted. By taking into consideration the computational requirements, however, only some of these combinations are good choices for real applications. For example, we always constrain g −1 (Fij ) and p(Rij |Fij , φ) to what can give a concave log-likelihood function. eters can be written as follows: ln p(µ, a, b, U, V |R, σ2 , σ2 µ , σ2 a , σ2 b , σ2 U , σ2 V ) = − 1 2σ 2 X N i=1 X M j=1 Iij (Rij − Fij ) 2 − 1 2σ 2 U X N i=1 U T i Ui − 1 2σ 2 V X M j=1 V T j Vj − 1 2σ 2 µ µ 2 − 1 2σ 2 a X N i=1 a 2 i − 1 2σ 2 b X M j=1 b 2 j − 1 2 X N i=1 X M j=1 Iij ln σ 2 − ND 2 ln σ 2 U − MD 2 ln σ 2 V − 1 2 ln σ 2 µ − N 2 ln σ 2 a − M 2 ln σ 2 b + C, (5) where C is a constant that does not depend on the parameters and hyperparameters. For fixed hyperparameters, maximizing the log-posterior over the parameters is equivalent to minimizing the follow￾ing objective function: L = 1 2 X N i=1 X M j=1 Iij (Rij − Fij ) 2 + λU 2 tr(U TU) + λV 2 tr(V T V ) + λµ 2 µ 2 + λa 2 a T a + λb 2 b T b, (6) where tr(·) denotes the trace of a matrix, λU = σ 2/σ2 U , λV = σ 2/σ2 V , λµ = σ 2/σ2 µ , λa = σ 2/σ2 a , and λb = σ 2/σ2 b . Parameter Learning The learning procedure for SRMCoFi consists of two parts: learning the parameters and learning the hyperparameters. We will discuss hyperparameter learning in detail in the fol￾lowing subsection. In this subsection, we only consider the parameter learning problem, which is to optimize the objec￾tive function in (6) with fixed hyperparameters (or, equiva￾lently, with fixed λ’s).4 Note that (6) is not jointly convex over µ, a, b, U, V . However, it is convex with respect to each individual vari￾able with the others held fixed. We exploit this property to perform alternating updates for each variable iteratively, the convergence of which can be guaranteed and will be proved later in this subsection. To update µ, we compute the gradient of L with respect to µ and set the gradient to 0, giving µ = PN i=1 PM j=1 Iij Rij − (ai + bj + U T i Vj ) λµ + PN i=1 PM j=1 Iij . (7) Similarly, closed-form solutions exist for updating ai , bj : ai = PM j=1 Iij (Rij − µ − bj − U T i Vj ) λa + PM j=1 Iij , (8) bj = PN i=1 Iij (Rij − µ − ai − U T i Vj ) λb + PN i=1 Iij . (9) 4Here we overload the term ‘hyperparameters’ to refer to the λ’s. We can easily realize from the context whether ‘hyperparame￾ters’ refer to the σ 2 s or the λ’s. In fact, they refer to the same thing because the λ’s can be computed from the σ 2 s.
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有