正在加载图片...
To update U,we first rewrite(6)as follows: probabilistic formulations,a validation set or,more gener- ally,cross validation is often used to determine the hyper- c-2{2 parameters.However,in cross validation,only a small num- ber of discrete values are considered as candidate choices. Hence,the values eventually chosen for the hyperparame- ters are not necessarily optimal. 10) Probabilistic formulation allows the hyperparameters to be learned automatically.In this paper,we adopt an alternat- ing method to efficiently learn the hyperparameters.More where Ci is a constant independent of U.We can see that specifically,in the learning procedure,we fix the hyperpa- the vectors 0;for different i are decoupled,which means rameters and optimize over the parameters first and then op- that we can solve a linear system of equations for each row timize over the hyperparameters with the parameters kept of U separately. fixed,and this procedure iterates until some termination con- It is not difficult to see that the objective dition is satisfied.In this paper,we use a conjugate hyper- function in (10)is convex.If D is small,we prior which gives closed-form updates for inference.More can directly solve the linear system to obtain: specifically,we use an inverse-x2 distribution as a conjugate =(AI+Σyy)[E=1,(R,-4-a-by hyperprior for the hyperparameters(o2s). Assume that p(y0,)=N(yl0,)with a unknown, Otherwise,if D is large,we can instead apply the steep- the likelihood of n i.i.d.observations is est descent method (Shewchuk 1994)with convergence guaranteed. p(ylo) =1近 .1 Similarly,updating V corresponds to solving the follow- p 22 ing linear system: where s is the sufficient statistic:s() AvI+∑lUUF If we use a scaled inverse-x2(Gelman et al.2003)as the Vj= R-μ-a-b)U. hyperprior for:Inv-x2(vo,0),we can get Theorem 1 The learning procedure above guarantees local oy心Inv-x2(o+n, 6i+ns (11) convergence. 0+n Hence,the hyperprior can be thought of as providing Proof:(Sketch)The objective function in (6)is lower prior information equivalent to vo observations with average bounded by 0.Furthermore,the objective function decreases squared deviation (Gelman et al.2003). its value monotonically in each iteration when the variables It is easy to derive the sufficient statistics of the o2s for are updated.Hence,the learning procedure always con- SRMCoFi: verges to a local minimum. Theorem 2 Both the time complexity and space complexity s(σ2)= ∑∑(R-F} ∑1∑1 s(o2)=u2, are linear in the number of training ratings available in R. Proof:Let us denote the number of training ratings by y. It is easy to see that the time needed for updating u,a,b M is O(Dy).For updating 0,the time for computing all the ()for all U,is(D),and for each U:.D) s()=∑UPU s(o2)= ∑y is needed for computing the matrix inverse and multiplica- ND MD tion.Hence,the total time for updating U is O(Dy+D3N). Similarly,O(Dy+D3M)is needed for updating V.Hence, The mode,which is for Inv-x2(v).of the pos- the time complexity for each iteration is O()because D is terior of the o2 is adopted to update the A's in (6).It is easy typically a small constant and N,M.Moreover,from to see that the time and space complexity of hyperparameter our experiment,only a small number (20)of iterations learning is also linear in the number of training ratings is sufficient for the algorithm to achieve good performance. Hence,we can say that the time complexity of the whole Comparison with Related Work algorithm is O(). Let us first consider MF methods based on non- The space complexity of R,u,a,b,U,V is probabilistic formulations.Maximum margin matrix fac- O(),O(1),O(N),O(M),O(DN),O(DM)respec- torization (MMMF)(Srebro.Rennie.and Jaakkola 2004: tively.Hence,the overall space complexity is also O(). Rennie and Srebro 2005)can be seen as a special case of (6)by setting4=ai=bj=u=入a=λb=0.5 Hyperparameter Learning SThe original MMMF(Srebro,Rennie,and Jaakkola 2004)only used the hinge loss,but many subsequent works also used other How to choose appropriate regularization parameters (or loss functions such as smooth hinge (Rennie and Srebro 2005)and called hyperparameters here),such as the入'sin(⑥),is cru- squared loss (Weimer.Karatzoglou,and Smola 2008).Here we re- cial to model capacity control.For models based on non- fer to the squared loss function.To update U, we first rewrite (6) as follows: L = 1 2 X N i=1    U T i  λU I + X M j=1 IijVjV T j   Ui −2   X M j=1 Iij (Rij − µ − ai − bj )V T j  Ui    + C1, (10) where C1 is a constant independent of U. We can see that the vectors Ui for different i are decoupled, which means that we can solve a linear system of equations for each row of U separately. It is not difficult to see that the objective function in (10) is convex. If D is small, we can directly solve the linear system to obtain: Ui =  λU I + PM j=1 IijVjV T j −1 hPM j=1 Iij (Rij − µ − ai − bj )Vj i . Otherwise, if D is large, we can instead apply the steep￾est descent method (Shewchuk 1994) with convergence guaranteed. Similarly, updating V corresponds to solving the follow￾ing linear system: λV I + X N i=1 IijUiU T i ! Vj = X N i=1 Iij (Rij −µ−ai −bj )Ui . Theorem 1 The learning procedure above guarantees local convergence. Proof: (Sketch) The objective function in (6) is lower bounded by 0. Furthermore, the objective function decreases its value monotonically in each iteration when the variables are updated. Hence, the learning procedure always con￾verges to a local minimum. Theorem 2 Both the time complexity and space complexity are linear in the number of training ratings available in R. Proof: Let us denote the number of training ratings by γ. It is easy to see that the time needed for updating µ, a, b is O(Dγ). For updating U, the time for computing all the PM j=1(·) for all Ui is O(Dγ), and for each Ui , O(D3 + D2 ) is needed for computing the matrix inverse and multiplica￾tion. Hence, the total time for updating U is O(Dγ +D3N). Similarly, O(Dγ +D3M) is needed for updating V . Hence, the time complexity for each iteration is O(γ) because D is typically a small constant and N, M  γ. Moreover, from our experiment, only a small number (< 20) of iterations is sufficient for the algorithm to achieve good performance. Hence, we can say that the time complexity of the whole algorithm is O(γ). The space complexity of R, µ, a, b, U, V is O(γ), O(1), O(N), O(M), O(DN), O(DM) respec￾tively. Hence, the overall space complexity is also O(γ). Hyperparameter Learning How to choose appropriate regularization parameters (or called hyperparameters here), such as the λ’s in (6), is cru￾cial to model capacity control. For models based on non￾probabilistic formulations, a validation set or, more gener￾ally, cross validation is often used to determine the hyper￾parameters. However, in cross validation, only a small num￾ber of discrete values are considered as candidate choices. Hence, the values eventually chosen for the hyperparame￾ters are not necessarily optimal. Probabilistic formulation allows the hyperparameters to be learned automatically. In this paper, we adopt an alternat￾ing method to efficiently learn the hyperparameters. More specifically, in the learning procedure, we fix the hyperpa￾rameters and optimize over the parameters first and then op￾timize over the hyperparameters with the parameters kept fixed, and this procedure iterates until some termination con￾dition is satisfied. In this paper, we use a conjugate hyper￾prior which gives closed-form updates for inference. More specifically, we use an inverse-χ 2 distribution as a conjugate hyperprior for the hyperparameters (σ 2 s). Assume that p(y|0, σ2 y ) = N (y|0, σ2 y ) with σ 2 y unknown, the likelihood of n i.i.d. observations is p(y|σ 2 y ) ∝ 1 σ n y exp  − Pn i=1 y 2 i 2σ 2 y  = 1 σ n y exp  − n 2σ 2 y s  , where s is the sufficient statistic: s(σ 2 y ) = Pn i=1 y 2 i n . If we use a scaled inverse-χ 2 (Gelman et al. 2003) as the hyperprior for σ 2 y : σ 2 y ∼ Inv-χ 2 (ν0, σ2 0 ), we can get σ 2 y |y ∼ Inv-χ 2  ν0 + n, ν0σ 2 0 + ns ν0 + n  . (11) Hence, the hyperprior can be thought of as providing prior information equivalent to ν0 observations with average squared deviation σ 2 0 (Gelman et al. 2003). It is easy to derive the sufficient statistics of the σ 2 s for SRMCoFi: s(σ 2 ) = PN i=1 PM j=1 Iij (Rij − Fij ) 2 PN i=1 PM j=1 Iij , s(σ 2 µ ) = µ 2 , s(σ 2 a ) = PN i=1 a 2 i N , s(σ 2 b ) = PM j=1 b 2 j M , s(σ 2 U ) = PN i=1 U T i Ui ND , s(σ 2 V ) = PM j=1 V T j Vj MD . The mode, which is ν ν+2σ 2 y for Inv-χ 2 (ν, σ2 y ), of the pos￾terior of the σ 2 is adopted to update the λ’s in (6). It is easy to see that the time and space complexity of hyperparameter learning is also linear in the number of training ratings. Comparison with Related Work Let us first consider MF methods based on non￾probabilistic formulations. Maximum margin matrix fac￾torization (MMMF) (Srebro, Rennie, and Jaakkola 2004; Rennie and Srebro 2005) can be seen as a special case of (6) by setting µ = ai = bj = λµ = λa = λb = 0. 5 5The original MMMF (Srebro, Rennie, and Jaakkola 2004) only used the hinge loss, but many subsequent works also used other loss functions such as smooth hinge (Rennie and Srebro 2005) and squared loss (Weimer, Karatzoglou, and Smola 2008). Here we re￾fer to the squared loss function.
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有