正在加载图片...
For simplicity,here we adopt the maximum a posteriori M-step:The Q-function is maximized to update the pa- (MAP)strategy to estimate the parameters.Due to the Li rameters: constraint on W,the MAP estimation of W will naturally induce sparsity,which means that many entries of W will be Θ(t+1)=argmax Q(Θ|Θ(t) automatically driven to zero during the learning procedure. In this paper,we assume that p(u)and p(o2)are uniform. The whole EM learning procedure is summarized in Al- Remark 1 Of course,we may also put non-uniform priors, gorithm 1 and the detailed derivation can be found in(Li such as conjugate priors,on u and o2.Here,uniform pri- 2010).Note that as in (Li.Yeung,and Zhang 2009),we use ors for u and o-are adopted mainly for fair comparison W and o2 to denote the old values and W and 2 for the because they are also adopted in PRPCA.Alternatively,we updated ones. may also treat all the parameters as random variables and resort to fully Bayesian methods,such as variational meth- Algorithm 1 EM algorithm for SPRP ods (Jordan et al.1999),for learning and inference.How- Initialize W and o2 ever,since the focus of this paper is on demonstrating the fort=1to T promising advantages of sparsity under the PRPCA frame- E-step:Compute the sufficient statistics work,all these possible variants are left to future extensions. M=WTW+o2Ig. It is not easy to directly optimize the objective function (X)=M-WT(T-ueT). in(2)though.As in (Park and Casella 2008;Guan and Dy (X△XT)=Na2M-1+(X)△(X)T, 2009),we adopt a hierarchical interpretation of the Laplace (安〉= prior: M-step:Update the parameters 210={-} 入 for i=1 to d forZ≥0, (3) ag(,,W尖), WilZi ~N(0,Zij ) (4) Wi.=Hi.WM->:[(o2Ig +M-WTHW)M-+ It is easy to show that this hierarchical reformulation is end for equivalent to the original Laplace prior,because 2=但-HWM-1WI d end for p(W|λ)= p(Wijl Zis)p(ZijX)dzis To learn the hyperparameter A,we may resort to the cross- 2 A expvWill validation strategy.Alternatively,we may treat A as one of (5) the parameters,just like W,to get the following EM updat- ing equation:入= 2dg Figure 1(b)depicts the graphical model of SPRP as com- 1Dg-12g1 pared with that of PRPCA in Figure 1(a). SPRP with Jeffreys Prior Learning By setting the gradient of Inp(T)with re- The Jeffreys prior(Figueiredo 2003;Guan and Dy 2009)is a spect to u to zero,we get the (closed-form)MAP estimate noninformative prior.To use the Jeffreys prior,we only need for u as follows: T△e to change the density function ofi in(3)to:p(i) h-erAe (6) We can see that there exist no hyperparameters in the Jef- freys prior but the Laplace prior does have the hyperparam- For the other parameters (W and o2)of SPRP,we derive an eter A which needs to be learned. (iterative)EM(Dempster,Laird,and Rubin 1977)algorithm With the Jeffreys prior,the log-posterior can be computed to learn them.For the rest of this paper,we still use to refer as follows: to the parameters but they only contain W and o2 because u can be directly computed by(6).During the EM learning Inp(T)=Inp(Te)+Inp(e)+c2 procedure,we treat (Z,X}as missing data and {T,Z,X as complete data.The EM algorithm for MAP estimation 2血C+tr(C-H-hIwI operates by alternating between the following two steps: +Inp(u)+Inp(o2)+c3, (7 E-step:The expectation of the complete-data log- posterior with respect to the distribution of the missing where c2 and ca are constants independent of the parameters. variables [Z,X}is computed.This expected value is of- We can see that the only difference between(2)and (7)lies ten called the O-function which is defined as follows: in the difference between the regularization terms vW and In Wll1. Q(Θ|Θ(t)= To learn the parameters for SPRP with the Jeffreys prior, dZ dX p(Z,X|Θ(t),T)np(日|T,Z,X): we only need to change(六)and∑in Algorithm1 as fol- 1ows:(六)=,2:=diag(Wi,,W%)For simplicity, here we adopt the maximum a posteriori (MAP) strategy to estimate the parameters. Due to the L1 constraint on W, the MAP estimation of W will naturally induce sparsity, which means that many entries of W will be automatically driven to zero during the learning procedure. In this paper, we assume that p(µ) and p(σ 2 ) are uniform. Remark 1 Of course, we may also put non-uniform priors, such as conjugate priors, on µ and σ 2 . Here, uniform pri￾ors for µ and σ 2 are adopted mainly for fair comparison because they are also adopted in PRPCA. Alternatively, we may also treat all the parameters as random variables and resort to fully Bayesian methods, such as variational meth￾ods (Jordan et al. 1999), for learning and inference. How￾ever, since the focus of this paper is on demonstrating the promising advantages of sparsity under the PRPCA frame￾work, all these possible variants are left to future extensions. It is not easy to directly optimize the objective function in (2) though. As in (Park and Casella 2008; Guan and Dy 2009), we adopt a hierarchical interpretation of the Laplace prior: p(Zij | λ) = λ 2 exp  − λ 2 Zij , for Zij ≥ 0, (3) Wij |Zij ∼ N (0, Zij ). (4) It is easy to show that this hierarchical reformulation is equivalent to the original Laplace prior, because p(Wij | λ) = Z p(Wij |Zij )p(Zij | λ)dZij = √ λ 2 exp n − √ λkWijk1 o . (5) Figure 1(b) depicts the graphical model of SPRP as com￾pared with that of PRPCA in Figure 1(a). Learning By setting the gradient of ln p(Θ | T) with re￾spect to µ to zero, we get the (closed-form) MAP estimate for µ as follows: µb = T∆e e T∆e. (6) For the other parameters (W and σ 2 ) of SPRP, we derive an (iterative) EM (Dempster, Laird, and Rubin 1977) algorithm to learn them. For the rest of this paper, we still use Θ to refer to the parameters but they only contain W and σ 2 because µ can be directly computed by (6). During the EM learning procedure, we treat {Z, X} as missing data and {T, Z, X} as complete data. The EM algorithm for MAP estimation operates by alternating between the following two steps: • E-step: The expectation of the complete-data log￾posterior with respect to the distribution of the missing variables {Z, X} is computed. This expected value is of￾ten called the Q-function which is defined as follows: Q (Θ | Θ(t)) = Z dZ dX p(Z, X | Θ(t), T) ln p(Θ | T, Z, X). • M-step: The Q-function is maximized to update the pa￾rameters: Θ(t + 1) = argmax Θ Q (Θ | Θ(t)). The whole EM learning procedure is summarized in Al￾gorithm 1 and the detailed derivation can be found in (Li 2010). Note that as in (Li, Yeung, and Zhang 2009), we use W and σ 2 to denote the old values and Wf and σe 2 for the updated ones. Algorithm 1 EM algorithm for SPRP Initialize W and σ 2 . for t = 1 to T E-step: Compute the sufficient statistics M = WTW + σ 2 Iq, hXi = M−1WT (T − µe T ), hX∆XT i = Nσ2M−1 + hXi∆hXi T , h 1 Zij i = √ λ kWijk1 . M-step: Update the parameters for i = 1 to d Σi = diag￾ kW√i1k1 λ , · · · , kW√iqk1 λ  , Wfi∗ = Hi∗WM−1Σi (σ 2 Iq +M−1WT HW)M−1Σi + σ 2 N Iq −1 . end for σe 2 = tr[H−HWM−1WfT ] d . end for To learn the hyperparameter λ, we may resort to the cross￾validation strategy. Alternatively, we may treat λ as one of the parameters, just like W, to get the following EM updat￾ing equation: λ = P 2dq d i=1 Pq j=1hZij i . SPRP with Jeffreys Prior The Jeffreys prior (Figueiredo 2003; Guan and Dy 2009) is a noninformative prior. To use the Jeffreys prior, we only need to change the density function of Zij in (3) to: p(Zij ) ∝ 1 Zij . We can see that there exist no hyperparameters in the Jef￾freys prior but the Laplace prior does have the hyperparam￾eter λ which needs to be learned. With the Jeffreys prior, the log-posterior can be computed as follows: ln p(Θ | T) = ln p(T | Θ) + ln p(Θ) + c2 = − N 2 h ln |C| + tr(C−1H) i − ln kWk1 + ln p(µ) + ln p(σ 2 ) + c3, (7) where c2 and c3 are constants independent of the parameters. We can see that the only difference between (2) and (7) lies in the difference between the regularization terms √ λkWk1 and ln kWk1. To learn the parameters for SPRP with the Jeffreys prior, we only need to change h 1 Zij i and Σi in Algorithm 1 as fol￾lows: h 1 Zij i = 1 W2 ij , Σi = diag￾ W2 i1 , . . . , W2 iq .
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有