正在加载图片...
Letting重=A-1,we can getp(X)=pfu-xx7]}=epuK△x (2x)9N/21△-9/2 (2m)9N/21△1-g/2 The first termin(5)can be treated as a measure of weighted variance of all the instances in the latent space.We can see that the larger Dii is,the more weight will be put on instance i,which is reasonable because Di;mainly reflects the degree of instance i in the graph. It is easy to see that,for those latent representations having a fixed value of weighted variance D the closer the latent representations of two linked entities are,the larger is their contribution to trXAXT],and subsequently the less is their contribution to p(X).This means that under the latent space representation X,the closer the linked instances are,the lower is the probability density at X given by the prior.Hence,we can get an appropriate prior for X by setting Φ=△-1in(4). 4.1.2 Model With the constructed relational covariance the generative model of PRPCA is defined as follows: Y~Na,N(0,σ2Ia⑧Φ),X~Ng,N(0,Ig⑧Φ),T=WX+eT+Y, whereΦ=△-1 We can further obtain the following results: TlX~Na.v(WX+ueT,a2La⑧重),T~Na.N(ueT,(Wwr+aIa)⑧重). (6) The graphical model of PRPCA is illustrated in Figure 1(b),from which we can see that the differ- ence between PRPCA and PPCA lies solely in the difference between and IN.Comparing(6)to (2),we can find that the observations of PPCA are sampled independently while those of PRPCA are sampled with correlation.In fact,PPCA may be seen as a degenerate case of PRPCA as detailed below in Remark 1: Remark 1 When the i.i.d.assumption holds,i.e.,all Aij =0,PRPCA degenerates to PPCA by setting-0.Note that the only role that y plays is to make0.Hence,in our implementation, we always set y to a very small positive value,such as 10-6.Actually,we may even setto 0, because△does not have to be pd.When△≥0,we say T follows a singular matrix variate normal distribution [11].and all the derivations for PRPCA are still correct.In our experiment, we find that the performance under=0 is almost the same as that under=10-6.Further deliberation is out of the scope of this paper. As in PPCA,we set C=WWT+o2Ia.Then the log-likelihood of the observation matrix T in PRPCA is mp()n(2)+inCl+(C. (7) where c =-In can be seen as a constant independent of the parameters u,W and o2,and H=(T-Me)A(T-HeT)T It is interesting to compare(7)with(3).We can find that to learn the parameters W and o2,the only difference between PRPCA and PPCA lies in the difference between H and S.Hence,all the learning techniques derived previously for PPCA are also potentially applicable to PRPCA simply by substituting S with H. 4.2 Learning By setting the gradient of Ci with respect to u to 0,we can get the maximum-likelihood estimator (MLE)for u as follows:=TA eT△e As in PPCA [21],we devise two methods to learn W and o2 in PRPCA,one based on a closed-form solution and the other based on EM.Letting Φ = ∆−1 , we can get p(X) = exp{tr[− 1 2 X∆XT ]} (2π) qN/2 |∆|−q/2 = exp{− 1 2 tr[X∆XT ]} (2π) qN/2 |∆|−q/2 . The first term PN i=1 Dˆ iikX∗ik 2 in (5) can be treated as a measure of weighted variance of all the instances in the latent space. We can see that the larger Dˆ ii is, the more weight will be put on instance i, which is reasonable because Dˆ ii mainly reflects the degree of instance i in the graph. It is easy to see that, for those latent representations having a fixed value of weighted variance PN i=1 Dˆ iikX∗ik 2 , the closer the latent representations of two linked entities are, the larger is their contribution to tr[X∆XT ], and subsequently the less is their contribution to p(X). This means that under the latent space representation X, the closer the linked instances are, the lower is the probability density at X given by the prior. Hence, we can get an appropriate prior for X by setting Φ = ∆−1 in (4). 4.1.2 Model With the constructed relational covariance Φ, the generative model of PRPCA is defined as follows: Υ ∼ Nd,N (0, σ2 Id ⊗ Φ), X ∼ Nq,N (0, Iq ⊗ Φ), T = WX + µe T + Υ, where Φ = ∆−1 . We can further obtain the following results: T | X ∼ Nd,N (WX + µe T , σ2 Id ⊗ Φ), T ∼ Nd,N ￾ µe T ,(WWT + σ 2 Id) ⊗ Φ  . (6) The graphical model of PRPCA is illustrated in Figure 1(b), from which we can see that the differ￾ence between PRPCA and PPCA lies solely in the difference between Φ and IN . Comparing (6) to (2), we can find that the observations of PPCA are sampled independently while those of PRPCA are sampled with correlation. In fact, PPCA may be seen as a degenerate case of PRPCA as detailed below in Remark 1: Remark 1 When the i.i.d. assumption holds, i.e., all Aij = 0, PRPCA degenerates to PPCA by setting γ = 0. Note that the only role that γ plays is to make ∆  0. Hence, in our implementation, we always set γ to a very small positive value, such as 10−6 . Actually, we may even set γ to 0, because ∆ does not have to be pd. When ∆  0, we say T follows a singular matrix variate normal distribution [11], and all the derivations for PRPCA are still correct. In our experiment, we find that the performance under γ = 0 is almost the same as that under γ = 10−6 . Further deliberation is out of the scope of this paper. As in PPCA, we set C = WWT + σ 2 Id. Then the log-likelihood of the observation matrix T in PRPCA is L1 = ln p(T) = − N 2 h d ln(2π) + ln |C| + tr(C−1H) i + c, (7) where c = − d 2 ln |Φ| can be seen as a constant independent of the parameters µ, W and σ 2 , and H = (T−µe T )∆(T−µe T ) T N . It is interesting to compare (7) with (3). We can find that to learn the parameters W and σ 2 , the only difference between PRPCA and PPCA lies in the difference between H and S. Hence, all the learning techniques derived previously for PPCA are also potentially applicable to PRPCA simply by substituting S with H. 4.2 Learning By setting the gradient of L1 with respect to µ to 0, we can get the maximum-likelihood estimator (MLE) for µ as follows: µ = T∆e eT∆e . As in PPCA [21], we devise two methods to learn W and σ 2 in PRPCA, one based on a closed-form solution and the other based on EM. 5
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有