正在加载图片...
Latent Wishart Processes for Relational Kernel Learning 3.2 Learning experiment,we find that if y is simply set to a value less than 0.1,our algorithm works very well for all the It is natural to consider the MAP estimate of A, experiments.Although in this case the objective func- namely, tion does not necessarily increase monotonically,the argmax log p(ZA)p(A) long-term trend of it is increasing.This makes our A algorithm not only very fast but also very accurate. Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B.In Note that this iterative procedure works in a paral- particular,we note that for the applications presented lel manner.It may also work sequentially.However, in Section 5,small values of q,typically no larger than our experiments show that the parallel scheme is more 50,are sufficient for delivering good performance.This stable than the sequential scheme.As we have men- motivates us to alternatively find the MaP estimate tioned,the size of Hi is gxg with g being always a of B,which will dramatically reduce the computation small number in our experiments,so the iterative pro- cost.Consequently,we attempt to maximize the fol- cedure is very efficient.In this paper,we adopt KPCA lowing log posterior probability: (Scholkopf et al.,1998)to initialize the values bi(0). L(B)=log(p(ZB)p(B)} 3.3 Out-of-Sample Extension >logp(zb:,b)+logp(B) i≠k It is easy for LWP to perform out-of-sample exten- sion (or induction),which means that we can use the >zbbx/2-log(1+exp(bbx/2)) learned LWP to predict the bi for new test data. 1 Z11Z12 「 -K+0'BB]+c Let Z= and∑= 2112121 Z21Z22 221222 where Z11(S11)is an nixni matrix and Z22(S22) bb:/2-log(1+exp(bbx/2) is an n2xn2 matrix.Suppose the n instances cor- responding to Zu and are training data,and -号∑oabe+C Z22 and 22 correspond to new test data.Similarly, B we partition B A A11 A12 i.k B2 A21 A22 BB BB2 where [gik]k=1 K+)- and C is a constant in- B2B B2B2 Because B~Nn,g(0,∑⑧Lg),we dependent of B.We employ a block quasi-Newton have B1~Nn1,g(0,∑11⑧Lg)and method to solve the maximization of L(B)w.r.t. B.The basic idea is to approximate the Hessian B2|B1~Nn2,g(②212B1,222.18Ig,(⑤) matrix a是品by using the block diagonal matrix 月2 Block diag(at而abn a2L a2L where221-∑22-∑212∑12 The Fisher score vector and Hessian matrix of L w.r.t. For inductive inference,we first find the MAP es- bi are given by timate of B1 based on argmaxB,logp(Z11B1)p(B1) and then adopt the conditional expectation (or condi- aL tional mean)of B2 given Bi to estimate B2,which is 2-s材-0)bj-0b Obi given by E21>11BI based on (5). ≠红 82L Owing to the page limit,out-of-sample extension will abiob s(1-)bb,-Lg-H 2 not be evaluated via experiments in this paper.We j≠ will report the experiments in our future work. Given the initial values bi(0),the update equations for the bi are 4 Related Work b:t+1)=b:()+y-H(0-19 i=1,...,n, The methods that are most related to our LWP model iB=B(t) are RGP (Chu et al.,2007)and XGP (Silva et al., (4) 2008).Both RGP and XGP also try to learn the where y is the step size.A strategy to make the ob- covariance (kernel)matrix for Gaussian processes by jective function monotonically increase is to learn y in exploiting the relationships between instances.Un- each update step,but to search for a suitable in each like LWP which is to learn multiple (g)GPs,RGP update step might incur high computation cost.In our and XGP only learn one GP.In fact,when q=1,BLatent Wishart Processes for Relational Kernel Learning 3.2 Learning It is natural to consider the MAP estimate of A, namely, argmax A log p(Z|A)p(A) . Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B. In particular, we note that for the applications presented in Section 5, small values of q, typically no larger than 50, are sufficient for delivering good performance. This motivates us to alternatively find the MAP estimate of B, which will dramatically reduce the computation cost. Consequently, we attempt to maximize the fol￾lowing log posterior probability: L(B) = log{p(Z|B)p(B)} = X i6=k log p(zik|bi , bk) + log p(B) = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 tr (K + λI) −1 β BB0 + C = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 X i,k σikb 0 ibk + C, where [σik] n i,k=1 = (K+λI) −1 β and C is a constant in￾dependent of B. We employ a block quasi-Newton method to solve the maximization of L(B) w.r.t. B. The basic idea is to approximate the Hessian matrix ∂ 2L ∂B∂B0 by using the block diagonal matrix Block diag ∂ 2L ∂b1∂b0 1 , . . . , ∂ 2L ∂bn∂b0 n  . The Fisher score vector and Hessian matrix of L w.r.t. bi are given by ∂L ∂bi = X j6=i (zij − sij − σij )bj − σiibi ∂ 2L ∂bi∂b 0 i = − 1 2 X j6=i sij (1−sij )bjb 0 j − σiiIq , −Hi . Given the initial values bi(0), the update equations for the bi are bi(t+1)=bi(t)+γ · Hi(t) −1 ∂L ∂bi B=B(t) , i = 1, . . . , n, (4) where γ is the step size. A strategy to make the ob￾jective function monotonically increase is to learn γ in each update step, but to search for a suitable γ in each update step might incur high computation cost. In our experiment, we find that if γ is simply set to a value less than 0.1, our algorithm works very well for all the experiments. Although in this case the objective func￾tion does not necessarily increase monotonically, the long-term trend of it is increasing. This makes our algorithm not only very fast but also very accurate. Note that this iterative procedure works in a paral￾lel manner. It may also work sequentially. However, our experiments show that the parallel scheme is more stable than the sequential scheme. As we have men￾tioned, the size of Hi is q×q with q being always a small number in our experiments, so the iterative pro￾cedure is very efficient. In this paper, we adopt KPCA (Sch¨olkopf et al., 1998) to initialize the values bi(0). 3.3 Out-of-Sample Extension It is easy for LWP to perform out-of-sample exten￾sion (or induction), which means that we can use the learned LWP to predict the bi for new test data. Let Z =  Z11 Z12 Z21 Z22  and Σ =  Σ11 Σ12 Σ21 Σ22  , where Z11(Σ11) is an n1×n1 matrix and Z22(Σ22) is an n2×n2 matrix. Suppose the n1 instances cor￾responding to Z11 and Σ11 are training data, and Z22 and Σ22 correspond to new test data. Similarly, we partition B =  B1 B2  , A =  A11 A12 A21 A22  =  B1B0 1 B1B0 2 B2B0 1 B2B0 2  . Because B ∼ Nn,q(0, Σ⊗Iq), we have B1 ∼ Nn1,q(0, Σ11⊗Iq) and B2 | B1 ∼ Nn2,q￾ Σ21Σ −1 11 B1, Σ22·1 ⊗ Iq  , (5) where Σ22·1 = Σ22 − Σ21Σ −1 11 Σ12. For inductive inference, we first find the MAP es￾timate of B1 based on argmaxB1 log p(Z11|B1)p(B1) and then adopt the conditional expectation (or condi￾tional mean) of B2 given B1 to estimate B2, which is given by Σ21Σ −1 11 B1 based on (5). Owing to the page limit, out-of-sample extension will not be evaluated via experiments in this paper. We will report the experiments in our future work. 4 Related Work The methods that are most related to our LWP model are RGP (Chu et al., 2007) and XGP (Silva et al., 2008). Both RGP and XGP also try to learn the covariance (kernel) matrix for Gaussian processes by exploiting the relationships between instances. Un￾like LWP which is to learn multiple (q) GPs, RGP and XGP only learn one GP. In fact, when q = 1, B
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有