正在加载图片...
it is easy to check that L=0 if and only if Wij=0 for Discussions The objective function in (9)can be inter- alli≠j.■ preted as a regularized version of GPLVM,where the reg- Hence,we can say that p(X.)is a meaningful prior for ularizer is a nonlinear variant of the supervised locality pre- X.k because its underlying graph G effectively reflects the serving projection(SLPP)model(Zheng et al.2007).When conditional independence relationship among the random a-+o0,(9)has a closed-form solution and is equiva- variables.More specifically,on one perspective,from the lent to a variant of supervised kernel locality preserving pro- definition of W,Wij=0 means that point i and point j are jection (SKLPP)model(Zheng et al.2007)or a variant of from different classes.On the other perspective,from the Laplacian eigenmap(Belkin and Niyogi 2001). semantics implied by the MRF,Wij=0 means that Xik is Although the objective function in (9)is only a simple independent of Xjk given the other variables.Because it is combination of two components,this combination is very reasonable to make the latent representations of two points effective because the two components are intrinsically com- from different classes independent,these two perspectives plementary to each other. are coherent Although the Laplacian matrix L in this paper is con- Furthermore,if we compute structed from the supervisory information,it may also be constructed based on the similarity between the observed px)=ⅡXt=方ep[-(xLx (6 data Y(often called a similarity graph),which is similar 1 to the Laplacian matrix for Laplacian eigenmap.Another we can also understand the effectiveness of the prior in(6). possible extension is that we may combine both supervisory To see this,let us rewrite the term tr(XTLX)as follows: information and a similarity graph to get a semi-supervised version of GPLRF.It is worth noting that all these extensions t(XLX)=∑[∑W(Xk-X)] can be integrated into the GPLRF framework easily.Due to k=1i=1j=1 space limitation,these extensions will be left to our future I NN pursuit. Out-of-Sample Prediction i=1j=1 NN For a new test point yt,we exploit the back-constrained ∑∑w,-xR, 1 strategy in (Lawrence and Quinonero-Candela 2006)and (7) (Urtasun and Darrell 2007)to estimate its low-dimensional i=1j=1 representation xt.More specifically,we minimize (9)with where xi-xjll is the Euclidean distance between xi and the constraints xj.We can see that tr(XTLX)actually reflects the sum of Ttj=gj(yt), the distances between points from the same class.The closer the points from the same class are,the smaller is the term where ti is the jth dimension of x:and gi is a function tr(XTLX),and consequently the higher is p(X).Hence, of the input points.In particular,to get a smooth inverse the latent representation which makes the data points from mapping,we use an RBF kernel for the back-constraint the same class closer will be given higher probability.This ineach dimension ()m).where is exactly what we need to learn-a discriminative latent (Bjm}(m 1,2,...,N;j =1,2,...,q)are the parame- space. ters to learn and the kernel function is k(yn,ym)=exp ( Model Formulation of GPLRF With the prior in(6)and yym).with y being the inverse width parame- the likelihood in(1),we can obtain the posterior distribution ter.The latent position of a test point can be computed p(X Y)x p(Y X)p(X). (8) directly by evaluating the inverse mapping learned by the back-constraint at the test point xtj=gi(yt). If we use the MAP strategy to estimate X,the supervi- sory information in p(X)will be seamlessly integrated into the model.We call this model Gaussian process latent ran- Experiments dom field(GPLRF)because p(Y X)corresponds to Gaus- To demonstrate the effectiveness of GPLRF,we compare sian processes and the prior p(X)for the latent variables is it with some related methods on both synthetic and real- a Gaussian Markov random field w.r.t.a graph constructed world data sets.These baseline methods include:1-nearest from the supervisory information neighbor(INN)classifier in the original space,PCA,LDA, Then the MAP estimate of X can be obtained by mini- LPP (He and Niyogi 2003),SLPP (Zheng et al.2007) mizing the following objective function: and their kernel counterparts,GPLVM,DGPLVM(Urta- sun and Darrell 2007),LL-GPLVM(Urtasun et al.2008). c,-nK+知K-YY+(xLx.9 rankGPLVM (Geiger,Urtasun,and Darrell 2009),and To minimize Cs,we can first compute the gradient of(9) tSNEGPLVM(van der Maaten 2009).Among them,PCA and LPP are unsupervised,LDA and SLPP are supervised, 股-股+alx LPP and SLPP account for preserving similarity between points,PCA,LDA,LPP and SLPP are linear dimensional- and then apply the SCG method(Moler 1993).This proce- ity reduction methods,and their kemel counterparts and all dure is similar to that for GPLVM. GPLVM based methods are nonlinear methods. 681it is easy to check that Lij = 0 if and only if Wij = 0 for all i 6= j.  Hence, we can say that p(X∗k) is a meaningful prior for X∗k because its underlying graph G effectively reflects the conditional independence relationship among the random variables. More specifically, on one perspective, from the definition of W, Wij = 0 means that point i and point j are from different classes. On the other perspective, from the semantics implied by the MRF, Wij = 0 means that Xik is independent of Xjk given the other variables. Because it is reasonable to make the latent representations of two points from different classes independent, these two perspectives are coherent. Furthermore, if we compute p(X) = Yq k=1 p(X∗k) = 1 Zq exp h − α 2 tr(XTLX) i , (6) we can also understand the effectiveness of the prior in (6). To see this, let us rewrite the term tr(XTLX) as follows: tr(XTLX) = 1 2 Xq k=1 hX N i=1 X N j=1 Wij (Xik − Xjk) 2 i = 1 2 X N i=1 X N j=1 h Wij Xq k=1 (Xik − Xjk) 2 i = 1 2 X N i=1 X N j=1 Wijkxi − xjk 2 , (7) where kxi − xjk is the Euclidean distance between xi and xj . We can see that tr(XTLX) actually reflects the sum of the distances between points from the same class. The closer the points from the same class are, the smaller is the term tr(XTLX), and consequently the higher is p(X). Hence, the latent representation which makes the data points from the same class closer will be given higher probability. This is exactly what we need to learn – a discriminative latent space. Model Formulation of GPLRF With the prior in (6) and the likelihood in (1), we can obtain the posterior distribution p(X | Y) ∝ p(Y | X)p(X). (8) If we use the MAP strategy to estimate X, the supervi￾sory information in p(X) will be seamlessly integrated into the model. We call this model Gaussian process latent ran￾dom field (GPLRF) because p(Y | X) corresponds to Gaus￾sian processes and the prior p(X) for the latent variables is a Gaussian Markov random field w.r.t. a graph constructed from the supervisory information. Then the MAP estimate of X can be obtained by mini￾mizing the following objective function: Ls = d 2 ln |K| + 1 2 tr(K−1YYT ) + α 2 tr(XTLX). (9) To minimize Ls, we can first compute the gradient of (9) ∂Ls ∂X = ∂Lr ∂X + αLX, and then apply the SCG method (Møler 1993). This proce￾dure is similar to that for GPLVM. Discussions The objective function in (9) can be inter￾preted as a regularized version of GPLVM, where the reg￾ularizer is a nonlinear variant of the supervised locality pre￾serving projection (SLPP) model (Zheng et al. 2007). When α → +∞, (9) has a closed-form solution and is equiva￾lent to a variant of supervised kernel locality preserving pro￾jection (SKLPP) model (Zheng et al. 2007) or a variant of Laplacian eigenmap (Belkin and Niyogi 2001). Although the objective function in (9) is only a simple combination of two components, this combination is very effective because the two components are intrinsically com￾plementary to each other. Although the Laplacian matrix L in this paper is con￾structed from the supervisory information, it may also be constructed based on the similarity between the observed data Y (often called a similarity graph), which is similar to the Laplacian matrix for Laplacian eigenmap. Another possible extension is that we may combine both supervisory information and a similarity graph to get a semi-supervised version of GPLRF. It is worth noting that all these extensions can be integrated into the GPLRF framework easily. Due to space limitation, these extensions will be left to our future pursuit. Out-of-Sample Prediction For a new test point yt, we exploit the back-constrained strategy in (Lawrence and Quinonero-Candela 2006) and ˜ (Urtasun and Darrell 2007) to estimate its low-dimensional representation xt. More specifically, we minimize (9) with the constraints xtj = gj (yt), where xtj is the jth dimension of xt and gj is a function of the input points. In particular, to get a smooth inverse mapping, we use an RBF kernel for the back-constraint in each dimension gj (yt) = PN m=1 βjmk(yt, ym), where {βjm}(m = 1, 2, . . . , N; j = 1, 2, . . . , q) are the parame￾ters to learn and the kernel function is k(yn, ym) = exp ￾ − γ 2 kyn − ymk 2  , with γ being the inverse width parame￾ter. The latent position of a test point can be computed directly by evaluating the inverse mapping learned by the back-constraint at the test point xtj = gj (yt). Experiments To demonstrate the effectiveness of GPLRF, we compare it with some related methods on both synthetic and real￾world data sets. These baseline methods include: 1-nearest neighbor (1NN) classifier in the original space, PCA, LDA, LPP (He and Niyogi 2003), SLPP (Zheng et al. 2007) and their kernel counterparts, GPLVM, DGPLVM (Urta￾sun and Darrell 2007), LL-GPLVM (Urtasun et al. 2008), rankGPLVM (Geiger, Urtasun, and Darrell 2009), and tSNEGPLVM (van der Maaten 2009). Among them, PCA and LPP are unsupervised, LDA and SLPP are supervised, LPP and SLPP account for preserving similarity between points, PCA, LDA, LPP and SLPP are linear dimensional￾ity reduction methods, and their kernel counterparts and all GPLVM based methods are nonlinear methods. 681
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有