正在加载图片...
experiments reported later in this paper with aK via the chain rule,where i is the jth dimension 0r4 In this paper,we propose a novel supervised GPLVM of xi.Based on (3)and(4),a nonlinear optimizer such as method,called Gaussian process latent random field scaled conjugate gradient (SCG)(Moler 1993)can be used (GPLRF),by enforcing the latent variables to be a Gaus- to learn the latent variables.The gradients of (3)with re- sian Markov random field (GMRF)(Rue and Held 2005) spect to the parameters of the kernel function in(2)can also with respect to (w.r.t.)a graph constructed from the supervi- be computed and used to jointly optimize X and the kernel sory information.Some promising properties of GPLRF are parameters 9. highlighted below: GPLRF is nonlinear and hence can deal with complex Our Model data sets which linear methods,such as PCA and MDS In this section,we introduce in detail our proposed model cannot handle. GPLRF,including its formulation,learning algorithm and Compared with GPLVM,GPLRF can learn a more dis- out-of-sample prediction.GPLRF is a supervised extension criminative latent representation in which data points of of GPLVM based on a GMRF(Rue and Held 2005). the same class are clustered together and those of differ- ent classes form different clusters. Gaussian Process Latent Random Field The basic idea of GPLRF is to enforce the latent variables The dimensionality of the latent space is no longer re- stricted to at most C-1,making GPLRF more flexible X to be a GMRF (Rue and Held 2005)w.r.t.a graph con- than DGPLVM in applications. structed from the supervisory information.Based on the constructed GMRF,we get a prior for X and then apply max- GPLRF achieves performance at least comparable with imum a posteriori (MAP)estimation to learn X. DGPLVM and other state-of-the-art methods on data sets with intrinsic dimensionality at most C-1,and dramati- GMRF Construction We define an undirected graph g= cally outperforms DGPLVM on data sets when the intrin- (V,E)with the node set {Vi,V2,...,VN}and Vi sic dimensionality exceeds C-1. corresponds to a training example yi (or xi),and {(Vi,Vi)i j,yi and y;belong to the same class}is the Gaussian Process Latent Variable Model edge set.If we associate each edge with a weight 1,we can get a weight matrix W with its entries defined as follows: Given a set of N training examples represented as a matrix Y =[y1,y2,...,yN]T,where yi ERd.Let the matrix X=[x1,x2....,xN]T,where xiE R with g<d,denote W订= f1 if yi and yj,i j,belong to the same class 0 otherwise. their corresponding positions in the latent space.In the con- text of latent variable models,we often call Y the observed Hence,we can find that Wi;=1 if and only if there exists data and X the latent representation (or latent variables)of an edge between nodes Vi and Vi for any ij. Y.GPLVM relates each high-dimensional observation yi We can see that the graph g is constructed from the la- with its corresponding latent position xi using a Gaussian bel (supervisory)information.If we associate each node process mapping from the latent space to the observation with a random variable,we can get a Markov random field space.Given a covariance function k(xi,xj)for the Gaus- (MRF)(Rue and Held 2005)w.r.t.the graph G.Here,we sian process,the likelihood of the observed data given the associate this constructed graph g with the random vector latent positions is X.k (Xik,X2k,...XN)T (for any k =1,2,...,q). Based on the weight matrix W.we can compute the graph VzKRxP(-2K-1y7列.) 1 P(YX)= Laplacian matrix (Chung 1997)L as L D-W,where D is a diagonal matrix with Di=∑,Wy:With L,we define a prior distribution on the latent variables X as: where K is the kernel matrix with elements Kij= (xi,xj),and tr()denotes the trace of a matrix.We use a kernel defined as follows: 1 kKx)=8p-号K-xP)+%+2 (2 and where ij is the Kronecker delta function and 6 pX)=zep-XLX小 (5) (01,02,03,0)are the kernel parameters.Maximizing the likelihood is equivalent to minimizing where Z is a normalization constant and a >0 is a scaling parameter.Then we have c,-号a1K+k-YY+Y(em d (3) Theorem 1 X.k is a Gaussian Markov random field (GMRF)w.r.t.the graph 9. The gradients of(3)w.r.t.the latent variables can be com- Proof:According to the property of GMRF(Rue and Held puted through combining 2005),it suffices to prove that the missing edges in g corre- spond to the zero entries in the precision matrix (i.e.,inverse (4) covariance matrix).Because in (5)the precision matrix is L, 680experiments reported later in this paper. In this paper, we propose a novel supervised GPLVM method, called Gaussian process latent random field (GPLRF), by enforcing the latent variables to be a Gaus￾sian Markov random field (GMRF) (Rue and Held 2005) with respect to (w.r.t.) a graph constructed from the supervi￾sory information. Some promising properties of GPLRF are highlighted below: • GPLRF is nonlinear and hence can deal with complex data sets which linear methods, such as PCA and MDS, cannot handle. • Compared with GPLVM, GPLRF can learn a more dis￾criminative latent representation in which data points of the same class are clustered together and those of differ￾ent classes form different clusters. • The dimensionality of the latent space is no longer re￾stricted to at most C − 1, making GPLRF more flexible than DGPLVM in applications. • GPLRF achieves performance at least comparable with DGPLVM and other state-of-the-art methods on data sets with intrinsic dimensionality at most C − 1, and dramati￾cally outperforms DGPLVM on data sets when the intrin￾sic dimensionality exceeds C − 1. Gaussian Process Latent Variable Model Given a set of N training examples represented as a matrix Y = [y1, y2, . . . , yN ] T , where yi ∈ R d . Let the matrix X = [x1, x2, . . . , xN ] T , where xi ∈ R q with q < d, denote their corresponding positions in the latent space. In the con￾text of latent variable models, we often call Y the observed data and X the latent representation (or latent variables) of Y. GPLVM relates each high-dimensional observation yi with its corresponding latent position xi using a Gaussian process mapping from the latent space to the observation space. Given a covariance function k(xi , xj ) for the Gaus￾sian process, the likelihood of the observed data given the latent positions is p(Y | X) = 1 p (2π)Nd|K| d exp ￾ − 1 2 tr(K−1YYT )  , (1) where K is the kernel matrix with elements Kij = k(xi , xj ), and tr(·) denotes the trace of a matrix. We use a kernel defined as follows: k(xi , xj ) = θ1 exp(− θ2 2 kxi − xjk 2 ) + θ3 + δij θ4 , (2) where δij is the Kronecker delta function and θ = {θ1, θ2, θ3, θ4} are the kernel parameters. Maximizing the likelihood is equivalent to minimizing Lr = d 2 ln |K| + 1 2 tr(K−1YYT ) + N d 2 ln(2π). (3) The gradients of (3) w.r.t. the latent variables can be com￾puted through combining ∂Lr ∂K = d 2 K−1 − 1 2 K−1YYT K−1 (4) with ∂K ∂xij via the chain rule, where xij is the jth dimension of xi . Based on (3) and (4), a nonlinear optimizer such as scaled conjugate gradient (SCG) (Møler 1993) can be used to learn the latent variables. The gradients of (3) with re￾spect to the parameters of the kernel function in (2) can also be computed and used to jointly optimize X and the kernel parameters θ. Our Model In this section, we introduce in detail our proposed model, GPLRF, including its formulation, learning algorithm and out-of-sample prediction. GPLRF is a supervised extension of GPLVM based on a GMRF (Rue and Held 2005). Gaussian Process Latent Random Field The basic idea of GPLRF is to enforce the latent variables X to be a GMRF (Rue and Held 2005) w.r.t. a graph con￾structed from the supervisory information. Based on the constructed GMRF, we get a prior for X and then apply max￾imum a posteriori (MAP) estimation to learn X. GMRF Construction We define an undirected graph G = (V, E) with the node set V = {V1, V2, · · · , VN } and Vi corresponds to a training example yi (or xi), and E = {(Vi , Vj )| i 6= j, yi and yj belong to the same class} is the edge set. If we associate each edge with a weight 1, we can get a weight matrix W with its entries defined as follows: Wij =  1 if yi and yj , i 6= j, belong to the same class 0 otherwise. Hence, we can find that Wij = 1 if and only if there exists an edge between nodes Vi and Vj for any i 6= j. We can see that the graph G is constructed from the la￾bel (supervisory) information. If we associate each node with a random variable, we can get a Markov random field (MRF) (Rue and Held 2005) w.r.t. the graph G. Here, we associate this constructed graph G with the random vector X∗k = (X1k, X2k, · · · , XNk) T (for any k = 1, 2, . . . , q). Based on the weight matrixW, we can compute the graph Laplacian matrix (Chung 1997) L as L = D − W, where D is a diagonal matrix with Dii = P j Wij . With L, we define a prior distribution on the latent variables X as: p(X) = Yq k=1 p(X∗k), and p(X∗k) = 1 Z exp h − α 2 (XT ∗kLX∗k) i , (5) where Z is a normalization constant and α > 0 is a scaling parameter. Then we have Theorem 1 X∗k is a Gaussian Markov random field (GMRF) w.r.t. the graph G. Proof: According to the property of GMRF (Rue and Held 2005), it suffices to prove that the missing edges in G corre￾spond to the zero entries in the precision matrix (i.e., inverse covariance matrix). Because in (5) the precision matrix is L, 680
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有