Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence(AAAI-10) Gaussian Process Latent Random Field Guoqiang Zhong',Wu-Jun Li,Dit-Yan Yeung,Xinwen Houf,Cheng-Lin Liut t National Laboratory of Pattern Recognition(NLPR), Institute of Automation.Chinese Academy of Sciences.Beijing 100190.China Department of Computer Science and Engineering.The Hong Kong University of Science and Technology, Clear Water Bay,Kowloon,Hong Kong,China gqzhong@nlpr.ia.ac.cn,[liwujun,dyyeung)@cse.ust.hk,[xwhou,liucl)@nlpr.ia.ac.cn Abstract Saul 2000)have been proposed.They can discover the low- The Gaussian process latent variable model(GPLVM) dimensional manifold structure of the data embedded in a is an unsupervised probabilistic model for nonlinear di- high-dimensional space.However,under situations with mensionality reduction.A supervised extension,called relatively sparse or noisy data sets,it was found that these discriminative GPLVM (DGPLVM),incorporates su- methods do not perform well(Geiger.Urtasun,and Darrell pervisory information into GPLVM to enhance the clas- 2009). sification performance.However,its limitation of the The Gaussian process latent variable model (GPLVM) latent space dimensionality to at most C-1 (C is (Lawrence 2005)is a fully probabilistic,nonlinear latent the number of classes)leads to unsatisfactorily perfor- variable model based on Gaussian processes (Rasmussen mance when the intrinsic dimensionality of the applica- tion is higher than C-1.In this paper,we propose a and Williams 2006).GPLVM can learn a nonlinear map- novel supervised extension of GPLVM,called Gaussian ping from the latent space to the observation space.It has process latent random field (GPLRF),by enforcing the achieved very promising performance in many real-world latent variables to be a Gaussian Markov random field applications,especially for situations with only a small num- with respect to a graph constructed from the supervisory ber of training examples,i.e.,sparse data sets.As pointed information.In GPLRF,the dimensionality of the latent out by Lawrence and Quinonero-Candela (2006).GPLVM space is no longer restricted to at most C-1.This can preserve the dissimilarity between points,which means makes GPLRF much more flexible than DGPLVM in that the points will be far apart in the latent space if they are applications.Experiments conducted on both synthetic far apart in the observation space.However,GPLVM can- and real-world data sets demonstrate that GPLRF per- forms comparably with DGPLVM and other state-of- not preserve the similarity between points,i.e.,the points that are close in the observation space are not necessarily the-art methods on data sets with intrinsic dimensional- ity at most C-1,and dramatically outperforms DG- close in the latent space.Typically,the points that are close PLVM on data sets when the intrinsic dimensionality in the observation space are expected to belong to the same exceeds C-1. class.Consequently,in GPLVM,there is no guarantee that data points from the same class are close in the latent space, Introduction which makes the learned latent representation not necessar- ily good for discriminative applications. In many artificial intelligence applications,one often has In many applications,we often have some supervisory in- to deal with high-dimensional data.Such data require di- formation such as class labels though the information may mensionality reduction to reveal the low-dimensional latent be rather limited.However,GPLVM is unsupervised in na- structure of the data so that the underlying tasks,such as ture.If we can explicitly incorporate the supervisory infor- visualization,classification and clustering,can benefit from mation into the learning procedure of GPLVM to make the it.Many dimensionality reduction methods have been pro- points from the same class close in the latent space,we can posed over the past few decades.Nevertheless,classical lin- obtain a more discriminative latent representation.To the ear dimensionality reduction methods such as principal com best of our knowledge,only one work,called discrimina- ponent analysis(PCA)(Joliffe 1986)and multidimensional tive GPLVM(DGPLVM)(Urtasun and Darrell 2007).has scaling (MDS)(Cox and Cox 2001)remain to be popular choices due to their simplicity and efficiency.However,they integrated supervisory information into the GPLVM frame- work.However,since DGPLVM is based on the linear dis- fail to discover nonlinear latent structure of the data in more criminant analysis (LDA)(Fukunnaga 1991)or generalized complex data sets.Starting from about a decade ago,a num- discriminant analysis (GDA)(Baudat and Anouar 2000)cri- ber of nonlinear manifold learning methods such as isomet- ric feature mapping (Isomap)(Tenenbaum,Silva,and Lang- terion,the dimensionality of the learned latent space in DGPLVM is restricted to at most C-1,where Cis the num- ford 2000)and locally linear embedding (LLE)(Roweis and ber of classes.For applications with intrinsic dimensionality Copyright c)2010,Association for the Advancement of Artificial equal to or higher than C,DGPLVM might not be able to de- Intelligence (www.aaai.org).All rights reserved. liver satisfactory performance.This will be verified by the 679
Gaussian Process Latent Random Field Guoqiang Zhong† , Wu-Jun Li‡ , Dit-Yan Yeung‡ , Xinwen Hou† , Cheng-Lin Liu† † National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China ‡ Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China gqzhong@nlpr.ia.ac.cn, {liwujun, dyyeung}@cse.ust.hk, {xwhou, liucl}@nlpr.ia.ac.cn Abstract The Gaussian process latent variable model (GPLVM) is an unsupervised probabilistic model for nonlinear dimensionality reduction. A supervised extension, called discriminative GPLVM (DGPLVM), incorporates supervisory information into GPLVM to enhance the classification performance. However, its limitation of the latent space dimensionality to at most C − 1 (C is the number of classes) leads to unsatisfactorily performance when the intrinsic dimensionality of the application is higher than C − 1. In this paper, we propose a novel supervised extension of GPLVM, called Gaussian process latent random field (GPLRF), by enforcing the latent variables to be a Gaussian Markov random field with respect to a graph constructed from the supervisory information. In GPLRF, the dimensionality of the latent space is no longer restricted to at most C − 1. This makes GPLRF much more flexible than DGPLVM in applications. Experiments conducted on both synthetic and real-world data sets demonstrate that GPLRF performs comparably with DGPLVM and other state-ofthe-art methods on data sets with intrinsic dimensionality at most C − 1, and dramatically outperforms DGPLVM on data sets when the intrinsic dimensionality exceeds C − 1. Introduction In many artificial intelligence applications, one often has to deal with high-dimensional data. Such data require dimensionality reduction to reveal the low-dimensional latent structure of the data so that the underlying tasks, such as visualization, classification and clustering, can benefit from it. Many dimensionality reduction methods have been proposed over the past few decades. Nevertheless, classical linear dimensionality reduction methods such as principal component analysis (PCA) (Joliffe 1986) and multidimensional scaling (MDS) (Cox and Cox 2001) remain to be popular choices due to their simplicity and efficiency. However, they fail to discover nonlinear latent structure of the data in more complex data sets. Starting from about a decade ago, a number of nonlinear manifold learning methods such as isometric feature mapping (Isomap) (Tenenbaum, Silva, and Langford 2000) and locally linear embedding (LLE) (Roweis and Copyright c 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Saul 2000) have been proposed. They can discover the lowdimensional manifold structure of the data embedded in a high-dimensional space. However, under situations with relatively sparse or noisy data sets, it was found that these methods do not perform well (Geiger, Urtasun, and Darrell 2009). The Gaussian process latent variable model (GPLVM) (Lawrence 2005) is a fully probabilistic, nonlinear latent variable model based on Gaussian processes (Rasmussen and Williams 2006). GPLVM can learn a nonlinear mapping from the latent space to the observation space. It has achieved very promising performance in many real-world applications, especially for situations with only a small number of training examples, i.e., sparse data sets. As pointed out by Lawrence and Quinonero-Candela (2006), GPLVM ˜ can preserve the dissimilarity between points, which means that the points will be far apart in the latent space if they are far apart in the observation space. However, GPLVM cannot preserve the similarity between points, i.e., the points that are close in the observation space are not necessarily close in the latent space. Typically, the points that are close in the observation space are expected to belong to the same class. Consequently, in GPLVM, there is no guarantee that data points from the same class are close in the latent space, which makes the learned latent representation not necessarily good for discriminative applications. In many applications, we often have some supervisory information such as class labels though the information may be rather limited. However, GPLVM is unsupervised in nature. If we can explicitly incorporate the supervisory information into the learning procedure of GPLVM to make the points from the same class close in the latent space, we can obtain a more discriminative latent representation. To the best of our knowledge, only one work, called discriminative GPLVM (DGPLVM) (Urtasun and Darrell 2007), has integrated supervisory information into the GPLVM framework. However, since DGPLVM is based on the linear discriminant analysis (LDA) (Fukunnaga 1991) or generalized discriminant analysis (GDA) (Baudat and Anouar 2000) criterion, the dimensionality of the learned latent space in DGPLVM is restricted to at most C−1, where C is the number of classes. For applications with intrinsic dimensionality equal to or higher than C, DGPLVM might not be able to deliver satisfactory performance. This will be verified by the 679 Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10)
experiments reported later in this paper with aK via the chain rule,where i is the jth dimension 0r4 In this paper,we propose a novel supervised GPLVM of xi.Based on (3)and(4),a nonlinear optimizer such as method,called Gaussian process latent random field scaled conjugate gradient (SCG)(Moler 1993)can be used (GPLRF),by enforcing the latent variables to be a Gaus- to learn the latent variables.The gradients of (3)with re- sian Markov random field (GMRF)(Rue and Held 2005) spect to the parameters of the kernel function in(2)can also with respect to (w.r.t.)a graph constructed from the supervi- be computed and used to jointly optimize X and the kernel sory information.Some promising properties of GPLRF are parameters 9. highlighted below: GPLRF is nonlinear and hence can deal with complex Our Model data sets which linear methods,such as PCA and MDS In this section,we introduce in detail our proposed model cannot handle. GPLRF,including its formulation,learning algorithm and Compared with GPLVM,GPLRF can learn a more dis- out-of-sample prediction.GPLRF is a supervised extension criminative latent representation in which data points of of GPLVM based on a GMRF(Rue and Held 2005). the same class are clustered together and those of differ- ent classes form different clusters. Gaussian Process Latent Random Field The basic idea of GPLRF is to enforce the latent variables The dimensionality of the latent space is no longer re- stricted to at most C-1,making GPLRF more flexible X to be a GMRF (Rue and Held 2005)w.r.t.a graph con- than DGPLVM in applications. structed from the supervisory information.Based on the constructed GMRF,we get a prior for X and then apply max- GPLRF achieves performance at least comparable with imum a posteriori (MAP)estimation to learn X. DGPLVM and other state-of-the-art methods on data sets with intrinsic dimensionality at most C-1,and dramati- GMRF Construction We define an undirected graph g= cally outperforms DGPLVM on data sets when the intrin- (V,E)with the node set {Vi,V2,...,VN}and Vi sic dimensionality exceeds C-1. corresponds to a training example yi (or xi),and {(Vi,Vi)i j,yi and y;belong to the same class}is the Gaussian Process Latent Variable Model edge set.If we associate each edge with a weight 1,we can get a weight matrix W with its entries defined as follows: Given a set of N training examples represented as a matrix Y =[y1,y2,...,yN]T,where yi ERd.Let the matrix X=[x1,x2....,xN]T,where xiE R with g0 is a scaling parameter.Then we have c,-号a1K+k-YY+Y(em d (3) Theorem 1 X.k is a Gaussian Markov random field (GMRF)w.r.t.the graph 9. The gradients of(3)w.r.t.the latent variables can be com- Proof:According to the property of GMRF(Rue and Held puted through combining 2005),it suffices to prove that the missing edges in g corre- spond to the zero entries in the precision matrix (i.e.,inverse (4) covariance matrix).Because in (5)the precision matrix is L, 680
experiments reported later in this paper. In this paper, we propose a novel supervised GPLVM method, called Gaussian process latent random field (GPLRF), by enforcing the latent variables to be a Gaussian Markov random field (GMRF) (Rue and Held 2005) with respect to (w.r.t.) a graph constructed from the supervisory information. Some promising properties of GPLRF are highlighted below: • GPLRF is nonlinear and hence can deal with complex data sets which linear methods, such as PCA and MDS, cannot handle. • Compared with GPLVM, GPLRF can learn a more discriminative latent representation in which data points of the same class are clustered together and those of different classes form different clusters. • The dimensionality of the latent space is no longer restricted to at most C − 1, making GPLRF more flexible than DGPLVM in applications. • GPLRF achieves performance at least comparable with DGPLVM and other state-of-the-art methods on data sets with intrinsic dimensionality at most C − 1, and dramatically outperforms DGPLVM on data sets when the intrinsic dimensionality exceeds C − 1. Gaussian Process Latent Variable Model Given a set of N training examples represented as a matrix Y = [y1, y2, . . . , yN ] T , where yi ∈ R d . Let the matrix X = [x1, x2, . . . , xN ] T , where xi ∈ R q with q 0 is a scaling parameter. Then we have Theorem 1 X∗k is a Gaussian Markov random field (GMRF) w.r.t. the graph G. Proof: According to the property of GMRF (Rue and Held 2005), it suffices to prove that the missing edges in G correspond to the zero entries in the precision matrix (i.e., inverse covariance matrix). Because in (5) the precision matrix is L, 680
it is easy to check that L=0 if and only if Wij=0 for Discussions The objective function in (9)can be inter- alli≠j.■ preted as a regularized version of GPLVM,where the reg- Hence,we can say that p(X.)is a meaningful prior for ularizer is a nonlinear variant of the supervised locality pre- X.k because its underlying graph G effectively reflects the serving projection(SLPP)model(Zheng et al.2007).When conditional independence relationship among the random a-+o0,(9)has a closed-form solution and is equiva- variables.More specifically,on one perspective,from the lent to a variant of supervised kernel locality preserving pro- definition of W,Wij=0 means that point i and point j are jection (SKLPP)model(Zheng et al.2007)or a variant of from different classes.On the other perspective,from the Laplacian eigenmap(Belkin and Niyogi 2001). semantics implied by the MRF,Wij=0 means that Xik is Although the objective function in (9)is only a simple independent of Xjk given the other variables.Because it is combination of two components,this combination is very reasonable to make the latent representations of two points effective because the two components are intrinsically com- from different classes independent,these two perspectives plementary to each other. are coherent Although the Laplacian matrix L in this paper is con- Furthermore,if we compute structed from the supervisory information,it may also be constructed based on the similarity between the observed px)=ⅡXt=方ep[-(xLx (6 data Y(often called a similarity graph),which is similar 1 to the Laplacian matrix for Laplacian eigenmap.Another we can also understand the effectiveness of the prior in(6). possible extension is that we may combine both supervisory To see this,let us rewrite the term tr(XTLX)as follows: information and a similarity graph to get a semi-supervised version of GPLRF.It is worth noting that all these extensions t(XLX)=∑[∑W(Xk-X)] can be integrated into the GPLRF framework easily.Due to k=1i=1j=1 space limitation,these extensions will be left to our future I NN pursuit. Out-of-Sample Prediction i=1j=1 NN For a new test point yt,we exploit the back-constrained ∑∑w,-xR, 1 strategy in (Lawrence and Quinonero-Candela 2006)and (7) (Urtasun and Darrell 2007)to estimate its low-dimensional i=1j=1 representation xt.More specifically,we minimize (9)with where xi-xjll is the Euclidean distance between xi and the constraints xj.We can see that tr(XTLX)actually reflects the sum of Ttj=gj(yt), the distances between points from the same class.The closer the points from the same class are,the smaller is the term where ti is the jth dimension of x:and gi is a function tr(XTLX),and consequently the higher is p(X).Hence, of the input points.In particular,to get a smooth inverse the latent representation which makes the data points from mapping,we use an RBF kernel for the back-constraint the same class closer will be given higher probability.This ineach dimension ()m).where is exactly what we need to learn-a discriminative latent (Bjm}(m 1,2,...,N;j =1,2,...,q)are the parame- space. ters to learn and the kernel function is k(yn,ym)=exp ( Model Formulation of GPLRF With the prior in(6)and yym).with y being the inverse width parame- the likelihood in(1),we can obtain the posterior distribution ter.The latent position of a test point can be computed p(X Y)x p(Y X)p(X). (8) directly by evaluating the inverse mapping learned by the back-constraint at the test point xtj=gi(yt). If we use the MAP strategy to estimate X,the supervi- sory information in p(X)will be seamlessly integrated into the model.We call this model Gaussian process latent ran- Experiments dom field(GPLRF)because p(Y X)corresponds to Gaus- To demonstrate the effectiveness of GPLRF,we compare sian processes and the prior p(X)for the latent variables is it with some related methods on both synthetic and real- a Gaussian Markov random field w.r.t.a graph constructed world data sets.These baseline methods include:1-nearest from the supervisory information neighbor(INN)classifier in the original space,PCA,LDA, Then the MAP estimate of X can be obtained by mini- LPP (He and Niyogi 2003),SLPP (Zheng et al.2007) mizing the following objective function: and their kernel counterparts,GPLVM,DGPLVM(Urta- sun and Darrell 2007),LL-GPLVM(Urtasun et al.2008). c,-nK+知K-YY+(xLx.9 rankGPLVM (Geiger,Urtasun,and Darrell 2009),and To minimize Cs,we can first compute the gradient of(9) tSNEGPLVM(van der Maaten 2009).Among them,PCA and LPP are unsupervised,LDA and SLPP are supervised, 股-股+alx LPP and SLPP account for preserving similarity between points,PCA,LDA,LPP and SLPP are linear dimensional- and then apply the SCG method(Moler 1993).This proce- ity reduction methods,and their kemel counterparts and all dure is similar to that for GPLVM. GPLVM based methods are nonlinear methods. 681
it is easy to check that Lij = 0 if and only if Wij = 0 for all i 6= j. Hence, we can say that p(X∗k) is a meaningful prior for X∗k because its underlying graph G effectively reflects the conditional independence relationship among the random variables. More specifically, on one perspective, from the definition of W, Wij = 0 means that point i and point j are from different classes. On the other perspective, from the semantics implied by the MRF, Wij = 0 means that Xik is independent of Xjk given the other variables. Because it is reasonable to make the latent representations of two points from different classes independent, these two perspectives are coherent. Furthermore, if we compute p(X) = Yq k=1 p(X∗k) = 1 Zq exp h − α 2 tr(XTLX) i , (6) we can also understand the effectiveness of the prior in (6). To see this, let us rewrite the term tr(XTLX) as follows: tr(XTLX) = 1 2 Xq k=1 hX N i=1 X N j=1 Wij (Xik − Xjk) 2 i = 1 2 X N i=1 X N j=1 h Wij Xq k=1 (Xik − Xjk) 2 i = 1 2 X N i=1 X N j=1 Wijkxi − xjk 2 , (7) where kxi − xjk is the Euclidean distance between xi and xj . We can see that tr(XTLX) actually reflects the sum of the distances between points from the same class. The closer the points from the same class are, the smaller is the term tr(XTLX), and consequently the higher is p(X). Hence, the latent representation which makes the data points from the same class closer will be given higher probability. This is exactly what we need to learn – a discriminative latent space. Model Formulation of GPLRF With the prior in (6) and the likelihood in (1), we can obtain the posterior distribution p(X | Y) ∝ p(Y | X)p(X). (8) If we use the MAP strategy to estimate X, the supervisory information in p(X) will be seamlessly integrated into the model. We call this model Gaussian process latent random field (GPLRF) because p(Y | X) corresponds to Gaussian processes and the prior p(X) for the latent variables is a Gaussian Markov random field w.r.t. a graph constructed from the supervisory information. Then the MAP estimate of X can be obtained by minimizing the following objective function: Ls = d 2 ln |K| + 1 2 tr(K−1YYT ) + α 2 tr(XTLX). (9) To minimize Ls, we can first compute the gradient of (9) ∂Ls ∂X = ∂Lr ∂X + αLX, and then apply the SCG method (Møler 1993). This procedure is similar to that for GPLVM. Discussions The objective function in (9) can be interpreted as a regularized version of GPLVM, where the regularizer is a nonlinear variant of the supervised locality preserving projection (SLPP) model (Zheng et al. 2007). When α → +∞, (9) has a closed-form solution and is equivalent to a variant of supervised kernel locality preserving projection (SKLPP) model (Zheng et al. 2007) or a variant of Laplacian eigenmap (Belkin and Niyogi 2001). Although the objective function in (9) is only a simple combination of two components, this combination is very effective because the two components are intrinsically complementary to each other. Although the Laplacian matrix L in this paper is constructed from the supervisory information, it may also be constructed based on the similarity between the observed data Y (often called a similarity graph), which is similar to the Laplacian matrix for Laplacian eigenmap. Another possible extension is that we may combine both supervisory information and a similarity graph to get a semi-supervised version of GPLRF. It is worth noting that all these extensions can be integrated into the GPLRF framework easily. Due to space limitation, these extensions will be left to our future pursuit. Out-of-Sample Prediction For a new test point yt, we exploit the back-constrained strategy in (Lawrence and Quinonero-Candela 2006) and ˜ (Urtasun and Darrell 2007) to estimate its low-dimensional representation xt. More specifically, we minimize (9) with the constraints xtj = gj (yt), where xtj is the jth dimension of xt and gj is a function of the input points. In particular, to get a smooth inverse mapping, we use an RBF kernel for the back-constraint in each dimension gj (yt) = PN m=1 βjmk(yt, ym), where {βjm}(m = 1, 2, . . . , N; j = 1, 2, . . . , q) are the parameters to learn and the kernel function is k(yn, ym) = exp − γ 2 kyn − ymk 2 , with γ being the inverse width parameter. The latent position of a test point can be computed directly by evaluating the inverse mapping learned by the back-constraint at the test point xtj = gj (yt). Experiments To demonstrate the effectiveness of GPLRF, we compare it with some related methods on both synthetic and realworld data sets. These baseline methods include: 1-nearest neighbor (1NN) classifier in the original space, PCA, LDA, LPP (He and Niyogi 2003), SLPP (Zheng et al. 2007) and their kernel counterparts, GPLVM, DGPLVM (Urtasun and Darrell 2007), LL-GPLVM (Urtasun et al. 2008), rankGPLVM (Geiger, Urtasun, and Darrell 2009), and tSNEGPLVM (van der Maaten 2009). Among them, PCA and LPP are unsupervised, LDA and SLPP are supervised, LPP and SLPP account for preserving similarity between points, PCA, LDA, LPP and SLPP are linear dimensionality reduction methods, and their kernel counterparts and all GPLVM based methods are nonlinear methods. 681
11 (a)GPLVM for training data (b)DGPLVM for training data (c)GPLRF for training data (d)GPLVM for test data (e)DGPLVM for test data (f)GPLRF for test data Figure 1:2D latent spaces learned by GPLVM,DGPLVM and GPLRF on the USPS data set. We use both the visualization and classification settings method3,with dimensionality 93. to evaluate all these methods.For classification.we first use these methods to perform dimensionality reduction to Visualization get the lower-dimensional representation (or latent represen- We compare our model with GPLVM and DGPLVM on the tation in the case of latent variable models),then a INN visualization of the USPS data set in a 2-dimensional(2D) classifier is employed to predict the labels of the test data latent space.The training and test sets are randomly se- in the latent space.For all methods based on GPLVM, lected from the whole data set,with 50 examples per class testing is repeated with different parameter values:o E for training and the rest for testing.For all three methods, {10-5,10-4,.,104,105}andy∈{0.001,0.01,0.1}. PCA is used for initialization.Figure 1 shows the learned The settings which result in minimum mean errors over 20 2D latent spaces by using GPLVM,DGPLVM and GPLRF. random partitions are used.For the other methods com- As we can see,for the training data,GPLRF learns a 2D pared,we also choose the best parameter settings over 20 latent space in which all 50 data points of each class are random partitions and report the best classification results. almost mapped to the same point,but the latent representa- Three data sets are used in our experiments:the USPS tions learned by GPLVM and DGPLVM are more scattered handwritten digit data set',the Oil data set2,and the CMU with significant overlapping between classes.For the test motion capture(CMU mocap)data set3.The USPS data set data,the regions for different digit classes in the latent space contains 9298 handwritten digits from 10 classes.The size can be distinguished more easily using GPLRF but there is of each digit image is 16 x 16 pixels(so the dimensionality more significant overlapping when GPLVM or DGPLVM is of the data space is 256).In our experiment,we use nor- used.Hence,we can say that the learned space of GPLRF is malized digit images with pixel values in [0,1].For all the more discriminative than those of GPLVM and DGPLVM. dimensionality reduction methods,we first project the data which conforms to the theoretical analysis of GPLRF. onto a linear subspace by performing PCA on the training data with 99%of the variance retained and then use each Effect of Dimensionality method for dimensionality reduction.Other than PCA.no In this experiment,we assess the performance of GPLRF in other preprocessing is applied.The Oil data set is a syn- latent spaces of different dimensionality.Since DGPLVM is thetic data set.It contains 1000 examples grouped into three derived from the LDA or GDA criterion,the dimensional- classes,with dimensionality 12.Nevertheless,its intrinsic ity of the DGPLVM latent space is at most C-1,where C dimensionality is only twoi.The CMU mocap data set in- is the number of classes.However,GPLRF has no such re- cludes three categories,namely,jumping,running and walk- striction and can learn latent spaces of dimensionality higher ing.We choose 49 video sequences from four subjects.For than C-1.Figure 2(a)and Figure 2(b)show the classifi- each sequence,the features are generated using Lawrence's cation errors on the Oil data set and the CMU mocap data set,respectively.On the Oil data set,the best performance http://www-stat-class.stanford.edu/tibs/ElemStatLearn/data.html of DGPLVM is achieved when the dimensionality of the la- http://is6.cs.man.ac.uk/neill/datasets/ tent space is equal to C-1.Although GPLRF can learn a http://mocap.cs.cmu.edu/ "http://www.ncrg.aston.ac.uk/GTM/3PhaseData.html http://is6.cs.man.ac.uk/~neill/mocap/ 682
(a) GPLVM for training data (b) DGPLVM for training data (c) GPLRF for training data (d) GPLVM for test data (e) DGPLVM for test data (f) GPLRF for test data Figure 1: 2D latent spaces learned by GPLVM, DGPLVM and GPLRF on the USPS data set. We use both the visualization and classification settings to evaluate all these methods. For classification, we first use these methods to perform dimensionality reduction to get the lower-dimensional representation (or latent representation in the case of latent variable models), then a 1NN classifier is employed to predict the labels of the test data in the latent space. For all methods based on GPLVM, testing is repeated with different parameter values: α ∈ {10−5 , 10−4 , . . . , 104 , 105} and γ ∈ {0.001, 0.01, 0.1}. The settings which result in minimum mean errors over 20 random partitions are used. For the other methods compared, we also choose the best parameter settings over 20 random partitions and report the best classification results. Three data sets are used in our experiments: the USPS handwritten digit data set1 , the Oil data set2 , and the CMU motion capture (CMU mocap) data set3 . The USPS data set contains 9298 handwritten digits from 10 classes. The size of each digit image is 16 × 16 pixels (so the dimensionality of the data space is 256). In our experiment, we use normalized digit images with pixel values in [0, 1]. For all the dimensionality reduction methods, we first project the data onto a linear subspace by performing PCA on the training data with 99% of the variance retained and then use each method for dimensionality reduction. Other than PCA, no other preprocessing is applied. The Oil data set is a synthetic data set. It contains 1000 examples grouped into three classes, with dimensionality 12. Nevertheless, its intrinsic dimensionality is only two4 . The CMU mocap data set includes three categories, namely, jumping, running and walking. We choose 49 video sequences from four subjects. For each sequence, the features are generated using Lawrence’s 1 http://www-stat-class.stanford.edu/∼tibs/ElemStatLearn/data.html 2 http://is6.cs.man.ac.uk/∼neill/datasets/ 3 http://mocap.cs.cmu.edu/ 4 http://www.ncrg.aston.ac.uk/GTM/3PhaseData.html method5 , with dimensionality 93. Visualization We compare our model with GPLVM and DGPLVM on the visualization of the USPS data set in a 2-dimensional (2D) latent space. The training and test sets are randomly selected from the whole data set, with 50 examples per class for training and the rest for testing. For all three methods, PCA is used for initialization. Figure 1 shows the learned 2D latent spaces by using GPLVM, DGPLVM and GPLRF. As we can see, for the training data, GPLRF learns a 2D latent space in which all 50 data points of each class are almost mapped to the same point, but the latent representations learned by GPLVM and DGPLVM are more scattered with significant overlapping between classes. For the test data, the regions for different digit classes in the latent space can be distinguished more easily using GPLRF but there is more significant overlapping when GPLVM or DGPLVM is used. Hence, we can say that the learned space of GPLRF is more discriminative than those of GPLVM and DGPLVM, which conforms to the theoretical analysis of GPLRF. Effect of Dimensionality In this experiment, we assess the performance of GPLRF in latent spaces of different dimensionality. Since DGPLVM is derived from the LDA or GDA criterion, the dimensionality of the DGPLVM latent space is at most C − 1, where C is the number of classes. However, GPLRF has no such restriction and can learn latent spaces of dimensionality higher than C − 1. Figure 2(a) and Figure 2(b) show the classifi- cation errors on the Oil data set and the CMU mocap data set, respectively. On the Oil data set, the best performance of DGPLVM is achieved when the dimensionality of the latent space is equal to C − 1. Although GPLRF can learn a 5 http://is6.cs.man.ac.uk/∼neill/mocap/ 682
latent space of higher dimensionality,the performance can- The Oil data set 0.35 not be further improved by more than 1%.Besides,we can see that DGPLVM and GPLRF achieve comparable perfor- mance when the dimensionality of the latent space is set to 0.25 C-1(=2),which is the true dimensionality of the data. However,for the CMU mocap data set,the best performance of GPLRF is achieved when the dimensionality is higher 0.15 than C-1,which means that the intrinsic dimensionality of this data set might be higher than C-1.For this case, 0.0 DGPLVM cannot achieve satisfactory results due to the di- mensionality restriction.Hence,by comparing the results in 1Otrain Figure 2(a)and Figure 2(b),we can conclude that GPLRF Number of training data can model more complex data sets than DGPLVM. Figure 3:Mean classification errors and standard devia- tions for nine methods on the Oil data set.The bars in each group represent the classification results in the orig- inal space,the learned latent space using SLPP,SKLPP, GPLVM,LL-GPLVM,rankGPLVM,tSNEGPLVM,DG- PLVM and GPLRF,respectively. use PCA to initialize the latent variables of both DGPLVM and GPLRF.Based on Figure 2(b),we set the dimensionality of DGPLVM to 2 and that of GPLRF to 5.The mean classifi- (a) (b) cation errors and standard deviations are shown in Figure 4. Figure 2:Mean errors obtained by DGPLVM and GPLRF in We can see that GPLRF consistently outperforms DGPLVM latent spaces of different dimensionality on the Oil data set and INN in the original space.Paired t-tests also support and the CMU mocap data set. this conclusion. The CMU mocap data set Classification on the USPS Data Set 0.55 ☐DGPLVM In this experiment,we compare our model with 12 other re 05 lated methods on the USPS data set.For all methods based 0.45 on GPLVM.we use PCA for initialization.For compari- son,the dimensionality of the latent space is set to 9.Mean classification errors and their standard deviations are shown 0.35 in Table 1,where the best results are shown in bold.We can find that GPLRF achieves the best performance consis- tently.Although GDA performs well and can obtain mean 0.25 errors close to GPLRF,paired t-tests still support that,in 0.2 most cases,GPLRF is significantly better than GDA. 0.15 Classification on the Oil Data Set 0.1 4train 6train 8train 10train 12train In this experiment,all the GPLVM based methods are initial- Number of training data ized with SLPP.The dimensionality of the latent space is set Figure 4:Mean classification errors and standard deviations to 2.Mean classification errors and their standard deviations for three methods on the CMU mocap data set. are shown in Figure 3.Paired t-tests show that,in most cases (with p-value less than 0.1),GPLRF is significantly better than the other methods.Even in cases that GPLRF is not the best,there is no significant difference between GPLRF and Conclusion the best one.From Figure 2(a),we can find that this data set might be relatively simple.Hence,DGPLVM can achieve In this paper,we have proposed a novel supervised exten- sion of GPLVM,called GPLRF,for nonlinear dimension- performance comparable with GPLRF under some settings ality reduction.GPLRF can learn a discriminative latent on this data set. representation which is beneficial for classification.Exten- Classification on the CMU Mocap Data Set sive experiments on synthetic data and diverse applications demonstrate that GPLRF can achieve performance compa- To evaluate the performance of GPLRF when the input di- rable to other state-of-the-art methods,either linear or non- mensionality is higher than the number of training examples, linear,for simple data sets,and can dramatically outperform we carry out an experiment on the CMU mocap data set.We other methods for complex data sets. 683
latent space of higher dimensionality, the performance cannot be further improved by more than 1%. Besides, we can see that DGPLVM and GPLRF achieve comparable performance when the dimensionality of the latent space is set to C − 1 (= 2), which is the true dimensionality of the data. However, for the CMU mocap data set, the best performance of GPLRF is achieved when the dimensionality is higher than C − 1, which means that the intrinsic dimensionality of this data set might be higher than C − 1. For this case, DGPLVM cannot achieve satisfactory results due to the dimensionality restriction. Hence, by comparing the results in Figure 2(a) and Figure 2(b), we can conclude that GPLRF can model more complex data sets than DGPLVM. (a) (b) Figure 2: Mean errors obtained by DGPLVM and GPLRF in latent spaces of different dimensionality on the Oil data set and the CMU mocap data set. Classification on the USPS Data Set In this experiment, we compare our model with 12 other related methods on the USPS data set. For all methods based on GPLVM, we use PCA for initialization. For comparison, the dimensionality of the latent space is set to 9. Mean classification errors and their standard deviations are shown in Table 1, where the best results are shown in bold. We can find that GPLRF achieves the best performance consistently. Although GDA performs well and can obtain mean errors close to GPLRF, paired t-tests still support that, in most cases, GPLRF is significantly better than GDA. Classification on the Oil Data Set In this experiment, all the GPLVM based methods are initialized with SLPP. The dimensionality of the latent space is set to 2. Mean classification errors and their standard deviations are shown in Figure 3. Paired t-tests show that, in most cases (with p-value less than 0.1), GPLRF is significantly better than the other methods. Even in cases that GPLRF is not the best, there is no significant difference between GPLRF and the best one. From Figure 2(a), we can find that this data set might be relatively simple. Hence, DGPLVM can achieve performance comparable with GPLRF under some settings on this data set. Classification on the CMU Mocap Data Set To evaluate the performance of GPLRF when the input dimensionality is higher than the number of training examples, we carry out an experiment on the CMU mocap data set. We Figure 3: Mean classification errors and standard deviations for nine methods on the Oil data set. The bars in each group represent the classification results in the original space, the learned latent space using SLPP, SKLPP, GPLVM, LL-GPLVM, rankGPLVM, tSNEGPLVM, DGPLVM and GPLRF, respectively. use PCA to initialize the latent variables of both DGPLVM and GPLRF. Based on Figure 2(b), we set the dimensionality of DGPLVM to 2 and that of GPLRF to 5. The mean classifi- cation errors and standard deviations are shown in Figure 4. We can see that GPLRF consistently outperforms DGPLVM and 1NN in the original space. Paired t-tests also support this conclusion. Figure 4: Mean classification errors and standard deviations for three methods on the CMU mocap data set. Conclusion In this paper, we have proposed a novel supervised extension of GPLVM, called GPLRF, for nonlinear dimensionality reduction. GPLRF can learn a discriminative latent representation which is beneficial for classification. Extensive experiments on synthetic data and diverse applications demonstrate that GPLRF can achieve performance comparable to other state-of-the-art methods, either linear or nonlinear, for simple data sets, and can dramatically outperform other methods for complex data sets. 683
Table 1:Mean classification errors and standard deviations obtained by different methods on the USPS data set.The first row,'No transformation',presents the INN classification results in the original space.The first four methods below 'No transformation'are linear methods and the rest are nonlinear methods.The blank entry for 'rankGPLVM'is due to out of memory(2GB). Method 10train 20train 30train 40train 50train No transformation 0.1976±0.0161 0.1463±0.0069 0.1260±0.0064 0.1107±0.0056 0.0986±0.0047 PCA 0.2497±0.0154 0.2090±0.0133 0.1849±0.0122 0.1757±0.0136 0.1631±0.0054 LDA 0.2636±0.0275 0.1937±0.0103 0.1657±0.0068 0.1520±0.0093 0.1392±0.0055 LPP 0.3433±0.0356 0.2761±0.0284 0.2470±0.0154 0.2150±0.01310.1929±0.0125 SLPP 0.2448±0.0204 0.2070±0.0095 0.1792±0.0067 0.1639±0.00970.1498±0.0063 KPCA 0.3515±0.02600.2764±0.0294 0.2490±0.0179 0.2422±0.01540.2204±0.0115 GDA 0.1827±0.02400.1455±0.0205 0.0971±0.00440.0921±0.0070 0.0729±0.0053 KLPP 0.5087±0.04900.5024±0.0414 0.4751±0.03160.4645±0.0257 0.4517±0.0206 SKLPP 0.2155±0.0402 0.1702±0.04530.1450±0.0440 0.1345±0.04930.1059±0.0374 GPLVM 0.2554±0.01910.1940±0.0090 0.1695±0.01550.1422±0.0070 0.1311±0.0079 LL-GPLVM 0.2533±0.0200 0.1908±0.0158 0.1637±0.0152 0.1393±0.0094 0.1278±0.0088 rankGPLVM 0.1930±0.0146 0.1427±0.0075 0.1186±0.0067 0.1084±0.0529 tSNEGPLVM 0.2399±0.0138 0.1932±0.0096 0.1673±0.0157 0.1418±0.0072 0.1303±0.0084 DGPLVM 0.2491±0.02300.1922±0.0102 0.1692±0.0155 0.1442±0.0122 0.1303±0.0081 GPLRF 0.1769±0.0165 0.1129±0.0113 0.0946±0.0100 0.0809±0.0085 0.0690±0.0042 Acknowledgments Lawrence,N.2005.Probabilistic non-linear principal component This research is partially supported by the NSFC Outstand- analysis with Gaussian process latent variable models.Journal of ing Youth Fund (No.60825301)and by General Research Machine Learning Research 6:1783-1816. Fund 621407 from the Research Grants Council of the Hong Moler,F.1993.A scaled conjugate gradient algorithm for fast Kong Special Administrative Region,China supervised learning.Neural Networks 6(4):525-533. Rasmussen,C.E.,and Williams,C.2006.Gaussian Processes for References Machine Learning.Monographs on Statistics and Applied Proba- Baudat.G.,and Anouar,F.2000.Generalized discriminant analy- bility.MIT Press. sis using a kernel approach.Neural Computation 12:2385-2404. Roweis.S.,and Saul,L.2000.Nonlinear dimensionality reduction Belkin,M.,and Niyogi,P.2001.Laplacian eigenmaps and spectral by locally linear embedding.Science 290(5500):2323-2326. techniques for embedding and clustering.In Advances in Neural Rue.H..and Held.L.2005.Gaussian Markov Random Fields: Information Processing Systems,585-591. Theory and Applications,volume 104 of Monographs on Statistics Chung,F.1997.Spectral Graph Theory.Number 92 in Regional and Applied Probability.London:Chapman Hall. Conference Series in Mathematics.American Mathematical Soci- Tenenbaum,J.;Silva,V.;and Langford,J.2000.A global geo- ety. metric framework for nonlinear dimensionality reduction.Science Cox.T.,and Cox.M.2001.Multidimensional Scaling.Chapman 290(5500):2319-2323 and Hall,Boca Raton. Urtasun,R.,and Darrell,T.2007.Discriminative Gaussian pro- Fukunnaga,K.1991.Introduction to Statistical Pattern Recogni- cess latent variable models for classification.In Proceedings of the tion,second edition.Academic Press. International Conference on Machine Learning,927-934. Geiger,A.;Urtasun,R.;and Darrell,T.2009.Rank priors for Urtasun,R.;Fleet,D.J.;Geiger,A.;Popovic,J.;Darrell,T.;and continuous nonlinear dimensionality reduction.In Proceedings of Lawrence,N.2008.Topologically-constrained latent variable IEEE Conference on Computer Vision and Pattern Recognition, models.In Proceedings of the International Conference on Ma- 880-887. chine Learning,1080-1087. He,X.,and Niyogi,P.2003.Locality preserving projections.In van der Maaten,L.2009.Preserving local structure in Gaussian Advances in Neural Information Processing Systems process latent variable models.In Proceedings of the 18th Annual Joliffe,I.1986.Principal Component Analysis.Springer-Verlag. Belgian-Dutch Conference on Machine Learning,81-88. Lawrence,N.,and Quinonero-Candela,J.2006.Local distance Zheng.Z.;Yang.F.;Tan,W.;Jia,J.;and Yang.J.2007.Gabor preservation in the GP-LVM through back constraints.In Proceed- feature-based face recognition using supervised locality preserving ings of the International Conference on Machine Learning,96- projection.Signal Processing 87(10):2473-2483 103. 684
Table 1: Mean classification errors and standard deviations obtained by different methods on the USPS data set. The first row, ‘No transformation’, presents the 1NN classification results in the original space. The first four methods below ‘No transformation’ are linear methods and the rest are nonlinear methods. The blank entry for ’rankGPLVM’ is due to out of memory (2GB). Method 10train 20train 30train 40train 50train No transformation 0.1976 ± 0.0161 0.1463 ± 0.0069 0.1260 ± 0.0064 0.1107 ± 0.0056 0.0986 ± 0.0047 PCA 0.2497 ± 0.0154 0.2090 ± 0.0133 0.1849 ± 0.0122 0.1757 ± 0.0136 0.1631 ± 0.0054 LDA 0.2636 ± 0.0275 0.1937 ± 0.0103 0.1657 ± 0.0068 0.1520 ± 0.0093 0.1392 ± 0.0055 LPP 0.3433 ± 0.0356 0.2761 ± 0.0284 0.2470 ± 0.0154 0.2150 ± 0.0131 0.1929 ± 0.0125 SLPP 0.2448 ± 0.0204 0.2070 ± 0.0095 0.1792 ± 0.0067 0.1639 ± 0.0097 0.1498 ± 0.0063 KPCA 0.3515 ± 0.0260 0.2764 ± 0.0294 0.2490 ± 0.0179 0.2422 ± 0.0154 0.2204 ± 0.0115 GDA 0.1827 ± 0.0240 0.1455 ± 0.0205 0.0971 ± 0.0044 0.0921 ± 0.0070 0.0729 ± 0.0053 KLPP 0.5087 ± 0.0490 0.5024 ± 0.0414 0.4751 ± 0.0316 0.4645 ± 0.0257 0.4517 ± 0.0206 SKLPP 0.2155 ± 0.0402 0.1702 ± 0.0453 0.1450 ± 0.0440 0.1345 ± 0.0493 0.1059 ± 0.0374 GPLVM 0.2554 ± 0.0191 0.1940 ± 0.0090 0.1695 ± 0.0155 0.1422 ± 0.0070 0.1311 ± 0.0079 LL-GPLVM 0.2533 ± 0.0200 0.1908 ± 0.0158 0.1637 ± 0.0152 0.1393 ± 0.0094 0.1278 ± 0.0088 rankGPLVM 0.1930 ± 0.0146 0.1427 ± 0.0075 0.1186 ± 0.0067 0.1084 ± 0.0529 – tSNEGPLVM 0.2399 ± 0.0138 0.1932 ± 0.0096 0.1673 ± 0.0157 0.1418 ± 0.0072 0.1303 ± 0.0084 DGPLVM 0.2491 ± 0.0230 0.1922 ± 0.0102 0.1692 ± 0.0155 0.1442 ± 0.0122 0.1303 ± 0.0081 GPLRF 0.1769 ± 0.0165 0.1129 ± 0.0113 0.0946 ± 0.0100 0.0809 ± 0.0085 0.0690 ± 0.0042 Acknowledgments This research is partially supported by the NSFC Outstanding Youth Fund (No. 60825301) and by General Research Fund 621407 from the Research Grants Council of the Hong Kong Special Administrative Region, China. References Baudat, G., and Anouar, F. 2000. Generalized discriminant analysis using a kernel approach. Neural Computation 12:2385–2404. Belkin, M., and Niyogi, P. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 585–591. Chung, F. 1997. Spectral Graph Theory. Number 92 in Regional Conference Series in Mathematics. American Mathematical Society. Cox, T., and Cox, M. 2001. Multidimensional Scaling. Chapman and Hall, Boca Raton. Fukunnaga, K. 1991. Introduction to Statistical Pattern Recognition, second edition. Academic Press. Geiger, A.; Urtasun, R.; and Darrell, T. 2009. Rank priors for continuous nonlinear dimensionality reduction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 880–887. He, X., and Niyogi, P. 2003. Locality preserving projections. In Advances in Neural Information Processing Systems. Joliffe, I. 1986. Principal Component Analysis. Springer-Verlag. Lawrence, N., and Quinonero-Candela, J. 2006. Local distance ˜ preservation in the GP-LVM through back constraints. In Proceedings of the International Conference on Machine Learning, 96– 103. Lawrence, N. 2005. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6:1783–1816. Møler, F. 1993. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4):525–533. Rasmussen, C. E., and Williams, C. 2006. Gaussian Processes for Machine Learning. Monographs on Statistics and Applied Probability. MIT Press. Roweis, S., and Saul, L. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. Rue, H., and Held, L. 2005. Gaussian Markov Random Fields: Theory and Applications, volume 104 of Monographs on Statistics and Applied Probability. London: Chapman & Hall. Tenenbaum, J.; Silva, V.; and Langford, J. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323. Urtasun, R., and Darrell, T. 2007. Discriminative Gaussian process latent variable models for classification. In Proceedings of the International Conference on Machine Learning, 927–934. Urtasun, R.; Fleet, D. J.; Geiger, A.; Popovic, J.; Darrell, T.; and Lawrence, N. 2008. Topologically-constrained latent variable models. In Proceedings of the International Conference on Machine Learning, 1080–1087. van der Maaten, L. 2009. Preserving local structure in Gaussian process latent variable models. In Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning, 81–88. Zheng, Z.; Yang, F.; Tan, W.; Jia, J.; and Yang, J. 2007. Gabor feature-based face recognition using supervised locality preserving projection. Signal Processing 87(10):2473–2483. 684