Least Squares Linear Discriminant Analysis Jieping Ye JIEPING.YEQASU.EDU Department of Computer Science and Engineering,Arizona State University,Tempe,AZ 85287 USA Abstract is obtained by minimizing the within-class distance Linear Discriminant Analysis (LDA)is a and maximizing the between-class distance simultane- well-known method for dimensionality reduc- ously,thus achieving maximum class discrimination. tion and classification.LDA in the binary- It has been applied successfully in many applications class case has been shown to be equiva- (Belhumeour et al.,1997;Swets Weng,1996;Du- lent to linear regression with the class label doit et al.,2002)including face recognition and mi- as the output.This implies that LDA for croarray gene expression data analysis.The optimal binary-class classifications can be formulated transformation is readily computed by solving a gener- as a least squares problem.Previous stud- alized eigenvalue problem.The original LDA formula- ies have shown certain relationship between tion,known as the Fisher Linear Discriminant Analy- multivariate linear regression and LDA for sis(FLDA)(Fisher.1936)deals with binary-class clas- the multi-class case.Many of these studies sifications.The key idea in FLDA is to look for a show that multivariate linear regression with direction that separates the class means well (when a specific class indicator matrix as the out- projected onto that direction)while achieving a small put can be applied as a preprocessing step variance around these means. for LDA.However,directly casting LDA as FLDA bears strong connections to linear regression a least squares problem is challenging for the with the class label as the output.It has been shown multi-class case.In this paper,a novel for- (Duda et al.,2000;Mika,2002)that FLDA is equiva- mulation for multivariate linear regression is lent to a least squares problem.Many real-world appli- proposed.The equivalence relationship be- cations deal with multi-class classifications,and LDA tween the proposed least squares formulation is generally used to find a subspace with k-1 dimen- and LDA for multi-class classifications is rig- sions for multi-class problems,where k is the num- orously established under a mild condition. ber of classes in the training dataset(Fukunaga,1990; which is shown empirically to hold in many Hastie et al.,2001).However,directly casting LDA as applications involving high-dimensional data. a least squares problem is challenging for multi-class Several LDA extensions based on the equiv- problems(Duda et al.,2000;Hastie et al.,2001;Zhang alence relationship are discussed. Riedel.2005). Multivariate linear regression with a specific class in- 1.Introduction dicator matrix has been considered in (Hastie et al.. 2001)for multi-class classifications.It follows the gen- Linear Discriminant Analysis (LDA)is a well-known eral framework of linear regression with multiple out- method for dimensionality reduction and classifica- puts.As pointed out in (Pages 83-84,Hastie et al. tion that projects high-dimensional data onto a low- (2001)),this approach has a serious problem when the dimensional space where the data achieves maximum number of classes k>3,especially prevalent when k class separability (Duda et al.,2000;Fukunaga,1990; is large.More specifically,classes can be masked by Hastie et al.,2001).The derived features in LDA are others due to the rigid nature of the regression model, linear combinations of the original features,where the which is not the case for LDA (Hastie et al.,2001). coefficients are from the transformation matrix.The This multivariate linear regression model is in general optimal projection or transformation in classical LDA different from LDA.However.there is a close connec- tion between multivariate linear regression and LDA. Appearing in Proceedings of the 24th International Confer- ence on Machine Learning,Corvallis,OR,2007.Copyright More specifically,it can be shown(Hastie et al.,1994; 2007 by the author(s)/owner(s). Hastie et al.,2001)that LDA applied to the trans-
Least Squares Linear Discriminant Analysis Jieping Ye jieping.ye@asu.edu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287 USA Abstract Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binary-class classifications can be formulated as a least squares problem. Previous studies have shown certain relationship between multivariate linear regression and LDA for the multi-class case. Many of these studies show that multivariate linear regression with a specific class indicator matrix as the output can be applied as a preprocessing step for LDA. However, directly casting LDA as a least squares problem is challenging for the multi-class case. In this paper, a novel formulation for multivariate linear regression is proposed. The equivalence relationship between the proposed least squares formulation and LDA for multi-class classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving high-dimensional data. Several LDA extensions based on the equivalence relationship are discussed. 1. Introduction Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification that projects high-dimensional data onto a lowdimensional space where the data achieves maximum class separability (Duda et al., 2000; Fukunaga, 1990; Hastie et al., 2001). The derived features in LDA are linear combinations of the original features, where the coefficients are from the transformation matrix. The optimal projection or transformation in classical LDA Appearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). is obtained by minimizing the within-class distance and maximizing the between-class distance simultaneously, thus achieving maximum class discrimination. It has been applied successfully in many applications (Belhumeour et al., 1997; Swets & Weng, 1996; Dudoit et al., 2002) including face recognition and microarray gene expression data analysis. The optimal transformation is readily computed by solving a generalized eigenvalue problem. The original LDA formulation, known as the Fisher Linear Discriminant Analysis (FLDA) (Fisher, 1936) deals with binary-class classifications. The key idea in FLDA is to look for a direction that separates the class means well (when projected onto that direction) while achieving a small variance around these means. FLDA bears strong connections to linear regression with the class label as the output. It has been shown (Duda et al., 2000; Mika, 2002) that FLDA is equivalent to a least squares problem. Many real-world applications deal with multi-class classifications, and LDA is generally used to find a subspace with k − 1 dimensions for multi-class problems, where k is the number of classes in the training dataset (Fukunaga, 1990; Hastie et al., 2001). However, directly casting LDA as a least squares problem is challenging for multi-class problems (Duda et al., 2000; Hastie et al., 2001; Zhang & Riedel, 2005). Multivariate linear regression with a specific class indicator matrix has been considered in (Hastie et al., 2001) for multi-class classifications. It follows the general framework of linear regression with multiple outputs. As pointed out in (Pages 83–84, Hastie et al. (2001)), this approach has a serious problem when the number of classes k ≥ 3, especially prevalent when k is large. More specifically, classes can be masked by others due to the rigid nature of the regression model, which is not the case for LDA (Hastie et al., 2001). This multivariate linear regression model is in general different from LDA. However, there is a close connection between multivariate linear regression and LDA. More specifically, it can be shown (Hastie et al., 1994; Hastie et al., 2001) that LDA applied to the trans-
Least Squares Linear Discriminant Analysis formed space by multivariate linear regression with a between-class separation.In the lower-dimensional specific class indicator matrix as the output is identi- space resulting from the linear transformation G,the cal to LDA applied to the original space.In this case, scatter matrices become multivariate linear regression is applied as a prepro- cessing step for LDA.One natural question is whether SL=GTSG,S=GTSLG,SL=GTS:G.(4) LDA in the multi-class case can be directly formulated An optimal transformation G would maximize as a least squares problem. trace(S)and minimize trace(S.)simultaneously, In this paper,we propose a novel formulation for multi- which is equivalent to maximizing trace(S)and mini- variate linear regression based on a new class indicator mizing trace(S!)simultaneously,since S=+S. matrix.We establish the equivalence relationship be- The optimal transformation,GLDA,of LDA is com- tween the proposed least squares formulation and LDA puted by solving the following optimization problem under a mild condition(see Section 5),which holds in (Duda et al..2000;Fukunaga,1990): many applications involving high-dimensional and un- dersampled data.We call the proposed LDA formula- GLDA arg max {trace(())). (5) tion Least Squares Linear Discriminant Analysis (or LS-LDA in short).We have conducted experiments The optimal A consists of the top eigenvectors using a collection of high-dimensional datasets from of S.S corresponding to the nonzero eigenvalues various data sources,including text documents,face (Fukunaga,1990),provided that the total scatter ma- images,and microarray gene expression data.Experi- trix S:is nonsingular.In the following discussion,we mental results are consistent with the presented theo- consider the more general case when S:may be singu- retical analysis. lar.GLDA consists of the eigenvectors of SS corre- sponding to the nonzero eigenvalues.Here S denotes the pseudo-inverse of St (Golub Van Loan,1996). 2.Overview of Linear Discriminant Note that when St is nonsingular,S equals S. Analysis The above LDA formulation is an extension of the Given a dataset that consists of n samples original Fisher Linear Discriminant Analysis(FLDA) {(z,)}21,where z∈Ra,andh∈{1,2,·,de- (Fisher,1936),which deals with binary-class problems, notes the class label of the i-th sample,n is the sample i.e.,k=2.The optimal transformation,GF,of FLDA size,d is the data dimensionality,and k is the number is of rank one and is given by (Duda et al.,2000) of classes.Let the data matrix X=[1,x2,..,In]be partitioned into k classes as X =[X1,...,Xk],where GF=St(c四-c②) (6) X EIRdxn,nj is the size of the j-th class Xj,and n.Classical LDA computes a linear trans- Note that G is invariant of scaling.That is,aGF, for any 0 is also a solution to FLDA. formation Gxthat maps;in the d-dimensional space to a vector rf in the e-dimensional space as fol- lows:x:∈Rd→x=Grr:∈R(<d).n 3.Linear Regression versus Fisher LDA discriminant analysis (Fukunaga,1990),three scatter Given a dataset of two classes,.{(z,i)}是1,xi∈Rd matrices,called within-class,between-class and total and yi E{-1,1},the linear regression model with scatter matrices are defined as follows: the class label as the output has the following form: f(x)=rTw+b,where w E IRa is the weight vector, 。=1∑∑--c6, (1) and b is the bias of the linear model.A popular ap- proach for estimating w and b is the least squares,in j= which the following objective function is minimized: 1∑neo0-ceo-ar, (2) L(w,6)= llXTw+be-yll2= lf(x)-2, St= ∑a-o-d, (3) (7) n where X [1,z2,...,In]is the data matrix,e is the where c()is the centroid of the j-th class,and c is vectors of all ones,and y is the vector of class labels. the global centroid.It follows from the definition that Assume that both fri}and {yi}have been centered, S:=S+Sw.Furthermore,trace(S)measures the ie,∑1x=0and∑-1h=0.It follows that within-class cohesion,while trace(S)measures the yi∈{-2m2/n,2m1/m}
Least Squares Linear Discriminant Analysis formed space by multivariate linear regression with a specific class indicator matrix as the output is identical to LDA applied to the original space. In this case, multivariate linear regression is applied as a preprocessing step for LDA. One natural question is whether LDA in the multi-class case can be directly formulated as a least squares problem. In this paper, we propose a novel formulation for multivariate linear regression based on a new class indicator matrix. We establish the equivalence relationship between the proposed least squares formulation and LDA under a mild condition (see Section 5), which holds in many applications involving high-dimensional and undersampled data. We call the proposed LDA formulation Least Squares Linear Discriminant Analysis (or LS-LDA in short). We have conducted experiments using a collection of high-dimensional datasets from various data sources, including text documents, face images, and microarray gene expression data. Experimental results are consistent with the presented theoretical analysis. 2. Overview of Linear Discriminant Analysis Given a dataset that consists of n samples {(xi , yi)} n i=1, where xi ∈ IRd , and yi ∈ {1, 2, · · · , k} denotes the class label of the i-th sample, n is the sample size, d is the data dimensionality, and k is the number of classes. Let the data matrix X = [x1, x2, · · · , xn] be partitioned into k classes as X = [X1, · · · , Xk], where Xj ∈ IRd×nj , nj is the size of the j-th class Xj , and Pk j=1 nj = n. Classical LDA computes a linear transformation G ∈ IRd×ℓ that maps xi in the d-dimensional space to a vector x L i in the ℓ-dimensional space as follows: xi ∈ IRd → x L i = GT xi ∈ IRℓ (ℓ < d). In discriminant analysis (Fukunaga, 1990), three scatter matrices, called within-class, between-class and total scatter matrices are defined as follows: Sw = 1 n X k j=1 X x∈Xj (x − c (j) )(x − c (j) ) T , (1) Sb = 1 n X k j=1 nj (c (j) − c)(c (j) − c) T , (2) St = 1 n Xn i=1 (xi − c)(xi − c) T , (3) where c (j) is the centroid of the j-th class, and c is the global centroid. It follows from the definition that St = Sb + Sw. Furthermore, trace(Sw) measures the within-class cohesion, while trace(Sb) measures the between-class separation. In the lower-dimensional space resulting from the linear transformation G, the scatter matrices become S L w = G T SwG, SL b = G T SbG, SL t = G T StG. (4) An optimal transformation G would maximize trace(S L b ) and minimize trace(S L w) simultaneously, which is equivalent to maximizing trace(S L b ) and minimizing trace(S L t ) simultaneously, since S L t = S L w +S L b . The optimal transformation, GLDA, of LDA is computed by solving the following optimization problem (Duda et al., 2000; Fukunaga, 1990): G LDA = arg max G trace S L b (S L t ) −1 . (5) The optimal GLDA consists of the top eigenvectors of S −1 t Sb corresponding to the nonzero eigenvalues (Fukunaga, 1990), provided that the total scatter matrix St is nonsingular. In the following discussion, we consider the more general case when St may be singular. GLDA consists of the eigenvectors of S + t Sb corresponding to the nonzero eigenvalues. Here S + t denotes the pseudo-inverse of St (Golub & Van Loan, 1996). Note that when St is nonsingular, S + t equals S −1 t . The above LDA formulation is an extension of the original Fisher Linear Discriminant Analysis (FLDA) (Fisher, 1936), which deals with binary-class problems, i.e., k = 2. The optimal transformation, GF , of FLDA is of rank one and is given by (Duda et al., 2000) G F = S + t (c (1) − c (2)). (6) Note that GF is invariant of scaling. That is, αGF , for any α 6= 0 is also a solution to FLDA. 3. Linear Regression versus Fisher LDA Given a dataset of two classes, {(xi , yi)} n i=1, xi ∈ IRd , and yi ∈ {−1, 1}, the linear regression model with the class label as the output has the following form: f(x) = x T w + b, where w ∈ IRd is the weight vector, and b is the bias of the linear model. A popular approach for estimating w and b is the least squares, in which the following objective function is minimized: L(w, b) = 1 2 ||XT w + be − y||2 = 1 2 Xn i=1 ||f(xi) − yi ||2 , (7) where X = [x1, x2, · · · , xn] is the data matrix, e is the vectors of all ones, and y is the vector of class labels. Assume that both {xi} and {yi} have been centered, i.e., Pn i=1 xi = 0 and Pn i=1 yi = 0. It follows that yi ∈ {−2n2/n, 2n1/n}
Least Squares Linear Discriminant Analysis where n and n2 denote the number of samples from w1IRd,of the k linear models,f()=Twj, the negative and positive classes,respectively.The for j=1,...,k,via the minimization of the following bias term b becomes zero and we look for a linear model objective function: f(x)=xTw by minimizing L(w)-llXTw-yll2 (8) (9) The optimal w is given by (Duda et al.,2000;Golub where W [w1,w2,...,w]is the weight matrix,and Van Loan,1996;Hastie et al.,2001)w=(XXT)Xy. .F denotes the Frobenius norm of a matrix (Golub Note that XXT =nSt (data matrix X has been cen- Van Loan,1996).The optimal W is given by (Hastie tered)and Xy=2nin(c(1)-c(2)).It follows that etal.,2001) ”2Sc四-c2=12G2 (10) n2 W=(父)r, where GF is the optimal solution to FLDA in Eq.(6). which is determined by X and y Hence linear regression with the class label as the out- Both Yi and Y2 defined above,as well as the one in put is equivalent to Fisher LDA,as the projection in (ParkPark.2005)could be used to define the cen- FLDA is invariant of scaling.More details on this tered indicator matrix Y.An interesting connection equivalence relationship can be found at (Duda et al., between the linear regression model using Yi and LDA 2000:Mika,2002). can be found in (Page 112,Hastie et al.(2001)).It can be shown that if XL=WIX is the transformed 4.Multivariate Linear Regression with data by Wi,where Wi=(is the least a Class Indicator Matrix squares solution in Eq.(10)using the centered indica- In the multiclass case,we are given a dataset that tor matrix Yi,then LDA applied to X is identical to consists of n samples{(xi,k)}是l,where zi∈R4 LDA applied to the original space.In this case,lin- andh∈{l,2,·,k}denotes the class label of the ear regression is applied as a preprocessing step before i-th sample,and k 2.It is common to apply lin- the classification,and is in general not equivalent to ear regression of a class membership indicator ma- LDA.The second indicator matrix Y2 has been used trix YIRxk,which applies a vector valued class in SVM,and the resulting model using Y2 is also not code for each of the samples (Hastie et al.,2001). equivalent to LDA in general.This is also the case for the indicator matrix in (ParkPark.2005).A nat- There are several well-known indicator matrices in the literature..Denote Yi=(y(ij)》,∈IRaxk and ural question is whether there exists a class indicator y=(Y(ij》i∈Ras the class indicator matri- matrix Y ERR"k,with which the multivariate linear regression is equivalent to LDA.If this is the case,then ces as follows:Yi(ij)=1,if yi j,and Yi(ij)=0, LDA can be formulated as a least squares problem in otherwise;and Y2(ij)=1,if yi=j,and Y2(ij)= the multi-class case. -1/(k-1),otherwise.The first indicator matrix Yi is commonly used in connecting multi-class classifica- Note that in multivariate linear regression,each ii is tion with linear regression (Hastie et al.,2001),while transformed to(fh(i:),·,fik(a)=WT元,and the the second indicator matrix has recently been used in centered data matrix交∈Rxn is transformed to extending Support Vector Machines (SVM)to multi- WTX E IR*x",thus achieving dimensionality reduc- class classification (Lee et al.,2004)and in generalizing tionf<d.Note that W=(父文T)r交.Anat- the kernel target alignment measure(Guermeur et al., 2004),originally proposed in(Cristianini et al.,2001). ral measure for evaluating Y is the class discrimination used in LDA.We thus look for Y which solves the fol- In multivariate linear regression(MLR),a k-tuple of lowing optimization problem:(The pseudo-inverse is separating functions f()=(f1(x),f2(x),...,f(x)), applied to deal with the singular scatter matrix.) for any r IRd is considered.Denote= [,…,n∈Rdx,and立=(位)∈Rxk as maxy trace((WTSW)(WTS:W)+) the centered data matrix X and the centered indi- subject to W=(仪T)7 (11) cator matrix Y,respectively.That is,i=i- and=y-乃,where五=h∑-ar;and Y= In the following,we construct a specific class indica- Y Then MLR determines the weight vectors, tor matrix Y3 and show that it solves the optimiza-
Least Squares Linear Discriminant Analysis where n1 and n2 denote the number of samples from the negative and positive classes, respectively. The bias term b becomes zero and we look for a linear model f(x) = x T w by minimizing L(w) = 1 2 ||XT w − y||2 . (8) The optimal w is given by (Duda et al., 2000; Golub & Van Loan, 1996; Hastie et al., 2001) w = XXT + Xy. Note that XXT = nSt (data matrix X has been centered) and Xy = 2n1n2 n (c (1) − c (2)). It follows that w = 2n1n2 n2 S + t (c (1) − c (2)) = 2n1n2 n2 G F , where GF is the optimal solution to FLDA in Eq. (6). Hence linear regression with the class label as the output is equivalent to Fisher LDA, as the projection in FLDA is invariant of scaling. More details on this equivalence relationship can be found at (Duda et al., 2000; Mika, 2002). 4. Multivariate Linear Regression with a Class Indicator Matrix In the multiclass case, we are given a dataset that consists of n samples {(xi , yi)} n i=1, where xi ∈ IRd , and yi ∈ {1, 2, · · · , k} denotes the class label of the i-th sample, and k > 2. It is common to apply linear regression of a class membership indicator matrix Y ∈ IRn×k , which applies a vector valued class code for each of the samples (Hastie et al., 2001). There are several well-known indicator matrices in the literature. Denote Y1 = (Y1(ij))ij ∈ IRn×k and Y2 = (Y2(ij))ij ∈ IRn×k as the class indicator matrices as follows: Y1(ij) = 1, if yi = j, and Y1(ij) = 0, otherwise; and Y2(ij) = 1, if yi = j, and Y2(ij) = −1/(k − 1), otherwise. The first indicator matrix Y1 is commonly used in connecting multi-class classification with linear regression (Hastie et al., 2001), while the second indicator matrix has recently been used in extending Support Vector Machines (SVM) to multiclass classification (Lee et al., 2004) and in generalizing the kernel target alignment measure (Guermeur et al., 2004), originally proposed in (Cristianini et al., 2001). In multivariate linear regression (MLR), a k-tuple of separating functions f(x) = (f1(x), f2(x), · · · , fk(x)), for any x ∈ IRd is considered. Denote X˜ = [˜x1, · · · , x˜n] ∈ IRd×n , and Y˜ = Y˜ ij ∈ IRn×k as the centered data matrix X and the centered indicator matrix Y , respectively. That is, ˜xi = xi − x¯ and Y˜ ij = Yij − Y¯ j , where ¯x = 1 n Pn i=1 xi and Y¯ j = 1 n Pn i=1 Yij . Then MLR determines the weight vectors, {wj} k j=1 ∈ IRd , of the k linear models, fj (x) = x T wj , for j = 1, · · · , k, via the minimization of the following objective function: L(W) = 1 2 ||X˜ T W − Y˜ ||2 F = 1 2 X k j=1 Xn i=1 ||fj (˜xi) − Y˜ ij ||2 , (9) where W = [w1, w2, · · · , wk] is the weight matrix, and || · ||F denotes the Frobenius norm of a matrix (Golub & Van Loan, 1996). The optimal W is given by (Hastie et al., 2001) W = X˜X˜ T + X˜Y , ˜ (10) which is determined by X˜ and Y˜ . Both Y1 and Y2 defined above, as well as the one in (Park & Park, 2005) could be used to define the centered indicator matrix Y˜ . An interesting connection between the linear regression model using Y1 and LDA can be found in (Page 112, Hastie et al. (2001)). It can be shown that if XL = WT 1 X˜ is the transformed data by W1, where W1 = X˜X˜ T + X˜Y˜ 1 is the least squares solution in Eq. (10) using the centered indicator matrix Y˜ 1, then LDA applied to XL is identical to LDA applied to the original space. In this case, linear regression is applied as a preprocessing step before the classification, and is in general not equivalent to LDA. The second indicator matrix Y2 has been used in SVM, and the resulting model using Y2 is also not equivalent to LDA in general. This is also the case for the indicator matrix in (Park & Park, 2005). A natural question is whether there exists a class indicator matrix Y˜ ∈ IRn×k , with which the multivariate linear regression is equivalent to LDA. If this is the case, then LDA can be formulated as a least squares problem in the multi-class case. Note that in multivariate linear regression, each ˜xi is transformed to (f1(˜xi), · · · , fk(˜xi))T = WT x˜i , and the centered data matrix X˜ ∈ IRd×n is transformed to WT X˜ ∈ IRk×n , thus achieving dimensionality reduction if k < d. Note that W = X˜X˜ T + X˜Y˜ . A natural measure for evaluating Y˜ is the class discrimination used in LDA. We thus look for Y˜ which solves the following optimization problem: (The pseudo-inverse is applied to deal with the singular scatter matrix.) maxY˜ trace((WT SbW)(WT StW) +) subject to W = X˜X˜ T + X˜Y˜ (11) In the following, we construct a specific class indicator matrix Y3 and show that it solves the optimiza-
Least Squares Linear Discriminant Analysis tion problem in Eq.(11).More importantly,we show be the SVD of B,where P and Q are orthogonal and in Section 5 that multivariate linear regression us- EIRtxk is diagonal.Since S=HH,we have ing indicator matrix Y3 is equivalent to LDA under a mild condition,which has been shown empirically St'UT SLUI=BBT PSST PT PELPT, to hold for most high-dimensional and undersampled (20) data.The indicator matrix Y3=(Ya(ij)is where constructed as follows: ∑b= 2T=diag(a,…,a), (21) if班=j, ( (12) ≥…≥a号>0=听+1=…=,(22) otherwise, and q=rank(S). where nj is the sample size of the j-th class,and n is It follows from Eqs.(16)and(17)that the total sample size.It can be shown that Ya defined S:SoS:U1>2UTSUE2UT above has been centered (in terms of rows),and thus Y3=Y3 =U11(P6pT)1UT,(23) Define matrices Hw,Ho,and Ht as follows: where the last equality follows from Eq.(20).We have the following result: =后k-eY,…X-ce图 Lemma4.l.LetP,U,t,文,and Y be defined as i=点m0-e…v顶w-9 above.Then WTSoW =FTSoF,and WTSW (14) TF,where W is defined in Eq.(10),is defined in Eq.(21),and F=PTE:UT(XY). ,=5(X-ce (15) Vn We are now ready to present the main result of this section,that is Y=Y3,solves the optimization prob- where Xi is the data matrix of the j-th class,X is lem in Eq.(11),where Y3 is defined in Eq.(12),as the data matrix,c)is the centroid of the j-th class, summarized in the following theorem: c is the global centroid,e)is the vector of all ones of length nj and e is the vector of all ones of length n. Theorem 4.1.Let So;St,Eo,W,and Y be defined as above.Then for any Y,the following inequality holds: Then S,So,and S can be expressed as follows: tace(WTSW)(WTS.W)+)≤trace(Eb).Further- Sw =HuHi:So =HoH:St=HtH more,the equality holds when Y =Y3,where Y3 is defined in Eq.(12). Let Ht=UEVT be the Singular Value Decomposition (SVD)(Golub Van Loan,1996)of Ht,where Ht is With Y=Y3 as the class indicator matrix,the optimal defined in Eq.(15),U and V are orthogonal, weight matrix WMLR for multivariate linear regression diag(∑t,0),∑t∈Rtxt is diagonal,andt=rank(S). (MLR)in Eq.(10)becomes Then S:=HH=UXxTUT Udiag(2,0)UT.(16) wR=(x))X立=(nS,)+n,=S时H. (24) Let U=(U1,U2)be a partition of U,such that U1 E Rdxt and U2ERx(d-).That is,U2 lies in the null 5.Relationship between Multivariate space of St;i.e.,UTSU2=0.Since St=S+Sw, Linear Regression and LDA we have 0=UT S:U2 U3 SU2+U SuU2.Thus USU2 =0,since S is positive semi-definite.It Recall that in LDA,the optimal transformation ma- follows that trix,GLDA,consists of the top eigenvectors of SS corresponding to the nonzero eigenvalues.In this sec- 00 (17) tion,we study the relationship between WMLR SH in Eq.(24)and the eigenvectors of S,Sp.From Denote Eqs.(16)and (17),we can decompose matrix S,S as follows: B=UH6∈Rxk, (18) where H is defined in Eq.(14)and let B=PSQT (19 =u(8)(08)(0)
Least Squares Linear Discriminant Analysis tion problem in Eq. (11). More importantly, we show in Section 5 that multivariate linear regression using indicator matrix Y3 is equivalent to LDA under a mild condition, which has been shown empirically to hold for most high-dimensional and undersampled data. The indicator matrix Y3 = (Y3(ij))ij ∈ IRn×k is constructed as follows: Y3(ij) = q n nj − qnj n if yi = j, − qnj n otherwise, (12) where nj is the sample size of the j-th class, and n is the total sample size. It can be shown that Y3 defined above has been centered (in terms of rows), and thus Y˜ 3 = Y3. Define matrices Hw, Hb, and Ht as follows: Hw = 1 √ n [X1 − c (1)(e (1)) T , · · · , Xk − c (k) (e (k) ) T ],(13) Hb = 1 √ n [ √ n1(c (1) − c), · · · , √ nk(c (k) − c)], (14) Ht = 1 √ n (X − ceT ), (15) where Xj is the data matrix of the j-th class, X is the data matrix, c (j) is the centroid of the j-th class, c is the global centroid, e (j) is the vector of all ones of length nj and e is the vector of all ones of length n. Then Sw, Sb, and St can be expressed as follows: Sw = HwHT w , Sb = HbHT b , St = HtHT t . Let Ht = UΣV T be the Singular Value Decomposition (SVD) (Golub & Van Loan, 1996) of Ht, where Ht is defined in Eq. (15), U and V are orthogonal, Σ = diag(Σt, 0), Σt ∈ IRt×t is diagonal, and t = rank(St). Then St = HtHT t = UΣΣTU T = Udiag(Σ2 t , 0)U T . (16) Let U = (U1, U2) be a partition of U, such that U1 ∈ IRd×t and U2 ∈ IRd×(d−t) . That is, U2 lies in the null space of St, i.e., U T 2 StU2 = 0. Since St = Sb + Sw, we have 0 = U T 2 StU2 = U T 2 SbU2 + U T 2 SwU2. Thus U T 2 SbU2 = 0, since Sw is positive semi-definite. It follows that U T SbU = U T 1 SbU1 0 0 0 . (17) Denote B = Σ−1 t U T 1 Hb ∈ IRt×k , (18) where Hb is defined in Eq. (14) and let B = PΣˆQ T (19) be the SVD of B, where P and Q are orthogonal and Σˆ ∈ IRt×k is diagonal. Since Sb = HbHT b , we have Σ −1 t U T 1 SbU1Σ −1 t = BBT = PΣˆΣˆT P T = PΣbP T , (20) where Σb = ΣˆΣˆT = diag(α 2 1 , · · · , α2 t ), (21) α 2 1 ≥ · · · ≥ α 2 q > 0 = α 2 q+1 = · · · = α 2 t , (22) and q = rank(Sb). It follows from Eqs. (16) and (17) that S + t SbS + t = U1Σ −2 t U T 1 SbU1Σ −2 t U T 1 = U1Σ −1 t PΣbP T Σ −1 t U T 1 , (23) where the last equality follows from Eq. (20). We have the following result: Lemma 4.1. Let P, U1, Σt, X˜, and Y˜ be defined as above. Then WT SbW = 1 n2 F T ΣbF, and WT StW = 1 n2 F T F, where W is defined in Eq. (10), Σb is defined in Eq. (21), and F = P T Σ −1 t U T 1 (X˜Y˜ ). We are now ready to present the main result of this section, that is Y˜ = Y3, solves the optimization problem in Eq. (11), where Y3 is defined in Eq. (12), as summarized in the following theorem: Theorem 4.1. Let Sb, St, Σb, W, and Y˜ be defined as above. Then for any Y˜ , the following inequality holds: trace (WT SbW)(WT StW) + ≤ trace(Σb). Furthermore, the equality holds when Y˜ = Y3, where Y3 is defined in Eq. (12). With Y˜ = Y3 as the class indicator matrix, the optimal weight matrix W MLR for multivariate linear regression (MLR) in Eq. (10) becomes W MLR = X˜X˜ T + X˜Y˜ = (nSt) +nHb = S + t Hb. (24) 5. Relationship between Multivariate Linear Regression and LDA Recall that in LDA, the optimal transformation matrix, GLDA, consists of the top eigenvectors of S + t Sb corresponding to the nonzero eigenvalues. In this section, we study the relationship between W MLR = S + t Hb in Eq. (24) and the eigenvectors of S + t Sb. From Eqs. (16) and (17), we can decompose matrix S + t Sb as follows: S + t Sb =U Σ −1 t BBT Σt 0 0 0 U T =U Σ −1 t P 0 0 I Σb 0 0 0 P T Σt 0 0 I U T
Least Squares Linear Discriminant Analysis where the equalities follow since 6.Experiments B=UTH PSQT We performed our experimental studies using nine is the SVD of B as in Eq.(19)and=T.Thus, high-dimensional datasets,including text documents, the transformation matrix in LDA is given by GLDA face images,and gene expression data.DOC1,DOC2, UP,where P consists of the first g columns of P, and DOC3 are three text document datasets;ORL, since only the first g diagonal entries of X is nonzero. AR,and PIX are three face image datasets;and On the other hand. GCM,ALL,and ALLAML are three gene expression datasets.The statistics of the datasets are summa- S时H6= rized in Table 1 (the first column). 0 To compare LS-LDA and LDA,we use the K-NN algo- U (:UH) rithm with K=1 as the classifier.For all datasets,we 11B performed our study by repeated random splittings of UE PSOT the whole dataset into training and test sets as in(Du- doit et al.,2002).The data was partitioned randomly UP0QT into a training set,where each class consists of two- [GLDASBS,0]QT, thirds of the whole class and a test set with each class (25 consisting of one-third of the whole class.The splitting where,∑bg∈R9 x consists of the first q rows was repeated 10 times and the resulting accuracies of and the first g columns of o,respectively,the fifth different algorithms for the ten splittings are summa- equality follows since only the first g rows and the first rized in Table 1.The rank difference of three scatter g columns of are nonzero and the last equality fol- matrices,i.e.,rank(S)+rank(S)-rank(S:),as well lows since =EET.It follows that as the ratio of the largest to the smallest diagonal en- tries of for each of the splitting is also reported. WMLR=[GLDASB,0]QT, Recall from Theorem 5.1 that when the rank differ- where O is orthogonal. ence of the scatter matrices is zero,matrix Eog equals to the identity matrix and the ratio of the largest to The K-Nearest-Neighbor (K-NN)algorithm (Duda the smallest diagonal entries of is 1. et al.,2000)based on the Euclidean distance is com- monly applied as the classifier in the dimensionality re- We can observe from Table 1 that the rank difference, duced(transformed)space of LDA.If we apply WMLR rank(S)+rank(S)-rank(St),equals zero in all cases for dimensionality reduction before K-NN.the matrix except the DOC2 dataset.For most datasets,the n WMLR is invariant of an orthogonal transformation, data points are linearly independent,i.e.,rank(St)= since any orthogonal transformation preserves all pair- n-1.In this case,the k centroids are also linearly wise distance.Thus WMLR is essentially equivalent independent,i.e.,rank(S)=k-1,while in S, toGor GLDA,as the removal of zero each data point is subtracted by its class centroid and rank(S)=n-k.Hence,the rank difference is zero. columns does not change the pairwise distance either. Furthermore.LS-LDA and LDA achieve the same clas- The essential difference between WMLR and GLDA is sification performance for all cases when the rank dif- thus the diagonal matrix ference is zero.The empirical result confirms the the- Next,we show that matrix o is an identity matrix oretical analysis in Section 5.For DOC2,LS-LDA and of size q,that is,WMLR and GLDA are essentially LDA still achieve the same classification performance, equivalent,under a mild condition that the rank dif- although the rank difference is not zero. ference of the three scatter matrices is zero,that is. Recall that the value of ratio denotes the ratio of the rank(S)+rank(S)-rank(St)=0,which holds in largest to the smallest diagonal entries of the matrix many applications involving high-dimensional and un- From Table 1,the value of ratio equals 1 for all dersampled data (Ye Xiong,2006).The main result cases when the rank difference is zero.This is consis- is summarized in the following theorem: tent with the theoretical result in Theorem 5.1.For Theorem5.l.Let Ebg∈Rx consist o时the firstq DOC2,where the rank difference is not zero for sev- rows and the first q columns of o as defined above, eral cases,the value of ratio is close to 1 for all cases. where o is defined in Eg.(21).Assume that the fol- That is,matrix is close to the identity matrix. lowing equality holds:rank(So)+rank(S)-rank(S:)= This explains why LS-LDA and LDA achieve the same 0.Then Eog =Ig:where Ig is the identity matrix of classification performance for DOC2,even though the size g and q=rank(S6). rank difference is not zero
Least Squares Linear Discriminant Analysis where the equalities follow since B = Σ−1 t U T 1 Hb = PΣˆQ T is the SVD of B as in Eq. (19) and Σb = ΣˆΣˆT . Thus, the transformation matrix in LDA is given by GLDA = U1Σ −1 t Pq, where Pq consists of the first q columns of P, since only the first q diagonal entries of Σb is nonzero. On the other hand, S + t Hb = U Σ 2 t −1 0 0 0 U T Hb = U1Σ −1 t Σ −1 t U T 1 Hb = U1Σ −1 t B = U1Σ −1 t PΣˆQ T = U1Σ −1 t Pq h Σˆ q, 0 i Q T = G LDAΣ 0.5 bq , 0 Q T , (25) where Σˆ q, Σbq ∈ IRq×q consists of the first q rows and the first q columns of Σˆ, Σb, respectively, the fifth equality follows since only the first q rows and the first q columns of Σ are nonzero and the last equality fol- ˆ lows since Σb = ΣˆΣˆT . It follows that W MLR = G LDAΣ 0.5 bq , 0 Q T , where Q is orthogonal. The K-Nearest-Neighbor (K-NN) algorithm (Duda et al., 2000) based on the Euclidean distance is commonly applied as the classifier in the dimensionality reduced (transformed) space of LDA. If we apply W MLR for dimensionality reduction before K-NN, the matrix W MLR is invariant of an orthogonal transformation, since any orthogonal transformation preserves all pairwise distance. Thus W MLR is essentially equivalent to h GLDAΣ 0.5 bq , 0 i or GLDAΣ 0.5 bq , as the removal of zero columns does not change the pairwise distance either. The essential difference between W MLR and GLDA is thus the diagonal matrix Σ0.5 bq . Next, we show that matrix Σbq is an identity matrix of size q, that is, W MLR and GLDA are essentially equivalent, under a mild condition that the rank difference of the three scatter matrices is zero, that is, rank(Sb) + rank(Sw) − rank(St) = 0, which holds in many applications involving high-dimensional and undersampled data (Ye & Xiong, 2006). The main result is summarized in the following theorem: Theorem 5.1. Let Σbq ∈ IRq×q consist of the first q rows and the first q columns of Σb as defined above, where Σb is defined in Eq. (21). Assume that the following equality holds: rank(Sb)+rank(Sw)−rank(St) = 0. Then Σbq = Iq, where Iq is the identity matrix of size q and q = rank(Sb). 6. Experiments We performed our experimental studies using nine high-dimensional datasets, including text documents, face images, and gene expression data. DOC1, DOC2, and DOC3 are three text document datasets; ORL, AR, and PIX are three face image datasets; and GCM, ALL, and ALLAML are three gene expression datasets. The statistics of the datasets are summarized in Table 1 (the first column). To compare LS-LDA and LDA, we use the K-NN algorithm with K = 1 as the classifier. For all datasets, we performed our study by repeated random splittings of the whole dataset into training and test sets as in (Dudoit et al., 2002). The data was partitioned randomly into a training set, where each class consists of twothirds of the whole class and a test set with each class consisting of one-third of the whole class. The splitting was repeated 10 times and the resulting accuracies of different algorithms for the ten splittings are summarized in Table 1. The rank difference of three scatter matrices, i.e., rank(Sb) + rank(Sw) − rank(St), as well as the ratio of the largest to the smallest diagonal entries of Σ0.5 bq , for each of the splitting is also reported. Recall from Theorem 5.1 that when the rank difference of the scatter matrices is zero, matrix Σbq equals to the identity matrix and the ratio of the largest to the smallest diagonal entries of Σ0.5 bq is 1. We can observe from Table 1 that the rank difference, rank(Sb)+rank(Sw)−rank(St), equals zero in all cases except the DOC2 dataset. For most datasets, the n data points are linearly independent, i.e., rank(St) = n − 1. In this case, the k centroids are also linearly independent, i.e., rank(Sb) = k − 1, while in Sw, each data point is subtracted by its class centroid and rank(Sw) = n − k. Hence, the rank difference is zero. Furthermore, LS-LDA and LDA achieve the same classification performance for all cases when the rank difference is zero. The empirical result confirms the theoretical analysis in Section 5. For DOC2, LS-LDA and LDA still achieve the same classification performance, although the rank difference is not zero. Recall that the value of ratio denotes the ratio of the largest to the smallest diagonal entries of the matrix Σ 0.5 bq . From Table 1, the value of ratio equals 1 for all cases when the rank difference is zero. This is consistent with the theoretical result in Theorem 5.1. For DOC2, where the rank difference is not zero for several cases, the value of ratio is close to 1 for all cases. That is, matrix Σ0.5 bq is close to the identity matrix. This explains why LS-LDA and LDA achieve the same classification performance for DOC2, even though the rank difference is not zero.
Least Squares Linear Discriminant Analysis Table 1.Comparison of classification accuracy (in percentage)between LS-LDA and LDA.Ten different splittings into training and test sets of ratio 2:1 (for each of the k classes)are applied.The rank difference (Diff)of three scatter matrices,i.e.,rank(S)+rank(S)-rank(S:),for each splitting,as well as the ratio of the largest to the smallest diagonal entries ofis reported.nis the total sample size,d is the data dimensionality,andis the total number of classes. Dataset Method/ratio/Diff Ten different splittings into training and test sets of ratio 2:1 D01 LS-LDA 02333 0152 0515 94.55 0304 004 0515 038 02333 n=490 LDA 93.33 93.33 91.52 95.16 94.55 93.94 93.94 95.15 93.33 93.33 d=3759 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k=5 0 0 0 0 0 0 0 0 0 0 DOC2 LS-LDA 76.93 6759 78.70 7.7 8056 76.85 8148 85.19 8426 81.48 n=320 LDA 75.93 67.59 78.70 78.70 80.56 76.85 81.48 85.19 84.26 81.48 d=2887 ratio 1.010 1023 1.013 1.019 1.019 1013 1010 1000 1.010 1.000 k=4 Diff 1 1 0 1 0 DOC3 LS-LDA 95.71 1.000 1.000 97.14 97.14 97.14 97.14 1.000 98.57 92.86 n=210 LDA 95.71 1000 1.000 97.14 97.14 97.14 97.14 1.000 98.57 92.86 d=7455 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .000 k=7 Diff 山 0 0 1 0 0 山 ORL LS-LDA 90.00 90.00 92.50 97.50 95.00 94.17 92.50 92.50 96.67 94.17 n=400 LDA 90.00 90.00 92.50 97.50 95.00 94.17 92.50 92.50 96.67 94.17 d=10304 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k=40 D)i进 U 0 0 0 U AR LS-LDA 96.50 97.50 95.50 94.50 94.00 91.50 94.00 94.50 93.00 92.50 n=650 LDA 96.50 97.50 95.50 94d.50 9d.00 91.50 94.00 94.50 935.00 92.50 d=8888 ratio 1.000 1000 1.000 1000 1.000 1.000 1.000 1.000 1.000 1.000 k=50 0 0 0 0 0 0 0 0 PIX LS-LDA 94.44 95.56 92.22 96.67 100.0 98.89 95.56 95.5695.56 04.44 n=300 LDA 94.44 95.56 92.22 96.67 100.0 98.89 95.56 95.56 95.56 94.44 d=10000 ratio 1000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k=300 Diff 0 0 0 0 0 0 0 0 0 0 GCM LS-LDA 76.92 81.54 73.85 60.23 78.46 76.92 84.62 78.46 76.92 86.15 n=198 LDA 76.92 81.54 73.85 69.23 78.46 76.92 84.62 78.46 76.92 86.15 d=16063 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k=14 Diff 0 0 0 0 0 0 0 0 0 0 ALL LS-LDA 97.56 .00 96.34 97.56 96.34 98.78 98.78 97.56 98.78 97.56 n=248 LDA 97.56 1000 96.34 97.56 9634 9878 98.78 97.56 98.78 97.56 d=12559 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 是=6 Dift 中 0 中 ALLAML LS-LDA 100.0 91.67 95.83 83.33 91.67 95.83 91.67 95.83 95.83 83.33 n=72 LDA 100.0 91.67 95.83 83.33 91.67 95823 91.67 95.83 95.83 83.33 d=4106 ratio 1.000 1000 1.000 1000 1000 1000 1000 1000 1.000 1000 k=4 Di证 0 0 0 0 0 0 0 0 0 0 7.Conclusion and Discussion on the equivalence result established in this paper,we obtain the following formulation for regularized LDA: In this paper,we analyze the relationship between multi-class LDA and multivariate linear regression. Specifically,we show that under a mild condition, L2(W,y) which has been shown empirically to hold for many =1 =1 high-dimensional and undersampled data,multi-class (26) LDA is equivalent to multivariate linear regression where y >0 is the regularization parameter.com- with a specific class indicator matrix.That is,un- monly estimated through cross-validation der the given condition,multi-class LDA can be formulated as a least squares problem.which ex- The geometry of the marginal distribution can be tends previous equivalence result for the binary-class exploited through the Laplacian-based regularization case.Our experimental studies on high-dimensional (Belkin et al.,2006),which has been applied in regres- datasets confirm the presented theoretical analysis. sion and SVM.The equivalence relationship presented in this paper results in Laplacian regularized LDA. The presented analysis can be extended in several di- Furthermore,it naturally leads to semi-supervised di- rections.Regularization is commonly applied to stabi- mensionality reduction by combining both label and lize the sample covariance matrix estimation and im- unlabeled data(Belkin et al.,2006). prove the classification performance of LDA (Fried- man,1989).Regularization using the L2-norm penalty Sparsity has recently received much attention for ex- can also be applied in linear regression,which is known tending Principal Component Analysis (d'Aspremont as ridge regression (Hoerl Kennard,1970).Based et al..2004:Jolliffe Uddin.2003:Zou et al..2006) Li-norm penalty has been used in regression (Tibshi-
Least Squares Linear Discriminant Analysis Table 1. Comparison of classification accuracy (in percentage) between LS-LDA and LDA. Ten different splittings into training and test sets of ratio 2:1 (for each of the k classes) are applied. The rank difference (Diff) of three scatter matrices, i.e., rank(Sb)+ rank(Sw)−rank(St), for each splitting, as well as the ratio of the largest to the smallest diagonal entries of Σ0.5 bq , is reported. n is the total sample size, d is the data dimensionality, and k is the total number of classes. Dataset Method/ratio/Diff Ten different splittings into training and test sets of ratio 2:1 DOC1 LS-LDA 93.33 93.33 91.52 95.15 94.55 93.94 93.94 95.15 93.33 93.33 n = 490 LDA 93.33 93.33 91.52 95.15 94.55 93.94 93.94 95.15 93.33 93.33 d = 3759 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 5 Diff 0 0 0 0 0 0 0 0 0 0 DOC2 LS-LDA 75.93 67.59 78.70 78.70 80.56 76.85 81.48 85.19 84.26 81.48 n = 320 LDA 75.93 67.59 78.70 78.70 80.56 76.85 81.48 85.19 84.26 81.48 d = 2887 ratio 1.010 1.023 1.013 1.019 1.019 1.013 1.010 1.000 1.010 1.000 k = 4 Diff 1 1 1 1 1 1 1 0 1 0 DOC3 LS-LDA 95.71 1.000 1.000 97.14 97.14 97.14 97.14 1.000 98.57 92.86 n = 210 LDA 95.71 1.000 1.000 97.14 97.14 97.14 97.14 1.000 98.57 92.86 d = 7455 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 7 Diff 0 0 0 0 0 0 0 0 0 0 ORL LS-LDA 90.00 90.00 92.50 97.50 95.00 94.17 92.50 92.50 96.67 94.17 n = 400 LDA 90.00 90.00 92.50 97.50 95.00 94.17 92.50 92.50 96.67 94.17 d = 10304 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 40 Diff 0 0 0 0 0 0 0 0 0 0 AR LS-LDA 96.50 97.50 95.50 94.50 94.00 91.50 94.00 94.50 93.00 92.50 n = 650 LDA 96.50 97.50 95.50 94.50 94.00 91.50 94.00 94.50 93.00 92.50 d = 8888 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 50 Diff 0 0 0 0 0 0 0 0 0 0 PIX LS-LDA 94.44 95.56 92.22 96.67 100.0 98.89 95.56 95.56 95.56 94.44 n = 300 LDA 94.44 95.56 92.22 96.67 100.0 98.89 95.56 95.56 95.56 94.44 d = 10000 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 300 Diff 0 0 0 0 0 0 0 0 0 0 GCM LS-LDA 76.92 81.54 73.85 69.23 78.46 76.92 84.62 78.46 76.92 86.15 n = 198 LDA 76.92 81.54 73.85 69.23 78.46 76.92 84.62 78.46 76.92 86.15 d = 16063 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 14 Diff 0 0 0 0 0 0 0 0 0 0 ALL LS-LDA 97.56 1.000 96.34 97.56 96.34 98.78 98.78 97.56 98.78 97.56 n = 248 LDA 97.56 1.000 96.34 97.56 96.34 98.78 98.78 97.56 98.78 97.56 d = 12559 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 6 Diff 0 0 0 0 0 0 0 0 0 0 ALLAML LS-LDA 100.0 91.67 95.83 83.33 91.67 95.83 91.67 95.83 95.83 83.33 n = 72 LDA 100.0 91.67 95.83 83.33 91.67 95.83 91.67 95.83 95.83 83.33 d = 4106 ratio 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 k = 4 Diff 0 0 0 0 0 0 0 0 0 0 7. Conclusion and Discussion In this paper, we analyze the relationship between multi-class LDA and multivariate linear regression. Specifically, we show that under a mild condition, which has been shown empirically to hold for many high-dimensional and undersampled data, multi-class LDA is equivalent to multivariate linear regression with a specific class indicator matrix. That is, under the given condition, multi-class LDA can be formulated as a least squares problem, which extends previous equivalence result for the binary-class case. Our experimental studies on high-dimensional datasets confirm the presented theoretical analysis. The presented analysis can be extended in several directions. Regularization is commonly applied to stabilize the sample covariance matrix estimation and improve the classification performance of LDA (Friedman, 1989). Regularization using the L2-norm penalty can also be applied in linear regression, which is known as ridge regression (Hoerl & Kennard, 1970). Based on the equivalence result established in this paper, we obtain the following formulation for regularized LDA: L2(W, γ) = 1 2 X k j=1 1 n Xn i=1 ||x˜ T i wj − Y3(ij)||2 2 + γ||wj ||2 2 ! (26) where γ > 0 is the regularization parameter, commonly estimated through cross-validation. The geometry of the marginal distribution can be exploited through the Laplacian-based regularization (Belkin et al., 2006), which has been applied in regression and SVM. The equivalence relationship presented in this paper results in Laplacian regularized LDA. Furthermore, it naturally leads to semi-supervised dimensionality reduction by combining both label and unlabeled data (Belkin et al., 2006). Sparsity has recently received much attention for extending Principal Component Analysis (d’Aspremont et al., 2004; Jolliffe & Uddin, 2003; Zou et al., 2006). L1-norm penalty has been used in regression (Tibshi-
Least Squares Linear Discriminant Analysis rani,1996),known as LASSO,and SVM(Zhu et al., Fukunaga,K.(1990).Introduction to statistical pat- 2003)to achieve model sparsity.Sparsity often leads to tern classification.USA:Academic Press. easy interpretation and good generalization ability of the resulting model.Sparse Fisher Discriminant Anal- Golub,G.H.,&Van Loan,C.F.(1996).Matrir computations.USA:The Johns Hopkins University ysis has been proposed in (Mika,2002),for binary- Press. class problems.Based on the equivalence relation- ship between LDA and regression established in this Guermeur,Y.,Lifchitz,A.,Vert,R.(2004).A kernel paper,multi-class sparse LDA can be formulated by for protein secondary structure prediction.Kernel minimizing an objective function similar to the one in Methods in Computational Biology,193-206. Eq.(26)by replacing the 2-norm of w;by the 1-norm Hastie,T.,Tibshirani,R.,Buja,A.(1994).Flexible asi.The optimal w;can be computed by apply- discriminant analysis by optimal scoring.Journal of ing LASSO (Tibshirani,1996).A solution path can the American Statistical Association.89.1255-1270. also be obtained through the LARS algorithm (Efron et al.,2004).We plan to study the effectiveness of all Hastie,T.,Tibshirani,R.,Friedman,J.(2001).The elements of statistical learning Data mining,in- these extensions in the future. ference,and prediction.Springer. Acknowledgments Hoerl,A.,Kennard,R.(1970).Ridge regression:Bi- ased estimation for nonorthogonal problems.Tech- This research is sponsored by the Center for Evolution- nometrics,12,55-67. ary Functional Genomics of the Biodesign Institute at Jolliffe,I.,Uddin,M.(2003).A modified principal Arizona State University and by the National Science component technique based on the lasso. Journal Foundation Grant IIS-0612069. of Computational and Graphical Statistics,12,531- 547. References Lee,Y.,Lin,Y.,Wahba,G.(2004).Multicategory Belhumeour,P.,Hespanha,J.,Kriegman,D.(1997). Support Vector Machines,theory,and application Eigenfaces vs.Fisherfaces:Recognition using class to the classification of microarray data and satellite specific linear projection. IEEE Trans.Pattern radiance data.Journal of the American Statistical Analysis and Machine Intelligence,19,711-720. Association,99,67-81. Belkin,M.,Niyogi,P.,Sindhwani,V.(2006).Man- Mika.S.(2002).Kernel fisher discriminants.PhD ifold regularization:A geometric framework for thesis,University of Technology,Berlin. learning from labeled and unlabeled examples.Jour- nal of Machine Learning Research,7,2399-2343. Park,C.,Park,H.(2005).A relationship between LDA and the generalized minimum squared error so- Cristianini,N.,Kandola,J.,Elisseeff,A.,&Shawe- lution.SIAM Journal on Matrir Analysis and Ap- Taylor,J.(2001).On kernel target alignment.NIPS. plications,27,474-492. d'Aspremont,A.,Ghaoui,L.,Jordan,M.,Lanck- Swets,D.L.,Weng,J.(1996).Using discriminant riet,G.(2004).A direct formulation for sparse PCA eigenfeatures for image retrieval.IEEE Trans.Pat- using semidefinite programming.NIPS. tern Analysis and Machine Intelligence,18,831- Duda.R..Hart.P..Stork.D.(2000).Pattern clas- 836. sification.Wiley. Tibshirani,R.(1996).Regression shrinkage and selec- Dudoit,S.,Fridlyand,J.,Speed,T.P.(2002).Com- tion via the lasso.J.Royal.Statist.Soc B.,267-288 parison of discrimination methods for the classifica- Ye,J.,Xiong,T.(2006).Computational and the- tion of tumors using gene expression data.Journal of the American Statistical Association,97,77-87. oretical analysis of null space and orthogonal linear discriminant analysis.Journal of Machine Learning Efron,B.,Hastie,T.,Johnstone,I.,Tibshirani,R. Re.search.7,1183-1204. (2004).Least angle regression.Annals of Statistics (with discussion),32,407-499. Zhang,P.,Riedel,N.(2005).Discriminant analysis: A unified approach.ICDM. Fisher,R.(1936).The use of multiple measurements in taxonomic problems.Annals of Eugenics,7.179- Zhu,J.,Rosset,S.,Hastie,T.,Tibshirani,R.(2003). 188. 1-norm support vector machines.NIPS. Friedman,J.(1989).Regularized discriminant analy- Zou,H.,Hastie,T.,Tibshirani,R.(2006).Sparse sis.Journal of the American Statistical Association, principal component analysis.Journal of Computa- 84,165-175. tional and Graphical Statistics,15(2),265-286
Least Squares Linear Discriminant Analysis rani, 1996), known as LASSO, and SVM (Zhu et al., 2003) to achieve model sparsity. Sparsity often leads to easy interpretation and good generalization ability of the resulting model. Sparse Fisher Discriminant Analysis has been proposed in (Mika, 2002), for binaryclass problems. Based on the equivalence relationship between LDA and regression established in this paper, multi-class sparse LDA can be formulated by minimizing an objective function similar to the one in Eq. (26) by replacing the 2-norm of wj by the 1-norm as ||wj ||1. The optimal wj can be computed by applying LASSO (Tibshirani, 1996). A solution path can also be obtained through the LARS algorithm (Efron et al., 2004). We plan to study the effectiveness of all these extensions in the future. Acknowledgments This research is sponsored by the Center for Evolutionary Functional Genomics of the Biodesign Institute at Arizona State University and by the National Science Foundation Grant IIS-0612069. References Belhumeour, P., Hespanha, J., & Kriegman, D. (1997). Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19, 711–720. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2343. Cristianini, N., Kandola, J., Elisseeff, A., & ShaweTaylor, J. (2001). On kernel target alignment. NIPS. d’Aspremont, A., Ghaoui, L., Jordan, M., & Lanckriet, G. (2004). A direct formulation for sparse PCA using semidefinite programming. NIPS. Duda, R., Hart, P., & Stork, D. (2000). Pattern classification. Wiley. Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–87. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics (with discussion), 32, 407–499. Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179– 188. Friedman, J. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175. Fukunaga, K. (1990). Introduction to statistical pattern classification. USA: Academic Press. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. USA: The Johns Hopkins University Press. Guermeur, Y., Lifchitz, A., & Vert, R. (2004). A kernel for protein secondary structure prediction. Kernel Methods in Computational Biology, 193–206. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminant analysis by optimal scoring. Journal of the American Statistical Association, 89, 1255–1270. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning : Data mining, inference, and prediction. Springer. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Jolliffe, I., & Uddin, M. (2003). A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12, 531– 547. Lee, Y., Lin, Y., & Wahba, G. (2004). Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99, 67–81. Mika, S. (2002). Kernel fisher discriminants. PhD thesis, University of Technology, Berlin. Park, C., & Park, H. (2005). A relationship between LDA and the generalized minimum squared error solution. SIAM Journal on Matrix Analysis and Applications, 27, 474–492. Swets, D. L., & Weng, J. (1996). Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence, 18, 831– 836. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 267–288. Ye, J., & Xiong, T. (2006). Computational and theoretical analysis of null space and orthogonal linear discriminant analysis. Journal of Machine Learning Research, 7, 1183–1204. Zhang, P., & Riedel, N. (2005). Discriminant analysis: A unified approach. ICDM. Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2003). 1-norm support vector machines. NIPS. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286
Least Squares Linear Discriminant Analysis Appendix and the inequality follows since Pir has orthonormal A.Proof of Lemma 4.1 columns.This completes the proof of the first part of the theorem. Proof.Note thatT =nSt.From Eq.(23),we have When Y=Y3,we have XY=nH.It follows that F PTE:UTXY =nPTE.UTH WTSoW ())5()对 nPTB=nEQT. = (XY)T(nS:)+S.(nS:)+(XY) where B is defined in Eq.(18)and Rtxk is defined 是()rip四p in Eq.(19).Since o=EET,we have FEE WTS,W=IFTSF=QSTXSQT =QZBQT n2 Since M+MM+=M+,for any matrix M(Golub WTsW-Q Van Loan,1996),and P is orthogonal,i.e.,PPT=It, we have where Eok consists of the first k rows and the first k columns of It follows that wrsw=())s:(x))灯 tr((WTSW)(WTS:W)+)=tr(Ebk)=tr(b), (XY)T(nS:)+s:(nS:)+() where the last equality follows since only the first q diagonal entries of Eo are nonzero. = 是(rS)广m C.Proof of Theorem 5.1 是(Xmr2明xm)-3r'r Proof.Let matrix HE IRxd be defined as follows: H=U 1P0 0 Id-t (27) B.Proof of Theorem 4.1 where U and Et are defined in Eq.(16),and P is de- Proof.From Lemma 4.1,we have fined in Eq.(19).It follows from Eqs.(16)-(21) tr((WTSW)(WTS:W)+)=tr((FTEF)(FTF)+) HTSH diag(b,0) =tr(FT∑bF)F+(FT)+) HTStH diag(It,0). (28) =tr(FF+)TΣ6(FF+), Since S =S:-So:we have where the second equality follows since (FTF)+= HT SuH=diag(∑w,0), (29) F+(FT)+(Golub Van Loan,1996).Let for some diagonal matrix I-Eo.From F=Pidiag (1,0)QI Eqs.(21),(22),(28)and(29),we have be the SVD of F,where P and Q1 are orthogo- HTS6H=diag(a,·,a2,0·,0) nal,EI E IR"xr is diagonal and nonsingular,and HTSuH=diag(,…,,0…,0), r=rank(F).It follows that where a好≥…之a号>0=ag+1=…=a2,and FF+=Pidiag(,0)QfQ1diag(②,0)P a?+B2 1,for all i.Since rank(Sp)+rank(S)- =Pidiag(I,0)PT rank(St)=0,we have PirPtr t=rank(HTS:H)=rank(HTSH)+rank(HT SH). where Pir consists of the first r columns of P and thus Since a?+B2 1,at least one of ai and Bi is has orthonormal columns,i.e.,PrPir=Ir.It follows nonzero.Thus the following inequality always holds: that rank(HTSH)+rank(HTSH)>t.The equality holds only when either ai or Bi is zero,for all i.That tr(FF+)T∑(FF+)=tr(PIrEoPir)≤tr(②b), is,ii=0,for all i.Hence,a=...=a2 =1,that is,which consists of the first q rows and the first where the first equality follows since tr(AB)=tr(BA) q columns of 6,equals to Ig. for any two matrices A and B,and PrPir =Ir
Least Squares Linear Discriminant Analysis Appendix A. Proof of Lemma 4.1 Proof. Note that X˜X˜ T = nSt. From Eq. (23), we have WT SbW = X˜X˜ T + X˜Y˜ T Sb X˜X˜ T + X˜Y˜ = (X˜Y˜ ) T (nSt) + Sb (nSt) + (X˜Y˜ ) = 1 n2 (X˜Y˜ ) TU1Σ −1 t PΣbP T Σ −1 t U T 1 (X˜Y˜ ) = 1 n2 F T ΣbF. Since M+MM+ = M+, for any matrix M (Golub & Van Loan, 1996), and P is orthogonal, i.e., P P T = It, we have WT StW = X˜X˜ T + X˜Y˜ T St X˜X˜ T + X˜Y˜ = (X˜Y˜ ) T (nSt) + St (nSt) + (X˜Y˜ ) = 1 n2 (X˜Y˜ ) T (St) + (X˜Y˜ ) = 1 n2 (X˜Y˜ ) TU1Σ −2 t U T 1 (X˜Y˜ ) = 1 n2 F T F. B. Proof of Theorem 4.1 Proof. From Lemma 4.1, we have tr (WT SbW)(WT StW) + = tr (F T ΣbF)(F T F) + = tr (F T ΣbF)F +(F T ) + = tr (F F +) T Σb(F F +) , where the second equality follows since (F T F) + = F +(F T ) + (Golub & Van Loan, 1996). Let F = P1diag (Σ1, 0) Q T 1 be the SVD of F, where P1 and Q1 are orthogonal, Σ1 ∈ IRr×r is diagonal and nonsingular, and r = rank(F). It follows that F F + = P1diag (Σ1, 0) Q T 1 Q1diag Σ −1 1 , 0 P T 1 = P1diag (Ir, 0) P T 1 = P1rP T 1r , where P1r consists of the first r columns of P1 and thus has orthonormal columns, i.e., P T 1rP1r = Ir. It follows that tr (F F +) T Σb(F F +) = tr P T 1rΣbP1r ≤ tr(Σb), where the first equality follows since tr(AB) = tr(BA) for any two matrices A and B, and P T 1rP1r = Ir, and the inequality follows since P1r has orthonormal columns. This completes the proof of the first part of the theorem. When Y˜ = Y3, we have X˜Y˜ = nHb. It follows that F = P T Σ −1 t U T 1 X˜Y˜ = nP T Σ −1 t U T 1 Hb = nP T B = nΣˆQ T . where B is defined in Eq. (18) and Σˆ ∈ IRt×k is defined in Eq. (19). Since Σb = ΣˆΣˆT , we have WT SbW = 1 n2 F T ΣbF = QΣˆT ΣbΣˆQ T = QΣ 2 bkQ T WT StW = 1 n2 F T F = QΣˆT ΣˆQ T = QΣbkQ T where Σbk consists of the first k rows and the first k columns of Σb. It follows that tr (WT SbW)(WT StW) + = tr (Σbk) = tr (Σb), where the last equality follows since only the first q diagonal entries of Σb are nonzero. C. Proof of Theorem 5.1 Proof. Let matrix H ∈ IRd×d be defined as follows: H = U Σ −1 t P 0 0 Id−t , (27) where U and Σt are defined in Eq. (16), and P is de- fined in Eq. (19). It follows from Eqs. (16)–(21) HT SbH = diag (Σb, 0), HT StH = diag (It, 0). (28) Since Sw = St − Sb, we have HT SwH = diag (Σw, 0), (29) for some diagonal matrix Σw = It − Σb. From Eqs. (21), (22), (28) and (29), we have HT SbH = diag(α 2 1 , · · · , α2 t , 0 · · · , 0) HT SwH = diag(β 2 1 , · · · , β2 t , 0 · · · , 0), where α 2 1 ≥ · · · ≥ α 2 q > 0 = α 2 q+1 = · · · = α 2 t , and α 2 i + β 2 i = 1, for all i. Since rank(Sb) + rank(Sw) − rank(St) = 0, we have t = rank(HT StH) = rank(HT SbH) + rank(HT SwH). Since α 2 i + β 2 i = 1, at least one of αi and βi is nonzero. Thus the following inequality always holds: rank(HT SbH) + rank(HT SwH) ≥ t. The equality holds only when either αi or βi is zero, for all i. That is, αiβi = 0, for all i. Hence, α 2 1 = · · · = α 2 q = 1, that is, Σbq, which consists of the first q rows and the first q columns of Σb, equals to Iq