840 EEE TRANSACTIONS ON NEURAL NETWOR_中国高校课件下载中心

点击下载：北京大学：《模式识别》课程教学资源（参考资料）Learning in Linear Neural Networks - A Survey

正在加载图片...

840 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.6.NO.4.JULY 1995 vectors perpendicular to u.Again,by the previous result,we of the Euclidean space,such that in the new system of know the answer isb*=+u2,and so forth.At the kth step, coordinates,the components of the points in the cloud are we look for lines C perpendicular to the space spanned by uncorrelated and with decreasing variance.Again,if only the ,such that the projections of the points r along first few coordinates in the new system vary significantly,we C&have maximal dispersion.This is achieved by choosing may approximately locate points by giving only these few C&as the line spanned by uk. coordinates. After the completion of p steps,we extract the first p princi- PCA can also be examined from an information-theoretic pal components,and reduce r to its projection standpoint and shown to be optimal,under simple assump- onto the hyperplane spanned by the first p eigenvectors.One tions,for a different measure.More precisely,consider a may be interested in asking whether this is the best possible transmission channel (in our case,one can think of the data reduction of the kind under consideration,that is,the network connecting the input units to the hidden units)with best possible projection of the data onto a p-dimensional n-dimensional centered input vectors having a Gaussian dis- hyperplane in the sense that the projections of the data tribution with covariance matrix E=(zz).The outputs of onto H have maximal variance.After all,a better result might the channel are constrained to be p-dimensional vectors of the have been achieved by choosing the hyperplane in a single form y La,for some p x n matrix L (and,without any step.This,however,is not the case. loss of generality,we can assume that L has rank p,p<n. Among all p-dimensional hyperplanes H,the one spanned Hence,y is also Gaussian with covariance matrix LEL by the first p principal vectors ,.,up is the hyperplane Classically,the differential entropy of is given by (see,for such that (2)is maximal.Equivalently,it is the hy- instance,[7]for more details) perplane H which minimizes the average projection error (x-Pxl2〉. H(x)）=- p(x)logp(x)dx=号log[(2re)”det(②rxl It is therefore possible to incrementally build the PCA feature extractor.Since 7 is the best p-dimensional hyperplane where p(r)is the Gaussian density function,and similarly we can fit to the n-dimensional point cloud.the "flatter"'the H(y)=log ((2xe)Pdet(IErL')]. cloud the better the fit.It is worth investigating how good the fit is,that is,how much of the variance in the data set actually The conditional distribution of x given y (see,for instance, is explained by the first p principal components.This is easily [8])is normal with mean computed,for the variance of the ith component is given by Le.v=EruEuiu (Pu,x2)=(ugx)2)=4∑xxu:=(A::=入 and covariance matrix The total variance being equal to the sum of all the eigenvalues z.w=xx-2ry2y∑yr of the proportion of total variance explained by the first p principal components equals (+..+p)/(1+...+n). It can be shown that H(ly),the conditional entropy of x In fact,PCA performs "best data compression"among a given y(i.e.,the entropy of the conditional distribution)is wider class of methods.Let us write Up [u1,..,upl given by for the matrix having the first p normalized eigenvectors of r as its columns and let us stack the first p features H(xy)=2log(2xe)n-P1…m-p） ,,uextracted by PCA into a column vector zt. where y1 2..-p>0 are the nonzero eigenvalues of Then=and P=UpU=Upzt.Hence,PCA is As the entropy is one way of measuring our uncertainty, one method that linearly compresses n-dimensional inputs xt it is desirable to choose L so as to minimize H(ry).One can into p-dimensional vectors 2:for some p<n,that is,z =Ax show that the optimal L is of the form L=CU where C is an for a suitable px n matrix A.Linear reconstruction of the data invertible pxp matrix and Up=[,.p.In particular,this can then be achieved by approximating rt by Bz=BAre choice also maximizes the information that y conveys about z for some suitable n x p matrix B. measured by the mutual information I(r,y)defined to be Among all p x n matrices A and n x p matrices B, optimal linear data compression (in the sense that the average I(,)=H()-H(ly) reconstruction error (x-BAr2)is minimized)is achieved with value if and only if the global map W=BA equals the orthogonal projection Pu.onto the hyperplane spanned by the first p IpcA(a,)=麦log(2re)PX1…p): eigenvectors of∑xz Finally,computing the covariance of two principal compo- Thus,at least in the Gaussian setting,up to trivial trans- nents gives that for i formations,the optimal linear map maximizing the mutual information is the principal component analyzer.Finally,PCA (u,x)(ugx)》=(uxx'u）=4(zx)45=入，4=0. can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique,con- Thus different components are uncorrelated,and we can think sider the "open/closed book"data set in Mardia et al.[10, of the transformation of into the vector of n principal Table 1.2.1,p.3f].The data consist of the scores of T= components ut,.,'as an orthogonal transformation 88 students on examinations in mechanics,vectors,algebra,840 EEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 vectors perpendicular to u1. Again, by the previous result, we know the answer is b* = 4~~2, and so forth. At the kth step, we look for lines ck perpendicular to the space spanned by U], . . . , U&] such that the projections of the points xt along LI, have maximal dispersion. This is achieved by choosing LI, as the line spanned by uk. After the completion of p steps, we extract the first p principal components u/lxt, . . . , uLxt and reduce xt to its projection onto the hyperplane spanned by the first p eigenvectors. One may be interested in asking whether this is the best possible data reduction of the kind under consideration, that is, the best possible projection of the data onto a p-dimensional hyperplane 3-1 in the sense that the projections of the data onto 3-1 have maximal variance. After all, a better result might have been achieved by choosing the hyperplane in a single step. This, however, is not the case. Among all p-dimensional hyperplanes 3-1, the one spanned by the first p principal vectors 211, . . . , up is the hyperplane such that (IlP~xll~) is maximal. Equivalently, it is the hyperplane 3-1 which minimizes the average projection error It is therefore possible to incrementally build the PCA feature extractor. Since 3-1 is the best p-dimensional hyperplane we can fit to the n-dimensional point cloud, the “flatter” the cloud the better the fit. It is worth investigating how good the fit is, that is, how much of the variance in the data set actually is explained by the first p principal components. This is easily computed, for the variance of the ith component is given by (11. - PxxIl”. The total variance being equal to the sum of all the eigenvalues of E,,, the proportion of total variance explained by the first p principal components equals (X,+...+X,)/(Xl+...+X,) . In fact, PCA performs “best data compression” among a wider class of methods. Let us write U, = [ul,... ,up] for the matrix having the first p normalized eigenvectors of E,, as its columns and let us stack the first p features uixt, . . . , ukx, extracted by PCA into a column vector zt. Then zt = Vizt and Pu,xt = UpUbxt = U,zt. Hence, PCA is one method that linearly compresses n-dimensional inputs xt into p-dimensional vectors zt for some p < n, that is, z = Ax for a suitable p x n matrix A. Linear reconstruction of the data can then be achieved by approximating xt by Bzt = BAxt for some suitable n x p matrix B. Among all p x n matrices A and n x p matrices B, optimal linear data compression (in the sense that the average reconstruction error (llx - BAxl12) is minimized) is achieved if and only if the global map W = BA equals the orthogonal projection Pup onto the hyperplane spanned by the first p eigenvectors of E,,. Finally, computing the covariance of two principal components gives that for i # j ((u:x)(u;z)) = (u:xx’uj) = ‘IL::(22’)Uj = u:xjuj = 0. Thus different components are uncorrelated, and we can think of the transformation of xt into the vector of n principal components [u’,xt, . . . , udxt]’ as an orthogonal transformation of the Euclidean space, such that in the new system of coordinates, the components of the points in the cloud are uncorrelated and with decreasing variance. Again, if only the first few coordinates in the new system vary significantly, we may approximately locate points by giving only these few coordinates. PCA can also be examined from an information-theoretic standpoint and shown to be optimal, under simple assumptions, for a different measure. More precisely, consider a transmission channel (in our case, one can think of the network connecting the input units to the hidden units) with n-dimensional centered input vectors having a Gaussian distribution with covariance matrix E,, = (xx’). The outputs of the channel are constrained to be p-dimensional vectors of the form y = Lx, for some p x n matrix L (and, without any loss of generality, we can assume that L has rank p,p < n. Hence, y is also Gaussian with covariance matrix LCx,L’. Classically, the differential entropy of x is given by (see, for instance, [7] for more details) H(z) = - /p(x) log p(x) dx = log [(are)” det(C,,)] where p(x) is the Gaussian density function, and similarly H(y) = log [(2re)Pdet(LE,,L’)]. The conditional distribution of x given y (see, for instance, [SI) is normal with mean Px.y = Ex,c;;Y Ezz.y = c,, - EX&;pJyz. and covariance matrix It can be shown that H(xly), the conditional entropy of x given y (i.e., the entropy of the conditional distribution) is given by ~(xly) = ;log((2re)”-~yl ...yn-,) where y1 2 . . . 2 yn-, > 0 are the nonzero eigenvalues of Czz.y. As the entropy is one way of measuring our uncertainty, it is desirable to choose L so as to minimize H(xJy). One can show that the optimal L is of the form L = CU; where C is an invertible p xp matrix and U, = [UI, . . . , U,]. In particular, this choice also maximizes the information that y conveys about z measured by the mutual information I(x, y) defined to be I(x, Y) = H(x) - H(xly) ~pCA(x,y) = ilog((2re)~~1 ... A,). Thus, at least in the Gaussian setting, up to trivial transformations, the optimal linear map maximizing the mutual information is the principal component analyzer. Finally, PCA can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique, consider the “opedclosed book” data set in Mardia et al. [lo, Table 1.2.1, p. 3fl. The data consist of the scores of T = 88 students on examinations in mechanics, vectors, algebra, with value

<<向上翻页向下翻页>>

点击下载：北京大学：《模式识别》课程教学资源（参考资料）Learning in Linear Neural Networks - A Survey