BALD1 AND HORNIK: LEARNING IN LINEAR _中国高校课件下载中心

点击下载：北京大学：《模式识别》课程教学资源（参考资料）Learning in Linear Neural Networks - A Survey

正在加载图片...

BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 841 analysis,and statistics (i.en=5,where the first two Upon projecting the patters,the corresponding exams were closed book and the other three were open total and between classes dispersions of the z patterns become book).For each exam,the best possible score was 100. C'∑z,Cand C'∑y2∑gC.A projection is optimal if the It is found that the average score (z)equals (39.0,50.6, between classes variation of the z's is as large as possible 50.6,46.7,42.3)'and that the eigenvalues of the covariance relative to the total variation.Different cost functions can matrix Erz are given by A1 =679.2.A2 199.8,A3 be introduced at this stage.If the size of a variation matrix 102.6.A4 83.7 and As 31.8.Hence,the first two is measured by its determinant (the determinant of a matrix principal components already explain 80%of the variance measures the volume of the image of a unit cube under the in the data (and 91%is achieved with the first three).The corresponding linear map),then we are led to the problem of first two eigenvectors are u=(0.51,0.37,0.35,0.45,20.53)' finding an n x p matrix C maximizing the ratio andu2=(0.75,0.21,-0.08,-0.30,-0.55)'.These findings can easily be interpreted.The authors conclude that "...the det(C'Ew∑d∑grC) E(C)= (3) first principal component gives positive weight to all the det(C'∑rxC) variables and thus represents an average grade.On the other The solution is well known. hand,the second principal component represents a contrast All optimal matrices,the so-called discriminant analysis between the open-book and closed-book examinations..."For (DA)matrices,are of the form HR,where R is an arbitrary example,the scores and first two principal components of the p x p invertible matrix and Hp has the first p eigenvectors of two best students are (77,82,67,67,81).66.4,and 6.4 and (63,78.80,70,81),63.7,and -6.4.Even without looking at ∑z∑zy2 as its columns.. It is not easy to see what the solutions look like in general. the individual test scores,by considering only the first two There is one case,however,where all optimal solutions can principal components one would conclude that the overall easily be described. performances of the two students are very similar,but the first When p=T=rank(∑zy),ann×p matrix C is a student did better on closed book and the second one better DA matrix if and only if the space spanned by the columns on open-book exams. of C coincides with the space spanned by the rows of the In conclusion,PCA is optimal in the least-mean-square mean-square classifier∑yx∑r. sense and can serve two purposes:data compression by See,for instance,Kshirsager [8]for more details on DA projecting high-dimensional data into a lower-dimensiona space and feature extraction by revealing,through the principal components,relevant but unexpected structure hidden in the IⅡ.BACKPROPAGATION data(although an interpretation of these features in terms of the original variables may not always be straightforward). A.The Landscape Properties of E We now consider the setting described in the Introduction where the learning procedure is based on the minimization F.Mean Square Classifier and Discriminant Analysis of the cost function E(A,B).A complete description of the Consider now the problem where the patterns zt must be landscape properties of E is given in Baldi and Hornik [6] classified into m classes C1,...Cm,with,in general,mn. We shall briefly review the most salient features.E is best Thus for every input pattern rt,there is a binary target output described in terms of its critical points,that is,the points where pattern=(0,…,l,…,0)'where,t=1 if and only if E/Oaij E/0bij =0.It is first important to observe belongs to Ci.One possible classification method consists that if C is any p x p invertible matrix,then E(A,B)= in finding an m x n matrix L such that (y-Lx2)is E(CA,BC-1).Therefore,at any point E really depends on minimal.Needless to say,this is a special case of least-squares the global map W BA rather than on A and B.For regression,and,as we have seen,under the usual assumptions instance,there is an infinite family of pairs of matrices (A.B) the optimal L is given byL= and is called the corresponding to any critical point.Unlike the simple case of mean-square classifier linear regression,however,W cannot be chosen arbitrarily: In many applications n is very large compared to m,and the network architecture constrains W to have at most rank p. therefore it becomes useful to first reduce the dimensionality The remarkable property of the landscape of E is the of the input data.One is thus led to find a linear subspace of absence of local minima in spite of the fact that E is not dimension p such that,when projected onto this subspace,the convex (nor is the set of all matrices of rank at most p).E patterns fall as much as possible into well-defined separated is characterized by a unique global minimum (up to multi- clusters facilitating the classification.This problem of finding plication by a matrix C).All other critical points are saddle an optimal projection is similar to the one encountered in points.The structure of the critical points can be described PCA.Because of the clustering,however,a new measure completely.More precisely,assume for simplicity that p must be introduced to compare different projections.Consider m≤n and that∑=∑yz∑r∑z,the covariance matrix a projection z=C'x,where C is an n x p matrix.The of the linear estimates (see Section II-D),is full rank total dispersion (variation)in the z-sample can be decomposed with m distinct eigenvalues 1>>Am and corresponding into the sum of within-class dispersions and between-class orthonormal eigenvectors u1,...,um.If I=fi,.,ip}with dispersions.When the z's are centered,the total dispersion is I≤ii<…<in≤m is any ordered p-index set,let Ur and the dispersion between classes can be shown to be denote the matrix []Then two full-rank matricesBALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 84 1 analysis, and statistics (i.e., n = 5, where the first two exams were closed book and the other three were open book). For each exam, the best possible score was 100. It is found that the average score (z) equals (39.0, 50.6, 50.6, 46.7, 42.3)’ and that the eigenvalues of the covariance matrix E,, are given by A1 = 679.2, A2 = 199.8, A3 = 102.6, A4 = 83.7 and X5 = 31.8. Hence, the first two principal components already explain 80% of the variance in the data (and 91% is achieved with the first three). The first two eigenvectors are u1 = (0.51,0.37,0.35,0.45,20.53)’ and ‘112 = (0.75,0.21, -0.08, -0.30, -0.55)’. These findings can easily be interpreted. The authors conclude that “. . .the first principal component gives positive weight to all the variables and thus represents an average grade. On the other hand, the second principal component represents a contrast between the open-book and closed-book examinations. . .” For example, the scores and first two principal components of the two best students are (77, 82, 67, 67, 81), 66.4, and 6.4 and (63, 78, 80, 70, 81), 63.7, and -6.4. Even without looking at the individual test scores, by considering only the first two principal components one would conclude that the overall performances of the two students are very similar, but the first student did better on closed book and the second one better on open-book exams. In conclusion, PCA is optimal in the least-mean-square sense and can serve two purposes: data compression by projecting high-dimensional data into a lower-dimensional space and feature extraction by revealing, through the principal components, relevant but unexpected structure hidden in the data (although an interpretation of these features in terms of the original variables may not always be straightforward). F. Mean Square Class$er and Discriminant Analysis Consider now the problem where the patterns xt must be classified into m classes Cl, . . . , C,, with, in general, m << n. Thus for every input pattern iCt, there is a binary target output pattern yt = (0,. . . ,1,. .. ,0)’ where yi,t = 1 if and only if xt belongs to Ci. One possible classification method consists in finding an m x n matrix L such that (11 y - Lz(I2) is minimal. Needless to say, this is a special case of least-squares regression, and, as we have seen, under the usual assumptions the optimal L is given by L = EyzE;2 and is called the mean-square classifier. In many applications n is very large compared to m, and therefore it becomes useful to first reduce the dimensionality of the input data. One is thus led to find a linear subspace of dimension p such that, when projected onto this subspace, the patterns xt fall as much as possible into well-defined separated clusters facilitating the classification. This problem of finding an optimal projection is similar to the one encountered in PCA. Because of the clustering, however, a new measure must be introduced to compare different projections. Consider a projection z = C’x, where C is an n x p matrix. The total dispersion (variation) in the x-sample can be decomposed into the sum of within-class dispersions and between-class dispersions. When the z’s are centered, the total dispersion is E,,, and the dispersion between classes can be shown to be EzYE;; E,,. Upon projecting the patterns, the corresponding total and between classes dispersions of the zt patterns become C’C,,C and C’C,,E;;E,,C. A projection is optimal if the between classes variation of the z’s is as large as possible relative to the total variation. Different cost functions can be introduced at this stage. If the size of a variation matrix is measured by its determinant (the determinant of a matrix measures the volume of the image of a unit cube under the corresponding linear map), then we are led to the problem of finding an n x p matrix C maximizing the ratio E(C) = det(C’C,,C,-,lC,,C) det (C’C,,C) ’ (3) The solution is well known. All optimal matrices, the so-called discriminant analysis (DA) matrices, are of the form H,R, where R is an arbitrary p x p invertible matrix and H, has the first p eigenvectors of E;:EzYE;tE,, as its columns. It is not easy to see what the solutions look like in general. There is one case, however, where all optimal solutions can easily be described. When p = T = rank(&,), an n x p matrix C is a DA matrix if and only if the space spanned by the columns of C coincides with the space spanned by the rows of the mean-square classifier Cy, E;:. See, for instance, Kshirsager [8] for more details on DA. 111. BACKPROPAGATION A. The Landscape Properties of E We now consider the setting described in the Introduction where the learning procedure is based on the minimization of the cost function E(A, B). A complete description of the landscape properties of E is given in Baldi and Hornik [6]. We shall briefly review the most salient features. E is best described in terms of its critical points, that is, the points where dE/daij = dE/dbij = 0. It is first important to observe that if C is any p x p invertible matrix, then E(A,B) = E(CA,BC-l). Therefore, at any point E really depends on the global map W = BA rather than on A and B. For instance, there is an infinite family of pairs of matrices (A, B) corresponding to any critical point. Unlike the simple case of linear regression, however, W cannot be chosen arbitrarily: the network architecture constrains W to have at most rank p. The remarkable property of the landscape of E is the absence of local minima in spite of the fact that E is not convex (nor is the set of all matrices of rank at most p). E is characterized by a unique global minimum (up to multiplication by a matrix C). All other critical points are saddle points. The structure of the critical points can be described completely. More precisely, assume for simplicity that p 5 m 5 n and that C = Ey,E;:Czy, the covariance matrix of the linear estimates st (see Section 11-D), is full rank with m distinct eigenvalues A1 > . . . > A, and corresponding orthonormal eigenvectors u1, . . . , U,. If Z = { 21, . . . , i,} with 1 5 il < . . . < i, 5 m is any ordered p-index set, let UT denote the matrix [uil, . . . , ui,]. Then two full-rank matrices

<<向上翻页向下翻页>>

点击下载：北京大学：《模式识别》课程教学资源（参考资料）Learning in Linear Neural Networks - A Survey