正在加载图片...
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 39 is ZZ'x =z+..+z.If the matrix Z is not full case it is sufficient to notice that E can be rewritten as rank,Pz can still be written as E(A)=llvec(Y)-(X')vec(A)12/T;where denotes the P2=ZZ+=w1+…+ur Kronecker product of two matrices and "vec"is an operator which transforms a matrix into a vector by stacking its columns where u,..,u are the first r columns of the U matrix in one above the other.See [6]for details.)In particular,even the SVD of 2 (and r is the rank of 2). if the connectivity between the input and the output is local Consider now the problem of finding a vector w which the problem remains convex and without local minima and minimizes F(w)=c-Mw2 for a given vector c and a therefore,in principle,easy to learn by a gradient-descent matrix M.In other words,we are looking for the vector in type of mechanism.Finally.notice that if for an input r we the image space of M(the space spanned by the columns of approximate the corresponding output pattern y by its linear M)which is closest to c.Clearly,by the projection theorem estimate yt =vzt,then the covariance matrix of the this is the orthogonal projection of c onto the image of M. estimates is given by∑=(j)=∑yr2r∑ry In particular,if M is of full rank,then at the optimum w we must have M(M'M)-1M'c Mw or,equivalently, w=(M'M)-1M'e.The Hessian of F is 2M'M;hence,if E.Principal Comonent Analysis M is full rank,the Hessian is positive definite and the problem Suppose we are given a collection of T objects.For each is strictly convex without any other critical points. object xt,the measurements of the same n characteristics 1.t,,n.t are available.Assume it is desired to extract D.Least-Squares Regression some“structure'or“main features'”from this collection of The problem of linear regression is the following.Given a data.For efficient classification,it is obviously useful to set of n-dimensional input vectors 1,...,r and a set of m- compress the high-dimensional input data into something low dimensional target vectors y,,yr,find an m x n matrix A dimensional without discarding too much relevant information. which minimizes E(A)=(lly-Ax2).In other words,linear Of course,there are several different techniques for feature regression is exactly the usual learning problem in a linear extraction and data compression.One of the simplest and most network without any hidden units.Since the output units are general-purpose ones is a statistical method known as principal completely uncoupled,the connection weights for each of them component analysis (PCA). can be synthesized separately and therefore one needs only to By possibly subtracting the average ()we can think of consider the case m =1,where we write A =a'.In this the data set ri,t(l≤i≤n,l≤t≤T)as a cloud of case,the problem has a simple geometrical interpretation:find T points in n-dimensional Euclidean space centered around a hyperplane through the origin in (n +1)-dimensional space the origin.To capture the main features of the data set, which best fits(in the least-squares sense)a cloud of T points PCA is looking for directions along which the dispersion or with coordinates(x,1',·,(xr,r))'.Now variance of the point cloud is maximal,that is,looking for a subspace C such that the projection of the points r onto E(a)=(y-a'x)2〉=a∑xra-2Evxa+(y2) C has maximal variance.If C is the line spanned by the and the gradient of E with respect to a is unit vector a,the projection Pcr is given by Par=aa'r with squared length Par2 =(a'r)2=a'xz'a.Hence, VE=2∑rxa-2∑xy the average dispersion of the data set in the direction of the line is (lPax2)=(a'xx'a)a'(r')a a'Erza,where E is continuous,differentiable,and bounded below by zero =(r')is the data covariance matrix.PCA looks for a and therefore it must reach its minimum for a vector a unit vector a*which maximizes a'a over the set of all satisfying =ry If is positive definite,then there unit vectors.IfXi>·>入n>0 are the eigenvalues of∑rr is a unique solution given by with eigenvectors u,.,n,then,by the previous result on a=∑xw (2) quadratic forms in Section II-A,we know that a=u (or equivalently-u1)is the answer. and,in addition,E is strictly convex (with Hessian 2) To sum up,PCA starts by finding the direction in which the and so without any local minima (or even without any other dispersion of the cloud is maximal,which is the direction u critical point).The landscape is therefore as simple as possible, of the first eigenvector of the data covariance matrix.The first and this remains true even if some of the connections are "feature"which is extracted is the first principal component forced in advance to take some fixed values,typically zero in uit.The component of the data "explained"by the first the case of "local"connectivity (this introduces linear,thus principal component is the projection onto the line spanned convex,restrictions on the set of possible weights).When by u.What remains unexplained is the dispersion of the m>1,everything goes through mutatis mutandis.In the case residual re-Purt which is just the projection of t where is positive definite,the unique optimal A is called onto the orthogonal complement of u.In a second step,we the slope matrix of the regression of y on r and is given by proceed as before,but with the points r replaced by t. A=y∑ That is we look for straight lines C perpendicular to the line spanned by u such that the projections of the points t which generalizes (2),taking into account that A a'have maximal variance.This amounts to finding a unit vector in one dimension.(Formally,to reduce the m-dimensional b",perpendicular to,which maximizesbover all unitBALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 839 is ZZ’x = zlzix + ... + Z&X. If the matrix 2 is not full rank, Pz can still be written as Pz = zz+ = U,,; + ’. ’ + u,u; where ~1, . . . , U, are the first T columns of the U matrix in the SVD of 2 (and T is the rank of 2). Consider now the problem of finding a vector w which minimizes F(w) = [IC - Mw1I2 for a given vector c and a matrix M. In other words, we are looking for the vector in the image space of M (the space spanned by the columns of M) which is closest to e. Clearly, by the projection theorem, this is the orthogonal projection of c onto the image of M. In particular, if M is of full rank, then at the optimum w we must have M(M’M)-lM’c = MUJ or, equivalently, w = (M’M)-lM’c. The Hessian of F is 2M’M; hence, if M is full rank, the Hessian is positive definite and the problem is strictly convex without any other critical points. D. Least-Squares Regression The problem of linear regression is the following. Given a set of n-dimensional input vectors x1 , . . . , XT and a set of m￾dimensional target vectors y1, . . . , yT, find an m x n matrix A which minimizes E(A) = (lly - Axil2). In other words, linear regression is exactly the usual learning problem in a linear network without any hidden units. Since the output units are completely uncoupled, the connection weights for each of them can be synthesized separately and therefore one needs only to consider the case m = 1, where we write A = a’. In this case, the problem has a simple geometrical interpretation: find a hyperplane through the origin in (n + 1)-dimensional space which best fits (in the least-squares sense) a cloud of T points with coordinates (xi, ~1)’. . . . , (xk, y~)’. Now E(a) = ((y - a’x)2) = u’C,,a - 2Cy,a + (y2) and the gradient of E with respect to a is VE = 2C,,a - 2C,,. E is continuous, differentiable, and bounded below by zero and therefore it must reach its minimum for a vector a satisfying Czxa = CZy. If E,, is positive definite, then there is a unique solution given by a = (2) and, in addition, E is strictly convex (with Hessian 2C,,) and so without any local minima (or even without any other critical point). The landscape is therefore as simple as possible, and this remains true even if some of the connections are forced in advance to take some fixed values, typically zero in the case of “local” connectivity (this introduces linear, thus convex, restrictions on the set of possible weights). When m > 1, everything goes through mutatis mutandis. In the case where E,, is positive definite, the unique optimal A is called the slope matrix of the regression of y on z and is given by A = Ey,CL2 which generalizes (2), taking into account that A = a’ in one dimension. (Formally, to reduce the m-dimensional case it is sufficient to notice that E can be rewritten as E(A) = Ilvec(Y)-(X’@I)vec(A)112/T, where 63 denotes the Kronecker product of two matrices and “vec” is an operator which transforms a matrix into a vector by stacking its columns one above the other. See [6] for details.) In particular, even if the connectivity between the input and the output is local, the problem remains convex and without local minima and therefore, in principle, easy to learn by a gradient-descent type of mechanism. Finally, notice that if for an input xt we approximate the corresponding output pattern yt by its linear estimate Gt = CyzC;:xt, then the covariance matrix of the estimates is given by C = ($6’) = Cy,C;2EZy. E. Principal Comonent Analysis Suppose we are given a collection of T objects. For each object xt, the measurements of the same n characteristics xl,t, * . . , x,,t are available. Assume it is desired to extract some “structure” or “main features” from this collection of data. For efficient classification, it is obviously useful to compress the high-dimensional input data into something low dimensional without discarding too much relevant information. Of course, there are several different techniques for feature extraction and data compression. One of the simplest and most general-purpose ones is a statistical method known as principal component analysis (PCA). By possibly subtracting the average (x), we can think of the data set x+(l 5 i 5 n, 1 5 t 5 T) as a cloud of T points in n-dimensional Euclidean space centered around the origin. To capture the main features of the data set, PCA is looking for directions along which the dispersion or variance of the point cloud is maximal, that is, looking for a subspace C such that the projection of the points zt onto C has maximal variance. If C is the line spanned by the unit vector a, the projection PL~ is given by Pax = adz with squared length llP,~11~ = (a’x)’ = a’zx’a. Hence, the average dispersion of the data set in the direction of the line is (llPaxl12) = (a’xx’a) = a’(xx’)a = a’C,,a, where E,, = (xx’) is the data covariance matrix. PCA looks for a unit vector a* which maximizes a’C,,a over the set of all unit vectors. If A1 > . . . > A, > 0 are the eigenvalues of C,, with eigenvectors ul, . . . , U,, then, by the previous result on quadratic forms in Section 11-A, we know that a* = u1 (or equivalently -ul) is the answer. To sum up, PCA starts by finding the direction in which the dispersion of the cloud is maximal, which is the direction 7L1 of the first eigenvector of the data covariance matrix. The first “feature” which is extracted is the first principal component uixt. The component of the data “explained” by the first principal component is the projection onto the line spanned by ~1. What remains unexplained is the dispersion of the residual xt - Pu,xt which is just the projection Qulxt of xt onto the orthogonal complement of ~1. In a second step, we proceed as before, but with the points xt replaced by Qulxt. That is we look for straight lines C perpendicular to the line spanned by u1 such that the projections of the points Qulxt have maximal variance. This amounts to finding a unit vector b*, perpendicular to ~1, which maximizes b’C,,b over all unit
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有