IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.6.NO.4.JULY 1995 837 Learning in Linear Neural Networks:A Survey Pierre F.Baldi and Kurt Hornik,Member,IEEE Abstract-Networks of linear units are the simplest kind of n output units networks,where the basic questions related to learning,gen- eralization,and self-organization can sometimes be answered analytically.We survey most of the known results on linear net- p hidden units works,including:1)backpropagation learning and the structure of the error function landscape,2)the temporal evolution of generalization,and 3)unsupervised learning algorithms and their n input units properties.The connections to classical statistical ideas,such as principal component analysis (PCA),are emphasized as well as several simple but challenging open questions.A few new results Fig.I.The basic network in the autoassociative case (m n). are also spread across the paper,including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. in its linear regime.Even when training is completed,one often finds several units in the network which are operating in their linear range.From the standpoint of theoretical biology, I.INTRODUCTION it has been argued that certain classes of neurons may be HIS paper addresses the problems of supervised and operating most of the time in a linear or quasi-linear regime unsupervised learning in layered networks of linear units and linear input-output relations seem to hold for certain and,together with a few new results.reviews most of the specific biological circuits (see [2]for an example).Finally, recent literature on the subject.One may expect the topic to the study of linear networks leads to new interesting questions, be fairly restricted,yet it is in fact quite rich and far from being insights,and paradigms which could not have been guessed exhausted.Since the first approximations of biological neurons in advance and to new ways of looking at certain classical using threshold gates [1],the nonlinear aspects of neural statistical techniques. computations and hardware have often been emphasized and To begin with,we shall consider a linear network with an linear networks dismissed as uninteresting for being able to n-p-m architecture comprising one input layer,one hidden express linear input-output maps only.Furthermore,multiple layer,and one output layer with n,p,and m units,respectively layers of linear units can always be collapsed by multiplying (Fig.I).The more general case,with,for instance,multiple the corresponding weight matrices.So why bother?Non- hidden layers,can be reduced to this simple setting as we linear computations are obviously extremely important,but shall see.A will usually denote the pxn matrix connecting the these arguments should be considered as very suspicious; input to the middle layer and B the m x p matrix of connection by stressing the input-output relations only,they miss the weights from the middle layer to the output.Thus,for instance, subtle problems of dynamics,structure,and organization that 6:;represents the strength of the coupling between the jth normally arise during learning and plasticity,even in simple hidden unit and the ith output unit(double indexes are always linear systems.There are other reasons why linear networks in the post-presynaptic order).The network therefore computes deserve careful attention.General results in the nonlinear case the linear function =BAx.In the usual learning from are often absent or difficult to derive analytically,whereas examples setting,we assume that a set of n-dimensional input the linear case can often be analyzed in mathematical detail patterns r(1<tT)is given together with a corresponding As in the theory of differential equations,the linear setting set of m-dimensional target output patterns yt(1 <t<T)(all should be regarded as the first simple case to be studied.More vectors are assumed to be column vectors).X=[1,.... complex situations can often be investigated by linearization andY=lh,…,r]are the n×T and m×T matrices although this has not been attempted systematically in neural having the patterns as their columns.Because of the need networks,for instance in the analysis of backpropagation for target outputs,this form of learning will also be called learning.In backpropagation,learning is often started with supervised.For simplicity,unless otherwise stated,all the zero or small random initial weights and biases.Thus,at least patterns are assumed to be centered (i.e.,(x)=(y)=0).The during the initial phase of training,the network is operating symbol“(-〉'”will be used for averages over the set of patterns or sometimes over the pattern distribution,depending on the Manuscript received September 25,1992;revised June 19,1994.This work context.The approximation of one by the other is a central was supported in part by grants from NSF,AFOSR,and ONR. problem in statistics,but is not our main concern here.The P.F.Baldi is with the Jet Propulsion Laboratory,and Division of Biology, environment is supposed to be stationary but the results could California Institute of Technology,Pasadena,CA 91109 USA. K.Hornik is with the Institut fur Statistik und Wahrscheinlichkeitstheorie. be extended to a slowly varying environment to deal with Technische Universitit Wicn.A-1040 Vienna,Austria. plasticity issues.Throughout this paper,learning will often be IEEE Log Number 9409158. based on the minimization of an error function E depending 1045-9227/95$04.00@1995IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 837 Learning in Linear Neural Networks: A Survey Pierre F. Baldi and Kurt Homik, Member, IEEE Absfract- Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and self-organization can sometimes be answered analytically. We survey most of the known results on linear networks, including: 1) backpropagation learning and the structure of the error function landscape, 2) the temporal evolution of generalization, and 3) unsupervised learning algorithms and their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. I. INTRODUCTION HIS paper addresses the problems of supervised and T unsupervised learning in layered networks of linear units and, together with a few new results, reviews most of the recent literature on the subject. One may expect the topic to be fairly restricted, yet it is in fact quite rich and far from being exhausted. Since the first approximations of biological neurons using threshold gates [ll, the nonlinear aspects of neural computations and hardware have often been emphasized and linear networks dismissed as uninteresting for being able to express linear input-output maps only. Furthermore, multiple layers of linear units can always be collapsed by multiplying the corresponding weight matrices. So why bother? Nonlinear computations are obviously extremely important, but these arguments should be considered as very suspicious; by stressing the input-output relations only, they miss the subtle problems of dynamics, structure, and organization that normally arise during learning and plasticity, even in simple linear systems. There are other reasons why linear networks deserve careful attention. General results in the nonlinear case are often absent or difficult to derive analytically, whereas the linear case can often be analyzed in mathematical detail. As in the theory of differential equations, the linear setting should be regarded as the first simple case to be studied. More complex situations can often be investigated by linearization, although this has not been attempted systematically in neural networks, for instance in the analysis of backpropagation learning. In backpropagation, learning is often started with zero or small random initial weights and biases. Thus, at least during the initial phase of training, the network is operating Manuscript received September 25, 1992; revised June 19, 1994. This work P. F. Baldi is with the Jet Propulsion Laboratory, and Division of Biology, K. Homik is with the Institut fiir Statistik und Wahrscheinlichkeitstheorie, IEEE Log Number 9409158. was supported in part by grants from NSF, AFOSR, and ONR. California Institute of Technology, Pasadena, CA 91 109 USA. Technische Universittit Wien, A-1040 Vienna, Austria. B A n output units p hidden units n input units Fig. 1. The basic network in the autoassociative case (m = n). in its linear regime. Even when training is completed, one often finds several units in the network which are operating in their linear range. From the standpoint of theoretical biology, it has been argued that certain classes of neurons may be operating most of the time in a linear or quasi-linear regime and linear input-output relations seem to hold for certain specific biological circuits (see [2] for an example). Finally, the study of linear networks leads to new interesting questions, insights, and paradigms which could not have been guessed in advance and to new ways of looking at certain classical statistical techniques. To begin with, we shall consider a linear network with an n-pm architecture comprising one input layer, one hidden layer, and one output layer with n, p, and m units, respectively (Fig. 1). The more general case, with, for instance, multiple hidden layers, can be reduced to this simple setting as we shall see. A will usually denote the p x n matrix connecting the input to the middle layer and B the m x p matrix of connection weights from the middle layer to the output. Thus, for instance, bij represents the strength of the coupling between the jth hidden unit and the ith output unit (double indexes are always in the post-presynaptic order). The network therefore computes the linear function y = BAT. In the usual learning from examples setting, we assume that a set of n-dimensional input patterns xt (1 5 t 5 T) is given together with a corresponding set of m-dimensional target output patterns yt (1 5 t 5 T) (all vectors are assumed to be column vectors). X = [XI, . . . , ZT] and Y = [yl,...,y~] are the n x T and m x T matrices having the patterns as their columns. Because of the need for target outputs, this form of learning will also be called supervised. For simplicity, unless otherwise stated, all the patterns are assumed to be centered (i.e., (z) = (y) = 0). The symbol ‘‘(.)” will be used for averages over the set of patterns or sometimes over the pattern distribution, depending on the context. The approximation of one by the other is a central problem in statistics, but is not our main concern here. The environment is supposed to be stationary but the results could be extended to a slowly varying environment to deal with plasticity issues. Throughout this paper, learning will often be based on the minimization of an error function E depending 1045-9227/95$04.00 0 1995 IEEE
838 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6.NO.4.JULY 1995 on the synaptic weights.In the main case of backpropagation II MATHEMATICAL BACKGROUND the error function is A.Optimization of Quadratic Forms over Spheres E(A,B)=(iy-BAxI〉 (1)Let S be a symmetric n x n matrix.Then all the eigenvalues λof S are real and can be ordered in the formλi≥ A22·≥λwith corresponding normalized eigenvectors whereu represents the Euclidean norm of the vector u. ,,un Consider the problem of maximizing the quadratic When no target outputs are provided,the learning (which form E(a)=a'Sa over the sphere of radius p and centered at then must be based on criteria to be specified,such as the the origin (aA2, vt=).This is also called autoencoding or identity mapping Epui are the only two solutions.Similarly,the maximum of in the literature E over the intersection of the sphere with the linear space Learning rules are algorithms for slowly altering the con- orthogonal to u is reached at +pu2,and so forth.Finally,the nection weights to achieve a desirable goal such as the minimum of E over the entire sphere is obtained at tpun. minimization of an error function.Often,three different ver- All these properties are easily derived by decomposing a as sions of the same rule have been given:the "on-line"version a=∑:a;:and noticing that E(a)=∑,入:a where the modification is calculated after the presentation of each pattern.the "off-line"version where the previous modifications are averaged over the cycle of all patterns,and B.Singular Value Decomposition the "continuous"version where the discrete changes induced Let 2 be an arbitrary k x matrix with rank r.Then there by the "off-line"algorithm are approximated continuously by exist numbers o1≥,,≥ar>0,the singular values of Z.an a differential equation governing the evolution of the weights orthogonal k x k matrix U,and an orthogonal Ix l matrix V in time.In some cases,the three formulations can be shown such that S=U'ZV is a :x I diagonal matrix of the form to lead to essentially the same results. It will be convenient to use the notation =(uv')where S= D O 00 the prime denotes transposition of matrices.If both (u)and (v)are zero,is the covariance matrix of u and v.for where D diag(1,..,o)is the diagonal matrix with instance,is a real n x n symmetric nonnegative definite matrix. entries o1,...,o.The decomposition Hence,its eigenvalues can be ordered as,≥·≥入n≥O. For mathematical simplicity,we shall often assume that in Z=USV fact A>·>入n>0.This should not be regarded as a very restrictive assumption,since this condition can always is called the singular value decomposition (SVD)of Z (it is be enforced by,at worst,perturbing the data by infinitesimal not necessarily unique).The matrices U and V in the SVD amounts and attributing these perturbations to "noise."Many have the following meaning.As Z'ZV=VS'U'USVV conclusions are only slightly different when some eigenvalues VS'S Vdiag(a,...,,0,...,0),the columns of V are coincide. unit-length,mutually perpendicular eigenvectors of Z'Z,and A good familiarity with linear algebra and basic calculus ,..,o2 are the nonzero eigenvalues of Z'Z.Similarly,the on the part of the reader should be sufficient to follow the columns of U are unit-length,mutually perpendicular eigen- paper.All the statistical techniques required to understand vectors of Z2'.With the aid of the SVD.the pseudoinverse some of the results are briefly reviewed in the second section of 2 can easily be given explicitly.If we write These include least-squares regression,principal component analysis,and discriminant analysis.In Section III,we treat S+= 「D-1O the case of supervised learning with backpropagation and 00 the corresponding autoassociative special case.We study the then +=VS+U is the pseudoinverse of Z (see,for landscape of the error function E of (1),its connections to instance.[4]and [5]for more details on SVD and pseudoin- the previously mentioned statistical techniques,and several verses). consequences and generalizations,including noisy and deep networks.In Section IV,we study the problems of validation generalization,and overfitting in a simple one-layer network C.Orthogonal Projections trained to learn the identity map.Under some assumptions,we If C is a linear subspace,we denote by Per the orthogonal give a complete description of the evolution of the validation projection of a vector x onto C and by Per Qcr error as a function of training time.Section V covers a variety -Pcx its projection onto the orthogonal complement of C. of unsupervised learning algorithms,based on variance max- If C has dimension and is spanned by the linearly independent imization/minimization by Hebbian or anti-Hebbian learning vectors 21,..,2,then Pex Pzx,where Z=[21,, or other error functions.Some of the more technical proofs and Pz Z(Z'Z)-Z'.In particular,if the vectors 2;are are deferred to the Appendix. mutually perpendicular unit vectors,the projection of simply
838 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 on the synaptic weights. In the main case of backpropagation, the error function is where llull represents the Euclidean norm of the vector U. When no target outputs are provided, the learning (which then must be based on criteria to be specified, such as the maximization of the output variance) is unsupervised. An important special case of unsupervised learning is the case of autoassociation, when the input is used as a teacher (i.e., yt = zt). This is also called autoencoding or identity mapping in the literature. Learning rules are algorithms for slowly altering the connection weights to achieve a desirable goal such as the minimization of an error function. Often, three different versions of the same rule have been given: the “on-line” version where the modification is calculated after the presentation of each pattern, the “off-line” version where the previous modifications are averaged over the cycle of all patterns, and the “continuous” version where the discrete changes induced by the “off-line” algorithm are approximated continuously by a differential equation governing the evolution of the weights in time. In some cases, the three formulations can be shown to lead to essentially the same results. It will be convenient to use the notation E,, = (uii’) where the prime denotes transposition of matrices. If both (U) and (U) are zero, E,,, is the covariance matrix of U and U. E,,, for instance, is a real n x n symmetric nonnegative definite matrix. Hence, its eigenvalues can be ordered as A1 2 . . . 2 A, 2 0. For mathematical simplicity, we shall often assume that in fact A1 > ... >A, > 0. This should not be regarded as a very restrictive assumption, since this condition can always be enforced by, at worst, perturbing the data by infinitesimal amounts and attributing these perturbations to “noise.” Many conclusions are only slightly different when some eigenvalues coincide. A good familiarity with linear algebra and basic calculus on the part of the reader should be sufficient to follow the paper. All the statistical techniques required to understand some of the results are briefly reviewed in the second section. These include least-squares regression, principal component analysis, and discriminant analysis. In Section 111, we treat the case of supervised learning with backpropagation and the corresponding autoassociative special case. We study the landscape of the error function E of (I), its connections to the previously mentioned statistical techniques, and several consequences and generalizations, including noisy and deep networks. In Section IV, we study the problems of validation, generalization, and overfitting in a simple one-layer network trained to learn the identity map. Under some assumptions, we give a complete description of the evolution of the validation error as a function of training time. Section V covers a variety of unsupervised learning algorithms, based on variance maximizatiodminimization by Hebbian or anti-Hebbian learning or other error functions. Some of the more technical proofs are deferred to the Appendix. 11. MATHEMATICAL BACKGROUND A. Optimization of Quadratic Forms over Spheres Let S be a symmetric n x n matrix. Then all the eigenvalues A, of S are real and can be ordered in the form A1 2 A2 2 . . . 2 A, with corresponding normalized eigenvectors u1, . . . , U,. Consider the problem of maximizing the quadratic form E(a) = n’Sa over the sphere of radius p and centered at the origin (Ilall 5 p). In geometry, it is well known (see, for instance, [3]) that the maximum of E is then reached on the surface of the sphere in the direction of the first eigenvector, that is, at the points fpul where E(*pul) = Alp2. If A1 > A2, &pul are the only two solutions. Similarly, the maximum of E over the intersection of the sphere with the linear space orthogonal to u1 is reached at *pu2, and so forth. Finally, the minimum of E over the entire sphere is obtained at fpu,. All these properties are easily derived by decomposing a as n = C,Q~U, and noticing that E(a) = E, A,cy:. B. Singular Value Decomposition Let 2 be an arbitrary IC x 1 matrix with rank r. Then there exist numbers (TI 2 . . . 2 (T, > 0, the singular values of 2, an orthogonal k x IC matrix U, and an orthogonal 1 x 1 matrix V such that S = U’ZV is a IC x 1 diagonal matrix of the form where D = diag(o1, . . . , or) is the diagonal matrix with entries 01, . . . , U,. . The decomposition 2 = usv‘ is called the singular value decomposition (SVD) of 2 (it is not necessarily unique). The matrices U and V in the SVD have the following meaning. As Z‘ZV = VS‘U‘USV‘V = VS’S = Vdiag(af,. . . , oz, 0,. . . ,0), the columns of V are unit-length, mutually perpendicular eigenvectors of 2’2, and (T:, . . . ,oz are the nonzero eigenvalues of 2’2. Similarly, the columns of U are unit-length, mutually perpendicular eigenvectors of 22’. With the aid of the SVD, the pseudoinverse of 2 can easily be given explicitly. If we write L J then Zi‘ = VS+U‘ is the pseudoinverse of Z (see, for instance, [4] and [5] for more details on SVD and pseudoinverses). C. Orthogonal Projections If L is a linear subspace, we denote by Pcx the orthogonal projection of a vector z onto C and by PClx = Qcx = x - Pcx its projection onto the orthogonal complement of C. If C has dimension 1 and is spanned by the linearly independent vectors 21, . . . ,zl, then PLZ = PZZ, where 2 = [z1 , . . . , 211 and Pz = Z(Z’Z)-lZ’. In particular, if the vectors z, are mutually perpendicular unit vectors, the projection of x simply
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 39 is ZZ'x =z+..+z.If the matrix Z is not full case it is sufficient to notice that E can be rewritten as rank,Pz can still be written as E(A)=llvec(Y)-(X')vec(A)12/T;where denotes the P2=ZZ+=w1+…+ur Kronecker product of two matrices and "vec"is an operator which transforms a matrix into a vector by stacking its columns where u,..,u are the first r columns of the U matrix in one above the other.See [6]for details.)In particular,even the SVD of 2 (and r is the rank of 2). if the connectivity between the input and the output is local Consider now the problem of finding a vector w which the problem remains convex and without local minima and minimizes F(w)=c-Mw2 for a given vector c and a therefore,in principle,easy to learn by a gradient-descent matrix M.In other words,we are looking for the vector in type of mechanism.Finally.notice that if for an input r we the image space of M(the space spanned by the columns of approximate the corresponding output pattern y by its linear M)which is closest to c.Clearly,by the projection theorem estimate yt =vzt,then the covariance matrix of the this is the orthogonal projection of c onto the image of M. estimates is given by∑=(j)=∑yr2r∑ry In particular,if M is of full rank,then at the optimum w we must have M(M'M)-1M'c Mw or,equivalently, w=(M'M)-1M'e.The Hessian of F is 2M'M;hence,if E.Principal Comonent Analysis M is full rank,the Hessian is positive definite and the problem Suppose we are given a collection of T objects.For each is strictly convex without any other critical points. object xt,the measurements of the same n characteristics 1.t,,n.t are available.Assume it is desired to extract D.Least-Squares Regression some“structure'or“main features'”from this collection of The problem of linear regression is the following.Given a data.For efficient classification,it is obviously useful to set of n-dimensional input vectors 1,...,r and a set of m- compress the high-dimensional input data into something low dimensional target vectors y,,yr,find an m x n matrix A dimensional without discarding too much relevant information. which minimizes E(A)=(lly-Ax2).In other words,linear Of course,there are several different techniques for feature regression is exactly the usual learning problem in a linear extraction and data compression.One of the simplest and most network without any hidden units.Since the output units are general-purpose ones is a statistical method known as principal completely uncoupled,the connection weights for each of them component analysis (PCA). can be synthesized separately and therefore one needs only to By possibly subtracting the average ()we can think of consider the case m =1,where we write A =a'.In this the data set ri,t(l≤i≤n,l≤t≤T)as a cloud of case,the problem has a simple geometrical interpretation:find T points in n-dimensional Euclidean space centered around a hyperplane through the origin in (n +1)-dimensional space the origin.To capture the main features of the data set, which best fits(in the least-squares sense)a cloud of T points PCA is looking for directions along which the dispersion or with coordinates(x,1',·,(xr,r))'.Now variance of the point cloud is maximal,that is,looking for a subspace C such that the projection of the points r onto E(a)=(y-a'x)2〉=a∑xra-2Evxa+(y2) C has maximal variance.If C is the line spanned by the and the gradient of E with respect to a is unit vector a,the projection Pcr is given by Par=aa'r with squared length Par2 =(a'r)2=a'xz'a.Hence, VE=2∑rxa-2∑xy the average dispersion of the data set in the direction of the line is (lPax2)=(a'xx'a)a'(r')a a'Erza,where E is continuous,differentiable,and bounded below by zero =(r')is the data covariance matrix.PCA looks for a and therefore it must reach its minimum for a vector a unit vector a*which maximizes a'a over the set of all satisfying =ry If is positive definite,then there unit vectors.IfXi>·>入n>0 are the eigenvalues of∑rr is a unique solution given by with eigenvectors u,.,n,then,by the previous result on a=∑xw (2) quadratic forms in Section II-A,we know that a=u (or equivalently-u1)is the answer. and,in addition,E is strictly convex (with Hessian 2) To sum up,PCA starts by finding the direction in which the and so without any local minima (or even without any other dispersion of the cloud is maximal,which is the direction u critical point).The landscape is therefore as simple as possible, of the first eigenvector of the data covariance matrix.The first and this remains true even if some of the connections are "feature"which is extracted is the first principal component forced in advance to take some fixed values,typically zero in uit.The component of the data "explained"by the first the case of "local"connectivity (this introduces linear,thus principal component is the projection onto the line spanned convex,restrictions on the set of possible weights).When by u.What remains unexplained is the dispersion of the m>1,everything goes through mutatis mutandis.In the case residual re-Purt which is just the projection of t where is positive definite,the unique optimal A is called onto the orthogonal complement of u.In a second step,we the slope matrix of the regression of y on r and is given by proceed as before,but with the points r replaced by t. A=y∑ That is we look for straight lines C perpendicular to the line spanned by u such that the projections of the points t which generalizes (2),taking into account that A a'have maximal variance.This amounts to finding a unit vector in one dimension.(Formally,to reduce the m-dimensional b",perpendicular to,which maximizesbover all unit
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 839 is ZZ’x = zlzix + ... + Z&X. If the matrix 2 is not full rank, Pz can still be written as Pz = zz+ = U,,; + ’. ’ + u,u; where ~1, . . . , U, are the first T columns of the U matrix in the SVD of 2 (and T is the rank of 2). Consider now the problem of finding a vector w which minimizes F(w) = [IC - Mw1I2 for a given vector c and a matrix M. In other words, we are looking for the vector in the image space of M (the space spanned by the columns of M) which is closest to e. Clearly, by the projection theorem, this is the orthogonal projection of c onto the image of M. In particular, if M is of full rank, then at the optimum w we must have M(M’M)-lM’c = MUJ or, equivalently, w = (M’M)-lM’c. The Hessian of F is 2M’M; hence, if M is full rank, the Hessian is positive definite and the problem is strictly convex without any other critical points. D. Least-Squares Regression The problem of linear regression is the following. Given a set of n-dimensional input vectors x1 , . . . , XT and a set of mdimensional target vectors y1, . . . , yT, find an m x n matrix A which minimizes E(A) = (lly - Axil2). In other words, linear regression is exactly the usual learning problem in a linear network without any hidden units. Since the output units are completely uncoupled, the connection weights for each of them can be synthesized separately and therefore one needs only to consider the case m = 1, where we write A = a’. In this case, the problem has a simple geometrical interpretation: find a hyperplane through the origin in (n + 1)-dimensional space which best fits (in the least-squares sense) a cloud of T points with coordinates (xi, ~1)’. . . . , (xk, y~)’. Now E(a) = ((y - a’x)2) = u’C,,a - 2Cy,a + (y2) and the gradient of E with respect to a is VE = 2C,,a - 2C,,. E is continuous, differentiable, and bounded below by zero and therefore it must reach its minimum for a vector a satisfying Czxa = CZy. If E,, is positive definite, then there is a unique solution given by a = (2) and, in addition, E is strictly convex (with Hessian 2C,,) and so without any local minima (or even without any other critical point). The landscape is therefore as simple as possible, and this remains true even if some of the connections are forced in advance to take some fixed values, typically zero in the case of “local” connectivity (this introduces linear, thus convex, restrictions on the set of possible weights). When m > 1, everything goes through mutatis mutandis. In the case where E,, is positive definite, the unique optimal A is called the slope matrix of the regression of y on z and is given by A = Ey,CL2 which generalizes (2), taking into account that A = a’ in one dimension. (Formally, to reduce the m-dimensional case it is sufficient to notice that E can be rewritten as E(A) = Ilvec(Y)-(X’@I)vec(A)112/T, where 63 denotes the Kronecker product of two matrices and “vec” is an operator which transforms a matrix into a vector by stacking its columns one above the other. See [6] for details.) In particular, even if the connectivity between the input and the output is local, the problem remains convex and without local minima and therefore, in principle, easy to learn by a gradient-descent type of mechanism. Finally, notice that if for an input xt we approximate the corresponding output pattern yt by its linear estimate Gt = CyzC;:xt, then the covariance matrix of the estimates is given by C = ($6’) = Cy,C;2EZy. E. Principal Comonent Analysis Suppose we are given a collection of T objects. For each object xt, the measurements of the same n characteristics xl,t, * . . , x,,t are available. Assume it is desired to extract some “structure” or “main features” from this collection of data. For efficient classification, it is obviously useful to compress the high-dimensional input data into something low dimensional without discarding too much relevant information. Of course, there are several different techniques for feature extraction and data compression. One of the simplest and most general-purpose ones is a statistical method known as principal component analysis (PCA). By possibly subtracting the average (x), we can think of the data set x+(l 5 i 5 n, 1 5 t 5 T) as a cloud of T points in n-dimensional Euclidean space centered around the origin. To capture the main features of the data set, PCA is looking for directions along which the dispersion or variance of the point cloud is maximal, that is, looking for a subspace C such that the projection of the points zt onto C has maximal variance. If C is the line spanned by the unit vector a, the projection PL~ is given by Pax = adz with squared length llP,~11~ = (a’x)’ = a’zx’a. Hence, the average dispersion of the data set in the direction of the line is (llPaxl12) = (a’xx’a) = a’(xx’)a = a’C,,a, where E,, = (xx’) is the data covariance matrix. PCA looks for a unit vector a* which maximizes a’C,,a over the set of all unit vectors. If A1 > . . . > A, > 0 are the eigenvalues of C,, with eigenvectors ul, . . . , U,, then, by the previous result on quadratic forms in Section 11-A, we know that a* = u1 (or equivalently -ul) is the answer. To sum up, PCA starts by finding the direction in which the dispersion of the cloud is maximal, which is the direction 7L1 of the first eigenvector of the data covariance matrix. The first “feature” which is extracted is the first principal component uixt. The component of the data “explained” by the first principal component is the projection onto the line spanned by ~1. What remains unexplained is the dispersion of the residual xt - Pu,xt which is just the projection Qulxt of xt onto the orthogonal complement of ~1. In a second step, we proceed as before, but with the points xt replaced by Qulxt. That is we look for straight lines C perpendicular to the line spanned by u1 such that the projections of the points Qulxt have maximal variance. This amounts to finding a unit vector b*, perpendicular to ~1, which maximizes b’C,,b over all unit
840 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.6.NO.4.JULY 1995 vectors perpendicular to u.Again,by the previous result,we of the Euclidean space,such that in the new system of know the answer isb*=+u2,and so forth.At the kth step, coordinates,the components of the points in the cloud are we look for lines C perpendicular to the space spanned by uncorrelated and with decreasing variance.Again,if only the ,such that the projections of the points r along first few coordinates in the new system vary significantly,we C&have maximal dispersion.This is achieved by choosing may approximately locate points by giving only these few C&as the line spanned by uk. coordinates. After the completion of p steps,we extract the first p princi- PCA can also be examined from an information-theoretic pal components,and reduce r to its projection standpoint and shown to be optimal,under simple assump- onto the hyperplane spanned by the first p eigenvectors.One tions,for a different measure.More precisely,consider a may be interested in asking whether this is the best possible transmission channel (in our case,one can think of the data reduction of the kind under consideration,that is,the network connecting the input units to the hidden units)with best possible projection of the data onto a p-dimensional n-dimensional centered input vectors having a Gaussian dis- hyperplane in the sense that the projections of the data tribution with covariance matrix E=(zz).The outputs of onto H have maximal variance.After all,a better result might the channel are constrained to be p-dimensional vectors of the have been achieved by choosing the hyperplane in a single form y La,for some p x n matrix L (and,without any step.This,however,is not the case. loss of generality,we can assume that L has rank p,p0 are the nonzero eigenvalues of Then=and P=UpU=Upzt.Hence,PCA is As the entropy is one way of measuring our uncertainty, one method that linearly compresses n-dimensional inputs xt it is desirable to choose L so as to minimize H(ry).One can into p-dimensional vectors 2:for some p<n,that is,z =Ax show that the optimal L is of the form L=CU where C is an for a suitable px n matrix A.Linear reconstruction of the data invertible pxp matrix and Up=[,.p.In particular,this can then be achieved by approximating rt by Bz=BAre choice also maximizes the information that y conveys about z for some suitable n x p matrix B. measured by the mutual information I(r,y)defined to be Among all p x n matrices A and n x p matrices B, optimal linear data compression (in the sense that the average I(,)=H()-H(ly) reconstruction error (x-BAr2)is minimized)is achieved with value if and only if the global map W=BA equals the orthogonal projection Pu.onto the hyperplane spanned by the first p IpcA(a,)=麦log(2re)PX1…p): eigenvectors of∑xz Finally,computing the covariance of two principal compo- Thus,at least in the Gaussian setting,up to trivial trans- nents gives that for i formations,the optimal linear map maximizing the mutual information is the principal component analyzer.Finally,PCA (u,x)(ugx)》=(uxx'u)=4(zx)45=入,4=0. can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique,con- Thus different components are uncorrelated,and we can think sider the "open/closed book"data set in Mardia et al.[10, of the transformation of into the vector of n principal Table 1.2.1,p.3f].The data consist of the scores of T= components ut,.,'as an orthogonal transformation 88 students on examinations in mechanics,vectors,algebra
840 EEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 vectors perpendicular to u1. Again, by the previous result, we know the answer is b* = 4~~2, and so forth. At the kth step, we look for lines ck perpendicular to the space spanned by U], . . . , U&] such that the projections of the points xt along LI, have maximal dispersion. This is achieved by choosing LI, as the line spanned by uk. After the completion of p steps, we extract the first p principal components u/lxt, . . . , uLxt and reduce xt to its projection onto the hyperplane spanned by the first p eigenvectors. One may be interested in asking whether this is the best possible data reduction of the kind under consideration, that is, the best possible projection of the data onto a p-dimensional hyperplane 3-1 in the sense that the projections of the data onto 3-1 have maximal variance. After all, a better result might have been achieved by choosing the hyperplane in a single step. This, however, is not the case. Among all p-dimensional hyperplanes 3-1, the one spanned by the first p principal vectors 211, . . . , up is the hyperplane such that (IlP~xll~) is maximal. Equivalently, it is the hyperplane 3-1 which minimizes the average projection error It is therefore possible to incrementally build the PCA feature extractor. Since 3-1 is the best p-dimensional hyperplane we can fit to the n-dimensional point cloud, the “flatter” the cloud the better the fit. It is worth investigating how good the fit is, that is, how much of the variance in the data set actually is explained by the first p principal components. This is easily computed, for the variance of the ith component is given by (11. - PxxIl”. The total variance being equal to the sum of all the eigenvalues of E,,, the proportion of total variance explained by the first p principal components equals (X,+...+X,)/(Xl+...+X,) . In fact, PCA performs “best data compression” among a wider class of methods. Let us write U, = [ul,... ,up] for the matrix having the first p normalized eigenvectors of E,, as its columns and let us stack the first p features uixt, . . . , ukx, extracted by PCA into a column vector zt. Then zt = Vizt and Pu,xt = UpUbxt = U,zt. Hence, PCA is one method that linearly compresses n-dimensional inputs xt into p-dimensional vectors zt for some p 0 are the nonzero eigenvalues of Czz.y. As the entropy is one way of measuring our uncertainty, it is desirable to choose L so as to minimize H(xJy). One can show that the optimal L is of the form L = CU; where C is an invertible p xp matrix and U, = [UI, . . . , U,]. In particular, this choice also maximizes the information that y conveys about z measured by the mutual information I(x, y) defined to be I(x, Y) = H(x) - H(xly) ~pCA(x,y) = ilog((2re)~~1 ... A,). Thus, at least in the Gaussian setting, up to trivial transformations, the optimal linear map maximizing the mutual information is the principal component analyzer. Finally, PCA can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique, consider the “opedclosed book” data set in Mardia et al. [lo, Table 1.2.1, p. 3fl. The data consist of the scores of T = 88 students on examinations in mechanics, vectors, algebra, with value
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 841 analysis,and statistics (i.en=5,where the first two Upon projecting the patters,the corresponding exams were closed book and the other three were open total and between classes dispersions of the z patterns become book).For each exam,the best possible score was 100. C'∑z,Cand C'∑y2∑gC.A projection is optimal if the It is found that the average score (z)equals (39.0,50.6, between classes variation of the z's is as large as possible 50.6,46.7,42.3)'and that the eigenvalues of the covariance relative to the total variation.Different cost functions can matrix Erz are given by A1 =679.2.A2 199.8,A3 be introduced at this stage.If the size of a variation matrix 102.6.A4 83.7 and As 31.8.Hence,the first two is measured by its determinant (the determinant of a matrix principal components already explain 80%of the variance measures the volume of the image of a unit cube under the in the data (and 91%is achieved with the first three).The corresponding linear map),then we are led to the problem of first two eigenvectors are u=(0.51,0.37,0.35,0.45,20.53)' finding an n x p matrix C maximizing the ratio andu2=(0.75,0.21,-0.08,-0.30,-0.55)'.These findings can easily be interpreted.The authors conclude that "...the det(C'Ew∑d∑grC) E(C)= (3) first principal component gives positive weight to all the det(C'∑rxC) variables and thus represents an average grade.On the other The solution is well known. hand,the second principal component represents a contrast All optimal matrices,the so-called discriminant analysis between the open-book and closed-book examinations..."For (DA)matrices,are of the form HR,where R is an arbitrary example,the scores and first two principal components of the p x p invertible matrix and Hp has the first p eigenvectors of two best students are (77,82,67,67,81).66.4,and 6.4 and (63,78.80,70,81),63.7,and -6.4.Even without looking at ∑z∑zy2 as its columns.. It is not easy to see what the solutions look like in general. the individual test scores,by considering only the first two There is one case,however,where all optimal solutions can principal components one would conclude that the overall easily be described. performances of the two students are very similar,but the first When p=T=rank(∑zy),ann×p matrix C is a student did better on closed book and the second one better DA matrix if and only if the space spanned by the columns on open-book exams. of C coincides with the space spanned by the rows of the In conclusion,PCA is optimal in the least-mean-square mean-square classifier∑yx∑r. sense and can serve two purposes:data compression by See,for instance,Kshirsager [8]for more details on DA projecting high-dimensional data into a lower-dimensiona space and feature extraction by revealing,through the principal components,relevant but unexpected structure hidden in the IⅡ.BACKPROPAGATION data(although an interpretation of these features in terms of the original variables may not always be straightforward). A.The Landscape Properties of E We now consider the setting described in the Introduction where the learning procedure is based on the minimization F.Mean Square Classifier and Discriminant Analysis of the cost function E(A,B).A complete description of the Consider now the problem where the patterns zt must be landscape properties of E is given in Baldi and Hornik [6] classified into m classes C1,...Cm,with,in general,mn. We shall briefly review the most salient features.E is best Thus for every input pattern rt,there is a binary target output described in terms of its critical points,that is,the points where pattern=(0,…,l,…,0)'where,t=1 if and only if E/Oaij E/0bij =0.It is first important to observe belongs to Ci.One possible classification method consists that if C is any p x p invertible matrix,then E(A,B)= in finding an m x n matrix L such that (y-Lx2)is E(CA,BC-1).Therefore,at any point E really depends on minimal.Needless to say,this is a special case of least-squares the global map W BA rather than on A and B.For regression,and,as we have seen,under the usual assumptions instance,there is an infinite family of pairs of matrices (A.B) the optimal L is given byL= and is called the corresponding to any critical point.Unlike the simple case of mean-square classifier linear regression,however,W cannot be chosen arbitrarily: In many applications n is very large compared to m,and the network architecture constrains W to have at most rank p. therefore it becomes useful to first reduce the dimensionality The remarkable property of the landscape of E is the of the input data.One is thus led to find a linear subspace of absence of local minima in spite of the fact that E is not dimension p such that,when projected onto this subspace,the convex (nor is the set of all matrices of rank at most p).E patterns fall as much as possible into well-defined separated is characterized by a unique global minimum (up to multi- clusters facilitating the classification.This problem of finding plication by a matrix C).All other critical points are saddle an optimal projection is similar to the one encountered in points.The structure of the critical points can be described PCA.Because of the clustering,however,a new measure completely.More precisely,assume for simplicity that p must be introduced to compare different projections.Consider m≤n and that∑=∑yz∑r∑z,the covariance matrix a projection z=C'x,where C is an n x p matrix.The of the linear estimates (see Section II-D),is full rank total dispersion (variation)in the z-sample can be decomposed with m distinct eigenvalues 1>>Am and corresponding into the sum of within-class dispersions and between-class orthonormal eigenvectors u1,...,um.If I=fi,.,ip}with dispersions.When the z's are centered,the total dispersion is I≤ii<…<in≤m is any ordered p-index set,let Ur and the dispersion between classes can be shown to be denote the matrix []Then two full-rank matrices
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 84 1 analysis, and statistics (i.e., n = 5, where the first two exams were closed book and the other three were open book). For each exam, the best possible score was 100. It is found that the average score (z) equals (39.0, 50.6, 50.6, 46.7, 42.3)’ and that the eigenvalues of the covariance matrix E,, are given by A1 = 679.2, A2 = 199.8, A3 = 102.6, A4 = 83.7 and X5 = 31.8. Hence, the first two principal components already explain 80% of the variance in the data (and 91% is achieved with the first three). The first two eigenvectors are u1 = (0.51,0.37,0.35,0.45,20.53)’ and ‘112 = (0.75,0.21, -0.08, -0.30, -0.55)’. These findings can easily be interpreted. The authors conclude that “. . .the first principal component gives positive weight to all the variables and thus represents an average grade. On the other hand, the second principal component represents a contrast between the open-book and closed-book examinations. . .” For example, the scores and first two principal components of the two best students are (77, 82, 67, 67, 81), 66.4, and 6.4 and (63, 78, 80, 70, 81), 63.7, and -6.4. Even without looking at the individual test scores, by considering only the first two principal components one would conclude that the overall performances of the two students are very similar, but the first student did better on closed book and the second one better on open-book exams. In conclusion, PCA is optimal in the least-mean-square sense and can serve two purposes: data compression by projecting high-dimensional data into a lower-dimensional space and feature extraction by revealing, through the principal components, relevant but unexpected structure hidden in the data (although an interpretation of these features in terms of the original variables may not always be straightforward). F. Mean Square Class$er and Discriminant Analysis Consider now the problem where the patterns xt must be classified into m classes Cl, . . . , C,, with, in general, m . . . > A, and corresponding orthonormal eigenvectors u1, . . . , U,. If Z = { 21, . . . , i,} with 1 5 il < . . . < i, 5 m is any ordered p-index set, let UT denote the matrix [uil, . . . , ui,]. Then two full-rank matrices
842 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6,NO.4.JULY 1995 A and B define a critical point of E if and only if there exist techniques without resorting to any descent procedure.As an ordered p-index set I and an invertible p x p matrix C pointed out in the Introduction,however,this is not the most such that relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of A=CU4Eye∑, (4) simple adaptive networks. B=UC-. (5) One of the central issues in learning from examples is For such a critical point we have the problem of generalization,that is,how does the network perform when exposed to a pattern never seen previously? W PU:EuzEat (6) In this setting.a precise quantitative answer can be given to this question.For instance,in the autoassociative case,the and distortion of a new pattern is given by its distance to the E(A,B)=trace(vy)-Ai. subspace generated by the first p eigenvectors of r ieI In the special case where rank(y)=r=p,Gallinari Therefore,a critical W of rank p is always the product et al.[12]have shown that if an n-p -m architecture is trained to classify n-dimensional inputs into m(m<n) of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p classes,then the corresponding network performs discriminant eigenvectors of E.The critical map W associated with the analysis in the sense that,for an optimal W =BA,A'is a DA matrix.In other words.under these assumptions,the index set {1,,p}is the unique local and global minimum of E.The remaining (")-1 p-index sets correspond to saddle projection realized by A'maximizes the ratio given in(3).In points.All additional critical points defined by matrices A this context,however,either p =r=m,in which case the and B which are not of full rank are also saddle points and architecture is n-m-m and there is no bottleneck,or r<m can be characterized in terms of orthogonal projections onto and then full classification into m categories is not supported subspaces spanned by q eigenvectors,with g<p. by the available data and there is no proper data compression In the autoassociative case,(4)-(6)become (only filtering out of linear dependencies).In any case,all the network ever learns is to be a mean-square classifier,and this A=CU (7) can be achieved without any hidden layer. B=UTC-1 (8) W=PUz (9) B.Deep Networks,Local Connectivity and therefore the unique locally and globally optimal map W Nonlinearities,and Bias is the orthogonal projection onto the space spanned by the In Baldi [13],the case of deep networks with multiple first p eigenvectors of r hidden layers is briefly examined.It is easy to see that,in This analysis links backpropagation in linear networks to this case,the main constraint on the network comes from several classical statistical techniques.In particular,at the its bottleneck,that is,from the hidden layer with smallest global minimum of E,if C=Ip then the activities in the size p (clearly,p could be attained in more than one hidden hidden layer are given by u,u,the principal com- layer).Although the expression for the critical points may now ponents of the least-squares estimators (see,for instance, become more involved,the main features of the landscape are [8)).In the autoassociative mode,these activities are given unchanged:a multiplicity of saddle points,an absence of local by ut,.t,and correspond to the coordinates of the minima,and a unique optimal input/output map satisfying (6) vector along the first p eigenvectors of as in the usual with I={1,·,p}. PCA.In general,if the initial conditions are random,one The bottleneck layer imposes a rank restriction on the map should not expect the backpropagation algorithm to converge computed by the network.Additional important constraints to an optimum satisfying C=Ip.In the autoassociative case, can be introduced on the geometry of the connections.Often this means that the rows of the final A and u,.,up will connections are assumed to be local,in the sense that a unit span the same space but A',,].Although at first in one layer receives projections only from a restricted subset sight this may seem a drawback,it must be regarded as a of elements in the previous layer,for instance according to a property leading to more robust networks.Indeed,in a physical Gaussian distribution.These geometrical constraints play an implementation where the compressed version of the data in essential role in self-organizing maps and in several models the hidden layer is to be sent to further processing layers, of "linear"cortical development;see for instance Linsker it may not be desirable that one of the units,extracting the [14]-[16]and Miller et al.[17].These topics deserve separate principal component,has a variance much larger than the other treatment and will not be addressed here.As mentioned in the units (it is known,for instance,that in the case of random previous section,however,in the case of a locally connected symmetric matrixes,λ2《λ1 almost always;see[11]).A linear network without any hidden layer the landscape of more balanced strategy,where all the variances in the hidden the usual quadratic error is again completely devoid of local layer are comparable,is by far preferable and is commonly minima.Learning by descent methods should then be efficient. observed in simulations. The landscape properties of the LMS (least mean square)error Since the optimal solution can be expressed analytically, function of a linear locally connected multilayer network have it can also be obtained effectively with numerical analysis not been carefully studied yet,and the previous results only
842 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 A and B define a critical point of E if and only if there exist an ordered p-index set Z and an invertible p x p matrix G such that A = CUkC,,Ci:, (4) B =VzG-’. (5) w = Pu,C,,C,-,l (6) For such a critical point we have and E(A, B) = trace(E,,) - Xi. iEZ Therefore, a critical W of rank p is always the product of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p eigenvectors of E. The critical map W associated with the index set { 1, . . . , p} is the unique local and global minimum of E. The remaining ( y) - 1 p-index sets correspond to saddle points. All additional critical points defined by matrices A and B which are not of full rank are also saddle points and can be characterized in terms of orthogonal projections onto subspaces spanned by q eigenvectors, with q < p. In the autoassociative case, (4)-(6) become A = CU; (7) B = UT,-’ (8) w = Pu, (9) and therefore the unique locally and globally optimal map W is the orthogonal projection onto the space spanned by the first p eigenvectors of Ezz. This analysis links backpropagation in linear networks to several classical statistical techniques. In particular, at the global minimum of E, if C = I, then the activities in the hidden layer are given by u\yt, . . . , ubyt, the principal components of the least-squares estimators ijt (see, for instance, [ 81). In the autoassociative mode, these activities are given by u\xt, . , ubxt, and correspond to the coordinates of the vector xt along the first p eigenvectors of E,, as in the usual ?CA. In general, if the initial conditions are random, one should not expect the backpropagation algorithm to converge to an optimum satisfying C = I,. In the autoassociative case, this means that the rows of the final A and u1, . . . , up will span the same space but A’ # [UI, . . . , U,]. Although at first sight this may seem a drawback, it must be regarded as a property leading to more robust networks. Indeed, in a physical implementation where the compressed version of the data in the hidden layer is to be sent to further processing layers, it may not be desirable that one of the units, extracting the principal component, has a variance much larger than the other units (it is known, for instance, that in the case of random symmetric matrixes, Xz << XI almost always; see [ll]). A more balanced strategy, where all the variances in the hidden layer are comparable, is by far preferable and is commonly observed in simulations. Since the optimal solution can be expressed analytically, it can also be obtained effectively with numerical analysis techniques without resorting to any descent procedure. As pointed out in the Introduction, however, this is not the most relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of simple adaptive networks. One of the central issues in learning from examples is the problem of generalization, that is, how does the network perform when exposed to a pattern never seen previously? In this setting, a precise quantitative answer can be given to this question. For instance, in the autoassociative case, the distortion of a new pattern is given by its distance to the subspace generated by the first p eigenvectors of Xzx. In the special case where rank(C,,) = r = p, Gallinari et al. [12] have shown that if an n - p - m architecture is trained to classify n-dimensional inputs into m (m < n) classes, then the corresponding network performs discriminant analysis in the sense that, for an optimal W = BA,A’ is a DA matrix. In other words, under these assumptions, the projection realized by A‘ maximizes the ratio given in (3). In this context, however, either p = r = m, in which case the architecture is n - m - m and there is no bottleneck, or r < m and then full classification into m categories is not supported by the available data and there is no proper data compression (only filtering out of linear dependencies). In any case, all the network ever learns is to be a mean-square classifier, and this can be achieved without any hidden layer. B. Deep Networks, Local Connectivity, Nonlinearities, and Bias In Baldi [13], the case of deep networks with multiple hidden layers is briefly examined. It is easy to see that, in this case, the main constraint on the network comes from its bottleneck, that is, from the hidden layer with smallest size p (clearly, p could be attained in more than one hidden layer). Although the expression for the critical points may now become more involved, the main features of the landscape are unchanged: a multiplicity of saddle points, an absence of local minima, and a unique optimal input/output map satisfying (6) with Z = {l,...,p} . The bottleneck layer imposes a rank restriction on the map computed by the network. Additional important constraints can be introduced on the geometry of the connections. Often connections are assumed to be local, in the sense that a unit in one layer receives projections only from a restricted subset of elements in the previous layer, for instance according to a Gaussian distribution. These geometrical constraints play an essential role in self-organizing maps and in several models of “linear” cortical development; see for instance Linsker [ 141-[ 161 and Miller et al. [ 171. These topics deserve separate treatment and will not be addressed here. As mentioned in the previous section, however, in the case of a locally connected linear network without any hidden layer the landscape of the usual quadratic error is again completely devoid of local minima. Learning by descent methods should then be efficient. The landscape properties of the LMS (least mean square) error function of a linear locally connected multilayer network have not been carefully studied yet, and the previous results only
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds.In particular,the question whether the zero (the case of"hard constraints").This leads to the problem error function has any local minimum remains open despite of minimizing(10)with A E A and B arbitrary,which clearly its disarming simplicity. has a well-defined solution.An optimal A must lie on the In the case of nonlinear units,few analytical results are boundary A of A (if not,we could find a>1 such that known,but certainly local minima do appear.An important 4A∈A). remark,however,has been made by Bourlard and Kamp [18]. Let us write Snn=R,where >0 measures the noise In the autoassociative mode,it is natural to use linear units level and R is some structure matrix (the simplest case is in the output layer.Under these conditions,nothing is to be R=7,but if the units are physically close it may be unnatural gained by using nonlinear elements in the hidden layer.This is to assume that the individual components of the noise are basically because the network is trying to approximate a linear uncorrelated).The explicit dependence of E on o can be taken map:the identity function.This result can be extended to any into account by writing linear map.That is,if the set of pairs (t,)of examples is such that y=F(rt)for every t with linear F,then nonlinear E(A,B)=E(A,B) units in the hidden layer can lead to an approximation of F =E(A.B)+o trace(BRB)+trace(.).(11) which is at best equivalent to the approximation obtainable by using linear units exclusively.Reports of simulations in the As soon as o I (for example),it is straightforward to see literature confirm this point and sometimes seem to indicate that the solutions of the problem of minimizing E with A E A that the solution found using nonlinear elements is "close"to are identical to the solutions of minimizing E with A E A PCA (Cottrell et al.[19]). and B E B,where B is some fixed compact set independent Finally,if it is not desirable to assume the existence of of o.By (11),as o-0,E(A,B)converges uniformly to a preprocessing stage where the data are centered,then the E(A,B)+trace(ee)over the compact set A x B.Since theory can easily be extended to the case of linear units with these two functions differ only by an additive constant,the bias (see,for instance,[18]and [20]for more details). solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B)satisfying (4)and (5)with in C.Noise Analysis addition A E A(this automatically forces B to be in B,by restricting the matrices C).In other words,if A is the set of How robust are the previous results against the effects of all matrices A EA which are optimal for noise level o,then noise?Different sorts of noise can be introduced.for instance at the level of the synaptic weights or of the activation limA.={A∈aA:A=CUg∑with invertibleC} functions.To fix the ideas,assume in our case that the activation functions in both the hidden layer and the output (provided of course that the latter set is nonempty).That is, layer are"noisy."Hence for an input the output of the as is intuitively clear,if o is very small,the solutions of hidden layer is w Ax +n and the activity in the output the constrained noisy problem are essentially the same as the units is z Bw +e BA+Bn e.Assume that the solutions of the nonnoisy problem. noise terms n and e have mean zero,covariance matrices nn In the Appendix we show that as o-oo and Eee,and that they are uncorrelated with each other and with the patterns a and y.It is also reasonable to assume for min E(A,B)=trace(Syy+See) simplicity that is full rank.We are now interested in the -g-trace(MA'R-A)+O(o2) problem of minimizing uniformly over A,where M =Ery.Hence,if a is E(A,B)=(lly-(BAx +Bn +e)2) very large,minimizing E over A is essentially equivalent to =E(A,B)+trace(B∑nB)+trace(∑ee).(IO) maximizing (A)trace(MA'R-A)over A.The solution set A to this "asymptotic problem"depends significantly on Observe that the term trace()is just an additive con- the choice of the constraint set A.For instance,if A =A stant and has no influence on the variation of E with A Al=trace(AA')1,then where u and v are normalized principal eigenvectors of Af and E(uA,B/u)<E(A,B).As a result,without any additional R-1,respectively.On the other hand,if A=(A AA'=Ip} constraints,there is no pair (A,B)which minimizes E.This (i.e.,the rows of A are orthonormal)and R=I,then the is intuitively clear,as the network will try to make A as large rows of the optimal A span the space of the first p principal as possible and/or B as small as possible so that the signal eigenvectors of M (for details,see the Appendix).Now notice dominates the noise.It is therefore necessary to restrict the that AA'=I implies trace(AA')=p.Hence in the high- power of the signal Ar.One way of accomplishing this is noise case,full PCA of M is inferior to extraction of the first to introduce"soft constraints"by adding penalty terms like principal component of M only.Or in other words,it is better (IAx2)or trace(AA')to E.Some results in this direction not to force the rows of A to orthonormality,but allow them have been obtained in Plumbley [21J. to cooperate(build"balanced"representations)instead.In this The other possibility,which we shall consider here in more sense,if o is very large and A is "rich"enough,the solutions detail,is to explicitly restrict A to some compact subset A of of the constrained noisy problem are of maximum redundancy the set of all p x n matrices,for instance,a sphere centered at where all the hidden units try to do the same thing
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds. In particular, the question whether the error function has any local minimum remains open despite its disarming simplicity. In the case of nonlinear units, few analytical results are known, but certainly local minima do appear. An important remark, however, has been made by Bourlard and Kamp [ 181. In the autoassociative mode, it is natural to use linear units in the output layer. Under these conditions, nothing is to be gained by using nonlinear elements in the hidden layer. This is basically because the network is trying to approximate a linear map: the identity function. This result can be extended to any linear map. That is, if the set of pairs (xt, yt) of examples is such that yt = F(xt) for every t with linear F, then nonlinear units in the hidden layer can lead to an approximation of F which is at best equivalent to the approximation obtainable by using linear units exclusively. Reports of simulations in the literature confirm this point and sometimes seem to indicate that the solution found using nonlinear elements is “close” to PCA (Cottrell et al. [19]). Finally, if it is not desirable to assume the existence of a preprocessing stage where the data are centered, then the theory can easily be extended to the case of linear units with bias (see, for instance, [18] and [20] for more details). C. Noise Analysis How robust are the previous results against the effects of noise? Different sorts of noise can be introduced, for instance at the level of the synaptic weights or of the activation functions. To fix the ideas, assume in our case that the activation functions in both the hidden layer and the output layer are “noisy.” Hence for an input x, the output of the hidden layer is w = Ax + n and the activity in the output units is z = Bw + e = BAx + Bn + e. Assume that the noise terms n and e have mean zero, covariance matrices E,, and E,,, and that they are uncorrelated with each other and with the patterns x and y. It is also reasonable to assume for simplicity that E,, is full rank. We are now interested in the problem of minimizing ,@(A, B) = (Ily - (BA2 + Bn + e)1I2) = E(A, B) + trace(BC,,B’) + trace(&,). (10) Observe that the term trace@,,) is just an additive constant and has no influence on the variation of E with A and B. For any positive p, E(pA,B/p) - ,@(A,B) = trace(BC,,B’)(l - p2)/p2. Thus, if B # 0 and p > 1, then ,@(PA, B/p) 1 such that pA E A). Let us write E,, = OR, where u>0 measures the noise level and R is some structure matrix (the simplest case is R = I, but if the units are physically close it may be unnatural to assume that the individual component_s of the noise are uncorrelated). The explicit dependence of E on (T can be taken into account by writing E(A, B) = E,(A, B) = E(A, B)+a trace(BRB’) +trace( Eee). (1 1) As soon as (T 5 1 (for example), it is straightforward to see that the solutions of the problem of minimizing-,!? with A E A are identical to the solutions of minimizing E with A E A and B E B, where B is som? fixed compact set independent of o. By (ll), as o -+ 0, E(A,B) converges uniformly to E(A,B) + trace(C,,) over the compact set A x B. Since these two functions differ only by an additive constant, the solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B) satisfying (4) and (5) with in addition A E A (this automatically forces B to be in B, by restricting the matrices C). In other words, if A, is the set of all matrices A E A which are optimal for noise level g, then lim A, = {A E dA : A = CU~C,,C~~ with invertiblec} (provided of course that the latter set is nonempty). That is, as is intuitively clear, if (T is very small, the solutions of the constrained noisy problem are essentially the same as the solutions of the nonnoisy problem. a-0 In the Appendix we show that as o -+ m min B ,@,(A, B) = trace&, + Xee) - a-ltrace(MA’R-lA) + O((T-~) uniformly over A, where M = E,,C,,. Hence, if o is very large, minimizing E, over A is essentially equivalent to maximizing @(A) = trace(MA’R-’A) over A. The solution set A+ to this “asymptotic problem” depends significantly on the choice of the constraint set A. For instance, if A = {A : llAll$ = trace(AA’) 5 p}(llAll~) is the so-called Frobenius norm of A), then A+ consists of all A of the form fit”, where U and U are normalized principal eigenvectors of M and R-l, respectively. On the other hand, if A = {A : AA’ = I,} (i.e., the rows of A are orthonormal) and R = I, then the rows of the optimal A span the space of the first p principal eigenvectors of M (for details, see the Appendix). Now notice that AA’ = I, implies trace(AA’) = p. Hence in the highnoise case, full PCA of M is inferior to extraction of the first principal component of M only. Or in other words, it is better not to force the rows of A to orthonormality, but allow them to cooperate (build “balanced” representations) instead. In this sense, if o is very large and A is “rich” enough, the solutions of the constrained noisy problem are of maximum redundancy where all the hidden units try to do the same thing
844 IEEE TRANSACTIONS ON NEURAL NETWORKS.VOL.6,NO.4.JULY 1995 Significantly refined results for the autoassociative case the identity map in a single-layer feedforward linear network. (where M =2)have been given in Diamantaras and With suitable assumptions on the noise,this setting turns Hornik [22].They show that for arbitrary invertible and out to be insightful and to yield analytical results which are orthogonally right-invariant A (i.e.,AY E A if A E A relevant to what one observes in more complicated situations. and Y is orthogonal),E is minimized for matrices A of the Here,we first define our framework and derive the basic form CU'for suitable C(as usual,the columns of U are equations,first in the noiseless case and then in the case of the eigenvectors of Ez).Under the constraint AA'=Ip, noisy data.The basic point is to derive an expression for the the minima are attained at A =>p are normalized eigenvectors ofith the corresponding validation function in terms of the statistical properties of the population and the training and validation samples.We then eigenvalues arranged in decreasing order.Under the Frobenius examine the main results which consist of an analysis of the norm constraint trace(AA')O)denote the One conventional approach is to consider the problem of eigenvectors and eigenvalues of∑,then learning as a surface fitting problem.Accordingly,neural A(k+1):=(1-(1-n入))*)+(1-n入:)A(0)u. networks should be very constrained,with a minimal number (12) of parameters,to avoid the classical overfitting problem.In practice,however,not too much is known about overfitting and The behavior of (12)is clear:provided the learning rate is its onset,both as a function of network parameters and training less than the inverse of the largest eigenvalue (n<1/A1), time.Furthermore,the conventional view can be challenged.It A(k)approaches the identity exponentially fast.This holds may be the case,for instance,that a suitable strategy consists for any starting matrix A(0).The eigenvectors of tend to rather in using networks with a few extra parameters.These become eigenvectors of A(k),and the corresponding eigen- larger networks must be used in conjunction with nontrivial values approach one at different rates depending on A:(larger priors in a Bayesian framework and/or trained for shorter eigenvalues are learned much faster).As a result,it is not times,based on a careful monitoring of the validation error very restrictive to assume,for ease of exposition,that the early-stopping”"). starting matrix A(0)is diagonal in the ui basis,that is, Partial initial results on generalization problems have A(0)=U diag(ai(0))U',where as usual.U =[u1,...,un]. been obtained in recent years in terms of VC (Vap- nik-Chervonenkis)dimension and statistical mechanics (see, IKrogh and Hertz 128]have independently analyzed the evolution of for instance,[25]-[27]).Here,we propose a different and generalization in the case of one single linear unit.It can be shown that the evolution curve can assume one of several possible shapes,depending on complementary approach consisting of a detailed analysis of a number of parameters.Although in the absence of any hidden layer there is generalization in simple feedforward linear networks.Even no coupling between the output units of an n-n network,it is still necessary in this simple framework,the questions are far from trivial. to study the n-n case since the corresponding evolution function results from the summation of the evolution curves of cach output unit,each such curve Thus we have restricted the problem even further:learning being capable of assuming a different shape with different characteristics
844 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 Significantly refined results for the autoassociative case (where M = E:,) have been given in Diamantaras and Homik [22]. They show that for arbitrary invertible E,, and orthogonally right-inv-ant A (i.e., AY E A if A E A and Y is orthogonal), E is minimized for matrices A of the form CU’ for suitable C (as usual, the columns of U are the eigenvectors of E,,). Under the constraint AA’ = Ip, the minima are attained at A = Cy=‘=, viui, where VI, . . . , up are normalized eigenvectors of R-’ with the corresponding eigenvalues arranged in decreasing order. Under the Frobenius norm constraint trace(AA’) 5 p, the minima are attained at A of the form ~~=’=, fiyiviu:, where the y; and the rank T depend on the eigenvalues of E,, and E,,. In particular, if E,, = aR as before, then T = r(a) is nonincreasing with .(a) = p for all a sufficiently small and .(a) = 1 for all a sufficiently large. This result formalizes the intuition that the units should increase “cooperation” along with the noise level. Further generalizations are possible by considering nonMSE measures of the “size” of the linear reconstruction errors w = y - (BA2 + Bn + e); see Homik [23]. In particular, in the Gaussian case, the determinant of (ww’) measures the amount of transmitted information, and its constrained maximization is intimately related to the INFOMAX principle of Linsker [9]. the identity map in a single-layer feedforward linear network. With suitable assumptions on the noise, this setting tums out to be insightful and to yield analytical results which are relevant to what one observes in more complicated situations. Here, we first define our framework and derive the basic equations, first in the noiseless case and then in the case of noisy data. The basic point is to derive an expression for the validation function in terms of the statistical properties of the population and the training and validation samples. We then examine the main results which consist of an analysis of the landscape of the validation error as a function of training time. Simple simulation results are also presented, and several interesting phenomena are described. The results are discussed in the conclusion, and some possible extensions are briefly mentioned. Mathematical proofs are deferred to the Appendix. We consider a simple feedforward network with n input units connected by a weight matrix A to n linear output units.’ The network is trained to learn the identity function (autoassociation) from a set of centered training patterns 21, . . . , ZT. The connection weights are adjusted by gradient descent on the usual LMS error function The gradient of E with respect to the weights A is IV. GENERALIZATION VE 1 (A - I)C This section is written with Y. Chauvin and is a modified version of the article “Temporal Evolution of Generalization During Learning in Linear I’ktWorks” wl. The material, copyrighted by MIT Press, was included here with permission where E = E,, is the covariance matrix of the training set. Thus, the gradient descent learning rule can be expressed as from the publisher. A(k + 1) = A(k) - v(A(k) - I)C A. Formal Setting In practice, the question to be answered is how should one allocate limited resources and parameters, such as network size and architecture, initial conditions, training time, and available examples, to optimize generalization performance? One conventional approach is to consider the problem of learning as a surface fitting problem. Accordingly, neural networks should be very constrained, with a minimal number of parameters, to avoid the classical overfitting problem. In practice, however, not too much is known about overfitting and its onset, both as a function of network parameters and training time. Furthermore, the conventional view can be challenged. It may be the case, for instance, that a suitable strategy consists rather in using networks with a few extra parameters. These larger networks must be used in conjunction with nontrivial priors in a Bayesian framework and/or trained for shorter times, based on a careful monitoring of the validation error (“early-stopping”). Partial initial results on generalization problems have been obtained in recent years in terms of VC (Vapnikxhervonenkis) dimension and statistical mechanics (see, for instance, [25]-[27]). Here, we propose a different and complementary approach consisting of a detailed analysis of generalization in simple feedforward linear networks. Even in this simple framework, the questions are far from trivial. Thus we have restricted the problem even further: learning where 7 is the constant learning rate. Simple induction shows that A(k) = (I - (I - qC)‘) + A(O)(I - QE)‘. Hence if U; and A, (A1 2 . .. 2 A, > 0) denote the eigenvectors and eigenvalues of C, then A(k + l)u, = (1 - (1 - qX,)’)uz + (1 - ~A,)‘~4(0)u,. (12) The behavior of (12) is clear: provided the learning rate is less than the inverse of the largest eigenvalue (Q < l/Al), A(k) approaches the identity exponentially fast. This holds for any starting matrix A(0). The eigenvectors of C tend to become eigenvectors of A( IC), and the corresponding eigenvalues approach one at different rates depending on A; (larger eigenvalues are learned much faster). As a result, it is not very restrictive to assume, for ease of exposition, that the starting matrix A(0) is diagonal in the U; basis, that is, A(0) = Udiag(ai(O))U’, where as usual, U = [ul,...,~,] . ‘Krogh and Hertz [28] have independently analyzed the evolution of generalization in the case of one single linear unit. It can be shown that the evolution curve can assume one of several possible shapes, depending on a number of parameters. Although in the absence of any hidden layer there is no coupling between the output units of an R - n network, it is still necessary to study the n - n case since the corresponding evolution function results from the summation of the evolution curves of each output unit, each such curve being capable of assuming a different shape with different characteristics
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 845 (In fact,learning is often started with the zero matrix.)In this To compute the image of any eigenvector u;during training, case,(12)becomes we have A(ku=[1-(1-nA)(1-a(0)月4=a(k). A(k+1)u:=7入:u+(1-n入:-)A(k)u; A simple calculation shows that the corresponding error can Thus by induction be written as A(k)=A(O)M-E(∑+∑nn)-1(M-I) E(A(k)=(a:(k)-1)2 where M=I-n(+Enn),and i=l We now modify the setting so as to introduce noise.To A(k)山:= A1-(1-nA-1u fix the ideas,the reader may think,for instance,that we are X十4 dealing with handwritten realizations of single-digit numbers. +(1-n入:-74)*A(0)u. In this case,there are 10 possible patterns but numerous possible noisy realizations.In general,we assume that there is If again we assume,as in the rest of the section,that the learn- a population of patterns of the form x+n,where denotes the ing rate satisfies n<min(1/(:+)),the eigenvectors of signal and n denotes the noise,characterized by the covariance tend to become eigenvectors of A()and A()approaches matrices∑=Ezr,∑nn,and Ern·Here,as everywhere the diagonal matrix diag(A:/(;+v))exponentially fast Assuming that A(0)is diag(a:(0))in the u;basis,we get else,we assume that the signal and the noise are centered. A sample rt+n(1<tT)from this population is used as a training set.The training sample is characterized by the At(1-ba)4,=a()u A(k)ui=i+vi covariance matrices∑=∑xx,∑nn,and∑calculated over the sample.Similarly,a different sample+n from the where b:=1-a,()(A:+)/入,anda=1-nA:-4. population is used as a validation set.The validation sample Notice that 0<a;<1.Using the fact that nn is diag()and is characterized by the covariance matrices∑=∑rr,∑nn A()is diag(i())in the ui basis,we obtain and To make the calculations tractable,we shall make, when necessary,.several assumptions.First,.∑=∑=∑: E(A() 1-(k)2+a:()2. (13) thus there is a common basis of unit length eigenvectors ui and corresponding eigenvalues A;for the signal in the population and in the training and validation samples.Then. It is easy to see that E(A(k))is a monotonically decreasing with respect to this basis of eigenvectors.the noise covariance function ofk which approaches an asymptotic residual error matrices are diagonal,that is,n=Udiag(vU and value given by nn=Udiag()U.Finally,the signal and the noise are always uncorrelated,that is,.∑zn三∑xm=0.(Obviously, 入:V it also makes sense to assume that万n=Udiag(⑦,)U'and E(A()》= =1 =0,although these assumptions are not needed in the main calculation.Thus we make the simplifying assumptions For any matrix A,we can define the validation error to be that both on the training and validation patterns the covariance matrix of the signal is identical to the covariance of the signal over the entire population,that the signal and the noise E*(A=∑I。-A+nI are uncorrelated,and that the components of the noise are uncorrelated in the eigenbase of the signal.Yet we allow the Using the fact that =0 and nn Udiag()U', estimates vi and of the variance of the components of the a derivation similar to (13)shows that the validation error noise to be different in the training and validation sets. EV(A(k))is For a given A,the LMS error function over the training patterns is now E(4(k)=∑(1-(k)2+()2. (14) B)=∑-Aa+肥 t Clearly,as ko,EV(A(k))approaches its horizontal As asymptote,given by ∑rm=Dnx=0, EV(A()=】 A(好+mA E(A)=trace(A-D∑(A-I)'+A∑nnA') (入+)2 i=1 Hence,the gradient of E is It is the behavior of EV before it reaches its asymptotic value, however,which is of most interest to us.This behavior,as we VE=(A-I)∑+A∑nm: shall see,can be fairly complicated
BALD1 AND HORNIK LEARNING IN LINEAR NEURAL NETWORKS 845 (In fact, learning is often started with the zero matrix.) In this case, (12) becomes we have To compute the image of any eigenvector ui during training, A(/c)u~ = [I - (1 - VX~)'(I - N~(o))]u~ = ai(k)ui. A(k + 1)~i = vX~U; + (I - vX~ -~/vi)A(k)~i. A simple calculation shows that the corresponding error can be written as Thus by induction A(k) = A(0)Mk - C(C + C,,)-l(M' - I) n E(A(k)) = Xi(ai(k) - 1)2 i=l where M = I - q(C + E,,), and [l - (1 - VXi - ?7vi)k]Ui Xi Xi + vi We now modify the setting so as to introduce noise. To A(~)u; ~ fix the ideas, the reader may think, for instance, that we are dealing with handwritten realizations of single-digit numbers. + (I - 7Xi - 7/v;)'AA(O)u;. In this case, there are 10 possible patterns but numerous possible noisy realizations. In general, we assume that there is a population of patterns of the form z + n, where z denotes the signal and n denotes the noise, characterized by the covariance matrices E = E,,, E,,, and EZn. Here, as everywhere else, we assume that the signal and the noise are centered. If again we assume, as in the rest of the section, that the learning rate satisfies 7 < min(l/(A; + vi)), the eigenvectors of C tend to become eigenvectors of A(k) and A(k) approaches the diagonal matrix diag(Xi/(Xi + vi)) exponentially fast. Assuming that A(0) is diag(ai(0)) in the ui basis, we get A sample xt + nt (1 5 t 5 T) from this population is used as a training set. The training sample is characterized by the covariance matrices C = E,,, E,,, and E,, calculated over the sample. Similarly, a different sample z, + n, from the population is used as a validation set. The va_lidati_on sample is chTacterized by the covariance matrices C = E,,, E,,, and E,,. To make the calculations tractable, we shall make, when necessary, several assumptions. First, = C = E; thus there is a common basis of unit length eigenvectors ui and corresponding eigenvalues Xi for the signal in the population and in the training and validation samples. Then, with respect to this basis of eigenvectors, the noise covariance matrices are diagonal, that is, E,, = Udiag(vi)U' and E,, = Udiag(6i)U'. Finally, the signal and the noise are always uncorrelated, that is, E,, = E,, = 0. (Obviously, it also makes sense to assume that E,, = Udiag(vi)U' and E,, = 0, although these assumptions are not needed in the main calculation.) Thus we make the simplifying assumptions that both on the training and validation patterns the covariance matrix of the signal is identical to the covariance of the signal over the entire population, that the signal and the noise are uncorrelated, and that the components of the noise are uncorrelated in the eigenbase of the signal. Yet we allow the estimates vi and Vi of the variance of the components of the noise to be different in the training and validation sets. For a given A, the LMS error function over the training patterns is now - where b, = 1 - a,(O)(X, + .,)/A, and a, = 1 - vX, - ~v,. Notice that 0 < a, < 1. Using the fact that E,, is diag(v,) and A(k) is diag(a,(k)) in the U, basis, we obtain n E(A(k)) = X,(1 - + v,az(k)2. (13) a=1 It is easy to see that E(A(k)) is a monotonically decreasing function of k which approaches an asymptotic residual error value given by For any matrix A, we can define the validation error to be U Using the fact that c,, = 0 and P,, = Udiag(G)U', a derivation similar to (13) shows that the validation error E~(A(~)) is n Ev(A(k)) = CXi(1 - ~ri(k))~ + 6i~~(k)~. (14) 1 i=l E(A) = T Ibt - 4% + nt)l12. t Clearly, as k 4 03, E"(A(k)) approaches its horizontal asymptote, given by As E,, = E,, = 0, E(A) = trace((A - I)C(A - I)' + AC,,A'). Hence, the gradient of E is VE = (A - 1)X + AX,,. n =E i=l It is the behavior of E" before it reaches its asymptotic value, however, which is of most interest to us. This behavior, as we shall see, can be fairly complicated
846 IEEE TRANSACTIONS ON NEURAL NETWORKS.VOL 6,NO.4.JULY 1995 B.Validation Analysis Case 2:For every i,,then the validation function to having random initial weights of order at most n-1/2 in EV decreases monotonically to a unique minimum and the original basis.Thus,heuristically,the variance of the then increases monotonically to its asymptotic value. initial random weights should be a function of the size of The derivatives of all orders of EV also have a unique the network.This condition is not satisfied in many of the zero crossing and a unique extremum.For optimal usual simulations found in the literature where inital weights generalization,E should be monitored and training are generated randomly and independently using,for instance, stopped as soon as EV begins to increase.A simple a centered Gaussian distribution with fixed standard deviation. bound on the optimal training time kopt is When more arbitrary conditions are considered,in the initial - weights or in the noise,multiple local minima can appear in min log ai bgg≤k≤m log a log the validation function.As can be seen in one of the curves Y of the example given in Fig.2,there exist even cases where In the most general case of arbitrary initial conditions the first minimum is not the deepest one,although these may and noise,the validation function EV can have several be rare.Also in this figure,better validation results seem to local minima of variable depth before converging to its be obtained with smaller initial conditions.This can easily be asymptotic value.The number of local minima is always understood,in this small-dimensional example,from some of at most n. the arguments given in the Appendix. The main result is a consequence of the following state- Another potentially interesting and relevant phenomenon is ments,which are proved in the Appendix. illustrated in Fig.3.It is possible to have a situation where, Case 1:For every i,>v,that is,the validation noise after a certain number of training cycles,both the LMS and the is bigger than the training noise. validation functions appear to be flat and to have converged a)If for every i,(0)>Ai/(;+),then EV decreases to their asymptotic values.If training is continued,however, monotonically to its asymptotic value. one observes that these plateaus can come to an end. b)If for every i,/(A,+)≤a(0)≤λ/(A:+,then Finally,we have made an implicit distinction between EV increases monotonically to its asymptotic value. validation and generalization throughout most of the previous c)If for every,ak(O)≤Ai/(A+i)andh≠,then EV sections.If generalization performance is measured by the decreases monotonically to a unique global minimum LMS error calculated over the entire population,it is clear and then increases monotonically to its asymptotic value. that our main result can be applied to the generalization error The derivatives of all orders of E have a unique zero by assuming that∑nn=Udiag(⑦)U',and=;for crossing and a unique extremum. every i.In particular,in the second statement of the main
846 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 B. Validation Analysis tion (14) and collecting terms yield Obviously, dai(k)/dk = -(Xibiaf logai)/(Xi +vi). Equaor, in more compact form with 2X,2bi Pi = (vi - Vi) log ai (Xi + Ui)2 The behavior of EV depends on the relative size of vi and Vi and the initial conditions ai(O), which together determine the signs of bi, Pi, and yi. The main result we can prove is as follows. Assume that learning is started with the zero matrix or with a matrix with sufficiently small weights satisfying, for every i Xi Xi ai(0) 5 min (-, Xi +vi -) Xi + vz . (15) If for every i, Ci 5 vi, then the validation function EV decreases monotonically to its asymptotic value and training should be continued as long as possible. If for every i, Vi > vi, then the validation function EV decreases monotonically to a unique minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV also have a unique zero crossing and a unique extremum. For optimal generalization, EV should be monitored and training stopped as soon as EV begins to increase. A simple bound on the optimal training time /copt is 1 -Pa 1 -Pa min ~ log - 5 koPt 5 max- log -. z log U2 yi 2 log ai yi In the most general case of arbitrary initial conditions and noise, the validation function EV can have several local minima of variable depth before converging to its asymptotic value. The number of local minima is always at most n. Case 2: For every i, Vi 5 vi, that is, the validation noise a) If forevery i, cyi(0) 2 X,/(Xi+Vi) and vi # Ci, then E" decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV have a unique zero crossing and a unique extremum. b) If for every i, &/(&+vi) 5 ai(0) 5 Xi/(Xi+Vi), then E'/ increases monotonically to its asymptotic value. c) If for every i, ai(0) 5 &/(Xi +vi), then EV decreases monotonically to its asymptotic value. Several remarks can be made on the previous statements. First, notice that in both b) cases, EV increases because the initial A(0) is already too good for the given noise levels. The monotonicity properties of the validation function are not always strict in the sense that, for instance, at the common boundary of some of the cases EV can be flat. These degenerate cases can be easily checked directly. The statement of the main result assumes that the initial matrix be the zero matrix or a matrix with a diagonal form in the basis of the eigenvectors ui. A random initial nonzero matrix, however, will not satisfy these conditions. EV is continuous and even infinitely differentiable in all of its parameters. Therefore, the results are also true for sufficiently small random matrices. If we use, for instance, an induced Z2 norm for the matrices, then the norm of a starting matrix is the same in the original, or in the orthonormal, ui basis. Equation (15) yields a trivial upper bound of n1l2 for the initial norm which roughly corresponds to having random initial weights of order at most n-1/2 in the original basis. Thus, heuristically, the variance of the initial random weights should be a function of the size of the network. This condition is not satisfied in many of the usual simulations found in the literature where inital weights are generated randomly and independently using, for instance, a centered Gaussian distribution with fixed standard deviation. When more arbitrary conditions are considered, in the initial weights or in the noise, multiple local minima can appear in the validation function. As can be seen in one of the curves of the example given in Fig. 2, there exist even cases where the first minimum is not the deepest one, although these may be rare. Also in this figure, better validation results seem to be obtained with smaller initial conditions. This can easily be understood, in this small-dimensional example, from some of the arguments given in the Amendix. is smaller than the training noise. - - .I The main result is a consequence of the following stateCase 1: For every i, Vi 2 vi, that is, the validation noise Another potentially interesting and relevant phenomenon is illustrated in Fig. 3. It is possible to have a situation where, after a certain number of training cycles, both the LMS and the validation functions appear to be flat and to have converged ments, which are proved in the Appendix. is bigger than the training noise. If for every i, cyi(0) 2 Xi/(& +vi), then EV decreases monotonically to its asymptotic value. If for every i, Xi/(Xi+Vi) 5 ai(0) 5 Xi/(Xi+vi), then EV increases monotonically to its asymptotic value. If for every i, ai(0) 5 Xi/(Xi+Vi) and vi # Vi, then EV decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV have a unique zero crossing and a unique extremum. to their asymptotic values. If training is continued, however, one observes that these plateaus can come to an end. Finally, we have made an implicit distinction between validation and generalization throughout most of the previous sections. If generalization performance is measured by the LMS error calculated over the entire population, it is clear that our main result can be applied to the generalization error by assuming that E,, = Udiag(fii)U', and Vi = Vi for every i. In particular, in the second statement of the main