正在加载图片...
844 IEEE TRANSACTIONS ON NEURAL NETWORKS.VOL.6,NO.4.JULY 1995 Significantly refined results for the autoassociative case the identity map in a single-layer feedforward linear network. (where M =2)have been given in Diamantaras and With suitable assumptions on the noise,this setting turns Hornik [22].They show that for arbitrary invertible and out to be insightful and to yield analytical results which are orthogonally right-invariant A (i.e.,AY E A if A E A relevant to what one observes in more complicated situations. and Y is orthogonal),E is minimized for matrices A of the Here,we first define our framework and derive the basic form CU'for suitable C(as usual,the columns of U are equations,first in the noiseless case and then in the case of the eigenvectors of Ez).Under the constraint AA'=Ip, noisy data.The basic point is to derive an expression for the the minima are attained at A =>p are normalized eigenvectors ofith the corresponding validation function in terms of the statistical properties of the population and the training and validation samples.We then eigenvalues arranged in decreasing order.Under the Frobenius examine the main results which consist of an analysis of the norm constraint trace(AA')<p,the minima are attained at landscape of the validation error as a function of training A of the form where the and the rank time.Simple simulation results are also presented,and several depend on the eigenvalues of and In particular,if interesting phenomena are described.The results are discussed nn=aR as before,thenr=r()is nonincreasing with in the conclusion,and some possible extensions are briefly r()=p for all o sufficiently small and r(o)=1 for all o mentioned.Mathematical proofs are deferred to the Appendix. sufficiently large.This result formalizes the intuition that the We consider a simple feedforward network with n input units should increase"cooperation"along with the noise level. units connected by a weight matrix A to n linear output Further generalizations are possible by considering nonMSE units.I The network is trained to learn the identity function measures of the "size"of the linear reconstruction errors (autoassociation)from a set of centered training patterns =-(BAx+Bn+e);see Hornik [23].In particular,in the 1,,r.The connection weights are adjusted by gradient Gaussian case,the determinant of (wu)measures the amount descent on the usual LMS error function of transmitted information,and its constrained maximization is intimately related to the INFOMAX principle of Linsker [9]. E(A)=(x-Ax2). The gradient of E with respect to the weights A is IV.GENERALIZATION VE=(A-IX This section is written with Y.Chauvin and is a modified version of the article "Temporal Evolution of Generalization where =Ez is the covariance matrix of the training set. During Learning in Linear Networks"[24].The material, Thus,the gradient descent learning rule can be expressed as copyrighted by MIT Press,was included here with permission from the publisher. A(k+1)=A(k)-7(A()-I)8 where n is the constant learning rate.Simple induction shows A.Formal Setting that In practice,the question to be answered is how should one allocate limited resources and parameters,such as network 4(=(I-(I-))+A(0)(I-nE)* size and architecture,initial conditions,training time,and available examples,to optimize generalization performance? Hence if u:andλ,(A1≥·≥入n>O)denote the One conventional approach is to consider the problem of eigenvectors and eigenvalues of∑,then learning as a surface fitting problem.Accordingly,neural A(k+1):=(1-(1-n入))*)+(1-n入:)A(0)u. networks should be very constrained,with a minimal number (12) of parameters,to avoid the classical overfitting problem.In practice,however,not too much is known about overfitting and The behavior of (12)is clear:provided the learning rate is its onset,both as a function of network parameters and training less than the inverse of the largest eigenvalue (n<1/A1), time.Furthermore,the conventional view can be challenged.It A(k)approaches the identity exponentially fast.This holds may be the case,for instance,that a suitable strategy consists for any starting matrix A(0).The eigenvectors of tend to rather in using networks with a few extra parameters.These become eigenvectors of A(k),and the corresponding eigen- larger networks must be used in conjunction with nontrivial values approach one at different rates depending on A:(larger priors in a Bayesian framework and/or trained for shorter eigenvalues are learned much faster).As a result,it is not times,based on a careful monitoring of the validation error very restrictive to assume,for ease of exposition,that the early-stopping”"). starting matrix A(0)is diagonal in the ui basis,that is, Partial initial results on generalization problems have A(0)=U diag(ai(0))U',where as usual.U =[u1,...,un]. been obtained in recent years in terms of VC (Vap- nik-Chervonenkis)dimension and statistical mechanics (see, IKrogh and Hertz 128]have independently analyzed the evolution of for instance,[25]-[27]).Here,we propose a different and generalization in the case of one single linear unit.It can be shown that the evolution curve can assume one of several possible shapes,depending on complementary approach consisting of a detailed analysis of a number of parameters.Although in the absence of any hidden layer there is generalization in simple feedforward linear networks.Even no coupling between the output units of an n-n network,it is still necessary in this simple framework,the questions are far from trivial. to study the n-n case since the corresponding evolution function results from the summation of the evolution curves of cach output unit,each such curve Thus we have restricted the problem even further:learning being capable of assuming a different shape with different characteristics.844 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 Significantly refined results for the autoassociative case (where M = E:,) have been given in Diamantaras and Homik [22]. They show that for arbitrary invertible E,, and orthogonally right-inv-ant A (i.e., AY E A if A E A and Y is orthogonal), E is minimized for matrices A of the form CU’ for suitable C (as usual, the columns of U are the eigenvectors of E,,). Under the constraint AA’ = Ip, the minima are attained at A = Cy=‘=, viui, where VI, . . . , up are normalized eigenvectors of R-’ with the corresponding eigenvalues arranged in decreasing order. Under the Frobenius norm constraint trace(AA’) 5 p, the minima are attained at A of the form ~~=’=, fiyiviu:, where the y; and the rank T depend on the eigenvalues of E,, and E,,. In particular, if E,, = aR as before, then T = r(a) is nonincreasing with .(a) = p for all a sufficiently small and .(a) = 1 for all a sufficiently large. This result formalizes the intuition that the units should increase “cooperation” along with the noise level. Further generalizations are possible by considering nonMSE measures of the “size” of the linear reconstruction errors w = y - (BA2 + Bn + e); see Homik [23]. In particular, in the Gaussian case, the determinant of (ww’) measures the amount of transmitted information, and its constrained maximization is intimately related to the INFOMAX principle of Linsker [9]. the identity map in a single-layer feedforward linear network. With suitable assumptions on the noise, this setting tums out to be insightful and to yield analytical results which are relevant to what one observes in more complicated situations. Here, we first define our framework and derive the basic equations, first in the noiseless case and then in the case of noisy data. The basic point is to derive an expression for the validation function in terms of the statistical properties of the population and the training and validation samples. We then examine the main results which consist of an analysis of the landscape of the validation error as a function of training time. Simple simulation results are also presented, and several interesting phenomena are described. The results are discussed in the conclusion, and some possible extensions are briefly mentioned. Mathematical proofs are deferred to the Appendix. We consider a simple feedforward network with n input units connected by a weight matrix A to n linear output units.’ The network is trained to learn the identity function (autoassociation) from a set of centered training patterns 21, . . . , ZT. The connection weights are adjusted by gradient descent on the usual LMS error function The gradient of E with respect to the weights A is IV. GENERALIZATION VE 1 (A - I)C This section is written with Y. Chauvin and is a modified version of the article “Temporal Evolution of Generalization During Learning in Linear I’ktWorks” wl. The material, copyrighted by MIT Press, was included here with permission where E = E,, is the covariance matrix of the training set. Thus, the gradient descent learning rule can be expressed as from the publisher. A(k + 1) = A(k) - v(A(k) - I)C A. Formal Setting In practice, the question to be answered is how should one allocate limited resources and parameters, such as network size and architecture, initial conditions, training time, and available examples, to optimize generalization performance? One conventional approach is to consider the problem of learning as a surface fitting problem. Accordingly, neural networks should be very constrained, with a minimal number of parameters, to avoid the classical overfitting problem. In practice, however, not too much is known about overfitting and its onset, both as a function of network parameters and training time. Furthermore, the conventional view can be challenged. It may be the case, for instance, that a suitable strategy consists rather in using networks with a few extra parameters. These larger networks must be used in conjunction with nontrivial priors in a Bayesian framework and/or trained for shorter times, based on a careful monitoring of the validation error (“early-stopping”). Partial initial results on generalization problems have been obtained in recent years in terms of VC (Vap￾nikxhervonenkis) dimension and statistical mechanics (see, for instance, [25]-[27]). Here, we propose a different and complementary approach consisting of a detailed analysis of generalization in simple feedforward linear networks. Even in this simple framework, the questions are far from trivial. Thus we have restricted the problem even further: learning where 7 is the constant learning rate. Simple induction shows that A(k) = (I - (I - qC)‘) + A(O)(I - QE)‘. Hence if U; and A, (A1 2 . .. 2 A, > 0) denote the eigenvectors and eigenvalues of C, then A(k + l)u, = (1 - (1 - qX,)’)uz + (1 - ~A,)‘~4(0)u,. (12) The behavior of (12) is clear: provided the learning rate is less than the inverse of the largest eigenvalue (Q < l/Al), A(k) approaches the identity exponentially fast. This holds for any starting matrix A(0). The eigenvectors of C tend to become eigenvectors of A( IC), and the corresponding eigen￾values approach one at different rates depending on A; (larger eigenvalues are learned much faster). As a result, it is not very restrictive to assume, for ease of exposition, that the starting matrix A(0) is diagonal in the U; basis, that is, A(0) = Udiag(ai(O))U’, where as usual, U = [ul,...,~,] . ‘Krogh and Hertz [28] have independently analyzed the evolution of generalization in the case of one single linear unit. It can be shown that the evolution curve can assume one of several possible shapes, depending on a number of parameters. Although in the absence of any hidden layer there is no coupling between the output units of an R - n network, it is still necessary to study the n - n case since the corresponding evolution function results from the summation of the evolution curves of each output unit, each such curve being capable of assuming a different shape with different characteristics
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有