正在加载图片...
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 845 (In fact,learning is often started with the zero matrix.)In this To compute the image of any eigenvector u;during training, case,(12)becomes we have A(ku=[1-(1-nA)(1-a(0)月4=a(k). A(k+1)u:=7入:u+(1-n入:-)A(k)u; A simple calculation shows that the corresponding error can Thus by induction be written as A(k)=A(O)M-E(∑+∑nn)-1(M-I) E(A(k)=(a:(k)-1)2 where M=I-n(+Enn),and i=l We now modify the setting so as to introduce noise.To A(k)山:= A1-(1-nA-1u fix the ideas,the reader may think,for instance,that we are X十4 dealing with handwritten realizations of single-digit numbers. +(1-n入:-74)*A(0)u. In this case,there are 10 possible patterns but numerous possible noisy realizations.In general,we assume that there is If again we assume,as in the rest of the section,that the learn- a population of patterns of the form x+n,where denotes the ing rate satisfies n<min(1/(:+)),the eigenvectors of signal and n denotes the noise,characterized by the covariance tend to become eigenvectors of A()and A()approaches matrices∑=Ezr,∑nn,and Ern·Here,as everywhere the diagonal matrix diag(A:/(;+v))exponentially fast Assuming that A(0)is diag(a:(0))in the u;basis,we get else,we assume that the signal and the noise are centered. A sample rt+n(1<tT)from this population is used as a training set.The training sample is characterized by the At(1-ba)4,=a()u A(k)ui=i+vi covariance matrices∑=∑xx,∑nn,and∑calculated over the sample.Similarly,a different sample+n from the where b:=1-a,()(A:+)/入,anda=1-nA:-4. population is used as a validation set.The validation sample Notice that 0<a;<1.Using the fact that nn is diag()and is characterized by the covariance matrices∑=∑rr,∑nn A()is diag(i())in the ui basis,we obtain and To make the calculations tractable,we shall make, when necessary,.several assumptions.First,.∑=∑=∑: E(A() 1-(k)2+a:()2. (13) thus there is a common basis of unit length eigenvectors ui and corresponding eigenvalues A;for the signal in the population and in the training and validation samples.Then. It is easy to see that E(A(k))is a monotonically decreasing with respect to this basis of eigenvectors.the noise covariance function ofk which approaches an asymptotic residual error matrices are diagonal,that is,n=Udiag(vU and value given by nn=Udiag()U.Finally,the signal and the noise are always uncorrelated,that is,.∑zn三∑xm=0.(Obviously, 入:V it also makes sense to assume that万n=Udiag(⑦,)U'and E(A()》= =1 =0,although these assumptions are not needed in the main calculation.Thus we make the simplifying assumptions For any matrix A,we can define the validation error to be that both on the training and validation patterns the covariance matrix of the signal is identical to the covariance of the signal over the entire population,that the signal and the noise E*(A=∑I。-A+nI are uncorrelated,and that the components of the noise are uncorrelated in the eigenbase of the signal.Yet we allow the Using the fact that =0 and nn Udiag()U', estimates vi and of the variance of the components of the a derivation similar to (13)shows that the validation error noise to be different in the training and validation sets. EV(A(k))is For a given A,the LMS error function over the training patterns is now E(4(k)=∑(1-(k)2+()2. (14) B)=∑-Aa+肥 t Clearly,as ko,EV(A(k))approaches its horizontal As asymptote,given by ∑rm=Dnx=0, EV(A()=】 A(好+mA E(A)=trace(A-D∑(A-I)'+A∑nnA') (入+)2 i=1 Hence,the gradient of E is It is the behavior of EV before it reaches its asymptotic value, however,which is of most interest to us.This behavior,as we VE=(A-I)∑+A∑nm: shall see,can be fairly complicated.BALD1 AND HORNIK LEARNING IN LINEAR NEURAL NETWORKS 845 (In fact, learning is often started with the zero matrix.) In this case, (12) becomes we have To compute the image of any eigenvector ui during training, A(/c)u~ = [I - (1 - VX~)'(I - N~(o))]u~ = ai(k)ui. A(k + 1)~i = vX~U; + (I - vX~ -~/vi)A(k)~i. A simple calculation shows that the corresponding error can be written as Thus by induction A(k) = A(0)Mk - C(C + C,,)-l(M' - I) n E(A(k)) = Xi(ai(k) - 1)2 i=l where M = I - q(C + E,,), and [l - (1 - VXi - ?7vi)k]Ui Xi Xi + vi We now modify the setting so as to introduce noise. To A(~)u; ~ fix the ideas, the reader may think, for instance, that we are dealing with handwritten realizations of single-digit numbers. + (I - 7Xi - 7/v;)'AA(O)u;. In this case, there are 10 possible patterns but numerous possible noisy realizations. In general, we assume that there is a population of patterns of the form z + n, where z denotes the signal and n denotes the noise, characterized by the covariance matrices E = E,,, E,,, and EZn. Here, as everywhere else, we assume that the signal and the noise are centered. If again we assume, as in the rest of the section, that the learn￾ing rate satisfies 7 < min(l/(A; + vi)), the eigenvectors of C tend to become eigenvectors of A(k) and A(k) approaches the diagonal matrix diag(Xi/(Xi + vi)) exponentially fast. Assuming that A(0) is diag(ai(0)) in the ui basis, we get A sample xt + nt (1 5 t 5 T) from this population is used as a training set. The training sample is characterized by the covariance matrices C = E,,, E,,, and E,, calculated over the sample. Similarly, a different sample z, + n, from the population is used as a validation set. The va_lidati_on sample is chTacterized by the covariance matrices C = E,,, E,,, and E,,. To make the calculations tractable, we shall make, when necessary, several assumptions. First, = C = E; thus there is a common basis of unit length eigenvectors ui and corresponding eigenvalues Xi for the signal in the population and in the training and validation samples. Then, with respect to this basis of eigenvectors, the noise covariance matrices are diagonal, that is, E,, = Udiag(vi)U' and E,, = Udiag(6i)U'. Finally, the signal and the noise are always uncorrelated, that is, E,, = E,, = 0. (Obviously, it also makes sense to assume that E,, = Udiag(vi)U' and E,, = 0, although these assumptions are not needed in the main calculation.) Thus we make the simplifying assumptions that both on the training and validation patterns the covariance matrix of the signal is identical to the covariance of the signal over the entire population, that the signal and the noise are uncorrelated, and that the components of the noise are uncorrelated in the eigenbase of the signal. Yet we allow the estimates vi and Vi of the variance of the components of the noise to be different in the training and validation sets. For a given A, the LMS error function over the training patterns is now - where b, = 1 - a,(O)(X, + .,)/A, and a, = 1 - vX, - ~v,. Notice that 0 < a, < 1. Using the fact that E,, is diag(v,) and A(k) is diag(a,(k)) in the U, basis, we obtain n E(A(k)) = X,(1 - + v,az(k)2. (13) a=1 It is easy to see that E(A(k)) is a monotonically decreasing function of k which approaches an asymptotic residual error value given by For any matrix A, we can define the validation error to be U Using the fact that c,, = 0 and P,, = Udiag(G)U', a derivation similar to (13) shows that the validation error E~(A(~)) is n Ev(A(k)) = CXi(1 - ~ri(k))~ + 6i~~(k)~. (14) 1 i=l E(A) = T Ibt - 4% + nt)l12. t Clearly, as k 4 03, E"(A(k)) approaches its horizontal asymptote, given by As E,, = E,, = 0, E(A) = trace((A - I)C(A - I)' + AC,,A'). Hence, the gradient of E is VE = (A - 1)X + AX,,. n =E i=l It is the behavior of E" before it reaches its asymptotic value, however, which is of most interest to us. This behavior, as we shall see, can be fairly complicated
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有