正在加载图片...
846 IEEE TRANSACTIONS ON NEURAL NETWORKS.VOL 6,NO.4.JULY 1995 B.Validation Analysis Case 2:For every i,<vi,that is,the validation noise Obviously,dai(k)/dk =-(A;b:af log a:)/(i+v:).Equa- is smaller than the training noise. tion (14)and collecting terms yield a)If for everyi,a(o)≥入/(A+s)and≠,then EV dEv n2λb:loga decreases monotonically to a unique global minimum a(-+b:a(d+)》 and then increases monotonically to its asymptotic value. dk 台+ The derivatives of all orders of EV have a unique zero or,in more compact form crossing and a unique extremum. dEv b)If for every i,λi/(:+)≤a(0)≤入x/(d:+),then 月a+a EV increases monotonically to its asymptotic value. 世1 c)If for every i,a:(0)<Ai/(:+v),then EV decreases monotonically to its asymptotic value. with 2入2b Several remarks can be made on the previous statements. =((log a First,notice that in both b)cases,EV increases because the initial A(0)is already too good for the given noise 2λ22 (+log ai. levels.The monotonicity properties of the validation function Yi= are not always strict in the sense that,for instance,at the The behavior of EV depends on the relative size of v and common boundary of some of the cases EV can be flat.These and the initial conditions o:(0),which together determine degenerate cases can be easily checked directly.The statement the signs of bi,B,and :The main result we can prove is of the main result assumes that the initial matrix be the zero as follows. matrix or a matrix with a diagonal form in the basis of the Assume that learning is started with the zero matrix or with eigenvectors u:.A random initial nonzero matrix,however, a matrix with sufficiently small weights satisfying,for every i will not satisfy these conditions.EV is continuous and even infinitely differentiable in all of its parameters.Therefore,the 入 x:(0)≤min :+'+西 (15) results are also true for sufficiently small random matrices.If we use,for instance,an induced 12 norm for the matrices,then 1)If for every i,<v,then the validation function the norm of a starting matrix is the same in the original,or in EV decreases monotonically to its asymptotic value and the orthonormal,u;basis.Equation (15)yields a trivial upper training should be continued as long as possible. bound of n1/2 for the initial norm which roughly corresponds 2)If for every i,>,then the validation function to having random initial weights of order at most n-1/2 in EV decreases monotonically to a unique minimum and the original basis.Thus,heuristically,the variance of the then increases monotonically to its asymptotic value. initial random weights should be a function of the size of The derivatives of all orders of EV also have a unique the network.This condition is not satisfied in many of the zero crossing and a unique extremum.For optimal usual simulations found in the literature where inital weights generalization,E should be monitored and training are generated randomly and independently using,for instance, stopped as soon as EV begins to increase.A simple a centered Gaussian distribution with fixed standard deviation. bound on the optimal training time kopt is When more arbitrary conditions are considered,in the initial - weights or in the noise,multiple local minima can appear in min log ai bgg≤k≤m log a log the validation function.As can be seen in one of the curves Y of the example given in Fig.2,there exist even cases where In the most general case of arbitrary initial conditions the first minimum is not the deepest one,although these may and noise,the validation function EV can have several be rare.Also in this figure,better validation results seem to local minima of variable depth before converging to its be obtained with smaller initial conditions.This can easily be asymptotic value.The number of local minima is always understood,in this small-dimensional example,from some of at most n. the arguments given in the Appendix. The main result is a consequence of the following state- Another potentially interesting and relevant phenomenon is ments,which are proved in the Appendix. illustrated in Fig.3.It is possible to have a situation where, Case 1:For every i,>v,that is,the validation noise after a certain number of training cycles,both the LMS and the is bigger than the training noise. validation functions appear to be flat and to have converged a)If for every i,(0)>Ai/(;+),then EV decreases to their asymptotic values.If training is continued,however, monotonically to its asymptotic value. one observes that these plateaus can come to an end. b)If for every i,/(A,+)≤a(0)≤λ/(A:+,then Finally,we have made an implicit distinction between EV increases monotonically to its asymptotic value. validation and generalization throughout most of the previous c)If for every,ak(O)≤Ai/(A+i)andh≠,then EV sections.If generalization performance is measured by the decreases monotonically to a unique global minimum LMS error calculated over the entire population,it is clear and then increases monotonically to its asymptotic value. that our main result can be applied to the generalization error The derivatives of all orders of E have a unique zero by assuming that∑nn=Udiag(⑦)U',and=;for crossing and a unique extremum. every i.In particular,in the second statement of the main846 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 B. Validation Analysis tion (14) and collecting terms yield Obviously, dai(k)/dk = -(Xibiaf logai)/(Xi +vi). Equa￾or, in more compact form with 2X,2bi Pi = (vi - Vi) log ai (Xi + Ui)2 The behavior of EV depends on the relative size of vi and Vi and the initial conditions ai(O), which together determine the signs of bi, Pi, and yi. The main result we can prove is as follows. Assume that learning is started with the zero matrix or with a matrix with sufficiently small weights satisfying, for every i Xi Xi ai(0) 5 min (-, Xi +vi -) Xi + vz . (15) If for every i, Ci 5 vi, then the validation function EV decreases monotonically to its asymptotic value and training should be continued as long as possible. If for every i, Vi > vi, then the validation function EV decreases monotonically to a unique minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV also have a unique zero crossing and a unique extremum. For optimal generalization, EV should be monitored and training stopped as soon as EV begins to increase. A simple bound on the optimal training time /copt is 1 -Pa 1 -Pa min ~ log - 5 koPt 5 max- log -. z log U2 yi 2 log ai yi In the most general case of arbitrary initial conditions and noise, the validation function EV can have several local minima of variable depth before converging to its asymptotic value. The number of local minima is always at most n. Case 2: For every i, Vi 5 vi, that is, the validation noise a) If forevery i, cyi(0) 2 X,/(Xi+Vi) and vi # Ci, then E" decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV have a unique zero crossing and a unique extremum. b) If for every i, &/(&+vi) 5 ai(0) 5 Xi/(Xi+Vi), then E'/ increases monotonically to its asymptotic value. c) If for every i, ai(0) 5 &/(Xi +vi), then EV decreases monotonically to its asymptotic value. Several remarks can be made on the previous statements. First, notice that in both b) cases, EV increases because the initial A(0) is already too good for the given noise levels. The monotonicity properties of the validation function are not always strict in the sense that, for instance, at the common boundary of some of the cases EV can be flat. These degenerate cases can be easily checked directly. The statement of the main result assumes that the initial matrix be the zero matrix or a matrix with a diagonal form in the basis of the eigenvectors ui. A random initial nonzero matrix, however, will not satisfy these conditions. EV is continuous and even infinitely differentiable in all of its parameters. Therefore, the results are also true for sufficiently small random matrices. If we use, for instance, an induced Z2 norm for the matrices, then the norm of a starting matrix is the same in the original, or in the orthonormal, ui basis. Equation (15) yields a trivial upper bound of n1l2 for the initial norm which roughly corresponds to having random initial weights of order at most n-1/2 in the original basis. Thus, heuristically, the variance of the initial random weights should be a function of the size of the network. This condition is not satisfied in many of the usual simulations found in the literature where inital weights are generated randomly and independently using, for instance, a centered Gaussian distribution with fixed standard deviation. When more arbitrary conditions are considered, in the initial weights or in the noise, multiple local minima can appear in the validation function. As can be seen in one of the curves of the example given in Fig. 2, there exist even cases where the first minimum is not the deepest one, although these may be rare. Also in this figure, better validation results seem to be obtained with smaller initial conditions. This can easily be understood, in this small-dimensional example, from some of the arguments given in the Amendix. is smaller than the training noise. - - .I The main result is a consequence of the following state￾Case 1: For every i, Vi 2 vi, that is, the validation noise Another potentially interesting and relevant phenomenon is illustrated in Fig. 3. It is possible to have a situation where, after a certain number of training cycles, both the LMS and the validation functions appear to be flat and to have converged ments, which are proved in the Appendix. is bigger than the training noise. If for every i, cyi(0) 2 Xi/(& +vi), then EV decreases monotonically to its asymptotic value. If for every i, Xi/(Xi+Vi) 5 ai(0) 5 Xi/(Xi+vi), then EV increases monotonically to its asymptotic value. If for every i, ai(0) 5 Xi/(Xi+Vi) and vi # Vi, then EV decreases monotonically to a unique global minimum and then increases monotonically to its asymptotic value. The derivatives of all orders of EV have a unique zero crossing and a unique extremum. to their asymptotic values. If training is continued, however, one observes that these plateaus can come to an end. Finally, we have made an implicit distinction between validation and generalization throughout most of the previous sections. If generalization performance is measured by the LMS error calculated over the entire population, it is clear that our main result can be applied to the generalization error by assuming that E,, = Udiag(fii)U', and Vi = Vi for every i. In particular, in the second statement of the main
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有