正在加载图片...
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds.In particular,the question whether the zero (the case of"hard constraints").This leads to the problem error function has any local minimum remains open despite of minimizing(10)with A E A and B arbitrary,which clearly its disarming simplicity. has a well-defined solution.An optimal A must lie on the In the case of nonlinear units,few analytical results are boundary A of A (if not,we could find a>1 such that known,but certainly local minima do appear.An important 4A∈A). remark,however,has been made by Bourlard and Kamp [18]. Let us write Snn=R,where >0 measures the noise In the autoassociative mode,it is natural to use linear units level and R is some structure matrix (the simplest case is in the output layer.Under these conditions,nothing is to be R=7,but if the units are physically close it may be unnatural gained by using nonlinear elements in the hidden layer.This is to assume that the individual components of the noise are basically because the network is trying to approximate a linear uncorrelated).The explicit dependence of E on o can be taken map:the identity function.This result can be extended to any into account by writing linear map.That is,if the set of pairs (t,)of examples is such that y=F(rt)for every t with linear F,then nonlinear E(A,B)=E(A,B) units in the hidden layer can lead to an approximation of F =E(A.B)+o trace(BRB)+trace(.).(11) which is at best equivalent to the approximation obtainable by using linear units exclusively.Reports of simulations in the As soon as o I (for example),it is straightforward to see literature confirm this point and sometimes seem to indicate that the solutions of the problem of minimizing E with A E A that the solution found using nonlinear elements is "close"to are identical to the solutions of minimizing E with A E A PCA (Cottrell et al.[19]). and B E B,where B is some fixed compact set independent Finally,if it is not desirable to assume the existence of of o.By (11),as o-0,E(A,B)converges uniformly to a preprocessing stage where the data are centered,then the E(A,B)+trace(ee)over the compact set A x B.Since theory can easily be extended to the case of linear units with these two functions differ only by an additive constant,the bias (see,for instance,[18]and [20]for more details). solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B)satisfying (4)and (5)with in C.Noise Analysis addition A E A(this automatically forces B to be in B,by restricting the matrices C).In other words,if A is the set of How robust are the previous results against the effects of all matrices A EA which are optimal for noise level o,then noise?Different sorts of noise can be introduced.for instance at the level of the synaptic weights or of the activation limA.={A∈aA:A=CUg∑with invertibleC} functions.To fix the ideas,assume in our case that the activation functions in both the hidden layer and the output (provided of course that the latter set is nonempty).That is, layer are"noisy."Hence for an input the output of the as is intuitively clear,if o is very small,the solutions of hidden layer is w Ax +n and the activity in the output the constrained noisy problem are essentially the same as the units is z Bw +e BA+Bn e.Assume that the solutions of the nonnoisy problem. noise terms n and e have mean zero,covariance matrices nn In the Appendix we show that as o-oo and Eee,and that they are uncorrelated with each other and with the patterns a and y.It is also reasonable to assume for min E(A,B)=trace(Syy+See) simplicity that is full rank.We are now interested in the -g-trace(MA'R-A)+O(o2) problem of minimizing uniformly over A,where M =Ery.Hence,if a is E(A,B)=(lly-(BAx +Bn +e)2) very large,minimizing E over A is essentially equivalent to =E(A,B)+trace(B∑nB)+trace(∑ee).(IO) maximizing (A)trace(MA'R-A)over A.The solution set A to this "asymptotic problem"depends significantly on Observe that the term trace()is just an additive con- the choice of the constraint set A.For instance,if A =A stant and has no influence on the variation of E with A Al=trace(AA')<p)(AllF)is the so-called Frobenius and B.For any positive u,E(uA,B/u)-E(A,B)= norm of A),then Aa consists of all A of the form pum', trace(Bnn B)(1-u)/u.Thus,if B 0 and u>1,then where u and v are normalized principal eigenvectors of Af and E(uA,B/u)<E(A,B).As a result,without any additional R-1,respectively.On the other hand,if A=(A AA'=Ip} constraints,there is no pair (A,B)which minimizes E.This (i.e.,the rows of A are orthonormal)and R=I,then the is intuitively clear,as the network will try to make A as large rows of the optimal A span the space of the first p principal as possible and/or B as small as possible so that the signal eigenvectors of M (for details,see the Appendix).Now notice dominates the noise.It is therefore necessary to restrict the that AA'=I implies trace(AA')=p.Hence in the high- power of the signal Ar.One way of accomplishing this is noise case,full PCA of M is inferior to extraction of the first to introduce"soft constraints"by adding penalty terms like principal component of M only.Or in other words,it is better (IAx2)or trace(AA')to E.Some results in this direction not to force the rows of A to orthonormality,but allow them have been obtained in Plumbley [21J. to cooperate(build"balanced"representations)instead.In this The other possibility,which we shall consider here in more sense,if o is very large and A is "rich"enough,the solutions detail,is to explicitly restrict A to some compact subset A of of the constrained noisy problem are of maximum redundancy the set of all p x n matrices,for instance,a sphere centered at where all the hidden units try to do the same thing.BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds. In particular, the question whether the error function has any local minimum remains open despite its disarming simplicity. In the case of nonlinear units, few analytical results are known, but certainly local minima do appear. An important remark, however, has been made by Bourlard and Kamp [ 181. In the autoassociative mode, it is natural to use linear units in the output layer. Under these conditions, nothing is to be gained by using nonlinear elements in the hidden layer. This is basically because the network is trying to approximate a linear map: the identity function. This result can be extended to any linear map. That is, if the set of pairs (xt, yt) of examples is such that yt = F(xt) for every t with linear F, then nonlinear units in the hidden layer can lead to an approximation of F which is at best equivalent to the approximation obtainable by using linear units exclusively. Reports of simulations in the literature confirm this point and sometimes seem to indicate that the solution found using nonlinear elements is “close” to PCA (Cottrell et al. [19]). Finally, if it is not desirable to assume the existence of a preprocessing stage where the data are centered, then the theory can easily be extended to the case of linear units with bias (see, for instance, [18] and [20] for more details). C. Noise Analysis How robust are the previous results against the effects of noise? Different sorts of noise can be introduced, for instance at the level of the synaptic weights or of the activation functions. To fix the ideas, assume in our case that the activation functions in both the hidden layer and the output layer are “noisy.” Hence for an input x, the output of the hidden layer is w = Ax + n and the activity in the output units is z = Bw + e = BAx + Bn + e. Assume that the noise terms n and e have mean zero, covariance matrices E,, and E,,, and that they are uncorrelated with each other and with the patterns x and y. It is also reasonable to assume for simplicity that E,, is full rank. We are now interested in the problem of minimizing ,@(A, B) = (Ily - (BA2 + Bn + e)1I2) = E(A, B) + trace(BC,,B’) + trace(&,). (10) Observe that the term trace@,,) is just an additive con￾stant and has no influence on the variation of E with A and B. For any positive p, E(pA,B/p) - ,@(A,B) = trace(BC,,B’)(l - p2)/p2. Thus, if B # 0 and p > 1, then ,@(PA, B/p) < &(A, B). As a result, without any additional constraints, there is no pair (A, B) which minimizes E. This is intuitively clear, as the network will try to make A as large as possible and/or B as small as possible so that the signal dominates the noise. It is therefore necessary to restrict the power of the signal Ax. One way of accomplishing this is to introduce “soft constraint_s” by adding penalty terms like (11A~11~) or trace(AA’) to E. Some results in this direction have been obtained in Plumbley [21]. The other possibility, which we shall consider here in more detail, is to explicitly restrict A to some compact subset A of the set of all p x n matrices, for instance, a sphere centered at zero (the case of “hard constraints”). This leads to the problem of minimizing (10) with A E A and B arbitrary, which clearly has a well-defined solution. An optimal A must lie on the boundary dA of A (if not, we could find a p > 1 such that pA E A). Let us write E,, = OR, where u>0 measures the noise level and R is some structure matrix (the simplest case is R = I, but if the units are physically close it may be unnatural to assume that the individual component_s of the noise are uncorrelated). The explicit dependence of E on (T can be taken into account by writing E(A, B) = E,(A, B) = E(A, B)+a trace(BRB’) +trace( Eee). (1 1) As soon as (T 5 1 (for example), it is straightforward to see that the solutions of the problem of minimizing-,!? with A E A are identical to the solutions of minimizing E with A E A and B E B, where B is som? fixed compact set independent of o. By (ll), as o -+ 0, E(A,B) converges uniformly to E(A,B) + trace(C,,) over the compact set A x B. Since these two functions differ only by an additive constant, the solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B) satisfying (4) and (5) with in addition A E A (this automatically forces B to be in B, by restricting the matrices C). In other words, if A, is the set of all matrices A E A which are optimal for noise level g, then lim A, = {A E dA : A = CU~C,,C~~ with invertiblec} (provided of course that the latter set is nonempty). That is, as is intuitively clear, if (T is very small, the solutions of the constrained noisy problem are essentially the same as the solutions of the nonnoisy problem. a-0 In the Appendix we show that as o -+ m min B ,@,(A, B) = trace&, + Xee) - a-ltrace(MA’R-lA) + O((T-~) uniformly over A, where M = E,,C,,. Hence, if o is very large, minimizing E, over A is essentially equivalent to maximizing @(A) = trace(MA’R-’A) over A. The solution set A+ to this “asymptotic problem” depends significantly on the choice of the constraint set A. For instance, if A = {A : llAll$ = trace(AA’) 5 p}(llAll~) is the so-called Frobenius norm of A), then A+ consists of all A of the form fit”, where U and U are normalized principal eigenvectors of M and R-l, respectively. On the other hand, if A = {A : AA’ = I,} (i.e., the rows of A are orthonormal) and R = I, then the rows of the optimal A span the space of the first p principal eigenvectors of M (for details, see the Appendix). Now notice that AA’ = I, implies trace(AA’) = p. Hence in the high￾noise case, full PCA of M is inferior to extraction of the first principal component of M only. Or in other words, it is better not to force the rows of A to orthonormality, but allow them to cooperate (build “balanced” representations) instead. In this sense, if o is very large and A is “rich” enough, the solutions of the constrained noisy problem are of maximum redundancy where all the hidden units try to do the same thing
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有