正在加载图片...
842 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6,NO.4.JULY 1995 A and B define a critical point of E if and only if there exist techniques without resorting to any descent procedure.As an ordered p-index set I and an invertible p x p matrix C pointed out in the Introduction,however,this is not the most such that relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of A=CU4Eye∑, (4) simple adaptive networks. B=UC-. (5) One of the central issues in learning from examples is For such a critical point we have the problem of generalization,that is,how does the network perform when exposed to a pattern never seen previously? W PU:EuzEat (6) In this setting.a precise quantitative answer can be given to this question.For instance,in the autoassociative case,the and distortion of a new pattern is given by its distance to the E(A,B)=trace(vy)-Ai. subspace generated by the first p eigenvectors of r ieI In the special case where rank(y)=r=p,Gallinari Therefore,a critical W of rank p is always the product et al.[12]have shown that if an n-p -m architecture is trained to classify n-dimensional inputs into m(m<n) of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p classes,then the corresponding network performs discriminant eigenvectors of E.The critical map W associated with the analysis in the sense that,for an optimal W =BA,A'is a DA matrix.In other words.under these assumptions,the index set {1,,p}is the unique local and global minimum of E.The remaining (")-1 p-index sets correspond to saddle projection realized by A'maximizes the ratio given in(3).In points.All additional critical points defined by matrices A this context,however,either p =r=m,in which case the and B which are not of full rank are also saddle points and architecture is n-m-m and there is no bottleneck,or r<m can be characterized in terms of orthogonal projections onto and then full classification into m categories is not supported subspaces spanned by q eigenvectors,with g<p. by the available data and there is no proper data compression In the autoassociative case,(4)-(6)become (only filtering out of linear dependencies).In any case,all the network ever learns is to be a mean-square classifier,and this A=CU (7) can be achieved without any hidden layer. B=UTC-1 (8) W=PUz (9) B.Deep Networks,Local Connectivity and therefore the unique locally and globally optimal map W Nonlinearities,and Bias is the orthogonal projection onto the space spanned by the In Baldi [13],the case of deep networks with multiple first p eigenvectors of r hidden layers is briefly examined.It is easy to see that,in This analysis links backpropagation in linear networks to this case,the main constraint on the network comes from several classical statistical techniques.In particular,at the its bottleneck,that is,from the hidden layer with smallest global minimum of E,if C=Ip then the activities in the size p (clearly,p could be attained in more than one hidden hidden layer are given by u,u,the principal com- layer).Although the expression for the critical points may now ponents of the least-squares estimators (see,for instance, become more involved,the main features of the landscape are [8)).In the autoassociative mode,these activities are given unchanged:a multiplicity of saddle points,an absence of local by ut,.t,and correspond to the coordinates of the minima,and a unique optimal input/output map satisfying (6) vector along the first p eigenvectors of as in the usual with I={1,·,p}. PCA.In general,if the initial conditions are random,one The bottleneck layer imposes a rank restriction on the map should not expect the backpropagation algorithm to converge computed by the network.Additional important constraints to an optimum satisfying C=Ip.In the autoassociative case, can be introduced on the geometry of the connections.Often this means that the rows of the final A and u,.,up will connections are assumed to be local,in the sense that a unit span the same space but A',,].Although at first in one layer receives projections only from a restricted subset sight this may seem a drawback,it must be regarded as a of elements in the previous layer,for instance according to a property leading to more robust networks.Indeed,in a physical Gaussian distribution.These geometrical constraints play an implementation where the compressed version of the data in essential role in self-organizing maps and in several models the hidden layer is to be sent to further processing layers, of "linear"cortical development;see for instance Linsker it may not be desirable that one of the units,extracting the [14]-[16]and Miller et al.[17].These topics deserve separate principal component,has a variance much larger than the other treatment and will not be addressed here.As mentioned in the units (it is known,for instance,that in the case of random previous section,however,in the case of a locally connected symmetric matrixes,λ2《λ1 almost always;see[11]).A linear network without any hidden layer the landscape of more balanced strategy,where all the variances in the hidden the usual quadratic error is again completely devoid of local layer are comparable,is by far preferable and is commonly minima.Learning by descent methods should then be efficient. observed in simulations. The landscape properties of the LMS (least mean square)error Since the optimal solution can be expressed analytically, function of a linear locally connected multilayer network have it can also be obtained effectively with numerical analysis not been carefully studied yet,and the previous results only842 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 A and B define a critical point of E if and only if there exist an ordered p-index set Z and an invertible p x p matrix G such that A = CUkC,,Ci:, (4) B =VzG-’. (5) w = Pu,C,,C,-,l (6) For such a critical point we have and E(A, B) = trace(E,,) - Xi. iEZ Therefore, a critical W of rank p is always the product of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p eigenvectors of E. The critical map W associated with the index set { 1, . . . , p} is the unique local and global minimum of E. The remaining ( y) - 1 p-index sets correspond to saddle points. All additional critical points defined by matrices A and B which are not of full rank are also saddle points and can be characterized in terms of orthogonal projections onto subspaces spanned by q eigenvectors, with q < p. In the autoassociative case, (4)-(6) become A = CU; (7) B = UT,-’ (8) w = Pu, (9) and therefore the unique locally and globally optimal map W is the orthogonal projection onto the space spanned by the first p eigenvectors of Ezz. This analysis links backpropagation in linear networks to several classical statistical techniques. In particular, at the global minimum of E, if C = I, then the activities in the hidden layer are given by u\yt, . . . , ubyt, the principal com￾ponents of the least-squares estimators ijt (see, for instance, [ 81). In the autoassociative mode, these activities are given by u\xt, . , ubxt, and correspond to the coordinates of the vector xt along the first p eigenvectors of E,, as in the usual ?CA. In general, if the initial conditions are random, one should not expect the backpropagation algorithm to converge to an optimum satisfying C = I,. In the autoassociative case, this means that the rows of the final A and u1, . . . , up will span the same space but A’ # [UI, . . . , U,]. Although at first sight this may seem a drawback, it must be regarded as a property leading to more robust networks. Indeed, in a physical implementation where the compressed version of the data in the hidden layer is to be sent to further processing layers, it may not be desirable that one of the units, extracting the principal component, has a variance much larger than the other units (it is known, for instance, that in the case of random symmetric matrixes, Xz << XI almost always; see [ll]). A more balanced strategy, where all the variances in the hidden layer are comparable, is by far preferable and is commonly observed in simulations. Since the optimal solution can be expressed analytically, it can also be obtained effectively with numerical analysis techniques without resorting to any descent procedure. As pointed out in the Introduction, however, this is not the most relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of simple adaptive networks. One of the central issues in learning from examples is the problem of generalization, that is, how does the network perform when exposed to a pattern never seen previously? In this setting, a precise quantitative answer can be given to this question. For instance, in the autoassociative case, the distortion of a new pattern is given by its distance to the subspace generated by the first p eigenvectors of Xzx. In the special case where rank(C,,) = r = p, Gallinari et al. [12] have shown that if an n - p - m architecture is trained to classify n-dimensional inputs into m (m < n) classes, then the corresponding network performs discriminant analysis in the sense that, for an optimal W = BA,A’ is a DA matrix. In other words, under these assumptions, the projection realized by A‘ maximizes the ratio given in (3). In this context, however, either p = r = m, in which case the architecture is n - m - m and there is no bottleneck, or r < m and then full classification into m categories is not supported by the available data and there is no proper data compression (only filtering out of linear dependencies). In any case, all the network ever learns is to be a mean-square classifier, and this can be achieved without any hidden layer. B. Deep Networks, Local Connectivity, Nonlinearities, and Bias In Baldi [13], the case of deep networks with multiple hidden layers is briefly examined. It is easy to see that, in this case, the main constraint on the network comes from its bottleneck, that is, from the hidden layer with smallest size p (clearly, p could be attained in more than one hidden layer). Although the expression for the critical points may now become more involved, the main features of the landscape are unchanged: a multiplicity of saddle points, an absence of local minima, and a unique optimal input/output map satisfying (6) with Z = {l,...,p} . The bottleneck layer imposes a rank restriction on the map computed by the network. Additional important constraints can be introduced on the geometry of the connections. Often connections are assumed to be local, in the sense that a unit in one layer receives projections only from a restricted subset of elements in the previous layer, for instance according to a Gaussian distribution. These geometrical constraints play an essential role in self-organizing maps and in several models of “linear” cortical development; see for instance Linsker [ 141-[ 161 and Miller et al. [ 171. These topics deserve separate treatment and will not be addressed here. As mentioned in the previous section, however, in the case of a locally connected linear network without any hidden layer the landscape of the usual quadratic error is again completely devoid of local minima. Learning by descent methods should then be efficient. The landscape properties of the LMS (least mean square) error function of a linear locally connected multilayer network have not been carefully studied yet, and the previous results only
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有