正在加载图片...
838 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6.NO.4.JULY 1995 on the synaptic weights.In the main case of backpropagation II MATHEMATICAL BACKGROUND the error function is A.Optimization of Quadratic Forms over Spheres E(A,B)=(iy-BAxI〉 (1)Let S be a symmetric n x n matrix.Then all the eigenvalues λof S are real and can be ordered in the formλi≥ A22·≥λwith corresponding normalized eigenvectors whereu represents the Euclidean norm of the vector u. ,,un Consider the problem of maximizing the quadratic When no target outputs are provided,the learning (which form E(a)=a'Sa over the sphere of radius p and centered at then must be based on criteria to be specified,such as the the origin (a<p).In geometry,it is well known (see,for maximization of the output variance)is unsupervised.An instance,[3])that the maximum of E is then reached on the important special case of unsupervised learning is the case surface of the sphere in the direction of the first eigenvector, of autoassociation,when the input is used as a teacher (i.e., that is,at the points tpun where E(tpu)=Aip2.If 1>A2, vt=).This is also called autoencoding or identity mapping Epui are the only two solutions.Similarly,the maximum of in the literature E over the intersection of the sphere with the linear space Learning rules are algorithms for slowly altering the con- orthogonal to u is reached at +pu2,and so forth.Finally,the nection weights to achieve a desirable goal such as the minimum of E over the entire sphere is obtained at tpun. minimization of an error function.Often,three different ver- All these properties are easily derived by decomposing a as sions of the same rule have been given:the "on-line"version a=∑:a;:and noticing that E(a)=∑,入:a where the modification is calculated after the presentation of each pattern.the "off-line"version where the previous modifications are averaged over the cycle of all patterns,and B.Singular Value Decomposition the "continuous"version where the discrete changes induced Let 2 be an arbitrary k x matrix with rank r.Then there by the "off-line"algorithm are approximated continuously by exist numbers o1≥,,≥ar>0,the singular values of Z.an a differential equation governing the evolution of the weights orthogonal k x k matrix U,and an orthogonal Ix l matrix V in time.In some cases,the three formulations can be shown such that S=U'ZV is a :x I diagonal matrix of the form to lead to essentially the same results. It will be convenient to use the notation =(uv')where S= D O 00 the prime denotes transposition of matrices.If both (u)and (v)are zero,is the covariance matrix of u and v.for where D diag(1,..,o)is the diagonal matrix with instance,is a real n x n symmetric nonnegative definite matrix. entries o1,...,o.The decomposition Hence,its eigenvalues can be ordered as,≥·≥入n≥O. For mathematical simplicity,we shall often assume that in Z=USV fact A>·>入n>0.This should not be regarded as a very restrictive assumption,since this condition can always is called the singular value decomposition (SVD)of Z (it is be enforced by,at worst,perturbing the data by infinitesimal not necessarily unique).The matrices U and V in the SVD amounts and attributing these perturbations to "noise."Many have the following meaning.As Z'ZV=VS'U'USVV conclusions are only slightly different when some eigenvalues VS'S Vdiag(a,...,,0,...,0),the columns of V are coincide. unit-length,mutually perpendicular eigenvectors of Z'Z,and A good familiarity with linear algebra and basic calculus ,..,o2 are the nonzero eigenvalues of Z'Z.Similarly,the on the part of the reader should be sufficient to follow the columns of U are unit-length,mutually perpendicular eigen- paper.All the statistical techniques required to understand vectors of Z2'.With the aid of the SVD.the pseudoinverse some of the results are briefly reviewed in the second section of 2 can easily be given explicitly.If we write These include least-squares regression,principal component analysis,and discriminant analysis.In Section III,we treat S+= 「D-1O the case of supervised learning with backpropagation and 00 the corresponding autoassociative special case.We study the then +=VS+U is the pseudoinverse of Z (see,for landscape of the error function E of (1),its connections to instance.[4]and [5]for more details on SVD and pseudoin- the previously mentioned statistical techniques,and several verses). consequences and generalizations,including noisy and deep networks.In Section IV,we study the problems of validation generalization,and overfitting in a simple one-layer network C.Orthogonal Projections trained to learn the identity map.Under some assumptions,we If C is a linear subspace,we denote by Per the orthogonal give a complete description of the evolution of the validation projection of a vector x onto C and by Per Qcr error as a function of training time.Section V covers a variety -Pcx its projection onto the orthogonal complement of C. of unsupervised learning algorithms,based on variance max- If C has dimension and is spanned by the linearly independent imization/minimization by Hebbian or anti-Hebbian learning vectors 21,..,2,then Pex Pzx,where Z=[21,, or other error functions.Some of the more technical proofs and Pz Z(Z'Z)-Z'.In particular,if the vectors 2;are are deferred to the Appendix. mutually perpendicular unit vectors,the projection of simply838 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 on the synaptic weights. In the main case of backpropagation, the error function is where llull represents the Euclidean norm of the vector U. When no target outputs are provided, the learning (which then must be based on criteria to be specified, such as the maximization of the output variance) is unsupervised. An important special case of unsupervised learning is the case of autoassociation, when the input is used as a teacher (i.e., yt = zt). This is also called autoencoding or identity mapping in the literature. Learning rules are algorithms for slowly altering the con￾nection weights to achieve a desirable goal such as the minimization of an error function. Often, three different ver￾sions of the same rule have been given: the “on-line” version where the modification is calculated after the presentation of each pattern, the “off-line” version where the previous modifications are averaged over the cycle of all patterns, and the “continuous” version where the discrete changes induced by the “off-line” algorithm are approximated continuously by a differential equation governing the evolution of the weights in time. In some cases, the three formulations can be shown to lead to essentially the same results. It will be convenient to use the notation E,, = (uii’) where the prime denotes transposition of matrices. If both (U) and (U) are zero, E,,, is the covariance matrix of U and U. E,,, for instance, is a real n x n symmetric nonnegative definite matrix. Hence, its eigenvalues can be ordered as A1 2 . . . 2 A, 2 0. For mathematical simplicity, we shall often assume that in fact A1 > ... >A, > 0. This should not be regarded as a very restrictive assumption, since this condition can always be enforced by, at worst, perturbing the data by infinitesimal amounts and attributing these perturbations to “noise.” Many conclusions are only slightly different when some eigenvalues coincide. A good familiarity with linear algebra and basic calculus on the part of the reader should be sufficient to follow the paper. All the statistical techniques required to understand some of the results are briefly reviewed in the second section. These include least-squares regression, principal component analysis, and discriminant analysis. In Section 111, we treat the case of supervised learning with backpropagation and the corresponding autoassociative special case. We study the landscape of the error function E of (I), its connections to the previously mentioned statistical techniques, and several consequences and generalizations, including noisy and deep networks. In Section IV, we study the problems of validation, generalization, and overfitting in a simple one-layer network trained to learn the identity map. Under some assumptions, we give a complete description of the evolution of the validation error as a function of training time. Section V covers a variety of unsupervised learning algorithms, based on variance max￾imizatiodminimization by Hebbian or anti-Hebbian learning or other error functions. Some of the more technical proofs are deferred to the Appendix. 11. MATHEMATICAL BACKGROUND A. Optimization of Quadratic Forms over Spheres Let S be a symmetric n x n matrix. Then all the eigenvalues A, of S are real and can be ordered in the form A1 2 A2 2 . . . 2 A, with corresponding normalized eigenvectors u1, . . . , U,. Consider the problem of maximizing the quadratic form E(a) = n’Sa over the sphere of radius p and centered at the origin (Ilall 5 p). In geometry, it is well known (see, for instance, [3]) that the maximum of E is then reached on the surface of the sphere in the direction of the first eigenvector, that is, at the points fpul where E(*pul) = Alp2. If A1 > A2, &pul are the only two solutions. Similarly, the maximum of E over the intersection of the sphere with the linear space orthogonal to u1 is reached at *pu2, and so forth. Finally, the minimum of E over the entire sphere is obtained at fpu,. All these properties are easily derived by decomposing a as n = C,Q~U, and noticing that E(a) = E, A,cy:. B. Singular Value Decomposition Let 2 be an arbitrary IC x 1 matrix with rank r. Then there exist numbers (TI 2 . . . 2 (T, > 0, the singular values of 2, an orthogonal k x IC matrix U, and an orthogonal 1 x 1 matrix V such that S = U’ZV is a IC x 1 diagonal matrix of the form where D = diag(o1, . . . , or) is the diagonal matrix with entries 01, . . . , U,. . The decomposition 2 = usv‘ is called the singular value decomposition (SVD) of 2 (it is not necessarily unique). The matrices U and V in the SVD have the following meaning. As Z‘ZV = VS‘U‘USV‘V = VS’S = Vdiag(af,. . . , oz, 0,. . . ,0), the columns of V are unit-length, mutually perpendicular eigenvectors of 2’2, and (T:, . . . ,oz are the nonzero eigenvalues of 2’2. Similarly, the columns of U are unit-length, mutually perpendicular eigen￾vectors of 22’. With the aid of the SVD, the pseudoinverse of 2 can easily be given explicitly. If we write L J then Zi‘ = VS+U‘ is the pseudoinverse of Z (see, for instance, [4] and [5] for more details on SVD and pseudoin￾verses). C. Orthogonal Projections If L is a linear subspace, we denote by Pcx the orthogonal projection of a vector z onto C and by PClx = Qcx = x - Pcx its projection onto the orthogonal complement of C. If C has dimension 1 and is spanned by the linearly independent vectors 21, . . . ,zl, then PLZ = PZZ, where 2 = [z1 , . . . , 211 and Pz = Z(Z’Z)-lZ’. In particular, if the vectors z, are mutually perpendicular unit vectors, the projection of x simply
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有