《人工智能、机器学习与大数据》课程教学资源（参考文献）NEWOn the Convergence and Improvement of Stochastic Normalized Gradient Descent.pdf_大学文库

SCIENCE CHINA Information Sciences March2021,Vol.64132103:1-132103:13 ·RESEARCH PAPER· https://doi.org/10.1007/s11432-020-3023-7 On the convergence and improvement of stochastic normalized gradient descent Shen-Yi ZHAO,Yin-Peng XIE Wu-Jun LI" National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology, Nanjing University,Nanjing 210023,China Received 20 March 2020/Revised 6 May 2020/Accepted 3 June 2020/Published online 8 February 2021 Abstract Non-convex models,like deep neural networks,have been widely used in machine learning ap- plications.Training non-convex models is a difficult task owing to the saddle points of models.Recently, stochastic normalized gradient descent(SNGD).which updates the model parameter by a normalized gradient in each iteration,has attracted much attention.Existing results show that SNGD can achieve better per- formance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However,none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems.In this paper,we firstly prove the convergence of SNGD for non-convex problems. Particularly,we prove that SNGD can achieve the same computation complexity as SGD.In addition,based on our convergence proof of SNGD,we find that SNGD needs to adopt a small constant learning rate for convergence guarantee.This makes SNGD do not perform well on training large non-convex models in prac- tice.Hence,we propose a new method,called stagewise SNGD (S-SNGD),to improve the performance of SNGD.Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage.The convergence of S-SNGD can also be theoretically proved for non-convex problems.Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy. Keywords non-convex problems,stochastic normalized gradient descent,computation complexity Citation Zhao S-Y,Xie Y-P,Li W-J.On the convergence and improvement of stochastic normalized gradient descent.Sci China Inf Sci,2021,64(3):132103,https://doi.org/10.1007/s11432-020-3023-7 1 Introduction Many machine learning models can be formulated as the following optimization problem: min F(w):=E[f(w;a)], (1) u∈Rd where w denotes the model parameter,a denotes a stochastic sample,and f(w;a)denotes the loss function.When a is uniformly sampled from a finite set {ai,a2,...,an},F(w)can also be formulated as1f(w:),where n is the number of samples.The formulation in (1)contains a broad family of machine learning models,such as logistic regression and deep learning models. Stochastic gradient descent (SGD)[1,2]has been one of the most efficient optimization tools for solving(1).In the t-th iteration,SGD randomly selects some samples Z:and calculates a mini-batch gradient to update the model parameter,which is denoted by Wt+1=Wt -agt, (2) whereg=f(wa)is a mini-batch gradient and is the learning rate.Compared to traditional batch training methods like gradient descent (GD).SGD does not need to calculate the full gradient VF(wt)which often costs much computation time.Hence,SGD is more popular and has been Corresponding author(email:liwujun@nju.edu.cn) Science China Press and Springer-Verlag GmbH Germany,part of Springer Nature 2021 info.scichina.com link.springer.com

SCIENCE CHINA Information Sciences March 2021, Vol. 64 132103:1–132103:13 https://doi.org/10.1007/s11432-020-3023-7 c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2021 info.scichina.com link.springer.com . RESEARCH PAPER . On the convergence and improvement of stochastic normalized gradient descent Shen-Yi ZHAO, Yin-Peng XIE & Wu-Jun LI* National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China Received 20 March 2020/Revised 6 May 2020/Accepted 3 June 2020/Published online 8 February 2021 Abstract Non-convex models, like deep neural networks, have been widely used in machine learning applications. Training non-convex models is a difficult task owing to the saddle points of models. Recently, stochastic normalized gradient descent (SNGD), which updates the model parameter by a normalized gradient in each iteration, has attracted much attention. Existing results show that SNGD can achieve better performance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However, none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems. In this paper, we firstly prove the convergence of SNGD for non-convex problems. Particularly, we prove that SNGD can achieve the same computation complexity as SGD. In addition, based on our convergence proof of SNGD, we find that SNGD needs to adopt a small constant learning rate for convergence guarantee. This makes SNGD do not perform well on training large non-convex models in practice. Hence, we propose a new method, called stagewise SNGD (S-SNGD), to improve the performance of SNGD. Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy. Keywords non-convex problems, stochastic normalized gradient descent, computation complexity Citation Zhao S-Y, Xie Y-P, Li W-J. On the convergence and improvement of stochastic normalized gradient descent. Sci China Inf Sci, 2021, 64(3): 132103, https://doi.org/10.1007/s11432-020-3023-7 1 Introduction Many machine learning models can be formulated as the following optimization problem: min w∈Rd F(w) := E[f(w; a)], (1) where w denotes the model parameter, a denotes a stochastic sample, and f(w; a) denotes the loss function. When a is uniformly sampled from a finite set {a1, a2, . . . , an}, F(w) can also be formulated as 1/nPn i=1 f(w; ai), where n is the number of samples. The formulation in (1) contains a broad family of machine learning models, such as logistic regression and deep learning models. Stochastic gradient descent (SGD) [1, 2] has been one of the most efficient optimization tools for solving (1). In the t-th iteration, SGD randomly selects some samples It and calculates a mini-batch gradient to update the model parameter, which is denoted by wt+1 = wt − αgt, (2) where gt = 1/|It| P a∈It ∇f(wt; a) is a mini-batch gradient and α > 0 is the learning rate. Compared to traditional batch training methods like gradient descent (GD), SGD does not need to calculate the full gradient ∇F(wt) which often costs much computation time. Hence, SGD is more popular and has been * Corresponding author (email: liwujun@nju.edu.cn)

Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:2 Saddle point Sharp minimum Flatten minimum Figure 1 (Color online)Smooth loss curve.The flatten minimum usually leads to a small generalization error [11].In the region of the saddle point,the gradient is small.In the region of the sharp minimum,the gradient is large.Intuitively,because lw+1-wrll=a in SNGD,with a suitable a,normalized gradient can yield a faster escape of saddle points and sharp minimum than unnormalized gradient. widely used in machine learning applications.Many variants of SGD have been proposed 3-5 and there are also many studies that analyze the convergence of SGD [6-8]. With the rapid growth of data,more and more large non-convex models are adopted in machine learning applications for improving the generalization ability.One typical example is the deep neural network with multiple layers,which has achieved success in many areas like image classification 9.However,training non-convex models is a difficult task,because there may be many saddle points and sharp local minimums in non-convex models [10,11],especially in those large models.When wt falls into the region of some saddle points,the gradient will be small.When wt falls into the region of some sharp local minimums. the gradient will be relatively large.Both of these two cases can lead to the inefficiency of(2).When w:falls into the region of some flatten minimums,it usually leads to a small generalization error 11 Figure 1 shows examples about saddle point,sharp local minimum and flatten minimum.Usually,we need to carefully choose the initialization 12 to escape the saddle points and sharp local minimums. Recently,stochastic normalized gradient descent (SNGD)[13-16]has attracted much attention for solving non-convex problems.SNGD is the stochastic version of normalized gradient descent (NGD)[17, 18].In each iteration,SNGD adopts a normalized gradient to update the model parameter.In particular, SNGD can be written as Dt+1=Dt一Q lgell' (3) wheregVf(w:a),0is the learning rate,and is the Euclidean norm.It is easy to get thatlwt+-wtll=a,no matter whether the gradient is large or small.Hence,SNGD has better performance than SGD when wt is around saddle points and sharp local minimums.Theoretical results show that normalized gradient can yield a faster escape of saddle points than SGD 14.Although the convergence of SNGD in(3)for strictly-locally-quasi-convex problems has been theoretically proved in 13,none of the existing studies has provided theoretical proof about the convergence of SNGD in (3) for non-convex problems.Furthermore,we find that SNGD does not perform well on training large non-convex models in practice owing to the requirement of a small constant learning rate. In this paper,we study on the convergence and improvement of SNGD.The main contributions of this paper are outlined as follows. We theoretically prove the convergence of SNGD for non-convex problems.In particular,we prove that SNGD can achieve the same computation complexity (total number of gradient computation)as SGD for non-convex problems. We propose a new method,called S-SNGD,to improve the performance of SNGD.Different from SNGD which necessarily adopts a small constant learning rate for convergence guarantee,S-SNGD can adopt a large initial learning rate and reduces the learning rate by stage.The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:2 Flatten minimum Saddle point Sharp minimum Figure 1 (Color online) Smooth loss curve. The flatten minimum usually leads to a small generalization error [11]. In the region of the saddle point, the gradient is small. In the region of the sharp minimum, the gradient is large. Intuitively, because kwt+1 − wtk = α in SNGD, with a suitable α, normalized gradient can yield a faster escape of saddle points and sharp minimum than unnormalized gradient. widely used in machine learning applications. Many variants of SGD have been proposed [3–5] and there are also many studies that analyze the convergence of SGD [6–8]. With the rapid growth of data, more and more large non-convex models are adopted in machine learning applications for improving the generalization ability. One typical example is the deep neural network with multiple layers, which has achieved success in many areas like image classification [9]. However, training non-convex models is a difficult task, because there may be many saddle points and sharp local minimums in non-convex models [10, 11], especially in those large models. When wt falls into the region of some saddle points, the gradient will be small. When wt falls into the region of some sharp local minimums, the gradient will be relatively large. Both of these two cases can lead to the inefficiency of (2). When wt falls into the region of some flatten minimums, it usually leads to a small generalization error [11]. Figure 1 shows examples about saddle point, sharp local minimum and flatten minimum. Usually, we need to carefully choose the initialization [12] to escape the saddle points and sharp local minimums. Recently, stochastic normalized gradient descent (SNGD) [13–16] has attracted much attention for solving non-convex problems. SNGD is the stochastic version of normalized gradient descent (NGD) [17, 18]. In each iteration, SNGD adopts a normalized gradient to update the model parameter. In particular, SNGD can be written as wt+1 = wt − α gt kgtk , (3) where gt = 1/|It| P a∈It ∇f(wt; a), α > 0 is the learning rate, and k · k is the Euclidean norm. It is easy to get that kwt+1 − wtk = α, no matter whether the gradient is large or small. Hence, SNGD has better performance than SGD when wt is around saddle points and sharp local minimums. Theoretical results show that normalized gradient can yield a faster escape of saddle points than SGD [14]. Although the convergence of SNGD in (3) for strictly-locally-quasi-convex problems has been theoretically proved in [13], none of the existing studies has provided theoretical proof about the convergence of SNGD in (3) for non-convex problems. Furthermore, we find that SNGD does not perform well on training large non-convex models in practice owing to the requirement of a small constant learning rate. In this paper, we study on the convergence and improvement of SNGD. The main contributions of this paper are outlined as follows. • We theoretically prove the convergence of SNGD for non-convex problems. In particular, we prove that SNGD can achieve the same computation complexity (total number of gradient computation) as SGD for non-convex problems. • We propose a new method, called S-SNGD, to improve the performance of SNGD. Different from SNGD which necessarily adopts a small constant learning rate for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduces the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. • Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy

Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:3 2 Preliminary and related work 2.1 Preliminary In this paper,we use lll to denote the Euclidean norm.For a random variable X,E[X]denotes the expectation of X.For any two positive sequences {n}and fyn},Un =O(rn)denotes that there exist two positive constants c and N such that yn cEn,Vn N.For any two positive integers a and b, mod(a,b)=0 denotes that a is divisible by b.For a function (w),Vo(w)denotes the gradient of o(w) at w and V2o(w)denotes the Hessian matrix of o(w)at w We also give the following two definitions. Definition 1.A function o(w)is called an L-smooth function (L >0)if Vu,w, (u）≤(o)+Vo(ojr(u-o)+5lu-wP Definition 2.A function o(w)is called a (Lo.L1)-smooth function (Lo >0,L>0)if o(w)is twice differentiable and Vw, I7(w)川l≤Lo+L1l7(w)l, where V2()denotes the Hessian matrix of o()at w. It is easy to find that if a function (w)is (Lo,0)-smooth,it is also Lo-smooth.If a function (w)is twice differentiable and L-smooth,it is also (L.0)-smooth. In this paper,we adopt the norm of gradient to measure the convergence.In particular,we call w an e-stationary point if VF(w)e.The computation complexity of an algorithm is defined as the total number of gradient computation.The iteration complexity of an algorithm is defined as total number of model parameter updates. 2.2 Related work NGD [17,18 and its variants have attracted much attention for solving non-convex problems.In 14, the authors proposed Saddle-NGD which can be formulated as VF(wt) if mod(t,g)≠0， Wt+1= wI-OTVF(w: VF(w:) (4) w:-aTVF(w:) +t,if mod(t,q)=0, where q>0 is a constant integer,ft is a Gaussian noise.Saddle-NGD is a variant of NGD which combines the noise perturbations.The work in 10 shows that Saddle-NGD achieves better iteration complexity (total number of model updates)than Noisy-GD 19. The first work that analyzes the convergence of SNGD in(3)appears in [13].In particular,it proves that for strictly-locally-quasi-convex problems,SNGD needs O(1/e2)iterations to achieve the result of F(w)-F(w*)0,B>0.We can observe that when llgt is larger than a/B,the formulation in (5)is the same as that SNGD.Otherwise,the formulation in (5)is the same as SGD.In [15],the authors proposed a method called SIPDER-SFO.Benefiting from the stochastic path-integrated differential estimator which is defined as (1 >[Vf(w;a)-7f(w-1;a】+gt-1,if mod(t,q)≠0， a∈It 9t= 面Efia.mod.=0 1

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:3 2 Preliminary and related work 2.1 Preliminary In this paper, we use k · k to denote the Euclidean norm. For a random variable X, E[X] denotes the expectation of X. For any two positive sequences {xn} and {yn}, yn = O(xn) denotes that there exist two positive constants c and N such that yn 6 cxn, ∀n > N. For any two positive integers a and b, mod(a, b) = 0 denotes that a is divisible by b. For a function φ(w), ∇φ(w) denotes the gradient of φ(w) at w and ∇2φ(w) denotes the Hessian matrix of φ(w) at w. We also give the following two definitions. Definition 1. A function φ(w) is called an L-smooth function (L > 0) if ∀u, w, φ(u) 6 φ(w) + ∇φ(w) T(u − w) + L 2 ku − wk 2 . Definition 2. A function φ(w) is called a (L0, L1)-smooth function (L0 > 0, L1 > 0) if φ(w) is twice differentiable and ∀w, k∇2φ(w)k 6 L0 + L1k∇φ(w)k, where ∇2φ(·) denotes the Hessian matrix of φ(·) at w. It is easy to find that if a function φ(w) is (L0, 0)-smooth, it is also L0-smooth. If a function φ(w) is twice differentiable and L-smooth, it is also (L, 0)-smooth. In this paper, we adopt the norm of gradient to measure the convergence. In particular, we call w an ǫ-stationary point if k∇F(w)k 6 ǫ. The computation complexity of an algorithm is defined as the total number of gradient computation. The iteration complexity of an algorithm is defined as total number of model parameter updates. 2.2 Related work NGD [17, 18] and its variants have attracted much attention for solving non-convex problems. In [14], the authors proposed Saddle-NGD which can be formulated as wt+1 =    wt − α ∇F(wt) k∇F(wt)k , if mod(t, q) 6= 0, wt − α ∇F(wt) k∇F(wt)k + ξt, if mod(t, q) = 0, (4) where q > 0 is a constant integer, ξt is a Gaussian noise. Saddle-NGD is a variant of NGD which combines the noise perturbations. The work in [10] shows that Saddle-NGD achieves better iteration complexity (total number of model updates) than Noisy-GD [19]. The first work that analyzes the convergence of SNGD in (3) appears in [13]. In particular, it proves that for strictly-locally-quasi-convex problems, SNGD needs O(1/ǫ2 ) iterations to achieve the result of F(w¯) − F(w∗ ) 6 ǫ, where w¯ is the output of SNGD and w∗ is the optimal solution of (1). Hence, the proved computation complexity to achieve F(w¯)−F(w∗ ) 6 ǫ is O(1/ǫ4 ) owing to a batch size of O(1/ǫ2 ). Recently, in [15, 16], the authors proposed some mixture methods which combine SGD and SNGD. In particular, their formulations can universally be written as wt+1 = wt − ηtgt, (5) where gt is some stochastic gradients, ηt = min{α/kgtk, β}, α > 0, β > 0. We can observe that when kgtk is larger than α/β, the formulation in (5) is the same as that SNGD. Otherwise, the formulation in (5) is the same as SGD. In [15], the authors proposed a method called SIPDER-SFO. Benefiting from the stochastic path-integrated differential estimator which is defined as gt =    1 |It| X a∈It [∇f(wt; a) − ∇f(wt−1; a)] + gt−1, if mod(t, q) 6= 0, 1 |It| X a∈It ∇f(wt; a), if mod(t, q) = 0

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:4 SPIDER-SFO achieves a computation complexity of O(1/ǫ3 ) under the criterion of k∇f(w¯)k 6 ǫ, which is the best convergence result among existing first-order stochastic methods. However, whether gt/kgtk can be adopted for small kgtk and the corresponding computation complexity is unknown. Furthermore, owing to the requirement of a small constant learning rate α, we find that these mixture methods are not efficient in practice. 3 Convergence of SNGD We briefly present SNGD [13] in Algorithm 1. We can observe that SNGD is as simple as SGD. In each iteration, SNGD only needs to calculate a normalized mini-batch gradient to update the model parameter with a constant learning rate α. Different from SGD in which the stochastic gradient is an unbiased estimation of ∇F(w), the stochastic gradient in SNGD is a biased one. That means E gt kgtk wt 6= ∇F(wt). Hence, the update direction is not necessarily the direction of −∇F(wt) in expectation and this makes the convergence analysis difficult. Algorithm 1 SNGD 1: Initialization: w0, α, b, T; 2: for t = 0, 1, . . . , T − 1 do 3: Randomly choose b samples, denoted by It; 4: Calculate a mini-batch gradient gt = 1 b P a∈It f(wt; a); 5: wt+1 = wt − α gt kgtk ; 6: end for 7: Return w¯, which is randomly chosen from {w0, w1, . . . , wT }. First, we make the following assumption which is common in stochastic optimization. Assumption 1. F(w) > 0 and the stochastic gradient of F(w) has a σ-bounded variance (σ > 0), i.e., Ek∇f(w; a) − ∇F(w)k 2 6 σ 2 , ∀w. If F(w) is L-smooth, we have the following lemma which captures a general property in normalized gradient descent. Lemma 1. Assume F(w) is L-smooth (L > 0). For any w, g ∈ R d , we define w+ as follows: w+ = w − α g kgk , (6) where α > 0. Then we have k∇F(w)k 6 F(w) − F(w+) α + Lα 2 + 2k∇F(w) − gk. Proof. Since F(w) is L-smooth, we obtain F(w+) 6 F(w) − α∇F(w) Tg/kgk + Lα2 2 = F(w) − α(∇F(w) − g) Tg/kgk − αkgk + Lα2 2 6 F(w) + αk∇F(w) − gk − αkgk + Lα2 2 . Combining the fact that k∇F(w)k 6 kgk + k∇F(w) − gk, we obtain k∇F(w)k 6 F(w) − F(w+) α + Lα 2 + 2k∇F(w) − gk. According to Lemma 1, we can set w, w+, g in (6) as wt, wt+1, gt respectively. Then we can obtain the following convergence result

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:5 Theorem 1. Assume F(w) is L-smooth (L > 0). Let {wt} be the sequence produced by Algorithm 1, b = σ 2/ǫ 2 . Then we have 1 T + 1 X T t=0 Ek∇F(wt)k 6 F(w1) α(T + 1) + Lα 2 + 2ǫ. (7) Furthermore, let w¯ denote the output of Algorithm 1. By setting α = 2ǫ/L and T = LF(w1)/(2ǫ 2 ), we obtain Ek∇F(w¯)k 6 4ǫ. Hence, the iteration complexity and computation complexity to achieve an ǫ-stationary point of SNGD are O(1/ǫ2 ) and O(1/ǫ4 ), respectively. Proof. According to Lemma 1, we obtain Ek∇F(wt)k 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2Ek∇F(wt) − gtk 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2p Ek∇F(wt) − gtk 2 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2r σ 2 b = E[F(wt) − F(wt+1)] α + Lα 2 + 2ǫ, (8) where the second inequality uses the fact that (E[x])2 6 E[x 2 ] and the third inequality uses the fact that Ek∇F(wt) − gtk 2 = Ek∇F(wt) − ∇f(wt; a)k 2/b 6 σ 2/b. Summing up (8) from t = 0 to T , we obtain 1 T + 1 X T t=0 Ek∇F(wt)k 6 F(w1) α(T + 1) + Lα 2 + 2ǫ. By setting α = 2ǫ/L, T = LF(w1)/(2ǫ 2 ), we obtain Ek∇F(w¯)k 6 4ǫ. Hence, the iteration complexity is T = O(1/ǫ2 ) and the computation complexity is T b = O(1/ǫ4 ). The result in Theorem 1 implies that SNGD achieves the same computation complexity as SGD for smooth non-convex problems [6]. Furthermore, when F(w) is L-smooth and F(w) > 0, we can obtain that ∀w, F(w) − F(w∗ ) > 1 2L k∇F(w)k 2 , where w∗ ∈ arg minw F(w). If F(w) further satisfies the strictly-locally-quasi-convex property [13], then the computation complexity to achieve an ǫ-stationary point in [13] is actually O(1/ǫ8 ), which is much higher than O(1/ǫ4 ) for small ǫ. Hence, our convergence result for SNGD is better than that in [13] for strictly-locally-quasi-convex problems. If F(w) is (L0, L1)-smooth, we can get the following result in normalized gradient descent. Lemma 2. Assume F(w) is (L0, L1)-smooth (L0 > 0, L1 > 0). For any w, g ∈ R d , we define w+ as follows: w+ = w − α g kgk , (9) where α > 0. Then we have 1 − αL1 2 e αL1 k∇F(w)k 6 F(w) − F(w+) α + (1 + αL1e αL1 )L0α 2 + 2k∇F(w) − gk. Proof. Let r(t) = t(w+ − w) + w, t ∈ [0, 1], p(t) = k∇F(r(t))k. Since kw+ − wk = α, we have p(t) = k∇F(r(t))k = Z t 0 ∇ 2F(r(c))r ′ (c)dc + ∇F(r(0))

Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:7 Summing up the above inequality from t=0 to T,we obtain T 1 EVE(w)l≤ T+1 2F(w1)+2Loa+4e. t=0 (T+1) By setting a=e/Lo:T=LoF(w1)/e2,we obtain Ell7F(o)l川≤8e Hence,the iteration complexity is T=O(1/e2)and the computation complexity is Tb=O(1/e4). Compared to the L-smooth assumption in Theorem 1,the (Lo,L1)-smooth assumption is more general. In particular.if F(w)is twice differentiable and L-smooth,then it must be (L.0)-smooth.Furthermore. the (Lo.L1)-smooth assumption implies that when L0,the gradient of F(w)may change drastically. which is a common phenomenon in machine learning models [16].When we apply SGD on these models in practice,we usually need to clip gradients to avoid gradient explosion.Existing theoretical results about the convergence of SGD need the gradient norm to be bounded if F(w)does not satisfy the L-smooth property in Definition 1 [20].On the contrary,the result in Theorem 4 shows that SNGD can deal with the(Lo,L1)-smooth problems without bounded gradient assumption and the corresponding computation complexity to achieve an e-stationary point is O(1/e4). 4 Stagewise stochastic normalized gradient descent In both Theorems 1 and 2,we observe that SNGD needs a small constant learning rate a=O(e)to achieve an e-stationary point.The small constant learning rate may slow down the convergence rate and make SNGD inefficient in practice.Here,we propose a new method,called S-SNGD,to improve the performance of SNGD.The details of S-SNGD are presented in Algorithm 2. Algorithm 2 S-SNGD 1:Initialization:wo,6,S,T,(at,{p:}.Here,[Pt}is a positive increasing sequence. 2:for t=0.1,....T-1 do 3: 0t,0=ùt; 4 Mt S/at; for m=0,1,...,Mt -1 do 6 Randomly choose b samples,denoted by T:: g,m=云∑ae,m了f(w,m:a)方 8: 0n+1=0m-a品： 9: end for 10: Choose t+1 randomly from {wt.o....,wt,M:-1): 11: ǜt+1=Dt,Mt； 12:end for 13:Return西，which equals to4 with probability p/∑1pt,i=l,,T S-SNGD consists of T stages.In the t-th stage,S-SNGD runs SNGD with M iterations and the learning rate at.After each stage,S-SNGD randomly chooses a vector from [wt.o,wt.1,...,wt.M-1 to be wt+1 and sets w+1=wt.M..Different from SNGD in which a small constant learning rate is necessary for convergence guarantee,S-SNGD allows using a large initial learning rate and then reduces the learning rate by stage.Intuitively,according to(7),a larger initial learning rate can reduce the value of F(w1)/aT more fastly.After several model parameter updates,S-SNGD reduces the learning rate so that it can reduce the value of La/2.In particular,for L-smooth loss function,we have the following convergence result for S-SNGD. Theorem3.Assume00,M:=S/at,b=02/e2.Then we have EVF(o)川≤ m一+L-业+26 S∑1P:T2∑-1P% (11) Proof. According to Lemma 1,we obtain VFom≤om）-fm++g:+2lVFw,m)-gm

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:7 Summing up the above inequality from t = 0 to T , we obtain 1 T + 1 X T t=0 Ek∇F(wt)k 6 2F(w1) α(T + 1) + 2L0α + 4ǫ. By setting α = ǫ/L0, T = L0F(w1)/ǫ2 , we obtain Ek∇F(w¯)k 6 8ǫ. Hence, the iteration complexity is T = O(1/ǫ2 ) and the computation complexity is T b = O(1/ǫ4 ). Compared to the L-smooth assumption in Theorem 1, the (L0,L1)-smooth assumption is more general. In particular, if F(w) is twice differentiable and L-smooth, then it must be (L, 0)-smooth. Furthermore, the (L0, L1)-smooth assumption implies that when L1 6= 0, the gradient of F(w) may change drastically, which is a common phenomenon in machine learning models [16]. When we apply SGD on these models in practice, we usually need to clip gradients to avoid gradient explosion. Existing theoretical results about the convergence of SGD need the gradient norm to be bounded if F(w) does not satisfy the L-smooth property in Definition 1 [20]. On the contrary, the result in Theorem 4 shows that SNGD can deal with the (L0, L1)-smooth problems without bounded gradient assumption and the corresponding computation complexity to achieve an ǫ-stationary point is O(1/ǫ4 ). 4 Stagewise stochastic normalized gradient descent In both Theorems 1 and 2, we observe that SNGD needs a small constant learning rate α = O(ǫ) to achieve an ǫ-stationary point. The small constant learning rate may slow down the convergence rate and make SNGD inefficient in practice. Here, we propose a new method, called S-SNGD, to improve the performance of SNGD. The details of S-SNGD are presented in Algorithm 2. Algorithm 2 S-SNGD 1: Initialization: w˜0, b, S, T, {αt}, {pt}. Here, {pt} is a positive increasing sequence. 2: for t = 0, 1, . . . , T − 1 do 3: wt,0 = w˜t; 4: Mt = S/αt; 5: for m = 0, 1, . . . , Mt − 1 do 6: Randomly choose b samples, denoted by It; 7: gt,m = 1 b P a∈It,m ∇f(wt,m; a); 8: wt,m+1 = wt,m − αt gt,m kgt,mk ; 9: end for 10: Choose w¯t+1 randomly from {wt,0, . . . , wt,Mt−1}; 11: w˜t+1 = wt,Mt ; 12: end for 13: Return w¯, which equals to w¯i with probability pi/ PT t=1 pt, i = 1, . . . , T. S-SNGD consists of T stages. In the t-th stage, S-SNGD runs SNGD with Mt iterations and the learning rate αt. After each stage, S-SNGD randomly chooses a vector from {wt,0, wt,1, . . . , wt,Mt−1} to be w¯t+1 and sets w˜t+1 = wt,Mt . Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD allows using a large initial learning rate and then reduces the learning rate by stage. Intuitively, according to (7), a larger initial learning rate can reduce the value of F(w1)/αT more fastly. After several model parameter updates, S-SNGD reduces the learning rate so that it can reduce the value of Lα/2. In particular, for L-smooth loss function, we have the following convergence result for S-SNGD. Theorem 3. Assume 0 1. w¯ denotes the output of Algorithm 2. Let S > 0, Mt = S/αt, b = σ 2/ǫ2 . Then we have Ek∇F(w¯)k 6 δpT S PT t=1 pt + L PT t=1 αt−1pt 2 PT t=1 pt + 2ǫ. (11) Proof. According to Lemma 1, we obtain k∇F(wt,m)k 6 F(wt,m) − F(wt,m+1) αt + Lαt 2 + 2k∇F(wt,m) − gt,mk

Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:9 Sincewith probability pi we obtain EVF()训≤，2p7+ 2oWt-卫+46 S∑1m ∑e14 According to the second term in the right side of (11)or(13),ie.),we observe that reducing the learning rate is necessary for the convergence guarantee.According to the first term in the right side of (11)or(13),i.e.(/(),we can set a relatively large 5.which means S-SNGD can update the model parameter with a large learning rate for many iterations.Below,we give some specific [at}and {pt}to show the convergence of S-SNGD. Corollary 1.With the assumptions and settings in Theorem 3 or Theorem 4,and for any e>0,let T=O(1/e),at p/(t+1),pt =1/at-1,where p>0 is a constant.Then we have EllVF(w)is a constant,we obtain that T(T+1)/(2p)and =pT.By plugging the two summation terms into (11)or(13),we obtain that EVF(o)川≤O 1 +e =O(e) where the inequality uses the fact that T=O(1/e). Corollary 2.With the assumptions and settings in Theorem 3 or Theorem 4,and for any e>0,let T =O(log(1/e)),S=0(1/e),at =p,pt =1/(2p),where p(0,0.5)is a constant.Then we have EVF()川≤O(). Proof.Since at=pp=1/(2p),where p(0,0.5)is a constant,we obtain that (1/(2p)T-1)/(2p-1)and at-p1.Similar to the proof of Corollary 1,we plug the two summation terms into(11)or(13),and obtain that EVF()sO(e). Remark 1.In Theorems 3 and 4,we assume that the objective value F(w)is bounded.In practice,we find that F(wt.m)does be bounded.Furthermore,if we do not make this assumption,the convergence can also be guaranteed.In particular,we can directly sum up (12)or(14)(the two equations do not need the bounded value assumption),and set at=O(1/t).Then we obtain +O(e) t=1 Although the convergence rate is slightly worse than that in Corollary 1 owing to the logarithm term logT,we can observe that the convergence rate does not rely on the upper bound constant 6 of the objective function defined in Theorems 3 and 4. 5 Comparison with SPIDER-SFO We briefly present SPIDER-SFO [15]in Algorithm 3.SPIDER-SFO also consists of several stages and it adopts the constant learning rates a and B throughout the whole algorithm.SPIDER-SFO updates the model parameter by the normalized gradient when llut.ml is larger than a/B.Otherwise,it up- dates the model parameter by the unnormalized gradient.Hence,SPIDER-SFO keeps the relation llwt.m-wt.m-1lls a.In [15],the authors theoretically proved that for L-smooth loss functions, SPIDER-SFO can achieve a computation complexity O(1/e3)by setting a =O(e/L),B=O(1/L), B =O(1/e2),b=O(1/e),M=O(1/e),T =O(1/e).Owing to the technique of the stochastic path-integrated differential estimator,SPIDER-SFO achieves faster convergence rate than other vari- ance reduction methods like SVRG [21],SCSG [22]and Natasha2 [23].The work in [24]also points out that those variance reduction methods are ineffective for some non-convex optimization problems like deep learning.In fact,SPIDER-SFO theoretically achieves the best computation complexity among all existing first-order stochastic methods 15. Comparing the convergence conditions of SPIDER-SFO and SNGD,we can find that both of them require a small constant learning rate a=O(e/L).However,when training small non-convex models(e.g., neural networks with one hidden layer),we empirically find that SNGD can adopt a relatively large

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:9 Since w¯ = w¯i with probability pi/ PT t=1 pt, we obtain Ek∇F(w¯)k 6 2δpT S PT t=1 pt + 2L0 PT t=1 αt−1pt PT t=1 pt + 4ǫ. According to the second term in the right side of (11) or (13), i.e., O( PT t=1 ptαt/ PT t=1 pt), we observe that reducing the learning rate is necessary for the convergence guarantee. According to the first term in the right side of (11) or (13), i.e., O(δpT /(S PT t=1 pt)), we can set a relatively large S, which means S-SNGD can update the model parameter with a large learning rate for many iterations. Below, we give some specific {αt} and {pt} to show the convergence of S-SNGD. Corollary 1. With the assumptions and settings in Theorem 3 or Theorem 4, and for any ǫ > 0, let T = O(1/ǫ), αt = ρ/(t + 1), pt = 1/αt−1, where ρ > 0 is a constant. Then we have Ek∇F(w¯)k 6 O(ǫ). Proof. Since αt = ρ/(t + 1), pt = 1/αt−1, where ρ > 0 is a constant, we obtain that PT t=1 pt = T (T + 1)/(2ρ) and PT t=1 αt−1pt = ρT . By plugging the two summation terms into (11) or (13), we obtain that Ek∇F(w¯)k 6 O 1 T + ǫ = O(ǫ), where the inequality uses the fact that T = O(1/ǫ). Corollary 2. With the assumptions and settings in Theorem 3 or Theorem 4, and for any ǫ > 0, let T = O(log(1/ǫ)), S = O(1/ǫ), αt = ρ t , pt = 1/(2ρ) t , where ρ ∈ (0, 0.5) is a constant. Then we have Ek∇F(w¯)k 6 O(ǫ). Proof. Since αt = ρ t+1 , pt = 1/(2ρ) t , where ρ ∈ (0, 0.5) is a constant, we obtain that PT t=1 pt = (1/(2ρ) T − 1)/(2ρ − 1) and PT t=1 αt−1pt 6 1. Similar to the proof of Corollary 1, we plug the two summation terms into (11) or (13), and obtain that Ek∇F(w¯)k 6 O(ǫ). Remark 1. In Theorems 3 and 4, we assume that the objective value F(w) is bounded. In practice, we find that F(wt,m) does be bounded. Furthermore, if we do not make this assumption, the convergence can also be guaranteed. In particular, we can directly sum up (12) or (14) (the two equations do not need the bounded value assumption), and set αt = O(1/t). Then we obtain 1 T X T t=1 Ek∇F(w¯t)k 6 2F(w˜0) T S + O log T T + O(ǫ). Although the convergence rate is slightly worse than that in Corollary 1 owing to the logarithm term log T , we can observe that the convergence rate does not rely on the upper bound constant δ of the objective function defined in Theorems 3 and 4. 5 Comparison with SPIDER-SFO We briefly present SPIDER-SFO [15] in Algorithm 3. SPIDER-SFO also consists of several stages and it adopts the constant learning rates α and β throughout the whole algorithm. SPIDER-SFO updates the model parameter by the normalized gradient when kµt,mk is larger than α/β. Otherwise, it updates the model parameter by the unnormalized gradient. Hence, SPIDER-SFO keeps the relation kwt,m − wt,m−1k 6 α. In [15], the authors theoretically proved that for L-smooth loss functions, SPIDER-SFO can achieve a computation complexity O(1/ǫ3 ) by setting α = O(ǫ/L), β = O(1/L), B = O(1/ǫ2 ), b = O(1/ǫ), M = O(1/ǫ), T = O(1/ǫ). Owing to the technique of the stochastic path-integrated differential estimator, SPIDER-SFO achieves faster convergence rate than other variance reduction methods like SVRG [21], SCSG [22] and Natasha2 [23]. The work in [24] also points out that those variance reduction methods are ineffective for some non-convex optimization problems like deep learning. In fact, SPIDER-SFO theoretically achieves the best computation complexity among all existing first-order stochastic methods [15]. Comparing the convergence conditions of SPIDER-SFO and SNGD, we can find that both of them require a small constant learning rate α = O(ǫ/L). However, when training small non-convex models (e.g., neural networks with one hidden layer), we empirically find that SNGD can adopt a relatively large

Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:10 Algorithm 3 SPIDER-SFO (15] Initialization:wo,a,B,B,b. for t=0,1,...,T-1 do Randomly choose B samples,denoted by Zt; Calculate a mini--batch gradient=音∑aez,Vf(it;a月 Wt.0 Wt.1 wit; ht,0= for m=1,2,...,M do Randomly choose b samples,denoted by It.m: h,m=言∑aeZ.m(Tf(,m;a)-Vf(0t,m-1a》+ht,m-1; iflt,ml≤a/8 then Ut,m+1=Dt,m-Bμt,m; else m1=m一a品尚 end if end for t+1=0M+1 end for Return which is randomly chosen from {wt,mtm0.1.....T:m1,2.....M constant learning rate with good convergence property,and SPIDER-SFO with a large constant learning rate does not perform well.This interesting phenomenon can be explained as follows.According to the proofs in [15],we can obtain that the t.m in Algorithm 3 is an unbiased estimation of VF(wt.m),i.e., Elut.mlwt.m]=VF(wt.m).Furthermore,if f(w;a)is L-smooth,then the variance of ut.m is bounded as follows:Vt,1≤m≤M, Elm-VF(mg 1 +石=0(2+) (15) where M=e(b)for O(1/e3)computation complexity guarantee.According to (15),we can observe that the variance of the stochastic gradient ut.m is related to four hyper-parameters B,b,a,M,and a large learning rate a may yield a large variance in the learning process.Hence,a small learning rate is necessary for SPIDER-SFO in real applications.On the contrary,the variance of gt in Algorithm 1 is only related to the hyper-parameter b and the learning rate will not affect the variance of gt.Hence, SNGD can empirically adopt a relatively large constant learning rate.Owing to the same reason that the learning rate does not affect the variance of gt.m in Algorithm 2,S-SNGD can successfully adopt stagewise learning rate reducing strategies with a large initial learning rate. 6 Experiments We conduct experiments on PyTorch platform(version 1.4.0)with an NVIDIA V100 GPU(cuda version 10.0).As stated in Section 5,the work in 15 has verified that SPIDER-SFO is better than existing stochastic optimization methods including some variance reduction methods like Natash2.Hence,we mainly adopt SPIDER-SFO as the baseline for comparison.In addition,we also compare our method with SGD to show that SNGD can more easily escape the saddle points or sharp minimums of the non-convex problems. 6.1 Small non-convex model First,we use the handwritten digits dataset MNIST)and a fully connected 3-layer network to compare SNGD and SPIDER-SFO.In MNIST,the training set includes 60000 samples and the testing set includes 10000 samples.Since each sample in MNIST is a 28 x 28 gray picture,we scale the pixels in each picture to [0,1]as that in the website2).The dimensions of input and output layers of this network are 784 and 10.respectively.The hidden layer of this network contains 100 units.We use the exponential linear unit (ELU)as activation function.In SPIDER-SFO,we set B=1,B=1024,M B/(26).For both SNGD and SPIDER-SFO,we set aE{0.1,0.001},bE{32,128}.The results are presented in Figure 2. Comparing the blue dotted line to the yellow solid line in Figures 2(a)and(c),we can observe that when a small learning rate is adopted for these two methods,SPIDER-SFO converges faster than SNGD in 1)http://yann.lecun.com/exdb/mnist/. 2)https://github.com/pytorch/examples/tree/master/mnist

Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:10 Algorithm 3 SPIDER-SFO [15] Initialization: w˜0, α, β, B, b. for t = 0, 1, . . . , T − 1 do Randomly choose B samples, denoted by It; Calculate a mini-batch gradient µ˜t = 1 B P a∈It ∇f(w˜t; a); wt,0 = wt,1 = w˜t; µt,0 = µ˜t; for m = 1, 2, . . . , M do Randomly choose b samples, denoted by It,m; µt,m = 1 b P a∈It,m (∇f(wt,m; a) − ∇f(wt,m−1; a)) + µt,m−1; if kµt,mk 6 α/β then wt,m+1 = wt,m − βµt,m; else wt,m+1 = wt,m − α µt,m kµt,mk ; end if end for w˜t+1 = wM+1; end for Return w¯, which is randomly chosen from {wt,m}t=0,1,...,T ;m=1,2,...,M. constant learning rate with good convergence property, and SPIDER-SFO with a large constant learning rate does not perform well. This interesting phenomenon can be explained as follows. According to the proofs in [15], we can obtain that the µt,m in Algorithm 3 is an unbiased estimation of ∇F(wt,m), i.e., E[µt,m|wt,m] = ∇F(wt,m). Furthermore, if f(w; a) is L-smooth, then the variance of µt,m is bounded as follows: ∀t, 1 6 m 6 M, Ekµt,m − ∇F(wt,m)k 2 6 Mα2L 2 b + σ 2 B = O α 2 + 1 B , (15) where M = Θ(b) for O(1/ǫ3 ) computation complexity guarantee. According to (15), we can observe that the variance of the stochastic gradient µt,m is related to four hyper-parameters B, b, α, M, and a large learning rate α may yield a large variance in the learning process. Hence, a small learning rate is necessary for SPIDER-SFO in real applications. On the contrary, the variance of gt in Algorithm 1 is only related to the hyper-parameter b and the learning rate will not affect the variance of gt. Hence, SNGD can empirically adopt a relatively large constant learning rate. Owing to the same reason that the learning rate does not affect the variance of gt,m in Algorithm 2, S-SNGD can successfully adopt stagewise learning rate reducing strategies with a large initial learning rate. 6 Experiments We conduct experiments on PyTorch platform (version 1.4.0) with an NVIDIA V100 GPU (cuda version 10.0). As stated in Section 5, the work in [15] has verified that SPIDER-SFO is better than existing stochastic optimization methods including some variance reduction methods like Natash2. Hence, we mainly adopt SPIDER-SFO as the baseline for comparison. In addition, we also compare our method with SGD to show that SNGD can more easily escape the saddle points or sharp minimums of the non-convex problems. 6.1 Small non-convex model First, we use the handwritten digits dataset MNIST1) and a fully connected 3-layer network to compare SNGD and SPIDER-SFO. In MNIST, the training set includes 60000 samples and the testing set includes 10000 samples. Since each sample in MNIST is a 28 × 28 gray picture, we scale the pixels in each picture to [0, 1] as that in the website2). The dimensions of input and output layers of this network are 784 and 10, respectively. The hidden layer of this network contains 100 units. We use the exponential linear unit (ELU) as activation function. In SPIDER-SFO, we set β = 1, B = 1024, M = B/(2b). For both SNGD and SPIDER-SFO, we set α ∈ {0.1, 0.001}, b ∈ {32, 128}. The results are presented in Figure 2. Comparing the blue dotted line to the yellow solid line in Figures 2(a) and (c), we can observe that when a small learning rate is adopted for these two methods, SPIDER-SFO converges faster than SNGD in 1) http://yann.lecun.com/exdb/mnist/. 2) https://github.com/pytorch/examples/tree/master/mnist