Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao,Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China zhaosy, zhanggd)elamda.nju.edu.cn,liwujun@nju.edu.cn Abstract 2013)to solve it.Many works (Roux,Schmidt,and Bach 2012:Shalev-Shwartz and Zhang 2013:Johnson and Zhang Stochastic gradient descent (SGD)and its variants have at- tracted much attention in machine learning due to their ef- 2013)have found that SGD-based methods can achieve ficiency and effectiveness for optimization.To handle large- promising performance in large-scale learning problem- scale problems,researchers have recently proposed several s.According to the implementation platforms or system- lock-free strategy based parallel SGD (LF-PSGD)method- s,existing SGD-based methods can be divided into three s for multi-core systems.However,existing works have on- categories:sequential SGD (SSGD)methods,parallel S- ly proved the convergence of these LF-PSGD methods for GD (PSGD)methods.and distributed SGD (DSGD)meth- convex problems.To the best of our knowledge,no work ods.SSGD methods are designed for a single thread on has proved the convergence of the LF-PSGD methods for a single machine,PSGD methods are designed for multi- non-convex problems.In this paper,we provide the theoret- core (multi-thread)on a single machine with a shared memo- ical proof about the convergence of two representative LF- PSGD methods.Hogwild!and AsySVRG.for non-convex ry,and DSGD methods are designed for multiple machines. problems.Empirical results also show that both Hogwild!and When the problem in (1)is convex,the SGD meth- AsySVRG are convergent on non-convex problems,which ods,including SSGD (Roux,Schmidt.and Bach 2012: successfully verifies our theoretical results. Shalev-Shwartz and Zhang 2013;Johnson and Zhang 2013), PSGD (Recht et al.2011)and DSGD (Jaggi et al.2014; Li et al.2014;Xing et al.2015;Zhang,Zheng,and K- Introduction wok 2016),have achieved very promising empirical perfor- Many machine learning models can be formulated as the fol- mance.Furthermore,good theoretical results about the con- lowing optimization problem: vergence of the SGD methods are also provided by these existing works. 1>f(w) min (1 In many real applications,the problems to optimize can be i=1 non-convex.For example,the problems for the neural net- works are typically non-convex.Because many researcher- where w is the parameter to learn (optimize),n is the num s (Li et al.2014;Xing et al.2015)find that the SGD ber of training instances,fi(w)is the loss defined on in- methods can also achieve good empirical results for non- stance i.For example,assuming we are given a set of la- convex problems,theoretical proof about the convergence beled instances {(xy)i=1,2,...,n},where xiERd is of SGD methods for non-convex problems has recently at- the feature vector and yi{1,-1}is the label of xi,fi(w) tracted much attention.Some progress has been achieved can be log(1+e)+wll2 which is known as the For example,the works in (Ghadimi and Lan 2013:Red- regularized loss in logistic regression(LR).We can also take di et al.2016:Li et al.2016:Allen-Zhu and Hazan 2016: fi(w)to be max{0,1-w}+w2 which is known as Allen-Zhu and Yuan 2016)have proved the convergence of the regularized loss in support vector machine(SVM).Here, the sequential SGD and its variants for non-convex prob- A is the regularization hyper-parameter.Moreover,many lems.There are also some other theoretical results for some other machine learning models,including neural network- particular non-convex problems,like PCA (Shamir 2015; s(Krizhevsky,Sutskever,and Hinton 2012),matrix factor- 2016a;2016b)and matrix factorization (Sa,Re,and Oluko- ization (Koren,Bell,and Volinsky 2009),and principal com- tun 2015).But all these works are only for SSGD methods. ponent analysis (PCA)(Shamir 2015)and so on,can also be There have appeared only two works (Lian et al.2015: formulated as that in (1). Huo and Huang 2016)which propose PSGD methods for When the problem in (1)is large-scale,i.e.,n is large, non-convex problems with theoretical proof of convergence. researchers have recently proposed stochastic gradient de- scent(SGD)and its variants like SVRG (Johnson and Zhang IIn some literatures,PSGD refers to the methods implemented on both multi-core and multi-machine systems.In this paper,PS- Copyright c)2017,Association for the Advancement of Artificial GD only refers to the methods implemented on multi-core systems Intelligence (www.aaai.org).All rights reserved. with a shared memory
Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao, Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy, zhanggd}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Stochastic gradient descent (SGD) and its variants have attracted much attention in machine learning due to their ef- ficiency and effectiveness for optimization. To handle largescale problems, researchers have recently proposed several lock-free strategy based parallel SGD (LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LFPSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Introduction Many machine learning models can be formulated as the following optimization problem: min w 1 n Xn i=1 fi(w), (1) where w is the parameter to learn (optimize), n is the number of training instances, fi(w) is the loss defined on instance i. For example, assuming we are given a set of labeled instances {(xi , yi)|i = 1, 2, . . . , n}, where xi ∈ R d is the feature vector and yi ∈ {1, −1} is the label of xi , fi(w) can be log(1 + e −yix T i w) + λ 2 kwk 2 which is known as the regularized loss in logistic regression (LR). We can also take fi(w)to be max{0, 1−yix T i w}+ λ 2 kwk 2 which is known as the regularized loss in support vector machine (SVM). Here, λ is the regularization hyper-parameter. Moreover, many other machine learning models, including neural networks (Krizhevsky, Sutskever, and Hinton 2012), matrix factorization (Koren, Bell, and Volinsky 2009), and principal component analysis (PCA) (Shamir 2015) and so on, can also be formulated as that in (1). When the problem in (1) is large-scale, i.e., n is large, researchers have recently proposed stochastic gradient descent (SGD) and its variants like SVRG (Johnson and Zhang Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2013) to solve it. Many works (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013) have found that SGD-based methods can achieve promising performance in large-scale learning problems. According to the implementation platforms or systems, existing SGD-based methods can be divided into three categories: sequential SGD (SSGD) methods, parallel SGD (PSGD) methods, and distributed SGD (DSGD) methods. SSGD methods are designed for a single thread on a single machine, PSGD methods are designed for multicore (multi-thread) on a single machine with a shared memory1 , and DSGD methods are designed for multiple machines. When the problem in (1) is convex, the SGD methods, including SSGD (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013), PSGD (Recht et al. 2011) and DSGD (Jaggi et al. 2014; Li et al. 2014; Xing et al. 2015; Zhang, Zheng, and Kwok 2016), have achieved very promising empirical performance. Furthermore, good theoretical results about the convergence of the SGD methods are also provided by these existing works. In many real applications, the problems to optimize can be non-convex. For example, the problems for the neural networks are typically non-convex. Because many researchers (Li et al. 2014; Xing et al. 2015) find that the SGD methods can also achieve good empirical results for nonconvex problems, theoretical proof about the convergence of SGD methods for non-convex problems has recently attracted much attention. Some progress has been achieved. For example, the works in (Ghadimi and Lan 2013; Reddi et al. 2016; Li et al. 2016; Allen-Zhu and Hazan 2016; Allen-Zhu and Yuan 2016) have proved the convergence of the sequential SGD and its variants for non-convex problems. There are also some other theoretical results for some particular non-convex problems, like PCA (Shamir 2015; 2016a; 2016b) and matrix factorization (Sa, Re, and Olukotun 2015). But all these works are only for SSGD methods. There have appeared only two works (Lian et al. 2015; Huo and Huang 2016) which propose PSGD methods for non-convex problems with theoretical proof of convergence. 1 In some literatures, PSGD refers to the methods implemented on both multi-core and multi-machine systems. In this paper, PSGD only refers to the methods implemented on multi-core systems with a shared memory
However,the PSGD methods in (Lian et al.2015)need This is a common assumption for the convergence analy- write-lock or atomic operation for the memory to prove sis of most existing gradient-based methods. the convergence 2.Similarly,the work in (Huo and Huang Since we focus on non-convex problems in this paper,it 2016)also does not prove the convergence for the lock- is difficult to get the global solution of(1)based on the gra- free case in our paper.Recent works (Recht et al.2011; dient methods.Hence,we use llVf(w)2 to measure the Chaturapruek,Duchi,and Re 2015;J.Reddi et al.2015; convergence instead of f(w)-min f(w). Zhao and Li 2016)find that lock-free strategy based parallel SGD(LF-PSGD)methods can empirically outperform lock- Here,we give a Lemma which is useful in the conver- based PSGD methods for multi-core systems.Although gence analysis of Hogwild!and AsySVRG. some existing works(Chaturapruek,Duchi,and Re 2015; Lemma 1.Assume B is a positive semi-definite matrix with Zhao and Li 2016)have proved the convergence of these LF- the largest eigenvalue less than or equal to I and the mini- PSGD methods for convex problems,no work has proved mum eigenvalue a >0,we have:Vx,y, the convergence of the LF-PSGD methods for non-convex problems. In this paper,we provide the theoretical proof about the -vf✉Bfw)≤乞Ix-yP-INfx)P. convergence of two representative LF-PSGD methods,Hog- Proof. wild!(Recht et al.2011;Chaturapruek,Duchi,and Re 2015) and AsySVRG(Zhao and Li 2016),for non-convex prob- lems.The contribution of this work can be outlined as fol- 含IVfP-V/(*)Bv5 lows: ≤号Bvf-Vf(x)BV/) .Theoretical results show that both Hogwild!and AsySVRG can converge with lock-free strategy for non- convex problems. ≤号Bfx-vf(x)BVf()+Bfy) .Hogwild!gets a convergence rate of (1/)for non- -号IB)-f川 convex problems,where T=p x T is the total iteration number of p threads. s号k-R AsySVRG gets a convergence rate of O(1/T)for non- 口 convex problems. To get an e-local optimal solution for AsySVRG,the Hogwild!for Non-Convex Problems computation complexity by all threads is O(n/e),or e- The Hogwild!method(Recht et al.2011)is listed in Algo- quivalently the computation complexity of each thread is rithm 1.Each thread reads w from the shared memory,com- ()This is faster than traditional parallel gradient de- putes a stochastic gradient and updates the w in the shared cent methods whose computation complexity is O()for memory.Please note that Hogwild!in(Recht et al.2011)has each thread. several variants with locks or lock-free.Here,we only focus on the lock-free variant of Hogwild!,which means that we Empirical results also show that both Hogwild!and do not use any locks,either read-lock or write-lock,for all AsySVRG are convergent on non-convex problems, threads which successfully verifies our theoretical results. Algorithm 1 Hogwild! Preliminary Initialization:p threads,initialize wo,n: We use f(w)to denote the objective function in(1),which For each thread,do: means f(w)=∑-1fi(w.And we use·lto denote for /=0,1,2,...,T-1 do theL2-noml·2. Read current w in the shared memory,denoted as w: Randomly pick up an i from 1,...,n and compute the gra- Assumption 1.The function fi()in (1)is smooth,which dient Vfi(w); means that there exists a constant L>0.Va,b, w←w-Vf(w)月 end for f:(b)sfi(a)+Vf:(a)"(b-a)+lb-all2 As in (Zhao and Li 2016).we can construct an equivalent or equivalently write sequence w:: llVfi(b)-Vfi(a)Llb-all Wi+1=Wi-nBiVfi,(wi), (2) 2Although the implementation of AsySG-incon in (Lian et al. where 0<t pxT,B:is a random diagonal matrix whose 2015)is lock-free,the theoretical analysis about the convergence diagonal entries are 0 or 1.The B:is used to denote whether of AsySG-incon is based on an assumption that no over-writing over-writing happens.If the kth diagonal entry of B:is 0,it happens,i.e.,the theoretical analysis is not for the lock-free case. means that the kth element in the gradient vector Vfi(wt)
However, the PSGD methods in (Lian et al. 2015) need write-lock or atomic operation for the memory to prove the convergence 2 . Similarly, the work in (Huo and Huang 2016) also does not prove the convergence for the lockfree case in our paper. Recent works (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015; J. Reddi et al. 2015; ´ Zhao and Li 2016) find that lock-free strategy based parallel SGD (LF-PSGD) methods can empirically outperform lockbased PSGD methods for multi-core systems. Although some existing works (Chaturapruek, Duchi, and Re 2015; ´ Zhao and Li 2016) have proved the convergence of these LFPSGD methods for convex problems, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015) ´ and AsySVRG (Zhao and Li 2016), for non-convex problems. The contribution of this work can be outlined as follows: • Theoretical results show that both Hogwild! and AsySVRG can converge with lock-free strategy for nonconvex problems. • Hogwild! gets a convergence rate of O(1/ p T˜) for nonconvex problems, where T˜ = p × T is the total iteration number of p threads. • AsySVRG gets a convergence rate of O(1/T˜) for nonconvex problems. • To get an -local optimal solution for AsySVRG, the computation complexity by all threads is O(n 2 3 /), or equivalently the computation complexity of each thread is O( n 2 3 p ). This is faster than traditional parallel gradient decent methods whose computation complexity is O( n p ) for each thread. • Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Preliminary We use f(w) to denote the objective function in (1), which means f(w) = 1 n Pn i=1 fi(w). And we use k · k to denote the L2-norm k · k2. Assumption 1. The function fi(·) in (1) is smooth, which means that there exists a constant L > 0, ∀a, b, fi(b) ≤ fi(a) + ∇fi(a) T (b − a) + L 2 kb − ak 2 , or equivalently k∇fi(b) − ∇fi(a)k ≤ Lkb − ak. 2Although the implementation of AsySG-incon in (Lian et al. 2015) is lock-free, the theoretical analysis about the convergence of AsySG-incon is based on an assumption that no over-writing happens, i.e., the theoretical analysis is not for the lock-free case. This is a common assumption for the convergence analysis of most existing gradient-based methods. Since we focus on non-convex problems in this paper, it is difficult to get the global solution of (1) based on the gradient methods. Hence, we use k∇f(w)k 2 to measure the convergence instead of f(w) − min w f(w). Here, we give a Lemma which is useful in the convergence analysis of Hogwild! and AsySVRG. Lemma 1. Assume B is a positive semi-definite matrix with the largest eigenvalue less than or equal to 1 and the minimum eigenvalue α > 0, we have: ∀x, y, −∇f(x) T B∇f(y) ≤ L 2 2 kx − yk 2 − α 2 k∇f(x)k 2 . Proof. α 2 k∇f(x)k 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) + 1 2 B 1 2 ∇f(y) 2 = 1 2 B 1 2 (∇f(x) − ∇f(y)) 2 ≤ L 2 2 kx − yk 2 . Hogwild! for Non-Convex Problems The Hogwild! method (Recht et al. 2011) is listed in Algorithm 1. Each thread reads w from the shared memory, computes a stochastic gradient and updates the w in the shared memory. Please note that Hogwild! in (Recht et al. 2011) has several variants with locks or lock-free. Here, we only focus on the lock-free variant of Hogwild!, which means that we do not use any locks, either read-lock or write-lock, for all threads. Algorithm 1 Hogwild! Initialization: p threads, initialize w0, η; For each thread, do: for l = 0, 1, 2, ..., T − 1 do Read current w in the shared memory, denoted as wˆ ; Randomly pick up an i from {1, . . . , n} and compute the gradient ∇fi(wˆ ); w ← w − η∇fi(wˆ ); end for As in (Zhao and Li 2016), we can construct an equivalent write sequence {wt}: wt+1 = wt − ηBt∇fit (wˆ t), (2) where 0 ≤ t ≤ p×T, Bt is a random diagonal matrix whose diagonal entries are 0 or 1. The Bt is used to denote whether over-writing happens. If the kth diagonal entry of Bt is 0, it means that the kth element in the gradient vector ∇fit (wˆ t)
is overwritten by other threads.Otherwise,that element is Lemma 3.With the condition about p,n in Lemma 2,we not overwritten. have wt is read by the thread who computes Vfi,(wt)and has the following format: E4-,P≤4rpp'-业g) p-1 (4) Combining with Assumption 5,we can find that the gap of wt=Wa(t) Pt.j-a(t)Vfi,(wj) (3) the write sequence and read sequence can always be bound- j=a(t) ed by a constantnr where a(t)means that some old stochastic gradients have been completely written on the w in the shared memory. Theorem 1.Let A=21(w)and B=2V2( a(p-1) Pjt)is a diagonal matrix whose diagonal entries are 0 会If we take the stepsize n=√where T-pxT, 「4 or 1,which means w:might include parts of new stochastic gradients. we can get the following result: In the lock-free strategy,we need the following assump- 一1 tions to guarantee convergence: AB Ivf(wP≤V Assumption2.a(t)is bounded by:0≤t-a(t)≤T It means that the old stochastic gradients Proof.According to Assumption 1,we have Vfio,...,Vfihave been completely written on w in the shared memory. E[f(wt+i)小wt,w] Assumption 3.We consider the matrix Bt as a random ma- <f(w:)-nE[Vf(w:)TB:Vfi(w:)lwt,w:] trix and EBwt,wt=B 0 with the minimum eigen- value a 0. +牙v,副 According to the definition of B:,it is easy to find B:,B =f(wt)-nVf(w:)TBVf(w:) are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1.Assumption 3 means that the +牙v.(时 probability that over-writing happens is at most 1-a <1 for each write step. ≤fw)-受Ivf0wP+w-a4P Assumption 4.Bt and it are independent. Since it is the random index selected by each thread while E了j.()w,, 十2 B:is highly affected by the hardware,the independence as- sumption is reasonable. where the first equality uses Assumption 4,the second in- For Hogwild!,the following assumption is also necessary: equality uses Lemma 1.Taking expectation on the above in- equality,we obtain Assumption 5.There exists a constant V,Vfi(w) V,i=1,,n. Ef(wt+1) For convenience.in this section.we denote ≤B-受wP+空2-aP q(x)= ∑f(x)2 Ln2v2 + n i=1 2 It is easy to find that Eg(w:)=ElVfi (w)2]and note ≤gf)-罗NwP that when x is close to some stationary point,g(x)may still be far away from 0.Hence,it is not a variance reduction +ve2Bnm-D+克 method and we need to control the variance of the stochastic p-1 gradient. where the first inequality uses Assumption 5 and second in- The difficulty of the analysis is wt wt.Here,we give equality uses Lemma 3.Summing the above inequality from the following Lemmas 3: t=0 to T-1,we get Lemma 2.In Hogwild!,we have Eq(wt)<pEq(wt+1)if 7-1 p,n satisfy ∑ElVf(w)I2 t=0 1-7-物+1L2(or+1-D≤p. 2 p-1 ≤2f0wo)+2niv2(2rIor-卫+ an a(p-1) 20 3The proof of some Lemmas can be found in the supplemen- For convenience,let A 2f(wo)and B tary material,which can be downloaded from http://cs.nju. a edu.cn/1wj/paper/LFnonConvex_sup.pdf. 2).which are two bounded constants. a(p-1)
is overwritten by other threads. Otherwise, that element is not overwritten. wˆ t is read by the thread who computes ∇fit (wˆ t) and has the following format: wˆ t = wa(t) − η Xt−1 j=a(t) Pt,j−a(t)∇fij (wˆ j ), (3) where a(t) means that some old stochastic gradients have been completely written on the w in the shared memory. Pt,j−a(t) is a diagonal matrix whose diagonal entries are 0 or 1, which means wˆ t might include parts of new stochastic gradients. In the lock-free strategy, we need the following assumptions to guarantee convergence: Assumption 2. a(t) is bounded by: 0 ≤ t − a(t) ≤ τ It means that the old stochastic gradients ∇fi0 , . . . , ∇fit−τ−1 have been completely written on w in the shared memory. Assumption 3. We consider the matrix Bt as a random matrix and E[Bt|wt, wˆ t] = B 0 with the minimum eigenvalue α > 0. According to the definition of Bt, it is easy to find Bt, B are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1. Assumption 3 means that the probability that over-writing happens is at most 1 − α < 1 for each write step. Assumption 4. Bt and it are independent. Since it is the random index selected by each thread while Bt is highly affected by the hardware, the independence assumption is reasonable. For Hogwild!, the following assumption is also necessary: Assumption 5. There exists a constant V , k∇fi(w)k ≤ V, i = 1, . . . , n. For convenience, in this section, we denote q(x) = 1 n Xn i=1 kfi(x)k 2 . It is easy to find that Eq(wˆ t) = E[k∇fit (wˆ t)k 2 ] and note that when x is close to some stationary point, q(x) may still be far away from 0. Hence, it is not a variance reduction method and we need to control the variance of the stochastic gradient. The difficulty of the analysis is wt 6= wˆ t. Here, we give the following Lemmas 3 : Lemma 2. In Hogwild!, we have Eq(wˆ t) ≤ ρEq(wˆ t+1) if ρ, η satisfy 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. 3The proof of some Lemmas can be found in the supplementary material, which can be downloaded from http://cs.nju. edu.cn/lwj/paper/LFnonConvex_sup.pdf. Lemma 3. With the condition about ρ, η in Lemma 2, we have Ekwt − wˆ tk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(wˆ t) (4) Combining with Assumption 5, we can find that the gap of the write sequence and read sequence can always be bounded by a constant 4η 2V 2 τρ(ρ τ −1) ρ−1 . Theorem 1. Let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ). If we take the stepsize η = q A T B˜ , where T˜ = p × T, we can get the following result: 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Proof. According to Assumption 1, we have E[f(wt+1)|wt, wˆ t] ≤f(wt) − ηE[∇f(wt) T Bt∇fit (wˆ t)|wt, wˆ t] + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] =f(wt) − η∇f(wt) T B∇f(wˆ t) + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] ≤f(wt) − αη 2 k∇f(wt)k 2 + L 2η 2 kwt − wˆ tk 2 + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t], where the first equality uses Assumption 4, the second inequality uses Lemma 1. Taking expectation on the above inequality, we obtain Ef(wt+1) ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + L 2η 2 Ekwt − wˆ tk 2 + Lη2V 2 2 ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + η 2V 2 ( 2τL2ηρ(ρ τ − 1) ρ − 1 + L 2 ), where the first inequality uses Assumption 5 and second inequality uses Lemma 3. Summing the above inequality from t = 0 to T˜ − 1, we get T X˜−1 t=0 Ek∇f(wt)k 2 ≤ 2 αη f(w0) + 2ηT V˜ 2 ( 2τL2ηρ(ρ τ − 1) α(ρ − 1) + L 2α ). For convenience, let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ), which are two bounded constants
If we take the get since all the stochastic gradients have been written on w at the end of the tth outer-loop.Here,we also need the assump- 7-1 ions:0≤m-a(m)≤T;EB,mlut,m,it.m]=B>0 with the minimum eigenvalue a 0;Bt.m and it.m are t=0 independent.These assumptions are similar to those in the previous section. For convenience,let pi(x)=Vfi(x)-Vfi(ut.o)+ Hence,our theoretical result shows that Hogwild!with Vf(ut.o),and in this section,we denote 2 lock-free strategy gets a convergence rate of O(1/VT)for non-convex problems,where T=px T is the total iteration x=∑Ipx)2 i=1 number of p threads. It easy to find that Eq(it.m)=E[vt.m]. AsySVRG for Non-Convex Problems The difference between Hogwild!and AsySVRG is the s- tochastic gradient and we have the following Lemmas which The AsySVRG method(Zhao and Li 2016)is listed in Algo- lead to fast convergence rate of AsySVRG: rithm 2.AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Lemma 4.Vx,we have Compared with Hogwild!,AsySVRG includes the full gra- q(x)<2L2x-ut.oll2+21Vf(x)2. dient to get a variance reduced stochastic gradient,which Proof. has been proved to have linear convergence rate on strong- ly convex problems (Zhao and Li 2016).In this section,we will prove that AsySVRG is also convergent for non-convex q(x)= ∑Ii网-io)+a,olr problems,and has faster convergence rate than Hogwild!on non-convex problems. 2∑INfo-Vjiu:o)+vfuo)-fxP n i=1 Algorithm 2 AsySVRG Initialization:p threads,initialize wo,n: +27f(x)I2 fort =0.1.2....T-1 do uo W; ≤2∑INfx-Vu:oP+2IfxP n All threads parallelly compute the full gradient Vf(uo)= 2=1 a∑17f(uo)方 ≤2L21x-u.ol2+27fx)2 u=Wt; For each thread,do: 0 forj=0to M-1 do Read current value of u,denoted as u,from the shared According to Lemma 4,we can find that AsySVRG is memory.And randomly pick up an i from 1,...,n; a variance reduction method for non-convex problems,be- Compute the update vector:=Vfi(u)-Vfi(uo)+ cause when ut.m,ut.o get close to some stationary point, Vf(uo); g(ut,m)gets close to 0.And hence we do not need the u←←u-V时 bounded gradient assumption for the convergence proof. end for Since ut.m ut.m,the difficulty of convergence analy- Take w+1 to be the current value of u in the shared memory; sis lies in the gap between ut.m and ut.m.and the relation end for between q(it.m)and q(ut.m). Similar to the analysis in the last section,we construct an Lemma 5.In AsySVRG,we have Eq(ut.m)<pEq(ft.m+1) equivalent write sequence {ut.m}for the tth outer-loop: if we choose p and n to satisfy that 1 ut,0 Wt, 1-7-9n+1L2p*+1-D≤p. p-1 ut.m+1 ut.m -nBt.mVt.m; (5 Lemma 6.With the condition about p,n in Lemma 5,we where vt.m =Vfit.(ut.m)-Vfit.(ut.o)+Vf(ut.). have Bi.m is a diagonal matrix whose diagonal entries are 0 or 1. And ut.m is read by the thread who computes vt.m.It has EIwm-nP≤p0-DEdt.m.)同 the following format: p-1 Lemma 7.With the condition about p,n in Lemma 5.we have Eq(ut.m)<pEq(ut,m). it,m=ut,a(m)-刃 P(t) m.j-a(m)Vt j; Combining Lemma 6 and Lemma 7,we can directly ob- 1=a(m】 tain: where P is a diagonal matrix whose diagonal en llfm-mPsP-DB4lum小. tries are 0 or 1.Note that according to (5).ut.M=Wt+1 p-1
If we take the stepsize η = q A T B˜ , we get 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Hence, our theoretical result shows that Hogwild! with lock-free strategy gets a convergence rate of O(1/ p T˜) for non-convex problems, where T˜ = p×T is the total iteration number of p threads. AsySVRG for Non-Convex Problems The AsySVRG method (Zhao and Li 2016) is listed in Algorithm 2. AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Compared with Hogwild!, AsySVRG includes the full gradient to get a variance reduced stochastic gradient, which has been proved to have linear convergence rate on strongly convex problems (Zhao and Li 2016). In this section, we will prove that AsySVRG is also convergent for non-convex problems, and has faster convergence rate than Hogwild! on non-convex problems. Algorithm 2 AsySVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ...T − 1 do u0 = wt; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt; For each thread, do: for j = 0 to M − 1 do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for Similar to the analysis in the last section, we construct an equivalent write sequence {ut,m} for the t th outer-loop: ut,0 = wt, ut,m+1 = ut,m − ηBt,mvˆt,m, (5) where vˆt,m = ∇fit,m(uˆt,m) − ∇fit,m(ut,0) + ∇f(ut,0). Bt,m is a diagonal matrix whose diagonal entries are 0 or 1. And uˆt,m is read by the thread who computes vˆt,m. It has the following format: uˆt,m = ut,a(m) − η mX−1 j=a(m) P (t) m,j−a(m) vˆt,j , where P (t) m,j−a(m) is a diagonal matrix whose diagonal entries are 0 or 1. Note that according to (5), ut,M˜ = wt+1 since all the stochastic gradients have been written on w at the end of the t th outer-loop. Here, we also need the assumptions: 0 ≤ m − a(m) ≤ τ ; E[Bt,m|ut,m, uˆt,m] = B 0 with the minimum eigenvalue α > 0; Bt,m and it,m are independent. These assumptions are similar to those in the previous section. For convenience, let pi(x) = ∇fi(x) − ∇fi(ut,0) + ∇f(ut,0), and in this section, we denote q(x) = 1 n Xn i=1 kpi(x)k 2 . It easy to find that Eq(uˆt,m) = E[kvˆt,mk 2 ]. The difference between Hogwild! and AsySVRG is the stochastic gradient and we have the following Lemmas which lead to fast convergence rate of AsySVRG: Lemma 4. ∀x, we have q(x) ≤ 2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . Proof. q(x) = 1 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0) − ∇f(x)k 2 + 2k∇f(x)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0)k 2 + 2k∇f(x)k 2 ≤2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . According to Lemma 4, we can find that AsySVRG is a variance reduction method for non-convex problems, because when uˆt,m, ut,0 get close to some stationary point, q(uˆt,m) gets close to 0. And hence we do not need the bounded gradient assumption for the convergence proof. Since ut,m 6= uˆt,m, the difficulty of convergence analysis lies in the gap between ut,m and uˆt,m, and the relation between q(uˆt,m) and q(ut,m). Lemma 5. In AsySVRG, we have Eq(uˆt,m) < ρEq(uˆt,m+1) if we choose ρ and η to satisfy that 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. Lemma 6. With the condition about ρ, η in Lemma 5, we have Ekut,m − uˆt,mk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(uˆt,m). (6) Lemma 7. With the condition about ρ, η in Lemma 5, we have Eq(uˆt,m) < ρEq(ut,m). Combining Lemma 6 and Lemma 7, we can directly obtain: E kuˆt,m − ut,mk 2 ≤ 4η 2 τ ρ2 (ρ τ − 1) ρ − 1 Eq(ut,m). (7)
Theorem 2.We define cm =cm+1(1+Bn)+2L2n2hm+1. we have hm =((Cmp)with co.80. p-1. ERt.m+1 Furthermore,we choose co,n,B such that y=min- 2eH里-2n2hm+1>0 and cx=0,where M=M×n =Ef(ui.m+1)+cm+illut.m+1-ut.oll2 Then we have fu.a)-(受-2 i2)(w. ∑∑IVm)sEJ(wo)-Efw到 1 T-1M-1 (m TMy t0n=0 +cm+1(1+Bn)Ellut.m-ut.oll2 Proof.In the tth outer-loop,similar to (Reddi et al.2016), +(cm+1+2)E到m2 we define Rt.m as follows Rt.m f(ut.m)+Cmllut.m-ut.ol2. go)-(受-2 /(wP ++g9,- p-1 Then VB>0, +cm+1(1+Bn)Ellut.m -ut.oll2 Ellut.m+1-ut.olu.m,it.m] +(cm+1+2),m, SEllut.m+1-ut.ml2+lut.m -ut.oll2 where the last inequality uses equation(7). -2n(EBtmvt.m)T(ut.m -ut.o) For convenience,we use hm =( ≤72El.m2+((1+3nl训lu,m-.,ol2 2学)ro2e-1+p(cm+号).Since El,ml3= 0-1 +.m) Eq(ut.m)0.It is easy to see that cm >Cm+1.We can =f(ut.m)-nVf(ut.m)TBVf(it.m) choose co,n,B to make c =0.Then we have: +字w. M ∑Vf(u,m)川2 ≤fum)-受I7fu.n)I2 m=0 tmem ≤Ao-E立-Efw)-Efw4 Y 2 which is equivalent to nm小 (9) Ta∑∑Vf0 EI(wo)-Ew】 1 T-1M-1 where the first equality uses the independence of Bt.m,it.m. t=0m=0 TMy the second inequality uses Lemma 1.Combining(8)and (9)
Theorem 2. We define cm = cm+1(1 + βη) + 2L 2η 2hm+1, hm = ( ηL2 2 + 2cmη β ) 4τρ2 (ρ τ −1) ρ−1 +(cmρ+ Lρ 2 ) with c0, β > 0. Furthermore, we choose c0, η, β such that γ = min αη 2 − 2cm+1η β − 2η 2hm+1 > 0 and cM˜ = 0, where M˜ = M × p. Then we have 1 TM˜ T X−1 t=0 M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ Ef(w0) − Ef(wT ) TM γ ˜ . Proof. In the t th outer-loop, similar to (Reddi et al. 2016), we define Rt,m as follows Rt,m = f(ut,m) + cmkut,m − ut,0k 2 . Then ∀β > 0, E[kut,m+1 − ut,0k 2 |ut,m, uˆt,m] ≤Ekut,m+1 − ut,mk 2 + kut,m − ut,0k 2 − 2η(EBt,mvˆt,m) T (ut,m − ut,0) ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + η β k∇f(uˆt,m)k 2 ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + 2η β (k∇f(ut,m)k + k∇f(uˆt,m) − ∇f(ut,m)k 2 ) ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + 2η β (k∇f(ut,m)k 2 + L 2 kuˆt,m − ut,mk 2 ), (8) where the second inequality uses the fact 2ab ≤ βa2 + 1 β b 2 . Since the objective function is L-smooth, we have E[f(ut,m+1)|ut,m, uˆt,m] ≤ − ηE[∇f(ut,m) T Bt,m∇fit,m(uˆt,m)|ut,m, uˆt,m] + f(ut,m) + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m] =f(ut,m) − η∇f(ut,m) T B∇f(uˆt,m) + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m] ≤f(ut,m) − αη 2 k∇f(ut,m)k 2 + ηL2 2 kut,m − uˆt,mk 2 + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m], (9) where the first equality uses the independence of Bt,m, it,m, the second inequality uses Lemma 1. Combining (8) and (9), we have ERt,m+1 =Ef(ut,m+1) + cm+1kut,m+1 − ut,0k 2 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + (ηL2 2 + 2cm+1η β )Ekut,m − uˆt,mk 2 + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2 (cm+1 + L 2 )Ekvˆt,mk 2 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + (ηL2 2 + 2cm+1η β ) 4τη2ρ 2 (ρ τ − 1) ρ − 1 Eq(ut,m) + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2 (cm+1 + L 2 )Ekvˆt,mk 2 , where the last inequality uses equation (7). For convenience, we use hm = ( ηL2 2 + 2cmη β ) 4τρ2 (ρ τ −1) ρ−1 + ρ(cm + L 2 ). Since E[kvˆt,mk 2 ] = Eq(uˆt,m) ≤ ρEq(ut,m), we have ERt,m+1 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2hm+1Eq(ut,m) ≤Ef(ut,m) + [cm+1(1 + βη) + 2L 2 η 2hm+1]Ekut,m − ut,0k 2 − ( αη 2 − 2cm+1η β − 2η 2hm+1)Ek∇f(ut,m)k 2 , where the second inequality uses Lemma 4. Then we can obtain: ( αη 2 − 2cm+1η β − 2η 2hm+1)Ek∇f(um)k 2 ≤ ERm − ERm+1, where cm = cm+1(1 + βη) + 2L 2η 2hm+1. We set c0 > 0. It is easy to see that cm > cm+1. We can choose c0, η, β to make cM˜ = 0. Then we have: M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ ER0 − ERM˜ γ = Ef(wt) − Ef(wt+1) γ , which is equivalent to 1 TM˜ T X−1 t=0 M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ Ef(w0) − Ef(wT ) TM γ ˜
Computation Complexity Figure 1 illustrates the convergence property of both Hog- In Theorem 2,we construct a sequence [cm}and need wild!and AsySVRG.The x-axis denotes the CPU time, Y>0.According to the definition of hm,we can write where we set the CPU time that Hogwild!passes through ho as h-gem+whereg the whole dataset once with one thread as 1 unit.The y- p-1 axis denotes the training loss.In this experiment,we run are constants. Hogwild!and AsySVRG with 10 threads.Hogwild!-10 and 2 0- First,we choose B>n,then both g,f are bounded posi- AsySVRG-10 denote the corresponding methods with 10 tive constants.We have threads.It is easy to see that both Hogwild!and AsySVRG are convergent.Furthermore,AsySVRG is faster than Hog- cm=cm+1(1+6m+2L2n2g)+2L2n2f. wild!.This is consistent with our theoretical results in this paper. Let a Bn +2L2n2g.Because cxr =0,it is easy to get 0=2L2n2r1+a)"-1 We takeM=lJ≤&,then we have co≤4 I and 7=0-0g-4g2r-. aB Q As recommended in(Reddi et al.2016).we can take n (a)MNIST (b)connect-4 u/n23,B=v/n/withB(assuming n is large).Then we can get f=O(1),g=O(1),a =O(1/n).By choosing ,vto satisfy0,M =O(n).Hence,to get Figure 2 reports the results of Hogwild!and AsySVRG an e-local optimal solution,the computation complexity by with different numbers of threads,where the number of all p threads is O(n/e).and the computation complexity of threads p =1.4.10.We can find that in most cases the two each thread is). methods will become faster with the increase of threads.The only outlier is the case for Hogwild!on dateset connect-4, Experiment Hogwild!using 4 threads is slower than using 1 thread.One possible reason is that we have two CPUs in our server,with To verify our theoretical results about Hogwild!and 6 cores for each CPU.In the 4-thread case,different threads AsySVRG,we use a fully-connected neural network to con- may be allocated on different CPUs,which will cause extra struct a non-convex function.The neural network has one cost. hidden layer with 100 nodes and the sigmoid function is used for the activation function.We use the soft-max out- put and a L2 regularization for training.The loss function 1s: K 1 f(w,b)=- C∑1=kogo+w i=1k=1 where w is the weights of the neural network,b is the bias, is the label of instanceis the output corresponding (a)Hogwild!on MNIST (b)AsySVRG on MNIST to xi,K is the total number of class labels. We use two datasets:connect-4 and MNIST4 to do experiments and A =10-3.We initialize w by ran- domly sampling from a Gaussian distribution with mean being 0 and variance being 0.01,and initialize b 0.During training,we use a fixed stepsize for both Hogwild!and AsySVRG.The stepsize is chosen from {0.1,0.05,0.01,0.005,0.001,0.0005,0.0001},and the best is reported.For the iteration number of the inner-loop of AsySVRG,we set M=n/p,where p is the number of (c)Hogwild!on connect-4 (d)AsySVRG on connect-4 threads.The experiments are conducted on a server with 12 Intel cores and 64G memory. Figure 2:Comparison between different numbers of threads. "https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
Computation Complexity In Theorem 2, we construct a sequence {cm} and need γ > 0. According to the definition of hm, we can write hm as hm = gcm + f, where g = 2η β 4τρ2 (ρ τ −1) ρ−1 + ρ, f = ηL2 2 4τρ2 (ρ τ −1) ρ−1 + Lρ 2 are constants. First, we choose β > η, then both g, f are bounded positive constants. We have cm = cm+1(1 + βη + 2L 2 η 2 g) + 2L 2 η 2 f. Let a = βη + 2L 2η 2 g. Because cM˜ = 0, it is easy to get c0 = 2L 2 η 2 f (1 + a)M˜ − 1 a . We take M˜ = b 1 a c ≤ 1 a , then we have c0 ≤ 4L 2η 2 f a and γ = α 2 (η − 4c0η αβ − 4gc0 α η 2 − 4f α η 2 ). As recommended in (Reddi et al. 2016), we can take η = µ/n2/3 , β = v/n1/3 with η 0, M˜ = O(n). Hence, to get an -local optimal solution, the computation complexity by all p threads is O(n 2 3 /), and the computation complexity of each thread is O( n 2 3 p ). Experiment To verify our theoretical results about Hogwild! and AsySVRG, we use a fully-connected neural network to construct a non-convex function. The neural network has one hidden layer with 100 nodes and the sigmoid function is used for the activation function. We use the soft-max output and a L2 regularization for training. The loss function is: f(w, b) = − 1 n Xn i=1 X K k=1 1{yi = k} log o (k) i + λ 2 kwk 2 , where w is the weights of the neural network, b is the bias, yi is the label of instance xi , o (k) i is the output corresponding to xi , K is the total number of class labels. We use two datasets: connect-4 and MNIST4 to do experiments and λ = 10−3 . We initialize w by randomly sampling from a Gaussian distribution with mean being 0 and variance being 0.01, and initialize b = 0. During training, we use a fixed stepsize for both Hogwild! and AsySVRG. The stepsize is chosen from {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}, and the best is reported. For the iteration number of the inner-loop of AsySVRG, we set M = n/p, where p is the number of threads. The experiments are conducted on a server with 12 Intel cores and 64G memory. 4 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ Figure 1 illustrates the convergence property of both Hogwild! and AsySVRG. The x-axis denotes the CPU time, where we set the CPU time that Hogwild! passes through the whole dataset once with one thread as 1 unit. The yaxis denotes the training loss. In this experiment, we run Hogwild! and AsySVRG with 10 threads. Hogwild!-10 and AsySVRG-10 denote the corresponding methods with 10 threads. It is easy to see that both Hogwild! and AsySVRG are convergent. Furthermore, AsySVRG is faster than Hogwild!. This is consistent with our theoretical results in this paper. 0 5 10 15 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 CPU Time Training loss MNIST Hogwild!-10 AsySVRG-10 (a) MNIST 0 20 40 60 80 100 120 0.65 0.7 0.75 0.8 0.85 0.9 CPU Time Training loss connect-4 Hogwild!-10 AsySVRG-10 (b) connect-4 Figure 1: Hogwild! vs AsySVRG Figure 2 reports the results of Hogwild! and AsySVRG with different numbers of threads, where the number of threads p = 1, 4, 10. We can find that in most cases the two methods will become faster with the increase of threads. The only outlier is the case for Hogwild! on dateset connect-4, Hogwild! using 4 threads is slower than using 1 thread. One possible reason is that we have two CPUs in our server, with 6 cores for each CPU. In the 4-thread case, different threads may be allocated on different CPUs, which will cause extra cost. 0 5 10 15 20 25 30 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 CPU Time Training loss MNIST Hogwild!-1 Hogwild!-4 Hogwild!-10 (a) Hogwild! on MNIST 0 5 10 15 20 25 30 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 CPU Time Training loss MNIST AsySVRG-1 AsySVRG-4 AsySVRG-10 (b) AsySVRG on MNIST 0 50 100 150 200 250 0.65 0.7 0.75 0.8 0.85 0.9 CPU Time Training loss connect-4 Hogwild!-1 Hogwild!-4 Hogwild!-10 (c) Hogwild! on connect-4 0 50 100 150 200 0.65 0.7 0.75 0.8 0.85 0.9 CPU Time Training loss connect-4 AsySVRG-1 AsySVRG-4 AsySVRG-10 (d) AsySVRG on connect-4 Figure 2: Comparison between different numbers of threads
Conclusion server.In Proceedings of the IIth USENIX Symposium on In this paper,we have provided theoretical proof about the Operating Systems Design and Implementation. convergence of two representative lock-free strategy based Li,X.;Zhao,T.;Arora,R.;Han;and Haupt,J.2016.S- parallel SGD methods,Hogwild!and AsySVRG,for non- tochastic variance reduced optimization for nonconvex s- convex problems.Empirical results also show that both Hog- parse learning.In Proceedings of the 33nd International wild!and AsySVRG are convergent on non-convex prob- Conference on Machine Learning. lems.which successfully verifies our theoretical results.To Lian,X.;Huang,Y.;Li,Y.;and Liu,J.2015.Asynchronous the best of our knowledge,this is the first work to prove the parallel stochastic gradient for nonconvex optimization.In convergence of lock-free strategy based parallel SGD meth- Proceedings of the Advances in Neural Information Process- ods for non-convex problems. ing Systems. Recht,B.;Re,C.;Wright,S.;and Niu,F.2011.Hogwild!: Acknowledgements A lock-free approach to parallelizing stochastic gradient de- This work is partially supported by NSFC (No.61472182) scent.In Proceedings of the Advances in Neural Information and a fund from Tencent Processing Systems. Reddi,S.J.;Hefny,A.;Sra,S.;Poczos,B.;and Alex.2016. References Stochastic variance reduction for nonconvex optimization. Allen-Zhu,Z.,and Hazan.E.2016.Variance reduction for In Proceedings of the 33nd International Conference on Ma- faster non-convex optimization.In Proceedings of the 33nd chine Learning. International Conference on Machine Learning. Roux.N.L.:Schmidt,M.W.:and Bach,F.2012.A stochas- Allen-Zhu,Z.,and Yuan,Y.2016.Improved svrg for non- tic gradient method with an exponential convergence rate for strongly-convex or sum-of-non-convex objectives.In Pro- finite training sets.In Proceedings of the Advances in Neural ceedings of the 33nd International Conference on Machine Information Processing Systems. Learning Sa.C.D.:Re.C.:and Olukotun.K.2015.Global conver- Chaturapruek,S.;Duchi,J.C.;and Re,C.2015.Asyn- gence of stochastic gradient descent for some non-convex matrix problems.In Proceedings of the 32nd International chronous stochastic convex optimization:the noise is in the noise and sgd don't care.In Proceedings of the Advances in Conference on Machine Learning. Neural Information Processing Systems. Shalev-Shwartz,S.,and Zhang,T.2013.Stochastic dual coordinate ascent methods for regularized loss.Journal of Ghadimi,S.,and Lan,G.2013.Stochastic first-and zeroth-order methods for nonconvex stochastic program- Machine Learning Research 14(1):567-599. ming.SIAM Journal on Optimization 23(4):2341-2368. Shamir.O.2015.A stochastic PCA and SVD algorithm Huo,Z.,and Huang,H.2016.Asynchronous stochastic with an exponential convergence rate.In Proceedings of the 32nd International Conference on Machine Learning. gradient descent with variance reduction for non-convex op- timization.CoRR abs/1604.03584. Shamir,O.2016a.Convergence of stochastic gradient de- J.Reddi,S.;Hefny,A.;Sra,S.;Poczos,B.;and Smola,A.J. scent for pca.In Proceedings of the 33nd International Con- ference on Machine Learning. 2015.On variance reduction in stochastic gradient descent and its asynchronous variants.In Proceedings of the Ad- Shamir,O.2016b.Fast stochastic algorithms for svd and vances in Neural Information Processing Systems pca:Convergence properties and convexity.In Proceedings of the 33nd International Conference on Machine Learning. Jaggi,M.;Smith,V.;Takac,M.;Terhorst,J.;Krishnan,S.; Hofmann.T.:and Jordan.M.I.2014.Communication- Xing,E.P.;Ho,Q.;Dai,W.;Kim,J.K.;Wei,J.;Lee,S.; efficient distributed dual coordinate ascent.In Proceedings Zheng,X.;Xie,P.:Kumar,A.;and Yu,Y.2015.Petuum: of the Advances in Neural Information Processing Systems. A new platform for distributed machine learning on big da- ta.In Proceedings of the 2Ith ACM SIGKDD International Johnson,R.,and Zhang,T.2013.Accelerating stochastic Conference on Knowledge Discovery and Data Mining. gradient descent using predictive variance reduction.In Pro- ceedings of the Advances in Neural Information Processing Zhang,R.;Zheng,S.;and Kwok,J.T.2016.Asynchronous Systems. distributed semi-stochastic gradient optimization.In Pro ceedings of the AAAI Conference on Artificial Intelligence. Koren,Y.;Bell,R.M.;and Volinsky,C.2009.Matrix fac- torization techniques for recommender systems.IEEE Com Zhao,S.-Y.,and Li,W.-J.2016.Fast asynchronous parallel puter42(8):30-37. stochastic gradient descent:A lock-free approach with con- vergence guarantee.In Proceedings of the AAAI Conference Krizhevsky.A.:Sutskever.I.:and Hinton.G.E.2012.Ima- on Artificial Intelligence. genet classification with deep convolutional neural network- s.In Proceedings of the Advances in Neural Information Processing Systems. Li,M.;Andersen,D.G.;Park,J.W.;Smola,A.J.;Ahmed, A.;Josifovski,V.;Long,J.;Shekita,E.J.;and Su,B.2014. Scaling distributed machine learning with the parameter
Conclusion In this paper, we have provided theoretical proof about the convergence of two representative lock-free strategy based parallel SGD methods, Hogwild! and AsySVRG, for nonconvex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. To the best of our knowledge, this is the first work to prove the convergence of lock-free strategy based parallel SGD methods for non-convex problems. Acknowledgements This work is partially supported by NSFC (No. 61472182) and a fund from Tencent. References Allen-Zhu, Z., and Hazan, E. 2016. Variance reduction for faster non-convex optimization. In Proceedings of the 33nd International Conference on Machine Learning. Allen-Zhu, Z., and Yuan, Y. 2016. Improved svrg for nonstrongly-convex or sum-of-non-convex objectives. In Proceedings of the 33nd International Conference on Machine Learning. Chaturapruek, S.; Duchi, J. C.; and Re, C. 2015. Asyn- ´ chronous stochastic convex optimization: the noise is in the noise and sgd don’t care. In Proceedings of the Advances in Neural Information Processing Systems. Ghadimi, S., and Lan, G. 2013. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341–2368. Huo, Z., and Huang, H. 2016. Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. CoRR abs/1604.03584. J. Reddi, S.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. J. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Proceedings of the Advances in Neural Information Processing Systems. Jaggi, M.; Smith, V.; Takac, M.; Terhorst, J.; Krishnan, S.; Hofmann, T.; and Jordan, M. I. 2014. Communicationefficient distributed dual coordinate ascent. In Proceedings of the Advances in Neural Information Processing Systems. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems. Koren, Y.; Bell, R. M.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. IEEE Computer 42(8):30–37. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Li, X.; Zhao, T.; Arora, R.; Han; and Haupt, J. 2016. Stochastic variance reduced optimization for nonconvex sparse learning. In Proceedings of the 33nd International Conference on Machine Learning. Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the Advances in Neural Information Processing Systems. Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems. Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Alex. 2016. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33nd International Conference on Machine Learning. Roux, N. L.; Schmidt, M. W.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proceedings of the Advances in Neural Information Processing Systems. Sa, C. D.; Re, C.; and Olukotun, K. 2015. Global convergence of stochastic gradient descent for some non-convex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14(1):567–599. Shamir, O. 2015. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings of the 32nd International Conference on Machine Learning. Shamir, O. 2016a. Convergence of stochastic gradient descent for pca. In Proceedings of the 33nd International Conference on Machine Learning. Shamir, O. 2016b. Fast stochastic algorithms for svd and pca: Convergence properties and convexity. In Proceedings of the 33nd International Conference on Machine Learning. Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhang, R.; Zheng, S.; and Kwok, J. T. 2016. Asynchronous distributed semi-stochastic gradient optimization. In Proceedings of the AAAI Conference on Artificial Intelligence. Zhao, S.-Y., and Li, W.-J. 2016. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In Proceedings of the AAAI Conference on Artificial Intelligence