正在加载图片...
However,the PSGD methods in (Lian et al.2015)need This is a common assumption for the convergence analy- write-lock or atomic operation for the memory to prove sis of most existing gradient-based methods. the convergence 2.Similarly,the work in (Huo and Huang Since we focus on non-convex problems in this paper,it 2016)also does not prove the convergence for the lock- is difficult to get the global solution of(1)based on the gra- free case in our paper.Recent works (Recht et al.2011; dient methods.Hence,we use llVf(w)2 to measure the Chaturapruek,Duchi,and Re 2015;J.Reddi et al.2015; convergence instead of f(w)-min f(w). Zhao and Li 2016)find that lock-free strategy based parallel SGD(LF-PSGD)methods can empirically outperform lock- Here,we give a Lemma which is useful in the conver- based PSGD methods for multi-core systems.Although gence analysis of Hogwild!and AsySVRG. some existing works(Chaturapruek,Duchi,and Re 2015; Lemma 1.Assume B is a positive semi-definite matrix with Zhao and Li 2016)have proved the convergence of these LF- the largest eigenvalue less than or equal to I and the mini- PSGD methods for convex problems,no work has proved mum eigenvalue a >0,we have:Vx,y, the convergence of the LF-PSGD methods for non-convex problems. In this paper,we provide the theoretical proof about the -vf✉Bfw)≤乞Ix-yP-INfx)P. convergence of two representative LF-PSGD methods,Hog- Proof. wild!(Recht et al.2011;Chaturapruek,Duchi,and Re 2015) and AsySVRG(Zhao and Li 2016),for non-convex prob- lems.The contribution of this work can be outlined as fol- 含IVfP-V/(*)Bv5 lows: ≤号Bvf-Vf(x)BV/) .Theoretical results show that both Hogwild!and AsySVRG can converge with lock-free strategy for non- convex problems. ≤号Bfx-vf(x)BVf()+Bfy) .Hogwild!gets a convergence rate of (1/)for non- -号IB)-f川 convex problems,where T=p x T is the total iteration number of p threads. s号k-R AsySVRG gets a convergence rate of O(1/T)for non- 口 convex problems. To get an e-local optimal solution for AsySVRG,the Hogwild!for Non-Convex Problems computation complexity by all threads is O(n/e),or e- The Hogwild!method(Recht et al.2011)is listed in Algo- quivalently the computation complexity of each thread is rithm 1.Each thread reads w from the shared memory,com- ()This is faster than traditional parallel gradient de- putes a stochastic gradient and updates the w in the shared cent methods whose computation complexity is O()for memory.Please note that Hogwild!in(Recht et al.2011)has each thread. several variants with locks or lock-free.Here,we only focus on the lock-free variant of Hogwild!,which means that we Empirical results also show that both Hogwild!and do not use any locks,either read-lock or write-lock,for all AsySVRG are convergent on non-convex problems, threads which successfully verifies our theoretical results. Algorithm 1 Hogwild! Preliminary Initialization:p threads,initialize wo,n: We use f(w)to denote the objective function in(1),which For each thread,do: means f(w)=∑-1fi(w.And we use·lto denote for /=0,1,2,...,T-1 do theL2-noml·2. Read current w in the shared memory,denoted as w: Randomly pick up an i from 1,...,n and compute the gra- Assumption 1.The function fi()in (1)is smooth,which dient Vfi(w); means that there exists a constant L>0.Va,b, w←w-Vf(w)月 end for f:(b)sfi(a)+Vf:(a)"(b-a)+lb-all2 As in (Zhao and Li 2016).we can construct an equivalent or equivalently write sequence w:: llVfi(b)-Vfi(a)Llb-all Wi+1=Wi-nBiVfi,(wi), (2) 2Although the implementation of AsySG-incon in (Lian et al. where 0<t pxT,B:is a random diagonal matrix whose 2015)is lock-free,the theoretical analysis about the convergence diagonal entries are 0 or 1.The B:is used to denote whether of AsySG-incon is based on an assumption that no over-writing over-writing happens.If the kth diagonal entry of B:is 0,it happens,i.e.,the theoretical analysis is not for the lock-free case. means that the kth element in the gradient vector Vfi(wt)However, the PSGD methods in (Lian et al. 2015) need write-lock or atomic operation for the memory to prove the convergence 2 . Similarly, the work in (Huo and Huang 2016) also does not prove the convergence for the lock￾free case in our paper. Recent works (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015; J. Reddi et al. 2015; ´ Zhao and Li 2016) find that lock-free strategy based parallel SGD (LF-PSGD) methods can empirically outperform lock￾based PSGD methods for multi-core systems. Although some existing works (Chaturapruek, Duchi, and Re 2015; ´ Zhao and Li 2016) have proved the convergence of these LF￾PSGD methods for convex problems, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hog￾wild! (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015) ´ and AsySVRG (Zhao and Li 2016), for non-convex prob￾lems. The contribution of this work can be outlined as fol￾lows: • Theoretical results show that both Hogwild! and AsySVRG can converge with lock-free strategy for non￾convex problems. • Hogwild! gets a convergence rate of O(1/ p T˜) for non￾convex problems, where T˜ = p × T is the total iteration number of p threads. • AsySVRG gets a convergence rate of O(1/T˜) for non￾convex problems. • To get an -local optimal solution for AsySVRG, the computation complexity by all threads is O(n 2 3 /), or e￾quivalently the computation complexity of each thread is O( n 2 3 p ). This is faster than traditional parallel gradient de￾cent methods whose computation complexity is O( n p ) for each thread. • Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Preliminary We use f(w) to denote the objective function in (1), which means f(w) = 1 n Pn i=1 fi(w). And we use k · k to denote the L2-norm k · k2. Assumption 1. The function fi(·) in (1) is smooth, which means that there exists a constant L > 0, ∀a, b, fi(b) ≤ fi(a) + ∇fi(a) T (b − a) + L 2 kb − ak 2 , or equivalently k∇fi(b) − ∇fi(a)k ≤ Lkb − ak. 2Although the implementation of AsySG-incon in (Lian et al. 2015) is lock-free, the theoretical analysis about the convergence of AsySG-incon is based on an assumption that no over-writing happens, i.e., the theoretical analysis is not for the lock-free case. This is a common assumption for the convergence analy￾sis of most existing gradient-based methods. Since we focus on non-convex problems in this paper, it is difficult to get the global solution of (1) based on the gra￾dient methods. Hence, we use k∇f(w)k 2 to measure the convergence instead of f(w) − min w f(w). Here, we give a Lemma which is useful in the conver￾gence analysis of Hogwild! and AsySVRG. Lemma 1. Assume B is a positive semi-definite matrix with the largest eigenvalue less than or equal to 1 and the mini￾mum eigenvalue α > 0, we have: ∀x, y, −∇f(x) T B∇f(y) ≤ L 2 2 kx − yk 2 − α 2 k∇f(x)k 2 . Proof. α 2 k∇f(x)k 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) + 1 2 B 1 2 ∇f(y) 2 = 1 2 B 1 2 (∇f(x) − ∇f(y)) 2 ≤ L 2 2 kx − yk 2 . Hogwild! for Non-Convex Problems The Hogwild! method (Recht et al. 2011) is listed in Algo￾rithm 1. Each thread reads w from the shared memory, com￾putes a stochastic gradient and updates the w in the shared memory. Please note that Hogwild! in (Recht et al. 2011) has several variants with locks or lock-free. Here, we only focus on the lock-free variant of Hogwild!, which means that we do not use any locks, either read-lock or write-lock, for all threads. Algorithm 1 Hogwild! Initialization: p threads, initialize w0, η; For each thread, do: for l = 0, 1, 2, ..., T − 1 do Read current w in the shared memory, denoted as wˆ ; Randomly pick up an i from {1, . . . , n} and compute the gra￾dient ∇fi(wˆ ); w ← w − η∇fi(wˆ ); end for As in (Zhao and Li 2016), we can construct an equivalent write sequence {wt}: wt+1 = wt − ηBt∇fit (wˆ t), (2) where 0 ≤ t ≤ p×T, Bt is a random diagonal matrix whose diagonal entries are 0 or 1. The Bt is used to denote whether over-writing happens. If the kth diagonal entry of Bt is 0, it means that the kth element in the gradient vector ∇fit (wˆ t)
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有