正在加载图片...
If we take the get since all the stochastic gradients have been written on w at the end of the tth outer-loop.Here,we also need the assump- 7-1 ions:0≤m-a(m)≤T;EB,mlut,m,it.m]=B>0 with the minimum eigenvalue a 0;Bt.m and it.m are t=0 independent.These assumptions are similar to those in the previous section. For convenience,let pi(x)=Vfi(x)-Vfi(ut.o)+ Hence,our theoretical result shows that Hogwild!with Vf(ut.o),and in this section,we denote 2 lock-free strategy gets a convergence rate of O(1/VT)for non-convex problems,where T=px T is the total iteration x=∑Ipx)2 i=1 number of p threads. It easy to find that Eq(it.m)=E[vt.m]. AsySVRG for Non-Convex Problems The difference between Hogwild!and AsySVRG is the s- tochastic gradient and we have the following Lemmas which The AsySVRG method(Zhao and Li 2016)is listed in Algo- lead to fast convergence rate of AsySVRG: rithm 2.AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Lemma 4.Vx,we have Compared with Hogwild!,AsySVRG includes the full gra- q(x)<2L2x-ut.oll2+21Vf(x)2. dient to get a variance reduced stochastic gradient,which Proof. has been proved to have linear convergence rate on strong- ly convex problems (Zhao and Li 2016).In this section,we will prove that AsySVRG is also convergent for non-convex q(x)= ∑Ii网-io)+a,olr problems,and has faster convergence rate than Hogwild!on non-convex problems. 2∑INfo-Vjiu:o)+vfuo)-fxP n i=1 Algorithm 2 AsySVRG Initialization:p threads,initialize wo,n: +27f(x)I2 fort =0.1.2....T-1 do uo W; ≤2∑INfx-Vu:oP+2IfxP n All threads parallelly compute the full gradient Vf(uo)= 2=1 a∑17f(uo)方 ≤2L21x-u.ol2+27fx)2 u=Wt; For each thread,do: 0 forj=0to M-1 do Read current value of u,denoted as u,from the shared According to Lemma 4,we can find that AsySVRG is memory.And randomly pick up an i from 1,...,n; a variance reduction method for non-convex problems,be- Compute the update vector:=Vfi(u)-Vfi(uo)+ cause when ut.m,ut.o get close to some stationary point, Vf(uo); g(ut,m)gets close to 0.And hence we do not need the u←←u-V时 bounded gradient assumption for the convergence proof. end for Since ut.m ut.m,the difficulty of convergence analy- Take w+1 to be the current value of u in the shared memory; sis lies in the gap between ut.m and ut.m.and the relation end for between q(it.m)and q(ut.m). Similar to the analysis in the last section,we construct an Lemma 5.In AsySVRG,we have Eq(ut.m)<pEq(ft.m+1) equivalent write sequence {ut.m}for the tth outer-loop: if we choose p and n to satisfy that 1 ut,0 Wt, 1-7-9n+1L2p*+1-D≤p. p-1 ut.m+1 ut.m -nBt.mVt.m; (5 Lemma 6.With the condition about p,n in Lemma 5,we where vt.m =Vfit.(ut.m)-Vfit.(ut.o)+Vf(ut.). have Bi.m is a diagonal matrix whose diagonal entries are 0 or 1. And ut.m is read by the thread who computes vt.m.It has EIwm-nP≤p0-DEdt.m.)同 the following format: p-1 Lemma 7.With the condition about p,n in Lemma 5.we have Eq(ut.m)<pEq(ut,m). it,m=ut,a(m)-刃 P(t) m.j-a(m)Vt j; Combining Lemma 6 and Lemma 7,we can directly ob- 1=a(m】 tain: where P is a diagonal matrix whose diagonal en llfm-mPsP-DB4lum小. tries are 0 or 1.Note that according to (5).ut.M=Wt+1 p-1If we take the stepsize η = q A T B˜ , we get 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Hence, our theoretical result shows that Hogwild! with lock-free strategy gets a convergence rate of O(1/ p T˜) for non-convex problems, where T˜ = p×T is the total iteration number of p threads. AsySVRG for Non-Convex Problems The AsySVRG method (Zhao and Li 2016) is listed in Algo￾rithm 2. AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Compared with Hogwild!, AsySVRG includes the full gra￾dient to get a variance reduced stochastic gradient, which has been proved to have linear convergence rate on strong￾ly convex problems (Zhao and Li 2016). In this section, we will prove that AsySVRG is also convergent for non-convex problems, and has faster convergence rate than Hogwild! on non-convex problems. Algorithm 2 AsySVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ...T − 1 do u0 = wt; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt; For each thread, do: for j = 0 to M − 1 do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for Similar to the analysis in the last section, we construct an equivalent write sequence {ut,m} for the t th outer-loop: ut,0 = wt, ut,m+1 = ut,m − ηBt,mvˆt,m, (5) where vˆt,m = ∇fit,m(uˆt,m) − ∇fit,m(ut,0) + ∇f(ut,0). Bt,m is a diagonal matrix whose diagonal entries are 0 or 1. And uˆt,m is read by the thread who computes vˆt,m. It has the following format: uˆt,m = ut,a(m) − η mX−1 j=a(m) P (t) m,j−a(m) vˆt,j , where P (t) m,j−a(m) is a diagonal matrix whose diagonal en￾tries are 0 or 1. Note that according to (5), ut,M˜ = wt+1 since all the stochastic gradients have been written on w at the end of the t th outer-loop. Here, we also need the assump￾tions: 0 ≤ m − a(m) ≤ τ ; E[Bt,m|ut,m, uˆt,m] = B  0 with the minimum eigenvalue α > 0; Bt,m and it,m are independent. These assumptions are similar to those in the previous section. For convenience, let pi(x) = ∇fi(x) − ∇fi(ut,0) + ∇f(ut,0), and in this section, we denote q(x) = 1 n Xn i=1 kpi(x)k 2 . It easy to find that Eq(uˆt,m) = E[kvˆt,mk 2 ]. The difference between Hogwild! and AsySVRG is the s￾tochastic gradient and we have the following Lemmas which lead to fast convergence rate of AsySVRG: Lemma 4. ∀x, we have q(x) ≤ 2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . Proof. q(x) = 1 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0) − ∇f(x)k 2 + 2k∇f(x)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0)k 2 + 2k∇f(x)k 2 ≤2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . According to Lemma 4, we can find that AsySVRG is a variance reduction method for non-convex problems, be￾cause when uˆt,m, ut,0 get close to some stationary point, q(uˆt,m) gets close to 0. And hence we do not need the bounded gradient assumption for the convergence proof. Since ut,m 6= uˆt,m, the difficulty of convergence analy￾sis lies in the gap between ut,m and uˆt,m, and the relation between q(uˆt,m) and q(ut,m). Lemma 5. In AsySVRG, we have Eq(uˆt,m) < ρEq(uˆt,m+1) if we choose ρ and η to satisfy that 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. Lemma 6. With the condition about ρ, η in Lemma 5, we have Ekut,m − uˆt,mk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(uˆt,m). (6) Lemma 7. With the condition about ρ, η in Lemma 5, we have Eq(uˆt,m) < ρEq(ut,m). Combining Lemma 6 and Lemma 7, we can directly ob￾tain: E kuˆt,m − ut,mk 2 ≤ 4η 2 τ ρ2 (ρ τ − 1) ρ − 1 Eq(ut,m). (7)
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有