正在加载图片...
is overwritten by other threads.Otherwise,that element is Lemma 3.With the condition about p,n in Lemma 2,we not overwritten. have wt is read by the thread who computes Vfi,(wt)and has the following format: E4-,P≤4rpp'-业g) p-1 (4) Combining with Assumption 5,we can find that the gap of wt=Wa(t) Pt.j-a(t)Vfi,(wj) (3) the write sequence and read sequence can always be bound- j=a(t) ed by a constantnr where a(t)means that some old stochastic gradients have been completely written on the w in the shared memory. Theorem 1.Let A=21(w)and B=2V2( a(p-1) Pjt)is a diagonal matrix whose diagonal entries are 0 会If we take the stepsize n=√where T-pxT, 「4 or 1,which means w:might include parts of new stochastic gradients. we can get the following result: In the lock-free strategy,we need the following assump- 一1 tions to guarantee convergence: AB Ivf(wP≤V Assumption2.a(t)is bounded by:0≤t-a(t)≤T It means that the old stochastic gradients Proof.According to Assumption 1,we have Vfio,...,Vfihave been completely written on w in the shared memory. E[f(wt+i)小wt,w] Assumption 3.We consider the matrix Bt as a random ma- <f(w:)-nE[Vf(w:)TB:Vfi(w:)lwt,w:] trix and EBwt,wt=B 0 with the minimum eigen- value a 0. +牙v,副 According to the definition of B:,it is easy to find B:,B =f(wt)-nVf(w:)TBVf(w:) are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1.Assumption 3 means that the +牙v.(时 probability that over-writing happens is at most 1-a <1 for each write step. ≤fw)-受Ivf0wP+w-a4P Assumption 4.Bt and it are independent. Since it is the random index selected by each thread while E了j.()w,, 十2 B:is highly affected by the hardware,the independence as- sumption is reasonable. where the first equality uses Assumption 4,the second in- For Hogwild!,the following assumption is also necessary: equality uses Lemma 1.Taking expectation on the above in- equality,we obtain Assumption 5.There exists a constant V,Vfi(w) V,i=1,,n. Ef(wt+1) For convenience.in this section.we denote ≤B-受wP+空2-aP q(x)= ∑f(x)2 Ln2v2 + n i=1 2 It is easy to find that Eg(w:)=ElVfi (w)2]and note ≤gf)-罗NwP that when x is close to some stationary point,g(x)may still be far away from 0.Hence,it is not a variance reduction +ve2Bnm-D+克 method and we need to control the variance of the stochastic p-1 gradient. where the first inequality uses Assumption 5 and second in- The difficulty of the analysis is wt wt.Here,we give equality uses Lemma 3.Summing the above inequality from the following Lemmas 3: t=0 to T-1,we get Lemma 2.In Hogwild!,we have Eq(wt)<pEq(wt+1)if 7-1 p,n satisfy ∑ElVf(w)I2 t=0 1-7-物+1L2(or+1-D≤p. 2 p-1 ≤2f0wo)+2niv2(2rIor-卫+ an a(p-1) 20 3The proof of some Lemmas can be found in the supplemen- For convenience,let A 2f(wo)and B tary material,which can be downloaded from http://cs.nju. a edu.cn/1wj/paper/LFnonConvex_sup.pdf. 2).which are two bounded constants. a(p-1)is overwritten by other threads. Otherwise, that element is not overwritten. wˆ t is read by the thread who computes ∇fit (wˆ t) and has the following format: wˆ t = wa(t) − η Xt−1 j=a(t) Pt,j−a(t)∇fij (wˆ j ), (3) where a(t) means that some old stochastic gradients have been completely written on the w in the shared memory. Pt,j−a(t) is a diagonal matrix whose diagonal entries are 0 or 1, which means wˆ t might include parts of new stochastic gradients. In the lock-free strategy, we need the following assump￾tions to guarantee convergence: Assumption 2. a(t) is bounded by: 0 ≤ t − a(t) ≤ τ It means that the old stochastic gradients ∇fi0 , . . . , ∇fit−τ−1 have been completely written on w in the shared memory. Assumption 3. We consider the matrix Bt as a random ma￾trix and E[Bt|wt, wˆ t] = B  0 with the minimum eigen￾value α > 0. According to the definition of Bt, it is easy to find Bt, B are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1. Assumption 3 means that the probability that over-writing happens is at most 1 − α < 1 for each write step. Assumption 4. Bt and it are independent. Since it is the random index selected by each thread while Bt is highly affected by the hardware, the independence as￾sumption is reasonable. For Hogwild!, the following assumption is also necessary: Assumption 5. There exists a constant V , k∇fi(w)k ≤ V, i = 1, . . . , n. For convenience, in this section, we denote q(x) = 1 n Xn i=1 kfi(x)k 2 . It is easy to find that Eq(wˆ t) = E[k∇fit (wˆ t)k 2 ] and note that when x is close to some stationary point, q(x) may still be far away from 0. Hence, it is not a variance reduction method and we need to control the variance of the stochastic gradient. The difficulty of the analysis is wt 6= wˆ t. Here, we give the following Lemmas 3 : Lemma 2. In Hogwild!, we have Eq(wˆ t) ≤ ρEq(wˆ t+1) if ρ, η satisfy 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. 3The proof of some Lemmas can be found in the supplemen￾tary material, which can be downloaded from http://cs.nju. edu.cn/lwj/paper/LFnonConvex_sup.pdf. Lemma 3. With the condition about ρ, η in Lemma 2, we have Ekwt − wˆ tk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(wˆ t) (4) Combining with Assumption 5, we can find that the gap of the write sequence and read sequence can always be bound￾ed by a constant 4η 2V 2 τρ(ρ τ −1) ρ−1 . Theorem 1. Let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ). If we take the stepsize η = q A T B˜ , where T˜ = p × T, we can get the following result: 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Proof. According to Assumption 1, we have E[f(wt+1)|wt, wˆ t] ≤f(wt) − ηE[∇f(wt) T Bt∇fit (wˆ t)|wt, wˆ t] + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] =f(wt) − η∇f(wt) T B∇f(wˆ t) + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] ≤f(wt) − αη 2 k∇f(wt)k 2 + L 2η 2 kwt − wˆ tk 2 + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t], where the first equality uses Assumption 4, the second in￾equality uses Lemma 1. Taking expectation on the above in￾equality, we obtain Ef(wt+1) ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + L 2η 2 Ekwt − wˆ tk 2 + Lη2V 2 2 ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + η 2V 2 ( 2τL2ηρ(ρ τ − 1) ρ − 1 + L 2 ), where the first inequality uses Assumption 5 and second in￾equality uses Lemma 3. Summing the above inequality from t = 0 to T˜ − 1, we get T X˜−1 t=0 Ek∇f(wt)k 2 ≤ 2 αη f(w0) + 2ηT V˜ 2 ( 2τL2ηρ(ρ τ − 1) α(ρ − 1) + L 2α ). For convenience, let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ), which are two bounded constants
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有