parallel version of SVRG. Furthermore_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Fast Asynchronous Parallel Stochastic Gradient Descent - A Lock-Free Approach with Convergence Guarantee

正在加载图片...

parallel version of SVRG.Furthermore,in Algorithm 1,all When all the inner-loops of all threads are completed,we threads read and write the shared memory without any locks can get the current u in the shared memory,which can be Hence,Algorithm 1 is a lock-free asynchronous approach to presented as parallelize SVRG,which is called AsySVRG in this paper. M-1 An Equivalent Synthetic Process u=+∑B;△ (6 In the lock-free case,we do not use any lock whenever one =0 thread reads u from the shared memory or updates (writes) According to (6)and the definition of Am,we can define a u in the shared memory.Hence,the results of u seem to synthetic write (update)sequence {um}with uo =wt,and be totally disordered,which makes the convergence analysis for m =1:M, very difficult. In this paper,we find that there exists a synthetic process m-1 to generate the final value of u after all threads have um =uo+ B,△ (7) completed their updates in the inner-loop of Algorithm 1. i=0 It means that we can generate a sequence of synthetic values of u with some order to get the final u,based on which It is easy to see that we can prove the convergence of the lock-free AsySVRG um+1=um+Bm△m: in Algorithm 1. We can also find that um is the value which can be got after Synthetic Write (Update)Sequence the update vectors Ao,...,Am-1 have been completely The key step in the inner-loop of Algorithm I is u u-nv, applied to u in the shared memory. which can be rewritten as follows: Please note that the sequence {um}is synthetic and the u←-u+△， (3) whole u=()may never occur in the where A =-nv is the update vector computed by each shared memory,which means that we cannot obtain any thread. um (m =1,2,...,M-1)or the average sum of these Letu=(u,u(②），，ua)with u@denoting the ith um during the running of the inner-loop.What we can get is element of u.Since each thread has a local count,we use only the final value of u after all threads have completed Ai.j to denote the jth update vector produced by the ith their updates in the inner-loop of Algorithm 1.Hence,we thread (i=1,2...j1..).to denote can find an equivalent synthetic update process with some order to generate the same value as that of the disordered the time that(is changed by Without loss of lock-free update process. generality,we assume all threads will update the elements in u in the order from u(1)to u(d),which can be easily Read Sequence implemented by the program.Hence,we have We use um to denote the parameter read from the shared i,itg<tg<.…<tg (4) memory which is used to compute Am by a thread.Based Vi,k,l,if m<n on the synthetic write sequence [um,iim can be writ- (5) ten asm=uam）+∑nPni-atmA.Here, and theses are different from each other since a(m)<m denotes some time point when the update vectors u(1)can be changed by only one thread at an absolute Ao,...,Aa(m)-1 have been completely applied to u in the time.We sort these (s in an increasing order and shared memory.[Pm.(m)}are diagonal matrices whose diagonal entries are 0 or 1.b(m)>a(m)denotes another use△o,△1，，△m,△M-1,where M=pXM,to denote the corresponding update vectors.It is more useful time point. Pm)means that besides that we sort than that we sort(1)because ua(m),um also includes some other new update vectors beforethe update vectorhas not been applied to from time point a(m)to b(m).According to the definition of△i,all{△i}(i≥m)have not been applied to u at u.Furthermore,we will benefit from such a sort when we the time point when one thread gets um.Then,we can set discuss what would be read by one thread.Since we do not b(m)<m.Hence,we can reformulate um as use any lock,over-writing may happen when one thread is performing the update in(3).The real update vector can be m-1 written as BmAm,where Bm is a diagonal matrix whose im=ua(m)十 Pm,i-a(m)△i (8) diagonal entries are 0 or 1.If B()=0,then a(m is over-written by other threads.If Bm(,k)=1,then )is updated by successfully without over-writing. The matrix Pm.i-a(m)means that im may read some components of these new{△：}(a(m)≤i<m),including However,we do not know what the exact Bm is.It can be those which might be over-written by some other threads.It seen as a random variable.Then (3)can be rewritten as can also be seen as a random variable.Pm.-a(m)and Bi u←u+Bm△m are not necessary to be equal to each other.parallel version of SVRG. Furthermore, in Algorithm 1, all threads read and write the shared memory without any locks. Hence, Algorithm 1 is a lock-free asynchronous approach to parallelize SVRG, which is called AsySVRG in this paper. An Equivalent Synthetic Process In the lock-free case, we do not use any lock whenever one thread reads u from the shared memory or updates (writes) u in the shared memory. Hence, the results of u seem to be totally disordered, which makes the convergence analysis very difficult. In this paper, we find that there exists a synthetic process to generate the final value of u after all threads have completed their updates in the inner-loop of Algorithm 1. It means that we can generate a sequence of synthetic values of u with some order to get the final u, based on which we can prove the convergence of the lock-free AsySVRG in Algorithm 1. Synthetic Write (Update) Sequence The key step in the inner-loop of Algorithm 1 is u ← u−ηvˆ, which can be rewritten as follows: u ← u + ∆, (3) where ∆ = −ηvˆ is the update vector computed by each thread. Let u = (u (1), u(2), . . . , u(d) ) with u (i) denoting the ith element of u. Since each thread has a local count, we use ∆i,j to denote the j th update vector produced by the i th thread (i = 1, 2, . . . , p, j = 1, 2, . . . , M), t (k) i,j to denote the time that u (k) is changed by ∆ (k) i,j . Without loss of generality, we assume all threads will update the elements in u in the order from u (1) to u (d) , which can be easily implemented by the program. Hence, we have ∀i, j, t(1) i,j < t(2) i,j < . . . < t(d) i,j (4) ∀i, k, l, if m < n, t(k) i,m < t(l) i,n (5) and these {t (1) i,j }s are different from each other since u (1) can be changed by only one thread at an absolute time. We sort these {t (1) i,j }s in an increasing order and use ∆0, ∆1, . . . , ∆m, . . . , ∆M˜ −1 , where M˜ = p × M, to denote the corresponding update vectors. It is more useful that we sort t (1) i,j than that we sort t (k) i,j (k > 1) because before t (1) i,j , the update vector ∆i,j has not been applied to u. Furthermore, we will benefit from such a sort when we discuss what would be read by one thread. Since we do not use any lock, over-writing may happen when one thread is performing the update in (3). The real update vector can be written as Bm∆m, where Bm is a diagonal matrix whose diagonal entries are 0 or 1. If Bm(k, k) = 0, then ∆ (k) m is over-written by other threads. If Bm(k, k) = 1, then u (k) is updated by ∆ (k) m successfully without over-writing. However, we do not know what the exact Bm is. It can be seen as a random variable. Then (3) can be rewritten as u ← u + Bm∆m When all the inner-loops of all threads are completed, we can get the current u in the shared memory, which can be presented as u = u0 + M X˜ −1 i=0 Bi∆i (6) According to (6) and the definition of ∆m, we can define a synthetic write (update) sequence {um} with u0 = wt, and for m = 1 : M˜ , um = u0 + mX−1 i=0 Bi∆i (7) It is easy to see that um+1 = um + Bm∆m. We can also find that um is the value which can be got after the update vectors ∆0, . . . , ∆m−1 have been completely applied to u in the shared memory. Please note that the sequence {um} is synthetic and the whole um = (u (1) m , u (2) m , . . . , u (d) m ) may never occur in the shared memory, which means that we cannot obtain any um (m = 1, 2, . . . , M˜ − 1) or the average sum of these um during the running of the inner-loop. What we can get is only the final value of uM˜ after all threads have completed their updates in the inner-loop of Algorithm 1. Hence, we can find an equivalent synthetic update process with some order to generate the same value as that of the disordered lock-free update process. Read Sequence We use uˆm to denote the parameter read from the shared memory which is used to compute ∆m by a thread. Based on the synthetic write sequence {um}, uˆm can be written as uˆm = ua(m) + Pb(m) i=a(m) Pm,i−a(m)∆i . Here, a(m) < m denotes some time point when the update vectors ∆0, . . . , ∆a(m)−1 have been completely applied to u in the shared memory. {Pm,i−a(m)} are diagonal matrices whose diagonal entries are 0 or 1. b(m) ≥ a(m) denotes another time point. Pb(m) i=a(m) Pm,i−a(m)∆i means that besides ua(m) , uˆm also includes some other new update vectors from time point a(m) to b(m). According to the definition of ∆i , all {∆i} (i ≥ m) have not been applied to u at the time point when one thread gets uˆm. Then, we can set b(m) < m. Hence, we can reformulate uˆm as uˆm = ua(m) + mX−1 i=a(m) Pm,i−a(m)∆i . (8) The matrix Pm,i−a(m) means that uˆm may read some components of these new {∆i} (a(m) ≤ i < m), including those which might be over-written by some other threads. It can also be seen as a random variable. Pm,i−a(m) and Bi are not necessary to be equal to each other

<<向上翻页向下翻页>>