Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology.Nanjing University,China zhaosy@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract Zhang 2013;Mairal 2013;Shalev-Shwartz and Zhang 2013; Liu et al.2014:Nitanda 2014:Zhang and Kwok 2014)have Stochastic gradient descent (SGD)and its variants have recently attracted much attention to solve machine learning become more and more popular in machine learning due to their efficiency and effectiveness.To handle large-scale problems like that in(1).Many works have proved that SGD problems,researchers have recently proposed several parallel and its variants can outperform traditional batch learning SGD methods for multicore systems.However,existing algorithms such as gradient descent or Newton methods in parallel SGD methods cannot achieve satisfactory perfor- real applications. mance in real applications.In this paper,we propose a In many real-world problems,the number of instances fast asynchronous parallel SGD method,called AsySVRG. n is typically very large.In this case,the traditional by designing an asynchronous strategy to parallelize the sequential SGD methods might not be efficient enough to recently proposed SGD variant called stochastic variance reduced gradient (SVRG).AsySVRG adopts a lock-free find the optimal solution for (1).On the other hand,clus- strategy which is more efficient than other strategies with ters and multicore systems have become popular in recent locks.Furthermore,we theoretically prove that AsySVRG years.Hence,to handle large-scale problems,researchers is convergent with a linear convergence rate.Both theoretical have recently proposed several distributed SGD methods for and empirical results show that AsySVRG can outperform clusters and parallel SGD methods for multicore systems. existing state-of-the-art parallel SGD methods like Hogwild! Although distributed SGD methods for clusters like those in terms of convergence rate and computation cost. in (Zinkevich,Smola,and Langford 2009;Duchi,Agarwal, and Wainwright 2010;Zinkevich et al.2010;Zhang,Zheng, Introduction and T.Kwok 2015)are meaningful to handle very large- scale problems,there also exist a lot of problems which Assume we have a set of labeled instances can be solved by a single machine with multiple cores. {(xi,vi)li=1....,n},where xi Rd is the feature Furthermore,even in distributed settings with clusters,each vector for instance i,d is the feature size and yi {1,-1} machine(node)of the cluster typically have multiple cores. is the class label of xi.In machine learning.we often Hence,how to design effective parallel SGD methods for need to solve the following regularized empirical risk multicore systems has become a key issue to solve large- minimization problem: scale learning problems like that in (1). 1 There have appeared some parallel SGD methods for mul- minf(w), (1) ticore systems.The round-robin scheme proposed in(Zinke- i=1 vich,Smola,and Langford 2009)tries to order the proces- sors and then each processor update the variables in order. where w is the parameter to learn,fi(w)is the loss function Hogwild!(Recht et al.2011)is an asynchronous approach defined on instance i which is often with a regularization for parallel SGD.Experimental results in(Recht et al.2011) term to avoid overfitting.For example,fi(w)can be log(1+ have shown that Hogwild!can outperform the round-robin e)+w2 which is known as the logistic loss plus scheme in (Zinkevich,Smola.and Langford 2009).How- a regularization term,or max (0,1-yixw+wl ever,Hogwild!can only achieve a sub-linear convergence which is known as the regularized hinge loss in support rate.Hence,Hogwild!is not efficient (fast)enough to vector machine (SVM).Besidesw,the regularization achieve satisfactory performance.PASSCoDe (Hsieh,Yu, term can also be lwll or some other forms and Dhillon 2015)and CoCoA (Jaggi et al.2014)are also Due to their efficiency and effectiveness,stochastic gra- asynchronous approaches for parallel SGD.However,both dient descent(SGD)and its variants(Xiao 2009;Duchi and of them are formulated from the dual coordinate descent (as- Singer 2009;Roux,Schmidt,and Bach 2012;Johnson and cent)perspective,and hence it can only be used for problems whose dual functions can be computed.Although some Copyright C2016,Association for the Advancement of Artificial works,such as Hogwild!and PASSCoDe,have empirically Intelligence (www.aaai.org).All rights reserved. found that in some cases the lock-free strategies are much
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China zhaosy@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG). AsySVRG adopts a lock-free strategy which is more efficient than other strategies with locks. Furthermore, we theoretically prove that AsySVRG is convergent with a linear convergence rate. Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost. Introduction Assume we have a set of labeled instances {(xi , yi)|i = 1, . . . , n}, where xi ∈ R d is the feature vector for instance i, d is the feature size and yi ∈ {1, −1} is the class label of xi . In machine learning, we often need to solve the following regularized empirical risk minimization problem: min w 1 n Xn i=1 fi(w), (1) where w is the parameter to learn, fi(w) is the loss function defined on instance i which is often with a regularization term to avoid overfitting. For example, fi(w) can be log(1+ e −yix T i w) + λ 2 kwk 2 which is known as the logistic loss plus a regularization term, or max 0, 1 − yix T i w + λ 2 kwk 2 which is known as the regularized hinge loss in support vector machine (SVM). Besides λ 2 kwk 2 , the regularization term can also be λ kwk1 or some other forms. Due to their efficiency and effectiveness, stochastic gradient descent (SGD) and its variants (Xiao 2009; Duchi and Singer 2009; Roux, Schmidt, and Bach 2012; Johnson and Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Zhang 2013; Mairal 2013; Shalev-Shwartz and Zhang 2013; Liu et al. 2014; Nitanda 2014; Zhang and Kwok 2014) have recently attracted much attention to solve machine learning problems like that in (1). Many works have proved that SGD and its variants can outperform traditional batch learning algorithms such as gradient descent or Newton methods in real applications. In many real-world problems, the number of instances n is typically very large. In this case, the traditional sequential SGD methods might not be efficient enough to find the optimal solution for (1). On the other hand, clusters and multicore systems have become popular in recent years. Hence, to handle large-scale problems, researchers have recently proposed several distributed SGD methods for clusters and parallel SGD methods for multicore systems. Although distributed SGD methods for clusters like those in (Zinkevich, Smola, and Langford 2009; Duchi, Agarwal, and Wainwright 2010; Zinkevich et al. 2010; Zhang, Zheng, and T. Kwok 2015) are meaningful to handle very largescale problems, there also exist a lot of problems which can be solved by a single machine with multiple cores. Furthermore, even in distributed settings with clusters, each machine (node) of the cluster typically have multiple cores. Hence, how to design effective parallel SGD methods for multicore systems has become a key issue to solve largescale learning problems like that in (1). There have appeared some parallel SGD methods for multicore systems. The round-robin scheme proposed in (Zinkevich, Smola, and Langford 2009) tries to order the processors and then each processor update the variables in order. Hogwild! (Recht et al. 2011) is an asynchronous approach for parallel SGD. Experimental results in (Recht et al. 2011) have shown that Hogwild! can outperform the round-robin scheme in (Zinkevich, Smola, and Langford 2009). However, Hogwild! can only achieve a sub-linear convergence rate. Hence, Hogwild! is not efficient (fast) enough to achieve satisfactory performance. PASSCoDe (Hsieh, Yu, and Dhillon 2015) and CoCoA (Jaggi et al. 2014) are also asynchronous approaches for parallel SGD. However, both of them are formulated from the dual coordinate descent (ascent) perspective, and hence it can only be used for problems whose dual functions can be computed. Although some works, such as Hogwild! and PASSCoDe, have empirically found that in some cases the lock-free strategies are much
more efficient than other strategies with locks.no works Approach have theoretically proved the convergence of the lock-free Assume that we have p threads (processors)which can strategies.For example,both Hogwild!and PASSCoDe are access a shared memory,and w is stored in the shared proved to be convergent under the assumption that there are memory.Furthermore,we assume each thread has access some locks to guarantee that over-writing operation would to a shared data structure for the vector w and has access never happen.However,both of them have not provided to randomly choose any instance i to compute the gradient theoretical guarantee for the convergence under the lock-free 7f(w) case. Our AsySVRG algorithm is presented in Algorithm 1.We In this paper,we propose a fast asynchronous can find that in the tth iteration,each thread completes the parallel SGD method,called AsySVRG,by designing following operations: an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced .By using a temporary variable uo to store w:(i.e., gradient (SVRG)(Johnson and Zhang 2013). The uo =wt),all threads parallelly compute the full gradient 72 contributions of AsySVRG can be outlined as follows: Vf(uo)=∑1Vf(o)=∑2=,7f(w)-As sume the gradients computed by thread a are denoted by AsySVRG is a lock-free asynchronous parallel approach, a which is a subset of [vfi(w)i=1,...,n).We have which means that there is no read lock or update (write) pa∩pb=gifa≠b,and U=1pa={7fi(wt)li= lock in AsySVRG.Hence,AsySVRG can be expected to 1,,n} be more efficient than other approaches with locks. Then each thread parallelly runs an inner-loop in each Although AsySVRG is lock-free,we can still theoretical- iteration of which the thread reads the current value of ly prove that it is convergent with a linear convergence u,denoted as u,from the shared memory and randomly rate,which is faster than that of Hogwild!. chooses an instance indexed by i to compute the vector The implementation of AsySVRG is simple. =Vfi(u)-Vfi(uo)+Vf(uo). (2) Empirical results on real datasets show that AsySVRG can outperform Hogwild!in terms of convergence rate Then update the vector and computation cost. u←-u-7v Preliminary where n>0 is a step size (or called learning rate) Here,we use w to denote the parameter in the outer-loop, We use f(w)to denote the objective function in (1),which means f(w)f(w).In this paper,we use and use u to denote the parameter in the inner-loop.Before running the inner-loops,u will be initialized by the current to denote the L2-norm l2 and w.to denote the optimal value of w:in the shared memory.After all the threads have solution of the objective function. completed the inner-loops,we take wt+1 to be the current Assumption 1.The function fi()(i =1,...,n)in (1)is value of u in the shared memory. convex and L-smooth,which means that 3L>0.Ya.b. Algorithm 1 will degenerate to stochastic variance re- duced gradient(SVRG)(Johnson and Zhang 2013)if there f.(a)sf.(b)+Vf.(b)T(a-b)+la-bP, exists only one thread (i.e..p=1).Hence,Algorithm 1 is a or equivalently Algorithm 1 AsySVRG Il7f(a)-7f(b)川≤Llla-bll, Initialization:p threads,initialize wo,n: for =0,1,2,...do where Vfi()denotes the gradient of fi(). uo=Wt; Assumption 2.The objective function f()is u-strongly All threads parallelly compute the full gradient convex,.which means that3μ>0,a,b, Vf(uo)=是∑1Vj(o): u=Wt; f(a)z f(b)+Vf(b)"(a-b)+la-b, For each thread,do: for m=1to M do or equivalently Read current value of u,denoted as u,from the shared memory.And randomly pick up an i from l7f(a)-7f(b)l≥μlla-bl {1,…,n: Please note that Assumptions 1 and 2 are often satisfied by Compute the update vector:=Vfi(u)- Vfi(uo)+Vf(uo); most objective functions in machine learning models,such u←u-v: as the logistic regression(LR)and linear regression with L2- end for norm regularization. Take wt+1 to be the current value of u in the shared In our early work,the parallel SVRG algorithms with locks memory; have been proved to be convergent(Zhao and Li 2015). end for
more efficient than other strategies with locks, no works have theoretically proved the convergence of the lock-free strategies. For example, both Hogwild! and PASSCoDe are proved to be convergent under the assumption that there are some locks to guarantee that over-writing operation would never happen. However, both of them have not provided theoretical guarantee for the convergence under the lock-free case. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG) (Johnson and Zhang 2013). The contributions of AsySVRG can be outlined as follows: • AsySVRG is a lock-free asynchronous parallel approach, which means that there is no read lock or update (write) lock in AsySVRG. Hence, AsySVRG can be expected to be more efficient than other approaches with locks. • Although AsySVRG is lock-free, we can still theoretically prove that it is convergent with a linear convergence rate1 , which is faster than that of Hogwild!. • The implementation of AsySVRG is simple. • Empirical results on real datasets show that AsySVRG can outperform Hogwild! in terms of convergence rate and computation cost. Preliminary We use f(w) to denote the objective function in (1), which means f(w) = 1 n Pn i=1 fi(w). In this paper, we use k·k to denote the L2-norm k·k2 and w∗ to denote the optimal solution of the objective function. Assumption 1. The function fi(·) (i = 1, . . . , n) in (1) is convex and L-smooth, which means that ∃L > 0, ∀a, b, fi(a) ≤ fi(b) + ∇fi(b) T (a − b) + L 2 ka − bk 2 , or equivalently k∇fi(a) − ∇fi(b)k ≤ Lka − bk , where ∇fi(·) denotes the gradient of fi(·). Assumption 2. The objective function f(·) is µ-strongly convex, which means that ∃µ > 0, ∀a, b, f(a) ≥ f(b) + ∇f(b) T (a − b) + µ 2 ka − bk 2 , or equivalently k∇f(a) − ∇f(b)k ≥ µ ka − bk . Please note that Assumptions 1 and 2 are often satisfied by most objective functions in machine learning models, such as the logistic regression (LR) and linear regression with L2- norm regularization. 1 In our early work, the parallel SVRG algorithms with locks have been proved to be convergent (Zhao and Li 2015). Approach Assume that we have p threads (processors) which can access a shared memory, and w is stored in the shared memory. Furthermore, we assume each thread has access to a shared data structure for the vector w and has access to randomly choose any instance i to compute the gradient ∇fi(w). Our AsySVRG algorithm is presented in Algorithm 1. We can find that in the t th iteration, each thread completes the following operations: • By using a temporary variable u0 to store wt (i.e., u0 = wt), all threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0) = 1 n Pn i=1 ∇fi(wt). Assume the gradients computed by thread a are denoted by φa which is a subset of {∇fi(wt)|i = 1, . . . , n}. We have φa T φb = ∅ if a 6= b, and Sp a=1 φa = {∇fi(wt)|i = 1, . . . , n}. • Then each thread parallelly runs an inner-loop in each iteration of which the thread reads the current value of u, denoted as uˆ, from the shared memory and randomly chooses an instance indexed by i to compute the vector vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0). (2) Then update the vector u ← u − ηvˆ, where η > 0 is a step size (or called learning rate). Here, we use w to denote the parameter in the outer-loop, and use u to denote the parameter in the inner-loop. Before running the inner-loops, u will be initialized by the current value of wt in the shared memory. After all the threads have completed the inner-loops, we take wt+1 to be the current value of u in the shared memory. Algorithm 1 will degenerate to stochastic variance reduced gradient (SVRG) (Johnson and Zhang 2013) if there exists only one thread (i.e., p = 1). Hence, Algorithm 1 is a Algorithm 1 AsySVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ... do u0 = wt; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt; For each thread, do: for m = 1 to M do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for
parallel version of SVRG.Furthermore,in Algorithm 1,all When all the inner-loops of all threads are completed,we threads read and write the shared memory without any locks can get the current u in the shared memory,which can be Hence,Algorithm 1 is a lock-free asynchronous approach to presented as parallelize SVRG,which is called AsySVRG in this paper. M-1 An Equivalent Synthetic Process u=+∑B;△ (6 In the lock-free case,we do not use any lock whenever one =0 thread reads u from the shared memory or updates (writes) According to (6)and the definition of Am,we can define a u in the shared memory.Hence,the results of u seem to synthetic write (update)sequence {um}with uo =wt,and be totally disordered,which makes the convergence analysis for m =1:M, very difficult. In this paper,we find that there exists a synthetic process m-1 to generate the final value of u after all threads have um =uo+ B,△ (7) completed their updates in the inner-loop of Algorithm 1. i=0 It means that we can generate a sequence of synthetic values of u with some order to get the final u,based on which It is easy to see that we can prove the convergence of the lock-free AsySVRG um+1=um+Bm△m: in Algorithm 1. We can also find that um is the value which can be got after Synthetic Write (Update)Sequence the update vectors Ao,...,Am-1 have been completely The key step in the inner-loop of Algorithm I is u u-nv, applied to u in the shared memory. which can be rewritten as follows: Please note that the sequence {um}is synthetic and the u←-u+△, (3) whole u=()may never occur in the where A =-nv is the update vector computed by each shared memory,which means that we cannot obtain any thread. um (m =1,2,...,M-1)or the average sum of these Letu=(u,u(②),,ua)with u@denoting the ith um during the running of the inner-loop.What we can get is element of u.Since each thread has a local count,we use only the final value of u after all threads have completed Ai.j to denote the jth update vector produced by the ith their updates in the inner-loop of Algorithm 1.Hence,we thread (i=1,2...j1..).to denote can find an equivalent synthetic update process with some order to generate the same value as that of the disordered the time that(is changed by Without loss of lock-free update process. generality,we assume all threads will update the elements in u in the order from u(1)to u(d),which can be easily Read Sequence implemented by the program.Hence,we have We use um to denote the parameter read from the shared i,itga(m)denotes another use△o,△1,,△m,△M-1,where M=pXM,to denote the corresponding update vectors.It is more useful time point. Pm)means that besides that we sort than that we sort(1)because ua(m),um also includes some other new update vectors beforethe update vectorhas not been applied to from time point a(m)to b(m).According to the definition of△i,all{△i}(i≥m)have not been applied to u at u.Furthermore,we will benefit from such a sort when we the time point when one thread gets um.Then,we can set discuss what would be read by one thread.Since we do not b(m)<m.Hence,we can reformulate um as use any lock,over-writing may happen when one thread is performing the update in(3).The real update vector can be m-1 written as BmAm,where Bm is a diagonal matrix whose im=ua(m)十 Pm,i-a(m)△i (8) diagonal entries are 0 or 1.If B()=0,then a(m is over-written by other threads.If Bm(,k)=1,then )is updated by successfully without over-writing. The matrix Pm.i-a(m)means that im may read some components of these new{△:}(a(m)≤i<m),including However,we do not know what the exact Bm is.It can be those which might be over-written by some other threads.It seen as a random variable.Then (3)can be rewritten as can also be seen as a random variable.Pm.-a(m)and Bi u←u+Bm△m are not necessary to be equal to each other
parallel version of SVRG. Furthermore, in Algorithm 1, all threads read and write the shared memory without any locks. Hence, Algorithm 1 is a lock-free asynchronous approach to parallelize SVRG, which is called AsySVRG in this paper. An Equivalent Synthetic Process In the lock-free case, we do not use any lock whenever one thread reads u from the shared memory or updates (writes) u in the shared memory. Hence, the results of u seem to be totally disordered, which makes the convergence analysis very difficult. In this paper, we find that there exists a synthetic process to generate the final value of u after all threads have completed their updates in the inner-loop of Algorithm 1. It means that we can generate a sequence of synthetic values of u with some order to get the final u, based on which we can prove the convergence of the lock-free AsySVRG in Algorithm 1. Synthetic Write (Update) Sequence The key step in the inner-loop of Algorithm 1 is u ← u−ηvˆ, which can be rewritten as follows: u ← u + ∆, (3) where ∆ = −ηvˆ is the update vector computed by each thread. Let u = (u (1), u(2), . . . , u(d) ) with u (i) denoting the ith element of u. Since each thread has a local count, we use ∆i,j to denote the j th update vector produced by the i th thread (i = 1, 2, . . . , p, j = 1, 2, . . . , M), t (k) i,j to denote the time that u (k) is changed by ∆ (k) i,j . Without loss of generality, we assume all threads will update the elements in u in the order from u (1) to u (d) , which can be easily implemented by the program. Hence, we have ∀i, j, t(1) i,j 1) because before t (1) i,j , the update vector ∆i,j has not been applied to u. Furthermore, we will benefit from such a sort when we discuss what would be read by one thread. Since we do not use any lock, over-writing may happen when one thread is performing the update in (3). The real update vector can be written as Bm∆m, where Bm is a diagonal matrix whose diagonal entries are 0 or 1. If Bm(k, k) = 0, then ∆ (k) m is over-written by other threads. If Bm(k, k) = 1, then u (k) is updated by ∆ (k) m successfully without over-writing. However, we do not know what the exact Bm is. It can be seen as a random variable. Then (3) can be rewritten as u ← u + Bm∆m When all the inner-loops of all threads are completed, we can get the current u in the shared memory, which can be presented as u = u0 + M X˜ −1 i=0 Bi∆i (6) According to (6) and the definition of ∆m, we can define a synthetic write (update) sequence {um} with u0 = wt, and for m = 1 : M˜ , um = u0 + mX−1 i=0 Bi∆i (7) It is easy to see that um+1 = um + Bm∆m. We can also find that um is the value which can be got after the update vectors ∆0, . . . , ∆m−1 have been completely applied to u in the shared memory. Please note that the sequence {um} is synthetic and the whole um = (u (1) m , u (2) m , . . . , u (d) m ) may never occur in the shared memory, which means that we cannot obtain any um (m = 1, 2, . . . , M˜ − 1) or the average sum of these um during the running of the inner-loop. What we can get is only the final value of uM˜ after all threads have completed their updates in the inner-loop of Algorithm 1. Hence, we can find an equivalent synthetic update process with some order to generate the same value as that of the disordered lock-free update process. Read Sequence We use uˆm to denote the parameter read from the shared memory which is used to compute ∆m by a thread. Based on the synthetic write sequence {um}, uˆm can be written as uˆm = ua(m) + Pb(m) i=a(m) Pm,i−a(m)∆i . Here, a(m) < m denotes some time point when the update vectors ∆0, . . . , ∆a(m)−1 have been completely applied to u in the shared memory. {Pm,i−a(m)} are diagonal matrices whose diagonal entries are 0 or 1. b(m) ≥ a(m) denotes another time point. Pb(m) i=a(m) Pm,i−a(m)∆i means that besides ua(m) , uˆm also includes some other new update vectors from time point a(m) to b(m). According to the definition of ∆i , all {∆i} (i ≥ m) have not been applied to u at the time point when one thread gets uˆm. Then, we can set b(m) < m. Hence, we can reformulate uˆm as uˆm = ua(m) + mX−1 i=a(m) Pm,i−a(m)∆i . (8) The matrix Pm,i−a(m) means that uˆm may read some components of these new {∆i} (a(m) ≤ i < m), including those which might be over-written by some other threads. It can also be seen as a random variable. Pm,i−a(m) and Bi are not necessary to be equal to each other
An Ilustrative Example random index of the instance chosen by this thread.Bi is a We give a lock-free example to illustrate the above synthetic diagonal matrix whose diagonal entries are 0 or 1. process.Assume u =(1,1)is stored in the shared memory. To get our convergence result,we give some assumptions There are three threads which we use Ta,Tb and Tc to and definitions. denote.The task of Ta is to add 1 on each component of u, Assumption 3.(Bound delay assumption) the task of Tb is to add 0.1 on each component of u,and the 0≤m-a(m)≤T. task of Tc is to read u from the shared memory.If the final result of u =(2,2.1),one of the possible update sequences Assumption 4.The conditional expectation of Bm on um of u can be presented as follows: and um is strictly positive definite,i.e,EBmlum,um]= B-0 with the minimum eigenvalue a>0. Time u Time0(1,1) Assumption 5.(Dependence assumption)Bm and im are Time 1 (1.1,1) conditional independent on um and um,where im is the Time2(2.1) index of the randomly chosen instance. Time3(2,2) Assumption 3 means that when one thread gets the am. Time4(2,2.1) at least△o,△l,·,△m-r-1 have been completely ap- It is easy to see that over-writing happens at Time 2.At plied (updated)to u.The T is a parameter for bound delay. Time 1,Tb is faster then Ta.At Time 3,Ta is faster than Tb. In real applications,we cannot control the bound delay since Then we can define we do not have any lock.In our experiments of logistic regression,we find that we do not need the bound delay.The 0=(1,1) phenomenon can be explained as that our threads are stable u1=u0+B0△0=(1,1.1) and the process of computing full gradient can be seen as a u2=u1+B1△1=(2,2.1) “delay”,although such a“delay'”is relatively large. According to the definition of um and um,both of them where are determined by these random variables Bi,Al,i(< △0=(0.1,0.1),△1=(1,1), m -1)and (Pm.i}(0 j<m-a(m)).So we and can take conditional expectation about Bm on um and iim. B0= (89)B=(0) which is E[Bmlum;tim]=B in Assumption 4.Since Bm is a diagonal positive semi-definite matrix,then the expectation of Bm is still a diagonal positive semi-definite We can find that the whole u =(1,1.1)never occurs in matrix.Assumption 4 only needs B to be a strictly positive the shared memory,which means that it is synthetic.The definite matrix,which means that the minimum eigenvalue result of Tc can be any u at Time 0,Time 1,...,Time 4, of B is strictly positive.According to Assumption 4,for and even some other value such as (1.1.2.1).The u at Time each thread,after it has got a i from the shared memory.the 1 can also be read by Te although the u(0)written by Tb probability that over-writing happens when updating on the at Time 1 was over-written by Ta.It is easy to verify that kth component of u is 1-B(k,k).If one of the eigenvalues any result of Tc can be presented as the format of(8).For of B is zero,it means that over-writing always happens example,if Tc reads the u =(1.1,2.1),then the read value on that component of u,which is not common. Hence, can be written as: Assumption 4 is reasonable. (1.1,2.1)=u0+Po△0+P1△ In most modern hardware,B(k.k)is close to 1.More- over,if we use atomic operation or update lock,B(k,k)= where 1.So Assumption 5 is also reasonable since the Bm is P=(0)P=(8) highly affected by the hardware but im is independent of the hardware Definition 1. Convergence Analysis In this section,we will focus on the convergence analysis for p:(x)=7f(x)-7f(o)+Vf(o) (10) the lock-free AsySVRG.Let vij denote the jth stochastic n gradient produced by the ith thread (i=1,2,...,p,j= gx)=∑Ip,x)2 (11) 1,2,...,M),then Aij =-nvi.j.We can also give an i=1 order of these vij and define a synthetic sequence [um} According to (10)and (11),it is easy to prove that according to the discussion in the above section: Eilllp;(x)]=q(x)and vm =pi (fm). uo =Wt: In the following content,we will prove the convergence um+1 um -nBmVm. (9) of our algorithm.All the proofs can be found at the supplementary material2. In (9),the stochastic gradient vm Vfi (fm)- Vfi (uo)+Vf(uo),where iim is got from the shared 2The supplementary material can be downloaded from http: memory by the thread which computes vm and im is the //cs.nju.edu.cn/1wj/paper/AsySVRG_sup.pdf
An Illustrative Example We give a lock-free example to illustrate the above synthetic process. Assume u = (1, 1) is stored in the shared memory. There are three threads which we use Ta, Tb and Tc to denote. The task of Ta is to add 1 on each component of u, the task of Tb is to add 0.1 on each component of u, and the task of Tc is to read u from the shared memory. If the final result of u = (2, 2.1), one of the possible update sequences of u can be presented as follows: Time u Time 0 (1, 1) Time 1 (1.1, 1) Time 2 (2, 1) Time 3 (2, 2) Time 4 (2, 2.1) It is easy to see that over-writing happens at Time 2. At Time 1, Tb is faster then Ta. At Time 3, Ta is faster than Tb. Then we can define u0 = (1, 1) u1 = u0 + B0∆0 = (1, 1.1) u2 = u1 + B1∆1 = (2, 2.1) where ∆0 = (0.1, 0.1)T , ∆1 = (1, 1)T , and B0 = 0 0 0 1 , B1 = 1 0 0 1 . We can find that the whole u1 = (1, 1.1) never occurs in the shared memory, which means that it is synthetic. The result of Tc can be any u at Time 0, Time 1, . . ., Time 4, and even some other value such as (1.1, 2.1). The u at Time 1 can also be read by Tc although the u (0) written by Tb at Time 1 was over-written by Ta. It is easy to verify that any result of Tc can be presented as the format of (8). For example, if Tc reads the u = (1.1, 2.1), then the read value can be written as: (1.1, 2.1) = u0 + P0∆0 + P1∆1 where P0 = 1 0 0 1 , P1 = 0 0 0 1 . Convergence Analysis In this section, we will focus on the convergence analysis for the lock-free AsySVRG. Let vˆi,j denote the j th stochastic gradient produced by the i th thread (i = 1, 2, . . . , p, j = 1, 2, . . . , M), then ∆i,j = −ηvˆi,j . We can also give an order of these vˆi,j and define a synthetic sequence {um} according to the discussion in the above section: u0 = wt, um+1 = um − ηBmvˆm. (9) In (9), the stochastic gradient vˆm = ∇fim(uˆm) − ∇fim(u0) + ∇f(u0), where uˆm is got from the shared memory by the thread which computes vˆm and im is the random index of the instance chosen by this thread. Bm is a diagonal matrix whose diagonal entries are 0 or 1. To get our convergence result, we give some assumptions and definitions. Assumption 3. (Bound delay assumption) 0 ≤ m − a(m) ≤ τ . Assumption 4. The conditional expectation of Bm on um and uˆm is strictly positive definite, i.e., E[Bm|um, uˆm] = B 0 with the minimum eigenvalue α > 0. Assumption 5. (Dependence assumption) Bm and im are conditional independent on um and uˆm, where im is the index of the randomly chosen instance. Assumption 3 means that when one thread gets the uˆm, at least ∆0, ∆1, . . . , ∆m−τ−1 have been completely applied (updated) to u. The τ is a parameter for bound delay. In real applications, we cannot control the bound delay since we do not have any lock. In our experiments of logistic regression, we find that we do not need the bound delay. The phenomenon can be explained as that our threads are stable and the process of computing full gradient can be seen as a “delay”, although such a “delay” is relatively large. According to the definition of um and uˆm, both of them are determined by these random variables Bl , ∆l , il (l ≤ m − 1) and {Pm,j} (0 ≤ j < m − a(m)). So we can take conditional expectation about Bm on um and uˆm, which is E[Bm|um, uˆm] = B in Assumption 4. Since Bm is a diagonal positive semi-definite matrix, then the expectation of Bm is still a diagonal positive semi-definite matrix. Assumption 4 only needs B to be a strictly positive definite matrix, which means that the minimum eigenvalue of B is strictly positive. According to Assumption 4, for each thread, after it has got a uˆ from the shared memory, the probability that over-writing happens when updating on the k th component of u is 1−B(k, k). If one of the eigenvalues of B is zero, it means that over-writing always happens on that component of u, which is not common. Hence, Assumption 4 is reasonable. In most modern hardware, B(k, k) is close to 1. Moreover, if we use atomic operation or update lock, B(k, k) = 1. So Assumption 5 is also reasonable since the Bm is highly affected by the hardware but im is independent of the hardware. Definition 1. pi(x) = ∇fi(x) − ∇fi(u0) + ∇f(u0) (10) q(x) = 1 n Xn i=1 kpi(x)k 2 (11) According to (10) and (11), it is easy to prove that Ei [kpi(x)k 2 ] = q(x) and vˆm = pim(uˆm). In the following content, we will prove the convergence of our algorithm. All the proofs can be found at the supplementary material2 . 2The supplementary material can be downloaded from http: //cs.nju.edu.cn/lwj/paper/AsySVRG_sup.pdf
Lemma 1.Vx,y,r >0,i,we have Baselines Hogwild!and our AsySVRG are formulated from the primal llp:()l-Ilp:()I P:)yl perspective,but PASSCoDe and CoCoA are formulated from the dual perspective.Hence,the most related work for Lemma 2.For any constant p 1,we have Eq(um)< our AsySVRG is Hogwild!,which is chosen as the baseline pEq(um+1)if we choose r and n to satisfy that for comparison. 1 In particular,we compare AsySVRG with the following 1--9r(+1)L2n2<p four variants: 1 .Hogwild!-lock:The lock version of Hogwild!,which 1-是-9r+L2n2(p*+1- <p. need a lock(write-lock)when each thread is going to P-1 update (write)the parameter Lemma 3.With the assumption in Lemma 2 about r,p,n. Hogwild!:The lock-free version of Hogwild!,which does we have Eq(im)<pEq(um). not use any lock during the whole algorithm. Theorem 1.With Assumption 1,2,3,4.5,and taking wt+1 AsySVRG-lock:The lock version of AsySVRG,which to be the last one of fum,we have need a lock (write-lock)when each thread is going to update (write)the parameter. Ef4-f)s+是6fw)-f.》 AsySVRG:The lock-free version of AsySVRG,which does not use any lock during the whole algorithm. where c1 =1-anu+c2 and c2 =n(8Ln(-1)+ Please note that we do not use any lock(read-lock)when 0-1 212p),M=px M is the total number of iterations of the each thread is reading parameter from the shared memory for AsySVRG and other baselines.which means all the inner-loop. threads perform inconsistent read. In Theorem 1,the constant c2=O(n2).We can choose We set M in Algorithm I to be where n is the n such that (n)<1 and c<1.Hence,our number of training instances and p is number of threads algorithm gets a linear convergence rate with a lock-free When p -1,the setting about M is the same as that strategy. in SVRG (Johnson and Zhang 2013).According to our theorems,the step size should be small.However,we can Experiment also get good performance with a relatively large step size We choose logistic regression(LR)with a L2-norm regular- in practice.For Hogwild!,in each epoch,each thread runs ization term to evaluate our AsySVRG.Hence,the f(w)is iterations.We use a constant step size y,and we set defined as follows: 0.9y after every epoch.These settings are the same as those in the experiments of Hogwild!(Recht et al.2011). og1+e好)+含w Result The experiments are conducted on a server with 12 Intel Convergence Rate We get a suboptimal solution by stop- ping the algorithms when the gap between the training loss cores and 64G memory. and the optimal solution min {f(w)}is less than 10-4.For Dataset each epoch,our algorithm visits the whole dataset three times and Hogwild!visits the whole dataset only once.To Four datasets are used for evaluation.They are rcvl,real- make a fair comparison about the convergence rate,we study sim,news20,and epsilon,which can be downloaded from the change of objective function value versus the number of the LibSVM website3.Detailed information is shown in effective passes.One effective pass of the dataset means the Table 1,where sparse means the features have many zeroes whole dataset is visited once and dense means the features have few zeroes.Since we Figure 1 shows the convergence rate with respect to only consider the speedup and convergence rate on the training data,we simply set the hyper-parameter=10-4 effective passes on four datasets.Here,AsySVRG-lock-10 denotes AsySVRG-lock with 10 threads.Similar notations in f(w)for all the experiments. are used for other variants of AsySVRG and Hogwild!.In particular,AsySVRG-1 is actually the original non-parallel Table 1:Dataset version of SVRG (Johnson and Zhang 2013).We can find dataset instances features type that the convergence rate of AsySVRG and its variants is memorv rcvl 20.242 47.236 36M sparse almost linear(please note that the vertical axis is in a log real-sim 72.309 20.958 90M sparse scale),which is much faster than that of Hogwild!and its news20 19.996 1.355.191 140M sparse variants.Hence,the empirical results successfully verify the epsilon 400.000 2.000 11G dense correctness of our theoretical results. There also exists another interesting phenomenon that the curves of AsySVRG-1,AsySVRG-lock-10 and AsySVRG- http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ 10 are close to each other.The number of effective passes
Lemma 1. ∀x, y, r > 0, i, we have kpi(x)k 2 − kpi(y)k 2 ≤ 1 r kpi(x)k 2 + rL2 kx − yk 2 . Lemma 2. For any constant ρ > 1, we have Eq(uˆm) < ρEq(uˆm+1) if we choose r and η to satisfy that 1 1 − 1 r − 9r(τ + 1)L2η 2 < ρ 1 1 − 1 r − 9r(τ+1)L2η2(ρτ+1−1) ρ−1 < ρ. Lemma 3. With the assumption in Lemma 2 about r, ρ, η, we have Eq(uˆm) < ρEq(um). Theorem 1. With Assumption 1, 2, 3, 4, 5, and taking wt+1 to be the last one of {um}, we have Ef(wt+1) − f(w∗) ≤ (c M˜ 1 + c2 1 − c1 )(Ef(wt) − f(w∗)), where c1 = 1 − αηµ + c2 and c2 = η 2 ( 8τL3ηρ2 (ρ τ −1) ρ−1 + 2L 2ρ), M˜ = p × M is the total number of iterations of the inner-loop. In Theorem 1, the constant c2 = O(η 2 ). We can choose η such that c2 1−c1 = O(η) < 1 and c1 < 1. Hence, our algorithm gets a linear convergence rate with a lock-free strategy. Experiment We choose logistic regression (LR) with a L2-norm regularization term to evaluate our AsySVRG. Hence, the f(w) is defined as follows: f(w) = 1 n Xn i=1 log(1 + e −yix T i w) + λ 2 kwk 2 . The experiments are conducted on a server with 12 Intel cores and 64G memory. Dataset Four datasets are used for evaluation. They are rcv1, realsim, news20, and epsilon, which can be downloaded from the LibSVM website3 . Detailed information is shown in Table 1, where sparse means the features have many zeroes and dense means the features have few zeroes. Since we only consider the speedup and convergence rate on the training data, we simply set the hyper-parameter λ = 10−4 in f(w) for all the experiments. Table 1: Dataset dataset instances features memory type rcv1 20,242 47,236 36M sparse real-sim 72,309 20,958 90M sparse news20 19,996 1,355,191 140M sparse epsilon 400,000 2,000 11G dense 3 http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ Baselines Hogwild! and our AsySVRG are formulated from the primal perspective, but PASSCoDe and CoCoA are formulated from the dual perspective. Hence, the most related work for our AsySVRG is Hogwild!, which is chosen as the baseline for comparison. In particular, we compare AsySVRG with the following four variants: • Hogwild!-lock: The lock version of Hogwild!, which need a lock (write-lock) when each thread is going to update (write) the parameter. • Hogwild!: The lock-free version of Hogwild!, which does not use any lock during the whole algorithm. • AsySVRG-lock: The lock version of AsySVRG, which need a lock (write-lock) when each thread is going to update (write) the parameter. • AsySVRG: The lock-free version of AsySVRG, which does not use any lock during the whole algorithm. Please note that we do not use any lock (read-lock) when each thread is reading parameter from the shared memory for AsySVRG and other baselines, which means all the threads perform inconsistent read. We set M in Algorithm 1 to be 2n p , where n is the number of training instances and p is number of threads. When p = 1, the setting about M is the same as that in SVRG (Johnson and Zhang 2013). According to our theorems, the step size should be small. However, we can also get good performance with a relatively large step size in practice. For Hogwild!, in each epoch, each thread runs n p iterations. We use a constant step size γ, and we set γ ← 0.9γ after every epoch. These settings are the same as those in the experiments of Hogwild! (Recht et al. 2011). Result Convergence Rate We get a suboptimal solution by stopping the algorithms when the gap between the training loss and the optimal solution min {f(w)} is less than 10−4 . For each epoch, our algorithm visits the whole dataset three times and Hogwild! visits the whole dataset only once. To make a fair comparison about the convergence rate, we study the change of objective function value versus the number of effective passes. One effective pass of the dataset means the whole dataset is visited once. Figure 1 shows the convergence rate with respect to effective passes on four datasets. Here, AsySVRG-lock-10 denotes AsySVRG-lock with 10 threads. Similar notations are used for other variants of AsySVRG and Hogwild!. In particular, AsySVRG-1 is actually the original non-parallel version of SVRG (Johnson and Zhang 2013). We can find that the convergence rate of AsySVRG and its variants is almost linear (please note that the vertical axis is in a log scale), which is much faster than that of Hogwild! and its variants. Hence, the empirical results successfully verify the correctness of our theoretical results. There also exists another interesting phenomenon that the curves of AsySVRG-1, AsySVRG-lock-10 and AsySVRG- 10 are close to each other. The number of effective passes
actually measures the computation cost.Hence,these curves rcv1 real-sim mean that the corresponding methods need almost the same computation cost to get the same objective function value. For example,AsySVRG-1 and AsySVRG-10 need the same computation cost on news20 dataset according to the results in Figure 1.Hence,if there does not exist any other cost,AsySVRG-10 can be expected to be 10 times faster than AsySVRG-1 because AsySVRG-10 has 10 threads to parallelly(simultaneously)perform computation.However, time time AsySVRG-10 typically cannot achieve the speed of 10 (a)rcvl (b)realsim times faster than AsySVRG-1 because there also exist other news20 epsilon costs,such as inter-thread communication cost,for multiple- thread cases.Similarly.AsySVRG-lock-10 and AsySVRG- 10 cannot have the same speed because AsySVRG-lock-10 need extra lock cost compared with AsySVRG-10.This will be verified in the next section about speedup. 10 As mentioned above,besides the computation cost mainly reflected by the number of effective passes,there are other costs like communication cost and lock cost to affect the time total CPU time in the multiple-thread case.We also compare (c)news20 (d)epsilon the total CPU time between AsySVRG and Hogwild!both of which are with 10 threads.Since different hardware would lead to different CPU time,we use relative time Figure 2:Convergence rate with respect to CPU time (the vertical for comparison.More specifically,we assume the time axis is in a log scale,and the horizonal axis is the ratio to the CPU that Hogwild!-10 takes to get a sub-optimal solution with time of Hogwild!-10 with the stopping condition f(w)-f(w.)< 0.01). error f(w)-f(w+)<0.01 to be one (unit).Here,we use f(w)-f(w,)<0.01 rather than f(w)-f(w.)< 10-4 because Hogwild!-10 cannot achieve the accuracy of f(w)-f(w.)<10-4 in some datasets.The convergence Speedup We compare between AsySVRG-lock and the result with respect to total CPU time is shown in Figure 2.It lock-free AsySVRG in terms of speedup to demonstrate the is easy to see that AsySVRG has almost linear convergence advantage of lock-free strategy.The definition of speedup is as follows: rate,which is much faster than Hogwild!. CPU time with 1 thread speedup real-sim CPU time with p threads Here,the stopping condition is f(w)-f(w.)<10-4. The speedup results are shown in Figure 3,where we can find that our lock-free AsySVRG achieves almost a linear speedup.However,the speedup of AsySVRG-lock is much worse than AsySVRG.The main reason is that besides the computation cost,AsySVRG-lock need extra lock cost number of effective passes number of eflective passes compared with AsySVRG. (a)rcvl (b)realsim news20 epsilon Conclusion In this paper,we have proposed a novel asynchronous paral- lel SGD method with a lock-free strategy,called AsySVRG, for multicore systems.Both theoretical and empirical results show that AsySVRG can outperform other state-of-the-art methods. Future work will design asynchronous distributed SVRG number of effective passes number of effective passes methods on clusters of multiple machines by using similar techniques in AsySVRG. (c)news20 (d)epsilon Acknowledgements Figure 1:Convergence rate with respect to the number of effective passes (the vertical axis is in a log scale).Please note that in (a)and This work is supported by the NSFC (No.61472182),and (b),the curves of AsySVRG-lock-10 and AsySVRG-10 overlap the Fundamental Research Funds for the Central Universi- with each other. ties(No.20620140510)
actually measures the computation cost. Hence, these curves mean that the corresponding methods need almost the same computation cost to get the same objective function value. For example, AsySVRG-1 and AsySVRG-10 need the same computation cost on news20 dataset according to the results in Figure 1. Hence, if there does not exist any other cost, AsySVRG-10 can be expected to be 10 times faster than AsySVRG-1 because AsySVRG-10 has 10 threads to parallelly (simultaneously) perform computation. However, AsySVRG-10 typically cannot achieve the speed of 10 times faster than AsySVRG-1 because there also exist other costs, such as inter-thread communication cost, for multiplethread cases. Similarly, AsySVRG-lock-10 and AsySVRG- 10 cannot have the same speed because AsySVRG-lock-10 need extra lock cost compared with AsySVRG-10. This will be verified in the next section about speedup. As mentioned above, besides the computation cost mainly reflected by the number of effective passes, there are other costs like communication cost and lock cost to affect the total CPU time in the multiple-thread case. We also compare the total CPU time between AsySVRG and Hogwild! both of which are with 10 threads. Since different hardware would lead to different CPU time, we use relative time for comparison. More specifically, we assume the time that Hogwild!-10 takes to get a sub-optimal solution with error f(w) − f(w∗) < 0.01 to be one (unit). Here, we use f(w) − f(w∗) < 0.01 rather than f(w) − f(w∗) < 10−4 because Hogwild!-10 cannot achieve the accuracy of f(w) − f(w∗) < 10−4 in some datasets. The convergence result with respect to total CPU time is shown in Figure 2. It is easy to see that AsySVRG has almost linear convergence rate, which is much faster than Hogwild!. 0 5 10 15 20 25 30 10−6 10−4 10−2 100 rcv1 number of effective passes log(objective value − min) AsySVRG−1 AsySVRG−lock−10 AsySVRG−10 Hogwild−lock−10 Hogwild−10 (a) rcv1 0 5 10 15 20 25 30 10−10 10−8 10−6 10−4 10−2 100 real−sim number of effective passes log(objective value − min) AsySVRG−1 AsySVRG−lock−10 AsySVRG−10 Hogwild−lock−10 Hogwild−10 (b) realsim 0 5 10 15 20 25 30 10−10 10−8 10−6 10−4 10−2 100 news20 number of effective passes log(objective value − min) AsySVRG−1 AsySVRG−lock−10 AsySVRG−10 Hogwild−lock−10 Hogwild−10 (c) news20 0 5 10 15 20 25 30 10−10 10−8 10−6 10−4 10−2 100 epsilon number of effective passes log(objective value − min) AsySVRG−1 AsySVRG−lock−10 AsySVRG−10 Hogwild−lock−10 Hogwild−10 (d) epsilon Figure 1: Convergence rate with respect to the number of effective passes (the vertical axis is in a log scale). Please note that in (a) and (b), the curves of AsySVRG-lock-10 and AsySVRG-10 overlap with each other. 0 1 2 3 4 5 6 10−8 10−6 10−4 10−2 100 time log(objective value − min) rcv1 AsySVRG−10 Hogwild−10 (a) rcv1 0 1 2 3 4 5 6 10−10 10−8 10−6 10−4 10−2 100 time log(objective value − min) real−sim AsySVRG−10 Hogwild−10 (b) realsim 0 1 2 3 4 5 10−8 10−6 10−4 10−2 100 time log(objective value − min) news20 AsySVRG−10 Hogwild−10 (c) news20 0 1 2 3 4 5 10−10 10−8 10−6 10−4 10−2 100 time log(objective value − min) epsilon AsySVRG−10 Hogwild−10 (d) epsilon Figure 2: Convergence rate with respect to CPU time (the vertical axis is in a log scale, and the horizonal axis is the ratio to the CPU time of Hogwild!-10 with the stopping condition f(w)−f(w∗) < 0.01). Speedup We compare between AsySVRG-lock and the lock-free AsySVRG in terms of speedup to demonstrate the advantage of lock-free strategy. The definition of speedup is as follows: speedup = CP U time with 1 thread CP U time with p threads . Here, the stopping condition is f(w) − f(w∗) < 10−4 . The speedup results are shown in Figure 3, where we can find that our lock-free AsySVRG achieves almost a linear speedup. However, the speedup of AsySVRG-lock is much worse than AsySVRG. The main reason is that besides the computation cost, AsySVRG-lock need extra lock cost compared with AsySVRG. Conclusion In this paper, we have proposed a novel asynchronous parallel SGD method with a lock-free strategy, called AsySVRG, for multicore systems. Both theoretical and empirical results show that AsySVRG can outperform other state-of-the-art methods. Future work will design asynchronous distributed SVRG methods on clusters of multiple machines by using similar techniques in AsySVRG. Acknowledgements This work is supported by the NSFC (No. 61472182), and the Fundamental Research Funds for the Central Universities (No. 20620140510)
rcv1 real-sim Recht,B.;Re,C.;Wright,S.;and Niu,F.2011. Hogwild!:A lock-free approach to parallelizing stochastic gradient descent.In Proceedings of the Advances in Neural Information Processing Systems,693-701. Roux,N.L.;Schmidt,M.W.;and Bach,F.2012.A stochastic gradient method with an exponential convergence rate for finite training sets.In Proceedings of the Advances number of threads in Neural Information Processing Systems,2672-2680. (a)rev1 (b)realsim Shalev-Shwartz,S.,and Zhang,T.2013.Stochastic dual coordinate ascent methods for regularized loss.Journal of news20 epsilon Machine Learning Research 14(1):567-599. Xiao,L.2009.Dual averaging method for regularized stochastic learning and online optimization.In Proceedings of the Advances in Neural Information Processing Systems, 2116-2124 Zhang,R.,and Kwok,J.T.2014.Asynchronous distributed ADMM for consensus optimization.In Proceedings of the number of threads 31th International Conference on Machine Learning,1701- number of threads 1709 (c)news20 (d)epsilon Zhang,R.;Zheng,S.;and T.Kwok,J.2015.Fast distributed asynchronous sgd with variance reduction. Figure 3:Speedup results. arXi:1508.01633. Zhao,S.-Y.,and Li,W.-J.2015.Fast asynchronous parallel stochastic gradient decent.arXiv:1508.05711. References Zinkevich,M.;Weimer,M.;Smola,A.J.;and Li,L.2010. Duchi,J.C.;Agarwal,A.;and Wainwright,M.J.2010. Parallelized stochastic gradient descent.In Proceedings of Distributed dual averaging in networks.In Proceedings of the Advances in Neural Information Processing Systems, the Advances in Neural Information Processing Systems, 2595-2603. 550-558. Zinkevich,M.;Smola,A.J.;and Langford,J.2009.Slow Duchi,J.C.,and Singer,Y.2009.Efficient online and learners are fast.In Proceedings of the Advances in Neural batch learning using forward backward splitting.Journal Information Processing Systems,2331-2339. of Machine Learning Research 10:2899-2934. Hsieh,C.;Yu,H.;and Dhillon,I.S.2015.Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In Proceedings of the 32nd International Conference on Machine Learning,2370-2379. Jaggi,M.;Smith,V.;Takac,M.;Terhorst,J.;Krishnan,S.; Hofmann.T.:and Jordan.M.I.2014.Communication- efficient distributed dual coordinate ascent.In Proceedings of the Advances in Neural Information Processing Systems. 3068-3076. Johnson,R.,and Zhang,T.2013.Accelerating stochastic gradient descent using predictive variance reduction.In Pro- ceedings of the Advances in Neural Information Processing Systems,315-323. Liu,J.;Wright,S.J.;Re,C.;Bittorf,V.;and Sridhar, S.2014.An asynchronous parallel stochastic coordinate descent algorithm.In Proceedings of the 31th International Conference on Machine Learning,469-477. Mairal,J.2013.Optimization with first-order surrogate functions. In Proceedings of the 30th International Conference on Machine Learning,783-791. Nitanda,A.2014.Stochastic proximal gradient descent with acceleration techniques.In Proceedings of the Advances in Neural Information Processing Systems,1574-1582
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 number of threads speedup rcv1 AsySVRG−lock AsySVRG (a) rcv1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 number of threads speedup real−sim AsySVRG−lock AsySVRG (b) realsim 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 number of threads speedup news20 AsySVRG−lock AsySVRG (c) news20 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 number of threads speedup epsilon AsySVRG−lock AsySVRG−unlock (d) epsilon Figure 3: Speedup results. References Duchi, J. C.; Agarwal, A.; and Wainwright, M. J. 2010. Distributed dual averaging in networks. In Proceedings of the Advances in Neural Information Processing Systems, 550–558. Duchi, J. C., and Singer, Y. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10:2899–2934. Hsieh, C.; Yu, H.; and Dhillon, I. S. 2015. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In Proceedings of the 32nd International Conference on Machine Learning, 2370–2379. Jaggi, M.; Smith, V.; Takac, M.; Terhorst, J.; Krishnan, S.; Hofmann, T.; and Jordan, M. I. 2014. Communicationefficient distributed dual coordinate ascent. In Proceedings of the Advances in Neural Information Processing Systems. 3068–3076. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems, 315–323. Liu, J.; Wright, S. J.; Re, C.; Bittorf, V.; and Sridhar, ´ S. 2014. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings of the 31th International Conference on Machine Learning, 469–477. Mairal, J. 2013. Optimization with first-order surrogate functions. In Proceedings of the 30th International Conference on Machine Learning, 783–791. Nitanda, A. 2014. Stochastic proximal gradient descent with acceleration techniques. In Proceedings of the Advances in Neural Information Processing Systems, 1574–1582. Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, 693–701. Roux, N. L.; Schmidt, M. W.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proceedings of the Advances in Neural Information Processing Systems, 2672–2680. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14(1):567–599. Xiao, L. 2009. Dual averaging method for regularized stochastic learning and online optimization. In Proceedings of the Advances in Neural Information Processing Systems, 2116–2124. Zhang, R., and Kwok, J. T. 2014. Asynchronous distributed ADMM for consensus optimization. In Proceedings of the 31th International Conference on Machine Learning, 1701– 1709. Zhang, R.; Zheng, S.; and T. Kwok, J. 2015. Fast distributed asynchronous sgd with variance reduction. arXiv:1508.01633. Zhao, S.-Y., and Li, W.-J. 2015. Fast asynchronous parallel stochastic gradient decent. arXiv:1508.05711. Zinkevich, M.; Weimer, M.; Smola, A. J.; and Li, L. 2010. Parallelized stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, 2595–2603. Zinkevich, M.; Smola, A. J.; and Langford, J. 2009. Slow learners are fast. In Proceedings of the Advances in Neural Information Processing Systems, 2331–2339
Appendix For any fixed l,we can take expectation for both sides of the Proof of Lemma 1 above inequality and then sum l from 1 to n.Then we can Proof. get: Ilp(x)2-Ip(y)I2≤2p(x)T(p(x)-py)》 Eq(im)-Eq(im+1)≤ ≤三p(x2+rp:(x)-p:y)2 ga)+9rr+z2∑a,) j=m- =p(x训2+rIfx)-Vfy)川 Here,we use the fact thatEll=(). ≤p.(x)2+r2Ix-yP Then,we will prove the conclusion by induction.For convenience,we use qi to denote Eq(ui). When m =0,we have 1 Proof of Lemma 2 90≤1--9rF+12P91≤p1 Proof.According to Lemma 1,we have Assuming that Vm≤Mo,we have qm-l≤pgm. llp()I2-llp(m)2 Then,for m Mo,we have ≤pu(am)川2+rL2lam-im+il2 qm-a1≤m+9rr+iz2r∑g Furthermore,we can get j=m一1 llum im+ill sn+9nr+12rn∑p叫 -lwm- j=m-7 Pm.i-a(m)Vi =a(m) -n+咖+1o*1- p-1 -(u(m+)-∑Pn+,i-am+v)l which means that =a(m+1) 1 ≤Ilm-ml+∑Isl gm≤1-是-c+9o西9m+1<p4m+1 p-1 i=a(m) 0 +n2 Proof of Lemma 3 =a(m+1) Proof.According to Lemma 1,we have su-+∑ a(m+1)-1 Ilpi()2-lpi(um)2 i=a(m) =a(m) ≤p(an产+r2Iaa-un +7 1 Similar to the proof of Lemma 2,we can get: =a(m+1) a(m+1)-1 m-1 lin-un‖= ta():-th ∑ +∑I i=a(m) i=a(m) i=a(m) +2 ≤I-n+g2l i=a(m)} i=a(m+1) ≤3∑. s-wl+ i=a(m) =a(m) =m-订 Then,we have 元+n罗 lpu(位n)川2-Ip(im+i)I2 i=a(m) =a(m) ≤Ipa+9r(r+1L2∑, ≤∑d (12) i=a(m)
Appendix Proof of Lemma 1 Proof. kpi(x)k 2 − kpi(y)k 2 ≤ 2pi(x) T (pi(x) − pi(y)) ≤ 1 r kpi(x)k 2 + r kpi(x) − pi(y)k 2 = 1 r kpi(x)k 2 + r k∇fi(x) − ∇fi(y)k 2 ≤ 1 r kpi(x)k 2 + rL2 kx − yk 2 Proof of Lemma 2 Proof. According to Lemma 1, we have kpl(uˆm)k 2 − kpl(uˆm+1)k 2 ≤ 1 r kpl(uˆm)k 2 + rL2 kuˆm − uˆm+1k 2 . Furthermore, we can get kuˆm − uˆm+1k =kua(m) − η mX−1 i=a(m) Pm,i−a(m)vˆi − (ua(m+1) − η Xm i=a(m+1) Pm+1,i−a(m+1)vˆi)k ≤ ua(m) − ua(m+1) + η mX−1 i=a(m) kvˆik + η Xm i=a(m+1) kvˆik ≤ a(mX +1)−1 i=a(m) kui − ui+1k + η mX−1 i=a(m) kvˆik + η Xm i=a(m+1) kvˆik ≤η a(mX +1)−1 i=a(m) kvˆik + η mX−1 i=a(m) kvˆik + η Xm i=a(m+1) kvˆik ≤3η Xm i=m−τ kvˆik . Then, we have kpl(uˆm)k 2 − kpl(uˆm+1)k 2 ≤ 1 r kpl(uˆm)k 2 + 9r(τ + 1)L 2 η 2 Xm j=m−τ kvˆjk 2 For any fixed l, we can take expectation for both sides of the above inequality and then sum l from 1 to n. Then we can get: Eq(uˆm) − Eq(uˆm+1) ≤ 1 r Eq(uˆm) + 9r(τ + 1)L 2 η 2 Xm j=m−τ Eq(uˆj ) Here, we use the fact that E[kvˆjk 2 |uˆj ] = q(uˆj ). Then, we will prove the conclusion by induction. For convenience, we use qi to denote Eq(uˆi). When m = 0, we have q0 ≤ 1 1 − 1 r − 9r(τ + 1)L2η 2 q1 ≤ ρq1 Assuming that ∀m ≤ M0, we have qm−1 ≤ ρqm. Then, for m = M0, we have qm−qm+1 ≤ 1 r qm + 9r(τ + 1)L 2 η 2 Xm j=m−τ qj ≤ 1 r qm + 9r(τ + 1)L 2 η 2 qm Xm j=m−τ ρ m−j = 1 r qm + 9r(τ + 1)L 2η 2 (ρ τ+1 − 1) ρ − 1 qm which means that qm ≤ 1 1 − 1 r − 9r(τ+1)L2η2(ρτ+1−1) ρ−1 qm+1 < ρqm+1 Proof of Lemma 3 Proof. According to Lemma 1, we have kpl(uˆm)k 2 − kpl(um)k 2 ≤ 1 r kpl(uˆm)k 2 + rL2 kuˆm − umk 2 Similar to the proof of Lemma 2, we can get: kuˆm − umk = ua(m) − η mX−1 i=a(m) Pm,i−a(m)vˆi − um ≤ ua(m) − um + η mX−1 i=a(m) kvˆik ≤ mX−1 i=a(m) kui − ui+1k + η mX−1 i=a(m) kvˆik ≤η mX−1 i=a(m) kvˆik + η mX−1 i=a(m) kvˆik ≤2η mX−1 i=a(m) kvˆik (12)
Then,we have Proof of Theorem 1 p(m)I2-lpi(u)I2 Proof. m-1 ≤panP+rr272∑I, E[f(um+1)lum,tm]=E[f(um -nBmVm)lum;im] E[f(um)lum;im]+E[Vf(um)T(-nBm vm)lum;im] j=a(m) For any fixed l,we can take expectation for both sides of the +号cJB-Y.P1ow.ob inequality and then sum l from 1 to n.Then,we can get Taking expectation on the above inequality,we have m-1 Eqan)-Eq(un)≤Eg(an)+4rrL2n2∑E4(a) Ef(um+1)-f(w.) j=a(m】 <Ef(um)-f(w.)-nE(Vf(um)BVf(um)) m-1 ≤,q(an)+4rrl2Egan)∑pm- +实aP j=a(m) -la小+rLp0r-少ggnj) ≤Ef(un)-fw)-罗ElVf(um)l)2 p-1 +u.-aP+受 which means that B4anls1--22回4unl<p4a】 ≤-o)(/(w.小-w》+72lua-aar P- ◇ +受P ≤(1-a4)(Ef(um)-f(w*) Proof of Lemma 4 Lemma 4.With Assumption 4,we have 22-u+四u) p-1 -vf(u)BVf()mm2-f(um) ci(Ef(um)-f(w))+c2(Ef(uo)-f(w.)), Proof. where c1 =1-anu+c2 and ca=n(8Lnp2(o1)+ -I 2L2p).Here,the first inequality uses Assumption 5.The 2 fw)-7f(u)TBVf() second inequality uses Lemma 4.The third inequality uses Assumption 2.The fourth inequality uses Lemma 3 and 号BVf(u)-Vf(um)BVf) (12).The last inequality uses the fact that q(um)≤4L(f(um)-f(w*)+f(uo)-f(w幸) +专 which has been proved in (Johnson and Zhang 2013). -号B(u)-fan川训f From the above inequality,we can get ≤51um-an fn)-f≤G+是6ef)-》 which means that fw4)-f)≤心+。f0)-f八w.》 where wurw:=uo,and M =px M is the total number of iterations of the inner-loop
Then, we have kpl(uˆm)k 2 − kpl(um)k 2 ≤ 1 r kpl(uˆm)k 2 + 4rτL2 η 2 mX−1 j=a(m) kvˆjk 2 For any fixed l, we can take expectation for both sides of the inequality and then sum l from 1 to n. Then, we can get Eq(uˆm) − Eq(um) ≤ 1 r Eq(uˆm) + 4rτL2 η 2 mX−1 j=a(m) Eq(uˆj ) ≤ 1 r Eq(uˆm) + 4rτL2 η 2Eq(uˆm) mX−1 j=a(m) ρ m−j = 1 r Eq(uˆm) + 4rτL2η 2ρ(ρ τ − 1) ρ − 1 Eq(uˆm) which means that Eq(uˆm) ≤ 1 1 − 1 r − 4rτL2η2ρ(ρτ −1) ρ−1 Eq(um) < ρEq(um) Proof of Lemma 4 Lemma 4. With Assumption 4, we have −∇f(um) T B∇f(uˆm) ≤ L 2 2 kum − uˆmk 2 − α 2 k∇f(um)k 2 Proof. 1 2 B 1 2 ∇f(um) 2 − ∇f(um) T B∇f(uˆm) ≤ 1 2 B 1 2 ∇f(um) 2 − ∇f(um) T B∇f(uˆm) + 1 2 B 1 2 ∇f(uˆm) 2 = 1 2 B 1 2 (∇f(um) − ∇f(uˆm)) 2 ≤ L 2 2 kum − uˆmk 2 Proof of Theorem 1 Proof. E[f(um+1)|um, uˆm] = E[f(um − ηBmvˆm)|um, uˆm] ≤ E[f(um)|um, uˆm] + E[∇f(um) T (−ηBmvˆm)|um, uˆm] + Lη2 2 E[kBmvˆmk 2 |um, uˆm]. Taking expectation on the above inequality, we have Ef(um+1) − f(w∗) ≤ Ef(um) − f(w∗) − ηE(∇f(um) T B∇f(uˆm)) + Lη2 2 E kvˆmk 2 ≤ Ef(um) − f(w∗) − αη 2 E k∇f(um)k 2 + ηL2 2 E kum − uˆmk 2 + Lη2 2 E kvˆmk 2 ≤ (1 − αηµ)(Ef(um) − f(w∗)) + ηL2 2 E kum − uˆmk 2 + Lη2 2 E kvˆmk 2 ≤ (1 − αηµ)(Ef(um) − f(w∗)) + 2τL2η 3ρ 2 (ρ τ − 1) ρ − 1 Eq(um) + Lρη2 2 Eq(um) ≤ c1(Ef(um) − f(w∗)) + c2(Ef(u0) − f(w∗)), where c1 = 1 − αηµ + c2 and c2 = η 2 ( 8τL3ηρ2 (ρ τ −1) ρ−1 + 2L 2ρ). Here, the first inequality uses Assumption 5. The second inequality uses Lemma 4. The third inequality uses Assumption 2. The fourth inequality uses Lemma 3 and (12). The last inequality uses the fact that q(um) ≤ 4L(f(um) − f(w∗) + f(u0) − f(w∗)), which has been proved in (Johnson and Zhang 2013). From the above inequality, we can get Ef(um) − f(w∗) ≤ (c m 1 + c2 1 − c1 )(Ef(u0) − f(w∗)) which means that Ef(wt+1) − f(w∗) ≤ (c M˜ 1 + c2 1 − c1 )(Ef(wt) − f(w∗)) where wt+1 = uM˜ , wt = u0, and M˜ = p × M is the total number of iterations of the inner-loop