Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence 李武军 LAMDA Group 南京大学计算机科学与技术系 软件新技术国家重,点实验室 liwujun@nju.edu.cn Dec10,2016 4口,4@,4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 1/36
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence o… LAMDA Group HÆåÆOéÅâÆÜE‚X ^á#E‚I[:¢ø liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 1 / 36
Introduction Outline Introduction AsySVRG SCOPE Conclusion 4口,4@下4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 2/36
Introduction Outline 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 2 / 36
Introduction Machine Learning o Supervised Learning: Given a set of training examples {(xi,yi)1,supervised learning tries to solve the following regularized empirical risk minimization problem: where fi(w)is the loss function(plus some regularization term) defined on example i,and w is the parameter to learn. Examples: ·Logistic regression:f(w)=员∑1log(l+e-xw)+lw鬥] 。SVM:f(w)=是∑1Imax{0,1-xTw}+lw鬥] Deep learning models o Unsupervised Learning: Many unsupervised learning models,such as PCA and matrix factorization,can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 3/36
Introduction Machine Learning Supervised Learning: Given a set of training examples {(xi , yi)} n i=1, supervised learning tries to solve the following regularized empirical risk minimization problem: min w f(w) = 1 n Xn i=1 fi(w), where fi(w) is the loss function (plus some regularization term) defined on example i, and w is the parameter to learn. Examples: Logistic regression: f(w) = 1 n Pn i=1[log(1 + e −yix T i w) + λ 2 kwk 2 ] SVM: f(w) = 1 n Pn i=1[max{0, 1 − yix T i w} + λ 2 kwk 2 ] Deep learning models Unsupervised Learning: Many unsupervised learning models, such as PCA and matrix factorization, can also be reformulated as similar problems. Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 3 / 36
Introduction Machine Learning for Big Data For big data applications,first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic)gradient descent(GD): w+1←L-nm片∑ fi(wt), where t is the iteration number. Linear convergence rate:O(p) Iteration cost is O(n) Stochastic gradient descent(SGD):In the tth iteration,randomly choosing an example it∈{1,2,,n},then update wt+1←-wt-EVfi,(wt) Iteration cost is O(1) The convergence rate is sublinear.O(1/t). Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 4/36
Introduction Machine Learning for Big Data For big data applications, first-order methods have become much more popular than other higher-order methods for learning (optimization). Gradient descent methods are the most representative first-order methods. (Deterministic) gradient descent (GD): wt+1 ← wt − ηt [ 1 n Xn i=1 ∇fi(wt)], where t is the iteration number. Linear convergence rate: O(ρ t ) Iteration cost is O(n) Stochastic gradient descent (SGD): In the t th iteration, randomly choosing an example it ∈ {1, 2, ..., n}, then update wt+1 ← wt − ηt∇fit (wt) Iteration cost is O(1) The convergence rate is sublinear: O(1/t) Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 4 / 36
Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al.,NIPS 2012],SDCA [Shalev-Shwartz and Zhang,JMLR 2013],SVRG [Johnson and Zhang,NIPS 2013] Number of gradient(Vfi)evaluation to reach e for smooth and strongly convex problems: 。GD:O(nk log(2) ·SGD:O((2) ·SAG:O(nlog()when n≥8k 。SDCA:O(n+)1log(2) ·SVRG:O(m+)log(2) where1 is the condition number of the objective function. Stochastic Learning: Stochastic GD o Stochastic coordinate descent Stochastic dual coordinate ascent 4口,4@下1242,2QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS.NJU 5/36
Introduction Stochastic Learning for Big Data Researchers recently proposed improved versions of SGD: SAG [Roux et al., NIPS 2012], SDCA [Shalev-Shwartz and Zhang, JMLR 2013], SVRG [Johnson and Zhang, NIPS 2013] Number of gradient (∇fi) evaluation to reach for smooth and strongly convex problems: GD: O(nκ log( 1 )) SGD: O(κ( 1 )) SAG: O(n log( 1 )) when n ≥ 8κ SDCA: O((n + κ) log( 1 )) SVRG: O((n + κ) log( 1 )) where κ = L µ > 1 is the condition number of the objective function. Stochastic Learning: Stochastic GD Stochastic coordinate descent Stochastic dual coordinate ascent Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 5 / 36
Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): o Parallel stochastic learning: One machine with multiple cores and a shared memory o Distributed stochastic learning: A cluster with multiple machines Key issues:cooperation Parallel stochastic learning: lock vs.lock-free:waiting cost and lock cost Distributed stochastic learning: synchronous vs.asynchronous:waiting cost and communication cost 4口,4@下¥2生,2分Q0 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS,NJU 6/36
Introduction Parallel and Distributed Stochastic Learning To further improve the learning scalability (speed): Parallel stochastic learning: One machine with multiple cores and a shared memory Distributed stochastic learning: A cluster with multiple machines Key issues: cooperation Parallel stochastic learning: lock vs. lock-free: waiting cost and lock cost Distributed stochastic learning: synchronous vs. asynchronous: waiting cost and communication cost Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 6 / 36
Introduction Our Contributions Parallel stochastic learning:AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent:A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning:SCOPE Scalable Composite Optimization for Learning 4口,4@,4242,定分Q0 Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS,NJU 7/36
Introduction Our Contributions Parallel stochastic learning: AsySVRG Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee. Distributed stochastic learning: SCOPE Scalable Composite Optimization for Learning Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 7 / 36
ASySVRG Outline Introduction ② AsySVRG SCOPE Conclusion 4口,4@下4242,定分QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 8/36
AsySVRG Outline 1 Introduction 2 AsySVRG 3 SCOPE 4 Conclusion Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 8 / 36
AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD:Hogwild![Recht et al.2011], and PASSCoDe [Hsieh,Yu,and Dhillon 2015] No parallel methods for SVRG. Lock-free:empirically effective,but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG,called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof o Outperforms Hogwild!in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. 4口,4@下¥24=, Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS,NJU 9/36
AsySVRG Motivation and Contribution Motivation: Existing asynchronous parallel SGD: Hogwild! [Recht et al. 2011], and PASSCoDe [Hsieh, Yu, and Dhillon 2015] No parallel methods for SVRG. Lock-free: empirically effective, but no theoretical proof. Contribution: A fast asynchronous method to parallelize SVRG, called AsySVRG. A lock-free parallel strategy for both read and write Linear convergence rate with theoretical proof Outperforms Hogwild! in experiments AsySVRG is the first lock-free parallel SGD method with theoretical proof of convergence. Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 9 / 36
AsySVRG AsySVRG:a multi-thread version of SVRG Initialization:p threads,initialize wo,n; fort=0,1,2,…do uo Wt; All threads parallelly compute the full gradient Vf(uo)=员∑2-1Vf(o: u=Wt: For each thread,do: for m =1 to M do Read current value of u,denoted as u,from the shared memory. And randomly pick up an i from {1,...,n}; Compute the update vector:=Vfi(u)-Vfi(uo)+Vf(uo); u←-u-7V: end for Take wt+1 to be the current value of u in the shared memory; end for 4口,49卡,重,4=,2QC Wu-Jun Li (http://cs.nju.edu.cn/lvj) PDSL CS.NJU 10/36
AsySVRG AsySVRG: a multi-thread version of SVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ... do u0 = wt ; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt ; For each thread, do: for m = 1 to M do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for Wu-Jun Li (http://cs.nju.edu.cn/lwj) PDSL CS, NJU 10 / 36