当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

《人工智能、机器学习与大数据》课程教学资源(参考文献)Scalable Composite Optimization for Learning on Spark

资源类别:文库,文档格式:PDF,文档页数:7,文件大小:659.8KB,团购合买
点击下载完整版文档(PDF)

SCOPE:Scalable Composite Optimization for Learning on Spark Shen-Yi Zhao,Ru Xiang,Ying-Hao Shi,Peng Gao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China {zhaosy,xiangr,shiyh,gaop}@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract and Zhang 2013;Zhang,Mahdavi,and Jin 2013;Shalev- Shwartz and Zhang 2013:2014:Lin,Lu,and Xiao 2014: Many machine learning models,such as logistic regres- Nitanda 2014).Existing SO methods can be divided into t- sion(LR)and support vector machine (SVM),can be for- wo categories.The first category is stochastic gradient de- mulated as composite optimization problems.Recently,many scent(SGD)and its variants,such as stochastic average gra- distributed stochastic optimization (DSO)methods have been proposed to solve the large-scale composite optimization dient (SAG)(Schmidt,Roux,and Bach 2013)and stochas- problems,which have shown better performance than tradi- tic variance reduced gradient (SVRG)(Johnson and Zhang tional batch methods.However,most of these DSO methods 2013),which try to perform optimization on the primal prob- might not be scalable enough.In this paper,we propose a lem.The second category,such as stochastic dual coordinate novel DSO method,called scalable composite optimization ascent (SDCA)(Shalev-Shwartz and Zhang 2013).tries to for learning(SCOPE),and implement it on the fault-tolerant perform optimization with the dual formulation.Many ad- distributed platform Spark.SCOPE is both computation- vanced SO methods.such as SVRG and SDCA.are more efficient and communication-efficient.Theoretical analysis efficient than traditional batch learning methods in both the- shows that SCOPE is convergent with linear convergence rate ory and practice for large-scale learning problems. when the objective function is strongly convex.Furthermore, empirical results on real datasets show that SCOPE can out- Most traditional SO methods are sequential which means perform other state-of-the-art distributed learning methods on that the optimization procedure is not parallelly performed. Spark,including both batch learning methods and DSO meth- However,with the increase of data scale,traditional se- ods. quential SO methods may not be efficient enough to handle large-scale datasets.Furthermore,in this big data era,many large-scale datasets are distributively stored on a cluster of Introduction multiple machines.Traditional sequential SO methods can- Many machine learning models can be formulated as com- not be directly used for these kinds of distributed datasets. posite optimization problems which have the following for- To handle large-scale composite optimization problems,re- m with finite sum of some functions:min P(w)= searchers have recently proposed several parallel SO(PSO) w∈Rd methods for multi-core systems and distributed SO (DSO) f(w).where w is the parameter to learn (optimize). methods for clusters of multiple machines. n is the number of training instances,and fi(w)is the loss PSO methods perform SO on a single machine with function on the training instance i.For example,fi(w)= multi-cores(multi-threads)and a shared memory.Typical- log(1+e)+wll2 in logistic regression (LR),and ly,synchronous strategies with locks will be much slow- fi(w)=max{0,1-yixw}+wll2 in support vec- er than asynchronous ones.Hence,recent progress of P- tor machine (SVM),where A is the regularization hyper- SO mainly focuses on designing asynchronous or lock-free parameter and(xi,yi)is the training instance i withxiERd optimization strategies (Recht et al.2011;Liu et al.2014; being the feature vector and yi E{+1,-1}being the class Hsieh,Yu,and Dhillon 2015:J.Reddi et al.2015:Zhao and label.Other cases like matrix factorization and deep neural Li2016). networks can also be written as similar forms of composite DSO methods perform SO on clusters of multiple ma- optimization. chines.DSO can be used to handle extremely large prob- Due to its efficiency and effectiveness,stochastic op- lems which are beyond the processing capability of one sin- timization (SO)has recently attracted much attention to gle machine.In many real applications especially industrial solve the composite optimization problems in machine applications,the datasets are typically distributively stored learning (Xiao 2009:Bottou 2010;Duchi,Hazan,and on clusters.Hence,DSO has recently become a hot research Singer 2011;Schmidt,Roux,and Bach 2013;Johnson topic.Many DSO methods have been proposed,including distributed SGD methods from primal formulation and dis- Copyright C2017,Association for the Advancement of Artificial tributed dual formulation.Representative distributed SGD Intelligence (www.aaai.org).All rights reserved. methods include PSGD(Zinkevich et al.2010),BAVG-

SCOPE: Scalable Composite Optimization for Learning on Spark Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy,xiangr,shiyh,gaop}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Many machine learning models, such as logistic regres￾sion (LR) and support vector machine (SVM), can be for￾mulated as composite optimization problems. Recently, many distributed stochastic optimization (DSO) methods have been proposed to solve the large-scale composite optimization problems, which have shown better performance than tradi￾tional batch methods. However, most of these DSO methods might not be scalable enough. In this paper, we propose a novel DSO method, called scalable composite optimization for learning (SCOPE), and implement it on the fault-tolerant distributed platform Spark. SCOPE is both computation￾efficient and communication-efficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is strongly convex. Furthermore, empirical results on real datasets show that SCOPE can out￾perform other state-of-the-art distributed learning methods on Spark, including both batch learning methods and DSO meth￾ods. Introduction Many machine learning models can be formulated as com￾posite optimization problems which have the following for￾m with finite sum of some functions: min w∈Rd P(w) = 1 n Pn i fi(w), where w is the parameter to learn (optimize), n is the number of training instances, and fi(w) is the loss function on the training instance i. For example, fi(w) = log(1 + e −yix T i w) + λ 2 kwk 2 in logistic regression (LR), and fi(w) = max{0, 1 − yix T i w} + λ 2 kwk 2 in support vec￾tor machine (SVM), where λ is the regularization hyper￾parameter and (xi , yi) is the training instance i with xi ∈ R d being the feature vector and yi ∈ {+1, −1} being the class label. Other cases like matrix factorization and deep neural networks can also be written as similar forms of composite optimization. Due to its efficiency and effectiveness, stochastic op￾timization (SO) has recently attracted much attention to solve the composite optimization problems in machine learning (Xiao 2009; Bottou 2010; Duchi, Hazan, and Singer 2011; Schmidt, Roux, and Bach 2013; Johnson Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and Zhang 2013; Zhang, Mahdavi, and Jin 2013; Shalev￾Shwartz and Zhang 2013; 2014; Lin, Lu, and Xiao 2014; Nitanda 2014). Existing SO methods can be divided into t￾wo categories. The first category is stochastic gradient de￾scent (SGD) and its variants, such as stochastic average gra￾dient (SAG) (Schmidt, Roux, and Bach 2013) and stochas￾tic variance reduced gradient (SVRG) (Johnson and Zhang 2013), which try to perform optimization on the primal prob￾lem. The second category, such as stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz and Zhang 2013), tries to perform optimization with the dual formulation. Many ad￾vanced SO methods, such as SVRG and SDCA, are more efficient than traditional batch learning methods in both the￾ory and practice for large-scale learning problems. Most traditional SO methods are sequential which means that the optimization procedure is not parallelly performed. However, with the increase of data scale, traditional se￾quential SO methods may not be efficient enough to handle large-scale datasets. Furthermore, in this big data era, many large-scale datasets are distributively stored on a cluster of multiple machines. Traditional sequential SO methods can￾not be directly used for these kinds of distributed datasets. To handle large-scale composite optimization problems, re￾searchers have recently proposed several parallel SO (PSO) methods for multi-core systems and distributed SO (DSO) methods for clusters of multiple machines. PSO methods perform SO on a single machine with multi-cores (multi-threads) and a shared memory. Typical￾ly, synchronous strategies with locks will be much slow￾er than asynchronous ones. Hence, recent progress of P￾SO mainly focuses on designing asynchronous or lock-free optimization strategies (Recht et al. 2011; Liu et al. 2014; Hsieh, Yu, and Dhillon 2015; J. Reddi et al. 2015; Zhao and Li 2016). DSO methods perform SO on clusters of multiple ma￾chines. DSO can be used to handle extremely large prob￾lems which are beyond the processing capability of one sin￾gle machine. In many real applications especially industrial applications, the datasets are typically distributively stored on clusters. Hence, DSO has recently become a hot research topic. Many DSO methods have been proposed, including distributed SGD methods from primal formulation and dis￾tributed dual formulation. Representative distributed SGD methods include PSGD (Zinkevich et al. 2010), BAVG-

M(Zhang,Wainwright,and Duchi 2012)and Splash(Zhang Algorithm 1 Task of Master in SCOPE and Jordan 2015).Representative distributed dual formu- Initialization:p Workers,wo; lations include DisDCA (Yang 2013),CoCoA (Jaggi et al. fort =0,1,2,....T do 2014)and CoCoA+(Ma et al.2015).Many of these meth- Send w:to the Workers; ods provide nice theoretical proof about convergence and Wait until it receives z1,Z2,...,p from the p Workers; promising empirical evaluations.However,most of these D- SO methods might not be scalable enough. Compute the full&radient.=是∑-1zk,and then send z to each Worker; In this paper,we propose a novel DSO method,called Wait until it receives u,u2,...,p from the p Work- scalable composite optimization for learning (SCOPE), ers; and implement it on the fault-tolerant distributed platform Spark(Zaharia et al.2010).SCOPE is both computation- Compute w+1=号∑R=1a: end for efficient and communication-efficient.Empirical results on real datasets show that SCOPE can outperform other state- of-the-art distributed learning methods on Spark,including both batch learning methods and DSO methods,in terms of Different Workers can not communicate with each other. scalability. This is similar to most existing distributed learning frame- Please note that some asynchronous methods or system- works like MLlib (Meng et al.2016),Splash,Parameter s.such as Parameter Server (Li et al.2014),Petuum (Xing Server,and CoCoA and so on. et al.2015)and the methods in (Zhang and Kwok 2014; Optimization Algorithm The whole optimization (learn- Zhang,Zheng,and Kwok 2016).have also been proposed ing)algorithm is completed cooperatively by the Master and for distributed learning with promising performance.But Workers: these methods or systems cannot be easily implemented on Spark with the MapReduce programming model which is Task of Master:The operations completed by the Master actually a bulk synchronous parallel (BSP)model.Hence, are outlined in Algorithm 1.We can find that the Master asynchronous methods are not the focus of this paper.We has two main tasks.The first task is to compute the full will leave the design of asynchronous version of SCOPE and gradient after all the local gradient sum [zk}have been the corresponding empirical comparison for future study. received from all Workers,and then send the full gradient to all Workers.The second task is to update the parame- SCOPE ter w after all the locally updated parameters u have Framework of SCOPE been received,and then send the updated parameter to all Workers.It is easy to see that the computation load of the SCOPE is based on a master-slave distributed framework, Master is lightweight. which is illustrated in Figure 1.More specifically,there is a master machine (called Master)and p(p 1)slave ma- Task of Workers:The operations completed by the Work- ers are outlined in Algorithm 2.We can find that each chines (called Workers)in the cluster.These Workers are called Worker_1,Worker_2....,and Worker-p,respectively. Worker has two main tasks.The first task is to compute the sum of the gradients on its local data (called local gra- dient sum).i.e..=Vfi(w)for Worker k,and then send the local gradient 'sum to the Master.The sec- Master ond task is to train w by only using the local data,after which the Worker will send the locally updated parame- ters,denoted as uk for Workerk,to the Master and wait for the newest w from Master. Here,w:denotes the global parameter at the tth iteration and is stored on the Master.uk.m denotes the local parame- ter at the mth iteration on Worker k. SCOPE is inspired by SVRG (Johnson and Zhang 2013) Figure 1:Distributed framework of SCOPE. which tries to utilize full gradient to speed up the con- vergence of stochastic optimization.However,the original SVRG in (Johnson and Zhang 2013)is sequential.To de- Data Partition and Parameter Storage sign a distributed SVRG method,one natural strategy is to For Workers:The whole dataset D is distributively stored adapt the mini-batch SVRG (Zhao et al.2014)to distribut- on all the Workers.More specifically,D is partitioned into ed settings,which is a typical strategy in most distributed p subsets,which are denoted as [D1,D2,..,Dp}with SGD frameworks like Parameter Server (Li et al.2014)and D=D&.De is stored on Worker-k.The data stored Petuum (Xing et al.2015).In appendix',we briefly outline on different Workers are different from each other,which the sequential SVRG and the mini-batch based distribut- means that if i j,DinDj=0. ed SVRG(called DisSVRG).We can find that there ex- For Master:The parameter w is stored on the Master and All the appendices and proofs of this paper can be found in the the Master always keeps the newest version of w. arXiv version of this paper(Zhao et al.2016)

M (Zhang, Wainwright, and Duchi 2012) and Splash (Zhang and Jordan 2015). Representative distributed dual formu￾lations include DisDCA (Yang 2013), CoCoA (Jaggi et al. 2014) and CoCoA+ (Ma et al. 2015). Many of these meth￾ods provide nice theoretical proof about convergence and promising empirical evaluations. However, most of these D￾SO methods might not be scalable enough. In this paper, we propose a novel DSO method, called scalable composite optimization for learning (SCOPE), and implement it on the fault-tolerant distributed platform Spark (Zaharia et al. 2010). SCOPE is both computation￾efficient and communication-efficient. Empirical results on real datasets show that SCOPE can outperform other state￾of-the-art distributed learning methods on Spark, including both batch learning methods and DSO methods, in terms of scalability. Please note that some asynchronous methods or system￾s, such as Parameter Server (Li et al. 2014), Petuum (Xing et al. 2015) and the methods in (Zhang and Kwok 2014; Zhang, Zheng, and Kwok 2016), have also been proposed for distributed learning with promising performance. But these methods or systems cannot be easily implemented on Spark with the MapReduce programming model which is actually a bulk synchronous parallel (BSP) model. Hence, asynchronous methods are not the focus of this paper. We will leave the design of asynchronous version of SCOPE and the corresponding empirical comparison for future study. SCOPE Framework of SCOPE SCOPE is based on a master-slave distributed framework, which is illustrated in Figure 1. More specifically, there is a master machine (called Master) and p (p ≥ 1) slave ma￾chines (called Workers) in the cluster. These Workers are called Worker 1, Worker 2, · · · , and Worker p, respectively. Master w Worker_1 �! . . . . . . . . Worker_p �! Worker_2 �! Figure 1: Distributed framework of SCOPE. Data Partition and Parameter Storage • For Workers: The whole dataset D is distributively stored on all the Workers. More specifically, D is partitioned into p subsets, which are denoted as {D1, D2, · · · , Dp} with D = Sp k=1 Dk. Dk is stored on Worker k. The data stored on different Workers are different from each other, which means that if i 6= j, Di ∩ Dj = ∅. • For Master: The parameter w is stored on the Master and the Master always keeps the newest version of w. Algorithm 1 Task of Master in SCOPE Initialization: p Workers, w0; for t = 0, 1, 2, . . . , T do Send wt to the Workers; Wait until it receives z1, z2, . . . , zp from the p Workers; Compute the full gradient z = 1 n Pp k=1 zk, and then send z to each Worker; Wait until it receives u˜1, u˜2, . . . , u˜p from the p Work￾ers; Compute wt+1 = 1 p Pp k=1 u˜k; end for Different Workers can not communicate with each other. This is similar to most existing distributed learning frame￾works like MLlib (Meng et al. 2016), Splash, Parameter Server, and CoCoA and so on. Optimization Algorithm The whole optimization (learn￾ing) algorithm is completed cooperatively by the Master and Workers: • Task of Master: The operations completed by the Master are outlined in Algorithm 1. We can find that the Master has two main tasks. The first task is to compute the full gradient after all the local gradient sum {zk} have been received from all Workers, and then send the full gradient to all Workers. The second task is to update the parame￾ter w after all the locally updated parameters {u˜k} have been received, and then send the updated parameter to all Workers. It is easy to see that the computation load of the Master is lightweight. • Task of Workers: The operations completed by the Work￾ers are outlined in Algorithm 2. We can find that each Worker has two main tasks. The first task is to compute the sum of the gradients on its local data (called local gra￾dient sum), i.e., zk = P i∈Dk ∇fi(w) for Worker k, and then send the local gradient sum to the Master. The sec￾ond task is to train w by only using the local data, after which the Worker will send the locally updated parame￾ters, denoted as u˜k for Worker k, to the Master and wait for the newest w from Master. Here, wt denotes the global parameter at the tth iteration and is stored on the Master. uk,m denotes the local parame￾ter at the mth iteration on Worker k. SCOPE is inspired by SVRG (Johnson and Zhang 2013) which tries to utilize full gradient to speed up the con￾vergence of stochastic optimization. However, the original SVRG in (Johnson and Zhang 2013) is sequential. To de￾sign a distributed SVRG method, one natural strategy is to adapt the mini-batch SVRG (Zhao et al. 2014) to distribut￾ed settings, which is a typical strategy in most distributed SGD frameworks like Parameter Server (Li et al. 2014) and Petuum (Xing et al. 2015). In appendix1 , we briefly outline the sequential SVRG and the mini-batch based distribut￾ed SVRG (called DisSVRG). We can find that there ex- 1All the appendices and proofs of this paper can be found in the arXiv version of this paper (Zhao et al. 2016)

Algorithm 2 Task of Workers in SCOPE to make uk.m+1 not be far away from wt.If w:is close to Initialization:initialize n and c >0; w*.uk.m+1 will also be close to w*.So the extra term in S- For the Worker_: COPE is reasonable for convergence guarantee.At the same fort =0,1,2,...,T do time,it does not bring extra computation since the update Wait until it gets the newest parameter w:from the rule in SCOPE can be rewritten as Master; Let uk.o wt,compute the local gradient sum zk uk.m+1 =(1-cn)uk.m CiD Vfi(w:),and then send zk to the Master; -n(V fik.m (uk,m)-V fik.m (wt)+2), Wait until it gets the full gradient z from the Master: for m=0to M-1 do where 2=z-cwt can be pre-computed and fixed as a constant for different m. Randomly pick up an instance with index ik,m from Dk: Besides the above mini-batch based strategy (DisSVRG) uk.m+1=uk.m-n(V fis.m (uk,m)-V fix.m (wt)+ for distributed SVRG,there also exist some other distributed z+c(uk,m-wi)); SVRG methods,including DSVRG (Lee et al.2016).Kro- end for Magnon (Mania et al.2015),SVRGfoR (Konecny,McMa- Sendu.or which is called the lo han,and Ramage 2015)and the distributed SVRG in (De and Goldstein 2016).DSVRG needs communication be- cally updated parameter and denoted as uk,to the Mas- tween Workers,and hence it cannot be directly implement- ter; ed on Spark.KroMagnon focuses on asynchronous strategy, end for which cannot be implemented on Spark either.SVRGfoR can be implemented on Spark,but it provides no theoreti- cal results about the convergence.Furthermore,SVRGfoR ist three major differences between SCOPE and SVRG(or is proposed for cases with unbalanced data partitions and s- DisSVRG). parse features.On the contrary,our SCOPE can be used for The first difference is that in SCOPE each Worker local- any kind of features with theoretical guarantee of conver- ly performs stochastic optimization by only using its native gence.Moreover,in our experiment,we find that our SCOPE data (refer to the update on uk.m+1 for each Worker k in can outperform SVRGfoR.The distributed SVRG in (De Algorithm 2).On the contrary,SVRG or DisSVRG perform and Goldstein 2016)cannot be guaranteed to converge be- stochastic optimization on the Master (refer to the update cause it is similar to the version of SCOPE with c=0. on um+1)based on the whole dataset,which means that EASGD (Zhang,Choromanska,and LeCun 2015)also we need to randomly pick up an instance or a mini-batch adopts a parameter like c to control the difference between from the whole dataset D in each iteration of stochastic opti- the local update and global update.However,EASGD as- mization.The locally stochastic optimization in SCOPE can sumes that each worker has access to the entire dataset while dramatically reduce the communication cost,compared with SCOPE only requires that each worker has access to a sub- DisS VRG with mini-batch strategy. set.Local learning strategy is also adopted in other problems The second difference is the update rule of wt+1 in like probabilistic logic programs(Riguzzi et al.2016) the Master.There are no locally updated parameters in DisSVRG with mini-batch strategy,and hence the update Communication Cost rule of wi+1 in the Master for DisSVRG can not be written Traditional mini-batch based distributed SGD methods.such in the form of Algorithm,1i.e,wt+1=是∑k=1ik. as DisSVRG in the appendix,need to transfer parameter The third difference is the update rule for uk.m+1 in w and stochastic gradients frequently between Workers and SCOPE and um+1 in SVRG or DisSVRG.Compared to Master.For example,the number of communication times is SVRG,SCOPE has an extra term c(uk.m-wt)in Algo- O(TM)for DisSVRG.Other traditional mini-batch based rithm 2 to guarantee convergence,where c >0 is a param- distributed SGD methods have the same number of com- eter related to the objective function.The strictly theoretical munication times.Typically,M=e(n).Hence,traditional proof will be provided in the following section about con- mini-batch based methods have O(Tn)number of commu- vergence.Here,we just give some intuition about the extra nication times,which may lead to high communication cost term c(uk.m-wi).Since SCOPE puts no constraints about Most training(computation)load of SCOPE comes from how to partition training data on different Workers,the da- the inner loop of Algorithm 2,which is done at local Work- ta distributions on different Workers may be totally different er without any communication.It is easy to find that the from each other.That means the local gradient in each Work- number of communication times in SCOPE is O(T),which er can not necessarily approximate the full gradient.Hence, is dramatically less than O(Tn)of traditional mini-batch the term Vfi.(uk.m)-Vfis.(wt)+z is a bias estima- based distributed SGD or distributed SVRG methods.In tion of the full gradient.This is different from SVRG whose the following section,we will prove that SCOPE has a lin- stochastic gradient is an unbias estimation of the full gradi- ent.The bias estimation Vf(uk.m)-Vf(wt)+ ear convergence rate in terms of the iteration number T.It means that to achieve an e-optimal solution2,T=O(log). in SCOPE may lead uk.m+1 to be far away from the optimal value w*.To avoid this,we use the technique in the proxi- 2w is called an e-optimal solution ifEllw-w*l2 e where mal stochastic gradient that adds an extra term c(uk.m-wt) w*is the optimal solution

Algorithm 2 Task of Workers in SCOPE Initialization: initialize η and c > 0; For the Worker k: for t = 0, 1, 2, . . . , T do Wait until it gets the newest parameter wt from the Master; Let P uk,0 = wt, compute the local gradient sum zk = i∈Dk ∇fi(wt), and then send zk to the Master; Wait until it gets the full gradient z from the Master; for m = 0 to M − 1 do Randomly pick up an instance with index ik,m from Dk; uk,m+1 = uk,m −η(∇fik,m(uk,m)−∇fik,m(wt)+ z + c(uk,m − wt)); end for Send uk,M or 1 M PM m=1 uk,m, which is called the lo￾cally updated parameter and denoted as u˜k, to the Mas￾ter; end for ist three major differences between SCOPE and SVRG (or DisSVRG). The first difference is that in SCOPE each Worker local￾ly performs stochastic optimization by only using its native data (refer to the update on uk,m+1 for each Worker k in Algorithm 2). On the contrary, SVRG or DisSVRG perform stochastic optimization on the Master (refer to the update on um+1) based on the whole dataset, which means that we need to randomly pick up an instance or a mini-batch from the whole dataset D in each iteration of stochastic opti￾mization. The locally stochastic optimization in SCOPE can dramatically reduce the communication cost, compared with DisSVRG with mini-batch strategy. The second difference is the update rule of wt+1 in the Master. There are no locally updated parameters in DisSVRG with mini-batch strategy, and hence the update rule of wt+1 in the Master for DisSVRG can not be written in the form of Algorithm 1, i.e., wt+1 = 1 p Pp k=1 u˜k. The third difference is the update rule for uk,m+1 in SCOPE and um+1 in SVRG or DisSVRG. Compared to SVRG, SCOPE has an extra term c(uk,m − wt) in Algo￾rithm 2 to guarantee convergence, where c > 0 is a param￾eter related to the objective function. The strictly theoretical proof will be provided in the following section about con￾vergence. Here, we just give some intuition about the extra term c(uk,m − wt). Since SCOPE puts no constraints about how to partition training data on different Workers, the da￾ta distributions on different Workers may be totally different from each other. That means the local gradient in each Work￾er can not necessarily approximate the full gradient. Hence, the term ∇fik,m(uk,m) − ∇fik,m(wt) + z is a bias estima￾tion of the full gradient. This is different from SVRG whose stochastic gradient is an unbias estimation of the full gradi￾ent. The bias estimation ∇fik,m(uk,m) − ∇fik,m(wt) + z in SCOPE may lead uk,m+1 to be far away from the optimal value w∗ . To avoid this, we use the technique in the proxi￾mal stochastic gradient that adds an extra term c(uk,m −wt) to make uk,m+1 not be far away from wt. If wt is close to w∗ , uk,m+1 will also be close to w∗ . So the extra term in S￾COPE is reasonable for convergence guarantee. At the same time, it does not bring extra computation since the update rule in SCOPE can be rewritten as uk,m+1 =(1 − cη)uk,m − η(∇fik,m(uk,m) − ∇fik,m(wt) + zˆ), where zˆ = z − cwt can be pre-computed and fixed as a constant for different m. Besides the above mini-batch based strategy (DisSVRG) for distributed SVRG, there also exist some other distributed SVRG methods, including DSVRG (Lee et al. 2016), Kro￾Magnon (Mania et al. 2015), SVRGfoR (Konecny, McMa- ´ han, and Ramage 2015) and the distributed SVRG in (De and Goldstein 2016). DSVRG needs communication be￾tween Workers, and hence it cannot be directly implement￾ed on Spark. KroMagnon focuses on asynchronous strategy, which cannot be implemented on Spark either. SVRGfoR can be implemented on Spark, but it provides no theoreti￾cal results about the convergence. Furthermore, SVRGfoR is proposed for cases with unbalanced data partitions and s￾parse features. On the contrary, our SCOPE can be used for any kind of features with theoretical guarantee of conver￾gence. Moreover, in our experiment, we find that our SCOPE can outperform SVRGfoR. The distributed SVRG in (De and Goldstein 2016) cannot be guaranteed to converge be￾cause it is similar to the version of SCOPE with c = 0. EASGD (Zhang, Choromanska, and LeCun 2015) also adopts a parameter like c to control the difference between the local update and global update. However, EASGD as￾sumes that each worker has access to the entire dataset while SCOPE only requires that each worker has access to a sub￾set. Local learning strategy is also adopted in other problems like probabilistic logic programs (Riguzzi et al. 2016). Communication Cost Traditional mini-batch based distributed SGD methods, such as DisSVRG in the appendix, need to transfer parameter w and stochastic gradients frequently between Workers and Master. For example, the number of communication times is O(TM) for DisSVRG. Other traditional mini-batch based distributed SGD methods have the same number of com￾munication times. Typically, M = Θ(n). Hence, traditional mini-batch based methods have O(T n) number of commu￾nication times, which may lead to high communication cost. Most training (computation) load of SCOPE comes from the inner loop of Algorithm 2, which is done at local Work￾er without any communication. It is easy to find that the number of communication times in SCOPE is O(T), which is dramatically less than O(T n) of traditional mini-batch based distributed SGD or distributed SVRG methods. In the following section, we will prove that SCOPE has a lin￾ear convergence rate in terms of the iteration number T. It means that to achieve an -optimal solution2 , T = O(log 1  ). 2wˆ is called an -optimal solution if Ekwˆ − w∗ k 2 ≤  where w∗ is the optimal solution

Hence,T is typically not large for many problems.For ex- 2014),since we do not need each fi(w)to be convex and ample,in most of our experiments,we can achieve conver- we do not make any assumption about the Hessian matrices gent results with TL-h Because the number of synchronization is also O(T),and T then we have Ym+1≤[1-(2μ+chYm+(cm+3L2n2)o. is typically a small number.Hence,the waiting time is also Let a 1-n(2u c),B cn 3L2n2.Given L and small. u which are determined by the objective function,we can always guarantee 0-d-a,d-a+已。L-u according to Lemma 1.Here,we discuss the necessity of c. where k 1,2,...,p.Then we have P(w) We first assume c =0,and try to find whether Algo- ∑=1F(w) rithm 2 will converge or not.It means that in the following To prove the convergence of SCOPE,we first give two derivation,we always assume c =0. assumptions which have also been widely adopted by most Let us define another local function: existing stochastic optimization algorithms for convergence proof. F(w)=Fk(w)+(z-VFk(w:))T(w-w) Assumption 1 (Smooth Gradient).There exists a constant and denote wi=arg min F(w). L >0 such that Ya,b E Rd and i =1,2,...,n,we have Vfi(a)-Vfi(b)0 such that Ya,b E Rd. z.Then,we have E[vk.mluk.m]=VF)(u.m)and we have Fx(a)>Fk(b)+VFk(b)T(a-b)+a-bl2. F(w)=z.Hence,we can find that each local Work- Please note that these assumptions are weaker than those eractually tries to optimize the local functionF(w)with in (Zhang and Jordan 2015;Ma et al.2015;Jaggi et al. SVRG based on the local data D.It means that if we set

Hence, T is typically not large for many problems. For ex￾ample, in most of our experiments, we can achieve conver￾gent results with T ≤ 10. Hence, SCOPE is communication￾efficient. SCOPE is a synchronous framework, which means that some waiting time is also needed for synchronization. Because the number of synchronization is also O(T), and T is typically a small number. Hence, the waiting time is also small. SCOPE on Spark One interesting thing is that the computing framework of SCOPE is quite suitable for the popular distributed plat￾form Spark. The programming model underlying Spark is MapReduce, which is actually a BSP model. In SCOPE, the task of Workers that computes local gradient sum zk and the training procedure in the inner loop of Algorithm 2 can be seen as the Map process since both of them only use local data. The task of Master that computes the average for both full gradient z and wt+1 can be seen as the Reduce process. The MapReduce programming model is essentially a syn￾chronous model, which need some synchronization cost. Fortunately, the number of synchronization times is very s￾mall as stated above. Hence, both communication cost and waiting time are very small for SCOPE. In this paper, we implement our SCOPE on Spark since Spark has been wide￾ly adopted in industry for big data applications, and our SCOPE can be easily integrated into the data processing pipeline of those organizations using Spark. Convergence of SCOPE In this section, we will prove the convergence of SCOPE when the objective functions are strongly convex. We on￾ly list some Lemmas and Theorems, the detailed proof of which can be found in the appendices (Zhao et al. 2016). For convenience, we use w∗ to denote the optimal so￾lution. k · k denotes the L2 norm k · k2. We assume that n = pq, which means that each Worker has the same num￾ber of training instances and |D1| = |D2| = · · · = |Dp| = q. In practice, we can not necessarily guarantee that these |Dk|s are the same. However, it is easy to guarantee that ∀i, j, |(|Di | − |Dj |)| ≤ 1, which will not affect the perfor￾mance. We define p local functions as Fk(w) = 1 q P i∈Dk fi(w), where k = 1, 2, . . . , p. Then we have P(w) = 1 p Pp k=1 Fk(w). To prove the convergence of SCOPE, we first give two assumptions which have also been widely adopted by most existing stochastic optimization algorithms for convergence proof. Assumption 1 (Smooth Gradient). There exists a constant L > 0 such that ∀a, b ∈ R d and i = 1, 2, . . . , n, we have k∇fi(a) − ∇fi(b)k ≤ Lka − bk. Assumption 2 (Strongly Convex). For each local function Fk(·), there exists a constant µ > 0 such that ∀a, b ∈ R d , we have Fk(a) ≥ Fk(b) + ∇Fk(b) T (a − b) + µ 2 ka − bk 2 . Please note that these assumptions are weaker than those in (Zhang and Jordan 2015; Ma et al. 2015; Jaggi et al. 2014), since we do not need each fi(w) to be convex and we do not make any assumption about the Hessian matrices either. Lemma 1. Let γm = 1 p Pp k=1 Ekuk,m−w∗k 2 . If c > L−µ, then we have γm+1 ≤ [1−η(2µ+c)]γm + (cη + 3L 2η 2 )γ0. Let α = 1 − η(2µ + c), β = cη + 3L 2η 2 . Given L and µ which are determined by the objective function, we can always guarantee 0 log 1−α−β 1−α α , αM + β 1−α 1 1−α−β , 1 M(1−α) + β 1−α L − µ according to Lemma 1. Here, we discuss the necessity of c. We first assume c = 0, and try to find whether Algo￾rithm 2 will converge or not. It means that in the following derivation, we always assume c = 0. Let us define another local function: F (t) k (w) = Fk(w) + (z − ∇Fk(wt))T (w − w∗ ) and denote w∗ k,t = arg min w F (t) k (w). Let vk,m = ∇fik,m(uk,m)−∇fik,m(wt)+z+c(uk,m − wt). When c = 0, vk,m = ∇fik,m(uk,m) − ∇fik,m(wt) + z. Then, we have E[vk,m|uk,m] = ∇F (t) k (uk,m) and ∇F (t) k (wt) = z. Hence, we can find that each local Work￾er actually tries to optimize the local function F (t) k (w) with SVRG based on the local data Dk. It means that if we set

a relatively small n and a relatively large M,the uk.m will Dataset converge to w.t. We use four datasets for evaluation.They are MNIST-8M, Since F((w)is strongly convex,we have epsilon,KDD12 and Data-A.The first two datasets can be VF(w)=0.Then,we can get downloaded from the LibSVM website3.MNIST-8M con- tains 8.100.000 handwritten digits.We set the instances of VFk(wk)-VFk(w*)=VFk(wt)-VFk(w*)-z. digits 5 to 9 as positive,and set the instances of digits 0 to 4 as negative.KDD12 is the dataset of Track 1 for KDD For the left-hand side.we have Cup 2012,which can be downloaded from the KDD Cup 7F(wt)-VF(w*)≈72F(w*)(w2.t-w*) website.Data-A is a dataset from a data mining competi- tion5.The information about these datasets is summarized For the right-hand side,we have in Table 2.All the data is normalized before training.The VFk(Wt)-VFk(w*)-z regularization hyper-parameter A is set to 10-4 for the first =VFk(wt)-VFk(w*)-(z-VP(w*)) three datasets which are relatively small,and is set to 10-6 for the largest dataset Data-A.Similar phenomenon can be ≈72F(w*)(w:-w*)-72P(w*)(w-w*). observed for other A.which is omitted due to space limita- Combining the two approximations,we can get tion.For all datasets,we set c=入×l0-2 wi.t -w*(I-AkA)(wt-w*), Table 2:Datasets for evaluation. where Ak=V2F(w*)and A =V2P(w*)are two Hes- tinstances features memory sian matrices for the local function F(w*)and the global MNIST-8M 8.100.000 784 39G 1e-4 function P(w*),respectively.Assuming in each iteration we epsilon 400.000 2.000 11G le-4 can always get the local optimal values for all local function- KDD12 73.209.277 1.427.495 21G 1e-4 106.691.093 320 260G 1e-6 s,we have Data-A w+1-w≈(I-1 Ar1A)(wt-w*). (1) Experimental Setting and Baseline k=1 Distributed Platform We have a Spark cluster of 33 ma- Please note that all the above derivations assume that c =0.From (1),we can find that Algorithm 2 will not nec- chines(nodes)connected by 10GB Ethernet.Each machine essarily converge if c =0,and the convergence property is has 12 Intel Xeon E5-2620 cores with 64GB memory.We dependent on the Hessian matrices of the local functions. construct two clusters,a small one and a large one,from the Here,we give a simple example for illustration.We set original 33 machines for our experiments.The small clus- n=p=2andF(wj=(w-1)2,F2(w)=100(w- ter contains 9 machines,one master and eight slaves.We 10)2.We set a small step-size n 10-5 and a large M use 2 cores for each slave.The large cluster contains 33 ma- chines.I master and 32 slaves.We use 4 cores for each slave. 4000.The convergence results of SCOPE with different c In both clusters,each machine has access to 64GB memo- are presented in Table 1. ry on the corresponding machine and one core corresponds to one Worker.Hence,the small cluster has one Master and Table 1:Impact of c. 16 Workers,and the large cluster has one Master and 128 01510 Workers.The small cluster is for experiments on the three Converge?NoNoNo Yes relatively small datasets including MNIST-8M,epsilon and KDD12.The large cluster is for experiments on the largest dataset Data-A.We use Spark1.5.2 for our experiment,and Separating Data Uniformly implement our SCOPE in Scala. If we separate data uniformly.which means that the lo- Baseline Because the focus of this paper is to design dis- cal data distribution on each Worker is similar to the glob- tributed learning methods for Spark,we compare SCOPE al data distribution,then we have Ak A and I- with distributed learning baselines which can be implement- B∑R=AglA≈0.Fmom(,we can find that c=O ed on Spark.More specifically,we adopt the following base- can make SCOPE converge for this special case. lines for comparison: MLlib (Meng et al.2016):MLlib is an open source li- Experiment brary for distributed machine learning on Spark.It is We choose logistic regression (LR)with a L2- mainly based on two optimization methods:mini-batch norm regularization term to evaluate SCOPE and based distributed SGD and distributed Ibfgs.We find that baselines. Hence,P(w)is defined as P(w)= 是∑,log(1+e-xfw)+lw.The code can be 3https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ "http://www.kddcup2012.org/ downloaded from https://github.com/LIBBLE/ Shttp://www.yiban.cn/project/2015ccf/comp-detail.php?cid=231 LIBBLE-Spark/. http://spark.apache.org/mllib/

a relatively small η and a relatively large M, the uk,m will converge to w∗ k,t. Since F (t) k (w) is strongly convex, we have ∇F (t) k (w∗ k,t) = 0. Then, we can get ∇Fk(w∗ k,t) − ∇Fk(w∗ ) = ∇Fk(wt) − ∇Fk(w∗ ) − z. For the left-hand side, we have ∇Fk(w∗ k,t) − ∇Fk(w∗ ) ≈ ∇2Fk(w∗ )(w∗ k,t − w∗ ). For the right-hand side, we have ∇Fk(wt) − ∇Fk(w∗ ) − z =∇Fk(wt) − ∇Fk(w∗ ) − (z − ∇P(w∗ )) ≈∇2Fk(w∗ )(wt − w∗ ) − ∇2P(w∗ )(wt − w∗ ). Combining the two approximations, we can get w∗ k,t − w∗ ≈ (I − A−1 k A)(wt − w∗ ), where Ak = ∇2Fk(w∗ ) and A = ∇2P(w∗ ) are two Hes￾sian matrices for the local function Fk(w∗ ) and the global function P(w∗ ), respectively. Assuming in each iteration we can always get the local optimal values for all local function￾s, we have wt+1 − w∗ ≈ (I − 1 p Xp k=1 A−1 k A)(wt − w∗ ). (1) Please note that all the above derivations assume that c = 0. From (1), we can find that Algorithm 2 will not nec￾essarily converge if c = 0, and the convergence property is dependent on the Hessian matrices of the local functions. Here, we give a simple example for illustration. We set n = p = 2 and F1(w) = (w − 1)2 , F2(w) = 100(w − 10)2 . We set a small step-size η = 10−5 and a large M = 4000. The convergence results of SCOPE with different c are presented in Table 1. Table 1: Impact of c. c 0 1 5 10 Converge? No No No Yes Separating Data Uniformly If we separate data uniformly, which means that the lo￾cal data distribution on each Worker is similar to the glob￾al data distribution, then we have Ak ≈ A and kI − 1 p Pp i=1 A−1 k Ak ≈ 0. From (1), we can find that c = 0 can make SCOPE converge for this special case. Experiment We choose logistic regression (LR) with a L2- norm regularization term to evaluate SCOPE and baselines. Hence, P(w) is defined as P(w) = 1 n Pn i=1 h log(1 + e −yix T i w) + λ 2 kwk 2 i . The code can be downloaded from https://github.com/LIBBLE/ LIBBLE-Spark/. Dataset We use four datasets for evaluation. They are MNIST-8M, epsilon, KDD12 and Data-A. The first two datasets can be downloaded from the LibSVM website3 . MNIST-8M con￾tains 8,100,000 handwritten digits. We set the instances of digits 5 to 9 as positive, and set the instances of digits 0 to 4 as negative. KDD12 is the dataset of Track 1 for KDD Cup 2012, which can be downloaded from the KDD Cup website4 . Data-A is a dataset from a data mining competi￾tion5 . The information about these datasets is summarized in Table 2. All the data is normalized before training. The regularization hyper-parameter λ is set to 10−4 for the first three datasets which are relatively small, and is set to 10−6 for the largest dataset Data-A. Similar phenomenon can be observed for other λ, which is omitted due to space limita￾tion. For all datasets, we set c = λ × 10−2 . Table 2: Datasets for evaluation. ]instances ]features memory λ MNIST-8M 8,100,000 784 39G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691,093 320 260G 1e-6 Experimental Setting and Baseline Distributed Platform We have a Spark cluster of 33 ma￾chines (nodes) connected by 10GB Ethernet. Each machine has 12 Intel Xeon E5-2620 cores with 64GB memory. We construct two clusters, a small one and a large one, from the original 33 machines for our experiments. The small clus￾ter contains 9 machines, one master and eight slaves. We use 2 cores for each slave. The large cluster contains 33 ma￾chines, 1 master and 32 slaves. We use 4 cores for each slave. In both clusters, each machine has access to 64GB memo￾ry on the corresponding machine and one core corresponds to one Worker. Hence, the small cluster has one Master and 16 Workers, and the large cluster has one Master and 128 Workers. The small cluster is for experiments on the three relatively small datasets including MNIST-8M, epsilon and KDD12. The large cluster is for experiments on the largest dataset Data-A. We use Spark1.5.2 for our experiment, and implement our SCOPE in Scala. Baseline Because the focus of this paper is to design dis￾tributed learning methods for Spark, we compare SCOPE with distributed learning baselines which can be implement￾ed on Spark. More specifically, we adopt the following base￾lines for comparison: • MLlib6 (Meng et al. 2016): MLlib is an open source li￾brary for distributed machine learning on Spark. It is mainly based on two optimization methods: mini-batch based distributed SGD and distributed lbfgs. We find that 3 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ 4 http://www.kddcup2012.org/ 5 http://www.yiban.cn/project/2015ccf/comp detail.php?cid=231 6 http://spark.apache.org/mllib/

the distributed SGD method is much slower than distribut- ed lbfgs on Spark in our experiments.Hence,we only compare our method with distributed Ibfgs for MLlib, which is a batch learning method. LibLinear'(Lin et al.2014):LibLinear is a distributed Newton method,which is also a batch learning method. .Splash(Zhang and Jordan 2015):Splash is a distributed SGD method by using the local learning strategy to re- (a)MNIST-8M (b)epsilon duce communication cost(Zhang,Wainwright,and Duchi 2012).which is different from the mini-batch based dis- tributed SGD method. .CoCoA(Jaggi et al.2014):CoCoA is a distributed dual coordinate ascent method by using local learning strategy to reduce communication cost,which is formulated from the dual problem. CoCoA+10(Ma et al.2015):CoCoA+is an improved ver- sion of CoCoA.Different from CoCoA which adopts av- (c)KDD12 (d)Data-A erage to combine local updates for global parameters,Co- CoA+adopts adding to combine local updates. Figure 2:Efficiency comparison with baselines. We can find that the above baselines include state-of-the- art distributed learning methods with different characteris- tics.All the authors of these methods have shared the source code of their methods to the public.We use the source code provided by the authors for our experiment.For all base- lines,we try several parameter values to choose the best per- formance. Efficiency Comparison with Baselines Figure 3:Speedup We compare SCOPE with other baselines on the four datasets.The result is shown in Figure 2.Each marked point on the curves denotes one update for w by the Master,which time with 16 cores by SCOpE where is the number of ma typically corresponds to an iteration in the outer-loop.For S- chines and we choose,16,24,.The experiments COPE,good convergence results can be got with number of are performed by 5 times and the average time is reported updates (i.e.,the T in Algorithm 1)less than five.We can for the final speedup result. find that Splash vibrates on some datasets since it introduces The speedup result is shown in Figure 3,where we can variance in the training process.On the contrary,SCOPE find that SCOPE has a super-linear speedup.This might be are stable,which means that SCOPE is a variance reduction reasonable due to the higher cache hit ratio with more ma- method like SVRG.It is easy to see that SCOPE has a lin- chines(Yu et al.2014).This speedup result is quite promis- ear convergence rate,which also conforms to our theoretical ing on our multi-machine settings since the communication analysis.Furthermore.SCOPE is much faster than all the cost is much larger than that of multi-thread setting.The other baselines. good speedup of SCOPE can be explained by the fact that SCOPE can also outperform SVRGfoR (Konecny,M- most training work can be locally completed by each Work- cMahan,and Ramage 2015)and DisSVRG.Experimental er and SCOPE does not need much communication cost. comparison can be found in appendix (Zhao et al.2016). SCOPE is based on the synchronous MapReduce frame- work of Spark.One shortcoming of synchronous framework Speedup is the synchronization cost,which includes both communi- We use dataset MNIST-8M for speedup evaluation of S- cation time and waiting time.We also do experiments to COPE.Two cores are used for each machine.We evalu- show the low synchronization cost of SCOPE,which can ate speedup by increasing the number of machines.The be found in the appendix (Zhao et al.2016). training process will stop when the gap between the ob- Conclusion jective function value and the optimal value is less than 10-10.The speedup is defined as follows:speedup In this paper,we propose a novel DSO method,called S- COPE,for distributed machine learning on Spark.Theoret- https://www.csie.ntu.edu.tw/cjlin/liblinear/ ical analysis shows that SCOPE is convergent with linear http://zhangyuc.github.io/splash convergence rate for strongly convex cases.Empirical re- https://github.com/gingsmith/cocoa sults show that SCOPE can outperform other state-of-the-art 10https://github.com/gingsmith/cocoa distributed methods on Spark

the distributed SGD method is much slower than distribut￾ed lbfgs on Spark in our experiments. Hence, we only compare our method with distributed lbfgs for MLlib, which is a batch learning method. • LibLinear7 (Lin et al. 2014): LibLinear is a distributed Newton method, which is also a batch learning method. • Splash8 (Zhang and Jordan 2015): Splash is a distributed SGD method by using the local learning strategy to re￾duce communication cost (Zhang, Wainwright, and Duchi 2012), which is different from the mini-batch based dis￾tributed SGD method. • CoCoA9 (Jaggi et al. 2014): CoCoA is a distributed dual coordinate ascent method by using local learning strategy to reduce communication cost, which is formulated from the dual problem. • CoCoA+10 (Ma et al. 2015): CoCoA+ is an improved ver￾sion of CoCoA. Different from CoCoA which adopts av￾erage to combine local updates for global parameters, Co￾CoA+ adopts adding to combine local updates. We can find that the above baselines include state-of-the￾art distributed learning methods with different characteris￾tics. All the authors of these methods have shared the source code of their methods to the public. We use the source code provided by the authors for our experiment. For all base￾lines, we try several parameter values to choose the best per￾formance. Efficiency Comparison with Baselines We compare SCOPE with other baselines on the four datasets. The result is shown in Figure 2. Each marked point on the curves denotes one update for w by the Master, which typically corresponds to an iteration in the outer-loop. For S￾COPE, good convergence results can be got with number of updates (i.e., the T in Algorithm 1) less than five. We can find that Splash vibrates on some datasets since it introduces variance in the training process. On the contrary, SCOPE are stable, which means that SCOPE is a variance reduction method like SVRG. It is easy to see that SCOPE has a lin￾ear convergence rate, which also conforms to our theoretical analysis. Furthermore, SCOPE is much faster than all the other baselines. SCOPE can also outperform SVRGfoR (Konecny, M- ´ cMahan, and Ramage 2015) and DisSVRG. Experimental comparison can be found in appendix (Zhao et al. 2016). Speedup We use dataset MNIST-8M for speedup evaluation of S￾COPE. Two cores are used for each machine. We evalu￾ate speedup by increasing the number of machines. The training process will stop when the gap between the ob￾jective function value and the optimal value is less than 10−10. The speedup is defined as follows: speedup = 7 https://www.csie.ntu.edu.tw/∼ cjlin/liblinear/ 8 http://zhangyuc.github.io/splash 9 https://github.com/gingsmith/cocoa 10https://github.com/gingsmith/cocoa 0 2 4 6 8 10 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal MNIST−8M with 16 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (a) MNIST-8M 0 1 2 3 4 5 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal epsilon with 16 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (b) epsilon 0 2 4 6 8 10 x 105 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal KDD12 with 16 cores SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ (c) KDD12 0 2 4 6 8 10 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal Data−A with 128 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (d) Data-A Figure 2: Efficiency comparison with baselines. 20 30 40 50 60 1 1.5 2 2.5 3 3.5 4 4.5 #cores speedup SCOPE Ideal Figure 3: Speedup time with 16 cores by SCOP E time with 2π cores where π is the number of ma￾chines and we choose π = 8, 16, 24, 32. The experiments are performed by 5 times and the average time is reported for the final speedup result. The speedup result is shown in Figure 3, where we can find that SCOPE has a super-linear speedup. This might be reasonable due to the higher cache hit ratio with more ma￾chines (Yu et al. 2014). This speedup result is quite promis￾ing on our multi-machine settings since the communication cost is much larger than that of multi-thread setting. The good speedup of SCOPE can be explained by the fact that most training work can be locally completed by each Work￾er and SCOPE does not need much communication cost. SCOPE is based on the synchronous MapReduce frame￾work of Spark. One shortcoming of synchronous framework is the synchronization cost, which includes both communi￾cation time and waiting time. We also do experiments to show the low synchronization cost of SCOPE, which can be found in the appendix (Zhao et al. 2016). Conclusion In this paper, we propose a novel DSO method, called S￾COPE, for distributed machine learning on Spark. Theoret￾ical analysis shows that SCOPE is convergent with linear convergence rate for strongly convex cases. Empirical re￾sults show that SCOPE can outperform other state-of-the-art distributed methods on Spark

Acknowledgements Riguzzi,F.;Bellodi,E.;Zese,R.:Cota,G.;and Lamma.E.2016.S- This work is partially supported by the"DengFeng"project caling structure learning of probabilistic logic programs by mapre- of Nanjing University. duce.In European Conference on Artificial Intelligence Schmidt,M.W.;Roux,N.L.;and Bach,F.R.2013.Minimiz- References ing finite sums with the stochastic average gradient. CoRR ab- s/1309.2388. Bottou,L.2010.Large-scale machine learning with stochastic Shalev-Shwartz,S.,and Zhang.T.2013.Stochastic dual coordinate gradient descent.In International Conference on Computational ascent methods for regularized loss.Journal of Machine Learning Statistics. Research14(1):567-599. De.S..and Goldstein.T.2016.Efficient distributed SGD with vari- ance reduction.In IEEE International Conference on Data Mining. Shalev-Shwartz,S.,and Zhang,T.2014.Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization Duchi,J.C.;Hazan,E.;and Singer,Y.2011.Adaptive subgradient In International Conference on Machine Learning methods for online learning and stochastic optimization.Journal Xiao,L.2009.Dual averaging method for regularized stochastic of Machine Learning Research 12:2121-2159. learning and online optimization.In Neural Information Process- Hsieh,C.-J.;Yu.H.-F.;and Dhillon.I.S.2015.Passcode:Parallel ing Systems. asynchronous stochastic dual co-ordinate descent.In International Xing.E.P.:Ho,Q.;Dai,W.:Kim,J.K.:Wei,J.;Lee,S.;Zheng.X.: Conference on Machine Learning. Xie,P.:Kumar,A.:and Yu,Y.2015.Petuum:A new platform for J.Reddi,S.;Hefny,A.;Sra,S.;Poczos,B.;and Smola,A.J.2015. distributed machine learning on big data.In ACM SIGKDD Inter- On variance reduction in stochastic gradient descent and its asyn- national Conference on Knowledge Discovery and Data Mining. chronous variants.In Neural Information Processing Systems. Yang,T.2013.Trading computation for communication:Distribut- Jaggi,M.;Smith,V.:Takac,M.;Terhorst,J.;Krishnan,S.;Hofman- ed stochastic dual coordinate ascent.In Neural Information Pro- n,T.:and Jordan.M.I.2014.Communication-efficient distributed cessing Systems. dual coordinate ascent.In Neural Information Processing Systems. Yu,Z.-Q.;Shi,X.-J.;Yan,L.;and Li,W.-J.2014.Distributed s- Johnson,R.,and Zhang.T.2013.Accelerating stochastic gradient tochastic ADMM for matrix factorization.In International Confer- descent using predictive variance reduction.In Neural Information ence on Conference on Information and Knowledge Management. Processing Systems. Zaharia,M.;Chowdhury,M.;Franklin,M.J.;Shenker,S.;and S- Konecny,J.;McMahan,B.;and Ramage,D.2015.Federated op- toica,I.2010.Spark:Cluster computing with working sets.In timization:Distributed optimization beyond the datacenter.arX- USENIX Workshop on Hot Topics in Cloud Computing. iv:1511.03575 Zhang,Y.,and Jordan,M.I.2015.Splash:User-friendly pro- Lee,J.D.:Lin,Q.;Ma,T.;and Yang,T.2016.Distributed s- gramming interface for parallelizing stochastic algorithms.CoRR tochastic variance reduced gradient methods and a lower bound for abs/1506.07552. communication complexity.arXiv:1507.07595v2. Zhang.R.,and Kwok,J.T.2014.Asynchronous distributed AD- Li,M.:Andersen.D.G.;Park,J.W.:Smola,A.J.:Ahmed.A.: MM for consensus optimization.In International Conference on Josifovski,V.;Long.J.;Shekita,E.J.;and Su,B.2014.Scaling Machine Learning distributed machine learning with the parameter server.In USENIX Zhang,S.;Choromanska,A.;and LeCun,Y.2015.Deep learn- Symposium on Operating Systems Design and Implementation. ing with elastic averaging SGD.In Neural Information Processing Lin,C.-Y.;Tsai,C.-H.;Lee,C.-P.;and Lin,C.-J.2014.Large-scale Systems. logistic regression and linear support vector machines using spark. Zhang.L.:Mahdavi,M.;and Jin,R.2013.Linear convergence with In IEEE International Conference on Big Data. condition number independent access of full gradients.In Neural Lin,Q.:Lu,Z.;and Xiao,L.2014.An accelerated proximal coor- Information Processing Systems. dinate gradient method.In Neural Information Processing Systems Zhang.Y;Wainwright,M.J.;and Duchi,J.C. 2012. Liu,J.;Wright,S.J.;Re,C.:Bittorf,V.;and Sridhar,S.2014.An Communication-efficient algorithms for statistical optimization.In asynchronous parallel stochastic coordinate descent algorithm.In Neural Information Processing Systems. International Conference on Machine Learning. Zhang,R.;Zheng,S.;and Kwok,J.T.2016.Asynchronous dis- Ma,C.;Smith,V.;Jaggi,M.;Jordan,M.I.;Richtarik,P.;and Takac, tributed semi-stochastic gradient optimization.In AAA/Conference M.2015.Adding vs.averaging in distributed primal-dual optimiza- on Artificial Intelligence. tion.In International Conference on Machine Learning. Zhao,S.-Y.,and Li,W.-J.2016.Fast asynchronous parallel s- Mania.H.:Pan.X.:Papailiopoulos.D.S.:Recht.B.:Ramchan- tochastic gradient descent:A lock-free approach with convergence dran,K.;and Jordan,M.I.2015.Perturbed iterate analysis for guarantee.In AAAI Conference on Artificial Intelligence. asynchronous stochastic optimization.ar Xiv:1507.06970. Zhao.T.:Yu.M.:Wang.Y.:Arora,R.:and Liu,H.2014.Acceler- Meng.X.;Bradley,J.;Yavuz,B.;Sparks,E.:Venkataraman,S.: ated mini-batch randomized block coordinate descent method.In Liu,D.:Freeman,J.:Tsai,D.:Amde,M.;Owen,S.:Xin.D.:X- Neural Information Processing Systems. in,R.:Franklin,M.J.:Zadeh,R.:Zaharia,M.:and Talwalkar.A. Zhao,S.-Y.;Xiang,R.;Shi,Y.-H.;Gao,P.;and Li,W.-J.2016. 2016.Mllib:Machine learning in apache spark.Journal of Ma- SCOPE:scalable composite optimization for learning on Spark. chine Learning Research 17(34):1-7. CoRR abs/1602.00133. Nitanda.A.2014.Stochastic proximal gradient descent with ac- Zinkevich,M.:Weimer,M.;Li.L.:and Smola,A.J.2010.Paral- celeration techniques.In Neural Information Processing Systems. lelized stochastic gradient descent.In Neural Information Process- Recht,B.;Re,C.;Wright,S.J.;and Niu,F.2011.Hogwild!:A ing Systems lock-free approach to parallelizing stochastic gradient descent.In Neural Information Processing Systems

Acknowledgements This work is partially supported by the “DengFeng” project of Nanjing University. References Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In International Conference on Computational Statistics. De, S., and Goldstein, T. 2016. Efficient distributed SGD with vari￾ance reduction. In IEEE International Conference on Data Mining. Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159. Hsieh, C.-J.; Yu, H.-F.; and Dhillon, I. S. 2015. Passcode: Parallel asynchronous stochastic dual co-ordinate descent. In International Conference on Machine Learning. J. Reddi, S.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. J. 2015. On variance reduction in stochastic gradient descent and its asyn￾chronous variants. In Neural Information Processing Systems. Jaggi, M.; Smith, V.; Takac, M.; Terhorst, J.; Krishnan, S.; Hofman￾n, T.; and Jordan, M. I. 2014. Communication-efficient distributed dual coordinate ascent. In Neural Information Processing Systems. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems. Konecny, J.; McMahan, B.; and Ramage, D. 2015. Federated op- ´ timization: Distributed optimization beyond the datacenter. arX￾iv:1511.03575. Lee, J. D.; Lin, Q.; Ma, T.; and Yang, T. 2016. Distributed s￾tochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv:1507.07595v2. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation. Lin, C.-Y.; Tsai, C.-H.; Lee, C.-P.; and Lin, C.-J. 2014. Large-scale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data. Lin, Q.; Lu, Z.; and Xiao, L. 2014. An accelerated proximal coor￾dinate gradient method. In Neural Information Processing Systems. Liu, J.; Wright, S. J.; Re, C.; Bittorf, V.; and Sridhar, S. 2014. An ´ asynchronous parallel stochastic coordinate descent algorithm. In International Conference on Machine Learning. Ma, C.; Smith, V.; Jaggi, M.; Jordan, M. I.; Richtarik, P.; and Tak ´ ac, ´ M. 2015. Adding vs. averaging in distributed primal-dual optimiza￾tion. In International Conference on Machine Learning. Mania, H.; Pan, X.; Papailiopoulos, D. S.; Recht, B.; Ramchan￾dran, K.; and Jordan, M. I. 2015. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv:1507.06970. Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.; Amde, M.; Owen, S.; Xin, D.; X￾in, R.; Franklin, M. J.; Zadeh, R.; Zaharia, M.; and Talwalkar, A. 2016. Mllib: Machine learning in apache spark. Journal of Ma￾chine Learning Research 17(34):1–7. Nitanda, A. 2014. Stochastic proximal gradient descent with ac￾celeration techniques. In Neural Information Processing Systems. Recht, B.; Re, C.; Wright, S. J.; and Niu, F. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems. Riguzzi, F.; Bellodi, E.; Zese, R.; Cota, G.; and Lamma, E. 2016. S￾caling structure learning of probabilistic logic programs by mapre￾duce. In European Conference on Artificial Intelligence. Schmidt, M. W.; Roux, N. L.; and Bach, F. R. 2013. Minimiz￾ing finite sums with the stochastic average gradient. CoRR ab￾s/1309.2388. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14(1):567–599. Shalev-Shwartz, S., and Zhang, T. 2014. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International Conference on Machine Learning. Xiao, L. 2009. Dual averaging method for regularized stochastic learning and online optimization. In Neural Information Process￾ing Systems. Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. In ACM SIGKDD Inter￾national Conference on Knowledge Discovery and Data Mining. Yang, T. 2013. Trading computation for communication: Distribut￾ed stochastic dual coordinate ascent. In Neural Information Pro￾cessing Systems. Yu, Z.-Q.; Shi, X.-J.; Yan, L.; and Li, W.-J. 2014. Distributed s￾tochastic ADMM for matrix factorization. In International Confer￾ence on Conference on Information and Knowledge Management. Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; and S￾toica, I. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing. Zhang, Y., and Jordan, M. I. 2015. Splash: User-friendly pro￾gramming interface for parallelizing stochastic algorithms. CoRR abs/1506.07552. Zhang, R., and Kwok, J. T. 2014. Asynchronous distributed AD￾MM for consensus optimization. In International Conference on Machine Learning. Zhang, S.; Choromanska, A.; and LeCun, Y. 2015. Deep learn￾ing with elastic averaging SGD. In Neural Information Processing Systems. Zhang, L.; Mahdavi, M.; and Jin, R. 2013. Linear convergence with condition number independent access of full gradients. In Neural Information Processing Systems. Zhang, Y.; Wainwright, M. J.; and Duchi, J. C. 2012. Communication-efficient algorithms for statistical optimization. In Neural Information Processing Systems. Zhang, R.; Zheng, S.; and Kwok, J. T. 2016. Asynchronous dis￾tributed semi-stochastic gradient optimization. In AAAI Conference on Artificial Intelligence. Zhao, S.-Y., and Li, W.-J. 2016. Fast asynchronous parallel s￾tochastic gradient descent: A lock-free approach with convergence guarantee. In AAAI Conference on Artificial Intelligence. Zhao, T.; Yu, M.; Wang, Y.; Arora, R.; and Liu, H. 2014. Acceler￾ated mini-batch randomized block coordinate descent method. In Neural Information Processing Systems. Zhao, S.-Y.; Xiang, R.; Shi, Y.-H.; Gao, P.; and Li, W.-J. 2016. SCOPE: scalable composite optimization for learning on Spark. CoRR abs/1602.00133. Zinkevich, M.; Weimer, M.; Li, L.; and Smola, A. J. 2010. Paral￾lelized stochastic gradient descent. In Neural Information Process￾ing Systems

点击下载完整版文档(PDF)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
已到末页,全文结束
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有