正在加载图片...
the distributed SGD method is much slower than distribut- ed lbfgs on Spark in our experiments.Hence,we only compare our method with distributed Ibfgs for MLlib, which is a batch learning method. LibLinear'(Lin et al.2014):LibLinear is a distributed Newton method,which is also a batch learning method. .Splash(Zhang and Jordan 2015):Splash is a distributed SGD method by using the local learning strategy to re- (a)MNIST-8M (b)epsilon duce communication cost(Zhang,Wainwright,and Duchi 2012).which is different from the mini-batch based dis- tributed SGD method. .CoCoA(Jaggi et al.2014):CoCoA is a distributed dual coordinate ascent method by using local learning strategy to reduce communication cost,which is formulated from the dual problem. CoCoA+10(Ma et al.2015):CoCoA+is an improved ver- sion of CoCoA.Different from CoCoA which adopts av- (c)KDD12 (d)Data-A erage to combine local updates for global parameters,Co- CoA+adopts adding to combine local updates. Figure 2:Efficiency comparison with baselines. We can find that the above baselines include state-of-the- art distributed learning methods with different characteris- tics.All the authors of these methods have shared the source code of their methods to the public.We use the source code provided by the authors for our experiment.For all base- lines,we try several parameter values to choose the best per- formance. Efficiency Comparison with Baselines Figure 3:Speedup We compare SCOPE with other baselines on the four datasets.The result is shown in Figure 2.Each marked point on the curves denotes one update for w by the Master,which time with 16 cores by SCOpE where is the number of ma typically corresponds to an iteration in the outer-loop.For S- chines and we choose,16,24,.The experiments COPE,good convergence results can be got with number of are performed by 5 times and the average time is reported updates (i.e.,the T in Algorithm 1)less than five.We can for the final speedup result. find that Splash vibrates on some datasets since it introduces The speedup result is shown in Figure 3,where we can variance in the training process.On the contrary,SCOPE find that SCOPE has a super-linear speedup.This might be are stable,which means that SCOPE is a variance reduction reasonable due to the higher cache hit ratio with more ma- method like SVRG.It is easy to see that SCOPE has a lin- chines(Yu et al.2014).This speedup result is quite promis- ear convergence rate,which also conforms to our theoretical ing on our multi-machine settings since the communication analysis.Furthermore.SCOPE is much faster than all the cost is much larger than that of multi-thread setting.The other baselines. good speedup of SCOPE can be explained by the fact that SCOPE can also outperform SVRGfoR (Konecny,M- most training work can be locally completed by each Work- cMahan,and Ramage 2015)and DisSVRG.Experimental er and SCOPE does not need much communication cost. comparison can be found in appendix (Zhao et al.2016). SCOPE is based on the synchronous MapReduce frame- work of Spark.One shortcoming of synchronous framework Speedup is the synchronization cost,which includes both communi- We use dataset MNIST-8M for speedup evaluation of S- cation time and waiting time.We also do experiments to COPE.Two cores are used for each machine.We evalu- show the low synchronization cost of SCOPE,which can ate speedup by increasing the number of machines.The be found in the appendix (Zhao et al.2016). training process will stop when the gap between the ob- Conclusion jective function value and the optimal value is less than 10-10.The speedup is defined as follows:speedup In this paper,we propose a novel DSO method,called S- COPE,for distributed machine learning on Spark.Theoret- https://www.csie.ntu.edu.tw/cjlin/liblinear/ ical analysis shows that SCOPE is convergent with linear http://zhangyuc.github.io/splash convergence rate for strongly convex cases.Empirical re- https://github.com/gingsmith/cocoa sults show that SCOPE can outperform other state-of-the-art 10https://github.com/gingsmith/cocoa distributed methods on Spark.the distributed SGD method is much slower than distribut￾ed lbfgs on Spark in our experiments. Hence, we only compare our method with distributed lbfgs for MLlib, which is a batch learning method. • LibLinear7 (Lin et al. 2014): LibLinear is a distributed Newton method, which is also a batch learning method. • Splash8 (Zhang and Jordan 2015): Splash is a distributed SGD method by using the local learning strategy to re￾duce communication cost (Zhang, Wainwright, and Duchi 2012), which is different from the mini-batch based dis￾tributed SGD method. • CoCoA9 (Jaggi et al. 2014): CoCoA is a distributed dual coordinate ascent method by using local learning strategy to reduce communication cost, which is formulated from the dual problem. • CoCoA+10 (Ma et al. 2015): CoCoA+ is an improved ver￾sion of CoCoA. Different from CoCoA which adopts av￾erage to combine local updates for global parameters, Co￾CoA+ adopts adding to combine local updates. We can find that the above baselines include state-of-the￾art distributed learning methods with different characteris￾tics. All the authors of these methods have shared the source code of their methods to the public. We use the source code provided by the authors for our experiment. For all base￾lines, we try several parameter values to choose the best per￾formance. Efficiency Comparison with Baselines We compare SCOPE with other baselines on the four datasets. The result is shown in Figure 2. Each marked point on the curves denotes one update for w by the Master, which typically corresponds to an iteration in the outer-loop. For S￾COPE, good convergence results can be got with number of updates (i.e., the T in Algorithm 1) less than five. We can find that Splash vibrates on some datasets since it introduces variance in the training process. On the contrary, SCOPE are stable, which means that SCOPE is a variance reduction method like SVRG. It is easy to see that SCOPE has a lin￾ear convergence rate, which also conforms to our theoretical analysis. Furthermore, SCOPE is much faster than all the other baselines. SCOPE can also outperform SVRGfoR (Konecny, M- ´ cMahan, and Ramage 2015) and DisSVRG. Experimental comparison can be found in appendix (Zhao et al. 2016). Speedup We use dataset MNIST-8M for speedup evaluation of S￾COPE. Two cores are used for each machine. We evalu￾ate speedup by increasing the number of machines. The training process will stop when the gap between the ob￾jective function value and the optimal value is less than 10−10. The speedup is defined as follows: speedup = 7 https://www.csie.ntu.edu.tw/∼ cjlin/liblinear/ 8 http://zhangyuc.github.io/splash 9 https://github.com/gingsmith/cocoa 10https://github.com/gingsmith/cocoa 0 2 4 6 8 10 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal MNIST−8M with 16 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (a) MNIST-8M 0 1 2 3 4 5 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal epsilon with 16 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (b) epsilon 0 2 4 6 8 10 x 105 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal KDD12 with 16 cores SCOPE Liblinear CoCoA MLlib(lbfgs) Splash CoCoA+ (c) KDD12 0 2 4 6 8 10 x 104 10−15 10−10 10−5 100 CPU Time(millisecond) objective value − optimal Data−A with 128 cores SCOPE LibLinear CoCoA MLlib(lbfgs) Splash CoCoA+ (d) Data-A Figure 2: Efficiency comparison with baselines. 20 30 40 50 60 1 1.5 2 2.5 3 3.5 4 4.5 #cores speedup SCOPE Ideal Figure 3: Speedup time with 16 cores by SCOP E time with 2π cores where π is the number of ma￾chines and we choose π = 8, 16, 24, 32. The experiments are performed by 5 times and the average time is reported for the final speedup result. The speedup result is shown in Figure 3, where we can find that SCOPE has a super-linear speedup. This might be reasonable due to the higher cache hit ratio with more ma￾chines (Yu et al. 2014). This speedup result is quite promis￾ing on our multi-machine settings since the communication cost is much larger than that of multi-thread setting. The good speedup of SCOPE can be explained by the fact that most training work can be locally completed by each Work￾er and SCOPE does not need much communication cost. SCOPE is based on the synchronous MapReduce frame￾work of Spark. One shortcoming of synchronous framework is the synchronization cost, which includes both communi￾cation time and waiting time. We also do experiments to show the low synchronization cost of SCOPE, which can be found in the appendix (Zhao et al. 2016). Conclusion In this paper, we propose a novel DSO method, called S￾COPE, for distributed machine learning on Spark. Theoret￾ical analysis shows that SCOPE is convergent with linear convergence rate for strongly convex cases. Empirical re￾sults show that SCOPE can outperform other state-of-the-art distributed methods on Spark
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有