正在加载图片...
SCOPE:Scalable Composite Optimization for Learning on Spark Shen-Yi Zhao,Ru Xiang,Ying-Hao Shi,Peng Gao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China {zhaosy,xiangr,shiyh,gaop}@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract and Zhang 2013;Zhang,Mahdavi,and Jin 2013;Shalev- Shwartz and Zhang 2013:2014:Lin,Lu,and Xiao 2014: Many machine learning models,such as logistic regres- Nitanda 2014).Existing SO methods can be divided into t- sion(LR)and support vector machine (SVM),can be for- wo categories.The first category is stochastic gradient de- mulated as composite optimization problems.Recently,many scent(SGD)and its variants,such as stochastic average gra- distributed stochastic optimization (DSO)methods have been proposed to solve the large-scale composite optimization dient (SAG)(Schmidt,Roux,and Bach 2013)and stochas- problems,which have shown better performance than tradi- tic variance reduced gradient (SVRG)(Johnson and Zhang tional batch methods.However,most of these DSO methods 2013),which try to perform optimization on the primal prob- might not be scalable enough.In this paper,we propose a lem.The second category,such as stochastic dual coordinate novel DSO method,called scalable composite optimization ascent (SDCA)(Shalev-Shwartz and Zhang 2013).tries to for learning(SCOPE),and implement it on the fault-tolerant perform optimization with the dual formulation.Many ad- distributed platform Spark.SCOPE is both computation- vanced SO methods.such as SVRG and SDCA.are more efficient and communication-efficient.Theoretical analysis efficient than traditional batch learning methods in both the- shows that SCOPE is convergent with linear convergence rate ory and practice for large-scale learning problems. when the objective function is strongly convex.Furthermore, empirical results on real datasets show that SCOPE can out- Most traditional SO methods are sequential which means perform other state-of-the-art distributed learning methods on that the optimization procedure is not parallelly performed. Spark,including both batch learning methods and DSO meth- However,with the increase of data scale,traditional se- ods. quential SO methods may not be efficient enough to handle large-scale datasets.Furthermore,in this big data era,many large-scale datasets are distributively stored on a cluster of Introduction multiple machines.Traditional sequential SO methods can- Many machine learning models can be formulated as com- not be directly used for these kinds of distributed datasets. posite optimization problems which have the following for- To handle large-scale composite optimization problems,re- m with finite sum of some functions:min P(w)= searchers have recently proposed several parallel SO(PSO) w∈Rd methods for multi-core systems and distributed SO (DSO) f(w).where w is the parameter to learn (optimize). methods for clusters of multiple machines. n is the number of training instances,and fi(w)is the loss PSO methods perform SO on a single machine with function on the training instance i.For example,fi(w)= multi-cores(multi-threads)and a shared memory.Typical- log(1+e)+wll2 in logistic regression (LR),and ly,synchronous strategies with locks will be much slow- fi(w)=max{0,1-yixw}+wll2 in support vec- er than asynchronous ones.Hence,recent progress of P- tor machine (SVM),where A is the regularization hyper- SO mainly focuses on designing asynchronous or lock-free parameter and(xi,yi)is the training instance i withxiERd optimization strategies (Recht et al.2011;Liu et al.2014; being the feature vector and yi E{+1,-1}being the class Hsieh,Yu,and Dhillon 2015:J.Reddi et al.2015:Zhao and label.Other cases like matrix factorization and deep neural Li2016). networks can also be written as similar forms of composite DSO methods perform SO on clusters of multiple ma- optimization. chines.DSO can be used to handle extremely large prob- Due to its efficiency and effectiveness,stochastic op- lems which are beyond the processing capability of one sin- timization (SO)has recently attracted much attention to gle machine.In many real applications especially industrial solve the composite optimization problems in machine applications,the datasets are typically distributively stored learning (Xiao 2009:Bottou 2010;Duchi,Hazan,and on clusters.Hence,DSO has recently become a hot research Singer 2011;Schmidt,Roux,and Bach 2013;Johnson topic.Many DSO methods have been proposed,including distributed SGD methods from primal formulation and dis- Copyright C2017,Association for the Advancement of Artificial tributed dual formulation.Representative distributed SGD Intelligence (www.aaai.org).All rights reserved. methods include PSGD(Zinkevich et al.2010),BAVG-SCOPE: Scalable Composite Optimization for Learning on Spark Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy,xiangr,shiyh,gaop}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Many machine learning models, such as logistic regres￾sion (LR) and support vector machine (SVM), can be for￾mulated as composite optimization problems. Recently, many distributed stochastic optimization (DSO) methods have been proposed to solve the large-scale composite optimization problems, which have shown better performance than tradi￾tional batch methods. However, most of these DSO methods might not be scalable enough. In this paper, we propose a novel DSO method, called scalable composite optimization for learning (SCOPE), and implement it on the fault-tolerant distributed platform Spark. SCOPE is both computation￾efficient and communication-efficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is strongly convex. Furthermore, empirical results on real datasets show that SCOPE can out￾perform other state-of-the-art distributed learning methods on Spark, including both batch learning methods and DSO meth￾ods. Introduction Many machine learning models can be formulated as com￾posite optimization problems which have the following for￾m with finite sum of some functions: min w∈Rd P(w) = 1 n Pn i fi(w), where w is the parameter to learn (optimize), n is the number of training instances, and fi(w) is the loss function on the training instance i. For example, fi(w) = log(1 + e −yix T i w) + λ 2 kwk 2 in logistic regression (LR), and fi(w) = max{0, 1 − yix T i w} + λ 2 kwk 2 in support vec￾tor machine (SVM), where λ is the regularization hyper￾parameter and (xi , yi) is the training instance i with xi ∈ R d being the feature vector and yi ∈ {+1, −1} being the class label. Other cases like matrix factorization and deep neural networks can also be written as similar forms of composite optimization. Due to its efficiency and effectiveness, stochastic op￾timization (SO) has recently attracted much attention to solve the composite optimization problems in machine learning (Xiao 2009; Bottou 2010; Duchi, Hazan, and Singer 2011; Schmidt, Roux, and Bach 2013; Johnson Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. and Zhang 2013; Zhang, Mahdavi, and Jin 2013; Shalev￾Shwartz and Zhang 2013; 2014; Lin, Lu, and Xiao 2014; Nitanda 2014). Existing SO methods can be divided into t￾wo categories. The first category is stochastic gradient de￾scent (SGD) and its variants, such as stochastic average gra￾dient (SAG) (Schmidt, Roux, and Bach 2013) and stochas￾tic variance reduced gradient (SVRG) (Johnson and Zhang 2013), which try to perform optimization on the primal prob￾lem. The second category, such as stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz and Zhang 2013), tries to perform optimization with the dual formulation. Many ad￾vanced SO methods, such as SVRG and SDCA, are more efficient than traditional batch learning methods in both the￾ory and practice for large-scale learning problems. Most traditional SO methods are sequential which means that the optimization procedure is not parallelly performed. However, with the increase of data scale, traditional se￾quential SO methods may not be efficient enough to handle large-scale datasets. Furthermore, in this big data era, many large-scale datasets are distributively stored on a cluster of multiple machines. Traditional sequential SO methods can￾not be directly used for these kinds of distributed datasets. To handle large-scale composite optimization problems, re￾searchers have recently proposed several parallel SO (PSO) methods for multi-core systems and distributed SO (DSO) methods for clusters of multiple machines. PSO methods perform SO on a single machine with multi-cores (multi-threads) and a shared memory. Typical￾ly, synchronous strategies with locks will be much slow￾er than asynchronous ones. Hence, recent progress of P￾SO mainly focuses on designing asynchronous or lock-free optimization strategies (Recht et al. 2011; Liu et al. 2014; Hsieh, Yu, and Dhillon 2015; J. Reddi et al. 2015; Zhao and Li 2016). DSO methods perform SO on clusters of multiple ma￾chines. DSO can be used to handle extremely large prob￾lems which are beyond the processing capability of one sin￾gle machine. In many real applications especially industrial applications, the datasets are typically distributively stored on clusters. Hence, DSO has recently become a hot research topic. Many DSO methods have been proposed, including distributed SGD methods from primal formulation and dis￾tributed dual formulation. Representative distributed SGD methods include PSGD (Zinkevich et al. 2010), BAVG-
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有