a relatively small η and a relatively_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Scalable Composite Optimization for Learning on Spark

正在加载图片...

a relatively small n and a relatively large M,the uk.m will Dataset converge to w.t. We use four datasets for evaluation.They are MNIST-8M, Since F((w)is strongly convex,we have epsilon,KDD12 and Data-A.The first two datasets can be VF(w)=0.Then,we can get downloaded from the LibSVM website3.MNIST-8M con- tains 8.100.000 handwritten digits.We set the instances of VFk(wk)-VFk(w*)=VFk(wt)-VFk(w*)-z. digits 5 to 9 as positive,and set the instances of digits 0 to 4 as negative.KDD12 is the dataset of Track 1 for KDD For the left-hand side.we have Cup 2012,which can be downloaded from the KDD Cup 7F(wt)-VF(w*)≈72F(w*)(w2.t-w*) website.Data-A is a dataset from a data mining competi- tion5.The information about these datasets is summarized For the right-hand side,we have in Table 2.All the data is normalized before training.The VFk(Wt)-VFk(w*)-z regularization hyper-parameter A is set to 10-4 for the first =VFk(wt)-VFk(w*)-(z-VP(w*)) three datasets which are relatively small,and is set to 10-6 for the largest dataset Data-A.Similar phenomenon can be ≈72F(w*)(w:-w*)-72P(w*)(w-w*). observed for other A.which is omitted due to space limita- Combining the two approximations,we can get tion.For all datasets,we set c=入×l0-2 wi.t -w*(I-AkA)(wt-w*), Table 2:Datasets for evaluation. where Ak=V2F(w*)and A =V2P(w*)are two Hes- tinstances features memory sian matrices for the local function F(w*)and the global MNIST-8M 8.100.000 784 39G 1e-4 function P(w*),respectively.Assuming in each iteration we epsilon 400.000 2.000 11G le-4 can always get the local optimal values for all local function- KDD12 73.209.277 1.427.495 21G 1e-4 106.691.093 320 260G 1e-6 s,we have Data-A w+1-w≈(I-1 Ar1A)(wt-w*). (1) Experimental Setting and Baseline k=1 Distributed Platform We have a Spark cluster of 33 ma- Please note that all the above derivations assume that c =0.From (1),we can find that Algorithm 2 will not nec- chines(nodes)connected by 10GB Ethernet.Each machine essarily converge if c =0,and the convergence property is has 12 Intel Xeon E5-2620 cores with 64GB memory.We dependent on the Hessian matrices of the local functions. construct two clusters,a small one and a large one,from the Here,we give a simple example for illustration.We set original 33 machines for our experiments.The small clus- n=p=2andF(wj=(w-1)2,F2(w)=100(w- ter contains 9 machines,one master and eight slaves.We 10)2.We set a small step-size n 10-5 and a large M use 2 cores for each slave.The large cluster contains 33 ma- chines.I master and 32 slaves.We use 4 cores for each slave. 4000.The convergence results of SCOPE with different c In both clusters,each machine has access to 64GB memo- are presented in Table 1. ry on the corresponding machine and one core corresponds to one Worker.Hence,the small cluster has one Master and Table 1:Impact of c. 16 Workers,and the large cluster has one Master and 128 01510 Workers.The small cluster is for experiments on the three Converge?NoNoNo Yes relatively small datasets including MNIST-8M,epsilon and KDD12.The large cluster is for experiments on the largest dataset Data-A.We use Spark1.5.2 for our experiment,and Separating Data Uniformly implement our SCOPE in Scala. If we separate data uniformly.which means that the lo- Baseline Because the focus of this paper is to design dis- cal data distribution on each Worker is similar to the glob- tributed learning methods for Spark,we compare SCOPE al data distribution,then we have Ak A and I- with distributed learning baselines which can be implement- B∑R=AglA≈0.Fmom(,we can find that c=O ed on Spark.More specifically,we adopt the following base- can make SCOPE converge for this special case. lines for comparison: MLlib (Meng et al.2016):MLlib is an open source li- Experiment brary for distributed machine learning on Spark.It is We choose logistic regression (LR)with a L2- mainly based on two optimization methods:mini-batch norm regularization term to evaluate SCOPE and based distributed SGD and distributed Ibfgs.We find that baselines. Hence,P(w)is defined as P(w)= 是∑，log(1+e-xfw)+lw.The code can be 3https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ "http://www.kddcup2012.org/ downloaded from https://github.com/LIBBLE/ Shttp://www.yiban.cn/project/2015ccf/comp-detail.php?cid=231 LIBBLE-Spark/. http://spark.apache.org/mllib/a relatively small η and a relatively large M, the uk,m will converge to w∗ k,t. Since F (t) k (w) is strongly convex, we have ∇F (t) k (w∗ k,t) = 0. Then, we can get ∇Fk(w∗ k,t) − ∇Fk(w∗ ) = ∇Fk(wt) − ∇Fk(w∗ ) − z. For the left-hand side, we have ∇Fk(w∗ k,t) − ∇Fk(w∗ ) ≈ ∇2Fk(w∗ )(w∗ k,t − w∗ ). For the right-hand side, we have ∇Fk(wt) − ∇Fk(w∗ ) − z =∇Fk(wt) − ∇Fk(w∗ ) − (z − ∇P(w∗ )) ≈∇2Fk(w∗ )(wt − w∗ ) − ∇2P(w∗ )(wt − w∗ ). Combining the two approximations, we can get w∗ k,t − w∗ ≈ (I − A−1 k A)(wt − w∗ ), where Ak = ∇2Fk(w∗ ) and A = ∇2P(w∗ ) are two Hessian matrices for the local function Fk(w∗ ) and the global function P(w∗ ), respectively. Assuming in each iteration we can always get the local optimal values for all local functions, we have wt+1 − w∗ ≈ (I − 1 p Xp k=1 A−1 k A)(wt − w∗ ). (1) Please note that all the above derivations assume that c = 0. From (1), we can find that Algorithm 2 will not necessarily converge if c = 0, and the convergence property is dependent on the Hessian matrices of the local functions. Here, we give a simple example for illustration. We set n = p = 2 and F1(w) = (w − 1)2 , F2(w) = 100(w − 10)2 . We set a small step-size η = 10−5 and a large M = 4000. The convergence results of SCOPE with different c are presented in Table 1. Table 1: Impact of c. c 0 1 5 10 Converge? No No No Yes Separating Data Uniformly If we separate data uniformly, which means that the local data distribution on each Worker is similar to the global data distribution, then we have Ak ≈ A and kI − 1 p Pp i=1 A−1 k Ak ≈ 0. From (1), we can find that c = 0 can make SCOPE converge for this special case. Experiment We choose logistic regression (LR) with a L2- norm regularization term to evaluate SCOPE and baselines. Hence, P(w) is defined as P(w) = 1 n Pn i=1 h log(1 + e −yix T i w) + λ 2 kwk 2 i . The code can be downloaded from https://github.com/LIBBLE/ LIBBLE-Spark/. Dataset We use four datasets for evaluation. They are MNIST-8M, epsilon, KDD12 and Data-A. The first two datasets can be downloaded from the LibSVM website3 . MNIST-8M contains 8,100,000 handwritten digits. We set the instances of digits 5 to 9 as positive, and set the instances of digits 0 to 4 as negative. KDD12 is the dataset of Track 1 for KDD Cup 2012, which can be downloaded from the KDD Cup website4 . Data-A is a dataset from a data mining competition5 . The information about these datasets is summarized in Table 2. All the data is normalized before training. The regularization hyper-parameter λ is set to 10−4 for the first three datasets which are relatively small, and is set to 10−6 for the largest dataset Data-A. Similar phenomenon can be observed for other λ, which is omitted due to space limitation. For all datasets, we set c = λ × 10−2 . Table 2: Datasets for evaluation. ]instances ]features memory λ MNIST-8M 8,100,000 784 39G 1e-4 epsilon 400,000 2,000 11G 1e-4 KDD12 73,209,277 1,427,495 21G 1e-4 Data-A 106,691,093 320 260G 1e-6 Experimental Setting and Baseline Distributed Platform We have a Spark cluster of 33 machines (nodes) connected by 10GB Ethernet. Each machine has 12 Intel Xeon E5-2620 cores with 64GB memory. We construct two clusters, a small one and a large one, from the original 33 machines for our experiments. The small cluster contains 9 machines, one master and eight slaves. We use 2 cores for each slave. The large cluster contains 33 machines, 1 master and 32 slaves. We use 4 cores for each slave. In both clusters, each machine has access to 64GB memory on the corresponding machine and one core corresponds to one Worker. Hence, the small cluster has one Master and 16 Workers, and the large cluster has one Master and 128 Workers. The small cluster is for experiments on the three relatively small datasets including MNIST-8M, epsilon and KDD12. The large cluster is for experiments on the largest dataset Data-A. We use Spark1.5.2 for our experiment, and implement our SCOPE in Scala. Baseline Because the focus of this paper is to design distributed learning methods for Spark, we compare SCOPE with distributed learning baselines which can be implemented on Spark. More specifically, we adopt the following baselines for comparison: • MLlib6 (Meng et al. 2016): MLlib is an open source library for distributed machine learning on Spark. It is mainly based on two optimization methods: mini-batch based distributed SGD and distributed lbfgs. We find that 3 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ 4 http://www.kddcup2012.org/ 5 http://www.yiban.cn/project/2015ccf/comp detail.php?cid=231 6 http://spark.apache.org/mllib/

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Scalable Composite Optimization for Learning on Spark