正在加载图片...
Hence,T is typically not large for many problems.For ex- 2014),since we do not need each fi(w)to be convex and ample,in most of our experiments,we can achieve conver- we do not make any assumption about the Hessian matrices gent results with T<10.Hence,SCOPE is communication- either. efficient.SCOPE is a synchronous framework,which means that some waiting time is also needed for synchronization. Lemma.1Letm=是∑R-=1 Elu,m-w*2.fc>L-h Because the number of synchronization is also O(T),and T then we have Ym+1≤[1-(2μ+chYm+(cm+3L2n2)o. is typically a small number.Hence,the waiting time is also Let a 1-n(2u c),B cn 3L2n2.Given L and small. u which are determined by the objective function,we can always guarantee 0<a <1,0<B<1,and a+B< SCOPE on Spark 1 by settingn<min).We have the following One interesting thing is that the computing framework of theorems: SCOPE is quite suitable for the popular distributed plat- form Spark.The programming model underlying Spark is Theorem 1.Ifwe take w+1=吕∑R=1uk,M,then we can MapReduce,which is actually a BSP model.In SCOPE,the get the following convergence result: task of Workers that computes local gradient sum zk and the training procedure in the inner loop of Algorithm 2 can be w+1-wP≤(aM+, 1-a )Ellw:-w*l2. seen as the Map process since both of them only use local data.The task of Master that computes the average for both full gradient z and w+1 can be seen as the Reduce process. When Mo ,aM+已。<l,which means we can get a linear convergence rate if we take wt= The MapReduce programming model is essentially a syn- chronous model,which need some synchronization cost. 君∑R=1A,M Fortunately,the number of synchronization times is very s- Theorem2.If we take wi+1=是∑f=1 uk with uk= mall as stated above.Hence,both communication cost and waiting time are very small for SCOPE.In this paper,we 7∑ m=1uk.m,then we can get the following comvergence resul implement our SCOPE on Spark since Spark has been wide- ly adopted in industry for big data applications,and our 1 -)Ew:-w*2 SCOPE can be easily integrated into the data processing E1-wP≤(-a+2 pipeline of those organizations using Spark When M>-d-a,d-a+已。<l,which means Convergence of SCOPE we can also get a linear convergence rate if we take w+1= In this section,we will prove the convergence of SCOPE ∑R=1i跳wik=a∑1,m when the objective functions are strongly convex.We on- According to Theorem 1 and Theorem 2,we can find that ly list some Lemmas and Theorems.the detailed proof of SCOPE gets a linear convergence rate when M is larger than which can be found in the appendices(Zhao et al.2016). some threshold.To achieve an e-optimal solution,the com- For convenience,we use w*to denote the optimal so- putation complexity of each worker is O((+M)log). lution.‖·‖denotes the L2norm‖·l2.We assume that In our experiment,we find that good performance can be n=pq,which means that each Worker has the same num- achieved with M.Hence,SCOPE is computation- ber of training instances and D1=D2=...=Dpl =g. efficient. In practice,we can not necessarily guarantee that these Ds are the same.However,it is easy to guarantee that Impact of Parameter c Vi,j,(Dil-Dil)<1,which will not affect the perfor- In Algorithm 2,we need the parameter c to guarantee the mance. We define plocal functions as F(w)f(w). convergence of SCOPE.Specifically,we need c>L-u according to Lemma 1.Here,we discuss the necessity of c. where k 1,2,...,p.Then we have P(w) We first assume c =0,and try to find whether Algo- ∑=1F(w) rithm 2 will converge or not.It means that in the following To prove the convergence of SCOPE,we first give two derivation,we always assume c =0. assumptions which have also been widely adopted by most Let us define another local function: existing stochastic optimization algorithms for convergence proof. F(w)=Fk(w)+(z-VFk(w:))T(w-w) Assumption 1 (Smooth Gradient).There exists a constant and denote wi=arg min F(w). L >0 such that Ya,b E Rd and i =1,2,...,n,we have Vfi(a)-Vfi(b)<Lla-bl. Let vk.m =Vfi.m(uk.m)-Vfis.(w:)+z+c(uk.m- Assumption 2 (Strongly Convex).For each local function wi).When c=0,vk.m =Vfi.m(uk.m)-Vfi.m(wt)+ Fk(),there exists a constant u>0 such that Ya,b E Rd. z.Then,we have E[vk.mluk.m]=VF)(u.m)and we have Fx(a)>Fk(b)+VFk(b)T(a-b)+a-bl2. F(w)=z.Hence,we can find that each local Work- Please note that these assumptions are weaker than those eractually tries to optimize the local functionF(w)with in (Zhang and Jordan 2015;Ma et al.2015;Jaggi et al. SVRG based on the local data D.It means that if we setHence, T is typically not large for many problems. For ex￾ample, in most of our experiments, we can achieve conver￾gent results with T ≤ 10. Hence, SCOPE is communication￾efficient. SCOPE is a synchronous framework, which means that some waiting time is also needed for synchronization. Because the number of synchronization is also O(T), and T is typically a small number. Hence, the waiting time is also small. SCOPE on Spark One interesting thing is that the computing framework of SCOPE is quite suitable for the popular distributed plat￾form Spark. The programming model underlying Spark is MapReduce, which is actually a BSP model. In SCOPE, the task of Workers that computes local gradient sum zk and the training procedure in the inner loop of Algorithm 2 can be seen as the Map process since both of them only use local data. The task of Master that computes the average for both full gradient z and wt+1 can be seen as the Reduce process. The MapReduce programming model is essentially a syn￾chronous model, which need some synchronization cost. Fortunately, the number of synchronization times is very s￾mall as stated above. Hence, both communication cost and waiting time are very small for SCOPE. In this paper, we implement our SCOPE on Spark since Spark has been wide￾ly adopted in industry for big data applications, and our SCOPE can be easily integrated into the data processing pipeline of those organizations using Spark. Convergence of SCOPE In this section, we will prove the convergence of SCOPE when the objective functions are strongly convex. We on￾ly list some Lemmas and Theorems, the detailed proof of which can be found in the appendices (Zhao et al. 2016). For convenience, we use w∗ to denote the optimal so￾lution. k · k denotes the L2 norm k · k2. We assume that n = pq, which means that each Worker has the same num￾ber of training instances and |D1| = |D2| = · · · = |Dp| = q. In practice, we can not necessarily guarantee that these |Dk|s are the same. However, it is easy to guarantee that ∀i, j, |(|Di | − |Dj |)| ≤ 1, which will not affect the perfor￾mance. We define p local functions as Fk(w) = 1 q P i∈Dk fi(w), where k = 1, 2, . . . , p. Then we have P(w) = 1 p Pp k=1 Fk(w). To prove the convergence of SCOPE, we first give two assumptions which have also been widely adopted by most existing stochastic optimization algorithms for convergence proof. Assumption 1 (Smooth Gradient). There exists a constant L > 0 such that ∀a, b ∈ R d and i = 1, 2, . . . , n, we have k∇fi(a) − ∇fi(b)k ≤ Lka − bk. Assumption 2 (Strongly Convex). For each local function Fk(·), there exists a constant µ > 0 such that ∀a, b ∈ R d , we have Fk(a) ≥ Fk(b) + ∇Fk(b) T (a − b) + µ 2 ka − bk 2 . Please note that these assumptions are weaker than those in (Zhang and Jordan 2015; Ma et al. 2015; Jaggi et al. 2014), since we do not need each fi(w) to be convex and we do not make any assumption about the Hessian matrices either. Lemma 1. Let γm = 1 p Pp k=1 Ekuk,m−w∗k 2 . If c > L−µ, then we have γm+1 ≤ [1−η(2µ+c)]γm + (cη + 3L 2η 2 )γ0. Let α = 1 − η(2µ + c), β = cη + 3L 2η 2 . Given L and µ which are determined by the objective function, we can always guarantee 0 < α < 1, 0 < β < 1, and α + β < 1 by setting η < min{ 2µ 3L2 , 1 2µ+c }. We have the following theorems: Theorem 1. If we take wt+1 = 1 p Pp k=1 uk,M, then we can get the following convergence result: Ekwt+1 − w∗ k 2 ≤ (α M + β 1 − α )Ekwt − w∗ k 2 . When M > log 1−α−β 1−α α , αM + β 1−α < 1, which means we can get a linear convergence rate if we take wt+1 = 1 p Pp k=1 uk,M. Theorem 2. If we take wt+1 = 1 p Pp k=1 u˜k with u˜k = 1 M PM m=1 uk,m, then we can get the following convergence result: Ekwt+1 − w∗ k 2 ≤ ( 1 M(1 − α) + β 1 − α )Ekwt − w∗ k 2 . When M > 1 1−α−β , 1 M(1−α) + β 1−α < 1, which means we can also get a linear convergence rate if we take wt+1 = 1 p Pp k=1 u˜k with u˜k = 1 M PM m=1 uk,m. According to Theorem 1 and Theorem 2, we can find that SCOPE gets a linear convergence rate when M is larger than some threshold. To achieve an -optimal solution, the com￾putation complexity of each worker is O(( n p + M) log 1  ). In our experiment, we find that good performance can be achieved with M = n p . Hence, SCOPE is computation￾efficient. Impact of Parameter c In Algorithm 2, we need the parameter c to guarantee the convergence of SCOPE. Specifically, we need c > L − µ according to Lemma 1. Here, we discuss the necessity of c. We first assume c = 0, and try to find whether Algo￾rithm 2 will converge or not. It means that in the following derivation, we always assume c = 0. Let us define another local function: F (t) k (w) = Fk(w) + (z − ∇Fk(wt))T (w − w∗ ) and denote w∗ k,t = arg min w F (t) k (w). Let vk,m = ∇fik,m(uk,m)−∇fik,m(wt)+z+c(uk,m − wt). When c = 0, vk,m = ∇fik,m(uk,m) − ∇fik,m(wt) + z. Then, we have E[vk,m|uk,m] = ∇F (t) k (uk,m) and ∇F (t) k (wt) = z. Hence, we can find that each local Work￾er actually tries to optimize the local function F (t) k (w) with SVRG based on the local data Dk. It means that if we set
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有