problems has a closed-form solution i_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Stochastic ADMM for Matrix Factorization

正在加载图片...

problems has a closed-form solution in ALS: scheduler of DSGD into an asynchronous one.Its experi- ments show that FPSGD can achieve better efficiency and U.i(AimiIk Vo,Va)VR (2) accuracy than DSGD. V←(a2nLk+U,U,)-1UR, Both Hogwild!and FPSGD are only for shared memory systems with one single node and thus their scalability is where Vo,denotes a sub-matrix formed by the columns in limited.DSGD can be used for distributed systems while it V indexed by i,U,is similarly defined,mi=and costs too much on synchronization. n;=.Please note that all the missing entries in R 2.5 ADMM have been filled by zeros.The columns in both U and V can be independently updated by following (2).Hence,it is ADMM [3 is used to solve the constrained problems as easy to design the parallel strategy for ALS,which has been follows: implemented in 23. min f(x)+g(2) (4) Instead of optimizing the whole vector U.:or V.;at one X, time,CCD 13 adopts the coordinate descent method to s.t.: Ax+Bz=c. optimize each element of U.or V.separately,which can avoid the matrix inverse operation in(2).CCD++[21]fur- where f()and g()are functions,x and z are variables,A. B and c are known values. ther improves CCD's performance by changing the updating sequence in CCD.It rewrites UTV=U4 Vd.,and To solve the problem in (4),ADMM first gets the aug- mented Lagrangian as follows: updates one element in Ud.or Vd.each time by using simi- lar coordinate descent method in CCD.Changing the updat- L(x,z,y)=f(x)+g(z)+y"(Ax+Bz-c) ing sequence may improve the convergence rate,which has been verified by the experimental results in CCD++21. +lAx+Bz-cll2, (5) 2.4 SGD-based Parallel MF Models where y is the Lagrangian multiplier and p is the penalty parameter.The ADMM solution can be got be repeating The idea of SGD is to randomly select one rating index the following three steps: (i,j)from n each time,and then update the corresponding variables U.i an V.i as follows: xt+1←←argmin L(X,zt,ye): U对←Ui+(eV与-A1Ui), (3) zt+1←argmin L(Xt+1,z,y:)月 V对←-V*j+(U*i-入2V*), y+1y+p(Axt+1+BZt+1-c) where ej Rij-UnV.j,and n is the learning rate. where xt denotes the value of x at the tth iteration,ye Due to the demand of many large-scale problems,several and z are similarly defined.If f(x)or g(z)are separable, parallel SGD models have been proposed.Some of them. the corresponding steps of ADMM can be done in parallel. such as those described in 25 and 20,are not for MF Hence,ADMM can be used to design distributed learning problems.Here,we just focus on those parallel SGD models algorithms for large-scale problems 3. for MF,including Hogwild!11,DSGD 4 and FPSGD 24 In recent years,ADMM has captured more and more From (3),it is easy to find that conflicts exist between t- attention with wide applications,such as matrix comple- wo processes or nodes when their randomly selected ratings tion 5,compressive sensing 19,image restoration 6 and share either the same user index or the same item index response prediction [1].Moreover,many variants of ADMM Hogwild![11]allows overwriting each other's work when con- are also devised,including the stochastic and online exten- flicts happen.It also shows that if the optimization problem sions [15,12,17.However,to the best of our knowledge, is sparse enough,the Hogwild!will get a nearly optimal rate few works have been proposed to use stochastic ADMM for of convergence. distributed MF problems. DSGD [4]divides the whole rating matrix into P stra- ta and each stratum contains P mutually independent sub- 3. DISTRIBUTED STOCHASTIC ADMM blocks without sharing any column or row indices.Conse- quently,sub-blocks in the same stratum can be processed in FOR MATRIX FACTORIZATION parallel since they don't share any U.or V..One iteration In this section,we present the details of our DS-ADMM of DSGD is divided into P steps,in each of which DSGD model.We first introduce our data split strategy to divide picks a stratum containing P independent sub-blocks and the whole problem into several sub-problems.Then we pro- then processes these sub-blocks in parallel in a cluster of P pose a distributed ADMM framework to handle these sub- nodes with each node responsible for one sub-block.After problems in parallel.After that,a stochastic learning algo- all the P sub-blocks in each step are processed,the whole U rithm is designed to speed up the distributed ADMM frame- and V have been updated separately.They should be syn- work.Subsequently,we compare the scheduler of DS-ADMM chronized in order to let all nodes get the latest U and V with those of DSGD and CCD++.Finally,the complexity It is obvious that during one iteration of processing all the analysis of DS-ADMM will be provided. ratings in the whole matrix,P synchronization operations should be performed for DSGD.This frequent synchroniza- 3.1 Data Split Strategy tion will make DSGD inefficient because the slowest node In our data split strategy,we divide R and U into P sub- will become the bottleneck of the whole system. blocks according to users.More specifically,each sub-block FPSGD 24,which is proposed for shared-memory sys- will contain rows of R and columns of U.From (1),we tems,tries to improve the performance by changing the find that U and V are coupled together in the loss function.problems has a closed-form solution in ALS: U∗i ← (λ1miIk + VΩiV T Ωi ) −1VRT i∗, (2) V∗j ← (λ2nj Ik + UΩ˜ jU T Ω˜ j ) −1UR∗j , where VΩi denotes a sub-matrix formed by the columns in V indexed by Ωi, UΩ˜ j is similarly defined, mi = |Ωi| and nj = |Ω˜j |. Please note that all the missing entries in R have been filled by zeros. The columns in both U and V can be independently updated by following (2). Hence, it is easy to design the parallel strategy for ALS, which has been implemented in [23]. Instead of optimizing the whole vector U∗i or V∗j at one time, CCD [13] adopts the coordinate descent method to optimize each element of U∗i or V∗j separately, which can avoid the matrix inverse operation in (2). CCD++ [21] further improves CCD’s performance by changing the updating sequence in CCD. It rewrites UT V = Pk d=1 UT d∗Vd∗, and updates one element in Ud∗ or Vd∗ each time by using similar coordinate descent method in CCD. Changing the updating sequence may improve the convergence rate, which has been verified by the experimental results in CCD++ [21]. 2.4 SGD-based Parallel MF Models The idea of SGD is to randomly select one rating index (i, j) from Ω each time, and then update the corresponding variables U∗i an V∗j as follows: U∗i ← U∗i + η(ijV∗j − λ1U∗i), (3) V∗j ← V∗j + η(ijU∗i − λ2V∗j ), where ij = Rij − UT ∗iV∗j , and η is the learning rate. Due to the demand of many large-scale problems, several parallel SGD models have been proposed. Some of them, such as those described in [25] and [20], are not for MF problems. Here, we just focus on those parallel SGD models for MF, including Hogwild! [11], DSGD [4] and FPSGD [24]. From (3), it is easy to find that conflicts exist between two processes or nodes when their randomly selected ratings share either the same user index or the same item index. Hogwild! [11] allows overwriting each other’s work when con- flicts happen. It also shows that if the optimization problem is sparse enough, the Hogwild! will get a nearly optimal rate of convergence. DSGD [4] divides the whole rating matrix into P strata and each stratum contains P mutually independent subblocks without sharing any column or row indices. Consequently, sub-blocks in the same stratum can be processed in parallel since they don’t share any U∗i or V∗j . One iteration of DSGD is divided into P steps, in each of which DSGD picks a stratum containing P independent sub-blocks and then processes these sub-blocks in parallel in a cluster of P nodes with each node responsible for one sub-block. After all the P sub-blocks in each step are processed, the whole U and V have been updated separately. They should be synchronized in order to let all nodes get the latest U and V. It is obvious that during one iteration of processing all the ratings in the whole matrix, P synchronization operations should be performed for DSGD. This frequent synchronization will make DSGD inefficient because the slowest node will become the bottleneck of the whole system. FPSGD [24], which is proposed for shared-memory systems, tries to improve the performance by changing the scheduler of DSGD into an asynchronous one. Its experiments show that FPSGD can achieve better efficiency and accuracy than DSGD. Both Hogwild! and FPSGD are only for shared memory systems with one single node and thus their scalability is limited. DSGD can be used for distributed systems while it costs too much on synchronization. 2.5 ADMM ADMM [3] is used to solve the constrained problems as follows: min x,z f(x) + g(z) (4) s.t. : Ax + Bz = c, where f(·) and g(·) are functions, x and z are variables, A, B and c are known values. To solve the problem in (4), ADMM first gets the augmented Lagrangian as follows: L(x, z, y) = f(x) + g(z) + y T (Ax + Bz − c) + ρ 2 kAx + Bz − ck 2 , (5) where y is the Lagrangian multiplier and ρ is the penalty parameter. The ADMM solution can be got be repeating the following three steps: xt+1 ← argmin x L(x, zt, yt); zt+1 ← argmin z L(xt+1, z, yt); yt+1 ← yt + ρ(Axt+1 + Bzt+1 − c), where xt denotes the value of x at the tth iteration, yt and zt are similarly defined. If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel. Hence, ADMM can be used to design distributed learning algorithms for large-scale problems [3]. In recent years, ADMM has captured more and more attention with wide applications, such as matrix completion [5], compressive sensing [19], image restoration [6] and response prediction [1]. Moreover, many variants of ADMM are also devised, including the stochastic and online extensions [15, 12, 17]. However, to the best of our knowledge, few works have been proposed to use stochastic ADMM for distributed MF problems. 3. DISTRIBUTED STOCHASTIC ADMM FOR MATRIX FACTORIZATION In this section, we present the details of our DS-ADMM model. We first introduce our data split strategy to divide the whole problem into several sub-problems. Then we propose a distributed ADMM framework to handle these subproblems in parallel. After that, a stochastic learning algorithm is designed to speed up the distributed ADMM framework. Subsequently, we compare the scheduler of DS-ADMM with those of DSGD and CCD++. Finally, the complexity analysis of DS-ADMM will be provided. 3.1 Data Split Strategy In our data split strategy, we divide R and U into P subblocks according to users. More specifically, each sub-block will contain m P rows of R and m P columns of U. From (1), we find that U and V are coupled together in the loss function

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Stochastic ADMM for Matrix Factorization