Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Xing-Jian Shi Ling Yan Shanghai Key Laboratory of Department of Computer Shanghai Key Laboratory of Scalable Computing and Science and Engineering Scalable Computing and Systems Hong Kong University of Systems Department of Computer Science and Technology, Department of Computer Science and Engineering China Science and Engineering Shanghai Jiao Tong University, xshiab@connect.ust.hk Shanghai Jiao Tong University, China China yzqfqyd@sjtu.edu.cn yling0718@sjtu.edu.cn Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University,China liwujun@nju.edu.cn ABSTRACT models for recommender systems due to their promising per- Matrix factorization (MF)has become the most popular formance 8.In this big data era,more and more large-scale technique for recommender systems due to its promising data sets have emerged in many real-world recommender performance. Recently,distributed (parallel)MF models systems.Hence,parallel or distributed MF models with have received much attention from researchers of big da- the potential of high scalability have recently captured much ta community.In this paper,we propose a novel model, attention from researchers. called distributed stochastic alternating direction methods The basic idea of MF is to use the multiplication of two of multipliers (DS-ADMM),for large-scale MF problems latent matrices,the user matrix and the item matrix,to ap- DS-ADMM is a distributed stochastic variant of ADMM.In proximate the original rating matrix.Least square method particular,we first devise a new data split strategy to make is usually used to find a solution.In recent years,several the distributed MF problem fit for the ADMM framework. parallel models have been proposed for MF.These existing Then.a stochastic ADMM scheme is designed for learning models can be roughly divided into two main categories:al- Finally,we implement DS-ADMM based on message passing ternating least square (ALS)[23]based models and stochas- interface (MPD),which can run on clusters with multiple ma- tic gradient descent (SGD)based models. chines (nodes).Experiments on several data sets from rec- ALS [23]adopts the alternating learning strategy to up- ommendation applications show that our DS-ADMM model date one matrix with the other one fixed.With one of the can outperform other state-of-the-art distributed MF mod- matrices fixed,the optimization problem of MF can be re- els in terms of both efficiency and accuracy. duced to a least square problem on the other matrix,which can be further decomposed into several independent least Keywords square problems on the latent feature vector of each user or item.Hence,it is easy to design parallel strategies for ALS. Matrix Factorization,Recommender Systems,ADMM,Dis- which has been implemented in [23.However,the time tributed Computing,Stochastic Learning complexity for each iteration in ALS is cubic in k,where k is the number of latent features for each user or item. 1. INTRODUCTION The cyclic coordinate descent (CCD)method [13]adopt- Recommender systems try to recommend products (item- s coordinate descent strategy to improve the ALS method s)to customers (users)by utilizing the customers'historic by decreasing the time complexity for each iteration to be preferences.Matrix factorization (MF)[8]and its exten- linear in k.The CCD++21]further improves the efficien- sions [9,22,16,14,10,18 have become the most popular cy of CCD by using similar coordinate descent strategy but different updating sequence of the variables.Because both CCD and CCD++are based on ALS,they can also be easily Permission to make digital or hard copies of all or part of this work for personal or parallelized [21] classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita tion on the hrst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re- 1In existing literatures,distributed models refer to those im- publish,to post on servers or to redistribute to lists,requires prior specific permission plemented on clusters with multiple machines(nodes),while and/or a fee.Request permissions from permissions@acm.org. parallel models refer to those implemented either on multi- CM4.November 3-7.2014.Shanghai China core systems with a single node or on clusters.We will alsc Copyright2014ACM978-1-4503-2598-1/14/11$15.00. follow this tradition in this paper.The specific meaning of http:/k.doi.org/10.1145/2661829.2661996. parallel can be determined from the context in the paper
Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China yzqfqyd@sjtu.edu.cn Xing-Jian Shi Department of Computer Science and Engineering Hong Kong University of Science and Technology, China xshiab@connect.ust.hk Ling Yan Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China yling0718@sjtu.edu.cn Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University, China liwujun@nju.edu.cn ABSTRACT Matrix factorization (MF) has become the most popular technique for recommender systems due to its promising performance. Recently, distributed (parallel) MF models have received much attention from researchers of big data community. In this paper, we propose a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM. In particular, we first devise a new data split strategy to make the distributed MF problem fit for the ADMM framework. Then, a stochastic ADMM scheme is designed for learning. Finally, we implement DS-ADMM based on message passing interface (MPI), which can run on clusters with multiple machines (nodes). Experiments on several data sets from recommendation applications show that our DS-ADMM model can outperform other state-of-the-art distributed MF models in terms of both efficiency and accuracy. Keywords Matrix Factorization, Recommender Systems, ADMM, Distributed Computing, Stochastic Learning 1. INTRODUCTION Recommender systems try to recommend products (items) to customers (users) by utilizing the customers’ historic preferences. Matrix factorization (MF) [8] and its extensions [9, 22, 16, 14, 10, 18] have become the most popular Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CIKM’14, November 3–7, 2014, Shanghai, China. Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00. http://dx.doi.org/10.1145/2661829.2661996 . models for recommender systems due to their promising performance [8]. In this big data era, more and more large-scale data sets have emerged in many real-world recommender systems. Hence, parallel or distributed1 MF models with the potential of high scalability have recently captured much attention from researchers. The basic idea of MF is to use the multiplication of two latent matrices, the user matrix and the item matrix, to approximate the original rating matrix. Least square method is usually used to find a solution. In recent years, several parallel models have been proposed for MF. These existing models can be roughly divided into two main categories: alternating least square (ALS) [23] based models and stochastic gradient descent (SGD) based models. ALS [23] adopts the alternating learning strategy to update one matrix with the other one fixed. With one of the matrices fixed, the optimization problem of MF can be reduced to a least square problem on the other matrix, which can be further decomposed into several independent least square problems on the latent feature vector of each user or item. Hence, it is easy to design parallel strategies for ALS, which has been implemented in [23]. However, the time complexity for each iteration in ALS is cubic in k, where k is the number of latent features for each user or item. The cyclic coordinate descent (CCD) method [13] adopts coordinate descent strategy to improve the ALS method by decreasing the time complexity for each iteration to be linear in k. The CCD++ [21] further improves the efficiency of CCD by using similar coordinate descent strategy but different updating sequence of the variables. Because both CCD and CCD++ are based on ALS, they can also be easily parallelized [21]. 1 In existing literatures, distributed models refer to those implemented on clusters with multiple machines (nodes), while parallel models refer to those implemented either on multicore systems with a single node or on clusters. We will also follow this tradition in this paper. The specific meaning of parallel can be determined from the context in the paper
Due to its efficiency and ease of implementation,SGD has ple machines (nodes).Hence,DS-ADMM is scalable become one of the most popular optimization strategies for to handle large-scale data sets MF in recommender systems 8.The basic idea of SGD is to randomly select one rating each time from the rating matrix Experiments on several data sets from recommenda- and then use the gradient based on this selected rating to update the latent features.It is easy to see that SGD is tion applications show that not only can DS-ADMM outperform other SGD-based models,but it can also essentially a sequential method,which makes it difficult to outperform ALS-based models like CCD++in terms be parallelized. of both efficiency and accuracy. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same la- tent features corresponding to the same user or item.Hence, 2. BACKGROUND there exist conflicts between two processes or nodes which simultaneously update the same latent features.Recently In this section,we introduce the background of this pa several strategies have been proposed to parallelize SGD for per,including notations,MF formulation,ALS-based mod- MF.The Hogwild![11]model just ignores the conflicts by els.SGD-based models and ADMM. assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. 2.1 Notations However,the conflicts do exist in the learning procedure We use boldface uppercase letters like M to denote matri- which makes Hogwild!not effective enough [21.24.More- ces and boldface lowercase letters like m to denote vectors over,Hogwild!requires all the processes share the whole Mi.and M.;denote the ith row and the jth column of M training set which is hard to be satisfied in distributed sys- respectively.Mi;denotes the element at the ith row and tems.Hence,Hogwild!cannot be directly used in distribut- jth column in M.MT denotes the transpose of M,and ed systems. M denotes the inverse of M.tr()denotes the trace of a Distributed SGD (DSGD)4 utilizes the property that matrix.Ig is an identity matrix of size k x k.Assume there there exist several sub-blocks without overlapping rows and are m users and n items in the data set.We use RERmxm columns in the rating matrix.These sub-blocks are mutually to denote the rating matrix.Please note that there exist independent of each other,thus can be processed in parallel many missing entries in R.All the missing entries are filled by different processes or nodes at the same time.Exper- with0.Ve use2C{l,2,·,m}×{1,2,·,n}to denote iments in 21,24 have shown that DSGD can outperform the set of indices for the observed ratings.denotes the Hogwild!in terms of both efficiency and accuracy.However column indices of the observed ratings in the ith row of R. after a set of independent sub-blocks have been processed. and denotes the row indices of the observed ratings in the updated variables from all processes or nodes should be the jth column of R.URxm denotes the users'laten- synchronized before processing the other sets of independent t factors (matrix)with each column U.representing the sub-blocks.It is these frequent synchronization operations latent feature vector for user i,where k is the number of that make DSGD inefficient because the slowest node will latent factors for each user or item.Vx denotes the become the bottleneck of the whole system.Things go even items'latent factors (matrix)with each column V.i repre- worse if data skew exists,which is not rare in real applica- senting the latent feature vector for item j.P denotes the tions.Very recently,fast parallel SGD (FPSGD)[24]tries to total number of nodes in the cluster,and we use the letter p solve the issues in DSGD by changing the scheduler into an on the superscript like MP to denote the computer node id. asynchronous one,which has achieved better performance .llF denotes the Frobenius norm of a matrix or a vector. than DSGD.However,FPSGD can only be used in shared- memory systems with a single node.Hence,FPSGD is still 2.2 Matrix Factorization not scalable to handle large-scale problems. In this paper,a novel model,called distributed stochastic Matrix factorization (MF)can be formulated as the fol- alternating direction methods of multipliers (DS-ADMM). lowing optimization problem is proposed for large-scale MF problems.DS-ADMM is a distributed stochastic variant of ADMM [3.The main con- (Rij -UTV.j)2+AUTU.:+A2VT V.j tributions of DS-ADMM are briefly outlined as follows: i,1E2 (1) In DS-ADMM,a new data split (partition)strategy where AI and A2 are hyper-parameters for regularization. called LocalMFSplit is proposed to assign subsets of There are two categories of parallel models to solve the the whole set of ratings to different nodes in a cluster above MF problem,i.e.,the ALS-based models and SGD- and consequently divide the large-scale problem into based models,which will be briefly reviewed in the following several sub-problems.Our split strategy can make the subsections. distributed MF problem fit for the ADMM framework Furthermore,compared with existing split strategies in DSGD and CCD++,our split strategy can reduce syn- 2.3 ALS-based Parallel MF Models chronization and scheduling cost to improve efficiency. By adopting the alternating learning strategy,ALS 23 alternatively switches between updating U and updating V .A stochastic ADMM method is designed to perform with the other latent matrix fixed.With U fixed,the MF efficient learning for parameters problem can be decomposed into n independent least square problems,each of which corresponds to a column of the ma- DS-ADMM is implemented with message passing in- trix V.Similar m independent least square problems can terface (MPI),which can run on clusters with multi- be got by fixing V.Furthermore,each of these independent
Due to its efficiency and ease of implementation, SGD has become one of the most popular optimization strategies for MF in recommender systems [8]. The basic idea of SGD is to randomly select one rating each time from the rating matrix and then use the gradient based on this selected rating to update the latent features. It is easy to see that SGD is essentially a sequential method, which makes it difficult to be parallelized. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same latent features corresponding to the same user or item. Hence, there exist conflicts between two processes or nodes which simultaneously update the same latent features. Recently, several strategies have been proposed to parallelize SGD for MF. The Hogwild! [11] model just ignores the conflicts by assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. However, the conflicts do exist in the learning procedure, which makes Hogwild! not effective enough [21, 24]. Moreover, Hogwild! requires all the processes share the whole training set which is hard to be satisfied in distributed systems. Hence, Hogwild! cannot be directly used in distributed systems. Distributed SGD (DSGD) [4] utilizes the property that there exist several sub-blocks without overlapping rows and columns in the rating matrix. These sub-blocks are mutually independent of each other, thus can be processed in parallel by different processes or nodes at the same time. Experiments in [21, 24] have shown that DSGD can outperform Hogwild! in terms of both efficiency and accuracy. However, after a set of independent sub-blocks have been processed, the updated variables from all processes or nodes should be synchronized before processing the other sets of independent sub-blocks. It is these frequent synchronization operations that make DSGD inefficient because the slowest node will become the bottleneck of the whole system. Things go even worse if data skew exists, which is not rare in real applications. Very recently, fast parallel SGD (FPSGD) [24] tries to solve the issues in DSGD by changing the scheduler into an asynchronous one, which has achieved better performance than DSGD. However, FPSGD can only be used in sharedmemory systems with a single node. Hence, FPSGD is still not scalable to handle large-scale problems. In this paper, a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), is proposed for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM [3]. The main contributions of DS-ADMM are briefly outlined as follows: • In DS-ADMM, a new data split (partition) strategy called LocalMFSplit is proposed to assign subsets of the whole set of ratings to different nodes in a cluster and consequently divide the large-scale problem into several sub-problems. Our split strategy can make the distributed MF problem fit for the ADMM framework. Furthermore, compared with existing split strategies in DSGD and CCD++, our split strategy can reduce synchronization and scheduling cost to improve efficiency. • A stochastic ADMM method is designed to perform efficient learning for parameters. • DS-ADMM is implemented with message passing interface (MPI), which can run on clusters with multiple machines (nodes). Hence, DS-ADMM is scalable to handle large-scale data sets. • Experiments on several data sets from recommendation applications show that not only can DS-ADMM outperform other SGD-based models, but it can also outperform ALS-based models like CCD++ in terms of both efficiency and accuracy. 2. BACKGROUND In this section, we introduce the background of this paper, including notations, MF formulation, ALS-based models, SGD-based models and ADMM. 2.1 Notations We use boldface uppercase letters like M to denote matrices and boldface lowercase letters like m to denote vectors. Mi∗ and M∗j denote the ith row and the jth column of M, respectively. Mij denotes the element at the ith row and jth column in M. MT denotes the transpose of M, and M−1 denotes the inverse of M. tr(·) denotes the trace of a matrix. Ik is an identity matrix of size k × k. Assume there are m users and n items in the data set. We use R ∈ R m×n to denote the rating matrix. Please note that there exist many missing entries in R. All the missing entries are filled with 0. We use Ω ⊂ {1, 2, · · · , m} × {1, 2, · · · , n} to denote the set of indices for the observed ratings. Ωi denotes the column indices of the observed ratings in the ith row of R, and Ω˜j denotes the row indices of the observed ratings in the jth column of R. U ∈ R k×m denotes the users’ latent factors (matrix) with each column U∗i representing the latent feature vector for user i, where k is the number of latent factors for each user or item. V ∈ R k×n denotes the items’ latent factors (matrix) with each column V∗j representing the latent feature vector for item j. P denotes the total number of nodes in the cluster, and we use the letter p on the superscript like Mp to denote the computer node id. k · kF denotes the Frobenius norm of a matrix or a vector. 2.2 Matrix Factorization Matrix factorization (MF) can be formulated as the following optimization problem: min U,V 1 2 X (i,j)∈Ω (Ri,j − U T ∗iV∗j ) 2 + λ1U T ∗iU∗i + λ2V T ∗jV∗j , (1) where λ1 and λ2 are hyper-parameters for regularization. There are two categories of parallel models to solve the above MF problem, i.e., the ALS-based models and SGDbased models, which will be briefly reviewed in the following subsections. 2.3 ALS-based Parallel MF Models By adopting the alternating learning strategy, ALS [23] alternatively switches between updating U and updating V with the other latent matrix fixed. With U fixed, the MF problem can be decomposed into n independent least square problems, each of which corresponds to a column of the matrix V. Similar m independent least square problems can be got by fixing V. Furthermore, each of these independent
problems has a closed-form solution in ALS: scheduler of DSGD into an asynchronous one.Its experi- ments show that FPSGD can achieve better efficiency and U.i(AimiIk Vo,Va)VR (2) accuracy than DSGD. V←(a2nLk+U,U,)-1UR, Both Hogwild!and FPSGD are only for shared memory systems with one single node and thus their scalability is where Vo,denotes a sub-matrix formed by the columns in limited.DSGD can be used for distributed systems while it V indexed by i,U,is similarly defined,mi=and costs too much on synchronization. n;=.Please note that all the missing entries in R 2.5 ADMM have been filled by zeros.The columns in both U and V can be independently updated by following (2).Hence,it is ADMM [3 is used to solve the constrained problems as easy to design the parallel strategy for ALS,which has been follows: implemented in 23. min f(x)+g(2) (4) Instead of optimizing the whole vector U.:or V.;at one X, time,CCD 13 adopts the coordinate descent method to s.t.: Ax+Bz=c. optimize each element of U.or V.separately,which can avoid the matrix inverse operation in(2).CCD++[21]fur- where f()and g()are functions,x and z are variables,A. B and c are known values. ther improves CCD's performance by changing the updating sequence in CCD.It rewrites UTV=U4 Vd.,and To solve the problem in (4),ADMM first gets the aug- mented Lagrangian as follows: updates one element in Ud.or Vd.each time by using simi- lar coordinate descent method in CCD.Changing the updat- L(x,z,y)=f(x)+g(z)+y"(Ax+Bz-c) ing sequence may improve the convergence rate,which has been verified by the experimental results in CCD++21. +lAx+Bz-cll2, (5) 2.4 SGD-based Parallel MF Models where y is the Lagrangian multiplier and p is the penalty parameter.The ADMM solution can be got be repeating The idea of SGD is to randomly select one rating index the following three steps: (i,j)from n each time,and then update the corresponding variables U.i an V.i as follows: xt+1←←argmin L(X,zt,ye): U对←Ui+(eV与-A1Ui), (3) zt+1←argmin L(Xt+1,z,y:)月 V对←-V*j+(U*i-入2V*), y+1y+p(Axt+1+BZt+1-c) where ej Rij-UnV.j,and n is the learning rate. where xt denotes the value of x at the tth iteration,ye Due to the demand of many large-scale problems,several and z are similarly defined.If f(x)or g(z)are separable, parallel SGD models have been proposed.Some of them. the corresponding steps of ADMM can be done in parallel. such as those described in 25 and 20,are not for MF Hence,ADMM can be used to design distributed learning problems.Here,we just focus on those parallel SGD models algorithms for large-scale problems 3. for MF,including Hogwild!11,DSGD 4 and FPSGD 24 In recent years,ADMM has captured more and more From (3),it is easy to find that conflicts exist between t- attention with wide applications,such as matrix comple- wo processes or nodes when their randomly selected ratings tion 5,compressive sensing 19,image restoration 6 and share either the same user index or the same item index response prediction [1].Moreover,many variants of ADMM Hogwild![11]allows overwriting each other's work when con- are also devised,including the stochastic and online exten- flicts happen.It also shows that if the optimization problem sions [15,12,17.However,to the best of our knowledge, is sparse enough,the Hogwild!will get a nearly optimal rate few works have been proposed to use stochastic ADMM for of convergence. distributed MF problems. DSGD [4]divides the whole rating matrix into P stra- ta and each stratum contains P mutually independent sub- 3. DISTRIBUTED STOCHASTIC ADMM blocks without sharing any column or row indices.Conse- quently,sub-blocks in the same stratum can be processed in FOR MATRIX FACTORIZATION parallel since they don't share any U.or V..One iteration In this section,we present the details of our DS-ADMM of DSGD is divided into P steps,in each of which DSGD model.We first introduce our data split strategy to divide picks a stratum containing P independent sub-blocks and the whole problem into several sub-problems.Then we pro- then processes these sub-blocks in parallel in a cluster of P pose a distributed ADMM framework to handle these sub- nodes with each node responsible for one sub-block.After problems in parallel.After that,a stochastic learning algo- all the P sub-blocks in each step are processed,the whole U rithm is designed to speed up the distributed ADMM frame- and V have been updated separately.They should be syn- work.Subsequently,we compare the scheduler of DS-ADMM chronized in order to let all nodes get the latest U and V with those of DSGD and CCD++.Finally,the complexity It is obvious that during one iteration of processing all the analysis of DS-ADMM will be provided. ratings in the whole matrix,P synchronization operations should be performed for DSGD.This frequent synchroniza- 3.1 Data Split Strategy tion will make DSGD inefficient because the slowest node In our data split strategy,we divide R and U into P sub- will become the bottleneck of the whole system. blocks according to users.More specifically,each sub-block FPSGD 24,which is proposed for shared-memory sys- will contain rows of R and columns of U.From (1),we tems,tries to improve the performance by changing the find that U and V are coupled together in the loss function
problems has a closed-form solution in ALS: U∗i ← (λ1miIk + VΩiV T Ωi ) −1VRT i∗, (2) V∗j ← (λ2nj Ik + UΩ˜ jU T Ω˜ j ) −1UR∗j , where VΩi denotes a sub-matrix formed by the columns in V indexed by Ωi, UΩ˜ j is similarly defined, mi = |Ωi| and nj = |Ω˜j |. Please note that all the missing entries in R have been filled by zeros. The columns in both U and V can be independently updated by following (2). Hence, it is easy to design the parallel strategy for ALS, which has been implemented in [23]. Instead of optimizing the whole vector U∗i or V∗j at one time, CCD [13] adopts the coordinate descent method to optimize each element of U∗i or V∗j separately, which can avoid the matrix inverse operation in (2). CCD++ [21] further improves CCD’s performance by changing the updating sequence in CCD. It rewrites UT V = Pk d=1 UT d∗Vd∗, and updates one element in Ud∗ or Vd∗ each time by using similar coordinate descent method in CCD. Changing the updating sequence may improve the convergence rate, which has been verified by the experimental results in CCD++ [21]. 2.4 SGD-based Parallel MF Models The idea of SGD is to randomly select one rating index (i, j) from Ω each time, and then update the corresponding variables U∗i an V∗j as follows: U∗i ← U∗i + η(ijV∗j − λ1U∗i), (3) V∗j ← V∗j + η(ijU∗i − λ2V∗j ), where ij = Rij − UT ∗iV∗j , and η is the learning rate. Due to the demand of many large-scale problems, several parallel SGD models have been proposed. Some of them, such as those described in [25] and [20], are not for MF problems. Here, we just focus on those parallel SGD models for MF, including Hogwild! [11], DSGD [4] and FPSGD [24]. From (3), it is easy to find that conflicts exist between two processes or nodes when their randomly selected ratings share either the same user index or the same item index. Hogwild! [11] allows overwriting each other’s work when con- flicts happen. It also shows that if the optimization problem is sparse enough, the Hogwild! will get a nearly optimal rate of convergence. DSGD [4] divides the whole rating matrix into P strata and each stratum contains P mutually independent subblocks without sharing any column or row indices. Consequently, sub-blocks in the same stratum can be processed in parallel since they don’t share any U∗i or V∗j . One iteration of DSGD is divided into P steps, in each of which DSGD picks a stratum containing P independent sub-blocks and then processes these sub-blocks in parallel in a cluster of P nodes with each node responsible for one sub-block. After all the P sub-blocks in each step are processed, the whole U and V have been updated separately. They should be synchronized in order to let all nodes get the latest U and V. It is obvious that during one iteration of processing all the ratings in the whole matrix, P synchronization operations should be performed for DSGD. This frequent synchronization will make DSGD inefficient because the slowest node will become the bottleneck of the whole system. FPSGD [24], which is proposed for shared-memory systems, tries to improve the performance by changing the scheduler of DSGD into an asynchronous one. Its experiments show that FPSGD can achieve better efficiency and accuracy than DSGD. Both Hogwild! and FPSGD are only for shared memory systems with one single node and thus their scalability is limited. DSGD can be used for distributed systems while it costs too much on synchronization. 2.5 ADMM ADMM [3] is used to solve the constrained problems as follows: min x,z f(x) + g(z) (4) s.t. : Ax + Bz = c, where f(·) and g(·) are functions, x and z are variables, A, B and c are known values. To solve the problem in (4), ADMM first gets the augmented Lagrangian as follows: L(x, z, y) = f(x) + g(z) + y T (Ax + Bz − c) + ρ 2 kAx + Bz − ck 2 , (5) where y is the Lagrangian multiplier and ρ is the penalty parameter. The ADMM solution can be got be repeating the following three steps: xt+1 ← argmin x L(x, zt, yt); zt+1 ← argmin z L(xt+1, z, yt); yt+1 ← yt + ρ(Axt+1 + Bzt+1 − c), where xt denotes the value of x at the tth iteration, yt and zt are similarly defined. If f(x) or g(z) are separable, the corresponding steps of ADMM can be done in parallel. Hence, ADMM can be used to design distributed learning algorithms for large-scale problems [3]. In recent years, ADMM has captured more and more attention with wide applications, such as matrix completion [5], compressive sensing [19], image restoration [6] and response prediction [1]. Moreover, many variants of ADMM are also devised, including the stochastic and online extensions [15, 12, 17]. However, to the best of our knowledge, few works have been proposed to use stochastic ADMM for distributed MF problems. 3. DISTRIBUTED STOCHASTIC ADMM FOR MATRIX FACTORIZATION In this section, we present the details of our DS-ADMM model. We first introduce our data split strategy to divide the whole problem into several sub-problems. Then we propose a distributed ADMM framework to handle these subproblems in parallel. After that, a stochastic learning algorithm is designed to speed up the distributed ADMM framework. Subsequently, we compare the scheduler of DS-ADMM with those of DSGD and CCD++. Finally, the complexity analysis of DS-ADMM will be provided. 3.1 Data Split Strategy In our data split strategy, we divide R and U into P subblocks according to users. More specifically, each sub-block will contain m P rows of R and m P columns of U. From (1), we find that U and V are coupled together in the loss function
Updating one of them needs the other's latest value,which where makes the problem hardly separable.To decouple U and V we keep a local item latent matrix for all items in each node, which is denoted as VP.Please note that VP is not a sub- y,V,O)=∑1P(vP,V,) block of V,but it has the same size with V.We also have a global item latent matrix which is denoted as V.Because P(VP,7,ΘP)= lv"-V+tr(e""(v -v))] only the local VP couples with U,we can independently up- date U and VP for each node.This split strategy can make Here,p is a hyper-parameter and O={OP=denotes the the MF problem fit for the distributed ADMM framework. Lagrangian multiplier. which will be introduced in the following subsection. If we define Our split strategy is called LocalMFSplit,which is briefly summarized in Algorithm 1.Note that the size of VP is LP(U,VP,ΘP,)=fP(U,VP)+P(VP,7,Θ) k x n,but that of UP is k x mp with mp being the number of columns (about assigned to node p. = ∑i(U,V) (i.j)ERP Algorithm 1 LocalMFSplit +Iv-VIB+(ePvP-V列 1:Input:R,P 2:for i=1:m do we can get 3: Generate a random number p from 1,2,..,Pl,and distribute row i of R to node p. L(U,V,O,V)=>LP(U,VP,OP,V). 4:end for 5:for p=1:P parallel do p=1 6:Allocate memory for Ur,VP andV The ADMM will solve this problem by repeating the fol- 7:end for lowing steps: U+,V+1←ainU,V",o,V,peL,2…Py 3.2 Distributed ADMM (10a) Based on our split strategy LocalMFSplit,the MF problem V+1←argmin L(U+1,e+1,O,可), (10b) in (1)can be reformulated as follows: Θ+1←Θ+p(V8+1-V+1),p∈{1,2,,P (RJ -UTVE,)2 (10c) p=1(ij)EOP It is easy to see that U,VP and OP can be locally updated on each node.Because the whole MF problem has been +AUTU.+A[VE,]TVE, (6) divided into P sub-problems which can be solved in parallel. our method is actually a distributed ADMM framework. st.:Vp-V=0:p∈{1,2,,Py 3.3 Stochastic Learning for Distributed ADMM where v={VP)P1,denotes the (i,j)indices of the To learn the parameters in (6),we just need to find the ratings located in node p.Note that here we omit the p in solutions in (10a),(10b)and (10c).After getting the op- UP for simplicity.It is not hard to determine whether U refers to the whole latent matrix or a sub-block UP located timal U+and [V,it is easy to solve the problem in (10b).More specifically,if we set =0,we can prove in node p from the context. If we define that =0.Hence,the update rule for V is: P 1 f(U,)=∑P(U,VP) (7) V+1=p∑V+ (11) p=】 P=i where The problem in(10c)directly shows the update rule,which can be computed locally and efficiently.Therefore,the key fP(U,VP)= ∑i(U,V) (8) learning part lies in how to efficiently solve the problem (i,j)EOp in (10a).In the following content of this subsection,we first (.V)(-v design a batch learning algorithm for the problem in(10a), and then a stochastic learning algorithm inspired by the batch learning is also designed to further improve the ef- +UTU..+Aa[V,lV ficiency. 3.3.1 Batch Learning we can transform the constrained problem in (6)to an un With oP and V:fixed,(10a)is an MF problem.How- constrained problem with augmented Lagrangian method. ever,we can not easily get the solution because U and VP and get the following objective function: are coupled together and the objective function of the MF problem is non-convex.To get an efficient solution,we use L(U,y,O,)=f(U,)+1y,7,O), (9) a technique similar to that in [12 to construct a surrogate
Updating one of them needs the other’s latest value, which makes the problem hardly separable. To decouple U and V, we keep a local item latent matrix for all items in each node, which is denoted as Vp . Please note that Vp is not a subblock of V, but it has the same size with V. We also have a global item latent matrix which is denoted as V. Because only the local Vp couples with U, we can independently update U and Vp for each node. This split strategy can make the MF problem fit for the distributed ADMM framework, which will be introduced in the following subsection. Our split strategy is called LocalMFSplit, which is briefly summarized in Algorithm 1. Note that the size of Vp is k × n, but that of Up is k × mp with mp being the number of columns (about m P ) assigned to node p. Algorithm 1 LocalMFSplit 1: Input: R, P 2: for i = 1 : m do 3: Generate a random number p from {1, 2, · · · , P}, and distribute row i of R to node p. 4: end for 5: for p = 1 : P parallel do 6: Allocate memory for Up , Vp and V 7: end for 3.2 Distributed ADMM Based on our split strategy LocalMFSplit, the MF problem in (1) can be reformulated as follows: min U,V,V 1 2 XP p=1 X (i,j)∈Ωp (Ri,j − U T ∗iV p ∗j ) 2 + λ1U T ∗iU∗i + λ2[V p ∗j ] T V p ∗j (6) s.t. : V p − V = 0; ∀p ∈ {1, 2, ..., P} where V = {Vp } P p=1, Ωp denotes the (i, j) indices of the ratings located in node p. Note that here we omit the p in Up for simplicity. It is not hard to determine whether U refers to the whole latent matrix or a sub-block Up located in node p from the context. If we define f(U, V) = XP p=1 f p (U, V p ), (7) where f p (U, V p ) = X (i,j)∈Ωp ˆfi,j (U∗i, V p ∗j ), (8) ˆfi,j (U∗i, V p ∗j ) = 1 2 (Ri,j − U T ∗iV p ∗j ) 2 + λ1U T ∗iU∗i + λ2[V p ∗j ] T V p ∗j , we can transform the constrained problem in (6) to an unconstrained problem with augmented Lagrangian method, and get the following objective function: L(U, V, O, V) = f(U, V) + l(V, V, O), (9) where l(V, V, O) = XP p=1 l p (V p , V, Θ p ), l p (V p , V, Θ p ) = ρ 2 kV p − Vk 2 F + tr [Θ p ] T (V p − V) . Here, ρ is a hyper-parameter and O = {Θp } P p=1 denotes the Lagrangian multiplier. If we define L p (U, V p , Θ p , V) = f p (U, V p ) + l p (V p , V, Θ p ) = X (i,j)∈Ωp ˆfi,j (U∗i, V p ∗j ) + ρ 2 kV p − Vk 2 F + tr [Θ p ] T (V p − V) , we can get L(U, V, O, V) = XP p=1 L p (U, V p , Θ p , V). The ADMM will solve this problem by repeating the following steps: Ut+1, V p t+1 ← argmin U,Vp L p (U, V p , Θ p t , Vt), ∀p ∈ {1, 2, ..., P} (10a) Vt+1 ← argmin V L(Ut+1, Vt+1, Ot, V), (10b) Θ p t+1 ← Θ p t + ρ(V p t+1 − Vt+1), ∀p ∈ {1, 2, ..., P}. (10c) It is easy to see that U, Vp and Θp can be locally updated on each node. Because the whole MF problem has been divided into P sub-problems which can be solved in parallel, our method is actually a distributed ADMM framework. 3.3 Stochastic Learning for Distributed ADMM To learn the parameters in (6), we just need to find the solutions in (10a), (10b) and (10c). After getting the optimal Ut+1 and {V p t+1}, it is easy to solve the problem in (10b). More specifically, if we set Θ p 0 = 0, we can prove that PP p=1 Θ p t = 0. Hence, the update rule for V is: Vt+1 = 1 P XP p=1 V p t+1. (11) The problem in (10c) directly shows the update rule, which can be computed locally and efficiently. Therefore, the key learning part lies in how to efficiently solve the problem in (10a). In the following content of this subsection, we first design a batch learning algorithm for the problem in (10a), and then a stochastic learning algorithm inspired by the batch learning is also designed to further improve the ef- ficiency. 3.3.1 Batch Learning With Θ p t and Vt fixed, (10a) is an MF problem. However, we can not easily get the solution because U and Vp are coupled together and the objective function of the MF problem is non-convex. To get an efficient solution, we use a technique similar to that in [12] to construct a surrogate
objective function,which is convex and can make U and V 3.3.2 Stochastic Learning decouple from each other.For each iteration of minimizing From (15),we can find that it will access all ratings relat- the constructed function,we can easily get the closed form ed to U.i to update each U.i,and the same also goes for solution of U and VP by setting their gradients to zero. updating each Vr;in (16).Hence,the batch learning algo- The surrogate objective function is defined as follows rithm presented above is not efficient,especially when the G(U,VP,⊙,V,lUt,V) number of ratings becomes very large.To further improve the efficiency,we propose a stochastic learning strategy for =g(U,VP,U,V)+P(VP,V,Θ),(12) the distributed ADMM,which is called DS-ADMM.In par- where ticular,the update rules for DS-ADMM is as follows: gP(U.VP,T|U:.VP) (U.i)+1=(U.i)+(eij(VP;)-Ai(U.i)), (18) fP(U:,VP)+tr[VtfP(U:,VP)(U-U:)] Tt 1-2(V) +tr[VVPfP(U:,VP)(VP-VP)] (V)+1=1+pm +e(Ui):+p(T-(Θ), (19) 二(U-U,怡+IvP-V), (13) where j =Rij-[(U)]T(VP:)t.It is easy to see that with re being a value which will be useful for specifying the the stochastic learning algorithm is derived from the batch step-size in the stochastic learning method introduced later, learning algorithm by treating only U.:and V,as variables and the function fP(U,VP)being defined in (8). in(14a). LEMMA 1.For an arbitrary positive value 82,we can al- By combining the split strategy and the update rules stat- ed above,we can get our DS-ADMM algorithm.The whole ways find a T:that makes GP()satisfy the followring two procedure of DS-ADMM is briefly listed in Algorithm 2. properties writhin the domain D={U,VPU.-U.< 62,IV-[v屋≤6} Algorithm 2 DS-ADMM G(U,VP,⊙,V,U,V)≥LP(U,VP,⊙,V), 1:Input:R,P,p,Marlter,入1,入2,To 2:Use Algorithm 1 to distribute R to P different nodes G(U,V,⊙,V,U,V)=LP(U,VR,⊙,V) 3:Randomly initialize Uo,Vo; The proof of Lemma 1 can be found in Appendix A. 4:Calculate Vo by (11) From Lemma 1,we can find that G()is an upper bound 5:Set=0. of LP(),and GP()=IP()at the point (U:,VP).Compared 6:for t=1:Marlter do with IP(),U and VP are decoupled in GP(),and GP()is 7: for p=1:P parallel do convex in (U,VP).Hence,it is much easier to optimize 8: for each Ri;in node p do G()than LP(). 9: Update U.:and V;by (18)and (19) Instead of optimizing the original function LP(),we op- 10: end for timize the surrogate function GP()in the first step of the 11: end for ADMM: 12: Update V by (11) 13: U1,Vargmin G(U,VVU,V?)(14a) for p=1:P parallel do 14: Update⊙Pby(10c) 15: end for The objective function in (14a)is convex in both U and 16: Update Tt VP.Hence,we can easily get the solution by setting the 17:end for gradients to be zero.The optimal solution is computed as follows: U+1=U:-*7fP(U,V), (15) 3.4 Scheduler Comparison n+V.-o:-v"(U:Vi)L (6) CCD++and DSGD are two state-of-the-art distributed Viti=I+pn T MF models.We compare the scheduler of our DS-ADMM with those of CCD++and DSGD to illustrate the synchro- LEMMA 2.By following the update rules in (15)and (16), nization cost. the original objective function LP()will not increase in each Figure 1 (a),(b)and (c)show the number of synchro- step.That is to say, nization operations in one iteration of CCD++,DSGD and LP(U+1,V+1,Θ,7)≤LP(U,VR,⊙,V). (17)】 DS-ADMM,respectively.Here,one iteration means all the training ratings are processed for one time.We can find The proof of Lemma 2 can be found in Appendix B. that CCD++needs 2k times of synchronization and DSGD By combining the update rules in (15),(16),(11)and needs P times.From Algorithm 2.we can easily find that (10c),we can get a batch learning algorithm for the problem DS-ADMM needs only one synchronization for each itera- in (6)with the distributed ADMM framework. tion,which is shown in line 12.This synchronization step THEOREM 1.Our batch learning algorithm will converge is used to gather all VP.Hence,it is obvious that the syn- PROOF.Based on Lemma 2,we can prove that the objec- chronization cost of our DS-ADMM is much less than those tive function L()in (9)will decrease in each iteration of AD- of CCD++and DSGD. MM.Furthermore,L()is lower bounded by 3.5 Complexity Analysis 20 Hence,our batch learning algorithm will converge.Because DS-ADMM updates all variables once by three steps.Step L()is not convex,it might converge to a local minimum. one updates U and VP.For each rating Rij,the time com-
objective function, which is convex and can make U and V decouple from each other. For each iteration of minimizing the constructed function, we can easily get the closed form solution of U and Vp by setting their gradients to zero. The surrogate objective function is defined as follows: G p (U, V p ,Θ p t , Vt, τt|Ut, V p t ) = g p (U, V p , τt|Ut, V p t ) + l p (V p , Vt, Θ p t ), (12) where g p (U,V p , τt|Ut, V p t ) = f p (Ut, V p t ) + tr[∇ T Uf p (Ut, V p t )(U − Ut)] + tr[∇ T Vp f p (Ut, V p t )(V p − V p t )] + 1 2τt (kU − Utk 2 F + kV p − V p t k 2 F ), (13) with τt being a value which will be useful for specifying the step-size in the stochastic learning method introduced later, and the function f p (U, Vp ) being defined in (8). Lemma 1. For an arbitrary positive value δ 2 , we can always find a τt that makes G p (·) satisfy the following two properties within the domain D = {U, Vp | kU∗i−[U∗i]tk 2 F ≤ δ 2 , kV p ∗j − [V p ∗j ]tk 2 F ≤ δ 2 }: G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) ≥ L p (U, V p , Θ p t , Vt), G p (Ut, V p t , Θ p t , Vt, τt|Ut, V p t ) = L p (Ut, V p t , Θ p t , Vt). The proof of Lemma 1 can be found in Appendix A. From Lemma 1 , we can find that G p (·) is an upper bound of L p (·), and G p (·) = L p (·) at the point (Ut, V p t ). Compared with L p (·), U and Vp are decoupled in G p (·), and G p (·) is convex in (U, Vp ). Hence, it is much easier to optimize G p (·) than L p (·). Instead of optimizing the original function L p (·), we optimize the surrogate function G p (·) in the first step of the ADMM: Ut+1, V p t+1 ← argmin U,Vp G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) (14a) The objective function in (14a) is convex in both U and Vp . Hence, we can easily get the solution by setting the gradients to be zero. The optimal solution is computed as follows: Ut+1 = Ut − τt ∗ ∇T Uf p (Ut, V p t ), (15) V p t+1 = τt 1 + ρτt [ V p t τt + ρVt − Θ p t − ∇T Vp f p (Ut, V p t )]. (16) Lemma 2. By following the update rules in (15) and (16), the original objective function L p (·) will not increase in each step. That is to say, L p (Ut+1, V p t+1, Θ p t , Vt) ≤ L p (Ut, V p t , Θ p t , Vt). (17) The proof of Lemma 2 can be found in Appendix B. By combining the update rules in (15), (16), (11) and (10c), we can get a batch learning algorithm for the problem in (6) with the distributed ADMM framework. Theorem 1. Our batch learning algorithm will converge. Proof. Based on Lemma 2, we can prove that the objective function L(·) in (9) will decrease in each iteration of ADMM. Furthermore, L(·) is lower bounded by − PP p=1 ||Θp||2 F 2ρ . Hence, our batch learning algorithm will converge. Because L(·) is not convex, it might converge to a local minimum. 3.3.2 Stochastic Learning From (15), we can find that it will access all ratings related to U∗i to update each U∗i, and the same also goes for updating each V p ∗j in (16). Hence, the batch learning algorithm presented above is not efficient, especially when the number of ratings becomes very large. To further improve the efficiency, we propose a stochastic learning strategy for the distributed ADMM, which is called DS-ADMM. In particular, the update rules for DS-ADMM is as follows: (U∗i)t+1 =(U∗i)t + τt(ij (V p ∗j )t − λ1(U∗i)t), (18) (V p ∗j )t+1 = τt 1 + ρτt [ 1 − λ2τt τt (V p ∗j )t +ij (U∗i)t + ρ(V∗j )t − (Θ p ∗j )t], (19) where ij = Rij − [(U∗i)t] T (V p ∗j )t. It is easy to see that the stochastic learning algorithm is derived from the batch learning algorithm by treating only U∗i and V p ∗j as variables in (14a). By combining the split strategy and the update rules stated above, we can get our DS-ADMM algorithm. The whole procedure of DS-ADMM is briefly listed in Algorithm 2. Algorithm 2 DS-ADMM 1: Input: R, P, ρ, MaxIter, λ1, λ2, τ0; 2: Use Algorithm 1 to distribute R to P different nodes. 3: Randomly initialize U0,V p 0 ; 4: Calculate V0 by (11) 5: Set Θ p 0 = 0. 6: for t = 1 : MaxIter do 7: for p = 1 : P parallel do 8: for each Ri,j in node p do 9: Update U∗i and V p ∗j by (18) and (19) 10: end for 11: end for 12: Update V by (11) 13: for p = 1 : P parallel do 14: Update Θp by (10c) 15: end for 16: Update τt 17: end for 3.4 Scheduler Comparison CCD++ and DSGD are two state-of-the-art distributed MF models. We compare the scheduler of our DS-ADMM with those of CCD++ and DSGD to illustrate the synchronization cost. Figure 1 (a), (b) and (c) show the number of synchronization operations in one iteration of CCD++, DSGD and DS-ADMM, respectively. Here, one iteration means all the training ratings are processed for one time. We can find that CCD++ needs 2k times of synchronization and DSGD needs P times. From Algorithm 2, we can easily find that DS-ADMM needs only one synchronization for each iteration, which is shown in line 12. This synchronization step is used to gather all Vp . Hence, it is obvious that the synchronization cost of our DS-ADMM is much less than those of CCD++ and DSGD. 3.5 Complexity Analysis DS-ADMM updates all variables once by three steps. Step one updates U and Vp . For each rating Rij , the time com-
Setd=0 Generate Stratum Sequence Parallelly Update for Each Node UP and VP Parallelly Update U. Setp=0 Synchronize U Synchronize to get V Parallelly Update Parallelly Update Va U and V Parallelly Update Synchronize V Synchronize V ⊙A <P p++ d++ (a)CCD++ (b)DSGD (c)DS-ADMM Figure 1:Operations in one iteration of CCD++,DSGD,and DS-ADMM.It shows that CCD++and DSGD need multiple times of synchronization while DS-ADMM needs only one synchronization for each iteration. plexity to update U.and VP;is O(k2).Because the total ratings for testing and the remaining are used for training. number of observed ratings is the time complexity of Detailed information of these data sets are shown in the first step one is O().Step two is actually a summation of P four rows in Table 1. matrices of size k x n,thus the time complexity is O(Pnk). Step three need to update P matrices of size k x n,and the update operation only contains constant times of addition, Table 1:Data sets and parameter settings so the time complexity is O(Pnk).In total,the time com- Data Set Netflix Yahoo!Music R1 Yahoo!Music R2 plexity of DS-ADMM for each iteration is O(k2+Pnk). m 480.190 1.938,361 1,823.179 n 17.770 49.995 136.736 4.EXPERIMENTS #Train99,072,112 73.578,902 699.640.226 #Test 1.408,395 All the experiments are run on an MPI-cluster with twenty 7,534,592 18,231,790 40 40 40 nodes,each of which is a 24-core server with 2.2GHz Intel(R) 0/T0 0.1 0.1 0.1 Xeon(R)E5-2430 processor and 96GB of RAM.To evaluate 入1 0.05 0.05 0.05 the scale-out performance of our model,we use only one 1 A2 0.05 0.05 0.05 core (thread)and 10GB memory for each node. 005 0.05 0.1 0.002 0.002 0.006 4.1 Data Sets 0.7 0.7 0.7 8 10 20 We run our experiments on three public collaborative fil- tering data sets:Netflix2,Yahoo!Music R1,and Yahoo!Mu- sic R23.The Netflix data set contains the users'ratings to movies.Yahoo!Music RI data set contains the users'rat- ings to artists.Yahoo!Music R2 contains the users'ratings 4.2 Baselines and Parameter Settings to songs. FPSGD and Hogwild!can only run on multi-core systems As the original ratings of Yahoo!Music RI are ranging while we focus on distributed algorithms in this paper.So from 0-100 and have value Never play again,we treat the CCD++,DSGD and DSGD-Bias are adopted as our base- ratings with value 0 and Never play again as the explicit lines. negative feedback and filter them out.For the other ratings. CCD++is implemented by referring to the public OpenMP we normalize them by multiplying each rating by 0.05.After version.DSGD is implemented according to [4 with an preprocessing,all the ratings lie in the range of 0.05,5. asynchronous scheduler to improve its performance The Netflix and Yahoo!Music R2 data sets also contain We also implement a variant of DSGD called DSGD-Bias public test data sets,which are used in our experiments.For by using the prediction function:Ri.i=u+b:+6;+UV.i, the Yahoo!Music RI data set,we randomly select 10%of the where bi,bi are the user and item bias for ratings and u is the global mean of the ratings.The model with bias is an 2http://www.netflixprize.com/ 3http://webscope.sandbox.yahoo.com/catalog.php?datatype=r 4http://www.cs.utexas.edu/~rofuyu/libpmf/
Set d = 0 Parallelly Update Ud* Synchronize Ud* d++ d < k Set p = 0 Parallelly Update U and V Synchronize V p++ p < P Generate Stratum Sequence for Each Node (a) CCD++ (b) DSGD (c) DS-ADMM Parallelly Update U p and V p Synchronize to get V Parallelly Update Θ p Parallelly Update Vd* Synchronize Vd* Figure 1: Operations in one iteration of CCD++, DSGD, and DS-ADMM. It shows that CCD++ and DSGD need multiple times of synchronization while DS-ADMM needs only one synchronization for each iteration. plexity to update U∗i and V p ∗j is O(k 2 ). Because the total number of observed ratings is |Ω|, the time complexity of step one is O(k 2 |Ω|). Step two is actually a summation of P matrices of size k × n, thus the time complexity is O(P nk). Step three need to update P matrices of size k × n, and the update operation only contains constant times of addition, so the time complexity is O(P nk). In total, the time complexity of DS-ADMM for each iteration is O(k 2 |Ω| + P nk). 4. EXPERIMENTS All the experiments are run on an MPI-cluster with twenty nodes, each of which is a 24-core server with 2.2GHz Intel(R) Xeon(R) E5-2430 processor and 96GB of RAM. To evaluate the scale-out performance of our model, we use only one 1 core (thread) and 10GB memory for each node. 4.1 Data Sets We run our experiments on three public collaborative filtering data sets: Netflix 2 , Yahoo! Music R1, and Yahoo! Music R2 3 . The Netflix data set contains the users’ ratings to movies. Yahoo! Music R1 data set contains the users’ ratings to artists. Yahoo! Music R2 contains the users’ ratings to songs. As the original ratings of Yahoo! Music R1 are ranging from 0 − 100 and have value Never play again, we treat the ratings with value 0 and Never play again as the explicit negative feedback and filter them out. For the other ratings, we normalize them by multiplying each rating by 0.05. After preprocessing, all the ratings lie in the range of [0.05, 5]. The Netflix and Yahoo! Music R2 data sets also contain public test data sets, which are used in our experiments. For the Yahoo! Music R1 data set, we randomly select 10% of the 2http://www.netflixprize.com/ 3http://webscope.sandbox.yahoo.com/catalog.php?datatype=r ratings for testing and the remaining are used for training. Detailed information of these data sets are shown in the first four rows in Table 1. Table 1: Data sets and parameter settings Data Set Netflix Yahoo! Music R1 Yahoo! Music R2 m 480,190 1,938,361 1,823,179 n 17,770 49,995 136,736 #Train 99,072,112 73,578,902 699,640,226 #Test 1,408,395 7,534,592 18,231,790 k 40 40 40 η0/τ0 0.1 0.1 0.1 λ1 0.05 0.05 0.05 λ2 0.05 0.05 0.05 ρ 0.05 0.05 0.1 α 0.002 0.002 0.006 β 0.7 0.7 0.7 P 8 10 20 4.2 Baselines and Parameter Settings FPSGD and Hogwild! can only run on multi-core systems while we focus on distributed algorithms in this paper. So CCD++, DSGD and DSGD-Bias are adopted as our baselines. CCD++ is implemented by referring to the public OpenMP version4 . DSGD is implemented according to [4] with an asynchronous scheduler to improve its performance. We also implement a variant of DSGD called DSGD-Bias by using the prediction function: Ri,j = µ+bi+bj+UT ∗iV∗j , where bi, bj are the user and item bias for ratings and µ is the global mean of the ratings. The model with bias is an 4http://www.cs.utexas.edu/∼rofuyu/libpmf/
0.6 -CCD++ -CCD++ -CCD++ DSGD -DSGD -DSGD SGD-Bi -DSGD DSGGD-aia -DS-ADMM DS-ADVM -DS-ADMM 00 300 Time(s) Time(s) Tm倒 (a)Netflix (b)Yahoo!Music RI (c)Yahoo!Music R2 Figure 2:Test RMSE curve for CCD++,DSGD,DSGD-Bias and DS-ADMM.(a)Netflix data set;(b)Yahoo!Music RI data set;(c)Yahoo!Music R2 data set.The vertical axis is the RMSE on the test set,and the horizontal axis is the running time easy and natural extension of MF and has been proved to baselines.We can easily find that DS-ADMM performs the be more accurate than those without bias [8.DSGD-Bias best on all three data sets.RMSE value of CCD++and model is optimized and paralleled by using the same strategy the DSGD-Bias model decreases fast at first for some da- as that in DSGD. ta sets because CCD++updates the parameters by their During our experiments,we find that good results can closed-form solutions and the DSGD-Bias model extracts be achieved by setting A1 =A2 for all the algorithms.So more explicit information from the training data set.How- we simply set A1 =A2 in our experiments.All the hyper- ever,our DS-ADMM decreases much faster afterwards and parameters for each model in our experiments are selected finally converges to the best accuracy,which is much better by ten-fold cross validation on the training set except the than the DSGD-Bias model and CCD++.DSGD performs latent factor number k and the number of computing nodes the worst among all the models in most cases.Moreover P.k is selected by following the CCD++21.Actually. as the scale of the data set grows,the difference between C- we find that on our data sets larger k doesn't achieve much CD++,DSGD and DSGD-bias's convergence value becomes better accuracy for CCD++while increasing the running smaller,but their difference to DS-ADMM becomes larger. time.The node number P is set according to the size of the Note that Yahoo!Music R2 has about 0.7 billion ratings data sets.All the algorithms have the same P for the same which is much larger than many real world applications data set. In some application scenarios,the learning process stops We use a general and simple update rule for DSGD's learn- when the RMSE value reaches some threshold.So we de- ing rate.More specifically,we first initialize the learning velop experiments to test those algorithms'running time rate n with a relatively large value,then decrease it by to reach the given threshold with different number of com- nt+1 nt*B(0<B 1)after each iteration,and stop puting nodes.When conducting experiments on the Netflix decreasing when n becomes smaller than some threshold a. data set,we set the threshold of RMSE value as 0.922.This This strategy results in a fast convergence rate and avoids value is chosen because it is the place where the RMSE early stopping. curves of DS-ADMM and CCD++become smooth.and it DS-ADMM has a similar step-size parameter T.From is also a value that DSGD and DSGD-Bias can reach if pro- the detailed proof in Appendix A,we find that Te should be vided with enough running time.We report the log-scale smaller than some threshold computed based on U.and VP of the running time in Figure 3.Results on the other two Because the exact threshold value for r is hard to calculate. data sets are similar.which are omitted due to the limit- we approximately update it as T+1=T*B(0<B<1)for ed space.From Figure 3.we can find that our DS-ADMM the tth iteration.We also set a threshold o.When n a, outperforms all the other algorithms no matter how many we stop decreasing T.It is easy to find that the update rule nodes are used.DS-ADMM needs relatively fewer iterations for Te is the same as that for the DSGD's learning rate n. to reach a test RMSE of 0.922 and its running time for each The parameter settings are shown in the last eight rows iteration is the smallest among all algorithms.DSGD per- in Table 1. forms the worst since it should run more iterations to reach a test RMSE of 0.922 and the running time for each iteration 4.3 Accuracy and Efficiency is also relatively large. The root mean squared error (RMSE)is a widely used 4.4 Speedup metric to measure an MF model's performance 21.The Another important metric that can be used to measure a test RMSE is defined as:(Rj-UTV.)2,where Q distributed algorithm is its speedup or scalability.It mea- is the number of testing ratings.Figure 2 shows the test sures the performance when more machines are used or larg- RMSE versus running time for our DS-ADMM and other er data sets are processed.In general,more machines and
0 100 200 300 400 500 0.92 0.94 0.96 0.98 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (a) Netflix 0 200 400 500 0.84 0.88 0.92 0.96 1 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (b) Yahoo!Music R1 0 200 400 600 800 1.04 1.1 1.2 1.3 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (c) Yahoo!Music R2 Figure 2: Test RMSE curve for CCD++, DSGD, DSGD-Bias and DS-ADMM. (a) Netflix data set; (b)Yahoo! Music R1 data set; (c) Yahoo! Music R2 data set. The vertical axis is the RMSE on the test set, and the horizontal axis is the running time. easy and natural extension of MF and has been proved to be more accurate than those without bias [8]. DSGD-Bias model is optimized and paralleled by using the same strategy as that in DSGD. During our experiments, we find that good results can be achieved by setting λ1 = λ2 for all the algorithms. So we simply set λ1 = λ2 in our experiments. All the hyperparameters for each model in our experiments are selected by ten-fold cross validation on the training set except the latent factor number k and the number of computing nodes P. k is selected by following the CCD++ [21]. Actually, we find that on our data sets larger k doesn’t achieve much better accuracy for CCD++ while increasing the running time. The node number P is set according to the size of the data sets. All the algorithms have the same P for the same data set. We use a general and simple update rule for DSGD’s learning rate. More specifically, we first initialize the learning rate η with a relatively large value, then decrease it by ηt+1 = ηt ∗ β (0 < β < 1) after each iteration, and stop decreasing when η becomes smaller than some threshold α. This strategy results in a fast convergence rate and avoids early stopping. DS-ADMM has a similar step-size parameter τt. From the detailed proof in Appendix A, we find that τt should be smaller than some threshold computed based on Ut and V p t . Because the exact threshold value for τt is hard to calculate, we approximately update it as τt+1 = τt ∗ β (0 < β < 1) for the tth iteration. We also set a threshold α. When τt ≤ α, we stop decreasing τt. It is easy to find that the update rule for τt is the same as that for the DSGD’s learning rate η. The parameter settings are shown in the last eight rows in Table 1. 4.3 Accuracy and Efficiency The root mean squared error (RMSE) is a widely used metric to measure an MF model’s performance [21]. The test RMSE is defined as: 1 Q qP(Ri,j − UT ∗iV∗j ) 2, where Q is the number of testing ratings. Figure 2 shows the test RMSE versus running time for our DS-ADMM and other baselines. We can easily find that DS-ADMM performs the best on all three data sets. RMSE value of CCD++ and the DSGD-Bias model decreases fast at first for some data sets because CCD++ updates the parameters by their closed-form solutions and the DSGD-Bias model extracts more explicit information from the training data set. However, our DS-ADMM decreases much faster afterwards and finally converges to the best accuracy, which is much better than the DSGD-Bias model and CCD++. DSGD performs the worst among all the models in most cases. Moreover, as the scale of the data set grows, the difference between CCD++, DSGD and DSGD-bias’s convergence value becomes smaller, but their difference to DS-ADMM becomes larger. Note that Yahoo! Music R2 has about 0.7 billion ratings, which is much larger than many real world applications. In some application scenarios, the learning process stops when the RMSE value reaches some threshold. So we develop experiments to test those algorithms’ running time to reach the given threshold with different number of computing nodes. When conducting experiments on the Netflix data set, we set the threshold of RMSE value as 0.922. This value is chosen because it is the place where the RMSE curves of DS-ADMM and CCD++ become smooth, and it is also a value that DSGD and DSGD-Bias can reach if provided with enough running time. We report the log-scale of the running time in Figure 3. Results on the other two data sets are similar, which are omitted due to the limited space. From Figure 3, we can find that our DS-ADMM outperforms all the other algorithms no matter how many nodes are used. DS-ADMM needs relatively fewer iterations to reach a test RMSE of 0.922 and its running time for each iteration is the smallest among all algorithms. DSGD performs the worst since it should run more iterations to reach a test RMSE of 0.922 and the running time for each iteration is also relatively large. 4.4 Speedup Another important metric that can be used to measure a distributed algorithm is its speedup or scalability. It measures the performance when more machines are used or larger data sets are processed. In general, more machines and
DS-ADMM speedup.This might be reasonable.More specifically,as the DSGD number of computing nodes increases,the scale of the data -DSGD-Bias set on each node decreases.With a relatively large cache CCD++ size,most data or even the whole data set can be loaded in- to caches and thus the accessing cost is reduced,which will gain extra speedup 2,7 By separately measuring the computing cost and synchro- nization cost in our experiments,we find that DSGD cost- s much more time on communication and synchronization than DS-ADMM.Taking the Netflix experiment as an ex- ample,although the total running time for each iteration are nearly the same for DSGD and DS-ADMM,DSGD spends about 18%of the total time on communication and synchro- nization,while that is only 8%for DS-ADMM.This result verifies our previous analysis on the synchronization issue in DSGD. 2 4.5 Sensitivity to Hyper-parameter p Node Number We vary the values of p to study its effect on the per- formance of DS-ADMM,the results of which are shown in Figure 5.We can find that very good performance can be Figure 3:Average running time to reach a test RMSE achieved when p is around 0.05 for all data sets,and our of 0.922 on the Netflix data set for different methods. DS-ADMM is not sensitive to p in the range [0.03,0.1. This relatively stable property of p makes hyper-parameter selection much easier. larger data sets will increase the communication and schedul- ing costs.Algorithms with poor scalability consume more computing resources without offering a good reward,so they 5. CONCLUSION are not suitable for large-scale applications In this paper,we propose a new distributed algorithm To study the scalability of our algorithm,we compute the called DS-ADMM for large-scale matrix factorization in rec- speedup factor relative to the running time with 2 nodes by ommender systems.We first design a data split strategy varying the number of nodes from 2 to 8.Speedup factors to divide the large-scale problem into several sub-problems of CCD++and DSGD are provided for comparison.The and then derive a distributed stochastic variant of ADMM results on the Netflix data set are shown in Figure 4.Results for efficient learning.Experiments on real world data set- on other data sets are similar,which are omitted for space s show that our DS-ADMM algorithm can outperform the saving. state-of-the-art methods like DSGD and CCD++in terms of accuracy,efficiency and scalability. ⊙CCD++ 日DSGD 6.ACKNOWLEDGEMENTS DS-ADMM This work is supported by the NSFC (No.61100125,No -Linear-Speedup 61472182),the 863 Program of China (No.2012AA011003) and the Program for Changjiang Scholars and Innovative Research Team in University of China(IRT1158.PCSIRT). 7.REFERENCES [1]D.Agarwal.Computational advertising:the linkedin way.In CIKM,pages 1585-1586,2013. [2]J.Benzi and M.Damodaran.Parallel three dimensional direct simulation monte carlo for simulating micro flows.In Parallel Computational Fluid Dynamics 2007,pages 91-98.Springer,2009. [3 S.P.Boyd,N.Parikh,E.Chu,B.Peleato,and J.Eckstein.Distributed optimization and statistical Node Number learning via the alternating direction method of multipliers.Foundations and Trends in Machine Learning,pages 1-122,2011. Figure 4:Speedup comparison between CCD++, [4]R.Gemulla,E.Nijkamp,P.J.Haas,and Y.Sismanis DSGD,and DS-ADMM on the Netflix data set. Large-scale matrix factorization with distributed stochastic gradient descent.In KDD,pages 69-77, It is obvious from Figure 4 that DS-ADMM achieves the 2011. best speedup among all three algorithms.Figure 4 also shows that both DSGD and DS-ADMM have a super-linear 5http://en.wikipedia.org/wiki/Speedup
2 4 6 8 2.1 2.4 2.7 3 3.3 3.6 Node Number Log Running Time to get Test RMSE 0.922 DS−ADMM DSGD DSGD−Bias CCD++ Figure 3: Average running time to reach a test RMSE of 0.922 on the Netflix data set for different methods. larger data sets will increase the communication and scheduling costs. Algorithms with poor scalability consume more computing resources without offering a good reward, so they are not suitable for large-scale applications. To study the scalability of our algorithm, we compute the speedup factor relative to the running time with 2 nodes by varying the number of nodes from 2 to 8. Speedup factors of CCD++ and DSGD are provided for comparison. The results on the Netflix data set are shown in Figure 4. Results on other data sets are similar, which are omitted for space saving. 2 4 6 8 1 2 3 4 5 6 Node Number Speedup CCD++ DSGD DS−ADMM Linear−Speedup Figure 4: Speedup comparison between CCD++, DSGD, and DS-ADMM on the Netflix data set. It is obvious from Figure 4 that DS-ADMM achieves the best speedup among all three algorithms. Figure 4 also shows that both DSGD and DS-ADMM have a super-linear speedup. This might be reasonable. More specifically, as the number of computing nodes increases, the scale of the data set on each node decreases. With a relatively large cache size, most data or even the whole data set can be loaded into caches and thus the accessing cost is reduced, which will gain extra speedup [2, 7]5 . By separately measuring the computing cost and synchronization cost in our experiments, we find that DSGD costs much more time on communication and synchronization than DS-ADMM. Taking the Netflix experiment as an example, although the total running time for each iteration are nearly the same for DSGD and DS-ADMM, DSGD spends about 18% of the total time on communication and synchronization, while that is only 8% for DS-ADMM. This result verifies our previous analysis on the synchronization issue in DSGD. 4.5 Sensitivity to Hyper-parameter ρ We vary the values of ρ to study its effect on the performance of DS-ADMM, the results of which are shown in Figure 5. We can find that very good performance can be achieved when ρ is around 0.05 for all data sets, and our DS-ADMM is not sensitive to ρ in the range [0.03, 0.1]. This relatively stable property of ρ makes hyper-parameter selection much easier. 5. CONCLUSION In this paper, we propose a new distributed algorithm called DS-ADMM for large-scale matrix factorization in recommender systems. We first design a data split strategy to divide the large-scale problem into several sub-problems and then derive a distributed stochastic variant of ADMM for efficient learning. Experiments on real world data sets show that our DS-ADMM algorithm can outperform the state-of-the-art methods like DSGD and CCD++ in terms of accuracy, efficiency and scalability. 6. ACKNOWLEDGEMENTS This work is supported by the NSFC (No. 61100125, No. 61472182), the 863 Program of China (No. 2012AA011003), and the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT). 7. REFERENCES [1] D. Agarwal. Computational advertising: the linkedin way. In CIKM, pages 1585–1586, 2013. [2] J. Benzi and M. Damodaran. Parallel three dimensional direct simulation monte carlo for simulating micro flows. In Parallel Computational Fluid Dynamics 2007, pages 91–98. Springer, 2009. [3] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, pages 1–122, 2011. [4] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69–77, 2011. 5 http://en.wikipedia.org/wiki/Speedup
-DSGD DSGD 一DSGD p=0.0 p=0.01 p=0.03 p=0.03 p=0.02 =0.0 p01 p=0.03 0=0.1 p=02 p=0.05 p=0.2 0=0.3 92 =0=02 0=0.5 0Tme的 Time(s) 20 Time(s) 0 (a)Netflix (b)Yahoo!Music R1 (c)Yahoo!Music R2 Figure 5:The effect of p in DS-ADMM on three data sets. [5]D.Goldfarb,S.Ma,and K.Scheinberg.Fast [19]J.Yang and Y.Zhang.Alternating direction alternating linearization methods for minimizing the algorithms for problems in compressive sensing.SIAM sum of two convex functions.Math.Program.,pages J.Scientific Computing,pages 250-278,2011. 349-382.2013. 20T.Yang.Trading computation for communication: [6]T.Goldstein and S.Osher.The split bregman method Distributed stochastic dual coordinate ascent.In for 11-regularized problems.SIAM J.Imaging NIPS,pages629-637,2013. Sciences,pages 323-343,2009. [21]H.-F.Yu,C.-J.Hsieh,S.Si,and I.S.Dhillon.Scalable [7]B.John,X.-J.Gu,and D.R.Emerson.Investigation coordinate descent approaches to parallel matrix of heat and mass transfer in a lid-driven cavity under factorization for recommender systems.In ICDM. nonequilibrium flow conditions.Numerical Heat pages765-774,2012. Transfer.Part B:Fundamentals,58(5):287-303.2010. [22]Y.Zhen,W.-J.Li,and D.-Y.Yeung.Tagicofi:tag [8]Y.Koren,R.M.Bell,and C.Volinsky.Matrix informed collaborative filtering.In RecSys,pages factorization techniques for recommender systems 69-76.2009. IEEE Computer,42(8):30-37,2009. [23]Y.Zhou,D.M.Wilkinson,R.Schreiber,and R.Pan. [9]W.-J.Li and D.-Y.Yeung.Relation regularized matrix Large-scale parallel collaborative filtering for the factorization.In IJCAI,pages 1126-1131,2009. netflix prize.In AAIM,pages 337-348,2008. [10]Y.Li,M.Yang,and Z.M.Zhang.Scientific articles [24]Y.Zhuang,W.-S.Chin,Y.-C.Juan,and C.-J.Lin.A recommendation.In CIKM,pages 1147-1156,2013. fast parallel sgd for matrix factorization in shared [11]F.Niu,B.Recht,C.Re,and S.J.Wright.Hogwild!: memory systems.In RecSys,pages 249-256,2013. A lock-free approach to parallelizing stochastic [25]M.Zinkevich,M.Weimer,A.J.Smola,and L.Li. gradient descent.In NIPS,pages 693-701,2011 Parallelized stochastic gradient descent.In NIPS. [12]H.Ouyang,N.He,L.Tran,and A.G.Gray. pages2595-2603,2010. Stochastic alternating direction method of multipliers. In ICML,pages 80-88,2013. APPENDIX [13]I.Pilerl'szy,D.Zibriczky,and D.Tikk.Fast als-based matrix factorization for explicit and implicit feedback A.PROOF OF LEMMA 1 datasets.In RecSys,pages 71-78,2010. PROOF.We can rewrite 14]S.Purushotham and Y.Liu.Collaborative topic regression with social matrix factorization for g"(U,VP,nU VP)=>(U,VE,nU.V?), recommendation systems.In ICML.2012. (i,j)EQp 15 T.Suzuki.Dual averaging and proximal gradient where descent for online alternating direction multiplier g(U.,VP;nU,VP) method.In ICML,pages 392-400,2013. 16 C.Wang and D.M.Blei.Collaborative topic modeling =U小,v,l) for recommending scientific articles.In KDD,pages 448-456,2011. +76.(U,V)(Ui-U) [17]H.Wang and A.Banerjee.Online alternating direction method.In ICML,2012. +Vrf([U.vle)(v,-[VP;l) 18]H.Wang,B.Chen,and W.-J.Li.Collaborative topic 1 regression with social regularization for tag +2mlU.4-U小e recommendation.In IJCAI,2013. V -[V 十2
0 100 200 300 400 500 0.9 0.92 0.94 0.96 0.98 1 Time(s) Test RMSE DSGD ρ=0.01 ρ=0.03 ρ=0.1 ρ=0.2 ρ=0.3 (a) Netflix 0 100 200 300 400 0.83 0.86 0.89 0.92 0.95 0.98 Time(s) Test RMSE DSGD ρ=0.01 ρ=0.02 ρ=0.03 ρ=0.05 ρ=0.2 (b) Yahoo!Music R1 0 200 400 600 800 1 1.1 1.2 1.3 1.4 Time(s) Test RMSE DSGD ρ=0.03 ρ=0.05 ρ=0.1 ρ=0.2 ρ=0.5 (c) Yahoo!Music R2 Figure 5: The effect of ρ in DS-ADMM on three data sets. [5] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program., pages 349–382, 2013. [6] T. Goldstein and S. Osher. The split bregman method for l1-regularized problems. SIAM J. Imaging Sciences, pages 323–343, 2009. [7] B. John, X.-J. Gu, and D. R. Emerson. Investigation of heat and mass transfer in a lid-driven cavity under nonequilibrium flow conditions. Numerical Heat Transfer, Part B: Fundamentals, 58(5):287–303, 2010. [8] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009. [9] W.-J. Li and D.-Y. Yeung. Relation regularized matrix factorization. In IJCAI, pages 1126–1131, 2009. [10] Y. Li, M. Yang, and Z. M. Zhang. Scientific articles recommendation. In CIKM, pages 1147–1156, 2013. [11] F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693–701, 2011. [12] H. Ouyang, N. He, L. Tran, and A. G. Gray. Stochastic alternating direction method of multipliers. In ICML, pages 80–88, 2013. [13] I. Pil`eˇrl’szy, D. Zibriczky, and D. Tikk. Fast als-based matrix factorization for explicit and implicit feedback datasets. In RecSys, pages 71–78, 2010. [14] S. Purushotham and Y. Liu. Collaborative topic regression with social matrix factorization for recommendation systems. In ICML, 2012. [15] T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In ICML, pages 392–400, 2013. [16] C. Wang and D. M. Blei. Collaborative topic modeling for recommending scientific articles. In KDD, pages 448–456, 2011. [17] H. Wang and A. Banerjee. Online alternating direction method. In ICML, 2012. [18] H. Wang, B. Chen, and W.-J. Li. Collaborative topic regression with social regularization for tag recommendation. In IJCAI, 2013. [19] J. Yang and Y. Zhang. Alternating direction algorithms for problems in compressive sensing. SIAM J. Scientific Computing, pages 250–278, 2011. [20] T. Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In NIPS, pages 629–637, 2013. [21] H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, pages 765–774, 2012. [22] Y. Zhen, W.-J. Li, and D.-Y. Yeung. Tagicofi: tag informed collaborative filtering. In RecSys, pages 69–76, 2009. [23] Y. Zhou, D. M. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337–348, 2008. [24] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel sgd for matrix factorization in shared memory systems. In RecSys, pages 249–256, 2013. [25] M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pages 2595–2603, 2010. APPENDIX A. PROOF OF LEMMA 1 Proof. We can rewrite g p (U, V p , τt|Ut, V p t ) = X (i,j)∈Ωp g p i,j (U∗i, V p ∗j , τt|Ut, V p t ), where g p i,j (U∗i, V p ∗j ,τt|Ut, V p t ) = ˆfi,j ([U∗i]t, [V p ∗j ]t) + ∇ T U∗i ˆfi,j ([U∗i]t, [V p ∗j ]t)(U∗i − [U∗i]t) + ∇ T Vp ∗j ˆfi,j ([U∗i]t, [V p ∗j ]t)(V p ∗j − [V p ∗j ]t) + 1 2miτt kU∗i − [U∗i]tk 2 F + 1 2nj τt kV p ∗j − [V p ∗j ]tk 2 F
with mi IPl and nj =l we have Then,we have 2(u-u)Tv:+2(v-v:)Tu:+(u-u)T(v-v:) LP(U,VP,⊙,7)-G(U,VP,⊙,V,U,V) <2(u-u)Tv:+2(v-v:)Tul +l(u-u)T(v-v:) fP(U,VP)-gP(U,VP,n|U,VP) ≤u-l房+lv房+v-v3 =∑2(UVg)-s2,(U,vg,nU,V (,)Ep +lu房+专u-房+v-v) ≤u3+川v:房+362, For clarity,we denote U.,V[U and [Vl as u,v,ut and vt,respectively.Then we have Then we can prove u)=g-a-业+Pv-+wA ou,v)≤lo(u,v川≤uB+v.l层+362) ×(u-3+Iv-v2): +lu-u+u恨+lv-v+v:l恨 Because llu-url≤2andv-v:ll≤62,we can get =fij(ut,V)+Vufij(u,V)(u-u) fij(u,v)<fij(ut,vt)+Vfij(ut,vt)(u-ut) +7f.(u,v)(v-v) +7i(u,v)(w-v) +h(u,v)+o(u,v), +房+u房+郑2+ where VTfi.j(u,vt)=Aiut-(Ri.j-ui vt)vt, +R-vIu-u眼 V fij(ut,v)=A2v:-(Ri.j-ud vt)ut, +a房+眼+2+ h(u,v)contains all the second order terms,and the third +与R-vlv-v3. order and forth order terms are contained in o(u,v).Hence, we can get So if we let Auv)=岁u-w房+告v-尼 之wxim.v.房+ue+子+ -(Rij-uivt)(u-u)T(v-v:) +Rg-, +(v-)+(u-山)v)月, 2mu房+眼+2+0 1 and +R- o(u.v)=j[2(u-u)Tv:+2(v-v:)Tw we can prove that +(u-u)T(v-vt)](u-ut)T(v-vt). G(U,VP,⊙,7,U,V)≥LP(U,VP,⊙,7). Because That is to say,T should be smaller than some value which is dependent on the current U.and VP. -(Rij-uvi)(u-ut)T(v-vt) The second equation in Lemma 1 can be easily proved. ≤2R-ufv:(u-ua+v-v), B.PROOF OF LEMMA 2 and PROOF.From Lemma 1,we have j(wF(v-v)+(w-w)v) GP(U+1,VP+1,Vi,nJU:,V?) ≤lu(v-v)2+I(u-)Tv2 ≥LP(U+1,V8+1,Θ,V), GP(U,vP,,V,nU:,VP)=LP(U,v,,V.). ≤lu喉lv-v候+lv房lu-u匠, Furthermore,we have we have G(U+1,V+1,o,V,U,V) uv)s空+R-d+v.)u-u尼 ≤G(Ut,V,o,V,U,V). +(空+R-d+u)v-层 Hence,we can get LP(U+1,VR+1,⊙,V)≤LP(U,V,Θ,V) Because I(u-u)(v-ve)s (u-ui+llv-vi)/2
with mi = |Ω p i | and nj = |Ω˜ p j |. Then, we have L p (U, V p , Θ p t , Vt) − G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) = f p (U, V p ) − g p (U, V p , τt|Ut, V p t ) = X (i,j)∈Ωp ˆfi,j (U∗i, V p ∗j ) − g p i,j (U∗i, V p ∗j , τt|Ut, V p t ) , For clarity, we denote U∗i,V p ∗j ,[U∗i]t and [V p ∗j ]t as u,v,ut and vt, respectively. Then we have ˆfi,j (u, v) = 1 2 Ri,j − (u − ut + ut) T (v − vt + vt) 2 + λ1ku − ut + utk 2 F + λ2kv − vt + vtk 2 F = ˆfi,j (ut, vt) + ∇ T u ˆfi,j (ut, vt)(u − ut) + ∇ T v ˆfi,j (ut, vt)(v − vt) + h(u, v) + o(u, v), where ∇ T u ˆfi,j (ut, vt) = λ1ut − (Ri,j − u T t vt)vt, ∇ T v ˆfi,j (ut, vt) = λ2vt − (Ri,j − u T t vt)ut, h(u, v) contains all the second order terms, and the third order and forth order terms are contained in o(u, v). Hence, we can get h(u, v) = λ1 2 ku − utk 2 F + λ2 2 kv − vtk 2 F − (Ri,j − u T t vt)(u − ut) T (v − vt) + 1 2 u T t (v − vt) + (u − ut) T vt 2 , and o(u, v) = 1 2 [2(u − ut) T vt + 2(v − vt) T ut + (u − ut) T (v − vt)](u − ut) T (v − vt). Because − (Ri,j − u T t vt)(u − ut) T (v − vt) ≤ 1 2 |Ri,j − u T t vt| ku − utk 2 F + kv − vtk 2 F , and 1 2 u T t (v − vt) + (u − ut) T vt 2 ≤ ku T t (v − vt)k 2 F + k(u − ut) T vtk 2 F ≤ kutk 2 F kv − vtk 2 F + kvtk 2 F ku − utk 2 F , we have h(u, v) ≤ ( λ1 2 + 1 2 |Ri,j − u T t vt| + kvtk 2 F )ku − utk 2 F + (λ2 2 + 1 2 |Ri,j − u T t vt| + kutk 2 F )kv − vtk 2 F . Because |(u − ut) T (v − vt)| ≤ (ku − utk 2 F + kv − vtk 2 F )/2, we have |2(u − ut) T vt + 2(v − vt) T ut + (u − ut) T (v − vt)| ≤|2(u − ut) T vt| + |2(v − vt) T ut| + |(u − ut) T (v − vt)| ≤ku − utk 2 F + kvtk 2 F + kv − vtk 2 F + kutk 2 F + 1 2 [ku − utk 2 F + kv − vtk 2 F ] ≤kutk 2 F + kvtk 2 F + 3δ 2 , Then we can prove o(ut, vt) ≤ |o(ut, vt)| ≤ 1 4 kutk 2 F + kvtk 2 F + 3δ 2 × ku − utk 2 F + kv − vtk 2 F . Because ||u − ut||2 F ≤ δ 2 and ||v − vt||2 F ≤ δ 2 , we can get ˆfi,j (u, v) ≤ ˆfi,j (ut, vt) + ∇ T u ˆfi,j (ut, vt)(u − ut) +∇ T v ˆfi,j (ut, vt)(v − vt) +( 5 4 kvtk 2 F + 1 4 kutk 2 F + 3 4 δ 2 + 1 2 λ1 + 1 2 |Ri,j − u T t vt|)ku − utk 2 F +( 5 4 kutk 2 F + 1 4 kvtk 2 F + 3 4 δ 2 + 1 2 λ2 + 1 2 |Ri,j − u T t vt|)kv − vtk 2 F . So if we let 1 τt ≥ max{2mi 5 4 kvtk 2 F + 1 4 kutk 2 F + 3 4 δ 2 + 1 2 λ1 + 1 2 |Ri,j − u T t vt| , 2nj 5 4 kutk 2 F + 1 4 kvtk 2 F + 3 4 δ 2 + 1 2 λ2 + 1 2 |Ri,j − u T t vt| }, we can prove that G p (U, V p , Θ p t , Vt, τt|Ut, V p t ) ≥ L p (U, V p , Θ p t , Vt). That is to say, τt should be smaller than some value which is dependent on the current Ut and V p t . The second equation in Lemma 1 can be easily proved. B. PROOF OF LEMMA 2 Proof. From Lemma 1, we have G p (Ut+1,V p t+1, Θ p t , Vt, τt|Ut, V p t ) ≥ L p (Ut+1, V p t+1, Θ p t , Vt), G p (Ut, V p t , Θ p t ,Vt, τt|Ut, V p t ) = L p (Ut, V p t , Θ p t , Vt). Furthermore, we have G p (Ut+1,V p t+1, Θ p t , Vt, τt|Ut, V p t ) ≤ G p (Ut, V p t , Θ p t , Vt, τt|Ut, V p t ). Hence, we can get L p (Ut+1, V p t+1, Θ p t , Vt) ≤ L p (Ut, V p t , Θ p t , Vt).