正在加载图片...
Due to its efficiency and ease of implementation,SGD has ple machines (nodes).Hence,DS-ADMM is scalable become one of the most popular optimization strategies for to handle large-scale data sets MF in recommender systems 8.The basic idea of SGD is to randomly select one rating each time from the rating matrix Experiments on several data sets from recommenda- and then use the gradient based on this selected rating to update the latent features.It is easy to see that SGD is tion applications show that not only can DS-ADMM outperform other SGD-based models,but it can also essentially a sequential method,which makes it difficult to outperform ALS-based models like CCD++in terms be parallelized. of both efficiency and accuracy. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same la- tent features corresponding to the same user or item.Hence, 2. BACKGROUND there exist conflicts between two processes or nodes which simultaneously update the same latent features.Recently In this section,we introduce the background of this pa several strategies have been proposed to parallelize SGD for per,including notations,MF formulation,ALS-based mod- MF.The Hogwild![11]model just ignores the conflicts by els.SGD-based models and ADMM. assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. 2.1 Notations However,the conflicts do exist in the learning procedure We use boldface uppercase letters like M to denote matri- which makes Hogwild!not effective enough [21.24.More- ces and boldface lowercase letters like m to denote vectors over,Hogwild!requires all the processes share the whole Mi.and M.;denote the ith row and the jth column of M training set which is hard to be satisfied in distributed sys- respectively.Mi;denotes the element at the ith row and tems.Hence,Hogwild!cannot be directly used in distribut- jth column in M.MT denotes the transpose of M,and ed systems. M denotes the inverse of M.tr()denotes the trace of a Distributed SGD (DSGD)4 utilizes the property that matrix.Ig is an identity matrix of size k x k.Assume there there exist several sub-blocks without overlapping rows and are m users and n items in the data set.We use RERmxm columns in the rating matrix.These sub-blocks are mutually to denote the rating matrix.Please note that there exist independent of each other,thus can be processed in parallel many missing entries in R.All the missing entries are filled by different processes or nodes at the same time.Exper- with0.Ve use2C{l,2,·,m}×{1,2,·,n}to denote iments in 21,24 have shown that DSGD can outperform the set of indices for the observed ratings.denotes the Hogwild!in terms of both efficiency and accuracy.However column indices of the observed ratings in the ith row of R. after a set of independent sub-blocks have been processed. and denotes the row indices of the observed ratings in the updated variables from all processes or nodes should be the jth column of R.URxm denotes the users'laten- synchronized before processing the other sets of independent t factors (matrix)with each column U.representing the sub-blocks.It is these frequent synchronization operations latent feature vector for user i,where k is the number of that make DSGD inefficient because the slowest node will latent factors for each user or item.Vx denotes the become the bottleneck of the whole system.Things go even items'latent factors (matrix)with each column V.i repre- worse if data skew exists,which is not rare in real applica- senting the latent feature vector for item j.P denotes the tions.Very recently,fast parallel SGD (FPSGD)[24]tries to total number of nodes in the cluster,and we use the letter p solve the issues in DSGD by changing the scheduler into an on the superscript like MP to denote the computer node id. asynchronous one,which has achieved better performance .llF denotes the Frobenius norm of a matrix or a vector. than DSGD.However,FPSGD can only be used in shared- memory systems with a single node.Hence,FPSGD is still 2.2 Matrix Factorization not scalable to handle large-scale problems. In this paper,a novel model,called distributed stochastic Matrix factorization (MF)can be formulated as the fol- alternating direction methods of multipliers (DS-ADMM). lowing optimization problem is proposed for large-scale MF problems.DS-ADMM is a distributed stochastic variant of ADMM [3.The main con- (Rij -UTV.j)2+AUTU.:+A2VT V.j tributions of DS-ADMM are briefly outlined as follows: i,1E2 (1) In DS-ADMM,a new data split (partition)strategy where AI and A2 are hyper-parameters for regularization. called LocalMFSplit is proposed to assign subsets of There are two categories of parallel models to solve the the whole set of ratings to different nodes in a cluster above MF problem,i.e.,the ALS-based models and SGD- and consequently divide the large-scale problem into based models,which will be briefly reviewed in the following several sub-problems.Our split strategy can make the subsections. distributed MF problem fit for the ADMM framework Furthermore,compared with existing split strategies in DSGD and CCD++,our split strategy can reduce syn- 2.3 ALS-based Parallel MF Models chronization and scheduling cost to improve efficiency. By adopting the alternating learning strategy,ALS 23 alternatively switches between updating U and updating V .A stochastic ADMM method is designed to perform with the other latent matrix fixed.With U fixed,the MF efficient learning for parameters problem can be decomposed into n independent least square problems,each of which corresponds to a column of the ma- DS-ADMM is implemented with message passing in- trix V.Similar m independent least square problems can terface (MPI),which can run on clusters with multi- be got by fixing V.Furthermore,each of these independentDue to its efficiency and ease of implementation, SGD has become one of the most popular optimization strategies for MF in recommender systems [8]. The basic idea of SGD is to randomly select one rating each time from the rating matrix and then use the gradient based on this selected rating to update the latent features. It is easy to see that SGD is essentially a sequential method, which makes it difficult to be parallelized. The main reason that SGD can not be directly parallelized is that two randomly selected ratings may share the same la￾tent features corresponding to the same user or item. Hence, there exist conflicts between two processes or nodes which simultaneously update the same latent features. Recently, several strategies have been proposed to parallelize SGD for MF. The Hogwild! [11] model just ignores the conflicts by assuming that the probability of conflict is small when two ratings are randomly selected from a sparse rating matrix. However, the conflicts do exist in the learning procedure, which makes Hogwild! not effective enough [21, 24]. More￾over, Hogwild! requires all the processes share the whole training set which is hard to be satisfied in distributed sys￾tems. Hence, Hogwild! cannot be directly used in distribut￾ed systems. Distributed SGD (DSGD) [4] utilizes the property that there exist several sub-blocks without overlapping rows and columns in the rating matrix. These sub-blocks are mutually independent of each other, thus can be processed in parallel by different processes or nodes at the same time. Exper￾iments in [21, 24] have shown that DSGD can outperform Hogwild! in terms of both efficiency and accuracy. However, after a set of independent sub-blocks have been processed, the updated variables from all processes or nodes should be synchronized before processing the other sets of independent sub-blocks. It is these frequent synchronization operations that make DSGD inefficient because the slowest node will become the bottleneck of the whole system. Things go even worse if data skew exists, which is not rare in real applica￾tions. Very recently, fast parallel SGD (FPSGD) [24] tries to solve the issues in DSGD by changing the scheduler into an asynchronous one, which has achieved better performance than DSGD. However, FPSGD can only be used in shared￾memory systems with a single node. Hence, FPSGD is still not scalable to handle large-scale problems. In this paper, a novel model, called distributed stochastic alternating direction methods of multipliers (DS-ADMM), is proposed for large-scale MF problems. DS-ADMM is a distributed stochastic variant of ADMM [3]. The main con￾tributions of DS-ADMM are briefly outlined as follows: • In DS-ADMM, a new data split (partition) strategy called LocalMFSplit is proposed to assign subsets of the whole set of ratings to different nodes in a cluster and consequently divide the large-scale problem into several sub-problems. Our split strategy can make the distributed MF problem fit for the ADMM framework. Furthermore, compared with existing split strategies in DSGD and CCD++, our split strategy can reduce syn￾chronization and scheduling cost to improve efficiency. • A stochastic ADMM method is designed to perform efficient learning for parameters. • DS-ADMM is implemented with message passing in￾terface (MPI), which can run on clusters with multi￾ple machines (nodes). Hence, DS-ADMM is scalable to handle large-scale data sets. • Experiments on several data sets from recommenda￾tion applications show that not only can DS-ADMM outperform other SGD-based models, but it can also outperform ALS-based models like CCD++ in terms of both efficiency and accuracy. 2. BACKGROUND In this section, we introduce the background of this pa￾per, including notations, MF formulation, ALS-based mod￾els, SGD-based models and ADMM. 2.1 Notations We use boldface uppercase letters like M to denote matri￾ces and boldface lowercase letters like m to denote vectors. Mi∗ and M∗j denote the ith row and the jth column of M, respectively. Mij denotes the element at the ith row and jth column in M. MT denotes the transpose of M, and M−1 denotes the inverse of M. tr(·) denotes the trace of a matrix. Ik is an identity matrix of size k × k. Assume there are m users and n items in the data set. We use R ∈ R m×n to denote the rating matrix. Please note that there exist many missing entries in R. All the missing entries are filled with 0. We use Ω ⊂ {1, 2, · · · , m} × {1, 2, · · · , n} to denote the set of indices for the observed ratings. Ωi denotes the column indices of the observed ratings in the ith row of R, and Ω˜j denotes the row indices of the observed ratings in the jth column of R. U ∈ R k×m denotes the users’ laten￾t factors (matrix) with each column U∗i representing the latent feature vector for user i, where k is the number of latent factors for each user or item. V ∈ R k×n denotes the items’ latent factors (matrix) with each column V∗j repre￾senting the latent feature vector for item j. P denotes the total number of nodes in the cluster, and we use the letter p on the superscript like Mp to denote the computer node id. k · kF denotes the Frobenius norm of a matrix or a vector. 2.2 Matrix Factorization Matrix factorization (MF) can be formulated as the fol￾lowing optimization problem: min U,V 1 2 X (i,j)∈Ω  (Ri,j − U T ∗iV∗j ) 2 + λ1U T ∗iU∗i + λ2V T ∗jV∗j  , (1) where λ1 and λ2 are hyper-parameters for regularization. There are two categories of parallel models to solve the above MF problem, i.e., the ALS-based models and SGD￾based models, which will be briefly reviewed in the following subsections. 2.3 ALS-based Parallel MF Models By adopting the alternating learning strategy, ALS [23] alternatively switches between updating U and updating V with the other latent matrix fixed. With U fixed, the MF problem can be decomposed into n independent least square problems, each of which corresponds to a column of the ma￾trix V. Similar m independent least square problems can be got by fixing V. Furthermore, each of these independent
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有