0 100 200 300 400 500 0.92 0.94 0.96 _中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Stochastic ADMM for Matrix Factorization

正在加载图片...

0.6 -CCD++ -CCD++ -CCD++ DSGD -DSGD -DSGD SGD-Bi -DSGD DSGGD-aia -DS-ADMM DS-ADVM -DS-ADMM 00 300 Time(s) Time(s) Tm倒 (a)Netflix (b)Yahoo!Music RI (c)Yahoo!Music R2 Figure 2:Test RMSE curve for CCD++,DSGD,DSGD-Bias and DS-ADMM.(a)Netflix data set;(b)Yahoo!Music RI data set;(c)Yahoo!Music R2 data set.The vertical axis is the RMSE on the test set,and the horizontal axis is the running time easy and natural extension of MF and has been proved to baselines.We can easily find that DS-ADMM performs the be more accurate than those without bias [8.DSGD-Bias best on all three data sets.RMSE value of CCD++and model is optimized and paralleled by using the same strategy the DSGD-Bias model decreases fast at first for some da- as that in DSGD. ta sets because CCD++updates the parameters by their During our experiments,we find that good results can closed-form solutions and the DSGD-Bias model extracts be achieved by setting A1 =A2 for all the algorithms.So more explicit information from the training data set.How- we simply set A1 =A2 in our experiments.All the hyper- ever,our DS-ADMM decreases much faster afterwards and parameters for each model in our experiments are selected finally converges to the best accuracy,which is much better by ten-fold cross validation on the training set except the than the DSGD-Bias model and CCD++.DSGD performs latent factor number k and the number of computing nodes the worst among all the models in most cases.Moreover P.k is selected by following the CCD++21.Actually. as the scale of the data set grows,the difference between C- we find that on our data sets larger k doesn't achieve much CD++,DSGD and DSGD-bias's convergence value becomes better accuracy for CCD++while increasing the running smaller,but their difference to DS-ADMM becomes larger. time.The node number P is set according to the size of the Note that Yahoo!Music R2 has about 0.7 billion ratings data sets.All the algorithms have the same P for the same which is much larger than many real world applications data set. In some application scenarios,the learning process stops We use a general and simple update rule for DSGD's learn- when the RMSE value reaches some threshold.So we de- ing rate.More specifically,we first initialize the learning velop experiments to test those algorithms'running time rate n with a relatively large value,then decrease it by to reach the given threshold with different number of com- nt+1 nt*B(0<B 1)after each iteration,and stop puting nodes.When conducting experiments on the Netflix decreasing when n becomes smaller than some threshold a. data set,we set the threshold of RMSE value as 0.922.This This strategy results in a fast convergence rate and avoids value is chosen because it is the place where the RMSE early stopping. curves of DS-ADMM and CCD++become smooth.and it DS-ADMM has a similar step-size parameter T.From is also a value that DSGD and DSGD-Bias can reach if pro- the detailed proof in Appendix A,we find that Te should be vided with enough running time.We report the log-scale smaller than some threshold computed based on U.and VP of the running time in Figure 3.Results on the other two Because the exact threshold value for r is hard to calculate. data sets are similar.which are omitted due to the limit- we approximately update it as T+1=T*B(0<B<1)for ed space.From Figure 3.we can find that our DS-ADMM the tth iteration.We also set a threshold o.When n a, outperforms all the other algorithms no matter how many we stop decreasing T.It is easy to find that the update rule nodes are used.DS-ADMM needs relatively fewer iterations for Te is the same as that for the DSGD's learning rate n. to reach a test RMSE of 0.922 and its running time for each The parameter settings are shown in the last eight rows iteration is the smallest among all algorithms.DSGD per- in Table 1. forms the worst since it should run more iterations to reach a test RMSE of 0.922 and the running time for each iteration 4.3 Accuracy and Efficiency is also relatively large. The root mean squared error (RMSE)is a widely used 4.4 Speedup metric to measure an MF model's performance 21.The Another important metric that can be used to measure a test RMSE is defined as:(Rj-UTV.)2,where Q distributed algorithm is its speedup or scalability.It mea- is the number of testing ratings.Figure 2 shows the test sures the performance when more machines are used or larg- RMSE versus running time for our DS-ADMM and other er data sets are processed.In general,more machines and0 100 200 300 400 500 0.92 0.94 0.96 0.98 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (a) Netflix 0 200 400 500 0.84 0.88 0.92 0.96 1 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (b) Yahoo!Music R1 0 200 400 600 800 1.04 1.1 1.2 1.3 Time(s) Test RMSE CCD++ DSGD DSGD−Bias DS−ADMM (c) Yahoo!Music R2 Figure 2: Test RMSE curve for CCD++, DSGD, DSGD-Bias and DS-ADMM. (a) Netflix data set; (b)Yahoo! Music R1 data set; (c) Yahoo! Music R2 data set. The vertical axis is the RMSE on the test set, and the horizontal axis is the running time. easy and natural extension of MF and has been proved to be more accurate than those without bias [8]. DSGD-Bias model is optimized and paralleled by using the same strategy as that in DSGD. During our experiments, we find that good results can be achieved by setting λ1 = λ2 for all the algorithms. So we simply set λ1 = λ2 in our experiments. All the hyperparameters for each model in our experiments are selected by ten-fold cross validation on the training set except the latent factor number k and the number of computing nodes P. k is selected by following the CCD++ [21]. Actually, we find that on our data sets larger k doesn’t achieve much better accuracy for CCD++ while increasing the running time. The node number P is set according to the size of the data sets. All the algorithms have the same P for the same data set. We use a general and simple update rule for DSGD’s learning rate. More specifically, we first initialize the learning rate η with a relatively large value, then decrease it by ηt+1 = ηt ∗ β (0 < β < 1) after each iteration, and stop decreasing when η becomes smaller than some threshold α. This strategy results in a fast convergence rate and avoids early stopping. DS-ADMM has a similar step-size parameter τt. From the detailed proof in Appendix A, we find that τt should be smaller than some threshold computed based on Ut and V p t . Because the exact threshold value for τt is hard to calculate, we approximately update it as τt+1 = τt ∗ β (0 < β < 1) for the tth iteration. We also set a threshold α. When τt ≤ α, we stop decreasing τt. It is easy to find that the update rule for τt is the same as that for the DSGD’s learning rate η. The parameter settings are shown in the last eight rows in Table 1. 4.3 Accuracy and Efficiency The root mean squared error (RMSE) is a widely used metric to measure an MF model’s performance [21]. The test RMSE is defined as: 1 Q qP(Ri,j − UT ∗iV∗j ) 2, where Q is the number of testing ratings. Figure 2 shows the test RMSE versus running time for our DS-ADMM and other baselines. We can easily find that DS-ADMM performs the best on all three data sets. RMSE value of CCD++ and the DSGD-Bias model decreases fast at first for some data sets because CCD++ updates the parameters by their closed-form solutions and the DSGD-Bias model extracts more explicit information from the training data set. However, our DS-ADMM decreases much faster afterwards and finally converges to the best accuracy, which is much better than the DSGD-Bias model and CCD++. DSGD performs the worst among all the models in most cases. Moreover, as the scale of the data set grows, the difference between CCD++, DSGD and DSGD-bias’s convergence value becomes smaller, but their difference to DS-ADMM becomes larger. Note that Yahoo! Music R2 has about 0.7 billion ratings, which is much larger than many real world applications. In some application scenarios, the learning process stops when the RMSE value reaches some threshold. So we develop experiments to test those algorithms’ running time to reach the given threshold with different number of computing nodes. When conducting experiments on the Netflix data set, we set the threshold of RMSE value as 0.922. This value is chosen because it is the place where the RMSE curves of DS-ADMM and CCD++ become smooth, and it is also a value that DSGD and DSGD-Bias can reach if provided with enough running time. We report the log-scale of the running time in Figure 3. Results on the other two data sets are similar, which are omitted due to the limited space. From Figure 3, we can find that our DS-ADMM outperforms all the other algorithms no matter how many nodes are used. DS-ADMM needs relatively fewer iterations to reach a test RMSE of 0.922 and its running time for each iteration is the smallest among all algorithms. DSGD performs the worst since it should run more iterations to reach a test RMSE of 0.922 and the running time for each iteration is also relatively large. 4.4 Speedup Another important metric that can be used to measure a distributed algorithm is its speedup or scalability. It measures the performance when more machines are used or larger data sets are processed. In general, more machines and

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Distributed Stochastic ADMM for Matrix Factorization