正在加载图片...
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Table 1.Characteristics of the three data sets which contain training data of 4 days,10 days and 7 days from different time periods, respectively.The subsequent day's log information of each training set is for test set. DATA SET INSTANCES (IN BILLION) CTR (IN%) #ADS USERS (IN MILLION) STORAGE (IN TB) TRAIN 1 1.011 1.62 21.318 874.7 1.895 TEST 1 0.295 1.70 11,558 331.0 0.646 TRAIN 2 1.184 1.61 21.620 958.6 2.203 TEST 2 0.145 1.64 6,848 190.3 0.269 TRAIN 3 1.491 1.75 33.538 1119.3 2.865 TEST 3 0.126 1.70 9,437 183.7 0.233 5.2.Evaluation Metrics and Baseline 5.3.Accuracy of Lasso 5.2.1.METRIC Table 2 is the relative improvement(RelaImpr)of Lasso w.r.t.the baseline (LR).We can see that there does not We can regard CTR prediction as a binary classification problem.Because the data set is highly unbalanced with exist significant difference between LR and Lasso in terms of prediction accuracy. only a small proportion of positive instances,prediction ac- curacy is not a good metric for evaluation.Furthermore, neither precision nor recall is a good metric.In this paper, Table 2.Relative improvement of Lasso w.r.t.the baseline (LR). we adopt the area under the receiver operating characteris- tic curve (AUC)(Bradley,1997)as a metric to measure the DATA SET DATASET-1 DATASET-2 DATASET-3 prediction accuracy,which has been widely used in exist- ing literatures for CTR prediction(Chapelle et al.,2013). RELAIMPR -0.019% -0.096% +0.086% For a random guesser,the AUC value will be 0.5,which means total lack of discrimination.In order to have a good comparison with baseline models,we first remove this con- 5.4.Accuracy of CGL stant part(0.5)from the AUC value and then compute the Please note that in Algorithm 1,the W and V are randomly relative improvement(RelaImpr)of our model,which has initialized,which may affect the performance.We perform the following mathematical form: six independent rounds of experiments with different ini- tialization.The mean and variance of the relative improve- AUC(model)-0.5 RelaImpr ment of our CGL model (with k =50)w.r.t.the baseline AUC(baseline)-0.5 ×100%. LR are reported in Figure 2.It is easy to find that our CGL model can significantly outperform LR in all three data set- This RelaImpr metric has actually been widely adopted s.Furthermore,we can find that the random initialization in industry for comparing discrimination of models3. has an ignorable influence on the performance.Hence,in Our CGL model has an effect of selecting features or elim- the following experiments,we won't report the variance of inating features for both users and ads.We introduce group the values. sparsity (GSparsity)to measure the capability of our model in feature elimination:GSparsity100%, where vis the total number of all-zero rows in parameter matrices W and V,I and s are the number of rows in W and V respectively. 5.2.2.BASELINE Because LR model(with L2-norm)has been widely used Figure 2.The relative improvement of CGL w.r.t.baseline(LR) for CTR prediction and has achieved the state-of-the-art 5.5.Sensitivity to Hyper-Parameters performance,especially in industrial systems(Chapelle et al.,2013),we adopt LR as the baseline for comparison. In this subsection.we study the influence of the two key Please note that LR refers to the model in(1)with L2-norm hyper-parameters,.kandλ,in our CGL model. regularization,and the model in(1)with L-norm regular- Experiments are conducted for different k and the results ization is called Lasso in this paper. are shown in Figure 3(a).We can find that,with the in- 3http://en.wikipedia.org/wiki/Receiver_ creasing of k,the performance turns to be better in general. operatingcharacteristic But larger k implies more parameters,which can make theCoupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Table 1. Characteristics of the three data sets which contain training data of 4 days, 10 days and 7 days from different time periods, respectively. The subsequent day’s log information of each training set is for test set. DATA SET # INSTANCES (IN BILLION) CTR (IN %) # ADS # USERS (IN MILLION) STORAGE (IN TB) TRAIN 1 1.011 1.62 21, 318 874.7 1.895 TEST 1 0.295 1.70 11, 558 331.0 0.646 TRAIN 2 1.184 1.61 21, 620 958.6 2.203 TEST 2 0.145 1.64 6, 848 190.3 0.269 TRAIN 3 1.491 1.75 33, 538 1119.3 2.865 TEST 3 0.126 1.70 9, 437 183.7 0.233 5.2. Evaluation Metrics and Baseline 5.2.1. METRIC We can regard CTR prediction as a binary classification problem. Because the data set is highly unbalanced with only a small proportion of positive instances, prediction ac￾curacy is not a good metric for evaluation. Furthermore, neither precision nor recall is a good metric. In this paper, we adopt the area under the receiver operating characteris￾tic curve (AUC) (Bradley, 1997) as a metric to measure the prediction accuracy, which has been widely used in exist￾ing literatures for CTR prediction (Chapelle et al., 2013). For a random guesser, the AUC value will be 0.5, which means total lack of discrimination. In order to have a good comparison with baseline models, we first remove this con￾stant part (0.5) from the AUC value and then compute the relative improvement (RelaImpr) of our model, which has the following mathematical form: RelaImpr = AUC(model) − 0.5 AUC(baseline) − 0.5 × 100%. This RelaImpr metric has actually been widely adopted in industry for comparing discrimination of models 3 . Our CGL model has an effect of selecting features or elim￾inating features for both users and ads. We introduce group sparsity (GSparsity) to measure the capability of our model in feature elimination: GSparsity = ν l+s × 100%, where ν is the total number of all-zero rows in parameter matrices W and V, l and s are the number of rows in W and V respectively. 5.2.2. BASELINE Because LR model (with L2-norm) has been widely used for CTR prediction and has achieved the state-of-the-art performance, especially in industrial systems (Chapelle et al., 2013), we adopt LR as the baseline for comparison. Please note that LR refers to the model in (1) with L2-norm regularization, and the model in (1) with L1-norm regular￾ization is called Lasso in this paper. 3http://en.wikipedia.org/wiki/Receiver_ operating_characteristic 5.3. Accuracy of Lasso Table 2 is the relative improvement (RelaImpr) of Lasso w.r.t. the baseline (LR). We can see that there does not exist significant difference between LR and Lasso in terms of prediction accuracy. Table 2. Relative improvement of Lasso w.r.t. the baseline (LR). DATA SET DATASET-1 DATASET-2 DATASET-3 RELAIMPR −0.019% −0.096% +0.086% 5.4. Accuracy of CGL Please note that in Algorithm 1, the W and V are randomly initialized, which may affect the performance. We perform six independent rounds of experiments with different ini￾tialization. The mean and variance of the relative improve￾ment of our CGL model (with k = 50) w.r.t. the baseline LR are reported in Figure 2. It is easy to find that our CGL model can significantly outperform LR in all three data set￾s. Furthermore, we can find that the random initialization has an ignorable influence on the performance. Hence, in the following experiments, we won’t report the variance of the values. Dataset−1 Dataset−2 Dataset−3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 RelaImpr(%) Different datasets from real−world Figure 2. The relative improvement of CGL w.r.t. baseline (LR). 5.5. Sensitivity to Hyper-Parameters In this subsection, we study the influence of the two key hyper-parameters, k and λ, in our CGL model. Experiments are conducted for different k and the results are shown in Figure 3 (a). We can find that, with the in￾creasing of k, the performance turns to be better in general. But larger k implies more parameters, which can make the
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有