正在加载图片...
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click,which makes an instance in the data set. attractive for its property of doing variable selection at the Given a training set {((),()i=1,...N),in which group level,where all the variables in some groups will be xr-(xt,xa,xg),y∈{0,1 with y=1 denot- zero after learning. ing click and y =0 denoting non-click in an impres- sion,the CTR prediction problem is to learn a function 3.Coupled Group Lasso h(x)=h(x,xa,x)which can be used to predict the Although LR has been widely used for CTR prediction.it probability of user u to click on ad a in a specific context can not capture the conjunction information between us- 0. er features and ad features.One possible solution is to manually construct the conjunction features from the orig- 2.2.Logistic Regression inal input features as the input of LR.However,as stated The likelihood of LR is defined as h1(x)=Pr(y in (Chapelle et al.,2013),manual feature conjunction will 1w)-where w is the parameter (weight result in quadratic number of new features.which makes it vector)to learn.Please note that the bias term of LR has extraordinarily difficult to learn the parameters.Hence,the been integrated into w by adding an extra feature with con- modeling ability of LR is too weak to capture the complex stant value 1 to the feature vector.Given a training set relationship in the data. (x(,y())i=1,...,N),the weight vector w is found In this section,we introduce our coupled group by minimizing the following regularized loss function: lasso (CGL)model,which can easily model the conjunc- N tion information between users and ads to achieve better min21(w)+∑51(w;x,g), (1) performance than LR. i=1 E(w:x(()=-log(h((]1-h(x(1). 3.1.Model The likelihood of CGL is formulated as follows: where (w)is the regularization term. In real applications,we can use the following L2-norm for h(x)=Pr(y=1x,W,V,b) regularization (Golub et al.,1999):1(w)=w= =((xTW)(xav)T+bTxo), (3) .The resulting model is the standard LR model. We can also use the following Li-norm for regularization: where W is a matrix of size lxk.V is a matrix of size sxk. (w)=llwl1=∑i=,lwl,wherez is the length of b is a vector of length d,o(x)is the sigmoid function with vector w.The resulting model will be Lasso which can be ()=Here,W.V and b are parameters to used for feature selection or elimination(Tibshirani,1996). learn,k is a hyper-parameter. The optimization function in(1)is easy to implement with Furthermore,we put regularization on the negative log- promising performance,which makes LR very popular in likelihood to get the following optimization problem of industry.Please note that in the following content,LR CGL: refers to the LR model with L2-norm regularization,and the LR with L-norm will be called Lasso as in many liter- W,V,b;x⊙,y+A2(w,V),(④ atures (Tibshirani.1996) 2.3.Group Lasso with The group lasso is a technique to do variable selection on (W,V,b;x(),y())= (5) (predefined)groups of variables (Yuan Lin,2006;Meier et al.,2008).For a parameter vector BER,the regular- -log(h(x)】y9[1-hx】1-y9), ization term in group lasso is defined as follows: 2(W,V)=lWl2,1+lVI2,1 (6) G (2) Hee,.lWIkz.1=∑a1V∑=W号=ilW.bis 9=1 the L2.1-norm of the matrix W.Similarly,V2.is the L2.1-norm of the matrix V.From(2),it is easy to find that where T is the index set belonging to the predefined gth the L2.1-norm is actually a group lasso regularization with group of variables,g =1,2,...,G.The group lasso can each row being a group.Please note that we do not put be used together with linear regression (Yuan Lin,2006) regularization on b because from experiments we find that or logistic regression (Meier et al.,2008)as a penalty.It is this regularization does not affect the performance.Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click, which makes an instance in the data set. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, in which x T = (x T u , x T a , x T o ), y ∈ {0, 1} with y = 1 denot￾ing click and y = 0 denoting non-click in an impres￾sion, the CTR prediction problem is to learn a function h(x) = h(xu, xa, xo) which can be used to predict the probability of user u to click on ad a in a specific context o. 2.2. Logistic Regression The likelihood of LR is defined as h1(x) = P r(y = 1|x, w) = 1 1+exp(−wT x) , where w is the parameter (weight vector) to learn. Please note that the bias term of LR has been integrated into w by adding an extra feature with con￾stant value 1 to the feature vector. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, the weight vector w is found by minimizing the following regularized loss function: min w λΩ1(w) +X N i=1 ξ1(w; x (i) , y(i) ), (1) ξ1(w; x (i) , y(i) ) = − log([h1(x (i) )]y (i) [1 − h1(x (i) )]1−y (i) ), where Ω1(w) is the regularization term. In real applications, we can use the following L2-norm for regularization (Golub et al., 1999): Ω1(w) = 1 2 ||w||2 2 = wT w 2 . The resulting model is the standard LR model. We can also use the following L1-norm for regularization: Ω1(w) = ||w||1 = Pz i=1 |wi |, where z is the length of vector w. The resulting model will be Lasso which can be used for feature selection or elimination (Tibshirani, 1996). The optimization function in (1) is easy to implement with promising performance, which makes LR very popular in industry. Please note that in the following content, LR refers to the LR model with L2-norm regularization, and the LR with L1-norm will be called Lasso as in many liter￾atures (Tibshirani, 1996) 2.3. Group Lasso The group lasso is a technique to do variable selection on (predefined) groups of variables (Yuan & Lin, 2006; Meier et al., 2008). For a parameter vector β ∈ R z , the regular￾ization term in group lasso is defined as follows: X G g=1 ||βIg ||2, (2) where Ig is the index set belonging to the predefined gth group of variables, g = 1, 2, · · · , G. The group lasso can be used together with linear regression (Yuan & Lin, 2006) or logistic regression (Meier et al., 2008) as a penalty. It is attractive for its property of doing variable selection at the group level, where all the variables in some groups will be zero after learning. 3. Coupled Group Lasso Although LR has been widely used for CTR prediction, it can not capture the conjunction information between us￾er features and ad features. One possible solution is to manually construct the conjunction features from the orig￾inal input features as the input of LR. However, as stated in (Chapelle et al., 2013), manual feature conjunction will result in quadratic number of new features, which makes it extraordinarily difficult to learn the parameters. Hence, the modeling ability of LR is too weak to capture the complex relationship in the data. In this section, we introduce our coupled group lasso (CGL) model, which can easily model the conjunc￾tion information between users and ads to achieve better performance than LR. 3.1. Model The likelihood of CGL is formulated as follows: h(x) = P r(y = 1|x,W, V, b) = σ ￾ (x T uW)(x T a V) T + b T xo  , (3) whereW is a matrix of size l×k, V is a matrix of size s×k, b is a vector of length d, σ(x) is the sigmoid function with σ(x) = 1 1+exp (−x) . Here, W, V and b are parameters to learn, k is a hyper-parameter. Furthermore, we put regularization on the negative log￾likelihood to get the following optimization problem of CGL: min W,V,b X N i=1 ξ  W, V, b; x (i) , y(i)  + λΩ(W, V), (4) with ξ(W, V, b; x (i) , y(i) ) = (5) − log([h(x (i) )]y (i) [1 − h(x (i) )]1−y (i) ), Ω(W, V) = ||W||2,1 + ||V||2,1. (6) Here, ||W||2,1 = Pl i=1 qPk j=1 W2 ij = Pl i=1 ||Wi∗||2 is the L2,1-norm of the matrix W. Similarly, ||V||2,1 is the L2,1-norm of the matrix V. From (2), it is easy to find that the L2,1-norm is actually a group lasso regularization with each row being a group. Please note that we do not put regularization on b because from experiments we find that this regularization does not affect the performance
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有