正在加载图片...
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in(4),one for the L-BFGS algorithm,we only need to compute the gra- user features and the other for ad features.Furthermore. dient of the parameters. the two group lassos are coupled together to determine the CTR of an impression.Hence,our model is called coupled For ease of presentation,we use (x,y)to denote group lasso(CGL). (W,V,b;x,y)in (5)by omitting the parameters.For each instance (x,y),the gradient contributed by this in- Because (xTW)(xTV)T xTWVTx = stance can be derived as follows: xT(WVT)xa,it is very interesting to see that the 5(x,) term (xTW)(xv)T in (3)can effectively model the (7) 0b: =(h(x)-)zo: conjunction information between user features and ad 0(x,) features.The number of parameters in W and V is only =zui(h(x)-y)xav.j, (8) (I+s)k,where k is typically a small number(less than 50 ∂W) in our experiments).On the contrary,if we choose to man- 0c(x,) aVij zai(h(x)-y)xW.j, (9) ually construct the conjunction features for LR(Chapelle et al.,2013),the number of manual conjunction features is I x s,with both l and s being typically tens of thousands where ai,and oi denote the ith element in vectors in CTR prediction.Hence,the number of parameters of xu,xa,and xo,respectively. CGL is much less than that of LR with manual conjunction The regularization part in(6)can be expanded as follows: features,which makes CGL more scalable than LR to model conjunction features.Furthermore,the less number Q(W,V)=lWll21 +IVll21 of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regulariza- tion term of group lasso.It is easy to find that some rows of W and V will be all-zero after learning.Because each :+e row corresponds to a feature of users or ads,we can elim- inate the features corresponding to all-zero parameter val- ues.Hence,when we use the learned model for online pre- where e is a very small positive number to make the regu- diction,it's not necessary to collect those eliminated fea- larization term differentiable.Practically,it works well in tures for users and ads.This can not only save memory,but our application. also speed up the online prediction procedure. The gradient of (W,V)can be derived as follows: 3.2.Learning 02(W,V) W访 (10) The goal of our learning algorithm is to find the optimal OWig values b*∈Rd,W*∈Rlxk,V∈Rsx that min- ∑w+e imize the objective function in(4).The coupled part of an(w,v) V (xTW)(xTV)T makes the objective function non-convex. aVi (11) We adopt an alternating learning method to learn the pa- V∑喝+e rameters.Each time we optimize one parameter with oth- er parameters fixed.Several iterations will be repeated for We can concatenate the parameter group (W,b)into a this procedure until some termination condition is satisfied. parameter vector,and then compute the gradient vector More specifically,we first fix the ad parameter V and use g(W,b).Similarly,we can also compute the gradient vec- limited-memory BFGS (L-BFGS)(Malouf,2002;Andrew tor g(V,b)of the parameter group (V,b).Assume t is the Gao,2007)to optimize the objective function with re- T-th parameter in each parameter group,the r-th element spect to (w.r.t)W and b until convergence.Then we fix in gradient vector g()has the following form: the user parameter W,and optimize w.r.t.V and b until convergence.Obviously,the objective function is convex g-0=x9 +A2(w,V) (12) in any one of its parameter matrices W or V. Ot t 2=1 L-BFGS algorithm is in the family of quasi-Newton meth- ods(Broyden,1970).L-BFGS stores only a few gradien- where)is computed by using (7).(8).or (9). t vectors to approximate the Hessian matrix.Hence,it's andis computed by using (1)or(11).depending more suitable for optimization problems with a large num- on the value of t.Then we can construct the approximate ber of variables (Nocedal,1980;Byrd et al.,1994).To use Hessian matrix H for L-BFGS.Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in (4), one for user features and the other for ad features. Furthermore, the two group lassos are coupled together to determine the CTR of an impression. Hence, our model is called coupled group lasso (CGL). Because (x T uW)(x T a V) T = x T uWVT xa = x T u (WVT )xa, it is very interesting to see that the term (x T uW)(x T a V) T in (3) can effectively model the conjunction information between user features and ad features. The number of parameters in W and V is only (l + s)k, where k is typically a small number (less than 50 in our experiments). On the contrary, if we choose to man￾ually construct the conjunction features for LR (Chapelle et al., 2013), the number of manual conjunction features is l × s, with both l and s being typically tens of thousands in CTR prediction. Hence, the number of parameters of CGL is much less than that of LR with manual conjunction features, which makes CGL more scalable than LR to model conjunction features. Furthermore, the less number of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regulariza￾tion term of group lasso. It is easy to find that some rows of W and V will be all-zero after learning. Because each row corresponds to a feature of users or ads, we can elim￾inate the features corresponding to all-zero parameter val￾ues. Hence, when we use the learned model for online pre￾diction, it’s not necessary to collect those eliminated fea￾tures for users and ads. This can not only save memory, but also speed up the online prediction procedure. 3.2. Learning The goal of our learning algorithm is to find the optimal values b ∗ ∈ R d ,W∗ ∈ R l×k , V∗ ∈ R s×k that min￾imize the objective function in (4). The coupled part of (x T uW)(x T a V) T makes the objective function non-convex. We adopt an alternating learning method to learn the pa￾rameters. Each time we optimize one parameter with oth￾er parameters fixed. Several iterations will be repeated for this procedure until some termination condition is satisfied. More specifically, we first fix the ad parameter V and use limited-memory BFGS (L-BFGS) (Malouf, 2002; Andrew & Gao, 2007) to optimize the objective function with re￾spect to (w.r.t) W and b until convergence. Then we fix the user parameter W, and optimize w.r.t. V and b until convergence. Obviously, the objective function is convex in any one of its parameter matrices W or V. L-BFGS algorithm is in the family of quasi-Newton meth￾ods (Broyden, 1970). L-BFGS stores only a few gradien￾t vectors to approximate the Hessian matrix. Hence, it’s more suitable for optimization problems with a large num￾ber of variables (Nocedal, 1980; Byrd et al., 1994). To use the L-BFGS algorithm, we only need to compute the gra￾dient of the parameters. For ease of presentation, we use ξ(x, y) to denote ξ(W, V, b; x, y) in (5) by omitting the parameters. For each instance (x, y), the gradient contributed by this in￾stance can be derived as follows: ∂ξ(x, y) ∂bi = (h(x) − y)xoi , (7) ∂ξ(x, y) ∂Wij = xui (h(x) − y)x T a V∗j , (8) ∂ξ(x, y) ∂Vij = xai (h(x) − y)x T uW∗j , (9) where xui , xai , and xoi denote the ith element in vectors xu, xa, and xo, respectively. The regularization part in (6) can be expanded as follows: Ω(W, V) = ||W||21 + ||V||21 = X l i=1 vuutX k j=1 W2 ij + Xs i=1 vuutX k j=1 V 2 ij ≈ X l i=1 vuutX k j=1 W2 ij +  + Xs i=1 vuutX k j=1 V 2 ij + , where  is a very small positive number to make the regu￾larization term differentiable. Practically, it works well in our application. The gradient of Ω(W, V) can be derived as follows: ∂Ω(W, V) ∂Wij = q Wij Pk j=1 W2 ij +  , (10) ∂Ω(W, V) ∂Vij = q Vij Pk j=1 V 2 ij +  . (11) We can concatenate the parameter group (W,b) into a parameter vector, and then compute the gradient vector g(W, b). Similarly, we can also compute the gradient vec￾tor g(V, b) of the parameter group (V,b). Assume t is the τ -th parameter in each parameter group, the τ -th element in gradient vector g(·) has the following form: gτ (·) = X N i=1 ∂ξ(x (i) , y(i) ) ∂t + λ ∂Ω(W, V) ∂t , (12) where ∂ξ(x (i) ,y(i) ) ∂t is computed by using (7), (8), or (9), and ∂Ω(W,V) ∂t is computed by using (10) or (11), depending on the value of t. Then we can construct the approximate Hessian matrix H˜ for L-BFGS
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有