Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Ling Yan YLINGO718@SJTU.EDU.CN Shanghai Key Laboratory of Scalable Computing and Systems,Department of Computer Science and Engineering,Shang- hai Jiao Tong University,China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing University,China Gui-Rong Xue GRXUE@ALIBABA-INC.COM Alibaba Group,China Dingyi Han DINGYI.HAN@ALIBABA-INC.COM Alibaba Group,China Abstract 1.Introduction Recently,online advertising has become the most popular In display advertising,click through rate(CTR) and effective approach to do brand promotion and produc- prediction is the problem of estimating the prob- t marketing.It is a multi-billion business on the web and ability that an advertisement(ad)is clicked when accounts for the majority of the income for the major inter- displayed to a user in a specific context.Due net companies,such as Google,Yahoo and Alibaba.Dis- to its easy implementation and promising perfor- play advertising is a big part of online advertising where mance,logistic regression (LR)model has been advertisers pay publishers for placing graphical advertise- widely used for CTR prediction,especially in in- ments (ads)on publishers'web pages(Chapelle et al., dustrial systems.However,it is not easy for LR 2013).The publishers allocate some positions on their web to capture the nonlinear information,such as the pages and sell them to different advertisers.Users visit the conjunction information,from user features and web pages and can view the published ads.There are some ad features.In this paper,we propose a nov- other roles,such as ad agencies and publisher networks,to el model,called coupled group lasso (CGL),for compose the complex advertising system(Muthukrishnan, CTR prediction in display advertising.CGL can 2009).But that is not the focus of this paper.So we will just seamlessly integrate the conjunction information focus on the scenarios with a user-advertiser-publisher tri- from user features and ad features for modeling. partite business,in which three parties have separate goals Furthermore,CGL can automatically eliminate that can be reduced to a unified task in the end.The ad- useless features for both users and ads,which vertisers pay more attention on the desired user actions, may facilitate fast online prediction.Scalabili- such as clicks on the ads,subscriptions to the mailing list. ty of CGL is ensured through feature hashing and or purchases of products.Different advertisers target dif- distributed implementation.Experimental results ferent kinds of users.For example,a basketball company on real-world data sets show that our CGL model will be interested in users who bought many sports equip- can achieve state-of-the-art performance on web- ments recently,and a hotel would prefer to display its ads scale CTR prediction tasks. to people who travel frequently.There are different pay- ment options for advertisers,such as cost-per-click(CPC), cost-per-mill(CPM),and cost-per-conversion(CPA)(Mah- dian Tomak,2007).For the publisher part,their goal is to Proceedings of the 31st International Conference on Machine maximize the revenue from the advertisers and attract more Learning.Beijing,China,2014.JMLR:W&CP volume 32.Copy- users to their web pages.So they had better precisely dis- right 2014 by the author(s). play suitable ads to a specific user,and avoid affecting user
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Ling Yan YLING0718@SJTU.EDU.CN Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China Gui-Rong Xue GRXUE@ALIBABA-INC.COM Alibaba Group, China Dingyi Han DINGYI.HAN@ALIBABA-INC.COM Alibaba Group, China Abstract In display advertising, click through rate (CTR) prediction is the problem of estimating the probability that an advertisement (ad) is clicked when displayed to a user in a specific context. Due to its easy implementation and promising performance, logistic regression (LR) model has been widely used for CTR prediction, especially in industrial systems. However, it is not easy for LR to capture the nonlinear information, such as the conjunction information, from user features and ad features. In this paper, we propose a novel model, called coupled group lasso (CGL), for CTR prediction in display advertising. CGL can seamlessly integrate the conjunction information from user features and ad features for modeling. Furthermore, CGL can automatically eliminate useless features for both users and ads, which may facilitate fast online prediction. Scalability of CGL is ensured through feature hashing and distributed implementation. Experimental results on real-world data sets show that our CGL model can achieve state-of-the-art performance on webscale CTR prediction tasks. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction Recently, online advertising has become the most popular and effective approach to do brand promotion and product marketing. It is a multi-billion business on the web and accounts for the majority of the income for the major internet companies, such as Google, Yahoo and Alibaba. Display advertising is a big part of online advertising where advertisers pay publishers for placing graphical advertisements (ads) on publishers’ web pages (Chapelle et al., 2013). The publishers allocate some positions on their web pages and sell them to different advertisers. Users visit the web pages and can view the published ads. There are some other roles, such as ad agencies and publisher networks, to compose the complex advertising system (Muthukrishnan, 2009). But that is not the focus of this paper. So we will just focus on the scenarios with a user-advertiser-publisher tripartite business, in which three parties have separate goals that can be reduced to a unified task in the end. The advertisers pay more attention on the desired user actions, such as clicks on the ads, subscriptions to the mailing list, or purchases of products. Different advertisers target different kinds of users. For example, a basketball company will be interested in users who bought many sports equipments recently, and a hotel would prefer to display its ads to people who travel frequently. There are different payment options for advertisers, such as cost-per-click (CPC), cost-per-mill (CPM), and cost-per-conversion (CPA) (Mahdian & Tomak, 2007). For the publisher part, their goal is to maximize the revenue from the advertisers and attract more users to their web pages. So they had better precisely display suitable ads to a specific user, and avoid affecting user
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising experience of the web pages.From the user part,they want group lasso(CGL)for CTR prediction in display advertis- to find useful information from the web pages and find ads ing.The main contributions are outlined as follows: that they are really interested in. To satisfy the desire of all three parties,an accurate target- CGL can seamlessly integrate the conjunction infor- ing of advertising system is of great importance,in which mation from user features and ad features for model- the click through rate(CTR)prediction of a user to a spe- ing,which makes it better capture the underlying con- cific ad plays the key role (Chapelle et al.,2013).CTR nection between users and ads than LR. prediction is the problem of estimating the probability that CGL can automatically eliminate useless features for the display of an ad to a specific user will lead to a click. both users and ads,which may facilitate fast online This challenging problem is at the heart of display adver- tising and has to deal with several hard issues,such as very prediction. large scale data sets,frequently updated users and ads,and CGL is scalable by exploiting feature hashing and dis- the inherent obscure connection between user profiles(fea- tributed implementation tures)and ad features Recently,many models have been proposed for CTR pre- 2.Background diction in display advertising.Some models train standard classifiers,such as logistic regression (LR)(Neter et al. In this section,we introduce the background of our model, 1996)or generalized linear models,on simple concatena- including the description of CTR prediction task,LR mod- tion of user and ad features (Richardson et al..2007;Grae- el,and group lasso (Yuan Lin,2006;Meier et al.,2008). pel et al.,2010).Some other models use prior knowledge like the inherent hierarchical information for statistical s- 2.1.Notation and Task moothing in log-linear models (Agarwal et al..2010)or We use boldface lowercase letters,such as v,to denote col- LR models (Kuang-chih et al.,2012).In (Menon et al., umn vectors and v;to denote the ith element of v.Boldface 2011),a matrix factorization method is proposed,but it uppercase letters,such as M,are used to denote matrices, does not make use of user features.In (Stern et al..2009). with the ith row and the jth column of M denoted by Mis a probabilistic model is proposed to use user and item meta and M.j,respectively.Mij is the element at the ith row data together with collaborative filtering information,in and jth column of M.MT is the transpose of M and vT which user and item feature vectors are mapped into lower- dimensional space and inner product is used to measure is the transpose of v. similarity.However,it does not have the effect of auto- While some display advertising systems have access to on- matic feature selection from user and item meta features. ly some id information for users or ads,in this paper we In addition,inference of the model is too complicated to be focus on the scenarios where we can collect both user and used in a large scale scenario.In (Chapelle et al.,2013), ad features.Actually,the publishers can often collect us- a highly scalable framework based on LR is proposed,and er actions on the web pages,such as click on an ad,buy terabytes of data from real applications are used for eval- a product or type in some query keywords.They can an- uation.Due to its easy implementation and state-of-the- alyze these history behaviors and then construct user pro- art performance,LR model has become the most popular files (features).On the other hand,when advertisers submit one for CTR prediction,especially in industrial system- some ads to the publishers,they often choose some descrip- s(Chapelle et al.,2013).However,LR is a linear model, tion words,the groups of people to display the ads,or some in which the features contribute to the final prediction in- other useful features. dependently.Hence,LR can not capture the nonlinear in- We refer to a display of an ad to a particular user in a par- formation,such as the conjunction (cartesian product)in- ticular page view as an ad impression.Each impression is formation,between user features and ad features.In real a case that a user meets an ad in a specific context,such as applications,the conjunction information is very important daytime,weekdays,and publishing position.Hence,each for CTR prediction.For example,people who have high impression contains information of three aspects:the user, buying power may have more interest in luxury produc- the ad,and the context.We use xu of length I to denote t than those with low buying power,and college students the feature vector of user u,xa of length s to denote the may be more likely to buy machine learning books than feature vector of ad a.The context information together high-school students.Better performance can be expected with some advertiser id or ad id information are composed by exploiting the user-ad two-parts hybrid features through into a feature vector xo of length d.x is used to denote the feature conjunction. feature vector of an expression,with xT-(x). In this paper,we propose a novel model,called coupled Hence,if we use z to denote the length of vector x,we have z=l+s d.The result of an impression is click or
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising experience of the web pages. From the user part, they want to find useful information from the web pages and find ads that they are really interested in. To satisfy the desire of all three parties, an accurate targeting of advertising system is of great importance, in which the click through rate (CTR) prediction of a user to a specific ad plays the key role (Chapelle et al., 2013). CTR prediction is the problem of estimating the probability that the display of an ad to a specific user will lead to a click. This challenging problem is at the heart of display advertising and has to deal with several hard issues, such as very large scale data sets, frequently updated users and ads, and the inherent obscure connection between user profiles (features) and ad features. Recently, many models have been proposed for CTR prediction in display advertising. Some models train standard classifiers, such as logistic regression (LR) (Neter et al., 1996) or generalized linear models, on simple concatenation of user and ad features (Richardson et al., 2007; Graepel et al., 2010). Some other models use prior knowledge like the inherent hierarchical information for statistical smoothing in log-linear models (Agarwal et al., 2010) or LR models (Kuang-chih et al., 2012). In (Menon et al., 2011), a matrix factorization method is proposed, but it does not make use of user features. In (Stern et al., 2009), a probabilistic model is proposed to use user and item meta data together with collaborative filtering information, in which user and item feature vectors are mapped into lowerdimensional space and inner product is used to measure similarity. However, it does not have the effect of automatic feature selection from user and item meta features. In addition, inference of the model is too complicated to be used in a large scale scenario. In (Chapelle et al., 2013), a highly scalable framework based on LR is proposed, and terabytes of data from real applications are used for evaluation. Due to its easy implementation and state-of-theart performance, LR model has become the most popular one for CTR prediction, especially in industrial systems (Chapelle et al., 2013). However, LR is a linear model, in which the features contribute to the final prediction independently. Hence, LR can not capture the nonlinear information, such as the conjunction (cartesian product) information, between user features and ad features. In real applications, the conjunction information is very important for CTR prediction. For example, people who have high buying power may have more interest in luxury product than those with low buying power, and college students may be more likely to buy machine learning books than high-school students. Better performance can be expected by exploiting the user-ad two-parts hybrid features through feature conjunction. In this paper, we propose a novel model, called coupled group lasso (CGL) for CTR prediction in display advertising. The main contributions are outlined as follows: • CGL can seamlessly integrate the conjunction information from user features and ad features for modeling, which makes it better capture the underlying connection between users and ads than LR. • CGL can automatically eliminate useless features for both users and ads, which may facilitate fast online prediction. • CGL is scalable by exploiting feature hashing and distributed implementation. 2. Background In this section, we introduce the background of our model, including the description of CTR prediction task, LR model, and group lasso (Yuan & Lin, 2006; Meier et al., 2008). 2.1. Notation and Task We use boldface lowercase letters, such as v, to denote column vectors and vi to denote the ith element of v. Boldface uppercase letters, such as M, are used to denote matrices, with the ith row and the jth column of M denoted by Mi∗ and M∗j , respectively. Mij is the element at the ith row and jth column of M. MT is the transpose of M and v T is the transpose of v. While some display advertising systems have access to only some id information for users or ads, in this paper we focus on the scenarios where we can collect both user and ad features. Actually, the publishers can often collect user actions on the web pages, such as click on an ad, buy a product or type in some query keywords. They can analyze these history behaviors and then construct user pro- files (features). On the other hand, when advertisers submit some ads to the publishers, they often choose some description words, the groups of people to display the ads, or some other useful features. We refer to a display of an ad to a particular user in a particular page view as an ad impression. Each impression is a case that a user meets an ad in a specific context, such as daytime, weekdays, and publishing position. Hence, each impression contains information of three aspects: the user, the ad, and the context. We use xu of length l to denote the feature vector of user u, xa of length s to denote the feature vector of ad a. The context information together with some advertiser id or ad id information are composed into a feature vector xo of length d. x is used to denote the feature vector of an expression, with x T = (x T u , x T a , x T o ). Hence, if we use z to denote the length of vector x, we have z = l + s + d. The result of an impression is click or
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click,which makes an instance in the data set. attractive for its property of doing variable selection at the Given a training set {((),()i=1,...N),in which group level,where all the variables in some groups will be xr-(xt,xa,xg),y∈{0,1 with y=1 denot- zero after learning. ing click and y =0 denoting non-click in an impres- sion,the CTR prediction problem is to learn a function 3.Coupled Group Lasso h(x)=h(x,xa,x)which can be used to predict the Although LR has been widely used for CTR prediction.it probability of user u to click on ad a in a specific context can not capture the conjunction information between us- 0. er features and ad features.One possible solution is to manually construct the conjunction features from the orig- 2.2.Logistic Regression inal input features as the input of LR.However,as stated The likelihood of LR is defined as h1(x)=Pr(y in (Chapelle et al.,2013),manual feature conjunction will 1w)-where w is the parameter (weight result in quadratic number of new features.which makes it vector)to learn.Please note that the bias term of LR has extraordinarily difficult to learn the parameters.Hence,the been integrated into w by adding an extra feature with con- modeling ability of LR is too weak to capture the complex stant value 1 to the feature vector.Given a training set relationship in the data. (x(,y())i=1,...,N),the weight vector w is found In this section,we introduce our coupled group by minimizing the following regularized loss function: lasso (CGL)model,which can easily model the conjunc- N tion information between users and ads to achieve better min21(w)+∑51(w;x,g), (1) performance than LR. i=1 E(w:x(()=-log(h((]1-h(x(1). 3.1.Model The likelihood of CGL is formulated as follows: where (w)is the regularization term. In real applications,we can use the following L2-norm for h(x)=Pr(y=1x,W,V,b) regularization (Golub et al.,1999):1(w)=w= =((xTW)(xav)T+bTxo), (3) .The resulting model is the standard LR model. We can also use the following Li-norm for regularization: where W is a matrix of size lxk.V is a matrix of size sxk. (w)=llwl1=∑i=,lwl,wherez is the length of b is a vector of length d,o(x)is the sigmoid function with vector w.The resulting model will be Lasso which can be ()=Here,W.V and b are parameters to used for feature selection or elimination(Tibshirani,1996). learn,k is a hyper-parameter. The optimization function in(1)is easy to implement with Furthermore,we put regularization on the negative log- promising performance,which makes LR very popular in likelihood to get the following optimization problem of industry.Please note that in the following content,LR CGL: refers to the LR model with L2-norm regularization,and the LR with L-norm will be called Lasso as in many liter- W,V,b;x⊙,y+A2(w,V),(④ atures (Tibshirani.1996) 2.3.Group Lasso with The group lasso is a technique to do variable selection on (W,V,b;x(),y())= (5) (predefined)groups of variables (Yuan Lin,2006;Meier et al.,2008).For a parameter vector BER,the regular- -log(h(x)】y9[1-hx】1-y9), ization term in group lasso is defined as follows: 2(W,V)=lWl2,1+lVI2,1 (6) G (2) Hee,.lWIkz.1=∑a1V∑=W号=ilW.bis 9=1 the L2.1-norm of the matrix W.Similarly,V2.is the L2.1-norm of the matrix V.From(2),it is easy to find that where T is the index set belonging to the predefined gth the L2.1-norm is actually a group lasso regularization with group of variables,g =1,2,...,G.The group lasso can each row being a group.Please note that we do not put be used together with linear regression (Yuan Lin,2006) regularization on b because from experiments we find that or logistic regression (Meier et al.,2008)as a penalty.It is this regularization does not affect the performance
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising non-click, which makes an instance in the data set. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, in which x T = (x T u , x T a , x T o ), y ∈ {0, 1} with y = 1 denoting click and y = 0 denoting non-click in an impression, the CTR prediction problem is to learn a function h(x) = h(xu, xa, xo) which can be used to predict the probability of user u to click on ad a in a specific context o. 2.2. Logistic Regression The likelihood of LR is defined as h1(x) = P r(y = 1|x, w) = 1 1+exp(−wT x) , where w is the parameter (weight vector) to learn. Please note that the bias term of LR has been integrated into w by adding an extra feature with constant value 1 to the feature vector. Given a training set {(x (i) , y(i) ) | i = 1, ..., N}, the weight vector w is found by minimizing the following regularized loss function: min w λΩ1(w) +X N i=1 ξ1(w; x (i) , y(i) ), (1) ξ1(w; x (i) , y(i) ) = − log([h1(x (i) )]y (i) [1 − h1(x (i) )]1−y (i) ), where Ω1(w) is the regularization term. In real applications, we can use the following L2-norm for regularization (Golub et al., 1999): Ω1(w) = 1 2 ||w||2 2 = wT w 2 . The resulting model is the standard LR model. We can also use the following L1-norm for regularization: Ω1(w) = ||w||1 = Pz i=1 |wi |, where z is the length of vector w. The resulting model will be Lasso which can be used for feature selection or elimination (Tibshirani, 1996). The optimization function in (1) is easy to implement with promising performance, which makes LR very popular in industry. Please note that in the following content, LR refers to the LR model with L2-norm regularization, and the LR with L1-norm will be called Lasso as in many literatures (Tibshirani, 1996) 2.3. Group Lasso The group lasso is a technique to do variable selection on (predefined) groups of variables (Yuan & Lin, 2006; Meier et al., 2008). For a parameter vector β ∈ R z , the regularization term in group lasso is defined as follows: X G g=1 ||βIg ||2, (2) where Ig is the index set belonging to the predefined gth group of variables, g = 1, 2, · · · , G. The group lasso can be used together with linear regression (Yuan & Lin, 2006) or logistic regression (Meier et al., 2008) as a penalty. It is attractive for its property of doing variable selection at the group level, where all the variables in some groups will be zero after learning. 3. Coupled Group Lasso Although LR has been widely used for CTR prediction, it can not capture the conjunction information between user features and ad features. One possible solution is to manually construct the conjunction features from the original input features as the input of LR. However, as stated in (Chapelle et al., 2013), manual feature conjunction will result in quadratic number of new features, which makes it extraordinarily difficult to learn the parameters. Hence, the modeling ability of LR is too weak to capture the complex relationship in the data. In this section, we introduce our coupled group lasso (CGL) model, which can easily model the conjunction information between users and ads to achieve better performance than LR. 3.1. Model The likelihood of CGL is formulated as follows: h(x) = P r(y = 1|x,W, V, b) = σ (x T uW)(x T a V) T + b T xo , (3) whereW is a matrix of size l×k, V is a matrix of size s×k, b is a vector of length d, σ(x) is the sigmoid function with σ(x) = 1 1+exp (−x) . Here, W, V and b are parameters to learn, k is a hyper-parameter. Furthermore, we put regularization on the negative loglikelihood to get the following optimization problem of CGL: min W,V,b X N i=1 ξ W, V, b; x (i) , y(i) + λΩ(W, V), (4) with ξ(W, V, b; x (i) , y(i) ) = (5) − log([h(x (i) )]y (i) [1 − h(x (i) )]1−y (i) ), Ω(W, V) = ||W||2,1 + ||V||2,1. (6) Here, ||W||2,1 = Pl i=1 qPk j=1 W2 ij = Pl i=1 ||Wi∗||2 is the L2,1-norm of the matrix W. Similarly, ||V||2,1 is the L2,1-norm of the matrix V. From (2), it is easy to find that the L2,1-norm is actually a group lasso regularization with each row being a group. Please note that we do not put regularization on b because from experiments we find that this regularization does not affect the performance
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in(4),one for the L-BFGS algorithm,we only need to compute the gra- user features and the other for ad features.Furthermore. dient of the parameters. the two group lassos are coupled together to determine the CTR of an impression.Hence,our model is called coupled For ease of presentation,we use (x,y)to denote group lasso(CGL). (W,V,b;x,y)in (5)by omitting the parameters.For each instance (x,y),the gradient contributed by this in- Because (xTW)(xTV)T xTWVTx = stance can be derived as follows: xT(WVT)xa,it is very interesting to see that the 5(x,) term (xTW)(xv)T in (3)can effectively model the (7) 0b: =(h(x)-)zo: conjunction information between user features and ad 0(x,) features.The number of parameters in W and V is only =zui(h(x)-y)xav.j, (8) (I+s)k,where k is typically a small number(less than 50 ∂W) in our experiments).On the contrary,if we choose to man- 0c(x,) aVij zai(h(x)-y)xW.j, (9) ually construct the conjunction features for LR(Chapelle et al.,2013),the number of manual conjunction features is I x s,with both l and s being typically tens of thousands where ai,and oi denote the ith element in vectors in CTR prediction.Hence,the number of parameters of xu,xa,and xo,respectively. CGL is much less than that of LR with manual conjunction The regularization part in(6)can be expanded as follows: features,which makes CGL more scalable than LR to model conjunction features.Furthermore,the less number Q(W,V)=lWll21 +IVll21 of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regulariza- tion term of group lasso.It is easy to find that some rows of W and V will be all-zero after learning.Because each :+e row corresponds to a feature of users or ads,we can elim- inate the features corresponding to all-zero parameter val- ues.Hence,when we use the learned model for online pre- where e is a very small positive number to make the regu- diction,it's not necessary to collect those eliminated fea- larization term differentiable.Practically,it works well in tures for users and ads.This can not only save memory,but our application. also speed up the online prediction procedure. The gradient of (W,V)can be derived as follows: 3.2.Learning 02(W,V) W访 (10) The goal of our learning algorithm is to find the optimal OWig values b*∈Rd,W*∈Rlxk,V∈Rsx that min- ∑w+e imize the objective function in(4).The coupled part of an(w,v) V (xTW)(xTV)T makes the objective function non-convex. aVi (11) We adopt an alternating learning method to learn the pa- V∑喝+e rameters.Each time we optimize one parameter with oth- er parameters fixed.Several iterations will be repeated for We can concatenate the parameter group (W,b)into a this procedure until some termination condition is satisfied. parameter vector,and then compute the gradient vector More specifically,we first fix the ad parameter V and use g(W,b).Similarly,we can also compute the gradient vec- limited-memory BFGS (L-BFGS)(Malouf,2002;Andrew tor g(V,b)of the parameter group (V,b).Assume t is the Gao,2007)to optimize the objective function with re- T-th parameter in each parameter group,the r-th element spect to (w.r.t)W and b until convergence.Then we fix in gradient vector g()has the following form: the user parameter W,and optimize w.r.t.V and b until convergence.Obviously,the objective function is convex g-0=x9 +A2(w,V) (12) in any one of its parameter matrices W or V. Ot t 2=1 L-BFGS algorithm is in the family of quasi-Newton meth- ods(Broyden,1970).L-BFGS stores only a few gradien- where)is computed by using (7).(8).or (9). t vectors to approximate the Hessian matrix.Hence,it's andis computed by using (1)or(11).depending more suitable for optimization problems with a large num- on the value of t.Then we can construct the approximate ber of variables (Nocedal,1980;Byrd et al.,1994).To use Hessian matrix H for L-BFGS
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising We can find that there are two group lassos in (4), one for user features and the other for ad features. Furthermore, the two group lassos are coupled together to determine the CTR of an impression. Hence, our model is called coupled group lasso (CGL). Because (x T uW)(x T a V) T = x T uWVT xa = x T u (WVT )xa, it is very interesting to see that the term (x T uW)(x T a V) T in (3) can effectively model the conjunction information between user features and ad features. The number of parameters in W and V is only (l + s)k, where k is typically a small number (less than 50 in our experiments). On the contrary, if we choose to manually construct the conjunction features for LR (Chapelle et al., 2013), the number of manual conjunction features is l × s, with both l and s being typically tens of thousands in CTR prediction. Hence, the number of parameters of CGL is much less than that of LR with manual conjunction features, which makes CGL more scalable than LR to model conjunction features. Furthermore, the less number of parameters will result in a lower probability to be overfitting for a model. Another nice property of CGL comes from the regularization term of group lasso. It is easy to find that some rows of W and V will be all-zero after learning. Because each row corresponds to a feature of users or ads, we can eliminate the features corresponding to all-zero parameter values. Hence, when we use the learned model for online prediction, it’s not necessary to collect those eliminated features for users and ads. This can not only save memory, but also speed up the online prediction procedure. 3.2. Learning The goal of our learning algorithm is to find the optimal values b ∗ ∈ R d ,W∗ ∈ R l×k , V∗ ∈ R s×k that minimize the objective function in (4). The coupled part of (x T uW)(x T a V) T makes the objective function non-convex. We adopt an alternating learning method to learn the parameters. Each time we optimize one parameter with other parameters fixed. Several iterations will be repeated for this procedure until some termination condition is satisfied. More specifically, we first fix the ad parameter V and use limited-memory BFGS (L-BFGS) (Malouf, 2002; Andrew & Gao, 2007) to optimize the objective function with respect to (w.r.t) W and b until convergence. Then we fix the user parameter W, and optimize w.r.t. V and b until convergence. Obviously, the objective function is convex in any one of its parameter matrices W or V. L-BFGS algorithm is in the family of quasi-Newton methods (Broyden, 1970). L-BFGS stores only a few gradient vectors to approximate the Hessian matrix. Hence, it’s more suitable for optimization problems with a large number of variables (Nocedal, 1980; Byrd et al., 1994). To use the L-BFGS algorithm, we only need to compute the gradient of the parameters. For ease of presentation, we use ξ(x, y) to denote ξ(W, V, b; x, y) in (5) by omitting the parameters. For each instance (x, y), the gradient contributed by this instance can be derived as follows: ∂ξ(x, y) ∂bi = (h(x) − y)xoi , (7) ∂ξ(x, y) ∂Wij = xui (h(x) − y)x T a V∗j , (8) ∂ξ(x, y) ∂Vij = xai (h(x) − y)x T uW∗j , (9) where xui , xai , and xoi denote the ith element in vectors xu, xa, and xo, respectively. The regularization part in (6) can be expanded as follows: Ω(W, V) = ||W||21 + ||V||21 = X l i=1 vuutX k j=1 W2 ij + Xs i=1 vuutX k j=1 V 2 ij ≈ X l i=1 vuutX k j=1 W2 ij + + Xs i=1 vuutX k j=1 V 2 ij + , where is a very small positive number to make the regularization term differentiable. Practically, it works well in our application. The gradient of Ω(W, V) can be derived as follows: ∂Ω(W, V) ∂Wij = q Wij Pk j=1 W2 ij + , (10) ∂Ω(W, V) ∂Vij = q Vij Pk j=1 V 2 ij + . (11) We can concatenate the parameter group (W,b) into a parameter vector, and then compute the gradient vector g(W, b). Similarly, we can also compute the gradient vector g(V, b) of the parameter group (V,b). Assume t is the τ -th parameter in each parameter group, the τ -th element in gradient vector g(·) has the following form: gτ (·) = X N i=1 ∂ξ(x (i) , y(i) ) ∂t + λ ∂Ω(W, V) ∂t , (12) where ∂ξ(x (i) ,y(i) ) ∂t is computed by using (7), (8), or (9), and ∂Ω(W,V) ∂t is computed by using (10) or (11), depending on the value of t. Then we can construct the approximate Hessian matrix H˜ for L-BFGS
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising The whole learning procedure for CGL is summarized in on clusters with hundreds of computing nodes.By com- Algorithm 1.In each iteration,the alternating learning al- bining these techniques,our learning algorithm is highly gorithm ensures that the objective function value always scalable for web-scale applications. decreases.Furthermore,the objective function is bounded below by 0.Practically.when the whole loss function tend- 4.1.Hashing and Sub-Sampling s to be flat and the decrease turns to be flat,we can regard that as convergence.Because the objective function is not We use the hashing technique(Weinberger et al.,2009)for jointly convex in both W and V,the solution is a local op- efficient feature mapping and instance generating.The o- timum.In our implementation,the convergence condition riginal features used in real CTR prediction systems are of the algorithm is that the relative decrease of the objective mainly categorical,the number of which is typically very function value turns to be less than a threshold. large.In order to make the feature mapping(coding)results uniform and enable fast instance generating,we hash user, Algorithm 1 Alternate Learning for CGL ad and the other(context)features to three separate sub- Input:Data set {(x(,()i=1,...N),and hyper- spaces of bit vectors.The structure of the hashing frame- parameters k∈W+and入∈R+ work is illustrated in Figure 1,in which the raw represen- Output:W*.V*.b* tation of an impression is hashed to the instance represen- Initialize b=0. tation with bit vectors.The raw representation of each im- Initialize W=random(R!xk),V=random(*xk) pression is composed of several triples,with each triple be- repeat Fix V ing(domain,feature name,feature value).The domain can repeat be user,ad,or other,which refers to the types of the fea- Compute gradient g(W,b)using (12). tures.For example,the impression in Figure 1 contains Compute the approximate Hessian Hw.b w.r.t.(W,b). k triples,(d1,fl,v1),(d2,f2,v2),...,(dk,fk,vk).The in- d(W,b)=-Hw.b*g(W,b). stance representation of each impression is composed of Perform line search in the direction of d(W,b)and up- three subspaces of bit vectors,which are denoted as User date W.b. features,Ad features and Other features in Figure 1.Given until convergence on W,b a triple,we can get the index (also called position or key)of Fix W. repeat its coding in the feature space quickly with a hash function. Compute gradient g(V,b)using(12). For example,if an impression has a user(domain)feature Compute the approximate Hessian Hv.b w.r.t.(V,b). buypower'(feature name)with a value'5'(feature value), d(V,b)=-Hv.b *g(V,b). the hash function maps this triple (user,buypower,5)to one Perform line search in the direction of d(V,b)and up- position in the subspace of User features,and sets the cor- date V.b. responding position to be 1.The until convergence on V,b until convergence pairs are stored in the feature map for later searching and checking.The hashing technique enables efficient feature engineering and fast prediction together with straightfor- ward implementation.According to (Weinberger et al., 3.3.Complexity Analysis 2009),the hashing can not only contract the feature space by collision,but also bring a regularization effect. Let g =(7+s)k +d denote the total number of parame- ters in W,V and b.To train the model,we need O(qN) (d1,f1,v1,(d2,f2,v2,,(dk,fkk time to compute the gradient g(),O(g2)time to compute Feature epresentation map the approximate Hessian matrix and O(q2)time for ma- (Domain,feature,value) Triples trix multiplication and parameter update in each iteration. Hash Hence,the total time complexity is O(gN +2)u for u function iterations kd,(dd,fd.vd)> 4.Web-Scale Implementation Instance Representation o 1 0 1 00 1 1 0 1 00 0 1 0 11 0 User features Ad features Other features Web-scale applications always contain a huge number of users and ads,with billions of impression instances.Hence, Figure 1.The hashing framework for fast feature mapping and in- we need a scalable learning framework.In this section, stance generating. we first introduce the hashing and sub-sampling techniques used for memory saving and class-unbalance handling. The data sets are typically highly unbalanced,with only Furthermore,we propose a distributed learning framework a very small proportion of positive instances.In order to based on message passing interface(MPD),which can run reduce the learning complexity with a minimal influence
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising The whole learning procedure for CGL is summarized in Algorithm 1. In each iteration, the alternating learning algorithm ensures that the objective function value always decreases. Furthermore, the objective function is bounded below by 0. Practically, when the whole loss function tends to be flat and the decrease turns to be flat, we can regard that as convergence. Because the objective function is not jointly convex in both W and V, the solution is a local optimum. In our implementation, the convergence condition of the algorithm is that the relative decrease of the objective function value turns to be less than a threshold. Algorithm 1 Alternate Learning for CGL Input: Data set {(x (i) , y(i) ) | i = 1, ..., N}, and hyperparameters k ∈ N + and λ ∈ R +. Output: W∗ , V∗ , b ∗ Initialize b = 0. Initialize W = random(R l×k ), V = random(R s×k ). repeat Fix V. repeat Compute gradient g(W, b) using (12). Compute the approximate Hessian H˜ W,b w.r.t. (W, b). d(W, b) = −H˜ W,b ∗ g(W, b). Perform line search in the direction of d(W, b) and update W, b. until convergence on W, b Fix W. repeat Compute gradient g(V, b) using (12). Compute the approximate Hessian H˜ V,b w.r.t. (V, b). d(V, b) = −H˜ V,b ∗ g(V, b). Perform line search in the direction of d(V, b) and update V, b. until convergence on V, b until convergence 3.3. Complexity Analysis Let q = (l + s)k + d denote the total number of parameters in W, V and b. To train the model, we need O(qN) time to compute the gradient g(·), O(q 2 ) time to compute the approximate Hessian matrix and O(q 2 ) time for matrix multiplication and parameter update in each iteration. Hence, the total time complexity is O(qN + q 2 )µ for µ iterations. 4. Web-Scale Implementation Web-scale applications always contain a huge number of users and ads, with billions of impression instances. Hence, we need a scalable learning framework. In this section, we first introduce the hashing and sub-sampling techniques used for memory saving and class-unbalance handling. Furthermore, we propose a distributed learning framework based on message passing interface (MPI), which can run on clusters with hundreds of computing nodes. By combining these techniques, our learning algorithm is highly scalable for web-scale applications. 4.1. Hashing and Sub-Sampling We use the hashing technique (Weinberger et al., 2009) for efficient feature mapping and instance generating. The original features used in real CTR prediction systems are mainly categorical, the number of which is typically very large. In order to make the feature mapping (coding) results uniform and enable fast instance generating, we hash user, ad and the other (context) features to three separate subspaces of bit vectors. The structure of the hashing framework is illustrated in Figure 1, in which the raw representation of an impression is hashed to the instance representation with bit vectors. The raw representation of each impression is composed of several triples, with each triple being (domain, feature name, feature value). The domain can be user, ad, or other, which refers to the types of the features. For example, the impression in Figure 1 contains k triples, (d1, f1, v1), (d2, f2, v2), ... , (dk, fk, vk). The instance representation of each impression is composed of three subspaces of bit vectors, which are denoted as User features, Ad features and Other features in Figure 1. Given a triple, we can get the index (also called position or key) of its coding in the feature space quickly with a hash function. For example, if an impression has a user (domain) feature ‘buypower’ (feature name) with a value ‘5’ (feature value), the hash function maps this triple (user, buypower, 5) to one position in the subspace of User features, and sets the corresponding position to be 1. The pairs are stored in the feature map for later searching and checking. The hashing technique enables efficient feature engineering and fast prediction together with straightforward implementation. According to (Weinberger et al., 2009), the hashing can not only contract the feature space by collision, but also bring a regularization effect. Ad features Hash function Other features (d1, f1, v1), (d2, f2, v2), …, (dk, fk, vk) (Domain, feature, value) Triples … Feature map Raw Representation Instance Representation User features 0 1 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 Figure 1. The hashing framework for fast feature mapping and instance generating. The data sets are typically highly unbalanced, with only a very small proportion of positive instances. In order to reduce the learning complexity with a minimal influence
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising on accuracy,we sample negative instances with a probabil- Algorithm 2 Distributed Learning Framework for CGL ity of=10%and keep all the positive instances.After Input:Data set {(x(()i=1,...N).and hyper- sampling,we give a weight I to each negative instance dur- parameters∈W+and入∈R+. ing learning to make the objective calculation unbiased (M- Initialize Num of Nodes:P. cMahan et al..2013) Initialize Parameters:W,V,b. Split the data set across P nodes evenly. repeat 4.2.Distributed Learning for all nodes p =1,2,...,P}do in parallel In each iteration of Algorithm 1,we need to compute the Compute gradient g locally on node p using (13). end for gradient of all the parameters by using (12)and compute Compute gradient g=gwith AllReduce. the current value of the objective function in(4).Both of Add the gradient of the regularization term to g'in the mas- these two equations contain summations over all instances, ter node. which can be easily parallelized on one machine (node) Take an L-BFGS step in the master node. with multi-thread techniques or can be distributed across BroadCast the updated parameters to each slaver node until convergence several machines in a cluster with MPI. We implement a distributed learning framework for CGL based on an MPI-cluster with hundreds of nodes (ma- chines)and make full use of the parallel property of gra- dient computing of the parameters.Let P be the number of 5.1.Data Set nodes.We first evenly distribute the whole data set to each node,and the allocation of data will not be changed as the We conduct our experiment on three real-world data set- algorithm proceeds,which can minimize data movement s connected from Taobao of Alibaba group The train- and communication cost. ing sets of these three data sets contain the log information of display ads across different time periods with different The AllReduce and BroadCast interfaces in MPI are used time window sizes,and the subsequent(next)day's log in- to communicate between master and slaver nodes.Further- formation of each training set is for testing to evaluate the more,a synchronized routine is used to ensure the correct- performance of our model.We compose the data sets on ness of our system.We denote the gradient of parameters weekdays or holidays from different months to make the locally on node p as go.Please note that g only contain- data sets vary from each other and make the results of our s the gradient of the instance part,with the regularization model more convincing.There are billions of impression part discarded.The T-th element in gradient vector gp has instances in each data set,and the dimensionality of the the following form: feature vector for each impression is tens of thousands. m0(x包,y) These three data sets are named as Dataset-1,Dataset-2 gp,=】 (13) Ot and Dataset-3,respectively.Train I and Test I are the training set and test set for Dataset-1.Similar names can where pn is the number of instances allocated to node p, be got for other two data sets.The characteristics of these and t is the r-th parameter as stated in(12). three data sets are briefly summarized in Table 1.2 Please note that the CTR in Table 1 for a data set is computed as The sketch of the distributed learning framework is sum- which is actually the proportion of Number of Clicks marized in Algorithm 2. positive instances in the data set.We can find that all the data sets are highly unbalanced,with only a very small pro- 5.Experiment portion of positive instances.It's easy to find that the data set sizes of our experiments are of web-scale. We have an MPI-cluster with hundreds of nodes,each of We sample 20%of each training set for validation to speci- which is a 24-core server with 2.2GHz Intel(R)Xeon(R) E5-2430 processor and 96GB of RAM.However,we only fy the hyper-parameters of our CGL model and other base- use 80 nodes of the cluster for our experiment.One reason lines.The k in CGL is fixed to 50 in our experiment unless is that 80 nodes are enough to handle real web-scale CTR otherwise stated. applications.Another reason is that the whole cluster is We build our data sets from the logs of the advertisements shared by many groups who are running different jobs.We displayed in http://www.taobao.com,one of the most fa- do not want to interrupt other jobs.However,our method is mous C2C e-commerce web sites in China.The business model of Taobao is similar to that of eBay (http://www.ebay.com). highly scalable to use more nodes for larger scale problems, -Dataset-2 contains the log information of a long holiday,dur- which will be verified by our following experiments(refer ing which people may go outside and have no time surfing the to Figure 4). internet.So the number of instances and users is relatively small
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising on accuracy, we sample negative instances with a probability of γ = 10% and keep all the positive instances. After sampling, we give a weight 1 γ to each negative instance during learning to make the objective calculation unbiased (McMahan et al., 2013). 4.2. Distributed Learning In each iteration of Algorithm 1, we need to compute the gradient of all the parameters by using (12) and compute the current value of the objective function in (4). Both of these two equations contain summations over all instances, which can be easily parallelized on one machine (node) with multi-thread techniques or can be distributed across several machines in a cluster with MPI. We implement a distributed learning framework for CGL based on an MPI-cluster with hundreds of nodes (machines) and make full use of the parallel property of gradient computing of the parameters. Let P be the number of nodes. We first evenly distribute the whole data set to each node, and the allocation of data will not be changed as the algorithm proceeds, which can minimize data movement and communication cost. The AllReduce and BroadCast interfaces in MPI are used to communicate between master and slaver nodes. Furthermore, a synchronized routine is used to ensure the correctness of our system. We denote the gradient of parameters locally on node p as g 0 p . Please note that g 0 p only contains the gradient of the instance part, with the regularization part discarded. The τ -th element in gradient vector g 0 p has the following form: g 0 pτ = Xpn i=1 ∂ξ(x (i) , y(i) ) ∂t , (13) where pn is the number of instances allocated to node p, and t is the τ -th parameter as stated in (12). The sketch of the distributed learning framework is summarized in Algorithm 2. 5. Experiment We have an MPI-cluster with hundreds of nodes, each of which is a 24-core server with 2.2GHz Intel(R) Xeon(R) E5-2430 processor and 96GB of RAM. However, we only use 80 nodes of the cluster for our experiment. One reason is that 80 nodes are enough to handle real web-scale CTR applications. Another reason is that the whole cluster is shared by many groups who are running different jobs. We do not want to interrupt other jobs. However, our method is highly scalable to use more nodes for larger scale problems, which will be verified by our following experiments (refer to Figure 4). Algorithm 2 Distributed Learning Framework for CGL Input: Data set {(x (i) , y(i) ) | i = 1, ..., N}, and hyperparameters k ∈ N + and λ ∈ R +. Initialize Num of Nodes: P. Initialize Parameters: W, V, b. Split the data set across P nodes evenly. repeat for all nodes {p = 1, 2, · · · , P} do in parallel Compute gradient g 0 p locally on node p using (13). end for Compute gradient g 0 = PP p=1 g 0 p with AllReduce . Add the gradient of the regularization term to g 0 in the master node. Take an L-BFGS step in the master node. BroadCast the updated parameters to each slaver node. until convergence 5.1. Data Set We conduct our experiment on three real-world data sets connected from Taobao of Alibaba group 1 . The training sets of these three data sets contain the log information of display ads across different time periods with different time window sizes, and the subsequent (next) day’s log information of each training set is for testing to evaluate the performance of our model. We compose the data sets on weekdays or holidays from different months to make the data sets vary from each other and make the results of our model more convincing. There are billions of impression instances in each data set, and the dimensionality of the feature vector for each impression is tens of thousands. These three data sets are named as Dataset-1, Dataset-2 and Dataset-3, respectively. Train 1 and Test 1 are the training set and test set for Dataset-1. Similar names can be got for other two data sets. The characteristics of these three data sets are briefly summarized in Table 1. 2 Please note that the CTR in Table 1 for a data set is computed as Number of Clicks Number of Impressions , which is actually the proportion of positive instances in the data set. We can find that all the data sets are highly unbalanced, with only a very small proportion of positive instances. It’s easy to find that the data set sizes of our experiments are of web-scale. We sample 20% of each training set for validation to specify the hyper-parameters of our CGL model and other baselines. The k in CGL is fixed to 50 in our experiment unless otherwise stated. 1We build our data sets from the logs of the advertisements displayed in http://www.taobao.com, one of the most famous C2C e-commerce web sites in China. The business model of Taobao is similar to that of eBay (http://www.ebay.com). 2Dataset-2 contains the log information of a long holiday, during which people may go outside and have no time surfing the internet. So the number of instances and users is relatively small
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Table 1.Characteristics of the three data sets which contain training data of 4 days,10 days and 7 days from different time periods, respectively.The subsequent day's log information of each training set is for test set. DATA SET INSTANCES (IN BILLION) CTR (IN%) #ADS USERS (IN MILLION) STORAGE (IN TB) TRAIN 1 1.011 1.62 21.318 874.7 1.895 TEST 1 0.295 1.70 11,558 331.0 0.646 TRAIN 2 1.184 1.61 21.620 958.6 2.203 TEST 2 0.145 1.64 6,848 190.3 0.269 TRAIN 3 1.491 1.75 33.538 1119.3 2.865 TEST 3 0.126 1.70 9,437 183.7 0.233 5.2.Evaluation Metrics and Baseline 5.3.Accuracy of Lasso 5.2.1.METRIC Table 2 is the relative improvement(RelaImpr)of Lasso w.r.t.the baseline (LR).We can see that there does not We can regard CTR prediction as a binary classification problem.Because the data set is highly unbalanced with exist significant difference between LR and Lasso in terms of prediction accuracy. only a small proportion of positive instances,prediction ac- curacy is not a good metric for evaluation.Furthermore, neither precision nor recall is a good metric.In this paper, Table 2.Relative improvement of Lasso w.r.t.the baseline (LR). we adopt the area under the receiver operating characteris- tic curve (AUC)(Bradley,1997)as a metric to measure the DATA SET DATASET-1 DATASET-2 DATASET-3 prediction accuracy,which has been widely used in exist- ing literatures for CTR prediction(Chapelle et al.,2013). RELAIMPR -0.019% -0.096% +0.086% For a random guesser,the AUC value will be 0.5,which means total lack of discrimination.In order to have a good comparison with baseline models,we first remove this con- 5.4.Accuracy of CGL stant part(0.5)from the AUC value and then compute the Please note that in Algorithm 1,the W and V are randomly relative improvement(RelaImpr)of our model,which has initialized,which may affect the performance.We perform the following mathematical form: six independent rounds of experiments with different ini- tialization.The mean and variance of the relative improve- AUC(model)-0.5 RelaImpr ment of our CGL model (with k =50)w.r.t.the baseline AUC(baseline)-0.5 ×100%. LR are reported in Figure 2.It is easy to find that our CGL model can significantly outperform LR in all three data set- This RelaImpr metric has actually been widely adopted s.Furthermore,we can find that the random initialization in industry for comparing discrimination of models3. has an ignorable influence on the performance.Hence,in Our CGL model has an effect of selecting features or elim- the following experiments,we won't report the variance of inating features for both users and ads.We introduce group the values. sparsity (GSparsity)to measure the capability of our model in feature elimination:GSparsity100%, where vis the total number of all-zero rows in parameter matrices W and V,I and s are the number of rows in W and V respectively. 5.2.2.BASELINE Because LR model(with L2-norm)has been widely used Figure 2.The relative improvement of CGL w.r.t.baseline(LR) for CTR prediction and has achieved the state-of-the-art 5.5.Sensitivity to Hyper-Parameters performance,especially in industrial systems(Chapelle et al.,2013),we adopt LR as the baseline for comparison. In this subsection.we study the influence of the two key Please note that LR refers to the model in(1)with L2-norm hyper-parameters,.kandλ,in our CGL model. regularization,and the model in(1)with L-norm regular- Experiments are conducted for different k and the results ization is called Lasso in this paper. are shown in Figure 3(a).We can find that,with the in- 3http://en.wikipedia.org/wiki/Receiver_ creasing of k,the performance turns to be better in general. operatingcharacteristic But larger k implies more parameters,which can make the
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Table 1. Characteristics of the three data sets which contain training data of 4 days, 10 days and 7 days from different time periods, respectively. The subsequent day’s log information of each training set is for test set. DATA SET # INSTANCES (IN BILLION) CTR (IN %) # ADS # USERS (IN MILLION) STORAGE (IN TB) TRAIN 1 1.011 1.62 21, 318 874.7 1.895 TEST 1 0.295 1.70 11, 558 331.0 0.646 TRAIN 2 1.184 1.61 21, 620 958.6 2.203 TEST 2 0.145 1.64 6, 848 190.3 0.269 TRAIN 3 1.491 1.75 33, 538 1119.3 2.865 TEST 3 0.126 1.70 9, 437 183.7 0.233 5.2. Evaluation Metrics and Baseline 5.2.1. METRIC We can regard CTR prediction as a binary classification problem. Because the data set is highly unbalanced with only a small proportion of positive instances, prediction accuracy is not a good metric for evaluation. Furthermore, neither precision nor recall is a good metric. In this paper, we adopt the area under the receiver operating characteristic curve (AUC) (Bradley, 1997) as a metric to measure the prediction accuracy, which has been widely used in existing literatures for CTR prediction (Chapelle et al., 2013). For a random guesser, the AUC value will be 0.5, which means total lack of discrimination. In order to have a good comparison with baseline models, we first remove this constant part (0.5) from the AUC value and then compute the relative improvement (RelaImpr) of our model, which has the following mathematical form: RelaImpr = AUC(model) − 0.5 AUC(baseline) − 0.5 × 100%. This RelaImpr metric has actually been widely adopted in industry for comparing discrimination of models 3 . Our CGL model has an effect of selecting features or eliminating features for both users and ads. We introduce group sparsity (GSparsity) to measure the capability of our model in feature elimination: GSparsity = ν l+s × 100%, where ν is the total number of all-zero rows in parameter matrices W and V, l and s are the number of rows in W and V respectively. 5.2.2. BASELINE Because LR model (with L2-norm) has been widely used for CTR prediction and has achieved the state-of-the-art performance, especially in industrial systems (Chapelle et al., 2013), we adopt LR as the baseline for comparison. Please note that LR refers to the model in (1) with L2-norm regularization, and the model in (1) with L1-norm regularization is called Lasso in this paper. 3http://en.wikipedia.org/wiki/Receiver_ operating_characteristic 5.3. Accuracy of Lasso Table 2 is the relative improvement (RelaImpr) of Lasso w.r.t. the baseline (LR). We can see that there does not exist significant difference between LR and Lasso in terms of prediction accuracy. Table 2. Relative improvement of Lasso w.r.t. the baseline (LR). DATA SET DATASET-1 DATASET-2 DATASET-3 RELAIMPR −0.019% −0.096% +0.086% 5.4. Accuracy of CGL Please note that in Algorithm 1, the W and V are randomly initialized, which may affect the performance. We perform six independent rounds of experiments with different initialization. The mean and variance of the relative improvement of our CGL model (with k = 50) w.r.t. the baseline LR are reported in Figure 2. It is easy to find that our CGL model can significantly outperform LR in all three data sets. Furthermore, we can find that the random initialization has an ignorable influence on the performance. Hence, in the following experiments, we won’t report the variance of the values. Dataset−1 Dataset−2 Dataset−3 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 RelaImpr(%) Different datasets from real−world Figure 2. The relative improvement of CGL w.r.t. baseline (LR). 5.5. Sensitivity to Hyper-Parameters In this subsection, we study the influence of the two key hyper-parameters, k and λ, in our CGL model. Experiments are conducted for different k and the results are shown in Figure 3 (a). We can find that, with the increasing of k, the performance turns to be better in general. But larger k implies more parameters, which can make the
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising Table 4.Feature selection results AD PART USER PART B.E IMPORTANT WOMEN'S CLOTHES WATCH UNDER- FEATURES SKIRT, DRESS WEAR,FUR CLOTH- CHILDREN'S WEAR ING,FURNITURE (a)Influence of k (b)Influence of入 SHOES.CELLPHONE Figure 3.The influence of hyper-parameters k and A. USELESS MOVIE,ACT,TAKE- STAGE COSTUME, FEATURES OUT.FOOD BOOK- FLOORING,PENCIL, ING SERVICE OUTDOOR SOCK learning much harder in terms of both memory and speed. We find that k =50 is a suitable value for our experiments Hence,we choose k =50 for all the experiments in this ning time with 20 nodes by varying the number of nodes paper. from 20 to 80.Experiments are repeated several times,and the mean and variance of the speedup factor are reported in We vary the values of the hyper-parameter A and draw the Figure 4.We can find that the speedup appears to be close influence on the performance in Figure 3(b).We can find to linear,and close to the ideal speedup factors.Hence,our that very good performance can be achieved when A is around 1 for all data sets,and our CGL is not sensitive to A CGL is very scalable for web-scale applications. in a relatively large range(from 0.1 to 10). Actually.the A controls the tradeoff between the prediction accuracy and number of eliminated features(GSparsity). We choose Dataset-2 for demonstration.The relationship of relative improvement and GSparsity on Dataset-2 is shown in Table 3.We can find that our CGL does have --deal the ability to eliminate some features.For this data set,a GSparsity of 3%-15%will be a good tradeoff for both Figure 4.The speedup of the distributed learning framework. feature elimination and prediction accuracy. The whole training time (in second)of CGL on 80 nodes is Table 3.Tradeoff between relative improvement of performance shown in Table 5,from which we can find that CGL is fast w.r.t.baseline (LR)and GSparsity on Dataset-2. enough to handle web-scale applications. Table 5.Training time (in second). GSPARSITY 2% 3% 5% 15% 20% DATA SET DATASET-1 DATASET-2 DATASET-3 RELAIMPR 3.90% 3.42% 3.02% 2.5% 1.97% TIME 3,184士431 3,296±387 4,281±541 We take a deep look at the parameter matrices for users and ads,and show the most important features and the most 6.Conclusion and Future Work useless features in Table 4.They are all categorical fea- tures,in which the most important features for ad part are In this paper,a novel model called CGL is proposed to cap- popular or hot product categories,such as clothes,skirt and ture the conjunction information between features,which dress.This seems to be reasonable.The useless features for can outperform the state-of-the-art models in web-scale ad part contain Movie,Act,Takeout,Food booking service. CTR prediction for display advertising.Actually,our CGL This is also reasonable because few users buy these prod- is general enough to model other similar applications with ucts like food booking service from Taobao.The most im- outputs determined by two interacting roles.One of our portant features for user part are categories they show great future works is to pursue new applications of our model. interest,such as daily necessities and clothes.The useless features for users are some cold categories and some rare 7.Acknowledgement items,such as stage costume and flooring. This work is supported by the NSFC (No.61100125),the 863 5.6.Scalability Program of China (No.2012AA011003),and the Program for Changjiang Scholars and Innovative Research Team in University To study the scalability of our distributed learning frame- of China (IRT1158,PCSIRT).Wu-Jun Li is the corresponding work,we compute the speedup factor relative to the run- author
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising 10 20 30 40 50 60 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 k RelaImpr(%) Dataset−1 Dataset−2 Dataset−3 (a) Influence of k −1 −0.5 0 0.5 1 1.5 2 1.5 2 2.5 3 3.5 4 4.5 5 Regularization parameter λ(log scale) RelaImpr(%) Dataset−1 Dataset−2 Dataset−3 (b) Influence of λ Figure 3. The influence of hyper-parameters k and λ. learning much harder in terms of both memory and speed. We find that k = 50 is a suitable value for our experiments. Hence, we choose k = 50 for all the experiments in this paper. We vary the values of the hyper-parameter λ and draw the influence on the performance in Figure 3 (b). We can find that very good performance can be achieved when λ is around 1 for all data sets, and our CGL is not sensitive to λ in a relatively large range (from 0.1 to 10). Actually, the λ controls the tradeoff between the prediction accuracy and number of eliminated features (GSparsity). We choose Dataset-2 for demonstration. The relationship of relative improvement and GSparsity on Dataset-2 is shown in Table 3. We can find that our CGL does have the ability to eliminate some features. For this data set, a GSparsity of 3% − 15% will be a good tradeoff for both feature elimination and prediction accuracy. Table 3. Tradeoff between relative improvement of performance w.r.t. baseline (LR) and GSparsity on Dataset-2. GSPARSITY 2% 3% 5% 15% 20% RELAIMPR 3.90% 3.42% 3.02% 2.5% 1.97% We take a deep look at the parameter matrices for users and ads, and show the most important features and the most useless features in Table 4. They are all categorical features, in which the most important features for ad part are popular or hot product categories, such as clothes, skirt and dress. This seems to be reasonable. The useless features for ad part contain Movie, Act, Takeout, Food booking service. This is also reasonable because few users buy these products like food booking service from Taobao. The most important features for user part are categories they show great interest, such as daily necessities and clothes. The useless features for users are some cold categories and some rare items, such as stage costume and flooring. 5.6. Scalability To study the scalability of our distributed learning framework, we compute the speedup factor relative to the runTable 4. Feature selection results. AD PART USER PART IMPORTANT FEATURES WOMEN’S CLOTHES, SKIRT, DRESS, CHILDREN’S WEAR, SHOES, CELLPHONE WATCH, UNDERWEAR, FUR CLOTHING, FURNITURE USELESS FEATURES MOVIE, ACT, TAKEOUT, FOOD BOOKING SERVICE STAGE COSTUME, FLOORING, PENCIL, OUTDOOR SOCK ning time with 20 nodes by varying the number of nodes from 20 to 80. Experiments are repeated several times, and the mean and variance of the speedup factor are reported in Figure 4. We can find that the speedup appears to be close to linear, and close to the ideal speedup factors. Hence, our CGL is very scalable for web-scale applications. 20 30 40 50 60 70 80 1 1.5 2 2.5 3 3.5 4 Number of computing nodes Speedup Dataset−1 Dataset−2 Dataset−3 Ideal Figure 4. The speedup of the distributed learning framework. The whole training time (in second) of CGL on 80 nodes is shown in Table 5, from which we can find that CGL is fast enough to handle web-scale applications. Table 5. Training time (in second). DATA SET DATASET-1 DATASET-2 DATASET-3 TIME 3, 184 ± 431 3, 296 ± 387 4, 281 ± 541 6. Conclusion and Future Work In this paper, a novel model called CGL is proposed to capture the conjunction information between features, which can outperform the state-of-the-art models in web-scale CTR prediction for display advertising. Actually, our CGL is general enough to model other similar applications with outputs determined by two interacting roles. One of our future works is to pursue new applications of our model. 7. Acknowledgement This work is supported by the NSFC (No. 61100125), the 863 Program of China (No. 2012AA011003), and the Program for Changjiang Scholars and Innovative Research Team in University of China (IRT1158, PCSIRT). Wu-Jun Li is the corresponding author
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising References Meier,Lukas,Van De Geer.Sara,and Buihlmann,Peter Agarwal,Deepak,Agrawal,Rahul,Khanna,Rajiv,and Ko- The group lasso for logistic regression.Journal of the ta,Nagaraj.Estimating rates of rare events with multiple Royal Statistical Society,Series B.70(1):53-71,2008. hierarchies through scalable log-linear models.In KDD, Menon,Aditya Krishna,Chitrapura,Krishna Prasad,Garg, pp.213-222,2010. Sachin,Agarwal,Deepak,and Kota,Nagaraj.Response Andrew,Galen and Gao,Jianfeng.Scalable training of prediction using collaborative filtering with hierarchies Ij-regularized log-linear models.In ICML,pp.33-40, and side-information.In KDD,pp.141-149,2011. 2007. Muthukrishnan,S.Ad exchanges:Research issues.In Bradley,Andrew P.The use of the area under the roc curve WNE,pp.1-12,2009. in the evaluation of machine learning algorithms.Pattern Recognition,30(7):1145-1159,1997. Neter,John,Wasserman,William,and Kutner,Michael H. Applied linear statistical models,volume 4.Irwin Chica- Broyden,C.G.The convergence of a class of double-rank g0,1996. minimization algorithms 1.general considerations.IMA Journal of Applied Mathematics,6(1):76-90,1970. Nocedal,Jorge.Updating quasi-newton matrices with lim- ited storage.Mathematics of Computation,35(151): Byrd,Richard H.,Nocedal,Jorge,and Schnabel,Robert B. 773-782.1980 Representations of quasi-newton matrices and their use in limited memory methods.Mathematical Program- Richardson,Matthew,Dominowska,Ewa,and Ragno, ming,63:129-156,1994. Robert.Predicting clicks:estimating the click-through rate for new ads.In WWW,pp.521-530,2007. Chapelle.Oliver,Manavoglu,Eren.and Rosales,Romer. Simple and scalable response prediction for display ad- Stern,David H.,Herbrich,Ralf,and Graepel,Thore. vertising.ACM Transactions on Intelligent Systems and Matchbox:large scale online bayesian recommendation- Technology,2013. s.nWWw,Pp.111-120,2009. Golub,Gene H,Hansen,Per Christian,and O'Leary,Di- Tibshirani,Robert.Regression shrinkage and selection via anne P.Tikhonov regularization and total least squares. the lasso.Journal of the Royal Statistical Society.Series SIAM Journal on Matrix Analysis and Applications,21 B.Pp.267-288,1996. (1):185-194.1999. Weinberger,Kilian Q.,Dasgupta,Anirban,Langford,John, Graepel,Thore,Candela,Joaquin Quinonero,Borchert. Smola,Alexander J.,and Attenberg,Josh.Feature hash- Thomas,and Herbrich,Ralf.Web-scale bayesian click- ing for large scale multitask learning.In ICML,pp.140, through rate prediction for sponsored search advertising 2009 in microsoft's bing search engine.In ICML,pp.13-20, 2010. Yuan,Ming and Lin,Yi.Model selection and estimation in regression with grouped variables.Journal of the Royal Kuang-chih,Lee,Orten,Burkay,Dasdan,Ali,and Li,Wen- Statistical Society.Series B.68:49-67,2006. tong.Estimating conversion rate in display advertis- ing from past performance data.In KDD,pp.768-776. 2012. Mahdian,Mohammad and Tomak,Kerem.Pay-per-action model for online advertising.In WINE,pp.549-557, 2007 Malouf,Robert.A comparison of algorithms for maximum entropy parameter estimation.In CoNLL,pp.49-55, 2002. McMahan,H.Brendan,Holt,Gary,Sculley,David,Y- oung,Michael,Ebner,Dietmar,Grady,Julian,Nie, Lan,Phillips,Todd,Davydov,Eugene,Golovin,Daniel, Chikkerur,Sharat,Liu,Dan,Wattenberg,Martin, Hrafnkelsson,Arnar Mar,Boulos,Tom,and Kubica, Jeremy.Ad click prediction:a view from the trenches. nKDD,pp.1222-1230,2013
Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising References Agarwal, Deepak, Agrawal, Rahul, Khanna, Rajiv, and Kota, Nagaraj. Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In KDD, pp. 213–222, 2010. Andrew, Galen and Gao, Jianfeng. Scalable training of l1 -regularized log-linear models. In ICML, pp. 33–40, 2007. Bradley, Andrew P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. Broyden, C. G. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970. Byrd, Richard H., Nocedal, Jorge, and Schnabel, Robert B. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63:129–156, 1994. Chapelle, Oliver, Manavoglu, Eren, and Rosales, Romer. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology, 2013. Golub, Gene H, Hansen, Per Christian, and O’Leary, Dianne P. Tikhonov regularization and total least squares. SIAM Journal on Matrix Analysis and Applications, 21 (1):185–194, 1999. Graepel, Thore, Candela, Joaquin Quinonero, Borchert, ˜ Thomas, and Herbrich, Ralf. Web-scale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine. In ICML, pp. 13–20, 2010. Kuang-chih, Lee, Orten, Burkay, Dasdan, Ali, and Li, Wentong. Estimating conversion rate in display advertising from past performance data. In KDD, pp. 768–776, 2012. Mahdian, Mohammad and Tomak, Kerem. Pay-per-action model for online advertising. In WINE, pp. 549–557, 2007. Malouf, Robert. A comparison of algorithms for maximum entropy parameter estimation. In CoNLL, pp. 49–55, 2002. McMahan, H. Brendan, Holt, Gary, Sculley, David, Young, Michael, Ebner, Dietmar, Grady, Julian, Nie, Lan, Phillips, Todd, Davydov, Eugene, Golovin, Daniel, Chikkerur, Sharat, Liu, Dan, Wattenberg, Martin, Hrafnkelsson, Arnar Mar, Boulos, Tom, and Kubica, Jeremy. Ad click prediction: a view from the trenches. In KDD, pp. 1222–1230, 2013. Meier, Lukas, Van De Geer, Sara, and Buhlmann, Peter. ¨ The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B, 70(1):53–71, 2008. Menon, Aditya Krishna, Chitrapura, Krishna Prasad, Garg, Sachin, Agarwal, Deepak, and Kota, Nagaraj. Response prediction using collaborative filtering with hierarchies and side-information. In KDD, pp. 141–149, 2011. Muthukrishnan, S. Ad exchanges: Research issues. In WINE, pp. 1–12, 2009. Neter, John, Wasserman, William, and Kutner, Michael H. Applied linear statistical models, volume 4. Irwin Chicago, 1996. Nocedal, Jorge. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151): 773–782, 1980. Richardson, Matthew, Dominowska, Ewa, and Ragno, Robert. Predicting clicks: estimating the click-through rate for new ads. In WWW, pp. 521–530, 2007. Stern, David H., Herbrich, Ralf, and Graepel, Thore. Matchbox: large scale online bayesian recommendations. In WWW, pp. 111–120, 2009. Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, pp. 267–288, 1996. Weinberger, Kilian Q., Dasgupta, Anirban, Langford, John, Smola, Alexander J., and Attenberg, Josh. Feature hashing for large scale multitask learning. In ICML, pp. 140, 2009. Yuan, Ming and Lin, Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006