Efficient Bayesian Hierarchical User Modeling for Recommendation Systems Yi Zhang, Jonathan Koren School of Engineering University of California Santa Cruz Santa Cruz. CA USA diz, jonathan @soe. ucsc. edu ABSTRACT a document is relevant to the user or not, or a regression A content-based personalized recommendation system learns model that tells how relevant a document is to the user. One user specific profiles from user feedback so that it can delive major challenge of building a recommendation or personal information tailored to each individual user's interest. A sys ization system is that the profile learned for a particular tem serving millions of users can learn a better user profile usually of low quality when the amount of data from that for a new user, or a user with little feedback, by borrowing particular user is small. This is known as the "cold start" problem. This means that any new user must endure poor hierarchical model. Learning the model parameters to opti- initial performance until sufficient feedback from that user mize the joint data likelihood from millions of users is very computationally expensive. The commonly used EM alge There has been much research on improving classifica rithm converges very slowly due to the sparseness of the data tion accuracy when the amount of labeled training data is in IR applications. This paper proposes a new fast learning mall. The semi vised learning approach combines un- technique to learn a large number of individual user pro- labeled and labeled data together to achieve this goal [26] les. The efficacy and efficiency of the proposed algorithm Another approach is using domain knowledge. Researchers are justified by theory and demonstrated on actual user data have modified different learning algorithms, such as Naive- from Netflix and movielens Bayes [17, logistic regression 7], and SVMs [22, to integrate domain knowledge into a text classifier. The third approach 1. INTRODUCTION borrowing training data from other resources 57. The effectiveness of these different approaches is mixed, due to Personalization is the future of the Web, and it has achieved how well the underlying model assumption fits the data. great success in industrial applications. For example, online One well-received approach to improve recommendation stores, such as Amazon and Netfiit, provide customized rec- system performance for a particular user is borrowin ommendations for additional products or services based on a ormation from other users through a bavesian hierarchica user's history. Recent offerings such as My MSN, My Yahoo!, modeling approach. Several researchers have demonstrated My Google, and Google News have attracted much attention that this approach effectively trades off between shared and due to their potential ability to infer a user's interests from user-specific information, thus alleviating poor initial per- his/her history. formance for each user(27 25 One major personalization topic studied in the informa In order to learn a Bayesian hierarchical model, the sys ion retrieval community is content-based personal recom em usually tries to find the most likely model parameter mendation systems. These systems learn user-specific pro for the given data. A mature recommendation system usu files from user feedback so that they can recommend infor- ally works for millions of users. It is well known that learn- ation tailored to each individual user's interest without ing the optimal parameters of a Bayesian hierarchical model requiring the user to make an explicit query. Learning the is computationally expensive when there are thousands or user profiles is the core problem for these systems. millions of users. The EM algorithm is a commonly used A user profile is usually a classifier that can identify whether technique for parameter learning due to its simplicity and I Content-based recommendation is also called adaptive fil- convergence guarantee. However. a content based re or item-based collaborative filtering. In this paper Emendation system often handles documents in a ver ation"are used inter- dimensional space, in which each document is by a very sparse vector. With careful analysis of th resented algorithm in this scenario(Section 4), we find that the EM algorithm converges very slowly due to the sparseness of the input variables. We also find that updating the model Permission to make digital or hard copies of all or part of th parameter at each EM iteration is also expensive with com- ersonal or classroom use is granted without fee provided that c utational complexity of O(MK), where M is the number of users and k is the number of dimensions bear this notice and the full citation on the first page. To copy otl republish, to post on servers or to redistribute to lists, requires This paper modifies the standard EM algorithm an improved learning algorithm, which we call the " EM algorithm. The basic idea is that instead of Copyright207ACM978-159959300600
Efficient Bayesian Hierarchical User Modeling for Recommendation Systems Yi Zhang , Jonathan Koren School of Engineering University of California Santa Cruz Santa Cruz, CA, USA {yiz, jonathan}@soe.ucsc.edu ABSTRACT A content-based personalized recommendation system learns user specific profiles from user feedback so that it can deliver information tailored to each individual user’s interest. A system serving millions of users can learn a better user profile for a new user, or a user with little feedback, by borrowing information from other users through the use of a Bayesian hierarchical model. Learning the model parameters to optimize the joint data likelihood from millions of users is very computationally expensive. The commonly used EM algorithm converges very slowly due to the sparseness of the data in IR applications. This paper proposes a new fast learning technique to learn a large number of individual user pro- files. The efficacy and efficiency of the proposed algorithm are justified by theory and demonstrated on actual user data from Netflix and MovieLens. 1. INTRODUCTION Personalization is the future of the Web, and it has achieved great success in industrial applications. For example, online stores, such as Amazon and Netflix, provide customized recommendations for additional products or services based on a user’s history. Recent offerings such as My MSN, My Yahoo!, My Google, and Google News have attracted much attention due to their potential ability to infer a user’s interests from his/her history. One major personalization topic studied in the information retrieval community is content-based personal recommendation systems1 . These systems learn user-specific pro- files from user feedback so that they can recommend information tailored to each individual user’s interest without requiring the user to make an explicit query. Learning the user profiles is the core problem for these systems. A user profile is usually a classifier that can identify whether 1Content-based recommendation is also called adaptive filtering, or item-based collaborative filtering. In this paper, the words “filtering” and “recommendation” are used interchangeably. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. a document is relevant to the user or not, or a regression model that tells how relevant a document is to the user. One major challenge of building a recommendation or personalization system is that the profile learned for a particular user is usually of low quality when the amount of data from that particular user is small. This is known as the “cold start” problem. This means that any new user must endure poor initial performance until sufficient feedback from that user is provided to learn a reliable user profile. There has been much research on improving classification accuracy when the amount of labeled training data is small. The semi-supervised learning approach combines unlabeled and labeled data together to achieve this goal [26]. Another approach is using domain knowledge. Researchers have modified different learning algorithms, such as Na¨ıveBayes [17], logistic regression [7], and SVMs [22], to integrate domain knowledge into a text classifier. The third approach is borrowing training data from other resources [5][7]. The effectiveness of these different approaches is mixed, due to how well the underlying model assumption fits the data. One well-received approach to improve recommendation system performance for a particular user is borrowing information from other users through a Bayesian hierarchical modeling approach. Several researchers have demonstrated that this approach effectively trades off between shared and user-specific information, thus alleviating poor initial performance for each user[27][25]. In order to learn a Bayesian hierarchical model, the system usually tries to find the most likely model parameters for the given data. A mature recommendation system usually works for millions of users. It is well known that learning the optimal parameters of a Bayesian hierarchical model is computationally expensive when there are thousands or millions of users. The EM algorithm is a commonly used technique for parameter learning due to its simplicity and convergence guarantee. However, a content based recommendation system often handles documents in a very high dimensional space, in which each document is represented by a very sparse vector. With careful analysis of the EM algorithm in this scenario (Section 4), we find that the EM algorithm converges very slowly due to the sparseness of the input variables. We also find that updating the model parameter at each EM iteration is also expensive with computational complexity of O(MK), where M is the number of users and K is the number of dimensions. This paper modifies the standard EM algorithm to create an improved learning algorithm, which we call the “Modified EM algorithm.” The basic idea is that instead of calculat-
ing the numerical solution for all the user profile parameters, from the user's history. In the rest of this paper, we will we derive the analytical solution of the parameters for some use the following notations to represent the variables in the feature dimensions, and at the M step use the analytical so- system. lution instead of the numerical solution estimated at E step for those parameters. This greatly reduces the computation m=1.2.M: The index for each individual user. M at a single EM iteration, and also has the benefit of increas- the total number of users ing the convergence speed of the learning algorithm. The proposed technique is not only well supported by theory, Wm: The user model parameter associated with user m. wm but also by experimental results. is a k dimensional vector The organization of the remaining parts of this paper is as j= 1, 2, . Jm: The index for a set of data for user m. J, follows: Section 3 describes the Bayesian hierarchical linear is the number of training data for user m. regression modeling framework used for content-based rec- ommendations. Section 4 describes how to learn the model Dm =I(m,j, ym.)): A set of data associated with user m arameters using the standard EM algorithm, along with Im, is a K dimensional vector that represents the mth using the new technique proposed in this paper. The ex- user's jth training document. ym, is a scalar that perimental setting and results used to validate the proposed represents the label of document Im. learning technique are reported in Sections 5 and 6. Sec. k= 1, 2,. K: The dimensional index of input variable a. tion 7 summarizes and offers concluding remarks The Bayesian hierarchical modeling approach has been 2. RELATED WORK idely used in real-world information retrieval applications Generalized Bayesian hierarchical linear models. one of the Providing personalized recommendations to users has been simplest Bayesian hierarchical models, are commonly used identified as a very important problem in the IR community since the 1970,s. The approaches that have been used to and have achieved good performance on collaborative fil- solve this problem can be roughly classified into two major tering 25 and content-based adaptive filtering 27 tasks. categories: content based filtering versus collaborative filter- Figure 1 shows the graphical representation of a Bayesian ing. Content-based filtering studies the scenario where hierarchical model. In this graph, each user model is a recommendation system monitors a document stream and resented by a random vector u'm. We assume a user model pushes documents that match a user profile to the corre- is sampled random home prior distribution Pent e The ponding user. The user may read the delivered documents and provide explicit relevance feedback, which the filtering estimation of wm(or w'm's distribution) using a function then uses to update the user's profile using relevance y=f(r, w). The model is called generalized Bayesian hier archical linear model when y = f(wr) is any generalized feedback retrieval models(e. g. Boolean models, vector space linear model such as logistic regression, SVM, and linear works [3] and language models [6])or machine learning al- regression. To reliably estimate the user model wm, the sys- gorithms(e.g. Support Vector Machines(SVM), Knearest tem can borrow information from other users through the prior更=(/,∑ C(平m如mmm示 assume that each with similar tastes and preferences in the past. Memory- from a population distribution Pul e), which is governed by in collaborative filtering task 1582[14[1211- of user model w be a Gaussian distribution with parameter tion research by improving the efficiency and effectiveness of models. 2(1, 2, K)is a K dimensional vector that This paper contributes to the content-based recommenda p=(u, 2), which is the commonly used prior for linear represents the mean of the Gaussian distribution, and 2 is oretical basis and good empirical performance on recommen- the covariance matrix of the Gaussian. Usually, a Normal lation tasks[27]. This paper does not intend to compare distribution N(O, al) and an Inverse Wishart distribution content-based filtering with collaborative filtering or claim P(E)or 2exp(-2ctr(2))are used as hyperprior to which one is a better solution to the problem. We think nodel the prior distribution of u and 2 respectively. I is each complements the other, and that content-based filter the k dimensional identity matrix, and a, b, and c are real ing is extremely useful for handling new documents/items with little or no user feedback. Similar to some other re- With these settings, we have the following model for the searchers[18 121], we found that a recommendation s will be more effective when both techniques are combined However, this is beyond the scope of this paper and thus not d 2 are sampled from N(O, aD)and IWv(al),re- spectively. 2. For each user m, Wm is sampled randomly from a Nor- 3. BAYESIAN HIERARCHICAL LINEARRE. mal distribution: wm N(, 22) GRESSION 3. For each item Im,jt ym,i is sampled randomly from a Assume there are M users in the system. The task of Normal distribution: ym,NN(wm m,j,a2) the system is to recommend are re The first dimension of ar is a dummy variable that always each user. For each user, the system learns a user mode equals to 1
ing the numerical solution for all the user profile parameters, we derive the analytical solution of the parameters for some feature dimensions, and at the M step use the analytical solution instead of the numerical solution estimated at E step for those parameters. This greatly reduces the computation at a single EM iteration, and also has the benefit of increasing the convergence speed of the learning algorithm. The proposed technique is not only well supported by theory, but also by experimental results. The organization of the remaining parts of this paper is as follows: Section 3 describes the Bayesian hierarchical linear regression modeling framework used for content-based recommendations. Section 4 describes how to learn the model parameters using the standard EM algorithm, along with using the new technique proposed in this paper. The experimental setting and results used to validate the proposed learning technique are reported in Sections 5 and 6. Section 7 summarizes and offers concluding remarks. 2. RELATED WORK Providing personalized recommendations to users has been identified as a very important problem in the IR community since the 1970’s. The approaches that have been used to solve this problem can be roughly classified into two major categories: content based filtering versus collaborative filtering. Content-based filtering studies the scenario where a recommendation system monitors a document stream and pushes documents that match a user profile to the corresponding user. The user may read the delivered documents and provide explicit relevance feedback, which the filtering system then uses to update the user’s profile using relevance feedback retrieval models (e.g. Boolean models, vector space models, traditional probabilistic models [20] , inference networks [3] and language models [6]) or machine learning algorithms (e.g. Support Vector Machines (SVM), K nearest neighbors (K-NN) clustering, neural networks, logistic regression, or Winnow [16] [4] [23]). Collaborative filtering goes beyond merely using document content to recommend items to a user by leveraging information from other users with similar tastes and preferences in the past. Memorybased heuristics and model based approaches have been used in collaborative filtering task [15] [8] [2] [14] [12] [11]. This paper contributes to the content-based recommendation research by improving the efficiency and effectiveness of Bayesian hierarchical linear models, which have a strong theoretical basis and good empirical performance on recommendation tasks[27][25]. This paper does not intend to compare content-based filtering with collaborative filtering or claim which one is a better solution to the problem. We think each complements the other, and that content-based filtering is extremely useful for handling new documents/items with little or no user feedback. Similar to some other researchers[18][1][21], we found that a recommendation system will be more effective when both techniques are combined. However, this is beyond the scope of this paper and thus not discussed here. 3. BAYESIAN HIERARCHICAL LINEAR REGRESSION Assume there are M users in the system. The task of the system is to recommend documents that are relevant to each user. For each user, the system learns a user model from the user’s history. In the rest of this paper, we will use the following notations to represent the variables in the system. m = 1, 2, ..., M: The index for each individual user. M is the total number of users. wm: The user model parameter associated with user m. wm is a K dimensional vector. j = 1, 2, ..., Jm: The index for a set of data for user m. Jm is the number of training data for user m. Dm = {(xm,j , ym,j )}: A set of data associated with user m. xm,j is a K dimensional vector that represents the mth user’s jth training document.2 ym,j is a scalar that represents the label of document xm,j . k = 1, 2, ..., K: The dimensional index of input variable x. The Bayesian hierarchical modeling approach has been widely used in real-world information retrieval applications. Generalized Bayesian hierarchical linear models, one of the simplest Bayesian hierarchical models, are commonly used and have achieved good performance on collaborative filtering [25] and content-based adaptive filtering [27] tasks. Figure 1 shows the graphical representation of a Bayesian hierarchical model. In this graph, each user model is represented by a random vector wm. We assume a user model is sampled randomly from a prior distribution P(w|Φ). The system can predict the user label y of a document x given an estimation of wm (or wm’s distribution) using a function y = f(x, w). The model is called generalized Bayesian hierarchical linear model when y = f(w T x) is any generalized linear model such as logistic regression, SVM, and linear regression. To reliably estimate the user model wm, the system can borrow information from other users through the prior Φ = (µ, Σ). Now we look at one commonly used model where y = w T x + ², where ² ∼ N(0, σ2 ² ) is a random noise [25][27]. Assume that each user model wm is an independent draw from a population distribution P(w|Φ), which is governed by some unknown hyperparameter Φ. Let the prior distribution of user model w be a Gaussian distribution with parameter Φ = (µ, Σ), which is the commonly used prior for linear models. µ = (µ1, µ2, ..., µK) is a K dimensional vector that represents the mean of the Gaussian distribution, and Σ is the covariance matrix of the Gaussian. Usually, a Normal distribution N(0, aI) and an Inverse Wishart distribution P(Σ) ∝ |Σ| − 1 2 b exp(− 1 2 ctr(Σ−1 )) are used as hyperprior to model the prior distribution of µ and Σ respectively. I is the K dimensional identity matrix, and a, b, and c are real numbers. With these settings, we have the following model for the system: 1. µ and Σ are sampled from N(0, aI) and IWν(aI), respectively. 2. For each user m, wm is sampled randomly from a Normal distribution: wm ∼ N(µ, Σ 2 ) 3. For each item xm,j , ym,j is sampled randomly from a Normal distribution: ym,j ∼ N(w T mxm,j , σ2 ² ). 2The first dimension of x is a dummy variable that always equals to 1
w are the unobservable hidden variables and we have Q=/P(ul,∑2,Dm)logP(μ,∑2,u,D)do Based on the derivation of the EM formulas presented in we have the following Expectation-Maximization )W for finding the optimal hyperparameters. For space consid- erations, we omit the derivation in this paper since it is not he focus of our work E step: For each user m, estimate the user model distri- y bution P(wm Dm, )=N(m; wm, >7m)based on the rrent estimation of the prior更=(,∑2) User 1 User M 页m=(2)-1+5m)(Sym+(x2)-2)(6) ∑=(x2)-1 Figure 1: Illustration of dependencies of variables in the hierarchical model. The rating, y, for a doc- where Srr.m=∑xmxm,Sm=∑ ument,t, is conditioned on the document and the user model, wm, associated with the user m. Users M step: Optimize the prior =(u, E)based on the esti- share information about their models through the mation from the last E step prior,φ=(,∑) M Letb=(重,tn,u2,…,tua) represent the this system that needs to be estimated. The x2=∑n+(amn-p)(画m-p)2(9) hood for all the variables in the probabilistic likeli- includes the data and the parameters, is Many machine learning driven IR systems use a point es- timate of the parameters at different stages in the systen P(D,0)=P()II P(Um b)I P(m, 3 L=x-m. j, Wm)(1) However, we are estimating the posterior distribution of the particular data, which may be small and noisy. A For simplicity, we assume a, b, c, and ae are provided detailed discussion about this subject appears in [10 the system. 4.2 New Algorithm: Modified em Although the em algorithm is widely studied and used in 4. MODEL PARAMETER LEARNING lachine learning applications, using the above EM proces If the prior is known, finding the optimal wm is straight- to solve Bayesian hierarchical linear models in large-scale in forward: it is a simple ion. Therefore we will formation retrieval systems is still too computationally ex focus on estimatingΦ a priori solution of pensive. In this section, we describe why the learning rate of the eM algorithm is slow in our application and introduce a new technique to make the learning of the Bayesian hi- 更MAP= arg max P(D) 2) erarchical linear model scalable. The derivation of the new P(,D) learning algorithm will be based on the EM algorithm de- scribed in the previous section First, the covariance matrices 24, 2m are usually too large arg max P(D)P(p) to be computationally feasible. For simplicity, and as a com- mon practice in IR, we do not model the correlation between arg max/P(D]u, o)P(u 9)P(6)du (5) features. Thus we approximate these matrices with K di- mensional diagonal matrices. In the rest of the paper, we use Finding the optimal solution for the above problem is chal- these symbols to represent their diagonal approximations: lenging, since we need to integrate over all w=(w1, w2, .wM). which are unobserved hidden variables a200 4.1 EM Algorithm for Bayesian Hierarchical Linear models In Equation 5, p is the parameter needs to be estimated and the result depends on unobserved latent variables w This kind of optimization problem is usually solved by the EM algorithm Applying EM to the above problem, the set of user models
Figure 1: Illustration of dependencies of variables in the hierarchical model. The rating, y, for a document, x, is conditioned on the document and the user model, wm, associated with the user m. Users share information about their models through the prior, Φ = (µ, Σ). Let θ = (Φ, w1, w2, ..., wM) represent the parameters of this system that needs to be estimated. The joint likelihood for all the variables in the probabilistic model, which includes the data and the parameters, is: P(D, θ) = P(Φ)Y m P(wm|Φ)Y j P(ym,j |xm,j , wm) (1) For simplicity, we assume a, b, c, and σ² are provided to the system. 4. MODEL PARAMETER LEARNING If the prior Φ is known, finding the optimal wm is straightforward: it is a simple linear regression. Therefore, we will focus on estimating Φ. The maximum a priori solution of Φ is given by ΦMAP = arg max Φ P(Φ|D) (2) = arg max Φ P(Φ, D) P(D) (3) = arg max Φ P(D|Φ)P(Φ) (4) = arg max Φ Z w P(D|w, Φ)P(w|Φ)P(Φ)dw (5) Finding the optimal solution for the above problem is challenging, since we need to integrate over all w = (w1, w2, ..., wM), which are unobserved hidden variables. 4.1 EM Algorithm for Bayesian Hierarchical Linear Models In Equation 5, Φ is the parameter needs to be estimated, and the result depends on unobserved latent variables w. This kind of optimization problem is usually solved by the EM algorithm. Applying EM to the above problem, the set of user models w are the unobservable hidden variables and we have: Q = Z w P(w|µ, Σ 2 , Dm) log P(µ, Σ 2 , w, D)dw Based on the derivation of the EM formulas presented in [24], we have the following Expectation-Maximization steps for finding the optimal hyperparameters. For space considerations, we omit the derivation in this paper since it is not the focus of our work. E step: For each user m, estimate the user model distribution P(wm|Dm, Φ) = N(wm; ¯wm, Σ 2 m) based on the current estimation of the prior Φ = (µ, Σ 2 ). w¯m = ((Σ2 ) −1 + Sxx,m σ2 ² ) −1 ( Sxy,m σ2 ² + (Σ2 ) −1 µ) (6) Σ 2 m = ((Σ2 ) −1 + Sxx,m σ2 ² ) −1 (7) where Sxx,m = X j xm,jx T m,j Sxy,m = X j xm,j ym,j M step: Optimize the prior Φ = (µ, Σ 2 ) based on the estimation from the last E step. µ = 1 M X m w¯m (8) Σ 2 = 1 M X m Σ 2 m + ( ¯wm − µ)( ¯wm − µ) T (9) Many machine learning driven IR systems use a point estimate of the parameters at different stages in the system. However, we are estimating the posterior distribution of the variables at the E step. This avoids overfitting wm to a particular user’s data, which may be small and noisy. A detailed discussion about this subject appears in [10]. 4.2 New Algorithm: Modified EM Although the EM algorithm is widely studied and used in machine learning applications, using the above EM process to solve Bayesian hierarchical linear models in large-scale information retrieval systems is still too computationally expensive. In this section, we describe why the learning rate of the EM algorithm is slow in our application and introduce a new technique to make the learning of the Bayesian hierarchical linear model scalable. The derivation of the new learning algorithm will be based on the EM algorithm described in the previous section. First, the covariance matrices Σ2 , Σ 2 m are usually too large to be computationally feasible. For simplicity, and as a common practice in IR, we do not model the correlation between features. Thus we approximate these matrices with K dimensional diagonal matrices. In the rest of the paper, we use these symbols to represent their diagonal approximations: Σ 2 = σ 2 1 0 .. 0 0 σ 2 2 .. 0 .. .. .. .. 0 0 .. σ2 K Σ 2 m = σ 2 m,1 0 .. 0 0 σ 2 m,2 .. 0 .. .. .. .. 0 0 .. σ2 m,K
Secondly, and most importantly, the input space is very feature pairs. The M step implicitly uses the analytical sparse and there are many dimensions that are not "related solution for unrelated user-feature o a particular user in a real IR application. For example let us consider a movie recommendation system, with the input variable a representing a particular movie. For the jth movie that the user m has seen, let m,, k l if the directo unrelated of the movie is"Jean-Pierre Jeunet"(indexed by k). Here we assume that whether or not that this director directed 2 C a specific movie is represented by the kth dimension. If unrelated the user m has never seen a movie directed by Jean-Pierre +(m,k-k)(面mk-pk)2(13) Jeunet", then the corresponding dimension is always zero 0 for all j) where Mk is the number of users that are related to One major drawback of the eM algorithm is that the im- portance of a feature, uk, may be greatly dominated by users e only estimate the diagonal of >am and 2 sInce we are who have never encountered this feature (i.e. >i am,j k=O) at the M step(Equation 8. Assume that 100 out of 1 mil- ces. To estimate wm, we only need to calculate the numer lion users have viewed the movie directed by "Jean-Pierre cal solutions for dimensions that are related to user m. To Jeunet, and that the viewers have rated all of his movies as estimate o2 and Hk, we only sum over users that are related excellent". Intuitively, he is a good director and the weight to the kth feature for him (uk)should be high. Before the EM iteration, the initial value of u is usually set to 0. Since the other 999, 900 There are two major benefits of the new algorithm. First, ecause only the related(m, k) pairs are needed at the mod be very small initial .1999900, k)for that director would be iteration is much smaller when the data is sparse, and many of(m, k) pairs are unrelated. Second, the parameters es- director in the prior Hk at the first M step would be very timated at the modified M step(Equations 12-13)are ow, and the variance am, k will be large(Equations 8 an 7). It is undesirable that users who have never seen any more accurate than the standard M step described in Sec- tion 4.1 because the exact analytical solutions Wm. k= uk movie produced by the director influence the importance of and om k= Ok for the unrelated(m, k)pairs were used in the director so much. This makes the convergence of the the new algorithm instead of an approximate solution as in standard eM algorithm very slow Now lets look at whether we can improve the learning he standard algorithm. speed of the algorithm. Without a loss of generality, let 5. EXPERIMENTAL METHODOLOGY is not related to a particular user m. By which we mean 1,…,Jm: It is straigh rard to 5.1 Evaluation data set that the kth row and kth column of Srr, m are completely To evaluate the proposed technique, we used the following filled with zeros, and that the kth dimension of Sxu.m is three major data sets(Table 1 zeroed as well. Thus the corresponding kth dimension of the user models mean, wm, should be equal to that of the prior: MovieLens Data: This data set was created by combi Wm,k= Hk, with the corresponding covariance of om, k= oI the relevance judgments from the MovieLens(9 At the M step, the standard eM algorithm uses the nt set with documents from the Internet Movie dat merical solution of the distribution P(wmIDm, p)estimated (IMDB). MovieLens allows users to rank how much at E step(Equation 8 and Equation 7). However, the nu- he/she enjoyed a specific movie on a scale from 1 to merical solutions are very unreliable for Wm, k and om, k when 5. This "likeability" rating was used as a measure- the kth dimension is not related to the mth user. A better ment of how relevant the document representing the approach is using the analytical solutions Wm. k = Ak, and corresponding movie is to the user. We considered k ed(m, k)pa long with the documents with likeability scores of 4 or 5 as rele- numerical solution ed at E step for the other(m, k) vant. and documents with a score of 1 to 3 as ir- pairs. Thus we get lowing new EM-like algorithm: relevant to the user. MovieLens provided relevance judgments on 3,057 documents from 6,040 separate Modified E step: For each user m, estimate the user model users. On average, each user rated 151 movies, of distribution P(um Dm, ) N(wm; tm, Em) bas these 87 were judged to be relevant. The average on the current estimation of de, H, 22 score for a document was 3.58. Documents represent ng each movie were constructed from the portion of 面m=(22)1+3)-(y=+(x2)-)(10) the IMDB database that is available for public down- a2n,k=(7)-1+"k load 13. Based on this database, we created one doc- (11) ment per movie that contained the relevant informa- tion about it(e. g. directors, actors, etc. where sxr, m, k=2=m. k and sry, m, k=2Im, i k ym, j Netflix Data: This data set was constructed by combining documents about movies crawled from the web with a set of actual movie rental customer relevance Modified M Step Optimize the prior重=(,∑2)bas ments from Netfix[19. Netflix publicly provides the on the estimation from the last E step for related user- relevance judgments of 480, 189 anonymous customers
Secondly, and most importantly, the input space is very sparse and there are many dimensions that are not “related” to a particular user in a real IR application. For example, let us consider a movie recommendation system, with the input variable x representing a particular movie. For the jth movie that the user m has seen, let xm,j,k = 1 if the director of the movie is “Jean-Pierre Jeunet” (indexed by k). Here we assume that whether or not that this director directed a specific movie is represented by the kth dimension. If the user m has never seen a movie directed by “Jean-Pierre Jeunet”, then the corresponding dimension is always zero (xm,j,k = 0 for all j) . One major drawback of the EM algorithm is that the importance of a feature, µk, may be greatly dominated by users who have never encountered this feature (i.e. P j xm,j,k = 0) at the M step (Equation 8). Assume that 100 out of 1 million users have viewed the movie directed by “Jean-Pierre Jeunet”, and that the viewers have rated all of his movies as “excellent”. Intuitively, he is a good director and the weight for him (µk) should be high. Before the EM iteration, the initial value of µ is usually set to 0. Since the other 999,900 users have not seen this movie, their corresponding weights (w1,k, w2,k, ..., wm,k..., w999900,k) for that director would be very small initially. Thus the corresponding weight of the director in the prior µk at the first M step would be very low , and the variance σm,k will be large (Equations 8 and 7). It is undesirable that users who have never seen any movie produced by the director influence the importance of the director so much. This makes the convergence of the standard EM algorithm very slow. Now let’s look at whether we can improve the learning speed of the algorithm. Without a loss of generality, let us assume that the kth dimension of the input variable x is not related to a particular user m. By which we mean, xm,j,k = 0 for all j = 1, ..., Jm. It is straightforward to prove that the kth row and kth column of Sxx,m are completely filled with zeros, and that the kth dimension of Sxy,m is zeroed as well. Thus the corresponding kth dimension of the user model’s mean, ¯wm, should be equal to that of the prior: w¯m,k = µk, with the corresponding covariance of σm,k = σk. At the M step, the standard EM algorithm uses the numerical solution of the distribution P(wm|Dm, Φ) estimated at E step (Equation 8 and Equation 7). However, the numerical solutions are very unreliable for ¯wm,k and σm,k when the kth dimension is not related to the mth user. A better approach is using the analytical solutions ¯wm,k = µk, and σm,k = σk for the unrelated (m, k) pairs, along with the numerical solution estimated at E step for the other (m, k) pairs. Thus we get the following new EM-like algorithm: Modified E step: For each user m, estimate the user model distribution P(wm|Dm, Φ) = N(wm; ¯wm, Σ 2 m) based on the current estimation of σ², µ, Σ 2 . w¯m = ((Σ2 ) −1 + Sxx,m σ2 ² ) −1 ( Sxy,m σ2 ² + (Σ2 ) −1µ) (10) σ 2 m,k = ((σ 2 k) −1 + sxx,m,k σ2 ² ) −1 (11) where sxx,m,k = X j x 2 m,j,k and sxy,m,k = X j xm,j,kym,j Modified M Step Optimize the prior Φ = (µ, Σ 2 ) based on the estimation from the last E step for related userfeature pairs. The M step implicitly uses the analytical solution for unrelated user-feature pairs. µk = 1 Mk X m:related w¯m,k (12) σ 2 k = 1 Mk X m:related σ 2 m,k +( ¯wm,k − µk)( ¯wm,k − µk) T (13) where Mk is the number of users that are related to feature k We only estimate the diagonal of Σ2 m and Σ since we are using the diagonal approximation of the covariance matrices. To estimate ¯wm, we only need to calculate the numerical solutions for dimensions that are related to user m. To estimate σ 2 k and µk, we only sum over users that are related to the kth feature. There are two major benefits of the new algorithm. First, because only the related (m, k) pairs are needed at the modified M step, the computational complexity in a single EM iteration is much smaller when the data is sparse, and many of (m, k) pairs are unrelated. Second, the parameters estimated at the modified M step (Equations 12 – 13) are more accurate than the standard M step described in Section 4.1 because the exact analytical solutions ¯wm,k = µk and σm,k = σk for the unrelated (m, k) pairs were used in the new algorithm instead of an approximate solution as in the standard algorithm. 5. EXPERIMENTAL METHODOLOGY 5.1 Evaluation Data Set To evaluate the proposed technique, we used the following three major data sets (Table 1): MovieLens Data: This data set was created by combining the relevance judgments from the MovieLens[9] data set with documents from the Internet Movie Database (IMDB). MovieLens allows users to rank how much he/she enjoyed a specific movie on a scale from 1 to 5. This “likeability” rating was used as a measurement of how relevant the document representing the corresponding movie is to the user. We considered documents with likeability scores of 4 or 5 as relevant, and documents with a score of 1 to 3 as irrelevant to the user. MovieLens provided relevance judgments on 3,057 documents from 6,040 separate users. On average, each user rated 151 movies, of these 87 were judged to be relevant. The average score for a document was 3.58. Documents representing each movie were constructed from the portion of the IMDB database that is available for public download[13]. Based on this database, we created one document per movie that contained the relevant information about it (e.g. directors, actors, etc.). Netflix Data: This data set was constructed by combining documents about movies crawled from the web with a set of actual movie rental customer relevance judgments from Netflix[19]. Netflix publicly provides the relevance judgments of 480,189 anonymous customers
Table 1: Data Set Statistics. On Reuters, the num- first EM iteration. To answer the second question, we com- ber of rating for a simulated user is the number of pared the proposed new algorithm with the standard documents relevant to the corresponding topic. algorithm to see whether the new learning algorithm is bet- ter. To answer the third question, we tested the efficiency of s Ratings per User he new algorithm on the entire net fix data set where about 151 half a million user models need to be learned together For the MovieLens and Netflix data sets, algorithm effer Netflix-1000100017,770 tiveness was measured by mean square error, while on the Reuters-C Reuters data set classification error was used because it was Reuters-E ore informative. We first evaluated the performance on 1632 Reuters-G each individual user, and then estimated the macro average 100.000 2222 Reuters-M 100.000 over all users. Statistical tests(t-tests) were carried out to 6529 see whether the results are significant For the experiments on the MovieLens and Netflix data sets, we used a random sample of 90% of each user for train- There are around 100 million rating on a scale of 1 ing, and the rest for testing. On Reuters' data set, becaus to 5 for 17, 770 documents. Similar to MovieLens, we chere are too many relevant documents for each topic in th considered documents with likeability scores of 4 or 5 corpus, we used a random sample of 10% of each topic for as relevant training, and 10% of the remaining documents for testing For all runs, we set (a, b, c, 2e)=(0. 1, 10, 0.1, 1)manually. This number was reduced to 1000 customers through random sampling. The average customer on the re- 6. EXPERIMENTAL RESULTS deemed relevant. The average score for documents is Figure 2, Figure 3, and Figure 4 show that on all data 3.5. ets,the Bayesian hierarchical modeling approach has a sta tistical significant improvement over the regularized linear Reuters Data: This is the Reuters Corpus, Volume 1. It regression model, which is equivalent to the bayesian hierar covers 810,000 Reuters English language news stories chical models learned at the first iteration. Further analysis from August 20, 1996 to August 19, 1997. Only the shows a negative correlation between the number of training first 100,000 news were used in our experiments. The data for a user and the improvement the system gets. This Reuters corpus comes with a topic hierarchy. Each suggests that the borrowing information from other users document is assigned to one of several locations on has more significant improvements for users with less train the hierarchical tree. The first level of the tree con- ing data, which is as expected. However, the strength of the tains four topics, denoted as C, E, M, and G. For the correlation differs over data sets, and the amount of train- experiments in this paper, the tree was cut at level l to ing data is not the only characteristics that will influence create four smaller trees, each of which corresponds to he final performance one smaller data set: Reuters-E Reuters-C, Reuter Figure 2 and Figure 3 show that the proposed new al- M and Reuters-G. For each small data set, we created gorithm works better than the standard EM algorithm on several profiles, one profile for each node in a sub-tree, he Netflix and MovieLens data sets. This is not surpris- to simulate multiple users, each with a related, yet ing since the number of related feature-users pairs is much separate definition of relevance. All the user profiles smaller than the number of unrelated feature-user pairs on on a sub-tree are supposed to share the same prior these two data sets, and thus the proposed new algorithm model distribution. Since this corpus explicitly indi- expected to work better. cates only the relevant documents for a topic(user), all Figure 4 shows that the two algorithms work similarl other documents are considered irrelevant on the Reuters-E data set. The accuracy of the new al gorithm is similar to that of the standard EM algorithm at each iteration. The general patterns are very similar on 5.2 Evaluatio other Reuters' subsets. Further analysis shows that only We designed the experiments to answer the following three 58% of the user-feature pairs are unrelated on this data set Since the number of unrelated user-feature pairs is not ex- tremely large, the sparseness is not a serious problem on 1. Do we need to take the effort to use a Bayesian ap- the Reuters data set. Thus the two learning algorithms per proach and learn a prior fro form similarly. The results suggest that only on a corpus 2. Does the new algorithm work better than the standard here the number of unrelated user-feature pairs is much EM algorithm for learning the Bayesian hierarchical larger than the number of related pairs, such as on the Net- linear model? fix data set, the proposed technique will get a significant aprovement over standard EM. However, the experi 3. Can the new algorithm quickly learn many user mod also show that when the assumption does not hold, the new els? algorithm does not hurt performance Although the proposed technique is faster than standard To answer the first question, we compared the Bayesian hi EM, can it really learn millions of user models quickly? erarchical models with commonly used Norm-2 regularized Our results show that the modified EM algorithm converges linear regression models. In fact, the commonly used ap- quickly, and 2-3 modified EM iterations would result in proach is equivalent to the model learned at the end of the a reliable estimation. We evaluated the algorithm on the
Table 1: Data Set Statistics. On Reuters, the number of rating for a simulated user is the number of documents relevant to the corresponding topic. Data Users Docs Ratings per User MovieLens 6,040 3,057 151 Netflix-all 480,189 17,770 208 Netflix-1000 1000 17,770 127 Reuters-C 34 100,000 3949 Reuters-E 26 100,000 1632 Reuters-G 33 100,000 2222 Reuters-M 10 100,000 6529 There are around 100 million rating on a scale of 1 to 5 for 17,770 documents. Similar to MovieLens, we considered documents with likeability scores of 4 or 5 as relevant. This number was reduced to 1000 customers through random sampling. The average customer on the reduced data set provided 127 judgments, with 70 being deemed relevant. The average score for documents is 3.55. Reuters Data: This is the Reuters Corpus, Volume 1. It covers 810,000 Reuters English language news stories from August 20, 1996 to August 19, 1997. Only the first 100,000 news were used in our experiments. The Reuters corpus comes with a topic hierarchy. Each document is assigned to one of several locations on the hierarchical tree. The first level of the tree contains four topics, denoted as C, E, M, and G. For the experiments in this paper, the tree was cut at level 1 to create four smaller trees, each of which corresponds to one smaller data set: Reuters-E Reuters-C, ReutersM and Reuters-G. For each small data set, we created several profiles, one profile for each node in a sub-tree, to simulate multiple users, each with a related, yet separate definition of relevance. All the user profiles on a sub-tree are supposed to share the same prior model distribution. Since this corpus explicitly indicates only the relevant documents for a topic(user), all other documents are considered irrelevant. 5.2 Evaluation We designed the experiments to answer the following three questions: 1. Do we need to take the effort to use a Bayesian approach and learn a prior from other users? 2. Does the new algorithm work better than the standard EM algorithm for learning the Bayesian hierarchical linear model? 3. Can the new algorithm quickly learn many user models? To answer the first question, we compared the Bayesian hierarchical models with commonly used Norm-2 regularized linear regression models. In fact, the commonly used approach is equivalent to the model learned at the end of the first EM iteration. To answer the second question, we compared the proposed new algorithm with the standard EM algorithm to see whether the new learning algorithm is better. To answer the third question, we tested the efficiency of the new algorithm on the entire Netflix data set where about half a million user models need to be learned together. For the MovieLens and Netflix data sets, algorithm effectiveness was measured by mean square error, while on the Reuters data set classification error was used because it was more informative. We first evaluated the performance on each individual user, and then estimated the macro average over all users. Statistical tests (t-tests) were carried out to see whether the results are significant. For the experiments on the MovieLens and Netflix data sets, we used a random sample of 90% of each user for training, and the rest for testing. On Reuters’ data set, because there are too many relevant documents for each topic in the corpus, we used a random sample of 10% of each topic for training, and 10% of the remaining documents for testing. For all runs, we set (a, b, c, Σ²) = (0.1, 10, 0.1, 1) manually. 6. EXPERIMENTAL RESULTS Figure 2, Figure 3, and Figure 4 show that on all data sets, the Bayesian hierarchical modeling approach has a statistical significant improvement over the regularized linear regression model, which is equivalent to the Bayesian hierarchical models learned at the first iteration. Further analysis shows a negative correlation between the number of training data for a user and the improvement the system gets. This suggests that the borrowing information from other users has more significant improvements for users with less training data, which is as expected. However, the strength of the correlation differs over data sets, and the amount of training data is not the only characteristics that will influence the final performance. Figure 2 and Figure 3 show that the proposed new algorithm works better than the standard EM algorithm on the Netflix and MovieLens data sets. This is not surprising since the number of related feature-users pairs is much smaller than the number of unrelated feature-user pairs on these two data sets, and thus the proposed new algorithm is expected to work better. Figure 4 shows that the two algorithms work similarly on the Reuters-E data set. The accuracy of the new algorithm is similar to that of the standard EM algorithm at each iteration. The general patterns are very similar on other Reuters’ subsets. Further analysis shows that only 58% of the user-feature pairs are unrelated on this data set. Since the number of unrelated user-feature pairs is not extremely large, the sparseness is not a serious problem on the Reuters data set. Thus the two learning algorithms perform similarly. The results suggest that only on a corpus where the number of unrelated user-feature pairs is much larger than the number of related pairs, such as on the Net- flix data set, the proposed technique will get a significant improvement over standard EM. However, the experiments also show that when the assumption does not hold, the new algorithm does not hurt performance. Although the proposed technique is faster than standard EM, can it really learn millions of user models quickly? Our results show that the modified EM algorithm converges quickly, and 2 - 3 modified EM iterations would result in a reliable estimation. We evaluated the algorithm on the
Figure 2: Performance on a Netfix subset ers. The new algorithm is statistical significantly rithm at iterations 2 2 regularized linear models are equivalent to the models learned at the m on, and are statistical significantly worse than the New Algorithm 135 Traditional em Traditional EM 1. 12 105 Iterations Iterations Figure 3: Performance on a MovieLens subset with 1, 000 users. The new algorithm is statistical significantly better than EM algorithm at iteration 2 to 17 (evaluated with mean square error) New Algorithm 0.65 -Traditional EM New Algorith 06 - Traditional em 25 0.55 g 505 ∈0.45 0.35 0.5 11 terations erations igure 4: Performance on a Reuters-E subset with 26 profiles. Performances on Reuters-C, Reuters-M, Reuters-G are similar 0.014 0.0114 Traditional en Traditional em 00135 0.0112 0013 0011 0.012 0012 0.0106 00115 0.0104 0.011 0.0102 2
Figure 2: Performance on a Netflix subset with 1,000 users. The new algorithm is statistical significantly better than EM algorithm at iterations 2 - 10. Norm-2 regularized linear models are equivalent to the Bayesian hierarchical models learned at the first iteration, and are statistical significantly worse than the Bayesian hierarchical models. 0 2 4 6 8 10 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 Iterations Mean Square Error New Algorithm Traditional EM 1 2 3 4 5 6 7 8 9 10 0.33 0.34 0.35 0.36 0.37 0.38 0.39 Iterations Classification Error New Algorithm Traditional EM Figure 3: Performance on a MovieLens subset with 1,000 users. The new algorithm is statistical significantly better than EM algorithm at iteration 2 to 17 (evaluated with mean square error). 1 6 11 16 21 0.5 1 1.5 2 2.5 3 3.5 Iterations Mean Square Error New Algorithm Traditional EM 1 6 11 16 21 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Iterations Classification Error New Algorithm Traditional EM Figure 4: Performance on a Reuters-E subset with 26 profiles. Performances on Reuters-C, Reuters-M, Reuters-G are similar. 1 2 3 4 5 0.011 0.0115 0.012 0.0125 0.013 0.0135 0.014 Iterations Mean Square Error New Algorithm Traditional EM 1 2 3 4 5 0.0102 0.0104 0.0106 0.0108 0.011 0.0112 0.0114 Iterations Classification Error New Algorithm Traditional EM
hole Netflix data set(480, 189 users, features, and ers for valuable feedback on the work described in this pa- 100 million ratings)running on a single C (2GB mem- per. Part of the work was supported by Yahoo, Google, the ry, P4 3GHz). The system finished one modified EM itera- Petascale Data Storage Institute and the Institute for Scal- tion in about 4 hours. This demonstrates that the proposed able Scientific Data Management. Any opinions, findings echnique can efficiently handle large-scale system like Net conclusions, or recommendations expressed in this material are those of the authors, and do not necessarily reflect those of the 7. CONCLUSION 9. REFERENCES Content-based user profile learning is an important prob- lem and is the key to providing personal recommendations 11 C Basu, H Hirsh, and W. Cohen. Recommendation to a user, especially for recommending new items with a as classification: Using social and content-based small number of ratings. The Bayesian hierarchical model- information in recommendation. In Proceedings of the ing approach is becoming an important user profile learning fteenth National Conference on Artificial approach due to its theoretically justified ability to help one Intelligence, 1998 ser through information transfer from the other users by 2J S Breese, D Heckerman, and C Kadie Empirical way of hyperpriors analysis of predictive algorithms for collaborative This paper examined the weakness of the popular EM filtering. Technical report, Microsoft Research, One based learning approach for Bayesian hierarchical linear mod- Microsoft Way, Redmond, WA 98052, 1998 els and prop 3 J. Callan. Document filtering with inference network EM. We showed that the new technique is theoretically more In Proceedings of the Nineteenth Annual International computationally efficient than the standard EM algorithm ACM SIGIR Conference on Research and Evaluation on the movielens and netfix data sets denon- Development in Information Retrieval, pages 262-269 trated the effectiveness of the new technique when the data 1996 is sparse, by which we mean the ratio of related user-feature 4 N Cancedda, N Cesa-Bianchi, A Conconi pairs to unrelated pairs is small. Evaluation on the Reuters C. Gentile, C. Goutte, T. Graepel, Y. Li, J M data set showed that the new technique performed similar to Renders, J S. Taylor, and A. Vinokourov. Kernel the standard EM algorithm when the sparseness condition method for document filtering. In The Eleventh Tert does not hold. In general, it is better to use the new alge REtrieval Conference(TREC11) National Institute of rithm since it is as simple as standard eM, the performance Standards and Technology, special publication either better or similar to EM, and the computation com- 500-249,2 plexity is lower at each iteration. It is worth mentioning that 5 C Chelba and A Acero Adaptation of maxim even if the original problem space is not sparse, sparsene entropy capitalizer: Little data can help a lot. In an be created artificially when a recommendation syster D. Lin and D. Wu, editors, Proceedings of EMnLP ses user-specific feature selection techniques to reduce the 2004, pages 285-292, Barcelona, Spain, July 2004 noise and user model complexity. The proposed technique Association for Computational linguistics can also be adapted to improve the learning in such a sce- [6 B. Croft and J. Lafferty, editors. Language Modeling nario. We also demonstrated that the proposed technique for Information Retrieval. Kluwer, 2002 an learn half a million user profiles from 100 million ratings in a few hours with a single CPU. [7 A. Dayanik, DD. Lewis, D Madigan, V. Menkov and A. Genkin. Constructing informative prio The research is important because scalability is a major distributions from domain knowledge in text concern for researchers when using the Bayesian hierarchical classification In SIGIR 06: Proceedings of the 29th linear modeling approach to build a practical large scale annual international ACM SIGIR conference on system, even though the literature have demonstrated the Research and development in information retrieval effectiveness of the models in many applications. Our work pages 493-500, New York, NY, USA, 2006. ACM is one major step on the road to make Bayesian hierarchical Press linear models more practical. The proposed new technique can be easily adapted to run on a cluster of machines, and [8 J. Delgado and N. Ishii. Memory-based lus further speed up the learning process to handle a larger eightedmajority prediction for recommender scale system with hundreds of millions of users. systems. In ACM SIGIR 99 Workshop on Recommender Systems, 1999 The research has much potential to benefit pe EM algorithm on many other IR problems as well as ma- 9 GroupLens. Movielens chine learning problems. EM algorithm is a commonly used http://www.grouplens.org/taxonomy/term/14,2006. lachine learning technique. It is used to find model param- [10 D. Heckerman. A tutorial on learning with bayesian eters in many IR problems where the training data is very etworks In M. Jordan, editor, Learning in graphical sparse. Although we are focusing on the Bayesian hierar- Models. Kluwer Acadenic. 1998 chical linear models for recommendation and filtering, the [11] J. L Herlocker, J. A. Konstan, A Borchers, and new idea of using analytical solution instead of numerical J. Riedl. An algorithmic framework for performing olution for unrelated user-feature pairs at the M step could collaborative filtering. In SIGIR 99: Proceedings of dapted to many other problems. the 22nd annual international ACM SIGIR conference Research and development in information 8. ACKNOWLEDGMENTS pages 230-237, New York, NY, USA, 1999. ACM We thank Wei Xu, David Lewis and anonymous review- [12 T Hofmann and J. Puzicha. Latent class models for
whole Netflix data set (480,189 users, 159,836 features, and 100 million ratings) running on a single CPU PC (2GB memory, P4 3GHz). The system finished one modified EM iteration in about 4 hours. This demonstrates that the proposed technique can efficiently handle large-scale system like Net- flix. 7. CONCLUSION Content-based user profile learning is an important problem and is the key to providing personal recommendations to a user, especially for recommending new items with a small number of ratings. The Bayesian hierarchical modeling approach is becoming an important user profile learning approach due to its theoretically justified ability to help one user through information transfer from the other users by way of hyperpriors. This paper examined the weakness of the popular EM based learning approach for Bayesian hierarchical linear models and proposed a better learning technique called Modified EM. We showed that the new technique is theoretically more computationally efficient than the standard EM algorithm. Evaluation on the MovieLens and Netflix data sets demonstrated the effectiveness of the new technique when the data is sparse, by which we mean the ratio of related user-feature pairs to unrelated pairs is small. Evaluation on the Reuters data set showed that the new technique performed similar to the standard EM algorithm when the sparseness condition does not hold. In general, it is better to use the new algorithm since it is as simple as standard EM, the performance is either better or similar to EM, and the computation complexity is lower at each iteration. It is worth mentioning that even if the original problem space is not sparse, sparseness can be created artificially when a recommendation system uses user-specific feature selection techniques to reduce the noise and user model complexity. The proposed technique can also be adapted to improve the learning in such a scenario. We also demonstrated that the proposed technique can learn half a million user profiles from 100 million ratings in a few hours with a single CPU. The research is important because scalability is a major concern for researchers when using the Bayesian hierarchical linear modeling approach to build a practical large scale system, even though the literature have demonstrated the effectiveness of the models in many applications. Our work is one major step on the road to make Bayesian hierarchical linear models more practical. The proposed new technique can be easily adapted to run on a cluster of machines, and thus further speed up the learning process to handle a larger scale system with hundreds of millions of users. The research has much potential to benefit people using EM algorithm on many other IR problems as well as machine learning problems. EM algorithm is a commonly used machine learning technique. It is used to find model parameters in many IR problems where the training data is very sparse. Although we are focusing on the Bayesian hierarchical linear models for recommendation and filtering, the new idea of using analytical solution instead of numerical solution for unrelated user-feature pairs at the M step could be adapted to many other problems. 8. ACKNOWLEDGMENTS We thank Wei Xu, David Lewis and anonymous reviewers for valuable feedback on the work described in this paper. Part of the work was supported by Yahoo, Google, the Petascale Data Storage Institute and the Institute for Scalable Scientific Data Management. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors, and do not necessarily reflect those of the sponsors. 9. REFERENCES [1] C. Basu, H. Hirsh, and W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, 1998. [2] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. Technical report, Microsoft Research, One Microsoft Way, Redmond, WA 98052, 1998. [3] J. Callan. Document filtering with inference networks. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 262–269, 1996. [4] N. Cancedda, N. Cesa-Bianchi, A. Conconi, C. Gentile, C. Goutte, T. Graepel, Y. Li, J. M. Renders, J. S. Taylor, and A. Vinokourov. Kernel method for document filtering. In The Eleventh Text REtrieval Conference (TREC11). National Institute of Standards and Technology, special publication 500-249, 2003. [5] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. In D. Lin and D. Wu, editors, Proceedings of EMNLP 2004, pages 285–292, Barcelona, Spain, July 2004. Association for Computational Linguistics. [6] B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Kluwer, 2002. [7] A. Dayanik, D. D. Lewis, D. Madigan, V. Menkov, and A. Genkin. Constructing informative prior distributions from domain knowledge in text classification. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 493–500, New York, NY, USA, 2006. ACM Press. [8] J. Delgado and N. Ishii. Memory-based weightedmajority prediction for recommender systems. In ACM SIGIR’99 Workshop on Recommender Systems, 1999. [9] GroupLens. Movielens. http://www.grouplens.org/taxonomy/term/14, 2006. [10] D. Heckerman. A tutorial on learning with bayesian networks. In M. Jordan, editor, Learning in Graphical Models. Kluwer Academic, 1998. [11] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 230–237, New York, NY, USA, 1999. ACM Press. [12] T. Hofmann and J. Puzicha. Latent class models for
collaborative filtering In IJCAI 99: Proceedings of 26X. Zhu Semi-supervised learning literature surve the sixteenth International Joint Conference on Technical report, University of wisconsin-Madis Artificial Intelligence, pages 688-693, San francisco December 9. 2006 CA, USA, 1999. Morgan Kaufmann Publishers Inc 27 P. Zigoris and Y. Zhang Bayesian adaptive user [13 I M. D(IMDB). Internet movie database profiling with explicit implicit feedback. In http://www.imdb.com/interfaces/,2006 Conference on Information and Knowledge [14 R.Jin, J. Y. Chai, and L. Si. An automatic weighting Mangement 2006. 2006. cheme for collaborative filtering. In SIGIR 04 SIGIR conference on Research and development in ormation retrieval, pages 337-344, New York, NY USA 2004. ACM Press. 15J. A. Konstan, B N. Miller, D Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. GroupLens: applying collaborative filtering to Usenet news. Communications of the ACM, 40(3): 77-87, 1997. 16 D. Lewis. Applying support vector machines to th TREC-2001 batch filtering and routing tasks. In Proceedings of the Eleventh Tert REtrieval Conference (TREC-1),2002 [17 B Liu, X. Li, w.S. Lee,, and P. Yu. Text classification by labeling words. In Proceedings of The Nineteenth National Conference on Artificial Intelligence(A.AAl-2004), July 25-29, 2004 18 P. Melville, R J. Mooney, and R Nagarajan. Content-boosted collaborative filtering for improved recommendations. In Proceedings of the Eighteenth National Conference on Artificial Intelligen (AAAI-2002), Edmonton, Canada, 2002 [19Netflix.Netflixprizehttp://www.netflixprize.com ( visited on nov.30,2006),2006 20S. Robertson and K Sparck-Jones Relevance eighting of search terms. In Journal of the American Society for Information Science, volume 27, pages 129-146,1976. 21J. Wang, A. P. de Vries, and M. J. T. Reinders Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In SIGIR 06: roceedings of the 29th anmual international ACM GIR conference on Research and development in information retrieval, pages 501-508, New York, NY, USA 2006. ACM Press. 22X. Wu and R K Srihari Incorporating prior knowledge with weighted margin support vector machines. In Proc. ACM Knowledge Discovery Data Mining Conf(ACM SIGKDD 2004), Aug. 2004. 23 Y. Yang, S. Yoo, J. Zhang, and B. Kisiel. Robustness of adaptive filtering methods in a cross benchmark valuation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005 24] K. Yu, V. Tresp, and A. Schwaighofer Learnin.05 es from multiple tasks In ICML Proceedings of the 22nd international conference on Machine learning, pages 1012-1019, New York, NY USA. 2005. ACM Press 25 K. Yu, V. Tresp, and S Yu. A nonparametric hierarchical bayesian framework for information filtering. In SIGIR 04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 353-36 ACM Press. 2004
collaborative filtering. In IJCAI ’99: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 688–693, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [13] I. M. D. (IMDB). Internet movie database. http://www.imdb.com/interfaces/, 2006. [14] R. Jin, J. Y. Chai, and L. Si. An automatic weighting scheme for collaborative filtering. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 337–344, New York, NY, USA, 2004. ACM Press. [15] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. GroupLens: Applying collaborative filtering to Usenet news. Communications of the ACM, 40(3):77–87, 1997. [16] D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002. [17] B. Liu, X. Li, W. S. Lee, , and P. Yu. Text classification by labeling words. In Proceedings of The Nineteenth National Conference on Artificial Intelligence (AAAI-2004), July 25-29, 2004. [18] P. Melville, R. J. Mooney, and R. Nagarajan. Content-boosted collaborative filtering for improved recommendations. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-2002), Edmonton, Canada, 2002. [19] Netflix. Netflix prize. http://www.netflixprize.com (visited on Nov. 30, 2006), 2006. [20] S. Robertson and K. Sparck-Jones. Relevance weighting of search terms. In Journal of the American Society for Information Science, volume 27, pages 129–146, 1976. [21] J. Wang, A. P. de Vries, and M. J. T. Reinders. Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 501–508, New York, NY, USA, 2006. ACM Press. [22] X. Wu and R. K. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In Proc. ACM Knowledge Discovery Data Mining Conf.(ACM SIGKDD 2004), Aug. 2004. [23] Y. Yang, S. Yoo, J. Zhang, and B. Kisiel. Robustness of adaptive filtering methods in a cross-benchmark evaluation. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. [24] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 1012–1019, New York, NY, USA, 2005. ACM Press. [25] K. Yu, V. Tresp, and S. Yu. A nonparametric hierarchical bayesian framework for information filtering. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 353–360. ACM Press, 2004. [26] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin – Madison, December 9, 2006. [27] P. Zigoris and Y. Zhang. Bayesian adaptive user profiling with explicit & implicit feedback. In Conference on Information and Knowledge Mangement 2006, 2006