Leveraging aggregate Ratings for Better Recommendations Akhmed Umarov Alexander tuzhilin New York University New York University Stern School of business Stern School of Business 44 West 4th Street. 8th floor 44 West 4th Street. 8th floor New york NY USA New York. NY, USA aumyarov @stern. nyu. edu atuzhili@stern. nyu. edu ABSTRACT independently proposed by statisticians 5] and marketers 3 The paper presents a method that uses aggregate ratings studying recommender systems. We decided to use this type provided by various segments of users for various categor of hierarchical regression models [12 for the following rea- of items to derive better estimations of unknown individual sons. First, they constitute hybrid models integrating both ratings. This is achieved by converting the aggregate ratings user and item characteristics into a single recommendation into constraints on the parameters of a rating estimation model. Generally, hybrid models tend to outperform col- model presented in the paper. The paper also demonstrates laborative and content-based recommendation met hods in theoretically that these additional constraints reduce rating many cases 2]. In fact, the Hierarchical Bayesian model pre stimation errors resulting in better rating predictions sented in 3 outperformed a collaborative filtering model Second, these models are based on strong statistical theory Categories and Subject Descriptors: H 1.2 Models and and have nice statistical properties that can be analyzed Principles]: User/Machine Systems Human information theoretically, as is done in this paper. However, the gen- processing. H 3.3 [Information Storage and Retrieval]: In- eral approach presented in this paper is not limited to this formation Search and Retrieval- Information filtering particular type of models and can be generalized to various General Terms: Algorithms, Design, Theory other statistical and data mining models and approaches Keywords: Recommender systems, Hierarchical Bayesian For example,14 presents a method for using aggregate in- models, predictive models, aggregate ratings, OLAP users in order to provide better recommendations of hyper text pages to individual members of the group. In contrast 1. INTRODUCTION to this top-down approach, [11 presents a bottom-up ap- Consider a movie recommender system, such as the one proach in which the goal is to provide recommendations to provided by Netflix, and assume that we know a group of users. Then these group recommendations are age rating that graduate students provide for action movies based on the aggregate ratings that are computed based on from a reliable external source. Can we use this type of the individual ratings of the members of the group gregate rat ing information to improve quality of individ- In this paper we show theoretically that the extra knowl- edge of the external aggregate ratings indeed leads to items provided by individual users can be aggregated into accurate recommendations. We also show how this aggre- OLAP-based aggregation hierarchies [1], and various aggre- gate rating information can be converted into additional con- gate ratings for different groups of users and groups of items straints on model parameters leading to better estimations at different levels of the olap hierarchy can be known to of individual unknown ratings. Finally, we present a par- the recommender system. For example, the ImDB databas ticular semi-parametric frequentist method for estimating provides average ratings of movies by various categories of parameters of hierarchical regression models and show how users, such as Male vs. Female ratings. In this paper, we the method incorporates aggregate information describe how this aggregate rating information from external Before presenting the aggregate method, we first describe ources can be leveraged for providing better recommenda hierarchical regression models in Sections 2 and 3 and how tions of individual items to individual users they are used for estimating unknown individual ratings We study this problem in the context of the hierarchical Due to space limitations, we present a compressed version regression models, both Bayesian and frequentist, that of our results. A complete description is presented in [13] 2. HIERARCHICAL BAYESIAN REGRES- SION MODEL Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies ar As explained in Section 1, 3 describes a hybrid approach ot made or distributed for profit or commercial advantage and that copie to rating estimation that uses the following Hierarchical bear this notice and the full citation on the first page. To copy otherwise, to Bayesian(HB)linear regression model republish, to post on servers or to redistribute to lists, requires prior specific ∫r=xH+x211+vλ+y RecSys'07, October 19-20, 2007, Minneapolis, Minnesota, USA. Copyright2007ACM978-1-59593-730-81070010….S5.00. ~N(0,a2),~N(0,),A~NO,A,()
Leveraging Aggregate Ratings for Better Recommendations Akhmed Umyarov New York University Stern School of Business 44 West 4th Street, 8th floor New York, NY, USA aumyarov@stern.nyu.edu Alexander Tuzhilin New York University Stern School of Business 44 West 4th Street, 8th floor New York, NY, USA atuzhili@stern.nyu.edu ABSTRACT The paper presents a method that uses aggregate ratings provided by various segments of users for various categories of items to derive better estimations of unknown individual ratings. This is achieved by converting the aggregate ratings into constraints on the parameters of a rating estimation model presented in the paper. The paper also demonstrates theoretically that these additional constraints reduce rating estimation errors resulting in better rating predictions. Categories and Subject Descriptors: H.1.2 [Models and Principles]: User/Machine Systems - Human information processing. H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - Information filtering. General Terms: Algorithms, Design, Theory Keywords: Recommender systems, Hierarchical Bayesian models, predictive models, aggregate ratings, OLAP 1. INTRODUCTION Consider a movie recommender system, such as the one provided by Netflix, and assume that we know an average rating that graduate students provide for action movies from a reliable external source. Can we use this type of aggregate rating information to improve quality of individual recommendations? More generally, ratings of individual items provided by individual users can be aggregated into OLAP-based aggregation hierarchies [1], and various aggregate ratings for different groups of users and groups of items at different levels of the OLAP hierarchy can be known to the recommender system. For example, the IMDB database provides average ratings of movies by various categories of users, such as Male vs. Female ratings. In this paper, we describe how this aggregate rating information from external sources can be leveraged for providing better recommendations of individual items to individual users. We study this problem in the context of the hierarchical regression models, both Bayesian and frequentist, that were Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. RecSys’07, October 19–20, 2007, Minneapolis, Minnesota, USA. Copyright 2007 ACM 978-1-59593-730-8/07/0010 ...$5.00. independently proposed by statisticians [5] and marketers [3] studying recommender systems. We decided to use this type of hierarchical regression models [12] for the following reasons. First, they constitute hybrid models integrating both user and item characteristics into a single recommendation model. Generally, hybrid models tend to outperform collaborative and content-based recommendation methods in many cases [2]. In fact, the Hierarchical Bayesian model presented in [3] outperformed a collaborative filtering model [3]. Second, these models are based on strong statistical theory and have nice statistical properties that can be analyzed theoretically, as is done in this paper. However, the general approach presented in this paper is not limited to this particular type of models and can be generalized to various other statistical and data mining models and approaches. For example, [4] presents a method for using aggregate information about traversal of hypertext pages by a group of users in order to provide better recommendations of hypertext pages to individual members of the group. In contrast to this top-down approach, [11] presents a bottom-up approach in which the goal is to provide recommendations to a group of users. Then these group recommendations are based on the aggregate ratings that are computed based on the individual ratings of the members of the group. In this paper we show theoretically that the extra knowledge of the external aggregate ratings indeed leads to more accurate recommendations. We also show how this aggregate rating information can be converted into additional constraints on model parameters leading to better estimations of individual unknown ratings. Finally, we present a particular semi-parametric frequentist method for estimating parameters of hierarchical regression models and show how the method incorporates aggregate information. Before presenting the aggregate method, we first describe hierarchical regression models in Sections 2 and 3 and how they are used for estimating unknown individual ratings. Due to space limitations, we present a compressed version of our results. A complete description is presented in [13]. 2. HIERARCHICAL BAYESIAN REGRESSION MODEL As explained in Section 1, [3] describes a hybrid approach to rating estimation that uses the following Hierarchical Bayesian (HB) linear regression model: ( rij = x 0 ijµ + z 0 iγj + w0 jλi + εij , εij ∼ N(0, σ2 ), γj ∼ N(0, Γ), λi ∼ N(0, Λ), (1) 161
where observed values of the model are ratings rii assigned grouping together all the random effects in(2)as follows by user i for item j, zi is a vector of attributes of user i, such as age. gender. etc.. w, is a vector of attributesof Tij=ik+zili+wjAi+Eij vector containing all possible cross-products between indi- thus making it a Generalized Least Squares linear regression vidual elements of zi and w model (GLS). Moreover, A can be consistentlyestimated by Vector H represents unobserved slope of the regression, sing ordinary least squares estimator (OLS) if we assume vectors y, and A represent unobserved item heterogene- that y, and Ai are not correlated with ij ity and user heterogeneity effects respectively. Moreover, The covariance structure of residuals nii can be deter- the model(1)assumes that vector %, NN(O, T)and Ai mined from equations(2)and (3)as follows N(O, A), where r and A are unobserved covariance matrices, and that each observation Ti has also an i.i. d. disturbance Em=0, Eij Nn(o,o), where o is also an unobserved parameter E3k=0,ift≠ k and j≠ Thus, vectors u,ri, and (Ai], scalar o, covariance ma- Emk=k,ij≠k, ices T and A constitute the unknown parameters of model (1). Prior belief about these parameters is introduced in 3 Enij niki=z:Tzk,ifi≠k and the parameters are estimated from the known ratings ri LEn2=02+z rz;+w; Awj and known user/item data using Markov Chain Monte Carlo (MCMC)method [6, which constitutes one of the Bayesian where expected value E( is taken over Eij, Ai and y Let s be the covariance matrix of a very long vector of estimation techniques for finding the expected value of the residuals n= lmi ll: that is n Var(). From(4),we posterior distributions of parameters. Moreover, 3 compared predictive performance of conclude that Q depends just on a few unknown parameters Hierarchical Bayesian model (1)against the classical col T and A. Thus o, T and A can be consistently estimated from OLS residuals. For example, we can use the following orative filtering methods and demonstrated that model outperformed the collaborative filtering considered in 3 (overdetermined) system of linear equations In most practical cases, the number of parameters to be A eiieik. VSu estimated for model (1)is very large. For example, for 1000 users defined by 5 user attributes and 1000 movies defined U by 20 movie attributes. we will need to estimate more than 25, 000 free parameters in the model. In general, according =>,ws to 7, constrained Bayesian estimation techniques are no- torious for their computational difficulty, especially in high- 2=∑r Zi- wiAt dimensional parameter spaces, as is the case with model (1 To address this difficulty, we propose to use a frequentist LA=A, T'=r semi-parametric approach that we present in the next sec- is the Ols residual corresponding to observa tion for solving the aggregate rating problem instead of the tion r:.N is the total number of observations and Su and Bayesian parametric method defined by(1) SI are some subsets of users and items respectively. Parameter u of the model(2)can be estimated asymptot- 3. GENERALIZED LINEAR REGRESSION ically efficiently using the Feasible GLS(FGLS)estimator follows MODEL Consider the same model as in(1), but from a frequentist semi-parametric perspective where r is a column-vector of observed scalars rii stacked r=x2H+x21%+2A+E on top of each other, so the first element of the vector is a Ei=0, Var Eul=o, scalar Tiit the second element is rijn and so E[]=0.var[]=T matrix of row-vectors a'i stacked on top of each other one- by-one; thus the first row of the matrix X is a row-vector hi corresponding to observation rijn, the second row of For a frequentist, and Ai constitute random effects, so the matrix x is the row-vector and so on, is an that the model(2)constitutes a mixed-effects model 8 estimate of Q We introduce the notion of a compound disturbance nij by Once we estimated consistently parameters o, T and A,we can consistently estimate expressions Xo-X and XQ-I We typed vectors in bold font as opposed to matrices and using the estimates a, T, A and expression(4), and then obtain consistent and asymptotically efficient estimate of A 2We also include constant term both in zi as a user attribute using expression (6) and in w, as an item attribute. As long as we assume that user N; Semi-parametric perspective makes no assumptions about for each item N→∞asN→ asymptot the shape of distributions. For example, we don't assume we are able to estimate consistently individual item make assumptions only about the moments of the resid wal geneities (1, I and (A:) from the following (overdetermined) distribution, not about the shape of the distribution 4 Although. not efficiently
where observed values of the model are ratings rij assigned by user i for item j, zi is a vector1 of attributes of user i, such as age, gender, etc., wj is a vector of attributes2 of item j, such as price, weight, etc., and vector xij = zi ⊗ wj , where ⊗ is the Kronecker product. Intuitively, xij is a long vector containing all possible cross-products between individual elements of zi and wj . Vector µ represents unobserved slope of the regression, vectors γj and λi represent unobserved item heterogeneity and user heterogeneity effects respectively. Moreover, the model (1) assumes that vector γj ∼ N(0, Γ) and λi ∼ N(0, Λ), where Γ and Λ are unobserved covariance matrices, and that each observation rij has also an i.i.d. disturbance εij ∼ N(0, σ2 ), where σ is also an unobserved parameter. Thus, vectors µ, {γj} and {λi}, scalar σ, covariance matrices Γ and Λ constitute the unknown parameters of model (1). Prior belief about these parameters is introduced in [3], and the parameters are estimated from the known ratings rij and known user/item data using Markov Chain Monte Carlo (MCMC) method [6], which constitutes one of the Bayesian estimation techniques for finding the expected value of the posterior distributions of parameters. Moreover, [3] compared predictive performance of their Hierarchical Bayesian model (1) against the classical collaborative filtering methods and demonstrated that model (1) outperformed the collaborative filtering considered in [3]. In most practical cases, the number of parameters to be estimated for model (1) is very large. For example, for 1000 users defined by 5 user attributes and 1000 movies defined by 20 movie attributes, we will need to estimate more than 25,000 free parameters in the model. In general, according to [7], constrained Bayesian estimation techniques are notorious for their computational difficulty, especially in highdimensional parameter spaces, as is the case with model (1). To address this difficulty, we propose to use a frequentist semi-parametric approach that we present in the next section for solving the aggregate rating problem instead of the Bayesian parametric method defined by (1). 3. GENERALIZED LINEAR REGRESSION MODEL Consider the same model as in (1), but from a frequentist semi-parametric perspective3 : 8 >>>>>: rij = x 0 ijµ + z 0 iγj + w0 jλi + εij , E [εij ] = 0, Var [εij ] = σ 2 , E ˆ γj ˜ = 0, Var ˆ γj ˜ = Γ, E [λi] = 0, Var [λi] = Λ. (2) For a frequentist, γj and λi constitute random effects, so that the model (2) constitutes a mixed-effects model [8]. We introduce the notion of a compound disturbance ηij by 1We typed vectors in bold font as opposed to matrices and scalars that are typed in regular font. 2We also include constant term both in zi as a user attribute and in wj as an item attribute. 3Semi-parametric perspective makes no assumptions about the shape of distributions. For example, we don’t assume that the residuals are normally distributed. Instead, we make assumptions only about the moments of the residual distribution, not about the shape of the distribution. grouping together all the random effects in (2) as follows rij = x 0 ijµ + z 0 iγj + w 0 jλi + εij | {z } ηij , (3) thus making it a Generalized Least Squares linear regression model (GLS). Moreover, µ can be consistently4 estimated by using ordinary least squares estimator (OLS) if we assume that γj and λi are not correlated with xij . The covariance structure of residuals ηij can be determined from equations (2) and (3) as follows: 8 >>>>>>>>>: Eηij = 0, Eηijηkl = 0, if i 6= k and j 6= l, Eηijηik = w0 jΛwk, if j 6= k, Eηijηkj = z 0 iΓzk, if i 6= k, Eη2 ij = σ 2 + z 0 iΓzi + w0 jΛwj , (4) where expected value E(·) is taken over εij , λi and γj . Let Ω be the covariance matrix of a very long vector of residuals η = ||ηij ||; that is Ω = Var(η). From (4), we conclude that Ω depends just on a few unknown parameters: σ, Γ and Λ. Thus σ, Γ and Λ can be consistently estimated from OLS residuals. For example, we can use the following (overdetermined) system of linear equations: 8 >>>>>>>>>>>>>>>>>>>>>: X ijk j6=k,i∈SU w 0 jΛwk = X ijk j6=k,i∈SU eij eik, ∀SU X ijk i6=k,j∈SI z 0 iΓzk = X ijk i6=k,j∈SI eij ekj , ∀SI σ 2 = 1 N X ij ˆ e 2 ij − z 0 iΓzi − w 0 jΛwj ˜ , Λ 0 = Λ, Γ 0 = Γ, (5) where eij is the OLS residual corresponding to observation rij , N is the total number of observations and SU and SI are some subsets of users and items respectively. Parameter µ of the model (2) can be estimated asymptotically efficiently using the Feasible GLS (FGLS) estimator approach [8] as follows: µˆ = “ X 0Ωˆ −1X ”−1 X 0Ωˆ −1 r, (6) where r is a column-vector of observed scalars rij stacked on top of each other, so the first element of the vector is a scalar ri1j1 , the second element is ri2j2 and so on. X is a matrix of row-vectors x 0 ij stacked on top of each other oneby-one; thus the first row of the matrix X is a row-vector x 0 i1j1 corresponding to observation ri1j1 , the second row of the matrix X is the row-vector x 0 i2j2 and so on. Ω is an ˆ estimate of Ω. Once we estimated consistently parameters σ, Γ and Λ, we can consistently estimate expressions X 0Ω −1X and X 0Ω −1 r using the estimates ˆσ, Γ, ˆ Λ and expression (4), and then ˆ obtain consistent and asymptotically efficient estimate of µ using expression (6). As long as we assume that for each user Ni → ∞ and for each item Nj → ∞ as N → ∞ for asymptotic analysis, we are able to estimate consistently individual item heterogeneities {γj} and {λi} from the following (overdetermined) 4Although, not efficiently [8]. 162
system of linear equations Note that both the Bayesian model (1)and the frequentist nij=zili +wjAi+Ei V observations(i,j),(7) model(2)have the same expression for rii, thus the equation (12)has the same form for both. However, interpretation of where ii is a consistent estimator of nij, for example, it can the equation(12) can be different for the two approaches be an Ols residual jij eij For the Bayesian model (1), the new information from System(7)can be interpreted as an ordinary linear regres equation(12)about the expected average rating is inter sion with dependent variables ii, regressors zi and w,, and preted as a linear equality constraint on unknown parame- i.i.d. disturbances ei. Since ii; is a consistent estimator of ters u,y,,Ai. For the frequentist model(2), the new Tij, the Ols estimators Ai and Yj consistently estimate X information from equation(12) about the expected average and y, given our assumption about asymptotic behavior of rating is interpreted as an additional observation. To the model. Thus, the frequentist model(2)gives as much of this, denote i= 2Fu and i=2201+2e Then individual heterogeneity information as Bayesian model (1). equation(12) is equivalent to having an additional observa- As it follows from(6), estimation of A requires inverting tion in the model matrix s that is of size nxn. where n is the total number of observations. Matrix Q2 is sparse, symmetric a semidefinite and one can use Cholesky decomposition for where the residual i) has a known covariation structure with sparse matrices Q= LL, where L is the lower-triangular other residuals nij defined in(3) matrix. in order to calculate and store the inverse. Note that we don' t have to store Q2 itself, we only need EGil ∑ riTz (14) to calculate the xo-X and x'n-r. We also notice that (t,)∈S L-L and L is itself lower-triangular. Thus, COIX=XL-IL E o-r=XL Unfortunately, matrix Q2 is not a band matrix, so the re- Therefore, the constrained model still fits the gls paradigm uired storage for Cholesky decomposition matrix L can be presented in Section 3. Note that for the FGLS estimator as large as O(N2)of memory, that is too high for large equations(14)and(15)introduce an additional row and a problems. Computational complexity for naive algorithms ding to covariances(14)an can be as large as O(N"). However, the problem is parallelize- (15). By including this additional observation we create the able. For example, the inversion of triangular matrix take corresponding matrix Q from the matrix Q2 O(log- N)operations with O(N /log N)processors [10 Determination of how to invert the sparse matrix 2 more 5. MULTIPLE AGGREGATE RATINGS efficiently, and thus making the whole aggregate rating prob- In the previous section, we considered only one true len scalable, constitutes one of our future research topics gregate rating a for one particular segment of ratings. this section, we assume that there is a whole aggregation hi- 4. INTRODUCING AGGREGATE RATING erarchy defined for the ratings matrix. One example would The main research question addressed in this paper is how be an OLAP-based hierarchy 9 of aggregate ratings to use the aggregate ratings in our statistical models to pro- Given an OLaP hierarchy for users and items, where rat- vide better estimators of individual ratings ings constitute measures defined for the OLAP cells 9,con- Formally, assume that in addition to the classical indi- sider a particular category of items Cp, a particular segment dual ratings rii, user data zi and item data w, used in of users Sa and the cell CELLpa in the olap hierarchy cor- equations(1)and(2), we also know the expected value of responding to Cp and Sg. Also let Dpg be all the ratings that users in segment Sq provided for items in category Cp Also assume that there are k total possible user-item pairs and let Rpq, be the aggregate rating for CELlpg that was in the segment s. thus Clearly, the expert can assign numerous ratings Rpgto (10) arious aggregate cells CELLpg at different levels of the OLAP hierarchy. Using the results from Section 4, each aggregate rating Rpg produces a constraint of the form where sum is taken over all user-item pairs(i,j)es. For (12). This means that various aggregate ratings R xample, assume that the expected average rating of some duce multiple constraints for different values of p an 100 action movies provided by 20 graduate Cs students that these constraints come from various levels of based on k= 2000 possible user-item pairs, is a=7.8 tion in the olap hierarchy Substituting the expression for rii from our model equa- In fact, we may introduce so many such constraints that tions(1) or(2), we conclude that the estimator itself will be largely determined by the con- straints and not the real observation data. The solution to ∑ ∑(x+z2+入+) this problem for the aggregation model presented in this pa- per is that we may have different levels of confidence in the ku+Zioit k=a.(12) University of XYZ for action movies is 6.5 than in that the erage rating provided by graduate CS students from Here we take expected value only over E, not yi and Ai average rating by physics students for drama movies is 7.8
system of linear equations ηˆij = z 0 iγj + w 0 jλi + εij ∀ observations (i, j), (7) where ˆηij is a consistent estimator of ηij , for example, it can be an OLS residual ˆηij = eij . System (7) can be interpreted as an ordinary linear regression with dependent variables ˆηij , regressors zi and wj , and i.i.d. disturbances εij . Since ˆηij is a consistent estimator of ηij , the OLS estimators λˆi and γˆj consistently estimate λi and γj given our assumption about asymptotic behavior of the model. Thus, the frequentist model (2) gives as much of individual heterogeneity information as Bayesian model (1). As it follows from (6), estimation of µ requires inverting matrix Ω that is of size ˆ N ×N, where N is the total number of observations. Matrix Ω is sparse, symmetric and positive- ˆ semidefinite and one can use Cholesky decomposition for sparse matrices Ω = ˆ LL0 , where L is the lower-triangular matrix, in order to calculate and store the inverse. Note that we don’t have to store Ωˆ −1 itself, we only need to calculate the X 0Ωˆ −1X and X 0Ωˆ −1 r. We also notice that Ωˆ −1 = L −10 L −1 and L −1 is itself lower-triangular. Thus, X 0Ωˆ −1X = X 0L −10 L −1X = (L −1X) 0 (L −1X), (8) X 0Ωˆ −1 r = X 0L −10 L −1 r = (L −1X) 0 (L −1 r). (9) Unfortunately, matrix Ω is not a band matrix, so the re- ˆ quired storage for Cholesky decomposition matrix L can be as large as O ` N 2 ´ of memory, that is too high for large problems. Computational complexity for naive algorithms can be as large as O(N 3 ). However, the problem is parallelizable. For example, the inversion of triangular matrix takes O(log2 N) operations with O(N 3 / log N) processors [10]. Determination of how to invert the sparse matrix Ω more ˆ efficiently, and thus making the whole aggregate rating problem scalable, constitutes one of our future research topics. 4. INTRODUCING AGGREGATE RATING The main research question addressed in this paper is how to use the aggregate ratings in our statistical models to provide better estimators of individual ratings. Formally, assume that in addition to the classical individual ratings rij , user data zi and item data wj used in equations (1) and (2), we also know the expected value5 of an average rating across some segment S of user-item pairs. Also assume that there are k total possible user-item pairs in the segment S, thus Eε "P i,j rij k # = a, (10) where sum is taken over all user-item pairs (i, j) ∈ S. For example, assume that the expected average rating of some 100 action movies provided by 20 graduate CS students, based on k = 2000 possible user-item pairs, is a = 7.8. Substituting the expression for rij from our model equations (1) or (2), we conclude that E »Prij k – = E "P` x 0 ijµ + z 0 iγj + wjλi + εij ´ k # = (11) = Px 0 ij k µ + Pz 0 iγj k + Pwjλi k = a. (12) 5Here we take expected value only over ε, not γj and λi Note that both the Bayesian model (1) and the frequentist model (2) have the same expression for rij , thus the equation (12) has the same form for both. However, interpretation of the equation (12) can be different for the two approaches. For the Bayesian model (1), the new information from equation (12) about the expected average rating is interpreted as a linear equality constraint on unknown parameters µ, {γj}, {λi}. For the frequentist model (2), the new information from equation (12) about the expected average rating is interpreted as an additional observation. To see this, denote x˜ = P xij k and ˜η = P z 0 iγj k + P wjλi k . Then equation (12) is equivalent to having an additional observation in the model a = x˜ 0µ + ˜η, (13) where the residual ˜η has a known covariation structure with other residuals ηij defined in (3): E[˜ηηij ] = X t: (i,t)∈S w0 jΛwt k + X t: (t,j)∈S z 0 iΓzt k , (14) E[˜η 2 ] = X i,j,t: (i,j)∈S, (i,t)∈S w0 jΛwt k 2 + X i,j,t: (i,j)∈S, (t,j)∈S z 0 iΓzt k 2 . (15) Therefore, the constrained model still fits the GLS paradigm presented in Section 3. Note that for the FGLS estimator, equations (14) and (15) introduce an additional row and a column to matrix Ω corresponding to covariances (14) and (15). By including this additional observation we create the corresponding matrix Ω from the matrix Ω. ˜ 5. MULTIPLE AGGREGATE RATINGS In the previous section, we considered only one true aggregate rating a for one particular segment of ratings. In this section, we assume that there is a whole aggregation hierarchy defined for the ratings matrix. One example would be an OLAP-based hierarchy [9] of aggregate ratings. Given an OLAP hierarchy for users and items, where ratings constitute measures defined for the OLAP cells [9], consider a particular category of items Cp, a particular segment of users Sq and the cell CELLpq in the OLAP hierarchy corresponding to Cp and Sq. Also let Dpq be all the ratings that users in segment Sq provided for items in category Cp, and let R aggr pq be the aggregate rating for CELLpq that was independently assigned by the expert to that cell. Clearly, the expert can assign numerous ratings R aggr pq to various aggregate cells CELLpq at different levels of the OLAP hierarchy. Using the results from Section 4, each aggregate rating R aggr pq produces a constraint of the form (12). This means that various aggregate ratings R aggr pq produce multiple constraints for different values of p and q and that these constraints come from various levels of aggregation in the OLAP hierarchy. In fact, we may introduce so many such constraints that the estimator itself will be largely determined by the constraints and not the real observation data. The solution to this problem for the aggregation model presented in this paper is that we may have different levels of confidence in the aggregate ratings. For example, we may be more sure that the average rating provided by graduate CS students from University of XYZ for action movies is 6.5 than in that the average rating by physics students for drama movies is 7.8. 163
To model this " degree of confidence"in aggregate ratings, other types of frequentist estimation models that work wel e assume that the aggregate ratings are "noisy, which can for large problems. Also, the next step in our research would be formally represented as: be to test our theoretically-based conclusions about superior E:2=a performance of the constrained models on real data and try where s is an unknown noise component, alsan gn (16) to show that empirical results confirm our theoretical anal- a=a+,E=0.Var(5) Finally, we intend to extend Proposition 1 from the FGLS to more general types of estimators. true value, a is the observed value for the aggregate ratin and aF is some known parameter 7. REFERENCES Including this noise into expression(10)results in the fol- [1 G. Adomavicius, R. Sankaranarayanan, S Sen, and lowing fuzzy constraint rather than the crisp constraint(12) A. Tuzhilin Incorporating contextual information in recommender systems using a multidimensional ∑a (17) approach. ACM Trans. on Inf. Systems, 23(1), 2005 2 G Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the From the frequentist prospective, the model still can be state-of-the-art and possible extensions. IEEE Trans. interpreted as an additional observation of type(13). There- on Knowledge and Data Engineering, 17(6), 2005 fore, the multiple aggregation model with different degree 3 A. Ansari, S. Essegaier, and R. Kohli. Internet of certainty in various aggregate ratings can still be defined recommendations systems. Journal of Marketing with the Gls framework, and the same analysis presented Research,37(3),2000 in Sections 3 and 4 still holds. By including this additional 4 J. Bollen. Group user models for personalized observation(17), we create the corresponding matrix Q2 from hyperlink recommendations. Procs of the international the matrix n defined in Section 4. It can be shown that o Conference on Adaptive Hypermedia and Adaptive is not singular Web-Based Systems, 2000 Parameter at in(16)has the following intuition: it can be 5 M. Condliff, D. Lewis, D. Madigan, and C. Poss interpreted as the weight that we place on the corresponding Bayesian mixed-effects models for recommender constraint. It can be shown that the larger at is, the less the stems. ACM SIGIR 99 Workshop on Recommender FGLS method will try to satisfy the constraint. When we Systems: Algorithms and Evaluation, 15(5), 1999 consider multiple constraints, we can put different weight 6 A Gelfand and A Smith. Sampling-based approaches on different constraints by assigning to each constraint i its to calculating marginal densities. Journal of the own"weight"o2. In this way we can accommodate a real American Statistical Association, 85(410), 1990 situation when some external rating information is more re- 7 A Gelfand, A Smith, and T Lee. Bayesian analysis liable than the other of constrained parameter and truncated data problems The following proposition demonstrates that the constraine using Gibbs sampling. Jounal of the American models using aggregate ratings, such as FGLS, provide bet- Statistical Association, 87(418), 1992 ter individual rating estimations than the unconstrained ones 8 W. Greene. Econometric Analysis. Prentice Hall, 2002 Proposition 1. The expected mean squared error(MSE) 9 R Kimball. The Data Warehouse Toolkit: The on a test set of the constrained Fgls estimator is smaller Complete Guide to Dimensional Modeling. John wiley than the one of the unconstrained Fgls estimator d Sons. Inc. 2002 Sketch of Proof. Complete proof is available in [13. The proof is based on the idea that specifying an aggregate rating [10 Y C Kwong. Annual Review of Scalable Computing Singapore University Press, 2001 is equivalent to adding a new observation (17)and on the [11] M o' Connor, D Cosley, J.A. Konstan, and JRiedl on the test set of the estimator trained on the bigger sample Polylens: A recommender system for groups of users. size will be smaller than the expected MsE on the test set Procs of ECSCW, 2001 of the estimator trained on the subset of the sample [ 12S. W. Raudenbush and A. S Bryk Hierarchical Linear Models: Applications and Data Analysis 6. CONCLUSIONS Methods. Sage Publications, Inc, 2001 [13 A. Umyarov and A. Tuzhilin Leveraging aggregate In this paper, we replaced the Bayesian approach pre- ratings for better recommendations. Working paper. ously deployed in 3, 5 with a corresponding frequentist Stern School of Business. New York University estimation method Feasible GLS(FGLS) and demonstrated CeDER-07-03,2007 how aggregate ratings can be used to produce additional constraints on the parameters of the FGLS model. We also howed that these additional constraints reduce rating esti- mation errors of the FGLS model resulting in theoretically better rating estimation methods, thus demonstrating how aggregate ratings can improve individual recommendations. issue with the fgls method is that it works mainly on small to medium-sized problems because of the difficulty with inversion of matrix S for large problems Therefore, as a future research, we plan to work on devel- oping more scalable methods for estimating the FGLS and
To model this “degree of confidence” in aggregate ratings, we assume that the aggregate ratings are “noisy,” which can be formally represented as: ( Eε h P i,j rij k i = α, a = α + ξ, Eξ = 0, Var(ξ) = σ 2 ξ , (16) where ξ is an unknown noise component, α is an unknown true value, a is the observed value for the aggregate rating and σ 2 ξ is some known parameter. Including this noise into expression (10) results in the following fuzzy constraint rather than the crisp constraint (12): a = Px 0 ij k | {z } x˜ µ + Pz 0 iγj k + Pwjλi k + ξ | {z } η˜ . (17) From the frequentist prospective, the model still can be interpreted as an additional observation of type (13). Therefore, the multiple aggregation model with different degrees of certainty in various aggregate ratings can still be defined with the GLS framework, and the same analysis presented in Sections 3 and 4 still holds. By including this additional observation (17), we create the corresponding matrix Ω from ˜ the matrix Ω defined in Section 4. It can be shown that Ω˜ is not singular. Parameter σ 2 ξ in (16) has the following intuition: it can be interpreted as the weight that we place on the corresponding constraint. It can be shown that the larger σ 2 ξ is, the less the FGLS method will try to satisfy the constraint. When we consider multiple constraints, we can put different weights on different constraints by assigning to each constraint i its own “weight” σ 2 ξi . In this way, we can accommodate a real situation when some external rating information is more reliable than the other. The following proposition demonstrates that the constrained models using aggregate ratings, such as FGLS, provide better individual rating estimations than the unconstrained ones. Proposition 1. The expected mean squared error (MSE) on a test set of the constrained FGLS estimator is smaller than the one of the unconstrained FGLS estimator. Sketch of Proof. Complete proof is available in [13]. The proof is based on the idea that specifying an aggregate rating is equivalent to adding a new observation (17) and on the idea that the sample size matters, i.e., the expected MSE on the test set of the estimator trained on the bigger sample size will be smaller than the expected MSE on the test set of the estimator trained on the subset of the sample. 6. CONCLUSIONS In this paper, we replaced the Bayesian approach previously deployed in [3, 5] with a corresponding frequentist estimation method Feasible GLS (FGLS) and demonstrated how aggregate ratings can be used to produce additional constraints on the parameters of the FGLS model. We also showed that these additional constraints reduce rating estimation errors of the FGLS model resulting in theoretically better rating estimation methods, thus demonstrating how aggregate ratings can improve individual recommendations. The main issue with the FGLS method is that it works mainly on small to medium-sized problems because of the difficulty with inversion of matrix Ω for large problems. ˆ Therefore, as a future research, we plan to work on developing more scalable methods for estimating the FGLS and other types of frequentist estimation models that work well for large problems. Also, the next step in our research would be to test our theoretically-based conclusions about superior performance of the constrained models on real data and try to show that empirical results confirm our theoretical analysis. Finally, we intend to extend Proposition 1 from the FGLS to more general types of estimators. 7. REFERENCES [1] G. Adomavicius, R. Sankaranarayanan, S. Sen, and A. Tuzhilin. Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. on Inf. Systems, 23(1), 2005. [2] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. on Knowledge and Data Engineering, 17(6), 2005. [3] A. Ansari, S. Essegaier, and R. Kohli. Internet recommendations systems. Journal of Marketing Research, 37(3), 2000. [4] J. Bollen. Group user models for personalized hyperlink recommendations. Procs of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, 2000. [5] M. Condliff, D. Lewis, D. Madigan, and C. Posse. Bayesian mixed-effects models for recommender systems. ACM SIGIR’99 Workshop on Recommender Systems: Algorithms and Evaluation, 15(5), 1999. [6] A. Gelfand and A. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 1990. [7] A. Gelfand, A. Smith, and T. Lee. Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. Journal of the American Statistical Association, 87(418), 1992. [8] W. Greene. Econometric Analysis. Prentice Hall, 2002. [9] R. Kimball. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley and Sons, Inc., 2002. [10] Y. C. Kwong. Annual Review of Scalable Computing. Singapore University Press, 2001. [11] M. O‘Connor, D. Cosley, J. A. Konstan, and J. Riedl. Polylens: A recommender system for groups of users. Procs of ECSCW, 2001. [12] S. W. Raudenbush and A. S. Bryk. Hierarchical Linear Models: Applications and Data Analysis Methods. Sage Publications, Inc, 2001. [13] A. Umyarov and A. Tuzhilin. Leveraging aggregate ratings for better recommendations. Working paper. Stern School of Business. New York University. CeDER-07-03, 2007. 164