② Electronic Commerce Research. 5: 51-74(2005) @2005 Springer Science Business Media, Inc. Manufactured in the Netherlands. CrOC: A New evaluation criterion for Recommender Systems ANDREW I SCHEIN. ALEXANDRIN POPESCUL and LYLE H. UNGAR ent of Computer and Information Science, Levine Hall, 3330 Walnut Street DAVID M. PENNOCK S davidpennock@overture.com Overture Services, Inc, 74 N. Pasadena Ave, 3rd floor, Pasadena, CA 91103, USA abstract Evaluation of a recommender system algorithm is a challenging task due to the many possible scenarios in which systems may be deployed. We have designed a new performance plot called the CROC with an associated statistic: the area under the curve. Our CROC curve supplements the widely used ROC recommender system evaluation by discovering performance characteristics that standard ROC evaluation often ignores. Empirical studies on two domains and including several recommender system algorithms demonstrate that combining ROC and CROC curves in evaluation can lead to a more informed characterization of performance than using either curve alone Keywords: recommender systems, collaborative filtering, performance evaluation 1. Introduction Recommender systems(RS) suggest items of interest to users based on their explicit and implicit preferences, the preferences of other users, and user and item attributes. Methods for conveying recommendations to users are virtually unlimited. Recommendations can sent through Email solicitation(push mechanism), through explicit user request(pull mechanism)or as a suggestion in a web page the user is viewing(push mechanism). In many applications the number of recommendations made to each user must be limited to a handful in order to prevent information fatigue among the user base. However, traditional evaluation metrics for recommender systems--including ROC curves and mean absolute error(MAE) statistics--do not take this phenomenon into account, and may reward rec- ommender systems that skew the number of recommendations to the users who rate most frequently or most highly In this paper we will explore a new performance plot, called the CROC curve, which evaluates system performance in situations where each user receives the same number of *PortionsofthisworkappearedearlierIsCheinetal.,32],@2002Acm,http://doi.acm.org/10 1145/564376.564421 *i This work conducted while at NEC Laboratories America. Princeton NJ
Electronic Commerce Research, 5: 51–74 (2005) 2005 Springer Science + Business Media, Inc. Manufactured in the Netherlands. CROC: A New Evaluation Criterion for Recommender Systems ∗ ANDREW I. SCHEIN, ALEXANDRIN POPESCUL and LYLE H. UNGAR {ais;popescul;ungar}@cis.upenn.edu University of Pennsylvania, Department of Computer and Information Science, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6389, USA DAVID M. PENNOCK ∗∗ david.pennock@overture.com Overture Services, Inc., 74 N. Pasadena Ave, 3rd floor, Pasadena, CA 91103, USA Abstract Evaluation of a recommender system algorithm is a challenging task due to the many possible scenarios in which such systems may be deployed. We have designed a new performance plot called the CROC curve with an associated statistic: the area under the curve. Our CROC curve supplements the widely used ROC curve in recommender system evaluation by discovering performance characteristics that standard ROC evaluation often ignores. Empirical studies on two domains and including several recommender system algorithms demonstrate that combining ROC and CROC curves in evaluation can lead to a more informed characterization of performance than using either curve alone. Keywords: recommender systems, collaborative filtering, performance evaluation 1. Introduction Recommender systems (RS) suggest items of interest to users based on their explicit and implicit preferences, the preferences of other users, and user and item attributes. Methods for conveying recommendations to users are virtually unlimited. Recommendations can be sent through Email solicitation (push mechanism), through explicit user request (pull mechanism) or as a suggestion in a web page the user is viewing (push mechanism). In many applications the number of recommendations made to each user must be limited to a handful in order to prevent information fatigue among the user base. However, traditional evaluation metrics for recommender systems—including ROC curves and mean absolute error (MAE) statistics—do not take this phenomenon into account, and may reward recommender systems that skew the number of recommendations to the users who rate most frequently or most highly. In this paper we will explore a new performance plot, called the CROC curve, which evaluates system performance in situations where each user receives the same number of ∗ Portions of this work appeared earlier: [Schein et al., 32], © 2002 ACM, http://doi.acm.org/10. 1145/564376.564421 ∗∗ This work conducted while at NEC Laboratories America, Princeton, NJ
SCHEIN ET AL recommendations. While traditional ROC curves uncover the raw performance of an Rs algorithm, our CROC curve measures the ability of the rs to provide reasonable recom- mendations across a wide user base thus we find significant advantages in using both CROC and ROC metrics together. ROC curves are formed by creating one large ranked list of recommendations(Pi, m j ), indicating user i likes/rates/purchases item j. Opt timid ing performance on an ROC curve can mean dramatically skewing recommendations to the most active users or the users who rate more highly on average. In contrast, the Croc curve creates one ranked list for each user, interleaving recommendations evenly among users. In this way recommendations are constrained in the CroC evaluation so that each user receives the same number of recommendations, though different users will receive a separate set of recommendations according to their predicted preferences. We demonstrate our two-metric evaluation strategy by recommending web pages and movies: predicting both explicit ratings in the form of a rating in the scale 1-5 in addition to implicit preference information such as a web site visitation. In order to ascertain the value of the CRoC curve, we evaluate three machine learning algorithms while includ- ing baseline popularity-ranked recommendations. We evaluate in settings where we must recommend items nobody has rated before (i. e, recommending from a cold-start)as well as in the hot-start setting. In all of these situations we pinpoint advantages and disadvan- tages of the different recommender system algorithms tested, leading to some surprisin conclusions about RS performance. Our insights into algorithm performance demonstrate the advantage of combining ROC and CroC metrics in evaluation, compared to current practices that ignore the consequences of recommendation constraints in performance sum hanization 2. Background and related work To date, most comparisons among algorithms have been empirical or qualitative in nature [Herlocker et al., 13: Sarwar et al., 27 ], though some worst-case performance bounds have been derived [Freund et al., 8: Nakamura and Abe, 20), some general principles advocated Freund et al., 8, and some fundamental limitations explicated [Pennock et al., 21]. Tech- niques suggested in evaluating recommender system performance include mean absolute error, receiver operator characteristic (ROC)curves, ranked list techniques [Breese et al., 3: Herlocker et al., 13: Schein et al., 31] and variants of precision/recall statistics [Sarwar et al., 27]. Our CROC curve defined and developed in this paper can be viewed as a rOC curve created through constraints on the recommendations. Sarwar et al. [27 define analo- gous constraints on precision/recall statistics. Our contribution adds to the work of Sarwar et al. by demonstrating why traditional metrics such as ROC curves and precision/recall must be adjusted for the recommender system domain; we show that ROC curves and CROC curves frequently have conflicting statements about which recommender systems re best. Breese et al. [3] propose another method to combine performance information about individual users. Their R metric averages performance over the entire user base sing an exponential decay in the user's attention as the user peruses further and further down a list of recommendations. Unlike the r rank metric. the roc and croc methods
52 SCHEIN ET AL. recommendations. While traditional ROC curves uncover the raw performance of an RS algorithm, our CROC curve measures the ability of the RS to provide reasonable recommendations across a wide user base; thus we find significant advantages in using both CROC and ROC metrics together. ROC curves are formed by creating one large ranked list of recommendations (pi, mj ), indicating user i likes/rates/purchases item j . Optimizing performance on an ROC curve can mean dramatically skewing recommendations to the most active users or the users who rate more highly on average. In contrast, the CROC curve creates one ranked list for each user, interleaving recommendations evenly among users. In this way recommendations are constrained in the CROC evaluation so that each user receives the same number of recommendations, though different users will receive a separate set of recommendations according to their predicted preferences. We demonstrate our two-metric evaluation strategy by recommending web pages and movies: predicting both explicit ratings in the form of a rating in the scale 1–5 in addition to implicit preference information such as a web site visitation. In order to ascertain the value of the CROC curve, we evaluate three machine learning algorithms while including baseline popularity-ranked recommendations. We evaluate in settings where we must recommend items nobody has rated before (i.e., recommending from a cold-start) as well as in the hot-start setting. In all of these situations we pinpoint advantages and disadvantages of the different recommender system algorithms tested, leading to some surprising conclusions about RS performance. Our insights into algorithm performance demonstrate the advantage of combining ROC and CROC metrics in evaluation, compared to current practices that ignore the consequences of recommendation constraints in performance summarization. 2. Background and related work To date, most comparisons among algorithms have been empirical or qualitative in nature [Herlocker et al., 13; Sarwar et al., 27], though some worst-case performance bounds have been derived [Freund et al., 8; Nakamura and Abe, 20], some general principles advocated [Freund et al., 8], and some fundamental limitations explicated [Pennock et al., 21]. Techniques suggested in evaluating recommender system performance include mean absolute error, receiver operator characteristic (ROC) curves, ranked list techniques [Breese et al., 3; Herlocker et al., 13; Schein et al., 31] and variants of precision/recall statistics [Sarwar et al., 27]. Our CROC curve defined and developed in this paper can be viewed as a ROC curve created through constraints on the recommendations. Sarwar et al. [27] define analogous constraints on precision/recall statistics. Our contribution adds to the work of Sarwar et al. by demonstrating why traditional metrics such as ROC curves and precision/recall must be adjusted for the recommender system domain; we show that ROC curves and CROC curves frequently have conflicting statements about which recommender systems are best. Breese et al. [3] propose another method to combine performance information about individual users. Their R metric averages performance over the entire user base using an exponential decay in the user’s attention as the user peruses further and further down a list of recommendations. Unlike the R rank metric, the ROC and CROC methods
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS employed in our evaluation have no parameters that must be set prior to evaluation, such as a decay rate. Recommender system algorithms can be classified into collaborative filtering, content filtering, or methods that combine both forms of information. Pure collaborative filter ing methods Breese et al., 3: Hill et al., 14; Konstan et al., 17; Resnick and Varian, 25 Shardanand and Maes, 34] base their recommendations on community preferences(e.g user ratings and purchase histories), ignoring user and item attributes(e. g, demograph ics and product descriptions). On the other hand, pure content-based filtering or infor- mation filtering methods [Mooney and Roy, 19; Salton and McGill, 26] typically match query words or other user data with item attribute information, ignoring data from other users. Several hybrid algorithms combine both techniques[Basu et al., 1; Claypool et al., 4 Condliff et al., 6: Good et al., 10; Popescul et al., 23: Schein et al., 32]. Though"content sually refers to descriptive words associated with an item, we use the term more generally to refer to any form of item attribute information including, for example, the list of actors in a movie Early recommender systems were pure collaborative filters that computed pairwise sim- ilarities among users and recommended items according to a similarity-weighted aver- age [ Resnick et al., 24: Shardanand and Maes, 34]. Breese et al. [ 3] refer to this class of algorithms as memory-based algorithms. Subsequent authors employed a variety of techniques for collaborative filtering, including hard-clustering users into classes [Breese et al., 3], simultaneously hard-clustering users and items [Ungar and Foster, 36], soft clustering users and items[Hofmann and Puzicha, 16: Popescul et al., 23], singular value decomposition [Sarwar et al., 28], inferring item- item similarities [Sarwar et al., 291, prob- abilistic modeling [ Breese et al., 3: Condliff et al., 6: Heckerman et al., 12; Pennock et aL., 22; Popescul et al., 23: Schein et al., 31, 32], machine learning [Basu et al Billsus and Pazzani, 2: Nakamura and Abe, 20], and list-ranking Cohen et al, 5 Freund et al., 8; Pennock et al., 21]. More recently, authors have turned toward designing brid recommender systems that combine both collaborative and content information in various ways [Basu et al., 1 Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23: Schein et al, 31, 32 One difficult, though common, problem for a recommender system is the cold-start problem, where recommendations are required for items that no one(in our data set)has yet rated. Cold-start recommending is particularly hard since pure collaborative filtering approaches fail: new users have no history, yet history is the sole training information of a pure collaborative filtering system. In this work we will benchmark recommender systems in hot-start settings in addition to the new-item cold-start setting. Benchmarking in the cold-start setting requires slight changes in the experimental design, but exposes convenient theoretical properties of the CROC metric. In the sections that follow we define the ROC and CroC curves for recommender sys- tem evaluation, and then describe the algorithms used in the empirical evaluation and the experimental setup Results and discussion follow
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 53 employed in our evaluation have no parameters that must be set prior to evaluation, such as a decay rate. Recommender system algorithms can be classified into collaborative filtering, content filtering, or methods that combine both forms of information. Pure collaborative filtering methods [Breese et al., 3; Hill et al., 14; Konstan et al., 17; Resnick and Varian, 25; Shardanand and Maes, 34] base their recommendations on community preferences (e.g., user ratings and purchase histories), ignoring user and item attributes (e.g., demographics and product descriptions). On the other hand, pure content-based filtering or information filtering methods [Mooney and Roy, 19; Salton and McGill, 26] typically match query words or other user data with item attribute information, ignoring data from other users. Several hybrid algorithms combine both techniques [Basu et al., 1; Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23; Schein et al., 32]. Though “content” usually refers to descriptive words associated with an item, we use the term more generally to refer to any form of item attribute information including, for example, the list of actors in a movie. Early recommender systems were pure collaborative filters that computed pairwise similarities among users and recommended items according to a similarity-weighted average [Resnick et al., 24; Shardanand and Maes, 34]. Breese et al. [3] refer to this class of algorithms as memory-based algorithms. Subsequent authors employed a variety of techniques for collaborative filtering, including hard-clustering users into classes [Breese et al., 3], simultaneously hard-clustering users and items [Ungar and Foster, 36], softclustering users and items [Hofmann and Puzicha, 16; Popescul et al., 23], singular value decomposition [Sarwar et al., 28], inferring item-item similarities [Sarwar et al., 29], probabilistic modeling [Breese et al., 3; Condliff et al., 6; Heckerman et al., 12; Pennock et al., 22; Popescul et al., 23; Schein et al., 31, 32], machine learning [Basu et al., 1; Billsus and Pazzani, 2; Nakamura and Abe, 20], and list-ranking [Cohen et al., 5; Freund et al., 8; Pennock et al., 21]. More recently, authors have turned toward designing hybrid recommender systems that combine both collaborative and content information in various ways [Basu et al., 1; Claypool et al., 4; Condliff et al., 6; Good et al., 10; Popescul et al., 23; Schein et al., 31, 32]. One difficult, though common, problem for a recommender system is the cold-start problem, where recommendations are required for items that no one (in our data set) has yet rated.1 Cold-start recommending is particularly hard since pure collaborative filtering approaches fail: new users have no history, yet history is the sole training information of a pure collaborative filtering system. In this work we will benchmark recommender systems in hot-start settings in addition to the new-item cold-start setting. Benchmarking in the cold-start setting requires slight changes in the experimental design, but exposes convenient theoretical properties of the CROC metric. In the sections that follow we define the ROC and CROC curves for recommender system evaluation, and then describe the algorithms used in the empirical evaluation and the experimental setup. Results and discussion follow
SCHEIN ET AL 3. Evaluation metrics In this section we define constrained and unconstrained modes of allocating recommenda- ions and explore their consequences for evaluation of algorithms. Formally, we define a recommendation as a pair(p, m)interpreted to mean"recommend m to p". We can imag ine creating a ranked list out of all possible recommendations(a PllMI sized list), and pulling off the top n to actually recommend. There are no constraints introduced as of yet theoretically a recommender system may choose to recommend all items to a particular user before recommending items to any other user. In fact, we will find in practice that cer- tain algorithms will recommend more often to some users than to others if unconstrained Alternatively, we may wish to know the performance of our system under circumstances where we recommend the same number of items to each user We refer to these two models of recommender system usage as unconstrained and constrained recommending respectively, and develop evaluation metrics for each case. In the constrained mode of recommending, we create I PI separate ranked lists of recommendations: one list per user To make n total constrained recommendations we pull Pl/n recommendations off of the top of each of the PI lists. The following sections describe metrics for both unconstrained and constrained re 3.. Unconstrained evaluation. the roc curve For unconstrained recommending, we employ the receiver operator characteristic (ROC) urve [Swets, 35] advocated for recommender system evaluation by Herlocker et al. [13] ROC curves are suited for tracking performance in binary classification tasks while vary ing a classification threshold. RS applications are cast as binary classification when we classify a person/movie pair as like/does-not-like(rating prediction) or purchased/did-not purchase(implicit rating prediction). Instead of varying a threshold, we vary the num- ber of top(p, m) tuples in a ranked list that we use as recommendations. ROC curves plot the false-alarm rate on the x-axis against the hit rate- on the y-axis where we ate= 中+f false alarm rate=sp fp+tm The number of true positives, denoted tp, are the number of positive examples we correctly lentify as such. The number of false positives, denoted fp, are the number of negatives that we miss-classify as positive. The definitions for true negatives(m)and false negatives fn)are analogous. ROC curves have the convenient property that random recommenda- tion is characterized by an expected plot of a forty-five degree line from the lower-left to upper-right corners of the graph, and perfect performance is characterized by a horizontal line one unit above the origin a useful statistic for summarizing the curve is the area under
54 SCHEIN ET AL. 3. Evaluation metrics In this section we define constrained and unconstrained modes of allocating recommendations and explore their consequences for evaluation of algorithms. Formally, we define a recommendation as a pair (p, m) interpreted to mean “recommend m to p”. We can imagine creating a ranked list out of all possible recommendations (a |P||M| sized list), and pulling off the top n to actually recommend. There are no constraints introduced as of yet; theoretically a recommender system may choose to recommend all items to a particular user before recommending items to any other user. In fact, we will find in practice that certain algorithms will recommend more often to some users than to others if unconstrained. Alternatively, we may wish to know the performance of our system under circumstances where we recommend the same number of items to each user. We refer to these two models of recommender system usage as unconstrained and constrained recommending respectively, and develop evaluation metrics for each case. In the constrained mode of recommending, we create |P| separate ranked lists of recommendations: one list per user. To make n total constrained recommendations we pull |P|/n recommendations off of the top of each of the |P| lists. The following sections describe metrics for both unconstrained and constrained recommending. 3.1. Unconstrained evaluation: the ROC curve For unconstrained recommending, we employ the receiver operator characteristic (ROC) curve [Swets, 35] advocated for recommender system evaluation by Herlocker et al. [13]. ROC curves are suited for tracking performance in binary classification tasks while varying a classification threshold. RS applications are cast as binary classification when we classify a person/movie pair as like/does-not-like (rating prediction) or purchased/did-notpurchase (implicit rating prediction). Instead of varying a threshold, we vary the number of top (p, m) tuples in a ranked list that we use as recommendations. ROC curves plot the false-alarm rate on the x-axis against the hit rate2 on the y-axis where we have: hit rate = tp tp + fn, false alarm rate = fp fp + tn. The number of true positives, denoted tp, are the number of positive examples we correctly identify as such. The number of false positives, denoted fp, are the number of negatives that we miss-classify as positive. The definitions for true negatives (tn) and false negatives (fn) are analogous. ROC curves have the convenient property that random recommendation is characterized by an expected plot of a forty-five degree line from the lower-left to upper-right corners of the graph, and perfect performance is characterized by a horizontal line one unit above the origin. A useful statistic for summarizing the curve is the area under
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS the curve. Here, perfect performance is characterized by area 1.0 and random performance with expected area 0.5. ROC curves are constructed in the following manner 1. Order the predictions pred(Pi, mj) in a list by magnitude, imposing an ordering: 2. Pick k', calculate hit/false-alarm rates caused by predicting the top k'(p, m)k in the list, By selecting different k'(e.g, incrementing k by a fixed amount) we draw a curve on the 3. 2. Constrained evaluation the Croc curve For the constrained mode of recommending we introduce the CroC, or Customer ROC curve. The curve plots hit rates and false-alarm rates as defined in the standard ROC curve setting, however a constraint is imposed so that each user gets the same number of recommendations. This is achieved by creating one separate ranked list for each user in the data set and plotting points along the curve by recommending the top k items on each list. Note that the |PI ranked lists can have independent orderings of items, so users are potentially recommended separate sets of items In most forms of recommender system evaluation, users have different numbers of ob- servations in the testing set; it is common practice to remove training set data from consid- this rule, or ting. In this case the I PI lists have different lengths. There are exceptions to instance when performing certain types of cold-start evaluation where none of the test set events occur in training. Also, there may be alternative scenarios where a data set contains information about repeated purchases of an item, and therefore it should be permitted to recommend all person/item pairs in testing. However, the evaluations in this paper assume that we do not want to recommend an item if we know that a user has previously seen, purchased or rated that item. Let n(p) be the number of items available to user p in the test set observations. Accord ing to the CroC curve evaluation, the recommendation problem is essentially to guess which k of the n(p) items each user p will like/purchase. When n(p) varies by p(i.e, the I PI lists have different lengths), we make at most n(p) recommendations for user p. The orithm for generating the CROC curve is 1. For each person Pi, order the predictions pred(Pi, mj)in a list by magnitude imposing an ordering:(m)k 2. Pick k, calculate hit/false-alarm rates caused by recommending the top predicted min(k, n(p) movies to each person p from their own list and plot the hit rate against g k we generate the different points along the curve There are some important differences between the Croc curve and the roC curve For example the perfect recommender in a CROC curve is not necessarily a horizontal line ne unit above the x-axis. To see why, imagine evaluating an omniscient (i.e, perfect)
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 55 the curve. Here, perfect performance is characterized by area 1.0 and random performance with expected area 0.5. ROC curves are constructed in the following manner: 1. Order the predictions pred(pi, mj ) in a list by magnitude, imposing an ordering: (p, m)k. 2. Pick k , calculate hit/false-alarm rates caused by predicting the top k (p, m)k in the list, and plot the point. By selecting different k (e.g., incrementing k by a fixed amount) we draw a curve on the graph. 3.2. Constrained evaluation: the CROC curve For the constrained mode of recommending we introduce the CROC, or Customer ROC curve. The curve plots hit rates and false-alarm rates as defined in the standard ROC curve setting, however a constraint is imposed so that each user gets the same number of recommendations. This is achieved by creating one separate ranked list for each user in the data set and plotting points along the curve by recommending the top k items on each list. Note that the |P| ranked lists can have independent orderings of items, so users are potentially recommended separate sets of items. In most forms of recommender system evaluation, users have different numbers of observations in the testing set; it is common practice to remove training set data from consideration in testing. In this case the |P| lists have different lengths. There are exceptions to this rule, for instance when performing certain types of cold-start evaluation where none of the test set events occur in training. Also, there may be alternative scenarios where a data set contains information about repeated purchases of an item, and therefore it should be permitted to recommend all person/item pairs in testing. However, the evaluations in this paper assume that we do not want to recommend an item if we know that a user has previously seen, purchased or rated that item. Let n(p) be the number of items available to user p in the test set observations. According to the CROC curve evaluation, the recommendation problem is essentially to guess which k of the n(p) items each user p will like/purchase. When n(p) varies by p (i.e., the |P| lists have different lengths), we make at most n(p) recommendations for user p. The algorithm for generating the CROC curve is: 1. For each person pi, order the predictions pred(pi, mj ) in a list by magnitude imposing an ordering: (m)k. 2. Pick k , calculate hit/false-alarm rates caused by recommending the top predicted min(k , n(p)) movies to each person p from their own list and plot the hit rate against the false-alarm rate. By varying k we generate the different points along the curve. There are some important differences between the CROC curve and the ROC curve. For example the perfect recommender in a CROC curve is not necessarily a horizontal line one unit above the x-axis. To see why, imagine evaluating an omniscient (i.e., perfect)
SCHEIN ET AL recommender on a data set with three people each with six observations: likes only four out of six movies, person b likes only two out of six movies, and person c likes all six movies. When we recommend four movies to each person, we end up with two false-positives from person b, lowering the area of the curve. However, for any particular data set, we can plot the curve and calculate the area of the omniscient recommender in order to facilitate comparison Random performance in the CROC curve is another important behavior to understand A natural question to ask is whether a random recommender gives an expected plot consist ing of a forty-five degree line as in the roC curve. When each user has the same number of test set observations this quality holds for the CroC plot, as we will prove. An exper iment where k items are sampled without replacement from a population of size N with s successes in the population follows the hypergeometric distribution[Feller, 7]. Conse- quently, the expected number of successes is sk/N. Assume for the moment that each user has the same number of observations(items) for recommendation in a test set: n (p)=N Further, we denote the number of items each user has rated highly(success events) in the entire test set pool as s(p). In what follows, I PI is the total number of users we make recommendations for. and s is the total number of successes. summed over all users. Theorem 1(CROC random recommendation). The expected CROC curve of a random recommender on a test set where each user has the same number of observations is a forty five degree line Proof: Making k recommendations to user p leads to expected number of hits ks(P)/N Summing over all users we obtain expected number of hits (1) ading to expected hit rate k/N. Making k recommendations to user p leads to expected number of false alarms k(N-s(p))/N. Summing over all users we obtain ∑[N-s(p)] k(PN一S N leading to expected false-alarm rate k/N. The expected CROC point coordinates after k random reccomendations is therefore(k/N,k/N), leading to a forty-five degree line. O The theorem applies for cold-start implicit rating and cold-start explicit rating prediction as defined in Section 5. Section 6.2 provides opportunity to confirm that the forty-five degree line described in the theorem holds up well in practice. Our other evaluations do not invoke this theorem due to having different test set observation counts for different users. In cases where the number of observations varies for each user, we can provide no theoretical guarantee on the random recommender at the moment. With a little thought it becomes clear that in circumstances where having fewer observations is correlated with
56 SCHEIN ET AL. recommender on a data set with three people each with six observations: person a likes only four out of six movies, person b likes only two out of six movies, and person c likes all six movies. When we recommend four movies to each person, we end up with two false-positives from person b, lowering the area of the curve. However, for any particular data set, we can plot the curve and calculate the area of the omniscient recommender in order to facilitate comparison. Random performance in the CROC curve is another important behavior to understand. A natural question to ask is whether a random recommender gives an expected plot consisting of a forty-five degree line as in the ROC curve. When each user has the same number of test set observations this quality holds for the CROC plot, as we will prove. An experiment where k items are sampled without replacement from a population of size N with s successes in the population follows the hypergeometric distribution [Feller, 7]. Consequently, the expected number of successes is sk/N. Assume for the moment that each user has the same number of observations (items) for recommendation in a test set: n(p) = N. Further, we denote the number of items each user has rated highly (success events) in the entire test set pool as s(p). In what follows, |P| is the total number of users we make recommendations for, and S is the total number of successes, summed over all users. Theorem 1 (CROC random recommendation). The expected CROC curve of a random recommender on a test set where each user has the same number of observations is a forty- five degree line. Proof: Making k recommendations to user p leads to expected number of hits ks(p)/N. Summing over all users we obtain expected number of hits: k N p s(p) = kS N (1) leading to expected hit rate k/N. Making k recommendations to user p leads to expected number of false alarms k(N − s(p))/N. Summing over all users we obtain: k N p N − s(p) = k(|P|N − S) N (2) leading to expected false-alarm rate k/N. The expected CROC point coordinates after k random reccomendations is therefore (k/N, k/N), leading to a forty-five degree line. The theorem applies for cold-start implicit rating and cold-start explicit rating prediction, as defined in Section 5. Section 6.2 provides opportunity to confirm that the forty-five degree line described in the theorem holds up well in practice. Our other evaluations do not invoke this theorem due to having different test set observation counts for different users. In cases where the number of observations varies for each user, we can provide no theoretical guarantee on the random recommender at the moment. With a little thought it becomes clear that in circumstances where having fewer observations is correlated with
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS having a greater number of positive outcomes, the expected curve will be above the forty five degree line. The lack of theoretical guarantee has not been an impediment to other performance measures such as mean absolute error or precision/recall curves. In cases where we can not invoke the theorem, we can simulate random recommendation and plot the CRoC curve to provide this baseline 3.3. Interpreting ROC and CROC curves On both ROC and CROC curves the portion of the curve of special interest in evaluation is the left-hand side of the curve. The left-hand side corresponds to making k recommen- dations where k is small in comparison to the total number of recommendations we could make. In most applications we have only the opportunity to recommend a few items before a user gets tired of looking at the list of suggestions, and so performance on the left-hand side of the curve is critical. In our present experiments we plot the entire curve and calcu late the area under the entire curve. Alternatively, we might have truncated the curve at ow false-alarm rate such as 0.3 and calculated the area under this portion of the curve to 4. Recommender system algorithms tested In this paper, we apply and evaluate three probabilistic methods for describing the lationship between people and items: a logistic principal component analysis, an aspect yes model. We will also employ popularity ranked recommenda- tion methods that effectively recommend the most popular items or recommend only to users that seem to like everything. Popularity rank methods are interesting because they are the simplest to implement and as we will see in the evaluations sometimes outperform other algorithms. For consistency's sake, the models are denoted in the form of a movie recommendation task with the symbol m standing for a particular movie. However, in our evaluation we will recommend web pages as well. 4.1. Logistic principal component analysis Logistic principal component analysis(LPCA)[Schein et al., 33] is a dimensionality re- duction technique for binary data that is derived from a generalization of standard principal component analysis in the same way that logistic regression is derived from the general ized linear model framework. In the movie recommendation domain we use Pl, MI and ILl to denote the number of people, movies and the latent dimensions, respectively. Given a Person Movie matrix X of binary occurrence data indicating whether person p sees movie m, we hypothesize a latent space of dimension LI where |L< MI. We reduce the original data X to a set of coordinates(U=Pl x LI matrix)in a lower dimensional xis(V=lLl x [MI matrix). We calculate coordinates and axis(U and v) by maximizing the likelihood
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 57 having a greater number of positive outcomes, the expected curve will be above the forty- five degree line. The lack of theoretical guarantee has not been an impediment to other performance measures such as mean absolute error or precision/recall curves. In cases where we can not invoke the theorem, we can simulate random recommendation and plot the CROC curve to provide this baseline. 3.3. Interpreting ROC and CROC curves On both ROC and CROC curves the portion of the curve of special interest in evaluation is the left-hand side of the curve. The left-hand side corresponds to making k recommendations where k is small in comparison to the total number of recommendations we could make. In most applications we have only the opportunity to recommend a few items before a user gets tired of looking at the list of suggestions, and so performance on the left-hand side of the curve is critical. In our present experiments we plot the entire curve and calculate the area under the entire curve. Alternatively, we might have truncated the curve at a low false-alarm rate such as 0.3 and calculated the area under this portion of the curve to lend greater emphasis to the region of interest. 4. Recommender system algorithms tested In this paper, we apply and evaluate three probabilistic methods for describing the relationship between people and items: a logistic principal component analysis, an aspect model, and a naïve Bayes model. We will also employ popularity ranked recommendation methods that effectively recommend the most popular items or recommend only to users that seem to like everything. Popularity rank methods are interesting because they are the simplest to implement and as we will see in the evaluations sometimes outperform other algorithms. For consistency’s sake, the models are denoted in the form of a movie recommendation task with the symbol m standing for a particular movie. However, in our evaluation we will recommend web pages as well. 4.1. Logistic principal component analysis Logistic principal component analysis (LPCA) [Schein et al., 33] is a dimensionality reduction technique for binary data that is derived from a generalization of standard principal component analysis in the same way that logistic regression is derived from the generalized linear model framework. In the movie recommendation domain we use |P|, |M| and |L| to denote the number of people, movies and the latent dimensions, respectively. Given a Person×Movie matrix X of binary occurrence data indicating whether person p sees movie m, we hypothesize a latent space of dimension |L| where |L||M|. We reduce the original data X to a set of coordinates (U = |P|×|L| matrix) in a lower dimensional axis (V = |L|×|M| matrix). We calculate coordinates and axis (U and V ) by maximizing the likelihood:
SCHEIN ET AL Table/. Notation for the logistic pca model dimensionality of data(movies) dimensionality of latent spac coefficients(IPI x lLI matrix) basis vectors( LI x IMI matrix) bias vector(1 x MI vector) 6pm=(Uvpm+△m P2M Figure I. Graphical model of (a) person/movie aspect model and(b) person/actor aspect model. These graphs terpreted precisely as Bayesian belief networks C=∑∑[ Xpm log o(om)+(1-Xm)loga(-pm p=l m=l where e factors into U and V as described in Table 1 and the logistic function o(0) In our benchmarking experiments we estimate the parameters through the alternating least squares(ALS)algorithm described in Schein et al. [33], but fit the dimensions in a stagewise fashion(e. g, determine dimension k-l, fix it and then determine dimension k) We set the number of dimensions by performing validation on a portion of the training data, using a separate test set of observations only at final evaluation. 4.2. The two-way aspect mode The aspect model, a latent class variable framework designed for contingency table smoothing [Hofmann, 15], appears natural for the task of predicting an association be- tween p and m. Figure 1(a)shows a graphical model description of the aspect model for a person/movie contingency table and Table 2 explains our notation used in the graphical
58 SCHEIN ET AL. Table 1. Notation for the logistic PCA model. |P| number of observations (people) |M| dimensionality of data (movies) |L| dimensionality of latent space Un coefficients (|P|×|L| matrix) Vm basis vectors (|L|×|M| matrix) m bias vector (1 × |M| vector) pm = (UV )pm + m Figure 1. Graphical model of (a) person/movie aspect model and (b) person/actor aspect model. These graphs can be interpreted precisely as Bayesian belief networks. L = |P| p=1 |M| m=1 Xpm log σ (pm) + (1 − Xpm)log σ (−pm) , (3) where factors into U and V as described in Table 1 and the logistic function σ (θ ) is defined as σ (θ ) = 1 1 + exp(−θ ). (4) In our benchmarking experiments we estimate the parameters through the alternating least squares (ALS) algorithm described in Schein et al. [33], but fit the dimensions in a stagewise fashion (e.g., determine dimension k −1, fix it and then determine dimension k). We set the number of dimensions by performing validation on a portion of the training data, using a separate test set of observations only at final evaluation. 4.2. The two-way aspect model The aspect model, a latent class variable framework designed for contingency table smoothing [Hofmann, 15], appears natural for the task of predicting an association between p and m. Figure 1(a) shows a graphical model description of the aspect model for a person/movie contingency table and Table 2 explains our notation used in the graphical
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 2. Notation used in our aspect model descriptions. Random variable Interpretation P P latent class model as well as in other descriptions of the aspect model applied to the movie recommen- 4.2.I. Pure collaborative filtering model The aspect model of Figure 1(a)encodes a probability distribution over each person/movie pair. In other Rs domains we replace the movie symbol M with the item of interest. Observations consist of tuples(p, m)recording that person p has seen/rated movie m. We store observations in a count matrix or con tingency table with rows ranging over people and columns ranging over movies(or vice versa). In some domains the data may include multiple observations that are identical (e. g, Lyle saw Memento twice). With each observation we increment by one the count of the appropriate contingency table cell(or matrix entry). A naive probability estimate for each cell is simply the observed frequency of events in that cell. However, notice that using this method of assigning probabilities, an empty cell implies that there is zero probability of the corresponding person seeing the corresponding movie, clearly an unrealistic inference. An aspect model hypothesizes the existence of a hidden or latent variable z(e. g,an affinity for a particular style of movies) that motivates person p to watch movie m. Ac- cording to the generative model semantics, person p chooses a latent class z, which in urn determines the movie m watched. The choice of movie m is assumed independent f p given knowledge of z. Since z is hidden, we er possible choices to define the P(p,m)=2P(P)P(zlp)P(mIz) (5) Parameters P(zlp)and P(m z) correspond to the processes of p stochastically choosing a latent class z, and z stochastically choosing m. The P(p, m)values can be thought f as smoothed estimates of the probability distribution of the contingency table. The latent variables perform the smoothing in a manner that maximizes the model likelihood by keeping estimates of P(p, m)close to the empirical distribution). The model also creates smoothed estimates for the values P(p)and P(m), both taking their interpretations from contingency table analysis. The parameters are calculated using the tempered EM algorithm as described in [Hofmann, 15]. We choose the number of latent classes usin performance on a partition of training data as the criterion. Our own source code for fitting the two-way aspect model is available online [Schein et al., 30).3 4.2.2. Adding content information The recommender system described so far is a pure collaborative filtering algorithm developed by Hofmann and Puzicha [16]. In our bench marking, we will employ the pure collaborative filtering aspect model to the prediction of
CROC: A NEW EVALUATION CRITERION FOR RECOMMENDER SYSTEMS 59 Table 2. Notation used in our aspect model descriptions. Random variable Object Interpretation P p person M m movie A a actor Z z latent class model as well as in other descriptions of the aspect model applied to the movie recommendation task. 4.2.1. Pure collaborative filtering model The aspect model of Figure 1(a) encodes a probability distribution over each person/movie pair. In other RS domains we replace the movie symbol M with the item of interest. Observations consist of tuples (p, m) recording that person p has seen/rated movie m. We store observations in a count matrix or contingency table with rows ranging over people and columns ranging over movies (or vice versa). In some domains the data may include multiple observations that are identical (e.g., Lyle saw Memento twice). With each observation we increment by one the count of the appropriate contingency table cell (or matrix entry). A naïve probability estimate for each cell is simply the observed frequency of events in that cell. However, notice that using this method of assigning probabilities, an empty cell implies that there is zero probability of the corresponding person seeing the corresponding movie, clearly an unrealistic inference. An aspect model hypothesizes the existence of a hidden or latent variable z (e.g., an affinity for a particular style of movies) that motivates person p to watch movie m. According to the generative model semantics, person p chooses a latent class z, which in turn determines the movie m watched. The choice of movie m is assumed independent of p given knowledge of z. Since z is hidden, we sum over possible choices to define the distribution over (p, m): P(p, m) = z P(p)P(z|p)P(m|z). (5) Parameters P(z|p) and P(m|z) correspond to the processes of p stochastically choosing a latent class z, and z stochastically choosing m. The P(p, m) values can be thought of as smoothed estimates of the probability distribution of the contingency table. The latent variables perform the smoothing in a manner that maximizes the model likelihood (by keeping estimates of P(p, m) close to the empirical distribution). The model also creates smoothed estimates for the values P(p) and P(m), both taking their interpretations from contingency table analysis. The parameters are calculated using the tempered EM algorithm as described in [Hofmann, 15]. We choose the number of latent classes using performance on a partition of training data as the criterion. Our own source code for fitting the two-way aspect model is available online [Schein et al., 30].3 4.2.2. Adding content information The recommender system described so far is a pure collaborative filtering algorithm developed by Hofmann and Puzicha [16]. In our benchmarking, we will employ the pure collaborative filtering aspect model to the prediction of
SCHEIN ET AL web page visitation. In addition, we will employ a similar model to cold-start recommen- dations of movies. Since it is impossible to make cold-start recommendations with any Ire collaborative filtering model, we seek a way to implant content information into the aspect model. The person/actor aspect model of Figure 1(b)combines collaborative with content data in one model by exploiting lists of actors starring in a movie as"content P(P,a)=∑P( P)P(zlp)P(al Person/content models of this form have been shown to improve performance when the pure collaborative techniques suffer from sparse data[ Popescul et al., 23 We generate a dataset from the collaborative filtering model by taking the collaborative observations(P, m) and creating a set of observations(p, ai) for each actor i in movie m These newly formed observations are not independent of each other, explicitly breaking a modeling assumption of the aspect model. In practice the broken assumption does not prevent us from obtaining good results from this model. 4.2.3. Folding in Notice that the person/actor aspect model does not have a movie ob- ject in the event space. In order to recommend a movie, we must create a new movie object out of the set of actors that appear in that movie. This pseudo-movie is then placed in the latent space based on the content information. We use Hofmanns[15]folding-in algorithm (originally used to fold term-queries into a document-word aspect model). For example, suppose we have fit a person/actor model and want to fold-in a new movie. We create a new set of parameters P(zlm) and use the actors in the movie ((a, m)) as evidence for lacing the movie in latent space in a manner that maximizes the likelihood of the movie All of the original parameters from Equation(6) are held constant during the process. The exact EM algorithm operates as follows E-Step P(zJa, m)x P(az)P(zm). M-Step P(zlm)x>n(a, m)P(zla, m) Recommendations are made using: P(Plm)= ∑ P(Plz)P(zlm) If we desire an estimated value of P(p, m), we will first need to estimate or define P(m) To generate the experimental results in this paper, we use a uniform prior probability as- sumption over movies
60 SCHEIN ET AL. web page visitation. In addition, we will employ a similar model to cold-start recommendations of movies. Since it is impossible to make cold-start recommendations with any pure collaborative filtering model, we seek a way to implant content information into the aspect model. The person/actor aspect model of Figure 1(b) combines collaborative with content data in one model by exploiting lists of actors starring in a movie as “content”: P(p, a) = z P(p)P(z|p)P(a|z). (6) Person/content models of this form have been shown to improve performance when the pure collaborative techniques suffer from sparse data [Popescul et al., 23]. We generate a dataset from the collaborative filtering model by taking the collaborative observations (p, m) and creating a set of observations (p, ai) for each actor i in movie m. These newly formed observations are not independent of each other, explicitly breaking a modeling assumption of the aspect model. In practice the broken assumption does not prevent us from obtaining good results from this model. 4.2.3. Folding in Notice that the person/actor aspect model does not have a movie object in the event space. In order to recommend a movie, we must create a new movie object out of the set of actors that appear in that movie. This pseudo-movie is then placed in the latent space based on the content information. We use Hofmann’s [15] folding-in algorithm (originally used to fold term-queries into a document-word aspect model). For example, suppose we have fit a person/actor model and want to fold-in a new movie. We create a new set of parameters P(z|m) and use the actors in the movie {(a, m)} as evidence for placing the movie in latent space in a manner that maximizes the likelihood of the movie. All of the original parameters from Equation (6) are held constant during the process. The exact EM algorithm operates as follows. E-Step: P(z|a, m) ∝ P(a|z)P(z|m). (7) M-Step: P(z|m) ∝ a n(a, m)P(z|a, m). (8) Recommendations are made using: P(p|m) = z P(p|z)P(z|m). (9) If we desire an estimated value of P(p, m), we will first need to estimate or define P(m). To generate the experimental results in this paper, we use a uniform prior probability assumption over movies