Evaluating Collaborative Filtering Recommender Systems JONATHAN L HERLOCKER Oregon State University JOSEPH A KONSTAN, LOREN G. TERVEEN. and JOHN T RIEDL University of Minnesota Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior esearchers, we present empirical results from the analysis of various accuracy metrics on one con- ent domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency lasses were uncorrelated Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software-performance Evaluation (efficiency and effectiveness) General Terms: Experimentation, Measurement, Performance Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, 1 INTRODUCTION Recommender systems use the opinions of a community of users to help indi- viduals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of This research was supported by the National Science Foundation (NSF) under grants DGE 95- 54517,Is96-13960.IS9734442,Is99-78717,Is01-02229, and IIS01-3394, and by Net Perceptions, Inc. Authors'addresses: J. L. Herlocker, School of Electrical Engineering and Computer Science, Oregon State University, 102 Dearborn Hall, Corvallis, OR 97331: email: herlock @cs. orst. edu; J. A Konstan, L. G. Terveen, and J. T. Riedl, Department of Computer Science and Engineering. Uni- ersity of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55455: email: Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial dvantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi permission and/or a fee. Permissions may be requested from Publications Dept, ACM, Inc, 151 Broadway. New York, NY 10036 USA, fax: +1 (212)869-0481, or permissions@acm. org 2004ACM1046-818804/0100-0005S500 ACM Transactions on Information Systems, VoL 22, No. 1, January 2004, Pages 5-53
Evaluating Collaborative Filtering Recommender Systems JONATHAN L. HERLOCKER Oregon State University and JOSEPH A. KONSTAN, LOREN G. TERVEEN, and JOHN T. RIEDL University of Minnesota Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated. Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software—performance Evaluation (efficiency and effectiveness) General Terms: Experimentation, Measurement, Performance Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, evaluation 1. INTRODUCTION Recommender systems use the opinions of a community of users to help individuals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of This research was supported by the National Science Foundation (NSF) under grants DGE 95- 54517, IIS 96-13960, IIS 97-34442, IIS 99-78717, IIS 01-02229, and IIS 01-33994, and by Net Perceptions, Inc. Authors’ addresses: J. L. Herlocker, School of Electrical Engineering and Computer Science, Oregon State University, 102 Dearborn Hall, Corvallis, OR 97331; email: herlock@cs.orst.edu; J. A. Konstan, L. G. Terveen, and J. T. Riedl, Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55455; email: {konstan, terveen, riedl}@cs.umn.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. C 2004 ACM 1046-8188/04/0100-0005 $5.00 ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004, Pages 5–53
6 J. L. Herlocker et al the most successful technologies for recommender systems, called collabora tive filtering, has been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommenda- tions. Each algorithmic approach has adherents who claim it to be superior for some purpose. Clearly identifying the best algorithm for a given purpose has proven challenging, in part because researchers disagree on which attributes should be measured. and on which metrics should be used for each attribute. Re. searchers who survey the literature will find over a dozen quantitative metrics and additional qualitative evaluation techniques Evaluating recommender systems and their algorithms is inherently diffi- cult for several reasons. First, different algorithms may be better or worse on different data sets. Many collaborative filtering algorithms have been designed specifically for data sets where there are many more users than (e. g the MovieLens data set has 65, 000 users and 5,000 movies). Such algorithms may be entirely inappropriate in a domain where there are many more items than users(e. g, a research paper recommender with thousands of users but tens or hundreds of thousands of articles to recommend). Similar differences exist for ratings density, ratings scale, and other properties of data sets The second reason that evaluation is difficult is that the goals for which n evaluation is performed may differ. Much early evaluation work focused specifically on the"accuracy"of collaborative filtering algorithms in"predict ing"withheld ratings. Even early researchers recognized, however, that when recommenders are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices. Shardanand and Maes [1995] measured "reversals-large errors between the predicted and actual rating: we have used the signal-processing measure of the Receiver Operating Characteristic curve [Swets 1963] to measure a recommender's potential as a filter[Konstan et al. 1997. Other work has speculated that there are properties different from accuracy that have a larger effect on user satisfaction and perfor- nance. A range of research and systems have looked at measures including the legree to which the recommendations cover the entire set of items [Mobasher et al. 2001 the degree to which recommendations made are nonobvious [ McNee et al. 2002, and the ability of recommenders to explain their recommendations to users [Sinha and Swearingen 2002. A few researchers have argued that these issues are all details, and that the bottom-line measure of recommender system success should be user satisfaction. Commercial systems measure user satisfaction by the number of products purchased(and not returned ) while noncommercial systems may just ask users how satisfied they are Finally, there is a significant challenge in deciding what combination of mea sures to use in comparative evaluation. We have noticed a trend recently-many researchers find that their newest algorithms yield a mean absolute error of 0.73(on a five-point rating scale)on movie rating datasets. Though the new al- gorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality. We-and others-have speculated that we may be reaching some"magic barrier"where natural variability may prevent us from getting much more accurate. In support of this, Hill et al. [1995] have shown ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
6 • J. L. Herlocker et al. the most successful technologies for recommender systems, called collaborative filtering, has been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommendations. Each algorithmic approach has adherents who claim it to be superior for some purpose. Clearly identifying the best algorithm for a given purpose has proven challenging, in part because researchers disagree on which attributes should be measured, and on which metrics should be used for each attribute. Researchers who survey the literature will find over a dozen quantitative metrics and additional qualitative evaluation techniques. Evaluating recommender systems and their algorithms is inherently diffi- cult for several reasons. First, different algorithms may be better or worse on different data sets. Many collaborative filtering algorithms have been designed specifically for data sets where there are many more users than items (e.g., the MovieLens data set has 65,000 users and 5,000 movies). Such algorithms may be entirely inappropriate in a domain where there are many more items than users (e.g., a research paper recommender with thousands of users but tens or hundreds of thousands of articles to recommend). Similar differences exist for ratings density, ratings scale, and other properties of data sets. The second reason that evaluation is difficult is that the goals for which an evaluation is performed may differ. Much early evaluation work focused specifically on the “accuracy” of collaborative filtering algorithms in “predicting” withheld ratings. Even early researchers recognized, however, that when recommenders are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices. Shardanand and Maes [1995] measured “reversals”—large errors between the predicted and actual rating; we have used the signal-processing measure of the Receiver Operating Characteristic curve [Swets 1963] to measure a recommender’s potential as a filter [Konstan et al. 1997]. Other work has speculated that there are properties different from accuracy that have a larger effect on user satisfaction and performance. A range of research and systems have looked at measures including the degree to which the recommendations cover the entire set of items [Mobasher et al. 2001], the degree to which recommendations made are nonobvious [McNee et al. 2002], and the ability of recommenders to explain their recommendations to users [Sinha and Swearingen 2002]. A few researchers have argued that these issues are all details, and that the bottom-line measure of recommender system success should be user satisfaction. Commercial systems measure user satisfaction by the number of products purchased (and not returned!), while noncommercial systems may just ask users how satisfied they are. Finally, there is a significant challenge in deciding what combination of measures to use in comparative evaluation. We have noticed a trend recently—many researchers find that their newest algorithms yield a mean absolute error of 0.73 (on a five-point rating scale) on movie rating datasets. Though the new algorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality. We—and others—have speculated that we may be reaching some “magic barrier” where natural variability may prevent us from getting much more accurate. In support of this, Hill et al. [1995] have shown ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems that users provide inconsistent ratings when asked to rate the same movie at different times. They suggest that an algorithm cannot be more accurate than the variance in a user's ratings for the same item. Even when accuracy differences are measurable, they are usually tiny. On a five-point rating scale, are users sensitive to a change in mean absolute error of 0.01? These observations suggest that algorithmic improvements in collab rative filtering systems may come from different directions than just continued improvements in mean absolute error. Perhaps the best algorithms should be measured in accordance with how well they can communicate their reasoning to users, or with how little data they can yield accurate recommendations. If this is true, new metrics will be needed to evaluate these new algorithms This article presents six specific contributions towards evaluation of recom- mender systems. (1) We introduce a set of recommender tasks that categorize the user goals for recommender system. (2)We discuss the selection of appropriate datasets for evaluation. We explore when evaluation can be completed off-line using existing datasets and when it requires on-line experimentation. We briefly discuss synthetic data sets and more extensively review the properties of datasets that should be con- sidered in selecting them for evaluation. (3)We survey evaluation metrics that have been used to evaluation recom- mender systems in the past, conceptually analyzing their strengths and (4)We report on experimental results comparing the outcomes of a set of differ- ent accuracy evaluation metrics on one data set. We show that the metrics ollapse roughly into three equivalence classes (5)By evaluating a wide set of metrics on a dataset, we show that for some datasets, while many different metrics are strongly correlated, the ere are classes of metrics that are uncorrelated (6)We review a wide range of nonaccuracy metrics, including measures of the degree to which recommendations cover the set of items, the novelty and serendipity of recommendations, and user satisfaction and behavior in the mender syste o Throughout our discussion, we separate out our review of what has been ne before in the literature from the introduction of new tasks and methods We expect that the primary audience of this article will be collaborative fil tering researchers who are looking to evaluate new algorithms against previous research and collaborative filtering practitioners who are evaluating algorithms before deploying them in recommender systems There are certain aspects of recommender systems that we have specifically ft out of the scope of this paper. In particular, we have decided to avoid the large area of marketing-inspired evaluation. There is extensive work on evaluating marketing campaigns based on such measures as offer acceptance and sales lift [Rogers 2001. While recommenders are widely used in this area, we can- not add much to existing coverage of this topic. We also do not address general ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 7 that users provide inconsistent ratings when asked to rate the same movie at different times. They suggest that an algorithm cannot be more accurate than the variance in a user’s ratings for the same item. Even when accuracy differences are measurable, they are usually tiny. On a five-point rating scale, are users sensitive to a change in mean absolute error of 0.01? These observations suggest that algorithmic improvements in collaborative filtering systems may come from different directions than just continued improvements in mean absolute error. Perhaps the best algorithms should be measured in accordance with how well they can communicate their reasoning to users, or with how little data they can yield accurate recommendations. If this is true, new metrics will be needed to evaluate these new algorithms. This article presents six specific contributions towards evaluation of recommender systems. (1) We introduce a set of recommender tasks that categorize the user goals for a particular recommender system. (2) We discuss the selection of appropriate datasets for evaluation. We explore when evaluation can be completed off-line using existing datasets and when it requires on-line experimentation. We briefly discuss synthetic data sets and more extensively review the properties of datasets that should be considered in selecting them for evaluation. (3) We survey evaluation metrics that have been used to evaluation recommender systems in the past, conceptually analyzing their strengths and weaknesses. (4) We report on experimental results comparing the outcomes of a set of different accuracy evaluation metrics on one data set. We show that the metrics collapse roughly into three equivalence classes. (5) By evaluating a wide set of metrics on a dataset, we show that for some datasets, while many different metrics are strongly correlated, there are classes of metrics that are uncorrelated. (6) We review a wide range of nonaccuracy metrics, including measures of the degree to which recommendations cover the set of items, the novelty and serendipity of recommendations, and user satisfaction and behavior in the recommender system. Throughout our discussion, we separate out our review of what has been done before in the literature from the introduction of new tasks and methods. We expect that the primary audience of this article will be collaborative filtering researchers who are looking to evaluate new algorithms against previous research and collaborative filtering practitioners who are evaluating algorithms before deploying them in recommender systems. There are certain aspects of recommender systems that we have specifically left out of the scope of this paper. In particular, we have decided to avoid the large area of marketing-inspired evaluation. There is extensive work on evaluating marketing campaigns based on such measures as offer acceptance and sales lift [Rogers 2001]. While recommenders are widely used in this area, we cannot add much to existing coverage of this topic. We also do not address general ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al usability evaluation of the interfaces. That topic is well covered in the research and practitioner literature(e.g, Helander [1988] and Nielsen [1994])We have chosen not to discuss computation performance of recommender algorithms Such performance is certainly important, and in the future we expect there to be work on the quality of time-limited and memory-limited recommendations This area is just emerging, however(see for example Miller et al.'s recent work on recommendation on handheld devices [Miller et al. 2003), and there is not yet enough research to survey and synthesize. Finally, we do not address the emerging question of the robustness and transparency of recommender algo- rithms. We recognize that recommender system robustness to manipulation by attacks(and transparency that discloses manipulation by system operators)is important, but substantially more work needs to occur in this area before there will be accepted metrics for evaluating such robustness and transparency The remainder of the article is arranged as follows Section 2 We identify the key user tasks from which evaluation methods have been determined and suggest new tasks that have not been evaluated nsively Section 3. a discussion regarding the factors that can affect selection of a data set on which to perform evaluation Section 4. An investigation of metrics that have been used in evaluating the accuracy of collaborative filtering predictions and recommendations. Accu- racy has been by far the most commonly published evaluation method for collaborative filtering systems. This section also includes the results from an empirical study of the correlations between metrics Section 5. a discussion of metrics that evaluate dimensions other than ac- curacy. In addition to covering the dimensions and methods that have been used in the literature. we introduce new dimensions on which we believe evaluation should be done Section 6. Final conclusions, including a list of areas were we feel future work is particularly warranted Sections 2-5 are ordered to discuss the steps of evaluation in roughly the order that we would expect an evaluator to take. Thus, Section 2 describes the selec- tion of appropriate user tasks, Section 3 discusses the selection of a dataset, and Sections 4 and 5 discuss the alternative metrics that may be applied to the dataset chosen. We begin with the discussion of user tasks-the user task sets the entire context for evaluation 2 USER TASKS FOR RECOMMENDER SYSTEMS To properly evaluate a recommender system, it is important to understand the oals and tasks for which it is being used. In this article, we focus on end-user oals and tasks (as opposed to goals of marketers and other system stakehold- rs). We derive these tasks from the research literature and from deployed sys tems. For each task, we discuss its implications for evaluation. While the tasks weve identified are important ones, based on our experience in recommender systems research and from our review of published research, we recognize that ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
8 • J. L. Herlocker et al. usability evaluation of the interfaces. That topic is well covered in the research and practitioner literature (e.g., Helander [1988] and Nielsen [1994]) We have chosen not to discuss computation performance of recommender algorithms. Such performance is certainly important, and in the future we expect there to be work on the quality of time-limited and memory-limited recommendations. This area is just emerging, however (see for example Miller et al.’s recent work on recommendation on handheld devices [Miller et al. 2003]), and there is not yet enough research to survey and synthesize. Finally, we do not address the emerging question of the robustness and transparency of recommender algorithms. We recognize that recommender system robustness to manipulation by attacks (and transparency that discloses manipulation by system operators) is important, but substantially more work needs to occur in this area before there will be accepted metrics for evaluating such robustness and transparency. The remainder of the article is arranged as follows: —Section 2. We identify the key user tasks from which evaluation methods have been determined and suggest new tasks that have not been evaluated extensively. —Section 3. A discussion regarding the factors that can affect selection of a data set on which to perform evaluation. —Section 4. An investigation of metrics that have been used in evaluating the accuracy of collaborative filtering predictions and recommendations. Accuracy has been by far the most commonly published evaluation method for collaborative filtering systems. This section also includes the results from an empirical study of the correlations between metrics. —Section 5. A discussion of metrics that evaluate dimensions other than accuracy. In addition to covering the dimensions and methods that have been used in the literature, we introduce new dimensions on which we believe evaluation should be done. —Section 6. Final conclusions, including a list of areas were we feel future work is particularly warranted. Sections 2–5 are ordered to discuss the steps of evaluation in roughly the order that we would expect an evaluator to take. Thus, Section 2 describes the selection of appropriate user tasks, Section 3 discusses the selection of a dataset, and Sections 4 and 5 discuss the alternative metrics that may be applied to the dataset chosen. We begin with the discussion of user tasks—the user task sets the entire context for evaluation. 2. USER TASKS FOR RECOMMENDER SYSTEMS To properly evaluate a recommender system, it is important to understand the goals and tasks for which it is being used. In this article, we focus on end-user goals and tasks (as opposed to goals of marketers and other system stakeholders). We derive these tasks from the research literature and from deployed systems. For each task, we discuss its implications for evaluation. While the tasks we’ve identified are important ones, based on our experience in recommender systems research and from our review of published research, we recognize that ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems the list is necessarily incomplete As researchers and developers move into new recommendation domains, we expect they will find it useful to supplement this list and/or modify these tasks with domain-specific ones. Our goal is pri to identify domain-independent task descriptions to help distinguish among different evaluation measures We have identified two user tasks that have been discussed at length within the collaborative filtering literature Annotation in Context. The original recommendation scenario was filtering through structured discussion postings to decide which ones were worth read ing. Tapestry [Goldberg et al. 1992] and GroupLens [Resnick et al 1994] both applied this to already structured message databases. This task required re- taining the order and context of messages, and accordingly used predictions to annotate messages in context. In some cases the"worst"messages were filtered out. This same scenario, which uses a recommender in an existing context, has Iso been used by web recommenders that overlay prediction information on top of existing links [ Wexelblat and Maes 1999]. Users use the displayed predic tions to decide which messages to read (or which links to follow), and therefore the most important factor to evaluate is how successfully the predictions help users distinguish between desired and undesired content. a major factor is the whether the recommender can generate predictions for the items that the user Find Good Items. Soon after Tapestry and GroupLens, several systems ere developed with a more direct focus on actual recommendation. Ring Shardanand and Maes 1995 and the bellcore video Recommender hill et al 995] both provided interfaces that would suggest specific items to their users, providing users with a ranked list of the recommended items, along with predic. tions for how much the users would like them This is the core recommendation task and it recurs in a wide variety of research and commercial systems. In many commercial systems, the"best bet"recommendations are shown, but the predicted rating values are not While these two tasks can be identified quite generally across many different domains, there are likely to be many specializations of the above tasks within each domain We introduce some of the characteristics of domains that influence hose specializations in Section 3.3 While the annotation in Context and the find good items are overwhelm ingly the most commonly evaluated tasks in the literature, there are other important generic tasks that are not well described in the research literature Below we describe several other user tasks that we have encountered in our in- terviews with users and our discussions with recommender system designers We mention these tasks because we believe that they should be evaluated, but because they have not been addressed in the recommender systems literature, we do not discuss them further Find All Good Items. Most recommender tasks focus on finding some good items. This is not surprising: the problem that led to recommender systems was one of overload, and most users seem willing to live with overlooking some ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 9 the list is necessarily incomplete. As researchers and developers move into new recommendation domains, we expect they will find it useful to supplement this list and/or modify these tasks with domain-specific ones. Our goal is primarily to identify domain-independent task descriptions to help distinguish among different evaluation measures. We have identified two user tasks that have been discussed at length within the collaborative filtering literature: Annotation in Context. The original recommendation scenario was filtering through structured discussion postings to decide which ones were worth reading. Tapestry [Goldberg et al. 1992] and GroupLens [Resnick et al. 1994] both applied this to already structured message databases. This task required retaining the order and context of messages, and accordingly used predictions to annotate messages in context. In some cases the “worst” messages were filtered out. This same scenario, which uses a recommender in an existing context, has also been used by web recommenders that overlay prediction information on top of existing links [Wexelblat and Maes 1999]. Users use the displayed predictions to decide which messages to read (or which links to follow), and therefore the most important factor to evaluate is how successfully the predictions help users distinguish between desired and undesired content. A major factor is the whether the recommender can generate predictions for the items that the user is viewing. Find Good Items. Soon after Tapestry and GroupLens, several systems were developed with a more direct focus on actual recommendation. Ringo [Shardanand and Maes 1995] and the Bellcore Video Recommender [Hill et al. 1995] both provided interfaces that would suggest specific items to their users, providing users with a ranked list of the recommended items, along with predictions for how much the users would like them. This is the core recommendation task and it recurs in a wide variety of research and commercial systems. In many commercial systems, the “best bet” recommendations are shown, but the predicted rating values are not. While these two tasks can be identified quite generally across many different domains, there are likely to be many specializations of the above tasks within each domain. We introduce some of the characteristics of domains that influence those specializations in Section 3.3. While the Annotation in Context and the Find Good Items are overwhelmingly the most commonly evaluated tasks in the literature, there are other important generic tasks that are not well described in the research literature. Below we describe several other user tasks that we have encountered in our interviews with users and our discussions with recommender system designers. We mention these tasks because we believe that they should be evaluated, but because they have not been addressed in the recommender systems literature, we do not discuss them further. Find All Good Items. Most recommender tasks focus on finding some good items. This is not surprising; the problem that led to recommender systems was one of overload, and most users seem willing to live with overlooking some ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al good items in order to screen out many bad ones. Our discussions with firms in the legal databases industry, however, led in the opposite direction. Lawyers searching for precedents feel it is very important not to overlook a single possible case. Indeed, they are willing to invest large amounts of time(and their client's money) searching for that case. To use recommenders in their practice, they first need to be assured that the false negative rate can be made sufficiently low. As with annotation in context, coverage becomes particularly important in this task Recommend Sequence. We first noticed this task when using the personal ized radio web site Launch (launch. yahoo. com)which streams music based on a ariety of recommender algorithms. Launch has several interesting factors, in- cluding the desirability of recommending"already rated"items, though not too often What intrigued us, though, is the challenge of moving from recommend- ing one song at a time to recommending a sequence that is pleasing as a whole This same task can apply to recommending research papers to learn about a field (read this introduction, then that survey, ..) While data mining research has explored product purchase timing and sequences, we are not aware of any recommender applications or research that directly address this task. Just Browsing Recommenders are usually evaluated based on how well they help the user make a consumption decision In talking with users of our MovieLens system, of Amazon. com, and of several other sites, we discovered that many of them use the site even when they have no purchase imminent. They find it pleasant to browse. Whether one models this activity as learning or simply as entertainment, it seems that a substantial use of recommenders is simply using them without an ulterior motive. For those cases, the accuracy of algorithms may be less important than the interface, the ease of use, and the level and nature of information provided Find Credible recommender: This is another task gleaned from discussions with users. It is not surprising that users do not automatically trust a recom- mender. Many of them "play around" for a while to see if the recommender matches their tastes well. We,ve heard many complaints from users who are looking up their favorite(or least favorite) movies on MovieLens-they dont do this to learn about the movie, but to check up on us. Some users even go further. Especially on commercial sites, they try changing their profiles to see how the recommended items change. They explore the recommendations to try to find any hints of bias. a recommender optimized to produce"useful"recom- nendations(e. g, recommendations for items that the user does not already know about) may fail to appear trustworthy because it does not recommen movies the user is sure to enjoy but probably already knows about. We are not ware of any research on how to make a recommender appear credible, though there is more general research on making websites evoke trust [Bailey et al 2001] Most evaluations of recommender systems focus on the recommendations; however if users don,'t rate items, then collaborative filtering recommender sys tems can't provide recommendations. Thus, evaluating if and why users would ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
10 • J. L. Herlocker et al. good items in order to screen out many bad ones. Our discussions with firms in the legal databases industry, however, led in the opposite direction. Lawyers searching for precedents feel it is very important not to overlook a single possible case. Indeed, they are willing to invest large amounts of time (and their client’s money) searching for that case. To use recommenders in their practice, they first need to be assured that the false negative rate can be made sufficiently low. As with annotation in context, coverage becomes particularly important in this task. Recommend Sequence. We first noticed this task when using the personalized radio web site Launch (launch.yahoo.com) which streams music based on a variety of recommender algorithms. Launch has several interesting factors, including the desirability of recommending “already rated” items, though not too often. What intrigued us, though, is the challenge of moving from recommending one song at a time to recommending a sequence that is pleasing as a whole. This same task can apply to recommending research papers to learn about a field (read this introduction, then that survey, ... ). While data mining research has explored product purchase timing and sequences, we are not aware of any recommender applications or research that directly address this task. Just Browsing. Recommenders are usually evaluated based on how well they help the user make a consumption decision. In talking with users of our MovieLens system, of Amazon.com, and of several other sites, we discovered that many of them use the site even when they have no purchase imminent. They find it pleasant to browse. Whether one models this activity as learning or simply as entertainment, it seems that a substantial use of recommenders is simply using them without an ulterior motive. For those cases, the accuracy of algorithms may be less important than the interface, the ease of use, and the level and nature of information provided. Find Credible Recommender. This is another task gleaned from discussions with users. It is not surprising that users do not automatically trust a recommender. Many of them “play around” for a while to see if the recommender matches their tastes well. We’ve heard many complaints from users who are looking up their favorite (or least favorite) movies on MovieLens—they don’t do this to learn about the movie, but to check up on us. Some users even go further. Especially on commercial sites, they try changing their profiles to see how the recommended items change. They explore the recommendations to try to find any hints of bias. A recommender optimized to produce “useful” recommendations (e.g., recommendations for items that the user does not already know about) may fail to appear trustworthy because it does not recommend movies the user is sure to enjoy but probably already knows about. We are not aware of any research on how to make a recommender appear credible, though there is more general research on making websites evoke trust [Bailey et al. 2001]. Most evaluations of recommender systems focus on the recommendations; however if users don’t rate items, then collaborative filtering recommender systems can’t provide recommendations. Thus, evaluating if and why users would ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems contribute ratings may be important to communicate that a recommender sys- tem is likely to be successful. We will briefly introduce several different rating tasks Improve Profile. the rating task that most recommender systems have sume ed Users contribute ratings because they believe that they are improv their profile and thus improving the quality of the recommendations that they will receive Express Self Some users may not care about the recommendations-what is important to them is that they be allowed to contribute their ratings. Many users simply want a forum for expressing their opinions. We conducted inter- views with"power users"of MovieLens that had rated over 1000 movies(some over 2000 movies). What we learned was that these users were not rating to improve their recommendations. They were rating because it felt good. We par ticularly see this effect on sites like Amazon. com, where users can post reviews (ratings) of items sold by Amazon. For users with this task, issues may in clude the level of anonymity(which can be good or bad, depending on the user the feeling of contribution, and the ease of making the contribution. while recommender algorithms themselves may not evoke self-expression, encourag ing self-expression may provide more data which can improve the quality of commendations Help Others. Some users are happy to contribute ratings in recommender systems because they believe that the community benefits from their contribu tion. In many cases, they are also entering ratings in order to express them- selves(see previous task). However, the two do not always go togethe Influence Others. An unfortunate fact that we and other implementers of web-based recommender systems have encountered is that there are users recommender systems whose goal is to explicitly influence others into viewing or purchasing particular items. For example, advocates of particular movie genres (or movie studios) will frequently rate movies high on the MovieLens web site right before the movie is released to try and push others to go and see the movie This task is particularly interesting, because we may want to evaluate how well the system prevents this task. While we have briefly mentioned tasks involved in contributing ratings, we will not discuss them in depth in this paper, and rather focus on the task related to recommendation We must once again say that the list of tasks in this section is not compre- hensive. Rather, we have used our experience in the field to filter out the task categories that (a)have been most significant in the previously published work, and(b)that we feel are significant, but have not been considered sufficiently In the field of Human-Computer Interaction, it has been strongly that the evaluation process should begin with an understanding of the user tasks that the system will serve. When we evaluate recommender systems from the perspective of benefit to the user, we should also start by identifying the most important task for which the recommender will be used In this section, ge have provided descriptions of the most significant tasks that have been ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 11 contribute ratings may be important to communicate that a recommender system is likely to be successful. We will briefly introduce several different rating tasks. Improve Profile. the rating task that most recommender systems have assumed. Users contribute ratings because they believe that they are improving their profile and thus improving the quality of the recommendations that they will receive. Express Self. Some users may not care about the recommendations—what is important to them is that they be allowed to contribute their ratings. Many users simply want a forum for expressing their opinions. We conducted interviews with “power users” of MovieLens that had rated over 1000 movies (some over 2000 movies). What we learned was that these users were not rating to improve their recommendations. They were rating because it felt good. We particularly see this effect on sites like Amazon.com, where users can post reviews (ratings) of items sold by Amazon. For users with this task, issues may include the level of anonymity (which can be good or bad, depending on the user), the feeling of contribution, and the ease of making the contribution. While recommender algorithms themselves may not evoke self-expression, encouraging self-expression may provide more data which can improve the quality of recommendations. Help Others. Some users are happy to contribute ratings in recommender systems because they believe that the community benefits from their contribution. In many cases, they are also entering ratings in order to express themselves (see previous task). However, the two do not always go together. Influence Others. An unfortunate fact that we and other implementers of web-based recommender systems have encountered is that there are users of recommender systems whose goal is to explicitly influence others into viewing or purchasing particular items. For example, advocates of particular movie genres (or movie studios) will frequently rate movies high on the MovieLens web site right before the movie is released to try and push others to go and see the movie. This task is particularly interesting, because we may want to evaluate how well the system prevents this task. While we have briefly mentioned tasks involved in contributing ratings, we will not discuss them in depth in this paper, and rather focus on the tasks related to recommendation. We must once again say that the list of tasks in this section is not comprehensive. Rather, we have used our experience in the field to filter out the task categories that (a) have been most significant in the previously published work, and (b) that we feel are significant, but have not been considered sufficiently. In the field of Human-Computer Interaction, it has been strongly argued that the evaluation process should begin with an understanding of the user tasks that the system will serve. When we evaluate recommender systems from the perspective of benefit to the user, we should also start by identifying the most important task for which the recommender will be used. In this section, we have provided descriptions of the most significant tasks that have been ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
J. L. Herlocker et al identified. Evaluators should consider carefully which of the tasks described may be appropriate for their environment Once the proper tasks have been identified, the evaluator must select a dataset to which evaluation methods can be applied, a process that will most likely be constrained by the user tasks identified 3. SELECTING DATA SETS FOR EVALUATION Several key decisions regarding data sets underlie successful evaluation of a recommender system algorithm Can the evaluation be carried out offine on an xisting data set or does it require live user tests? If a data set is not currently available, can evaluation be performed on simulated data? What properties should the dataset have in order to best model the tasks for which the recom- mender is being evaluated? A few examples help clarify these decisions -When designing a recommender algorithm designed to recommend word pro- cessing commands(e. g, Lintonet al. [1998)), one can expect users to have ex- perienced 5-10%(or more)of the candidates. Accordingly, it would be unwise to select recommender algorithms based on evaluation results from movie or e-commerce datasets where ratings sparsity is much worse When evaluating a recommender algorithm in the context of the Find Good Items task where novel items are desired, it may be inappropriate to use only offline evaluation. Since the recommender algorithm is generating red ommendations for items that the user does not already know about, it probable that the data set will not provide enough information to evaluate the quality of the items being recommended. If an item was truly unknown to the user, then it is probable that there is no rating for that user in the database. If we perform a live user evaluation, ratings can be gained on the spot for each item recommended When evaluating a recommender in a new domain where there is significant research on the structure of user preferences, but no data sets, it may be ap- propriate to first evaluate algorithms against synthetic data sets to identify the promising ones for further study. We will examine in the following subsections each of the decisions that we posed in the first paragraph of this section, and then discuss the past and current trends in research with respect to collaborative filtering data sets 3.1 Live User Experiments vs Offline Analyses Evaluations can be completed using offline analysis, a variety of live user exper- mental methods, or a combination of the two. Much of the work in algorithm evaluation has focused on off-line analysis of predictive accuracy. In such an evaluation, the algorithm is used to predict certain withheld values from a dataset, and the results are analyzed using one or more of the metrics dis cussed in the following section. Such evaluations have the advantage that it is quick and economical to conduct large evaluations, often on several different datasets or algorithms at once. Once a dataset is available, conducting such an experiment simply requires running the algorithm on the appropriate subset of ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
12 • J. L. Herlocker et al. identified. Evaluators should consider carefully which of the tasks described may be appropriate for their environment. Once the proper tasks have been identified, the evaluator must select a dataset to which evaluation methods can be applied, a process that will most likely be constrained by the user tasks identified. 3. SELECTING DATA SETS FOR EVALUATION Several key decisions regarding data sets underlie successful evaluation of a recommender system algorithm. Can the evaluation be carried out offline on an existing data set or does it require live user tests? If a data set is not currently available, can evaluation be performed on simulated data? What properties should the dataset have in order to best model the tasks for which the recommender is being evaluated? A few examples help clarify these decisions: —When designing a recommender algorithm designed to recommend word processing commands (e.g., Linton et al. [1998]), one can expect users to have experienced 5–10% (or more) of the candidates. Accordingly, it would be unwise to select recommender algorithms based on evaluation results from movie or e-commerce datasets where ratings sparsity is much worse. —When evaluating a recommender algorithm in the context of the Find Good Items task where novel items are desired, it may be inappropriate to use only offline evaluation. Since the recommender algorithm is generating recommendations for items that the user does not already know about, it is probable that the data set will not provide enough information to evaluate the quality of the items being recommended. If an item was truly unknown to the user, then it is probable that there is no rating for that user in the database. If we perform a live user evaluation, ratings can be gained on the spot for each item recommended. —When evaluating a recommender in a new domain where there is significant research on the structure of user preferences, but no data sets, it may be appropriate to first evaluate algorithms against synthetic data sets to identify the promising ones for further study. We will examine in the following subsections each of the decisions that we posed in the first paragraph of this section, and then discuss the past and current trends in research with respect to collaborative filtering data sets. 3.1 Live User Experiments vs. Offline Analyses Evaluations can be completed using offline analysis, a variety of live user experimental methods, or a combination of the two. Much of the work in algorithm evaluation has focused on off-line analysis of predictive accuracy. In such an evaluation, the algorithm is used to predict certain withheld values from a dataset, and the results are analyzed using one or more of the metrics discussed in the following section. Such evaluations have the advantage that it is quick and economical to conduct large evaluations, often on several different datasets or algorithms at once. Once a dataset is available, conducting such an experiment simply requires running the algorithm on the appropriate subset of ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems that data. When the dataset includes timestamps, it is even possible to"replay a series of ratings and recommendations offine. Each time a rating was made, the researcher first computes the prediction for that item based on all prior data; then, after evaluating the accuracy of that prediction, the actual rating is entered so the next item can be evaluated Offline analyses have two important weaknesses. First, the natural sparsity of ratings data sets limits the set of items that can be evaluated. We cannot evaluate the appropriateness of a recommended item for a user if we do not have a rating from that user for that item in the dataset. Second, they are limited to objective evaluation of prediction results. No offine analysis can determine whether users will prefer a particular system, either because of its predictions or because of other less objective criteria such as the aesthetics of user interface An alternative approach is to conduct a live user experiment. Such experi- ments may be controlled(e. g, with random assignment of subjects to different conditions), or they may be field studies where a particular system is made available to a community of users that is then observed to ascertain the effects of the system. As we discuss later in Section 5.5, live user experiments can evaluate user performance, satisfaction, participation, and other measures. 3.2 Synthesized Vs Natural Data Sets Another choice that researchers face is whether to use an existing dataset that may imperfectly match the properties of the target domain and task, or to instead synthesize a dataset specifically to match those properties. In our own early work designing recommender algorithms for Usenet News [Konstan et al. 1997; Miller et al. 1997 we experimented with a variety of synthesized datasets. We modeled news articles as having a fixed number of"properties and users as having preferences for those properties. Our data set genera- tor could cluster users together, spread them evenly, or present other distri butions. While these simulated data sets gave us an easy way to test algo rithms for obvious flaws, they in no way accurately modeled the nature of real users and real data. In their research on horting as an approach for collabora- tive filtering, Aggarwal et al. [1999] used a similar technique, noting however that such synthetic data is"unfair to other algorithms"because it fits their approach too well, and that this is a placeholder until they can deploy their Synthesized data sets may be required in some limited cases, but only as early steps while gathering data sets or constructing complete systems. Drawing comparative conclusions from synthetic datasets is risky, because the data may fit one of the algorithms better than the others. On the other hand, there is new opportunity now to explore more advanced techniques for modeling user interest and generating synthetic data from those models, now that there exists data on which to evaluate the synthetically gen- erated data and tune the models. Such research could also lead to the develop ment of more accurate recommender algorithms with clearly defined theoretical pr ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
Evaluating Collaborative Filtering Recommender Systems • 13 that data. When the dataset includes timestamps, it is even possible to “replay” a series of ratings and recommendations offline. Each time a rating was made, the researcher first computes the prediction for that item based on all prior data; then, after evaluating the accuracy of that prediction, the actual rating is entered so the next item can be evaluated. Offline analyses have two important weaknesses. First, the natural sparsity of ratings data sets limits the set of items that can be evaluated. We cannot evaluate the appropriateness of a recommended item for a user if we do not have a rating from that user for that item in the dataset. Second, they are limited to objective evaluation of prediction results. No offline analysis can determine whether users will prefer a particular system, either because of its predictions or because of other less objective criteria such as the aesthetics of the user interface. An alternative approach is to conduct a live user experiment. Such experiments may be controlled (e.g., with random assignment of subjects to different conditions), or they may be field studies where a particular system is made available to a community of users that is then observed to ascertain the effects of the system. As we discuss later in Section 5.5, live user experiments can evaluate user performance, satisfaction, participation, and other measures. 3.2 Synthesized vs. Natural Data Sets Another choice that researchers face is whether to use an existing dataset that may imperfectly match the properties of the target domain and task, or to instead synthesize a dataset specifically to match those properties. In our own early work designing recommender algorithms for Usenet News [Konstan et al. 1997; Miller et al. 1997], we experimented with a variety of synthesized datasets. We modeled news articles as having a fixed number of “properties” and users as having preferences for those properties. Our data set generator could cluster users together, spread them evenly, or present other distributions. While these simulated data sets gave us an easy way to test algorithms for obvious flaws, they in no way accurately modeled the nature of real users and real data. In their research on horting as an approach for collaborative filtering, Aggarwal et al. [1999] used a similar technique, noting however that such synthetic data is “unfair to other algorithms” because it fits their approach too well, and that this is a placeholder until they can deploy their trial. Synthesized data sets may be required in some limited cases, but only as early steps while gathering data sets or constructing complete systems. Drawing comparative conclusions from synthetic datasets is risky, because the data may fit one of the algorithms better than the others. On the other hand, there is new opportunity now to explore more advanced techniques for modeling user interest and generating synthetic data from those models, now that there exists data on which to evaluate the synthetically generated data and tune the models. Such research could also lead to the development of more accurate recommender algorithms with clearly defined theoretical properties. ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004
3.3Pr f Data The final question we address in this section on data sets is"what properties should the dataset have in order to best model the tasks for which the rec- ommender is being evaluated? "We find it useful to divide data set properties into three categories: Domain features reflect the nature of the content being ecommended, rather than any particular system Inherent features reflect the nature of the specific recommender system from which data was drawn(and possibly from its data collection practices). Sample features reflect distribution properties of the data, and often can be manipulated by selecting the appropri ate subset of a larger data set. We discuss each of these three categories here dentifying specific features within each category. Domain Features of interest include (a) the content topic being recommended/rated and the associated context in which rating/recommendation takes place (b) the user tasks supported by the recommender: (d) the cost/benefit ratio of false/true positives/negatives (e) the granularity of true user preferences. Most commonly, recommender systems have been built for entertainment content domains(movies, music, etc. ) though some testbeds exist for filtering document collections(Usenet news, for example). Within a particular topic, there may be many contexts. Movie recommenders may operate on the web, or may operate entirely within a video rental store or as part of a set-top box digital video recorder. In our experience, one of the most important generic domain features to con- ider lies in the tradeoff between desire for novelty and desire for high quality In certain domains, the user goal is dominated by finding recommendations for things she doesn,'t already know about McNee et al. [2002] evaluated recom- menders for research papers and found that users were generally happy with a set of recommendations if there was a single item in the set that appeared be useful and that the user wasnt already familiar with. In some ways, this matches the conventional wisdom about supermarket recommenders-it would be almost always correct, but useless, to recommend bananas, bread, milk, and eggs. The recommendations might be correct, but they don't change the shop- per's behavior Opposite the desire for novelty is the desire for high quality. In- tuitively, this end of the tradeoff reflects the user's desire to rely heavily upon the recommendation for a consumption decision, rather than simply as one decision-support factor among many. At the extreme, the availability of high- confidence recommendations could enable automatic purchase decisions such as personalized book- or music-of-the-month clubs Evaluations of recommenders for this task must evaluate the success of high-confidence recommendations, and perhaps consider the opportunity costs of excessively low confidence Another important domain feature is the cost/benefit ratio faced by users in the domain from which items are being recommended. In the video recom- mender domain, the cost of false positives is low(S3 and two to three hours of ACM Transactions on Information Systems, VoL 22, No. 1, January 2004
14 • J. L. Herlocker et al. 3.3 Properties of Data Sets The final question we address in this section on data sets is “what properties should the dataset have in order to best model the tasks for which the recommender is being evaluated?” We find it useful to divide data set properties into three categories: Domain features reflect the nature of the content being recommended, rather than any particular system. Inherent features reflect the nature of the specific recommender system from which data was drawn (and possibly from its data collection practices). Sample features reflect distribution properties of the data, and often can be manipulated by selecting the appropriate subset of a larger data set. We discuss each of these three categories here, identifying specific features within each category. Domain Features of interest include (a) the content topic being recommended/rated and the associated context in which rating/recommendation takes place; (b) the user tasks supported by the recommender; (c) the novelty need and the quality need; (d) the cost/benefit ratio of false/true positives/negatives; (e) the granularity of true user preferences. Most commonly, recommender systems have been built for entertainment content domains (movies, music, etc.), though some testbeds exist for filtering document collections (Usenet news, for example). Within a particular topic, there may be many contexts. Movie recommenders may operate on the web, or may operate entirely within a video rental store or as part of a set-top box or digital video recorder. In our experience, one of the most important generic domain features to consider lies in the tradeoff between desire for novelty and desire for high quality. In certain domains, the user goal is dominated by finding recommendations for things she doesn’t already know about. McNee et al. [2002] evaluated recommenders for research papers and found that users were generally happy with a set of recommendations if there was a single item in the set that appeared to be useful and that the user wasn’t already familiar with. In some ways, this matches the conventional wisdom about supermarket recommenders—it would be almost always correct, but useless, to recommend bananas, bread, milk, and eggs. The recommendations might be correct, but they don’t change the shopper’s behavior. Opposite the desire for novelty is the desire for high quality. Intuitively, this end of the tradeoff reflects the user’s desire to rely heavily upon the recommendation for a consumption decision, rather than simply as one decision-support factor among many. At the extreme, the availability of highconfidence recommendations could enable automatic purchase decisions such as personalized book- or music-of-the-month clubs. Evaluations of recommenders for this task must evaluate the success of high-confidence recommendations, and perhaps consider the opportunity costs of excessively low confidence. Another important domain feature is the cost/benefit ratio faced by users in the domain from which items are being recommended. In the video recommender domain, the cost of false positives is low ($3 and two to three hours of ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004