Knowledge-Based Systems 23(2010)520-528 Contents lists available at Science Direct Knowledge-Based Systems ELSEVIER journalhomepagewww.elsevier.com/locate/knosys A new collaborative filtering metric that improves the behavior of recommender systems J. Bobadilla, F. Serradilla,J. Bernal Universidad Politecnica de madrid, Computer Science, Crta de valencia, Km 7, 28031 Madrid, Spain ARTICLE INFO A BSTRACT Recommender systems are typically provided as Web 2.0 services and are part of the range of applica- 21July2009 tions that give support to large-scale social networks, enabling on-line recommendations to be made revised form 18 March 2010 19 March 2010 ased on the use of networked databases. The operating core of recommender systems is based on the nline 23 March 2010 collaborative filtering stage, which, in current user to user recommender processes, usually uses the Pear son correlation metric. In this paper, we present a new metric which combines the numerical information of the votes with independent information from those values, based on the proportions of the common orative fltering and uncommon votes between each pair of users. Likewise, we define the reasoning and experiments on amender systems hich the design of the metric is based and the restriction of being applied to recommender where the possible range of votes is not greater than 5. In order to demonstrate the supe the proposed metric, we provide the comparative results of a set of experiments based on the Mo lean squared differences FilmAffinity and Net Flix databases. In addition to the traditional levels of accuracy, results are also pro- vided on the metrics'coverage, the percentage of hits obtained and the precision/recall 2010 Elsevier B V. All rights reserved. 1 Introduction Memory-based methods [22, 37, 35, 40 use similarity metrics and act directly on the ratio matrix that contains the ratings of Recommender systems(RS)cover an important field within col- all users who have expressed their preferences on the collaborative laborative services that are developed in the Web 2.0 environment service; these metrics mathematically express a distance between [21, 19, 26 and enable user-generated opinions to be exploited in a two users based on each of their ratios. Model-based methods [1] sophisticated and powerful way. Recommender systems can be use the ratio matrix to create a model from which the sets of sim- considered as social networking tools that provide dynamic and ilar users will be established. Among the most widely-used models collaborative communication, interaction and knowledge. we have: bayesian classifiers [8, neural networks [18 and fuzzy Recommender systems cover a wide variety of applications systems [39]. Generally, commercial recommender systems use 20, 4, 10, 28.5, 14, 27, although those related to movie recommen- memory-based methods, whilst model-based methods are usually dations are the most well-known and most widely-used in the associated with research recommender systems. search field [23, 3, 25] Regardless of the method used in the collaborative filtering filtering(CF)phase [1, 17,6, 34, 33]. Collaborative filtering is based ommender systems as high as possible: nevertheless, there are on making predictions about a users preferences or tastes based other objectives that need to be taken into account [38 avoid on the preferences of a group of users that are considered similar overspecialization phenomena, find good items, credibility of rec- to this user. A substantial part of the research in the area of collab- ommendations, precision and recall measures, etc. orative filtering centers on how to determine which users are sim- To date, various publications have been written which tackle the ilar to the given one: in order to tackle this task, there are way the recommender systems are evaluated, among the most sig- fundamentally 3 approaches: memory-based methods, model- nificant we have[ 17 which reviews the key decisions in evaluating based methods and hybrid approaches collaborative filtering recommender systems: the user tasks, the type of analysis and datasets being used, the ways in which predic- tion quality is measured and the user-based evaluation of the syster as a whole. Hernandez and gaudioso 9] is a current study which ations: RS, recommender systems: CF, collaborative filte ponding author. Tel: +34 913365133: fax: +34 913367522. proposes a recommendation filtering process based on the distinc address: jesus. bobadilla@upmes 0. Bobadilla). tion between interactive and non-interactive subsystems. General 0950-7051/s-see front matter o 2010 Elsevier B v. All rights reserved o:10.1016/ knosys2010.03009
A new collaborative filtering metric that improves the behavior of recommender systems J. Bobadilla *, F. Serradilla, J. Bernal Universidad Politécnica de Madrid, Computer Science, Crta. de Valencia, Km 7, 28031 Madrid, Spain article info Article history: Received 21 July 2009 Received in revised form 18 March 2010 Accepted 19 March 2010 Available online 23 March 2010 Keywords: Collaborative filtering Recommender systems Metric Jaccard Mean squared differences Similarity abstract Recommender systems are typically provided as Web 2.0 services and are part of the range of applications that give support to large-scale social networks, enabling on-line recommendations to be made based on the use of networked databases. The operating core of recommender systems is based on the collaborative filtering stage, which, in current user to user recommender processes, usually uses the Pearson correlation metric. In this paper, we present a new metric which combines the numerical information of the votes with independent information from those values, based on the proportions of the common and uncommon votes between each pair of users. Likewise, we define the reasoning and experiments on which the design of the metric is based and the restriction of being applied to recommender systems where the possible range of votes is not greater than 5. In order to demonstrate the superior nature of the proposed metric, we provide the comparative results of a set of experiments based on the MovieLens, FilmAffinity and NetFlix databases. In addition to the traditional levels of accuracy, results are also provided on the metrics’ coverage, the percentage of hits obtained and the precision/recall. 2010 Elsevier B.V. All rights reserved. 1. Introduction Recommender systems (RS) cover an important field within collaborative services that are developed in the Web 2.0 environment [21,19,26] and enable user-generated opinions to be exploited in a sophisticated and powerful way. Recommender systems can be considered as social networking tools that provide dynamic and collaborative communication, interaction and knowledge. Recommender systems cover a wide variety of applications [20,4,10,28,5,14,27], although those related to movie recommendations are the most well-known and most widely-used in the research field [23,3,25]. The recommender systems stage that normally has the greatest influence on the quality of the results obtained is the collaborative filtering (CF) phase [1,17,6,34,33]. Collaborative filtering is based on making predictions about a user’s preferences or tastes based on the preferences of a group of users that are considered similar to this user. A substantial part of the research in the area of collaborative filtering centers on how to determine which users are similar to the given one; in order to tackle this task, there are fundamentally 3 approaches: memory-based methods, modelbased methods and hybrid approaches. Memory-based methods [22,37,35,40] use similarity metrics and act directly on the ratio matrix that contains the ratings of all users who have expressed their preferences on the collaborative service; these metrics mathematically express a distance between two users based on each of their ratios. Model-based methods [1] use the ratio matrix to create a model from which the sets of similar users will be established. Among the most widely-used models we have: bayesian classifiers [8], neural networks [18] and fuzzy systems [39]. Generally, commercial recommender systems use memory-based methods, whilst model-based methods are usually associated with research recommender systems. Regardless of the method used in the collaborative filtering stage, the technical objective generally pursued is to minimize the prediction errors, by making the accuracy [12,11,29] of the recommender systems as high as possible; nevertheless, there are other objectives that need to be taken into account [38]: avoid overspecialization phenomena, find good items, credibility of recommendations, precision and recall measures, etc. To date, various publications have been written which tackle the way the recommender systems are evaluated, among the most significant we have [17] which reviews the key decisions in evaluating collaborative filtering recommender systems: the user tasks, the type of analysis and datasets being used, the ways in which prediction quality is measured and the user-based evaluation of the system as a whole. Hernández and Gaudioso [9] is a current study which proposes a recommendation filtering process based on the distinction between interactive and non-interactive subsystems. General 0950-7051/$ - see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.03.009 Abbreviations: RS, recommender systems; CF, collaborative filtering. * Corresponding author. Tel.: +34 913365133; fax: +34 913367522. E-mail address: jesus.bobadilla@upm.es (J. Bobadilla). Knowledge-Based Systems 23 (2010) 520–528 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
publications and reviews also exist which include the most com- sim(x, y)=- 2i(rzi-rmed)(ryi-Tme measures: mean absolute error, coverage precision, recall and derivatives of these: mean squared error, normalized mean absolute Tmed: median value in the rating scale. error, ROC and fallout: Goldberg et al. [13 focuses on the aspects not related to the evaluation, Breese et al. [6 compare the predictive rankxi-rankx)(rank accuracy of various methods in a set of representative problem domains. Candillier et al. [7] and Schafer et al. [36] review the mai rank collaborative filtering methods proposed in the literature The rest of the paper is structured as follows: Although Pearson correlation is the most commonly used met ric in the process of memory-based CF (user to user). this choice is In Section 2 we provide the basis for the principles on which the not always backed by the nature and distribution of the data in the esign of the new metric will be based, we present graphs RS. Formally, in order to be able to apply this metric with guaran which show the way in which the users vote, we carry out tees, the following assumptions must be met experiments which support the decisions made, we establish the best way of selecting numerical and non-numerical infor-. Linear relationship between x and y mation from the votes and, finally, we establish the hypothesis Continuous random variables. on which the paper and its proposed metric are based. mally distributed In section 3 we establish the mathematical formulation of the metric These conditions are not normally met in real RS, and Pearson In Sections 4 and 5, respectively, we list the experiments that correlation presents some significant cases of erroneous operation ill be carried out and we present and discuss the results that should not be ignored in Rs. Despite the deficiencies of Pearson correlation, this similarity Section 6 presents the most relevant conclusions of the measure presents the best prediction and recommendation results in CF-based RS[15, 16, 31,7,35, furthermore, it is the most co monly used and therefore any alternative metric proposed must 2. Approach and design of the new similarity metric its results On accepting that Pearson correlation is the metric for which 2.1. Introduction the results must be improved, but not necessarily the most appro- riate to be taken as a base. it is advisable to focus on the informa- Collaborative filtering methods work on a table of U users who tion that is obtained in the different research processes and which an rate I items. The prediction of a non-rated item i for a user u is can sometimes be overlooked when searching for other different computed as an aggregate of the ratings of the k most similar users objectives to improving the accuracy of rs(cold-start problem, (k-neighborhoods) for the same item i, where Ku denotes the set of trust and novelty, sparsity, etc. ) k-neighborhoods of u and rni denotes of value of the user n rating The simplest information to give us an idea of the nature of the rs on the item i (o if there is not rating value is to find out what users usually vote: do they always tend to vote for Once the set of K users(neighborhoods)similar to active u has the same values? Do they always tend to vote for the same items? Is been calculated, in order to obtain the prediction of item i on user there much difference between the votes of some users and others? u, one of the following aggregation approaches is often used: the average(2), the weighted sum (3)and the adjusted weighted and NetFlix RS (where you can vote in the interval [1.51).We can aggregation(deviation-from-mean)(4). We will use the auxiliar see how, on average, the users focus their votes on the higher levels set Gui in order to define Eqs. (2)-(5): of the interval, but avoiding the extremes, particularly the lower extremes. The distribution of the votes is not balanced and partic Gu={n∈KBn≠·} (1) ularly negative or particularly positive votes are avoided. ∑mCn≠ (2) of the votes cast in the MovieLens 1m and NetFlix databases Graphs(A)and ( b)of Fig. 2 show the number of items that display pa={u∑sim(u,n)asGu≠, (3) the arithmetic average specified on the x axis; we can see that there are hardly any items rated, on average, below 2 or above 4, pui=lu+ sim(u,n)(rni-n)Gu≠② (4) whereby most of the cases are between the values 3 and 4 Graphs C) and(D)of Fig. 2 show the number of items that display the standard deviation specified on the x axis; we can see that most where u serves as a normalizing factor, usually computed of the items have been voted by the users, on average with a max- imum difference of 1.2 votes =1/∑sim(u.n)Cu≠ According to the figures analyzed, we find that traditional met- rics must often achieve results by operating on a set of discrete rat- The most popular similarity metrics are Pearson correlation(6), Ings with very little variation(majority of votes between 3 and 4) cosine (7), constrained Pearsons correlation(8)and spearma and with the obligation of improving simpler and quicker estima- rank correlation ( 9): tions, such as always predicting with the value of the arithmetic average of the votes of each item (in which we know there is sel- dom a standard deviation higher than 1.2 sim(x,y)= ∑(rx-Fx)(ry-F,) 2(x-F1)2(y- 2.2. Basic experimentation sim(x,y)= After ascertaining the low diversity of the votes cast by the users, it seems reasonable to consider that the votes mainly tend
publications and reviews also exist which include the most commonly accepted metrics, aggregation approaches and evaluation measures: mean absolute error, coverage, precision, recall and derivatives of these: mean squared error, normalized mean absolute error, ROC and fallout; Goldberg et al.[13]focuses on the aspects not related to the evaluation, Breese et al. [6] compare the predictive accuracy of various methods in a set of representative problem domains. Candillier et al. [7] and Schafer et al. [36] review the main collaborative filtering methods proposed in the literature. The rest of the paper is structured as follows: In Section 2 we provide the basis for the principles on which the design of the new metric will be based, we present graphs which show the way in which the users vote, we carry out experiments which support the decisions made, we establish the best way of selecting numerical and non-numerical information from the votes and, finally, we establish the hypothesis on which the paper and its proposed metric are based. In Section 3 we establish the mathematical formulation of the metric. In Sections 4 and 5, respectively, we list the experiments that will be carried out and we present and discuss the results obtained. Section 6 presents the most relevant conclusions of the publication. 2. Approach and design of the new similarity metric 2.1. Introduction Collaborative filtering methods work on a table of U users who can rate I items. The prediction of a non-rated item i for a user u is computed as an aggregate of the ratings of the K most similar users (k-neighborhoods) for the same item i, where Ku denotes the set of k-neighborhoods of u and rn,i denotes of value of the user n rating on the item i ( if there is not rating value). Once the set of K users (neighborhoods) similar to active u has been calculated, in order to obtain the prediction of item i on user u, one of the following aggregation approaches is often used: the average (2), the weighted sum (3) and the adjusted weighted aggregation (deviation-from-mean) (4). We will use the auxiliar set Gu,i in order to define Eqs. (2)–(5): Gu;i ¼ n 2 Kuj9rn;i – ; ð1Þ pu;i ¼ 1 #Gu;i X n2Gu;i rn;i () Gu;i – £; ð2Þ pu;i ¼ lu;i X n2Gu;i simð Þ u; n rn;i () Gu;i – £; ð3Þ pu;i ¼ ru þ lu;i X n2Gu;i simð Þ u; n rn;i rn () Gu;i – £; ð4Þ where l serves as a normalizing factor, usually computed: lu;i ¼ 1 X n2Gu;i simðu; nÞ () Gu;i – £ , : ð5Þ The most popular similarity metrics are Pearson correlation (6), cosine (7), constrained Pearson’s correlation (8) and Spearman rank correlation (9): simð Þ¼ x; y P i rx;i rx ry;i ry ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rx;i rx 2P i ry;i ry 2 q ; ð6Þ simð Þ¼ x; y P i rx;iry;i ffiffiffiffiffiffiffiffiffiffiffiffi P i r2 x;i q ffiffiffiffiffiffiffiffiffiffiffiffi P i r2 y;i q ; ð7Þ simð Þ¼ x; y P i rx;i rmed ry;i rmed ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rx;i rmed 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i ry;i rmed 2 q ; rmed : median value in the rating scale; ð8Þ simð Þ¼ x; y P i rankx;i rankx ranky;i ranky ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P i rankx;i rankx 2P i ranky;i ranky 2 r : ð9Þ Although Pearson correlation is the most commonly used metric in the process of memory-based CF (user to user), this choice is not always backed by the nature and distribution of the data in the RS. Formally, in order to be able to apply this metric with guarantees, the following assumptions must be met: Linear relationship between x and y. Continuous random variables. Both variables must be normally distributed. These conditions are not normally met in real RS, and Pearson correlation presents some significant cases of erroneous operation that should not be ignored in RS. Despite the deficiencies of Pearson correlation, this similarity measure presents the best prediction and recommendation results in CF-based RS [15,16,31,7,35], furthermore, it is the most commonly used, and therefore, any alternative metric proposed must improve its results. On accepting that Pearson correlation is the metric for which the results must be improved, but not necessarily the most appropriate to be taken as a base, it is advisable to focus on the information that is obtained in the different research processes and which can sometimes be overlooked when searching for other different objectives to improving the accuracy of RS (cold-start problem, trust and novelty, sparsity, etc.). The simplest information to give us an idea of the nature of the RS is to find out what users usually vote: do they always tend to vote for the same values? Do they always tend to vote for the same items? Is there much difference between the votes of some users and others? Fig. 1 shows the distribution of the votes cast in MovieLens 1M and NetFlix RS (where you can vote in the interval [1..5]). We can see how, on average, the users focus their votes on the higher levels of the interval, but avoiding the extremes, particularly the lower extremes. The distribution of the votes is not balanced and particularly negative or particularly positive votes are avoided. Fig. 2 shows the arithmetic average and the standard deviation of the votes cast in the MovieLens 1M and NetFlix databases. Graphs (A) and (B) of Fig. 2 show the number of items that display the arithmetic average specified on the x axis; we can see that there are hardly any items rated, on average, below 2 or above 4, whereby most of the cases are between the values 3 and 4. Graphs (C) and (D) of Fig. 2 show the number of items that display the standard deviation specified on the x axis; we can see that most of the items have been voted by the users, on average, with a maximum difference of 1.2 votes. According to the figures analyzed, we find that traditional metrics must often achieve results by operating on a set of discrete ratings with very little variation (majority of votes between 3 and 4) and with the obligation of improving simpler and quicker estimations, such as always predicting with the value of the arithmetic average of the votes of each item (in which we know there is seldom a standard deviation higher than 1.2). 2.2. Basic experimentation After ascertaining the low diversity of the votes cast by the users, it seems reasonable to consider that the votes mainly tend J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521
J Bobadilla et al. / Knowledge-Based Systems 23(2010)520-528 A 210000000 5000000 Fig. 1. Distribution of votes in RS: (A)MovieLens 1M and(B)NetFlix. B 8:eesN8罚罚导守导导日 8ssN:s。9导守导导日 Arithmetic average 0m0mmmmg 500 8ss=s:ssN:89罚导守守导导日 Nq。da。Nqda。。ta。Nqa。 Standard deviation Fig. 2. Arithmet rage and standard deviation on the MovieLens IM and Net Flix ratings of the items (A)MovieLens arithmetic average, ( B)Net Flix arithmetic average, (c) lovieLens standard deviation,(D) NetFlix standard deviation. to represent a positive or non-positive rating of the items, and to a the precision when the number of recommendations(N)is high. lesser extent a range of these ratings; for instance, in a rS with pos- The numerical key to this improvement lies in the improved sible votes situated in the interval [1.5], a 4 will generally repre- capacity of the discrete calculation to determine whether an item sent a positive rating, which in some cases will be reinforced is recommended(based on the number of k-neighborhoods with with the rating 5. Similarly, a 3 will represent a non-positive rating, value P in that item), as regards the calculation with numerical In order to test this hypothesis, we have designed an experi- approach on the numerical values of the vote lected aggregation which in some cases will be reinforced with the rating 2 or 1 votes into P votes(Positive)and all of 1, 2 and 3 votes into N votes (Non-positive), in such a way that we aim to measure the impact Pearson -Positive/ Non- made on the recommendations by doing without the detailed information provided by the numerical values of the votes In the experiment we compare the precision/recall obtained in a egular way(using the numerical values of the votes )with that ob- tained using only the discretized values P and N: for this purpose we establish the relevance threshold at value 4 (0=4), assimilating a0.55 relevant"with"positive"; we use Pearson correlation, deviation om mean aggregation approach, 20% of test users, 20% of test items. number of recommendations from 2 to 20 K= 150. The experiment has been repeated for values between K= 100 and Rean05060708 K=200, obtaining equivalent results Fig 3 displays the results, which show how the"positive/non positive" discretization not only does not worsen the precision/re- results obtained using the numerical values. 20% of test users. 20% of test items. call measurements, but rather it improves them both, particularly K=150, Pearson correlation,0=4
to represent a positive or non-positive rating of the items, and to a lesser extent a range of these ratings; for instance, in a RS with possible votes situated in the interval [1..5], a 4 will generally represent a positive rating, which in some cases will be reinforced with the rating 5. Similarly, a 3 will represent a non-positive rating, which in some cases will be reinforced with the rating 2 or 1. In order to test this hypothesis, we have designed an experiment on the MovieLens 1M database: we transformed all 4 and 5 votes into P votes (Positive) and all of 1, 2 and 3 votes into N votes (Non-positive), in such a way that we aim to measure the impact made on the recommendations by doing without the detailed information provided by the numerical values of the votes. In the experiment we compare the precision/recall obtained in a regular way (using the numerical values of the votes) with that obtained using only the discretized values P and N; for this purpose, we establish the relevance threshold at value 4 (h = 4), assimilating ‘‘relevant” with ‘‘positive”; we use Pearson correlation, deviation from mean aggregation approach, 20% of test users, 20% of test items, number of recommendations from 2 to 20, K = 150. The experiment has been repeated for values between K = 100 and K = 200, obtaining equivalent results. Fig. 3 displays the results, which show how the ‘‘positive/nonpositive” discretization not only does not worsen the precision/recall measurements, but rather it improves them both, particularly the precision when the number of recommendations (N) is high. The numerical key to this improvement lies in the improved capacity of the discrete calculation to determine whether an item is recommended (based on the number of k-neighborhoods with value P in that item), as regards the calculation with numerical values (prediction obtained by applying the selected aggregation approach on the numerical values of the votes and their subsequent thresholding). Fig. 1. Distribution of votes in RS: (A) MovieLens 1M and (B) NetFlix. Fig. 2. Arithmetic average and standard deviation on the MovieLens 1M and NetFlix ratings of the items. (A) MovieLens arithmetic average, (B) NetFlix arithmetic average, (C) MovieLens standard deviation, (D) NetFlix standard deviation. Fig. 3. Precision/recall obtained by transforming all 4 and 5 votes into P votes (positive) and all 1, 2 and 3 votes into N votes (non-positive), compared to the results obtained using the numerical values. 20% of test users, 20% of test items, K = 150, Pearson correlation, h = 4. 522 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
2.3. Non-numerical information: Jaccard for the similarity between users, which aims to improve the results provided by traditional metrics. As the numerical values of the votes appear to lose relevance in the recommendation process(not in the prediction process), we 2.4. Numerical information: mean squared differend are obliged to search for similarity information between users by looking beyond the specific values of their votes. In this sense, it By unifying the concepts and results set out in Section 2, we find is reasonable to focus our attention on two related aspects the following (1) To not grant too much credibility to the similarity of two (1)The similarity between two users(core of the CF RS)is bein users based only on the similitude of a very limited set of based on numerical metrics of statistical origin(such as ommon items: traditional metrics provide their measure Pearson correlation)which should be applied to continuous of similarity without taking into account whether it has been variables, but which in fact are applied to discrete variables obtained based on few or many items voted by both users: which, for the purposes of recommendation, only contain 2 this way, it is not improbable to find the highest similarity measurements associated with pairs of users with very few (2)We are not making use of non-numerical information of the votes which could be valuable in order to complement the (2) The users who have voted for a large number of items should numerical information and provide a metric which satisfa be compared, in so far as is possible, with other users with torily groups these two sources of information. whom they have a large number of items voted in common, this way. for example, in users with around one thousand The most commonly used metrics(constrained Pearson correla- votes cast, a similarity of 0.78 calculated with some 150 tion, Spearman rank correlation, cosine, Pearson correlation, etc. ommon items is more convincing than a similarity of 0.82 display, to a greater or lesser extent, the deficiencies set out in refer- calculated with 60 common items. In short, the proportion ence to Pearson correlation: however, mean squared differences between the common votes and the total votes should be (MSD), based on the geometrical principles of the Euclidean dis- taken very much into consideration. ance, provide different characteristics which could be suitably com- plemented with Jaccard. This metric has been tested into 35 this The most direct way to quantify the above-mentioned aspects is study highlights the good accuracy results obtained using MSD by using the Jaccard metric [24], which calculates the proportion be- but at the cost of coverage values that make it unviable in general tween the number of items that two users have voted in common RS applications. The following paragraph of the conclusions of [35] and the number of different items that both users have voted for sums up its possibilities: the MSD metric offers very interesting re- in total, i.e. the intersection divided by the union of the items voted. sults due to its different behavior compared to the other two studied In order to design our new metric correctly, we must discover the (cosine and Pearson correlation)and its good results in all aspects impact that Jaccard can have as a factor of similarity between users. except in the coverage, which is undoubtedly its weak point As Jaccard does not operate with the numerical values of the votes, it seems improbable that, on its own, it will be able to have a positive 2.5. Hypothesis mpact on the MAE of the RS, however, it is more apparent that the coverage can improve by selecting users with more common votes The hypothesis on which this paper is based is that a suitable nd therefore, in general, with more votes cast. combination of Jaccard and MSd could complement Jaccard with On the other hand, it is reasonable to suspect that users who the numerical values of the votes, and could mitigate the deficien- have voted for a sufficient proportion of items in common display cies in the coverage entailed in the use of MSD, in such a way that common tastes: we know that the negative ratings are not very their joint use would enable the improvement of the results of tra- widely-used( Figs. 1 and 2)and therefore, we can theorize that part ditional metrics in general and of Pearson correlation in particular of the common absences of yotes between 2 users can mean com which will be used as the metric of reference for which the results on non-positive ratings that they preferred not to cast, likewise, must be improved. we know that most of the votes cast have a positive rating( Figs. 1 Although, a prior, the choice of the similarity measure MSD and 2), and therefore, many votes in common could denote a large seems to be the most suitable, it is advisable to test Jaccard com- number of positive ratings in common. bined not only with MSD, but also with the most common metrics: In order to try to test this theory we have designed the follow- Pearson correlation(PC)(6), cosine ( cos)(7), constrained Pearsons ing experiment: for every possible pair of users of MovieLens 1M correlation( CPC)(8)and Spearman rank correlation (SRO)(9) we have calculated the Jaccard value, the MAE and the coverage In Fig. 5. graph 5A shows the evolution of the mae using the met- obtained by establishing the first user of the pair as the active user rics PC, Jaccard*PC, Jaccard *COS, Jaccard*CPC, Jaccard* SRC and and the second one as their only neighborhood. On the x axis we Jaccard*(1-MSD)applied to the Movielens 1M database. As we represent the possible Jaccard values, represented in the interval expected, the best results are obtained by MSD. Graph 5B shows [0..1 where 0 indicates that there are no items in common and the coverage obtained by applying these same metrics: in this case, 1 indicates that all the items voted are in common COS and MSD give the best results up to k= 300 and SrC gives the L In Fig. 4. graph 4A shows the number of pairs of users(y axis) best results from this value of K In short, the similarity measure ich display the jaccard values indicated on the x axis. as is to MSD is confirmed as the best option in order to test the hypothesis be expected, most of the cases present a very low overlap between in this paper tionship between the increase in the value of Jaccard and the accu- 3. Formalization of the new metric racy obtained in the interval [00. 4, in which the great majority of the cases are grouped together(Graph 4A). Graph 4c shows a di The metric presented in this paper takes as reference to be im- rect relationship between the Jaccard value and an improvement proved the most widely-used metric in user to user memory-based CF: Pearson correlation; however, the operating principals that rule The reasoning given and the results obtained justify the incor- this metric will not be taken as a base, but rather, we will use the poration of the jaccard metric as an integral part of a new metric mean squared difference (MSD)metric, which is much less
2.3. Non-numerical information: Jaccard As the numerical values of the votes appear to lose relevance in the recommendation process (not in the prediction process), we are obliged to search for similarity information between users by looking beyond the specific values of their votes. In this sense, it is reasonable to focus our attention on two related aspects: (1) To not grant too much credibility to the similarity of two users based only on the similitude of a very limited set of common items: traditional metrics provide their measure of similarity without taking into account whether it has been obtained based on few or many items voted by both users; this way, it is not improbable to find the highest similarity measurements associated with pairs of users with very few items commonly voted. (2) The users who have voted for a large number of items should be compared, in so far as is possible, with other users with whom they have a large number of items voted in common, this way, for example, in users with around one thousand votes cast, a similarity of 0.78 calculated with some 150 common items is more convincing than a similarity of 0.82 calculated with 60 common items. In short, the proportion between the common votes and the total votes should be taken very much into consideration. The most direct way to quantify the above-mentioned aspects is by using the Jaccard metric [24], which calculates the proportion between the number of items that two users have voted in common and the number of different items that both users have voted for in total, i.e. the intersection divided by the union of the items voted. In order to design our new metric correctly, we must discover the impact that Jaccard can have as a factor of similarity between users. As Jaccard does not operate with the numerical values of the votes, it seems improbable that, on its own, it will be able to have a positive impact on the MAE of the RS, however, it is more apparent that the coverage can improve by selecting users with more common votes and therefore, in general, with more votes cast. On the other hand, it is reasonable to suspect that users who have voted for a sufficient proportion of items in common display common tastes: we know that the negative ratings are not very widely-used (Figs. 1 and 2) and therefore, we can theorize that part of the common absences of votes between 2 users can mean common non-positive ratings that they preferred not to cast, likewise, we know that most of the votes cast have a positive rating (Figs. 1 and 2), and therefore, many votes in common could denote a large number of positive ratings in common. In order to try to test this theory, we have designed the following experiment: for every possible pair of users of MovieLens 1M we have calculated the Jaccard value, the MAE and the coverage obtained by establishing the first user of the pair as the active user and the second one as their only neighborhood. On the x axis we represent the possible Jaccard values, represented in the interval [0..1], where 0 indicates that there are no items in common and 1 indicates that all the items voted are in common. In Fig. 4, graph 4A shows the number of pairs of users (y axis) which display the Jaccard values indicated on the x axis. As is to be expected, most of the cases present a very low overlap between the items voted by each pair of users. Graph 4B shows a direct relationship between the increase in the value of Jaccard and the accuracy obtained in the interval [0..0.4], in which the great majority of the cases are grouped together (Graph 4A). Graph 4C shows a direct relationship between the Jaccard value and an improvement in the coverage. The reasoning given and the results obtained justify the incorporation of the Jaccard metric as an integral part of a new metric for the similarity between users, which aims to improve the results provided by traditional metrics. 2.4. Numerical information: mean squared differences By unifying the concepts and results set out in Section 2, we find the following: (1) The similarity between two users (core of the CF RS) is being based on numerical metrics of statistical origin (such as Pearson correlation) which should be applied to continuous variables, but which in fact are applied to discrete variables which, for the purposes of recommendation, only contain 2 values of use (positive/non-positive). (2) We are not making use of non-numerical information of the votes which could be valuable in order to complement the numerical information and provide a metric which satisfactorily groups these two sources of information. The most commonly used metrics (constrained Pearson correlation, Spearman rank correlation, cosine, Pearson correlation, etc.) display, to a greater or lesser extent, the deficiencies set out in reference to Pearson correlation; however, mean squared differences (MSD), based on the geometrical principles of the Euclidean distance, provide different characteristics which could be suitably complemented with Jaccard. This metric has been tested into [35]; this study highlights the good accuracy results obtained using MSD, but at the cost of coverage values that make it unviable in general RS applications. The following paragraph of the conclusions of [35] sums up its possibilities: ‘‘the MSD metric offers very interesting results due to its different behavior compared to the other two studied (cosine and Pearson correlation) and its good results in all aspects except in the coverage, which is undoubtedly its weak point”. 2.5. Hypothesis The hypothesis on which this paper is based is that a suitable combination of Jaccard and MSD could complement Jaccard with the numerical values of the votes, and could mitigate the deficiencies in the coverage entailed in the use of MSD, in such a way that their joint use would enable the improvement of the results of traditional metrics in general and of Pearson correlation in particular, which will be used as the metric of reference for which the results must be improved. Although, a priori, the choice of the similarity measure MSD seems to be the most suitable, it is advisable to test Jaccard combined not only with MSD, but also with the most common metrics: Pearson correlation (PC) (6), cosine (COS) (7), constrained Pearson’s correlation (CPC) (8) and Spearman rank correlation (SRC) (9). In Fig. 5, graph 5A shows the evolution of the MAE using the metrics PC, Jaccard PC, Jaccard COS, Jaccard CPC, Jaccard SRC and Jaccard (1 MSD) applied to the MovieLens 1M database. As we expected, the best results are obtained by MSD. Graph 5B shows the coverage obtained by applying these same metrics; in this case, COS and MSD give the best results up to K = 300 and SRC gives the best results from this value of K. In short, the similarity measure MSD is confirmed as the best option in order to test the hypothesis in this paper. 3. Formalization of the new metric The metric presented in this paper takes as reference to be improved the most widely-used metric in user to user memory-based CF: Pearson correlation; however, the operating principals that rule this metric will not be taken as a base, but rather, we will use the mean squared difference (MSD) metric, which is much less J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 523
J Bobadilla et al/Knowledge-Based Systems 23(2010)520-528 12000 B 0.0 §8图8器图昌落器图昌。器8喜器喜舀器§邑昌8器 Jaccard 8芯器N品葛8器鵠吕 。。 Vaccaro ig. 4. Measurements related to the jaccard metric on MovieLens. (A)Number of pairs of users that display the jaccard values represented on the x axis. (B)Averaged MAE obtained in the pairs of users with the jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the jaccard values represented on he x axIs 115 B Jaccard"CPC 1000 00 0012001400 correlation and mean squared differences. (A)MAE, (B)Coverage. Mow IM, 20% of test users, 20% of test items, k E [2. 1500] step correlation, Spearman rank Fig. 5. MAE and coverage obtained with Pearson correlation and by combining jaccard with Pearson correlation, cosine, cons ommonly used due to its low capacity to produce new recom- The metric designed is based on two factors mendations MSD offers both a great advantage and a great disadvantage at The similarity between two users calculated as the mean of the the same time; the advantage is that it generates very good general squared differences(MSD): the smaller these differences, the results: low average error, high percentage of correct predictions greater the similarity between the 2 users. This part of the met and low percentage of incorrect predictions: the disadvantage is ric enables very good accuracy results to be obtained. that it has an intrinsic tendency to choose as similar users to one The number of items in which both one user and the other have given user those users who have rated a very small number of made a rating regarding the total number of items which have items [35. e.g. if we have 7 items that can be rated from 1 to 5 been rated between the two users. E.g. given users u1: ( 3. 2. and three users u1,u2,u3 with the following ratings:u1:(,·,4.4,··,·)andu2:(·,4.4,3,·,1) a common rating has been 5,·,·,·u2:(3,4,5,5,1,4,·)u3:(3.5.4,5.·,3,·)(· means made in two items as regards a joint rating of five items. This not rated item), the MSD metric will indicate that(u1, u3) have a to- factor enables us to greatly improve the metric's capacity to al similarity(o),(u1, u2) have a similarity 0.5 and(u2, u3)have a make predictions ower similarity(0.6). This situation is not convincing, as intuitively we realize u2 and u3 are very similar, whilst ul is only similar to u2 An important design aspect is the decision whether not to use a and u3 in 2 ratios, and, therefore, it is not logical to choose it as the parameter for which the value should be given arbitrarily,i.ethe most similar to them, and what is worse, if it is chosen it will not result provided by the metric should be obtained by only taking provide us with possibilities to recommend new items. he values of the ratings provided by the users of the rs. The strategy to follow to design the new metric is to consider By working on the 2 factors with standardized values [0.1 the metric obtained is as follows: Given the lists of ratings of 2 generic along the way its good behavior as regards accuracy and quality of users x and y(2)(2…,,(可…可)sthe the results number of items of our RS, where one of the possible values of each
commonly used due to its low capacity to produce new recommendations. MSD offers both a great advantage and a great disadvantage at the same time; the advantage is that it generates very good general results: low average error, high percentage of correct predictions and low percentage of incorrect predictions: the disadvantage is that it has an intrinsic tendency to choose as similar users to one given user those users who have rated a very small number of items [35], e.g. if we have 7 items that can be rated from 1 to 5 and three users u1, u2, u3 with the following ratings: u1: (, , 4, 5, , , ), u2: (3, 4, 5, 5, 1, 4, ), u3: (3, 5, 4, 5, , 3, ) ( means not rated item), the MSD metric will indicate that (u1,u3) have a total similarity (0), (u1,u2) have a similarity 0.5 and (u2,u3) have a lower similarity (0.6). This situation is not convincing, as intuitively we realize u2 and u3 are very similar, whilst u1 is only similar to u2 and u3 in 2 ratios, and, therefore, it is not logical to choose it as the most similar to them, and what is worse, if it is chosen it will not provide us with possibilities to recommend new items. The strategy to follow to design the new metric is to considerably raise the capacity to generate MSD predictions, without losing along the way its good behavior as regards accuracy and quality of the results. The metric designed is based on two factors: The similarity between two users calculated as the mean of the squared differences (MSD): the smaller these differences, the greater the similarity between the 2 users. This part of the metric enables very good accuracy results to be obtained. The number of items in which both one user and the other have made a rating regarding the total number of items which have been rated between the two users. E.g. given users u1: (3, 2, 4, , , ) and u2: (, 4, 4, 3, , 1), a common rating has been made in two items as regards a joint rating of five items. This factor enables us to greatly improve the metric’s capacity to make predictions. An important design aspect is the decision whether not to use a parameter for which the value should be given arbitrarily, i.e. the result provided by the metric should be obtained by only taking the values of the ratings provided by the users of the RS. By working on the 2 factors with standardized values [0..1], the metric obtained is as follows: Given the lists of ratings of 2 generic users x and y: rx;ry : r1 x ;r2 x ;r3 x ; ... ; rI x ; r1 y ;r2 y ;r3 y ; ; ... ; rI y j I is the number of items of our RS, where one of the possible values of each Fig. 5. MAE and coverage obtained with Pearson correlation and by combining Jaccard with Pearson correlation, cosine, constrained Pearson’s correlation, Spearman rank correlation and mean squared differences. (A) MAE, (B) Coverage. MovieLens 1M, 20% of test users, 20% of test items, k e [2..1500] step 25. Fig. 4. Measurements related to the Jaccard metric on MovieLens. (A) Number of pairs of users that display the Jaccard values represented on the x axis. (B) Averaged MAE obtained in the pairs of users with the Jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the Jaccard values represented on the x axis. 524 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
J Bobadilla et aL/Knowledge-Based Systems 23(2010)520-528 525 vote is"not voted", which we represent with the symbol . All the newmetric(x,y)=0.5×(1-0.218)=0.39 lists have the same number of elements: I If the values of the votes are normalized in the interval [0.1]. r∈{1.5]}U{} (1-MSD(x, y))A Jaccard(x,y) newmetric(x, yE[O.1] rx:(4.5,·,3,2,·,1,1),ry:(4.3,1,2,·,3.4, using standardized values [01: 4. Planning the experiments rx:(0.75.1,·,0.5.0.25·,0.0) The RS databases [2, 30, 32] that we use in our experiments pres ry:(0.75.0.5.0.0.25,·,0.5.0.75,·) ent the general characteristics summarized in Table 1 We define the cardinality of a list: #l as the number of elements The experiments have been grouped in such a way that the fol- the list l different to· · Accuracy dy;(r…d) Number of perfect predictions. Precision / recall. d,=(-)≠·A≠·可,=m== We consider a perfect prediction to be each situation in which (10) the prediction of the rating recommended to one user in one film matches the value rated by that user for that film. he previous experiments were carried out, depending on the Th =(0.0.25,·,0.0625.·,·,0.562 size of the database for each of the following k-neighborhoods (2)We obtain the MSD(xy)measure computing the arithmetic ues: MovieLens [2..1500] step 50, FilmAffinity [2.2000)step 100, average of the values in the list d 10000 step100, due to the fac that depending on size of each particular rS database, it is necessary to use a different d MSD(x, y)= number of k-neighborhoods in order to display tendencies in the (1) graphs that display their results. The precision recall recommenda- tion quality results have been obtained using a range 2. 20]of rec ommendations and relevance thresholds 0=5 using MovieLens dxy=(0+0.25+0.0625+0.5625)/4=0.218 and NetFlix and 0=9 using FilmAffinity When we use MovieLens and FilmAffinity we use 20% of test MSD(x,y)(11)tends towards zero as the ratings of users x and users taken at random from all the users of the database; with y become more similar and tends towards 1 as they became the remaining 80% we carry out the training. When we use NetFlix, different (we assume that the votes are normalized in given the huge number of users in the database we only use 5% of he interval [o.1D). ts users as test users in all cases we use 20% of test items ()We obtain the Jaccard (x,y) measure computing the propor Table 2 shows the numerical data exposed in this section. tion between the number of positions [1. in which there are elements different to. in both rx and ry regarding the 5. results number of positions [1] in which there are elements differ- ent to· In x or In ry: In this section we present the results obtained using the Jaccard(x,y)=nt=rx+#一#d abases specified in Table 1. Fig. 6 shows the results obtained (12) MovieLens, Fig. 7 shows those obtained with NetFlix and Fig 8 ds to FilmAffinity in our example: 4/(6+6-4)=0.5. Graph 6a shows the MAE error obtained in MovieLens by apply (4)We combine the above elements in the final equation g Pearson correlation(dashed) and the proposed metric(contin- newmetric(x,y)=Jaccard(x y)x(1-MSD(x y)).(13) uous). The new metric achieves significant fewer errors in practically all the experiments carried out (by varying the number of k-neighborhoods ) The percentage improvement average is around 0.2 stars in the most commonly used values of k(50, 100, Main parameters of the databases used in the experiments. 150.200) Graph 6B shows us the coverage. Small values of k pl Movielens FilmAffinity NetFlix small percentages in the capacity for prediction, as it is more mprobable that the few neighbors of a test user have voted for a Number o 1128 17770 1000209 19126278 100480507 film that this user has not voted for. As the number of neighbo ax values 1-10 increases, the probability that at least one of them has voted for the film also increases, as shown in the graph. Table 2 Main parameters used in the experiments K(MAE, coverage, perfect predictions) Pre Test users(%) Test items(%) Movielens 1M 2.20 FilmAffinity NetFlix 2.201
vote is ‘‘not voted”, which we represent with the symbol . All the lists have the same number of elements: I. Example: rb a 2 ½ f g 1::5 [ fg; rx : ð4; 5; ; 3; 2; ; 1; 1Þ; ry : ð4; 3; 1; 2; ; 3; 4; Þ: using standardized values [0..1]: rx : ð Þ 0:75; 1; ; 0:5; 0:25; ; 0; 0 ; ry : ð Þ 0:75; 0:5; 0; 0:25; ; 0:5; 0:75; : We define the cardinality of a list: #l as the number of elements in the list l different to . (1) We obtain the list dx;y : d1 x;y; d2 x;y; d3 x;y; ... ; dI x;y j di x;y ¼ ri x ri y 2 8ijri x – ^ri y – ; di x;y ¼ 8ijri x ¼_ ri y ¼ ; ð10Þ in our example: dx;y ¼ ð0; 0:25; ; 0:0625; ; ; 0:5625; Þ: (2) We obtain the MSD(x,y) measure computing the arithmetic average of the values in the list dx,y MSDðx; yÞ ¼ dx;y ¼ P i¼1::I;di x;y– di x;y #dx;y ; ð11Þ in our example: dx;y ¼ ð0 þ 0:25 þ 0:0625 þ 0:5625Þ=4 ¼ 0:218 MSD(x,y) (11) tends towards zero as the ratings of users x and y become more similar and tends towards 1 as they became more different (we assume that the votes are normalized in the interval [0..1]). (3) We obtain the Jaccard(x,y) measure computing the proportion between the number of positions [1..I] in which there are elements different to in both rx and ry regarding the number of positions [1..I] in which there are elements different to in rx or in ry: Jaccardðx; yÞ ¼ rx \ ry rx [ ry ¼ #dx;y #rx þ #ry #dx;y ; ð12Þ in our example: 4/(6 + 64) = 0.5. (4) We combine the above elements in the final equation: newmetricð Þ¼ x; y Jaccardð Þ x; y ð Þ 1 MSDð Þ x; y ; ð13Þ in the running example: newmetricðx; yÞ ¼ 0:5 ð1 0:218Þ ¼ 0; 391: If the values of the votes are normalized in the interval [0..1], then: ð Þ^ 1 MSDð Þ x; y Jaccardð Þ^ x; y newmetricð Þ2½ x; y 0::1: 4. Planning the experiments The RS databases [2,30,32] that we use in our experiments present the general characteristics summarized in Table 1. The experiments have been grouped in such a way that the following can be determined: Accuracy. Coverage. Number of perfect predictions. Precision/recall. We consider a perfect prediction to be each situation in which the prediction of the rating recommended to one user in one film matches the value rated by that user for that film. The previous experiments were carried out, depending on the size of the database, for each of the following k-neighborhoods values: MovieLens [2..1500] step 50, FilmAffinity [2..2000] step 100, NetFlix [2..10000] step100, due to the fact that depending on the size of each particular RS database, it is necessary to use a different number of k-neighborhoods in order to display tendencies in the graphs that display their results. The precision/recall recommendation quality results have been obtained using a range [2..20] of recommendations and relevance thresholds h = 5 using MovieLens and NetFlix and h = 9 using FilmAffinity. When we use MovieLens and FilmAffinity we use 20% of test users taken at random from all the users of the database; with the remaining 80% we carry out the training. When we use NetFlix, given the huge number of users in the database, we only use 5% of its users as test users. In all cases we use 20% of test items. Table 2 shows the numerical data exposed in this section. 5. Results In this section we present the results obtained using the databases specified in Table 1. Fig. 6 shows the results obtained with MovieLens, Fig. 7 shows those obtained with NetFlix and Fig. 8 corresponds to FilmAffinity. Graph 6A shows the MAE error obtained in MovieLens by applying Pearson correlation (dashed) and the proposed metric (continuous). The new metric achieves significant fewer errors in practically all the experiments carried out (by varying the number of k-neighborhoods). The percentage improvement average is around 0.2 stars in the most commonly used values of k (50, 100, 150, 200). Graph 6B shows us the coverage. Small values of k produce small percentages in the capacity for prediction, as it is more improbable that the few neighbors of a test user have voted for a film that this user has not voted for. As the number of neighbors increases, the probability that at least one of them has voted for the film also increases, as shown in the Graph. Table 1 Main parameters of the databases used in the experiments. MovieLens FilmAffinity NetFlix Number of users 4382 26447 480189 Number of movies 3952 21128 17770 Number of ratings 1000209 19126278 100480507 Min and max values 1–5 1–10 1–5 Table 2 Main parameters used in the experiments. K (MAE, coverage, perfect predictions) Precision/recall Test users (%) Test items (%) Range Step N h MovieLens 1M [2..1500] 50 [2..20] 5 20 20 FilmAffinity [2..2000] 100 [2..20] 5 20 20 NetFlix [2..10000] 100 [2..20] 9 5 20 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 525
J Bobadilla et al. / Knowledge-Based Systems 23(2010)520-528 A New metric - Pearson correlation B New Metric - Pearson corelation 8888888888888888 器昌学 品导导昌民8吕国易导早器 New Metric --Pearson correlation D e New metric - Pearson correlation 4°冒昌器影邑器器国020 88888888§§8 Fig. 6. Pearson correlation and new metric comparative results using MovieLens: (A)accuracy. (B) coverage, (C) percentage of perfect predictions. (D)precision recall. 20% of of test items.k∈[2.1500 I step50.N∈|220l.=5 -New metric - Pearson correlation B -New metric - Pearson correlation 0.75 °“88 New metric - Pearson correlation New metric - Pearson comelation 025 3爵昌昌昌昌昌是国昌昌昌量888 Fig. 7. Correlation and new metric comparative results using NetFlix: (A)accuracy. (B)coverage, (C) percentage of perfect predictions, (D)precision/recall. 5% of test users, 20% of test items, kE2-10000] step 100, NE[2-20. 6=9. The comparative results in Graph 6B show improvements of up ment in the results of the new metric regarding correlation, even 9% when applying the new metric as regards the correlation. by 15% This is a very good result, as higher values of accuracy normally im- Graph 6D shows the recommendation quality measure: preci ply smaller capabilities for recommendation. sion versus recall. Although the prediction results(graphs A and raph 6C shows the percentage of perfect estimations as re- C)of the new metric greatly improve the Pearson correlation ones, gards the total estimations made ct estimations a that improvement is not transferred to the same extent to the rec- hich match the value voted by the test user, taking as an ommendation quality results(approximate improvement of 0.02). tion the rounded value of the aggregation of the k-neighbe m In order to better understand this detachment between prediction The values obtained in Graph 6C show us a convincing improve- quality and recommendation quality we must remember that with
The comparative results in Graph 6B show improvements of up to 9% when applying the new metric as regards the correlation. This is a very good result, as higher values of accuracy normally imply smaller capabilities for recommendation. Graph 6C shows the percentage of perfect estimations as regards the total estimations made. Perfect estimations are those which match the value voted by the test user, taking as an estimation the rounded value of the aggregation of the k-neighborhoods. The values obtained in Graph 6C show us a convincing improvement in the results of the new metric regarding correlation, even by 15% in some cases. Graph 6D shows the recommendation quality measure: precision versus recall. Although the prediction results (graphs A and C) of the new metric greatly improve the Pearson correlation ones, that improvement is not transferred to the same extent to the recommendation quality results (approximate improvement of 0.02). In order to better understand this detachment between prediction quality and recommendation quality we must remember that with Fig. 6. Pearson correlation and new metric comparative results using MovieLens: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 20% of test users, 20% of test items, k e [2..1500] step 50, N e [2..20], h = 5. Fig. 7. Correlation and new metric comparative results using NetFlix: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 5% of test users, 20% of test items, k e [2..10000] step 100, N e [2..20], h = 9. 526 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
J Bobadilla et aL/Knowledge-Based Systems 23(2010)520-528 New metric - Pearson correhtion co melation B °裂导器器昌昌昌自目昌昌昌昌爵 §昌艮总器員§昌垦昌目量昌 New metric - Pearson corelation New metric - Pearson corre latio D 0.45 0.35 0.25 9爵昌昌昌昌昌昌昌國昌昌昌爵 Fig 8. Correlation and new metric comparative results using FilmAffinity: (A)accuracy, (B)coverage, (C) percentage of perfect predictions, (D) precision/recall. 20/ of test sers, 20% of test items, k E[2.2000 step 100, NE2-20. 6=5 precision and recall we are using the concept of relevant recom- Tal endations(determined by the threshold 0=5 in our experiment) uality of the results obtained by applying the quality measurements on the selected Based on an improvement in the MAE of 0. 2 stars, we will be capa ble, on many occasions, of suitably classifying the items which ob- NetFlix tained a prediction of 4.3-4.49(considered irrelevant) with Not applicable Pearson correlation and which, with the new metric, we will place, ften correctly, above 4.5, and therefore, we will consider them Coverage relevant As we can see, in the recommendation quality measurements we are dealing with more restrictive numerical margins than those which use the prediction quality measurements, and therefore, it is not adapt to this standard, primarily because the users can vote in advisable to consider all kinds of improvements positively a range from 1 to 10 stars, and to a lesser extent because certain n short, using the MovieLens database, the proposed metric im- itineraries"exist in order to vote for films established by the man- proves the prediction quality measures and the coverage. The rec- agers of the rS which have become more popularly used and which ommendation quality measures are slightly improved make jaccard's contribution to the metric less effectiv The MAE and perfect predictions(graphs 7A and 7C)results ob- Table 3 summarizes the most relevant aspects of the results lined with NetFlix are similar(although slightly lower) to those of obtained. MovieLens. Nevertheless, the coverage drops using NetFlix(graph Experimentally, the proposed metric, is executed in 7B); this behavior is logical and is to be expected, as the proposed mately a third of the time required by pearson correlation. Indeed metric is capable of finding more similar neighbors(which im- Pearson correlation requires 2 subtraction operations and 3 multi- prove the measure of accuracy ): he more similar the neighbors plication-addition operations(assimilating the squaring operation are to the test user, in general, not only will they have more similar with the multiplication)for every pair of items voted in common ote values, but also they will have a greater tendency to vote for(inside the summations ), whilst the MSd only requires one sub- the same subset of the total films rated( the same genres, in the traction operation and one multiplication-addition operation. Out same years, etc. ) although this factor has been alleviated by the side the summation(and therefore less significant in the average use of Jaccard its impact has not been completely eliminated. times), Pearson correlation requires a square-root calculation and The NetFlix precision/recall quality measure(graph 7D)im- the proposed metric only requires 2 divisions and one multiplica proves significantly when the number of recommendations is not tion. Finally, the time required for the Jaccard calculation has pro- high(between 2 and 5), remaining similar to Pearson correlation ven to have very little relevance as the values are obtained nmendations is high. efficiently in the same process(loop)used in the summations. Fig 8 shows the results obtained using the FilmAffinity data- base. In summary, the new metric offers results which cannot be considered better than those provided by Pearson correlation. 6. Conclusions These results do not negate the integrity of the proposed metric, but rather they confirm its design principles and restrict its field of appl aton Part of the design of the proposed metric is based on the low le- ratings as"positiveornon-positive, transferring those concep- vel of variety and on the sir ilarity of the numerical votes cast by tual lities( in the study cases from 1 to 5 stars). FilmAffinity does of each item by the set of users of the recommender systems s the different users of the RS which offer a small range of voting formity in the range of votes cast and little variation in the rating
precision and recall we are using the concept of relevant recommendations (determined by the threshold h = 5 in our experiment). Based on an improvement in the MAE of 0.2 stars, we will be capable, on many occasions, of suitably classifying the items which obtained a prediction of 4.3–4.49 (considered irrelevant) with Pearson correlation and which, with the new metric, we will place, often correctly, above 4.5, and therefore, we will consider them relevant. As we can see, in the recommendation quality measurements we are dealing with more restrictive numerical margins than those which use the prediction quality measurements, and therefore, it is advisable to consider all kinds of improvements positively. In short, using the MovieLens database, the proposed metric improves the prediction quality measures and the coverage. The recommendation quality measures are slightly improved. The MAE and perfect predictions (graphs 7A and 7C) results obtained with NetFlix are similar (although slightly lower) to those of MovieLens. Nevertheless, the coverage drops using NetFlix (graph 7B); this behavior is logical and is to be expected, as the proposed metric is capable of finding more similar neighbors (which improve the measure of accuracy); the more similar the neighbors are to the test user, in general, not only will they have more similar vote values, but also they will have a greater tendency to vote for the same subset of the total films rated (the same genres, in the same years, etc.); although this factor has been alleviated by the use of Jaccard, its impact has not been completely eliminated. The NetFlix precision/recall quality measure (graph 7D) improves significantly when the number of recommendations is not high (between 2 and 5), remaining similar to Pearson correlation when the number of recommendations is high. Fig. 8 shows the results obtained using the FilmAffinity database. In summary, the new metric offers results which cannot be considered better than those provided by Pearson correlation. These results do not negate the integrity of the proposed metric, but rather they confirm its design principles and restrict its field of application. Part of the design of the proposed metric is based on the low level of variety and on the similarity of the numerical votes cast by the different users of the RS which offer a small range of voting possibilities (in the study cases from 1 to 5 stars). FilmAffinity does not adapt to this standard, primarily because the users can vote in a range from 1 to 10 stars, and to a lesser extent because certain ‘‘itineraries” exist in order to vote for films established by the managers of the RS which have become more popularly used and which make Jaccard’s contribution to the metric less effective. Table 3 summarizes the most relevant aspects of the results obtained. Experimentally, the proposed metric, is executed in approximately a third of the time required by Pearson correlation. Indeed, Pearson correlation requires 2 subtraction operations and 3 multiplication-addition operations (assimilating the squaring operation with the multiplication) for every pair of items voted in common (inside the summations), whilst the MSD only requires one subtraction operation and one multiplication-addition operation. Outside the summation (and therefore less significant in the average times), Pearson correlation requires a square-root calculation and the proposed metric only requires 2 divisions and one multiplication. Finally, the time required for the Jaccard calculation has proven to have very little relevance, as the values are obtained efficiently in the same process (loop) used in the summations. 6. Conclusions In recommender systems which base the votes on small ranges of values (e.g. from 1 to 5), it occurs that users tend to cast their ratings as ‘‘positive” or ‘‘non-positive”, transferring those conceptual levels towards numerical values, which is reflected in low uniformity in the range of votes cast and little variation in the ratings of each item by the set of users of the recommender systems. Fig. 8. Correlation and new metric comparative results using FilmAffinity: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 20% of test users, 20% of test items, k e [2..2000] step 100, N e [2..20], h = 5. Table 3 Quality of the results obtained by applying the quality measurements on the selected databases. MovieLens NetFlix FilmAffinity MAE ++ ++ Not applicable Perfect predictions ++ ++ Coverage + Precision/recall + + J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 527
J Bobadilla et al. / Knowledge-Based Systems 23(2010)520-528 To complement the numerical values with additional non- 113] K. Goldberg. T. Perkins, Eigentaste: a constant time numerical information, focused on the arrangement of the votes collaborative filtering algorithm, Information Retrieval 4(2)(2001)133-151 of each pair of users, we manage to improve the results of Pearson [14] C.Hayes, P.Cunningham, Co ledge-Based Systems 17(2-4)(2004)131-138 correlation; for this purpose we use the Jaccard measure and the [151 J.L Herlocker, J.A. Konstan, A.l. Borchers, J.T. Riedl, An algorithmic framework traditional metric that best adapts to its characteristics: mean square differences. The new metric only operates with the data [161 ).L Herlocker, A Konstan, JT. Riedl, An empirical analysis of design choices in d-based collaborative filtering algorithms, Information Retrieval 5 (ratings)provided by the users of the recommender systems, and does not require any arbitrary parameters for adjustment or [171 J.L Herlocker, J.A. Konstan, J.T. Riedl, L.G. Terveen, Ev weighting. With the aim of achieving representative results the experi- [18)H. Ingoo, J.O. Kyong. H.R. Tae, The collaborative filtering recommendation ents have been carried out on three different recommender sys sed on SoM cluster-indexing CBR, Expert Systems with Applications 25 tems databases(MovieLens, FilmAffinity and NetFlix)which [19T Janner, C Schroth, Web 2.0 and SOA: converging concepts enabling the provide a sufficient volume and variety of data in order to offer reliable comparative results and general conclusions. The results [20] H Jinghua, w Kangning, F Shaohong. A survey of e-comme mender MovieLens and NetFlix(with range of votes [1.51), whilst the re- (211 M Knights, Web 2.0, IET Communications Engineer(2007)30-3 80214 ems confirm the integrity of the metric proposed when applied to ults do not improve that of Pearson correlation when it is applied [22] F. Kong. X Sun, s. Ye, A comparison of several algorithms for collaborative to FilmAffinity(with range of votes [ 1.10). filtering in startup stage, in: Proceedings of the IEEE Networking, Sensing and ontrol,2005,pp.25-28 Acknowledgements [24] G. Koutrica, B. Bercovitz, H Garci m2((204 Our acknowledgement to the GroupLens Research Group, and (25)P Li, S Yamada, A movie recommender system based on inductive learning. in ics and Intelligent Systems, 0.1109Ccs2004.1460433 Reference [26 K]. Lin, Building Web 2.0, Computer (2007)101-102. [271 G. Linden, B. Smith, J. York, Amazon. com recommendations, IEEE Internet systems: a survey of the state-c ossible extensions. IEEE in: 42nd Hawaii International Conference on System Sciences HICSS09, 2009. 如如m collaborative and content-based filtering. IEEE Intelligent Systems(2006)35- 30]Movielens, [4]R Baraglia, F. Silvestri, An online recommender system for large web sites, L, An improvement to collaborative filtering for edings of the IEEE/IC/ACM International Conference on Web Modeling, Control and Automatisation, vol. 2005,pp.792-795,10.1109/McA2005.1631361 P. Pu, L Chen, Trust-inspiring explanation interfaces for recommender Heckerman, C Kadie, Empirical analysis of predicti (6)(2007)542-556. nding using form Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1998, pp 43-52. [7 L Candillier, F. Meyer, M. Boulle, Comparing state-of-the-art collaborative [35] J. L Sanchez, F. Serradilla, E. Martinez, J Bobadilla, obile devices. Lecture Notes on echnologies DEST,2008,pp.432-436,10.1109/DEST20084635147 er Science4611(2007)1130-1139 [36]JB. Schafer, D. F ki, J.L Herlocker, S. Sen, Collaborative filtering [9 F. Hernandez, E. Gaudioso, Evaluation of recommender syst stems with Applications (35)(2008 [37 P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, Providing justifications in [10] D.R. Fesenmaier. U. Gretzel. C. Knoblock, C. Paris, F. Ricci, S. Stabb, et al, recommender systems, IEEE Transactions on Systems, Man and Cybernetics, Intelligent systems for tourism, Intelligent Systems 17(6)(20 [38] W. Yuan, D Guan, Y.K. Lee, S. Lee, S. Hur, Improved recommender [11]I Fuyuki, T K Quan, H Shinichi. In mall-worldness of trust network e-Based Systems clustering items based on stability of user si Knowledge ns, Fuzzy Sets and Li. Recommendation base commendation algorithms: approaches anchored on human factors, collaborative filtering. Knowledge-Based Systems 22(1)(2009)105-114 Interacting with Computers 18 (3)(2006)410-431
To complement the numerical values with additional nonnumerical information, focused on the arrangement of the votes of each pair of users, we manage to improve the results of Pearson correlation; for this purpose we use the Jaccard measure and the traditional metric that best adapts to its characteristics: mean square differences. The new metric only operates with the data (ratings) provided by the users of the recommender systems, and does not require any arbitrary parameters for adjustment or weighting. With the aim of achieving representative results the experiments have been carried out on three different recommender systems databases (MovieLens, FilmAffinity and NetFlix) which provide a sufficient volume and variety of data in order to offer reliable comparative results and general conclusions. The results confirm the integrity of the metric proposed when applied to MovieLens and NetFlix (with range of votes [1..5]), whilst the results do not improve that of Pearson correlation when it is applied to FilmAffinity (with range of votes [1..10]). Acknowledgements Our acknowledgement to the GroupLens Research Group, and to the FilmAffinity and NetFlix companies. References [1] E. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005) 734–749. [2] AICU, , . [3] N. Antonopoulus, J. Salter, CinemaScreen recommender agent: combining collaborative and content-based filtering, IEEE Intelligent Systems (2006) 35– 41. [4] R. Baraglia, F. Silvestri, An online recommender system for large web sites, in: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 199–205, 10.1109/WI.2004.10158. [5] J. Bobadilla, F. Serradilla, A. Hernando, Collaborative filtering adapted to recommender systems of e-learning, Knowledge-Based Systems 22 (2009) 261–265, doi:10.1016/j.knosys.2009.01.008. [6] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 1998, pp. 43–52. [7] L. Candillier, F. Meyer, M. Boullé, Comparing state-of-the-art collaborative filtering systems, in: LNAI 4571, 2007, pp. 548–562. [8] S.B. Cho, J.H. Hong, M.H. Park, Location-based recommendation system using bayesian user’s preference model in mobile devices, Lecture Notes on Computer Science 4611 (2007) 1130–1139. [9] F. Hernández, E. Gaudioso, Evaluation of recommender systems: a new approach, Expert Systems with Applications (35) (2008) 790–804, doi:10.1016/j.eswa.2007.07.047. [10] D.R. Fesenmaier, U. Gretzel, C. Knoblock, C. Paris, F. Ricci, S. Stabb, et al., Intelligent systems for tourism, Intelligent Systems 17 (6) (2002) 53–66. [11] I. Fuyuki, T.K. Quan, H. Shinichi, Improving accuracy of recommender systems by clustering items based on stability of user similarity, in: Proceedings of the IEEE International Conference on Intelligent Agents, Web Technologies and Internet Commerce, 2006, pp. 61–61, 10.1109/CIMCA.2006.123. [12] G.M. Giaglis, G. Lekakos, Improving the prediction accuracy of recommendation algorithms: approaches anchored on human factors, Interacting with Computers 18 (3) (2006) 410–431. [13] K. Goldberg, T. Roeder, D. Gupta, C. Perkins, Eigentaste: a constant time collaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133–151. [14] C. Hayes, P. Cunningham, Context boosting collaborative recommendations, Knowledge-Based Systems 17 (2–4) (2004) 131–138. [15] J.L. Herlocker, J.A. Konstan, A.l. Borchers, J.T. Riedl, An algorithmic framework for performing collaborative filtering, in: SIGIR, 1999, pp. 230–237. [16] J.L. Herlocker, J.A. Konstan, J.T. Riedl, An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms, Information Retrieval 5 (2002) 287–310. [17] J.L. Herlocker, J.A. Konstan, J.T. Riedl, L.G. Terveen, Evaluating collaborative filtering recommender systems, ACM Transactions on Information Systems 22 (1) (2004) 5–53. [18] H. Ingoo, J.O. Kyong, H.R. Tae, The collaborative filtering recommendation based on SOM cluster-indexing CBR, Expert Systems with Applications 25 (2003) 413–423. [19] T. Janner, C. Schroth, Web 2.0 and SOA: converging concepts enabling the internet of services, in: IT Professional, 2007, pp. 36–41. [20] H. Jinghua, W. Kangning, F. Shaohong, A survey of e-commerce recommender systems, in: Proceedings of the International Conference on Service Systems and Service Management, 2007, pp. 1–5, 10.1109/ICSSSM.2007.4280214. [21] M. Knights, Web 2.0, IET Communications Engineer (2007) 30–35. [22] F. Kong, X. Sun, S. Ye, A comparison of several algorithms for collaborative filtering in startup stage, in: Proceedings of the IEEE Networking, Sensing and Control, 2005, pp. 25–28. [23] J.A. Konstan, B.N. Miller, J. Riedl, PocketLens: toward a personal recommender system, ACM Transactions on Information Systems 22 (3) (2004) 437–476. [24] G. Koutrica, B. Bercovitz, H. Garcia, FlexRecs: expresing and combining flexible recommendations, in: SIGMOD, 2009, pp. 745–757. [25] P. Li, S. Yamada, A movie recommender system based on inductive learning, in: Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, vol. 1, 2004, pp. 318–323, 10.1109/ICCIS.2004.1460433. [26] K.J. Lin, Building Web 2.0, Computer (2007) 101–102. [27] G. Linden, B. Smith, J. York, Amazon.com recommendations, IEEE Internet Computing (January–February) (2003) 76–80. [28] F. Loll, N. Pinkwart, Using collaborative filtering algorithms as eLearning tools, in: 42nd Hawaii International Conference on System Sciences HICSS‘09, 2009, pp. 1–10. [29] Y. Manolopoulus, A. Nanopoulus, A.N. Papadopoulus, P. Symeonidis, Collaborative recommender systems: combining effectiveness and efficiency, Expert Systems with Applications (34) (2007) 2995–3013, doi:10.1016/ j.eswa.2007.05.013. [30] MovieLens, . [31] R. Nayak, L.T. Weng, Y. Xu, An improvement to collaborative filtering for recommender systems, in: Proceedings of the IEEE International Conference on Computational Intelligence for Modeling, Control and Automatisation, vol. 1, 2005, pp. 792–795, 10.1109/CIMCA.2005.1631361. [32] NetFlix, . [33] P. Pu, L. Chen, Trust-inspiring explanation interfaces for recommender systems, Knowledge-Based Systems 20 (6) (2007) 542–556. [34] P.B. Ryan, D. Bridge, Collaborative recommending using formal concept analysis, Knowledge-Based Systems 19 (5) (2006) 309–315. [35] J.L. Sanchez, F. Serradilla, E. Martinez, J. Bobadilla, Choice of metrics used in collaborative filtering and their impact on recommender systems, in: Proceedings of the IEEE International Conference on Digital Ecosystems and Technologies DEST, 2008, pp. 432–436, 10.1109/DEST.2008.4635147. [36] J.B. Schafer, D. Frankowski, J.L. Herlocker, S. Sen, Collaborative filtering recommender systems, LNCS 4321, 2007, pp. 291–324. [37] P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, Providing justifications in recommender systems, IEEE Transactions on Systems, Man and Cybernetics, Part A 38 (6) (2008) 1262–1272. [38] W. Yuan, D. Guan, Y.K. Lee, S. Lee, S.J. Hur, Improved trust-aware recommender system using small-worldness of trust networks, Knowledge-Based Systems 23 (3) (2010) 232–238. [39] R.R. Yager, Fuzzy logic methods in recommender systems, Fuzzy Sets and Systems 136 (2) (2003) 133–149. [40] J.M. Yang, K.F. Li, Recommendation based on rational inferences in collaborative filtering, Knowledge-Based Systems 22 (1) (2009) 105–114. 528 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528