Expert Systems with Applications 38(2011)14609-14623 Contents lists available at ScienceDirect Expert Systems with Applications ELSEVIER journalhomepagewww.elsevier.com/locate/eswa A framework for collaborative filtering recommender systems Jesus Bobadilla Antonio Hernando, Fernando ortega, Jesus Bernal Universidad Politecnica de Madrid E FilmAffinity. com research team, Crta De valencia, Km. 7, 28031 Madrid, Spain ARTICLE FO A BSTRACT As the use of recommender systems becomes more consolidated on the Net, an increasing need arises to recommender systems evelop some kind of evaluation framework for collaborative filtering measures and methods which is capable of not only testing the prediction and recommendation results, but also of other purposes which imilarity measures until now were considered secondary, such as novelty in the recommendations and the users'trust in these. This paper provides: (a)measures to evaluate the novelty of the users 'recommendations and trust uality in their neighborhoods, (b)equations that formalize and unify the collaborative filtering process and its collaborative fltering evaluation, (c)a framework based on the above- mentioned elements that enables the evaluation of the quality results of any collaborative filtering applied to the desired recommender systems, using four graphs: quality of the predictions, the recommendations, the novelty and the trust. 2011 Elsevier Ltd. All rights reserved 1 Introduction of"Canary Islands"of an important number of individuals who also rated destinations in the Caribbean very highly. This suggestion Recommender systems(RS )are developed to attempt to reduce (recommendation) will often provide the user of the service part of the information overload problem produced on the Net As inspiring information from the collective knowledge of all o opposed to other traditional help systems, such as search engines users of the service. (Google, Yahoo, etc. ) RS generally base their operation on a collab- RS cover a wide variety of applications(Baraglia Silvestri, ative Filtering( CF)process, which provides personalized recom- 2004: Bobadilla, Serradilla, Hernando, 2009: Fesenmaier et al mendations to active users of websites where different elements 2002: Jinghua, Kangning, Shaohong, 2007: Serrano, Viedma, (products, films, holidays, etc. ) can be rated. Olivas, Cerezo, Romero, 2011), although those related to movie RS are inspired by human social behavior, where it is common recommendations are by far the best and most widely-used in to take into account the tastes, opinions and experiences of our the research field(Antonopoulus Salter, 2006: Konstan, Miller, acquaintances when making all kinds of decisions(choosing films riedl, 2004 to watch, selecting schools for our children, choosing products to A substantial part of the research in the area of CF focuses on how buy, etc. ) Obviously, our decisions are modulated according to to determine which users are similar to the given one: in order to our interpretation of the similarity that exists between us and tackle this task, there are fundamentally three approaches: mem- our group of acquaintances, in such a way that we rate the opin- ory-based methods, model-based methods and hybrid approaches ions and experiences of some more highly than othe Memory-based methods(Bobadilla, Ortega, Hernando, in By emulating each step of our own behavior insofar as is press: Bobadilla, Serradilla, Bernal, 2010: Kong, Sun, & Ye ble, the CF process of RS firstly selects the group of users from the Rs 2005: Sanchez, Serradilla, Martinez, Bobadilla, 2008: Symeonidis, website that is most similar to us, and then provides us with a group Nanopoulos, Manolopoulos, 2008) use similarity metrics and act of recommendations of elements that we have not rated yet directly on the ratio matrix that contains the ratings of all users (assuming in this way that they are new to us)and which have been who have expressed their preferences on the collaborative service ated the best by the group of users with similar tastes to us. This these metrics mathematically express a distance between two ay, a trip to the Canary islands could be recommended to an indi- users based on each of their ratios. Model-based methods vidual who has rated different destinations in the Caribbean very ( Adomavicius tuzhilin, 2005) use the ratio matrix to create a highly, based on the positive ratings about the holiday destination model from which the sets of similar users will be established Among the most widely used models we have: Bayesian classifiers (Cho, Hong, Park, 2007), neural networks(Ingoo, Kyong, Tac Corresponding author. Address: Universidad Politecnica de Madrid,, Crta De 2003)and fuzzy systems(Yager, 2003).Generally, commercial RS Valencia, Km. 7, 28031 Madrid, Spain Tel. +34 3365133: fax: +34 913367522. use memory-based methods (Giaglis Lekakos, 2006), whilst model-based methods are usually associated with research RS 0957-4174/s- see front o 2011 Elsevier Ltd. All rights doi:10.1016/eswa2011.05.021
A framework for collaborative filtering recommender systems Jesus Bobadilla ⇑ , Antonio Hernando, Fernando Ortega, Jesus Bernal Universidad Politecnica de Madrid & FilmAffinity.com research team, Crta. De Valencia, Km. 7, 28031 Madrid, Spain article info Keywords: Recommender systems Framework Similarity measures Trust Novelty Quality Collaborative filtering abstract As the use of recommender systems becomes more consolidated on the Net, an increasing need arises to develop some kind of evaluation framework for collaborative filtering measures and methods which is capable of not only testing the prediction and recommendation results, but also of other purposes which until now were considered secondary, such as novelty in the recommendations and the users’ trust in these. This paper provides: (a) measures to evaluate the novelty of the users’ recommendations and trust in their neighborhoods, (b) equations that formalize and unify the collaborative filtering process and its evaluation, (c) a framework based on the above-mentioned elements that enables the evaluation of the quality results of any collaborative filtering applied to the desired recommender systems, using four graphs: quality of the predictions, the recommendations, the novelty and the trust. 2011 Elsevier Ltd. All rights reserved. 1. Introduction Recommender systems (RS) are developed to attempt to reduce part of the information overload problem produced on the Net. As opposed to other traditional help systems, such as search engines (Google, Yahoo, etc.), RS generally base their operation on a Collaborative Filtering (CF) process, which provides personalized recommendations to active users of websites where different elements (products, films, holidays, etc.) can be rated. RS are inspired by human social behavior, where it is common to take into account the tastes, opinions and experiences of our acquaintances when making all kinds of decisions (choosing films to watch, selecting schools for our children, choosing products to buy, etc.). Obviously, our decisions are modulated according to our interpretation of the similarity that exists between us and our group of acquaintances, in such a way that we rate the opinions and experiences of some more highly than others. By emulating each step of our own behavior insofar as is possible, the CF process of RS firstly selects the group of users from the RS website that is most similar to us, and then provides us with a group of recommendations of elements that we have not rated yet (assuming in this way that they are new to us) and which have been rated the best by the group of users with similar tastes to us. This way, a trip to the Canary Islands could be recommended to an individual who has rated different destinations in the Caribbean very highly, based on the positive ratings about the holiday destination of ‘‘Canary Islands’’ of an important number of individuals who also rated destinations in the Caribbean very highly. This suggestion (recommendation) will often provide the user of the service with inspiring information from the collective knowledge of all other users of the service. RS cover a wide variety of applications (Baraglia & Silvestri, 2004; Bobadilla, Serradilla, & Hernando, 2009; Fesenmaier et al., 2002; Jinghua, Kangning, & Shaohong, 2007; Serrano, Viedma, Olivas, Cerezo, & Romero, 2011), although those related to movie recommendations are by far the best and most widely-used in the research field (Antonopoulus & Salter, 2006; Konstan, Miller, & Riedl, 2004). A substantial part of the research in the area of CF focuses on how to determine which users are similar to the given one; in order to tackle this task, there are fundamentally three approaches: memory-based methods, model-based methods and hybrid approaches. Memory-based methods (Bobadilla, Ortega, & Hernando, in press; Bobadilla, Serradilla, & Bernal, 2010; Kong, Sun, & Ye, 2005; Sanchez, Serradilla, Martinez, & Bobadilla, 2008; Symeonidis, Nanopoulos, & Manolopoulos, 2008) use similarity metrics and act directly on the ratio matrix that contains the ratings of all users who have expressed their preferences on the collaborative service; these metrics mathematically express a distance between two users based on each of their ratios. Model-based methods (Adomavicius & Tuzhilin, 2005) use the ratio matrix to create a model from which the sets of similar users will be established. Among the most widely used models we have: Bayesian classifiers (Cho, Hong, & Park, 2007), neural networks (Ingoo, Kyong, & Tae, 2003) and fuzzy systems (Yager, 2003). Generally, commercial RS use memory-based methods (Giaglis & Lekakos, 2006), whilst model-based methods are usually associated with research RS. 0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.05.021 Abbreviations: RS, recommender systems; CF, collaborative filtering. ⇑ Corresponding author. Address: Universidad Politecnica de Madrid, Crta. De Valencia, Km. 7, 28031 Madrid, Spain. Tel.: +34 3365133; fax: +34 913367522. E-mail address: jesus.bobadilla@upm.es (J. Bobadilla). Expert Systems with Applications 38 (2011) 14609–14623 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
rman, and Kadie(1998) compare the predictive accuracy of various methods in a set of representa- ier, Meyer, and Boulle(2007)and main n, 2007 review the ma a CF frame predic der to be able me at differ- orovided to help ng ideas in each group of
Regardless of the method used in the CF stage, the technical aim generally pursued is to minimize the prediction errors, by making the accuracy (Fuyuki, Quan, & Shinichi, 2006; Giaglis & Lekakos, 2006; Li & Yamada, 2004; Manolopoulus, Nanopoulus, Papadopoulus, & Symeonidis, 2007; Su & Khoshgoftaar, 2009) of the RS as high as possible; nevertheless, there are other purposes that need to be taken into account: avoid overspecialization phenomena, find good items, trust of recommendations, novelty, precision and recall measures, sparsity, cold start issues, etc. The framework proposed in the paper gives special importance to the quality of the predictions and the recommendations, as well as to the novelty and trust results. Whilst the importance of the quality obtained in the predictions and recommendations has been studied in detail since the start of the RS, the quality results in novelty and trust provided by the different methods and metrics used in CF have not been evaluated in depth. Measuring the quality of the trust results in recommendations becomes even more complicated as we are entering a particularly subjective field, where each specific user can grant more or less importance to various aspects that are selected as relevant to gain their trust in the recommendations offered (recommendation of recent elements, such as film premieres, introduction of novel elements, etc.). Another additional problem is the number of nuances that can be taken into account together with the lack of consensus to define them; in this way we can find studies on trust, reputation, credibility, importance, expertise, competence, reliability, etc. which sometimes pursue the same objective and other times do not. In Buhwan, Jaewook, and Hyunbo (2009) we can see some novel memory-based methods that incorporate the level of a user credit instead of using similarity between users. In Kwiseok, Jinhyung, and Yongtae (2009) they employ a multidimensional credibility model, source credibility from consumer psychology, and provide a credible neighbor selection method, although the equations involved require a great number of parameters of difficult or arbitrary adjustment. O’Donovan and Smyth (2005) presents two computational models of trust and show how they can be readily incorporated into CF frameworks. Kitisin and Neuman (2006) propose an approach to include the social factors e.g. user’s past behaviors and reputation together as an element of trust that can be incorporated into the RS. Zhang (2008) and Hijikata et al., 2009 tackle the novelty issue: in the first paper they propose a novel topic diversity metric which explores hierarchical domain knowledge, whilst in the second paper they infer items that a user does not know by calculating the similarity of users or items based on information about what items users already know. An aspect related to the trust measures is the capacity to provide justifications for the recommendations made; in Symeonidis et al. (2008) they propose an approach that attains both accurate and justifiable recommendations, constructing a feature profile for the users to reveal their favorite features. To date, various publications have been written which tackle the way the RS are evaluated, among the most significant we have Herlocker, Konstan, Riedl, and Terveen (2004) which reviews the key decisions in evaluating CF RS: the user tasks, the type of analysis and datasets being used, the ways in which prediction quality is measured and the user-based evaluation of the system as a whole. Hernández and Gaudioso (2008) is a current study which proposes a recommendation filtering process based on the distinction between interactive and non-interactive subsystems. General publications and reviews also exist which include the most commonly accepted metrics, aggregation approaches and evaluation measures: mean absolute error, coverage, precision, recall and derivatives of these: mean squared error, normalized mean absolute error, ROC and fallout; Goldberg, Roeder, Gupta, and Perkins (2001) focus on the aspects not related to the evaluation, Breese, Heckerman, and Kadie (1998) compare the predictive accuracy of various methods in a set of representative problem domains. Candillier, Meyer, and Boullé (2007) and Schafer, Frankowski, Herlocker, and Sen, 2007 review the main CF methods proposed in the literature. Among the most significant papers that propose a CF framework is Herlocker, Konstan, Borchers, and Riedl (1999) which evaluates the following: similarity weight, significance weighting, variance weighting, selecting neighborhood and rating normalization; Hernández and Gaudioso (2008) propose a framework in which any RS is formed by two different subsystems, one of them to guide the user and the other to provide useful/interesting items. Koutrika, Bercovitz, and Garcia (2009) is a recent and very interesting framework which introduces levels of abstraction in CF process, making the modifications in the RS more flexible. The RS frameworks proposed until now present two deficiencies which we aim to tackle in this paper. The first of these is the lack of formalization in the evaluation methods; although the quality metrics are well defined, there are a variety of details in the implementation of the methods which, in the event they are not specified, can lead to the generation of different results in similar experiments. The second deficiency is the absence of quality measures of the results in aspects such as novelty and trust of the recommendations. The following section of this paper develops a complete series of mathematical formalizations based on sets theory, backed by a running example which aids understanding and by cases of studies which show clarifying results of the aspects and alternatives shown; in this section, we also obtain the combination of metric, aggregation approach and standardization method which provides the best results, enabling it to be used as a reference to evaluate metrics designed by the scientific community. In Section 3 we specify the evaluation measures proposed in the framework, which include the quality analysis of the following aspects: predictions (estimations), recommendations, novelty and trust; this same section shows the results obtained by using MovieLens 1M and NetFlix. Finally, we set our most relevant conclusions. 2. Framework specifications This section provides both the equations on which the prediction/recommendation process in the CF stage is based and the equations that support the quality evaluation process offered in the proposed framework; between these last two we have the traditional MAE, coverage, precision, recall and those developed specifically to complete the framework: novelty-precision, novelty-recall, trust-precision, trust-recall. The objective of formalizing the prediction, recommendation and evaluation processes is to ensure that the experiments carried out by different researchers can be reproduced and are not altered by different decisions made on behalf of different implementation details: e.g. deciding how to act when no k-neighborhoods have voted for a specific item (we could say not predict, or predict with the average votes of all users on that item), whether we apply a standardization process to the input data or to the weightings of the aggregation approach, whether on finding an error in a prediction we take the decimal values of the prediction or round them off to the nearest whole value, etc. The formalization presented here is fundamental when specifying a framework, where the same experiments carried out by different researchers must give the same results, in order to be able to compare the metrics and methods developed over time at different research centers. Throughout the section, a running example is provided to help to understand and follow the underlying ideas in each group of 14610 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623
Applications38(2011)14609-14623 directly interrelated equations. In the same way, various real re- Table\ 0 ye uom on to the use m he user sults are provided(obtained with MovieLens)grouped into"case Meas of study"subsections where the integrities and defects of each of Name Measures descriptions the alternatives mentioned can be compared. The main subsections in which this section is structured al ime of the user on the ite preliminary definitions, similarity measures, obtaining a users k neighborhoods, prediction of the value of an item, obtaining the Mean a Prediction quality recommendations, quality of the recommendation: precision and (u Cover age on the user or the Rs recall, quality of the novelty: novelty-precision and novelty-recall erage of the rs and quality of the trust: trust-precision and trust-recall Recommendation precision on the user Recommendation quality Recommendation recall on the user Recommendation recall of the rs In this subsection we specify the definitions, parameters, mea- Novelty quality sures and sets used in the equations, as well as the values of the ning example. In order to simplify the equations of the other Novelty recall of the Rs subsections, we do not specify here the different learning and test wu Trust precision on the user Trust quality groups which must be used in the framework operation. Trust precision of the rs 2.1.1. Formalization Trust recall of the rs Given an rs with a database of l users and m items rated range min,., max), where the absence of ratings will be repre- ented by the symbol·. (1,1),(2,·),③3.·),(4,2),(5.4,(6.1),(7,· We define the basic sets: (N: natural numbers, R: real numbers) (8.·),(9,·),(10,-(11,),(12,·,(13.4),(14.1 U={u∈N1≤u≤L},set users ={i∈N1≤i≤M}, set of items (1,4),·,(3,·),4.3),5·),(6,·,7,·), efine vote v of user u on item i as rui=v (8.·),9,5).(10.4)11.·),(12.·),(13,-,(14,·) 'e define the average of the valid votes of user u as Tu We define the cardinality of a set c as its number of B=01…,(2)(3,,4.(5-,(6(7,3 valid elements (8.3),(9,4),(10.5),(11,-,(12-),(13,5),(14.· Below we present the tables of parameters(Table 1), measures 2.2.1. Introduction (Table 2)and sets(table 3)used in the formalizations made in the 6. The proposed framework will allow to compare future similar- stantiate the behavior of well-known similarity measures and 2.1.2. Running exampl propose the one that gives the best results, so that it can act as a reference for future comparisons. The user to user similarity mea- U={u∈N1≤u≤5},I={∈N1≤i≤14}, sures most commonly-used in RS are: Pearson Correlation, cosine, V={U∈N1≤v≤5Vv=·} Constrained Pearsons Correlation and spearman rank correlation. (1.5),(2,·),(3.·,(4.3),(5,·,(6.4),(7,1) tween two users x and y: sim(x, y) based on their ratings of items (8-),(9,-),(10.4,、(11),(12.,2),(13,4,(14. hat both users have rated(9). Axy={∈lrx≠·Aryi≠· 2.2. 2 Running example Parameters In order to make the example easier to follow we will use a sim- Square Difference(MSD) of two users x and y We represent the votes issued in table format (Table 4): We obtain the table of similarities between users (Table 5), tak # Ratings received ing into account that MSD(x, y)= MSD(, x). The maximum similar- Relevant item threshold #Trust users #Trust items Calculation example: MSD(U1, U2)=i((5-1)+(3-2)+
directly interrelated equations. In the same way, various real results are provided (obtained with MovieLens) grouped into ‘‘case of study’’ subsections where the integrities and defects of each of the alternatives mentioned can be compared. The main subsections in which this section is structured are: preliminary definitions, similarity measures, obtaining a user’s kneighborhoods, prediction of the value of an item, obtaining the accuracy, standardization process, obtaining the coverage, top N recommendations, quality of the recommendation: precision and recall, quality of the novelty: novelty-precision and novelty-recall and quality of the trust: trust-precision and trust-recall. 2.1. Preliminary definitions In this subsection we specify the definitions, parameters, measures and sets used in the equations, as well as the values of the running example. In order to simplify the equations of the other subsections, we do not specify here the different learning and test groups which must be used in the framework operation. 2.1.1. Formalization Given an RS with a database of L users and M items rated in the range {min,...,max}, where the absence of ratings will be represented by the symbol . We define the basic sets: (N: natural numbers, R: real numbers) U ¼ fu 2 Nj1 6 u 6 Lg; set of users ð1Þ I ¼ fi 2 Nj1 6 i 6 Mg; set of items ð2Þ V ¼ fv 2 Nj min 6 v 6 maxg [ fg; set of possible votes ð3Þ Ru ¼ fði; vÞji 2 I; v 2 Vg; ratings of user u ð4Þ We define vote v of user u on item i as ru;i ¼ v ð5Þ We define the average of the valid votes of user u as ru ð6Þ We define the cardinality of a set C as its number of valid elements #C ¼ #fx 2 Cjx – g ð7Þ #Ru ¼ #fi 2 Ijru;i – g ð8Þ Below we present the tables of parameters (Table 1), measures (Table 2) and sets (Table 3) used in the formalizations made in the paper. 2.1.2. Running example U ¼ fu 2 Nj1 6 u 6 5g; I ¼ fi 2 Nj1 6 i 6 14g; V ¼ fv 2 Nj1 6 v 6 5 _ v ¼ g R1 ¼ h1; 5i;h2; i;h3; i;h4; 3i;h5; i;h6; 4i;h7; 1i; h8; i;h9; i;h10; 4i;h11; i;h12; 2i;h13; 4i;h14; i R2 ¼ h1; 1i;h2; i;h3; i;h4; 2i;h5; 4i;h6; 1i;h7; i h8; i;h9; i;h10; i;h11; i;h12; i;h13; 4i;h14; 1i R3 ¼ h1; 5i;h2; 2i;h3; i;h4; 4i;h5; i;h6; i;h7; i; h8; 3i;h9; 5i;h10; 4i;h11; i;h12; i;h13; 4i;h14; i ; R4 ¼ h1; 4i;h2; i;h3; i;h4; 3i;h5; i;h6; i;h7; i; h8; i;h9; 5i;h10; 4ih11; i;h12; i;h13; i;h14; i R5 ¼ h1; i;h2; i;h3; i;h4; i;h5; i;h6; i;h7; 3i; h8; 3i;h9; 4i;h10; 5i;h11; i;h12; i;h13; 5i;h14; i 2.2. Similarity measures 2.2.1. Introduction The proposed framework will allow to compare future similarity measures and methods, in the meantime, it is advisable to substantiate the behavior of well-known similarity measures and propose the one that gives the best results, so that it can act as a reference for future comparisons. The user to user similarity measures most commonly-used in RS are: Pearson Correlation, cosine, Constrained Pearson’s Correlation and Spearman rank correlation. The similarity approaches usually compute the similarity between two users x and y: sim(x,y) based on their ratings of items that both users have rated (9). Ax;y ¼ fi 2 Ijrx;i – ^ ry;i – g: ð9Þ 2.2.2. Running example In order to make the example easier to follow we will use a similarity measure that is very easy to calculate manually: the Mean Square Difference (MSD) of two users x and y. MSDðx; yÞ ¼ 1 #Ax;y X i2Ax;y ðrx;i ry;iÞ 2 : We represent the votes issued in table format (Table 4): We obtain the table of similarities between users (Table 5), taking into account that MSD(x,y) = MSD(y,x). The maximum similarity is reached at value 0. Calculation example: MSDðU1; U2Þ ¼ 1 4 ½ð5 1Þ 2 þ ð3 2Þ 2 þ ð4 1Þ 2 þ ð4 4Þ 2 ¼ 6:5. Table 1 Parameters. Name Parameters descriptions L #Users M #Items min #min rating value max #max rating value K #Neighborhoods N #Recommendations b #Range to the current time c #Ratings received h Relevant item threshold q #Trust users h #Trust items Table 2 Measures. Name Measures descriptions Usage ru, i Rating of the user on the item General use tu, i Rating time of the user on the item pu, i Prediction to the user on the item mu Mean absolute error on the user Prediction quality m Mean absolute error of the RS cu Coverage on the user c Coverage of the RS tu Recommendation precision on the user Recommendation quality t Recommendation precision of the RS xu Recommendation recall on the user x Recommendation recall of the RS nu Novelty precision on the user Novelty quality n Novelty precision of the RS lu Novelty recall on the user l Novelty recall of the RS wu Trust precision on the user Trust quality w Trust precision of the RS zu Trust recall on the user z Trust recall of the RS J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623 14611
J. Bobadilla et aL/ Expert Systems with Applications 38(2011)14609-14623 sle 3 Table 5 Running example: users similarities. Name Sets description MSD 6.66 75 user k userk e Items voted of the most by y user Table 6 Tu Users neighborhoods taking into account B er,k,阝.q Running example each user user,k, B, h Trust pairs (user, item) Axy Items rated simultaneously by users x and y K=3{U3,U4U (Ul U4. U2) 2 rated item i sers who have excepts er item Items that the user has voted for and on which sers from whom a mae can be obtained Once the set of K users (neighbors similar to active u has been Items that the user has not voted for and on which calculated(Ku), in order to obtain the prediction of item i on user predictions exist u( 12), one of the following aggregation approaches is often used ems that the user has user. k the average(15 ). the weighted sum(16)and the adjusted weighted Pairs(user, item) that have not been voted for and k aggregation(Deviation-From-Mean)(17). ems that have recently been voted for by both user x B, user1 etGu={n∈KBrn≠ Users recent votes user,阝 pa1=u∑simu.nrn台Cu≠ pa1=n+{u∑simu.n(rn-r)Gu≠ Running example: RS database. where u serves as a normalizing factor, usually computed u1=1/∑smn)Cu≠ (18) When it is not possible to make the prediction of an item as none of the K-neighbors has voted for this item, we can decide to 2.3. Obtaining a user's K-neighbors make use of the average ratings given to that item by all users of the rs who have voted for it; in this case, eqs. (14)-(18)are com- plemented with Eqs. (19)-(23) 23. 1. Formalization We define Ku as the set of K neighbors of the user u. where bui={n∈Un≠u,rn≠·} (19) The following must be true rn→Gu=ABu≠必 wx∈Ku,wy∈(U-Ku),sim(u,x)≥sim(u,y Pa={u∑simu.n)nCu=ABu1≠ pa1=7u+{∑sim(u.n)(ma-)台Gu=必ABu≠必(22) ng K=2 and K=3 2. 4. Prediction of the value of an item 1=1/∑sim(u,n)Cn=0ABn1≠ (23) Based on the information provided by the K-neighbors of a user diction o 'n rs cases exist in which it is impossible to make pre- 2. 4.1. Formalization Finally, some items that any other user has voted for: u, the CF process enables the value of an item to be predicted as P=·Gui=ABu= LetP={(ip)i∈I,p∈R} set of prediction to the user u(R: real numbers) 2. 4.2. Running example We will assign the value of the prediction p By using the simplest prediction Eq (15)we obtain the predic- tions that the users can receive using K=3 neighbors. Table 7 made to user u on item i as pui=p (13) shows these predictions
2.3. Obtaining a user’s K-neighbors 2.3.1. Formalization We define Ku as the set of K neighbors of the user u. The following must be true: Ku U ^ #Ku ¼ k ^ u R Ku; ð10Þ 8x 2 Ku; 8y 2 ðU KuÞ; simðu; xÞ P simðu; yÞ: ð11Þ 2.3.2. Running example Table 6 shows the sets of neighbors using K = 2 and K = 3: 2.4. Prediction of the value of an item 2.4.1. Formalization Based on the information provided by the K-neighbors of a user u, the CF process enables the value of an item to be predicted as follows: Let Pu ¼ fði; pÞji 2 I; p 2 Rg; set of prediction to the user u ðR : real numbersÞ ð12Þ We will assign the value of the prediction p made to user u on item i as pu;i ¼ p ð13Þ Once the set of K users (neighbors) similar to active u has been calculated (Ku), in order to obtain the prediction of item i on user u(12), one of the following aggregation approaches is often used: the average (15), the weighted sum (16) and the adjusted weighted aggregation (Deviation-From-Mean) (17). Let Gu;i ¼ fn 2 Kuj9rn;i – g ð14Þ pu;i ¼ 1 #Gu;i X n2Gu;i rn;i () Gu;i – £ ð15Þ pu;i ¼ lu;i X n2Gu;i simðu; nÞrn;i () Gu;i – £ ð16Þ pu;i ¼ ru þ lu;i X n2Gu;i simðu; nÞðrn;i rnÞ () Gu;i – £ ð17Þ where l serves as a normalizing factor, usually computed: lu;i ¼ 1 X n2Gu;i simðu; nÞ () Gu;i – £ , ð18Þ When it is not possible to make the prediction of an item as none of the K-neighbors has voted for this item, we can decide to make use of the average ratings given to that item by all users of the RS who have voted for it; in this case, Eqs. (14)–(18) are complemented with Eqs. (19)–(23): where Bu;i ¼ fn 2 Ujn – u;rn;i – g ð19Þ pu;i ¼ 1 #Bu;i X n2Bu;i rn;i () Gu;i ¼ £ ^ Bu;i – £ ð20Þ pu;i ¼ lu;i X n2Bu;i simðu; nÞrn;i () Gu;i ¼ £ ^ Bu;i – £ ð21Þ pu;i ¼ ru þ lu;i X n2Bu;i simðu; nÞðrn;i rnÞ () Gu;i ¼ £ ^ Bu;i – £ ð22Þ lu;i ¼ 1= X n2Bu;i simðu; nÞ () Gu;i ¼ £ ^ Bu;i – £ ð23Þ Finally, in RS cases exist in which it is impossible to make predictions on some items that any other user has voted for: pu;i ¼ () Gu;i ¼ £ ^ Bu;i ¼ £ ð24Þ 2.4.2. Running example By using the simplest prediction Eq. (15) we obtain the predictions that the users can receive using K = 3 neighbors. Table 7 shows these predictions. Table 3 Sets. Name Sets descriptions Parameters U Users L I Items M V Rating values min, max Ru User ratings user Ku Neighborhoods of the user user, k Pu Predictions to the user user, k Xu Top recommended items to the user user, k, h Zu Top N recommended items to the user user, k, N, h Y Items voted of the most by c users c Tu User’s neighborhoods taking into account b user, k, b, q Qu Trust users user, k, b, h Hu Trust pairs (user, item) user, k, b Ax,y Items rated simultaneously by users x and y user1, user2 Gu,i User’s neighborhoods which have rated item i user, k Bu,i Users who have voted for item i, exceptuser user, item Ou Items that the user has voted for and on which predictions exist user, k O Users from whom a MAE can be obtained k Cu Items that the user has not voted for and on which predictions exist user, k Du Items that the user has not voted for user, k C Pairs (user, item) that have not been voted for and accept predictions k D Pairs (user, item) that have not been voted for Ex,y Items that have recently been voted for by both user x and user y b, user1, user2 Su User’s recent votes user, b Table 4 Running example: RS database. ru,i I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 U1 5 3 4 1 4 2 4 U2 1 241 4 1 U3 5 2 4 354 4 U4 4 3 5 4 U5 3345 5 Table 6 Running example: 2 and 3 neighbors of each user. Ku U1 U2 U3 U4 U5 K =2 {U3,U4} {U5,U4} {U1,U4} {U1,U3} {U3,U2} K =3 {U3,U4,U5} {U5,U4,U1} {U1,U4,U2} {U1,U3,U5} {U3,U2,U4} Table 5 Running example: users similarities. MSD U1 U2 U3 U4 U5 U1 0 6.5 0.25 0.33 2 U2 6.5 0 6.66 5 1 U3 0.25 6.66 0 0.5 0.75 U4 0.33 5 0.5 0 1 U5 2 1 0.75 1 0 14612 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623
J Bobadilla et al /Expert Systems with Applications 38(2011)14609-14623 14613 2.5. Obtaining the mean absolute error-accuracy Table 8 Mean absolute errors of each user(mu)and of the system(m)using K=3 In order to measure the accuracy of the results of an RS, it is usual to use the calculation of some of the (3.5+1+3+0.5)/4=2 metrics, amongst which the mean absolute (167+134+0+0+0)/5=0.6 related metrics, mean squared error, root and normalized mean absolute error stand out (0.76+2+06+0.58+0.75)5=0.938 Let Ou={∈|u≠·Nu≠· Fig. 2 shows the MAE results obtained on MovieLens 1M using fine the mae of a user u as various similarity measures and two aggregations approache commonly used in CF(Eqs.(16)and (17). The calculations have pu-r叫却O≠必 been made in the range K=2 to K= 1500, by averaging their results: as we can see, the lowest error values are obtained using (27) Pearson Correlation(PC), particularly when Deviation From Mean (DFM)is used as the aggregation approach. These results lead us 2. The MAE of the RS can be obtained as the average of the user's to use PC-DFM as the reference combination which acts as a way Leto={u∈Um≠· (28) although it still needs to be tested with standardization methods. analysis of its coverage quality of recommendations, etc. We define the systems MAE as When selecting a similarity measure we must take into account that the averaged results may lead to a false idea of the integrity of m=0∑m台0≠必 (29) the real results, as can be seen in Fig. 3 where we can notice that, although PC-DFM presents a lower global MAE, when we use val- (30) ues of K-neighbors under 350 (which is quite common). CPC-WS offers better error measures this situation must be considered in The accuracy is defined as the inverse of the error (1/ m), the accuracy analysis obtained in the RS but more specifically it can be established as: accuracy 1-momm, accuracy E[0, 1. 2.6. Standardization process 2.5.2. Running example 2.6.1. Introduction Cable 8 shows the mean absolute errors of each user(mu)and of When using CF, at times it maybe a good idea to carry out a data the system(m)using K=3 standardization process. The z-scores, or normal scores distribute a group of data in line with a normal distribution Often, the systems MAE is implemented in such a way that < x 2.5.3. Case of study when there are no neighbors capable of making a prediction on where x is a raw score to be standardized, u is the mean of the pop- n item, the average for that item of all the training users(except ulation and o is the standard deviation of the population. z is neg- the active user) is used as the prediction. this behavior is reflected ative when the raw score is below the mean and positive when Eqs. (19)-(23) as opposed to Eqs. (14)-(18) which are used above. when there is at least one neighbor capable of making a prediction Although the most obvious application is the standardization of n the item considered Fig. 1 shows the result obtained using both the users' votes (of the input values), it is also possible to apply this approaches applied to Pearson Correlation and making use of the process to improve the predictions: the similarity values sim(u, n) average aggregation approaches(15),(20). Database: MovieLens obtained by applying the selected similarity measure are used to weight the importance of the votes of each K-neighbors(Eqs In graphs 1a(computed using Eq (15))and Ic(computed using (16 -(18). In some cases, most of the neighbors show very high Eq(20)), a horizontal line appears at 0. 797 which indicates the similarity values, and therefore, the weighting process loses effec- value of the mae obtained using K= all the training users. Fig. 1c tiveness; in these cases it is effective to make use of z-scores to bet ows values that tend towards this limit when low values of k ter differentiate the contribution that each neighbor will have in are selected, due to the fact that the lower the value of K the fewer the prediction results. the neighbors available in order to rate the items that the active user has voted for and therefore the greater probability of having 2.6. 2 Case of study to make use of the votes of all the training users of the rs in order g. 4 shows the result of applying z-scores to the input data or to make a prediction; in this case, when the mae increases, the to the similarity values sim(u, n). Except in the case of cosine, which prediction capacity(coverage)decreases drasticallygraph 1b). is greatly improved by applying z-scores to the input data, no Predictions that each user can receive using 3-neighbors 2:22 4.33 4.33 4.33
2.5. Obtaining the mean absolute error-accuracy 2.5.1. Formalization In order to measure the accuracy of the results of an RS, it is usual to use the calculation of some of the most common error metrics, amongst which the mean absolute error (MAE) and its related metrics, mean squared error, root mean squared error, and normalized mean absolute error stand out. Let Ou ¼ fi 2 Ijpu;i – ^ru;i – g ð25Þ We define the MAE of a user u as: mu ¼ 1 #Ou X i2Ou jpu;i ru;ij () Ou – £ ð26Þ mu ¼() Ou ¼ £ ð27Þ The MAE of the RS can be obtained as the average of the user’s MAE: Let O ¼ fu 2 Ujmu – g ð28Þ We define the system’s MAE as: m ¼ 1 #O X u2O mu () O – £ ð29Þ m ¼() O ¼ £ ð30Þ The accuracy is defined as the inverse of the error (1/m), but more specifically it can be established as: accuracy ¼ 1 m maxmin ; accuracy 2 ½0; 1. 2.5.2. Running example Table 8 shows the mean absolute errors of each user (mu) and of the system (m) using K = 3. 2.5.3. Case of study Often, the system’s MAE is implemented in such a way that when there are no neighbors capable of making a prediction on an item, the average for that item of all the training users (except the active user) is used as the prediction. This behavior is reflected in Eqs. (19)–(23), as opposed to Eqs. (14)–(18) which are used when there is at least one neighbor capable of making a prediction on the item considered. Fig. 1 shows the result obtained using both approaches applied to Pearson Correlation and making use of the average aggregation approaches (15), (20). Database: MovieLens 1M. In graphs 1a (computed using Eq. (15)) and 1c (computed using Eq. (20)), a horizontal line appears at 0.797 which indicates the value of the MAE obtained using K = all the training users. Fig. 1c shows values that tend towards this limit when low values of K are selected, due to the fact that the lower the value of K the fewer the neighbors available in order to rate the items that the active user has voted for and therefore, the greater probability of having to make use of the votes of all the training users of the RS in order to make a prediction; in this case, when the MAE increases, the prediction capacity (coverage) decreases drastically (graph 1b). Fig. 2 shows the MAE results obtained on MovieLens 1M using various similarity measures and two aggregations approaches commonly used in CF (Eqs. (16) and (17)). The calculations have been made in the range K = 2 to K = 1500, by averaging their results; as we can see, the lowest error values are obtained using Pearson Correlation (PC), particularly when Deviation From Mean (DFM) is used as the aggregation approach. These results lead us to use PC-DFM as the reference combination which acts as a way of testing future metrics proposed by the scientific community, although it still needs to be tested with standardization methods, analysis of its coverage, quality of recommendations, etc. When selecting a similarity measure we must take into account that the averaged results may lead to a false idea of the integrity of the real results, as can be seen in Fig. 3 where we can notice that, although PC-DFM presents a lower global MAE, when we use values of K-neighbors under 350 (which is quite common), CPC-WS offers better error measures. This situation must be considered in the accuracy analysis obtained in the RS. 2.6. Standardization process 2.6.1. Introduction When using CF, at times it maybe a good idea to carry out a data standardization process. The z-scores, or normal scores distribute a group of data in line with a normal distribution. z ¼ x l r ð31Þ where x is a raw score to be standardized, l is the mean of the population and r is the standard deviation of the population. z is negative when the raw score is below the mean and positive when above. Although the most obvious application is the standardization of the users’ votes (of the input values), it is also possible to apply this process to improve the predictions: the similarity values sim(u,n) obtained by applying the selected similarity measure are used to weight the importance of the votes of each K-neighbors (Eqs. (16)–(18)). In some cases, most of the neighbors show very high similarity values, and therefore, the weighting process loses effectiveness; in these cases it is effective to make use of z-scores to better differentiate the contribution that each neighbor will have in the prediction results. 2.6.2. Case of study Fig. 4 shows the result of applying z-scores to the input data or to the similarity values sim(u,n). Except in the case of cosine, which is greatly improved by applying z-scores to the input data, no Table 7 Predictions that each user can receive using 3-neighbors. Pu,i I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 U1 4.5 2 3.5 3 3 4.66 4.33 4.5 U2 4.5 3 4 2 3 4.5 4.33 2 4.5 U3 3.33 2.66 4 2.5 1 5 4 24 1 U4 5 2 3.5 4 2 3 4.5 4.33 2 4.33 U5 3.33 2 3 41 35 4 4 1 Table 8 Mean absolute errors of each user (mu) and of the system (m) using K = 3. mu U1 (0.5 + 0.5 + 2 + 0.33 + 0.5)/5 = 0.76 U2 (3.5 + 1 + 3 + 0.5)/4 = 2 U3 (1.67 + 1.34 + 0 + 0 + 0)/5 = 0.6 U4 (1 + 0.5 + 0.5 + 0.33)/4 = 0.58 U5 (0 + 1 + 1 + 1)/ 4 = 0.75 m (0.76 + 2 + 0.6 + 0.58 + 0.75)/5 = 0.938 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623 14613
J. Bobadilla et aL/ Expert Systems with Applications 38(2011)14609-14623 08 correlation 162432486496 2346812162432464s12812263b45127610241538 correlation 0423468116343441281634512n415 k-neighbohoods Fig. 1. (a)Mae obtained by only using the votes of the K-neighbors of each active user; (b)coverage; (c) MaE obtained using the votes of all the training users when the k ghbors cannot make a prediction. using Pearson Correlation DFM the effects of using z-scores on ■DFM■Ws the similarity values(PC-DFM-z) begin to produce improvements after a certain value of k: in the of movielens IM from K=500 and with movieLens 100 K K=100. 2.7. obtaining the coverage 0.8 2. 7.1. Formalization The coverage could be defined as the capacity of predicting from a metric applied to a specific RS. In short, it calculates the percent age of situations in which at least one K-neighbor of each active user can rate an item that has not been rated yet by that active user. Cu={i∈Ia Let du={i∈lra CPC SPR COS Coverage of user u Ca=100×#Dn =·台→Du=0 E results obtained on movielens ed Pearson Correlation(CPC) Pearso (SPR), cosine( Cos)and Mean Squared Coverage of the system egation approaches: Weighted Sum (V etC={(uu∈U,i∈l,rui=·,Gui≠必 significant improvements can be seen in the other metrics, how- LetD={(u)∈U,i∈l ever, by studying the details of the impact of the standardization processes for different K-neighbors (Fig. 5), we can see that by 100X#D
significant improvements can be seen in the other metrics, however, by studying the details of the impact of the standardization processes for different K-neighbors (Fig. 5), we can see that by using Pearson Correlation DFM the effects of using z-scores on the similarity values (PC-DFM-Z) begin to produce improvements after a certain value of K: in the case of MovieLens 1M from K = 500 and with MovieLens 100 K from K = 100. 2.7. Obtaining the coverage 2.7.1. Formalization The coverage could be defined as the capacity of predicting from a metric applied to a specific RS. In short, it calculates the percentage of situations in which at least one K-neighbor of each active user can rate an item that has not been rated yet by that active user. Let Cu ¼ fi 2 Ijru;i ¼^ Gu;i – £g ð32Þ Let Du ¼ fi 2 Ijru;i ¼ g ð33Þ Coverage of user u: cu ¼ 100 #Cu #Du () Du – £; cu ¼() Du ¼ £ ð34Þ Coverage of the system: Let C ¼ fðu; iÞju 2 U; i 2 I;ru;i ¼ ; Gu;i – £g ð35Þ Let D ¼ fðu; iÞju 2 U; i 2 I;ru;i ¼ g ð36Þ c ¼ 100x #C #D ð37Þ Fig. 1. (a) MAE obtained by only using the votes of the K-neighbors of each active user; (b) coverage; (c) MAE obtained using the votes of all the training users when the Kneighbors cannot make a prediction. Fig. 2. MAE results obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC), Pearson Correlation (PC), Spearman rank correlation (SPR), cosine (COS) and Mean Squared Differences (MSD), and making use of the aggregation approaches:Weighted Sum (WS) and Deviation From Mean (DFM). 14614 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623
J Bobadilla et aL /Expert Systems with Applications 38(2011)14609-14623 14615 CPC-WS -PC-DFM Fig. 3. Breakdown of the Mae obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC)combined with Weighted Sum(ws)and Pearson Correlation(PC) combined with Deviation From Mean(DFM). 2.7. 2 Running example If we want to impose a minimum recommendation value: OER, Table 9 shows the coverage measures using MSD and values we add pui>0 K=2 and K=3 2.8.2. Running example 2.7.3. Case of study By making use of Eqs. 38)and(39). as an example, we obtain By comparing Fig. 6 with Fig 4 we can see that there is a reverse the recommendations that can be made to user U3 with N=2 to trend between accuracy and coverage, to the extent that when N=5, using K=2. Table 10 shows these values. choosing a metric we must not take only one of these measures as a reference In Fig. 6 the similarity measure Mean Square Differ- 2.9. Quality of the recommendation: precision and recall ences(MsD) shows much better results than the other metrics however, as we have seen, it also has the worst accuracy. Along 29.1 Formalization lines, Pearson Correlation using z-scores provides us with First, we very low coverage values, in contrast to its good accuracy results Fig7 shows the breakdown of the coverage results using Pear- Xu CIAⅵ∈Xn,ru≠· son correlation with and without z-scores. as we can see. the use of We will use tu to represent the quality preci this standardization process is justified in order to improve the recommendations obtained by making test accuracy, due to its minimum impact on the coverage to the user u, taking a 0 relevancy threshold Simil resent the recall measure obtained by making the 2.8.1 Formalization Assuming that all users accept N test recommendation th. We define Xu as the set of recommendations to user u, and Z as set of n recommendations to user u Zuru≥卟 The following must be true: HCIAⅵi∈Xl,rui=·,pui≠·, #{i∈ Furui≥+#{i∈u≠·Au≥ zgXa,#u=N,Wx∈Zu,Wy∈Xu:Pax≥Pay 9 口DFM■ws■ Z-DFM ZWS口DFM2zwsZ INIAL PR Fig. 4. MAE results obtained on Moviel g the similarity measures: Constrained Pearson Correlation(CPC), Pearson Correlation(PC). Sp SPR), cosine(Cos) and Mean Squared Differences(MSD). making use of the aggregation approaches: Weighted Sum (ws)and Deviation From Mean (DFm)and using z-scores in the input data(Z-)or in similarity values during the prediction process(-2
2.7.2. Running example Table 9 shows the coverage measures using MSD and values K = 2 and K = 3. 2.7.3. Case of study By comparing Fig. 6 with Fig. 4 we can see that there is a reverse trend between accuracy and coverage, to the extent that when choosing a metric we must not take only one of these measures as a reference. In Fig. 6 the similarity measure Mean Square Differences (MSD) shows much better results than the other metrics, however, as we have seen, it also has the worst accuracy. Along the same lines, Pearson Correlation using z-scores provides us with very low coverage values, in contrast to its good accuracy results. Fig. 7 shows the breakdown of the coverage results using Pearson correlation with and without z-scores. As we can see, the use of this standardization process is justified in order to improve the accuracy, due to its minimum impact on the coverage. 2.8. Top N recommendations 2.8.1. Formalization We define Xu as the set of recommendations to user u, and Z u as the set of N recommendations to user u. The following must be true: Xu I ^ 8i 2 Xu; ru;i ¼ ; pu;i – ; ð38Þ Zu # Xu; #Zu ¼ N; 8x 2 Zu; 8y 2 Xu : pu;x P pu;y ð39Þ If we want to impose a minimum recommendation value: h 2 R, we add pu,i P h 2.8.2. Running example By making use of Eqs. (38) and (39), as an example, we obtain the recommendations that can be made to user U3 with N = 2 to N = 5, using K = 2. Table 10 shows these values. 2.9. Quality of the recommendation: precision and recall 2.9.1. Formalization First, we redefine the Eq. (38) Xu I ^ 8i 2 Xu; ru;i – ; pu;i – We will use tu to represent the quality precision measure for recommendations obtained by making N test recommendations to the user u, taking a h relevancy threshold. Similarly, xu will represent the recall measure obtained by making the same N recommendations to user u. Assuming that all users accept N test recommendations: tu ¼ #fi 2 Zujru;i P hg N ð40Þ xu ¼ #fi 2 Zujru;i P hg #fi 2 Zujru;i P hg þ #fi 2 Zc ujru;i – ^ru;i P hg ð41Þ Fig. 3. Breakdown of the MAE obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC) combined with Weighted Sum (WS) and Pearson Correlation (PC) combined with Deviation From Mean (DFM). Fig. 4. MAE results obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC), Pearson Correlation (PC), Spearman rank correlation (SPR), cosine (COS) and Mean Squared Differences (MSD), making use of the aggregation approaches: Weighted Sum (WS) and Deviation From Mean (DFM) and using z-scores in the input data (Z) or in similarity values during the prediction process (Z). J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623 14615
J. Bobadilla et aL/ Expert Systems with Applications 38(2011)14609-14623 Fig. 5. Breakdown of the MAE obtained on MovieLens IM using Deviation From Mean(DFM)Pearson Correlation(PC) using and ng z-scores in the similarity values Table 9 Z scores -- PC-DFM 100.0 Coverage measures using msd and values k= 2 and K K=2Cu K=3Cu [2,lg,lg]=3 #(l, 8. Ig. ho. 112)=5 #[lnl,l12]=3 #{l,ll,lz,l14=542,85%71.42 U410群[l2,l,llB,l12.l13}=6#{l2,ll,lg,l12.l13]=660%60 U59群[1,l,l,lsls,l14]=6#{l1,z,l4,ls,ls,l14}=6 日 No Z-Scores■z Fig. 7. Breakdown of the coverage obtained on Movielens IM using Pearson 81.0 Table 10 Sets of recommendations that user U3 could receive. K=2. s I6, I1) (s. 6. I12, 17] [Is, 6, I12,I7, 114) Square Differences(MsD )has been rejected due to its poor genera elation CPC SPR cos results As may be seen, there is a direct between the pre- cision and recall values obtained in each of the cases. it is also important to highlight the fact that, by using similarity measures, Fig.6. Coverage results obtained on Movielens IM using the similarity measures: high values of accuracy(low values of MAE)do not guarantee Constrained Pearson Correlation(CPC), Pearson Correlation(PC). Spearman rank the best values for recommendation quality(as is the case in our correlation(SPR) cosine(COS) and Mean Squared Differences(MSD) case study with CPC-WS). in the same way that bad accuracy val- ues can be combined with good results for recommendation qual ∑t (42) ity(see PC-Z-WS). We must take into account that mae provides us with a measure of the quality of the predictions, whilst precision and recall provide us with a measure of the quality of a small sub (43) group of the predictions: the n with the highest rating and which are over a certain threshold 2.9.2. Running example 2. 10. Quality of novelty: novelty-precision and novelty-recall In this example we will give parameters N and 0 values 4 and 4. Table 11 shows the recommendations Zu made to each U,(bott 2.101 Formalization ow of each user) and the votes issued (top row of each user)and We will define the novelty group Y as the group of items which Table 12 shows the precision and recall values obtained using N=4 have been voted at the most by users. Y={1∈#{u∈Urui≠≤ 9.3. Case of study We to Fig. 8 shows the average results for precision and recall novelty recommendations to user obtained using the most common similarity measures. Mean and requ a novelty measure Similarly. Iu will represent the
t ¼ 1 #U X u2U tu ð42Þ x ¼ 1 #U X u2U xu ð43Þ 2.9.2. Running example In this example we will give parameters N and h values 4 and 4. Table 11 shows the recommendations Zu made to each Ui (bottom row of each user) and the votes issued (top row of each user) and Table 12 shows the precision and recall values obtained using N = 4 and h = 4. 2.9.3. Case of study Fig. 8 shows the average results for precision and recall obtained using the most common similarity measures. Mean Square Differences (MSD) has been rejected due to its poor general results. As may be seen, there is a direct relation between the precision and recall values obtained in each of the cases. It is also important to highlight the fact that, by using similarity measures, high values of accuracy (low values of MAE) do not guarantee the best values for recommendation quality (as is the case in our case study with CPC-WS), in the same way that bad accuracy values can be combined with good results for recommendation quality (see PC-Z-WS). We must take into account that MAE provides us with a measure of the quality of the predictions, whilst precision and recall provide us with a measure of the quality of a small subgroup of the predictions: the N with the highest rating and which are over a certain threshold. 2.10. Quality of novelty: novelty-precision and novelty-recall 2.10.1. Formalization We will define the novelty group Y as the group of items which have been voted at the most by c users. Y ¼ fi 2 Ij#fu 2 Ujru;i – g 6 cg ð44Þ We will use nu to represent the quality precision measure for novelty obtained by making N test recommendations to user u and requiring a novelty measure c. Similarly, lu will represent the Fig. 5. Breakdown of the MAE obtained on MovieLens 1M using Deviation From Mean (DFM) Pearson Correlation (PC) using and not using z-scores in the similarity values during the prediction process (Z). Table 9 Coverage measures using MSD and values K = 2 and K = 3. #Du K = 2#Cu K = 3#Cu K = 2cu K = 3cu U1 7 #{I2,I8,I9} = 3 #{I2,I8,I9} = 3 42, 85% 42, 85% U2 8 #{I7,I8,I9,I10} = 4 #{I7,I8,I9,I10,I12} = 5 50% 62.5% U3 7 #{I6,I7,I12} = 3 #{I5,I6,I7,I12,I14} = 5 42, 85% 71, 42% U4 10 #{I2,I6,I7,I8,I12,I13} = 6 #{I2,I6,I7,I8,I12,I13} = 6 60% 60% U5 9 #{I1,I2,I4,I5,I6,I14} = 6 #{I1,I2,I4,I5,I6,I14} = 6 66, 66% 66, 66% c 22/41% 25/41% Fig. 7. Breakdown of the coverage obtained on MovieLens 1M using Pearson Correlation (PC). Table 10 Sets of recommendations that user U3 could receive, K = 2. eZu N = 2 N = 3 N = 4 N = 5 U3 {I5,I6} {I5,I6,I12} {I5,I6,I12,I7} {I5,I6,I12,I7,I14} Fig. 6. Coverage results obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC), Pearson Correlation (PC), Spearman rank correlation (SPR), cosine (COS) and Mean Squared Differences (MSD). 14616 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623
J Bobadilla et al Expert Systems with Applications 38(2011)14609-14623 recall measure obtained by making the sameN recommendations Table 12 to user u Values of precision and recall using N=4. 0=4. Assuming that all the users accept N test recommendations h{i∈Zui∈Y} 33+1)1(1+1)4(4+1)3/3+0)3/3+0)x=0.81 #{i∈zu∈Y} #{i∈Zui∈Y}+#{i∈Zi∈Y} Number of items that both x and y have voted for (rx≠·Arya≠) in relation to the total number of items voted for by both ∑l In order to include time in our model, we extend formulas(4) and(5)to contain a time value in timestamp format. 2.10.2. Running example irstly, we find the number of votes received for each item. Ru={(i.U,D)i∈l,v∈v,t∈·l which we represent in the last row of Table 13: later we establish Tui =vALui=t a threshold of novelty(y=3). The set of items belonging to the We define Ex, y as the group of items that both x and y have voted novelty set is as follows: for most recently. Most recently means within a period of B days as Y={2.3.56,7,8.9,11,12,14 regards the current time(t) Table14 shows the recommendations made to each of the users Exy={i∈lx≠·Nry≠·^t ng N=4 and 0=4(Zu): the items belonging to y (in the first We define Su as the group of votes of user u which have beer ow)are framed. Table 15 shows the novelty-precision and nov- made in the time interval B as regards the current time Ity-recall results obtained by each of the users and the total nov- elty-precision and novelty-recall obtained in the example. S={(,v,D)∈l,v∈V,t∈·.t-tu≤B If the votes' time information is not available, Eq (51)can be 2.10.3. Case of study mplified in the following way Firstly, in order to be able to adjust parameter y to a suitable value in the rs used, it is valuable to know the distribution of Exy=Axy={i∈|xi≠·^yi≠·h the votes regarding the items. As an example, Fig 9 shows this data From the group of items defined in(51), or failing that, in(53). obtained in MovieLens 100 K. Thus, we can determine, for instance, we use the similarity measure which each user will intuitively use that 600 items of the rs have been voted for by 13 or less users. Figs.10 and 11, respectively, show the novelty-precision and to compare their votes with those of each of their neighbors: the novelty-recall results obtained using MovieLens 100 K with values Mean Absolute Difference(MaD) of 7: 13, 17, 21 and 25. a general increase in the precision may be MAD(x, y B)=MAD,X, B) ted as we take higher values of 7, due to the gradual increase that this implies in the number of relevant recommended elements. E∑-列E≠中 (54) 2.11. Quality of trust: trust-precision and trust-recall As a list of common votes among users, as regards the total, we use Jaccard 2.11.1. Formalization As follows from actual results obtained in an experiment carried Jaccard(x, y, B)=Jaccard(y, x, B) out on a group of users of the filmaffinity. com website, the trust of s, OS ser x towards another user y could be based on the following 3 Willing to obtain similar importance to metrics(54)and(55). aspects we place the MAD results on the scale [0. 1. where 1 represents the greatest possible similitude and 0 the least possible. We com- votes bine both metrics by multiplying them, so that when either of Greater importance to the last items voted. them is low the total similitude is highly affected. 11 nt recommended: values with diagonal lines, relevant not recommended: values with horizontal lines. 33 234 3.33 555445 444454 3
recall measure obtained by making the sameN recommendations to user u. Assuming that all the users accept N test recommendations: nu ¼ #fi 2 Zuji 2 Yg N ð45Þ lu ¼ #fi 2 Zuji 2 Yg #fi 2 Zuji 2 Yg þ #fi 2 Zc uji 2 Yg ð46Þ n ¼ 1 #U X u2U nu ð47Þ l ¼ 1 #U X u2U lu ð48Þ 2.10.2. Running example Firstly, we find the number of votes received for each item, which we represent in the last row of Table 13; later we establish a threshold of novelty (c = 3). The set of items belonging to the novelty set is as follows: Y ¼ f2; 3; 5; 6; 7; 8; 9; 11; 12; 14g Table 14 shows the recommendations made to each of the users using N = 4 and h = 4(Zu); the items belonging to Y (in the first row) are framed. Table 15 shows the novelty-precision and novelty-recall results obtained by each of the users and the total novelty-precision and novelty-recall obtained in the example. 2.10.3. Case of study Firstly, in order to be able to adjust parameter c to a suitable value in the RS used, it is valuable to know the distribution of the votes regarding the items. As an example, Fig. 9 shows this data obtained in MovieLens 100 K. Thus, we can determine, for instance, that 600 items of the RS have been voted for by 13 or less users. Figs. 10 and 11, respectively, show the novelty-precision and novelty-recall results obtained using MovieLens 100 K with values of c:13, 17, 21 and 25. A general increase in the precision may be noted as we take higher values of c, due to the gradual increase that this implies in the number of relevant recommended elements. 2.11. Quality of trust: trust-precision and trust-recall 2.11.1. Formalization As follows from actual results obtained in an experiment carried out on a group of users of the filmaffinity.com website, the trust of user x towards another user y could be based on the following 3 aspects: Similarity in the votes. Greater importance to the last items voted. Number of items that both x and y have voted for (rx,i – ^ ry,i – ) in relation to the total number of items voted for by both. In order to include time in our model, we extend formulas (4) and (5) to contain a time value in timestamp format. Ru ¼ fði; v;tÞji 2 I; v 2 V;t 2 g ð49Þ ru;i ¼ v ^ tu;i ¼ t ð50Þ We define Ex,y as the group of items that both x and y have voted for most recently. Most recently means within a period of b days as regards the current time (tc) Ex;y ¼ fi 2 Ijrx;i – ^ry;i – ^tc tx;i 6 b ^ tc ty;i 6 bg ð51Þ We define Su as the group of votes of user u which have been made in the time interval b as regards the current time. Su ¼ fði; v;tÞji 2 I; v 2 V;t 2 ;tc tu;i 6 bg ð52Þ If the votes’ time information is not available, Eq. (51) can be simplified in the following way: Ex;y ¼ Ax;y ¼ fi 2 Ijrx;i – ^ry;i – g ð53Þ From the group of items defined in (51), or failing that, in (53), we use the similarity measure which each user will intuitively use to compare their votes with those of each of their neighbors: the Mean Absolute Difference (MAD): MADðx; y; bÞ ¼ MADðy; x; bÞ ¼ 1 #Ex;y X i2Ex;y jrx;i ry;ij () Ex;y – / ð54Þ As a list of common votes among users, as regards the total, we use Jaccard: Jaccardðx; y; bÞ ¼ Jaccardðy; x; bÞ ¼ Sx \ Sy Sx [ Sy ð55Þ Willing to obtain similar importance to metrics (54) and (55), we place the MAD results on the scale [0, 1], where 1 represents the greatest possible similitude and 0 the least possible. We combine both metrics by multiplying them, so that when either of them is low the total similitude is highly affected. Table 11 Relevant recommended: values with diagonal lines; relevant not recommended: values with horizontal lines. I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 U1 r1,i 5 3 4 1 4 2 4 eZ1 4.5 3.5 4.33 4.5 U2 r2,i 1 2 41 4 1 eZ2 4.5 3 4 4.5 U3 r3,i 5 2 4 35 4 4 eZ3 3.33 5 4 4 U4 r4,i 4 3 5 4 eZ4 5 3.5 4.5 4.33 U5 r5,i 334 5 5 eZ5 35 4 4 Table 12 Values of precision and recall using N = 4, h = 4. U1 U2 U3 U4 U5 Average tu 3/4 1/4 4/4 3/4 3/4 t = 0.70 xu 3/(3 + 1) 1/(1 + 1) 4/(4 + 1) 3/(3 + 0) 3/(3 + 0) x = 0.81 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623 14617
J. Bobadilla et aL/ Expert Systems with Applications 38(2011)14609-14623 口DFM■ WS Z-DFN回 Z-WS O DFM-Z WS-Z CPC PR DS 0.0 CPC Os on and Recall results obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation(CPC). Pearson Correlation(PC), Spearman rank correla PR)and cosine(COS), making use of the aggregation approaches: Weighted Sum(ws)and Deviation From Mean(DFM) using and not using z-scores in the similarity values during the prediction process (-z) Using the ratings times information Table 13 Number of users who have voted for each of the items trust(x, y, B)=trust(, x B) Pui I1 I2 I3 l4 Is I6 1, IgIg I0 I11 I1 113 114 (57) Without the ratings times information: Recommended novelty: values with diagonal lines to the right: not recommended trust(x, y)=trust(y, x)=RgnR 减况 novelty: values with horizontal lines. 台→Axy≠φ (58) U14.5· 3.5 4.33 4.5 tu(x,y)=tst(y,x)=·→Axy=中 (59) 45433 Eqs. (10)and (11) can be adapted as: We define Tu as the set of test K-neighbors of user u taking into account factor B of proximity in the issuing of votes to the current time The following must be true: TCU∧并Tu= aUnt U U4 Average Wx∈Tu,wy∈U-Tu),tust(u,x,B)≥tust(u,y,B) 0+10)1/(1+9)11+9)11+9)2/(2+8)l=0.10 The set of test K-neighbors of a given user(Tu)can be compared with the group of K-neighbors obtained using the similarity metric
Using the ratings times information: trustðx; y; bÞ ¼ trustðy; x; bÞ ¼ Sx \ Sy Sx [ Sy 1 1 #Ex;y P i2Ex;y jrx;i ry;ij max min 2 6 4 3 7 5 () Ex;y–/ ð56Þ trustðx; y; bÞ ¼ trustðy; x; bÞ¼() Ex;y ¼ / ð57Þ Without the ratings times information: trustðx; yÞ ¼ trustðy; xÞ ¼ Rx \ Ry Rx [ Ry 1 1 #Ax;y P i2Ax;y jrx;i ry;ij max min 2 6 4 3 7 5 () Ax;y – / ð58Þ trustðx; yÞ ¼ trustðy; xÞ¼() Ax;y ¼ / ð59Þ Eqs. (10) and (11) can be adapted as: We define Tu as the set of test K-neighbors of user u taking into account factor b of proximity in the issuing of votes to the current time. The following must be true: Tu U ^ #Tu ¼ k ^ u R Tu ð60Þ 8x 2 Tu; 8y 2 ðU TuÞ; trustðu; x; bÞ P trustðu; y; bÞ ð61Þ The set of test K-neighbors of a given user (Tu) can be compared with the group of K-neighbors obtained using the similarity metric Fig. 8. Precision and Recall results obtained on MovieLens 1M using the similarity measures: Constrained Pearson Correlation (CPC), Pearson Correlation (PC), Spearman rank correlation (SPR) and cosine (COS), making use of the aggregation approaches: Weighted Sum (WS) and Deviation From Mean (DFM), using and not using z-scores in the similarity values during the prediction process (Z). Table 13 Number of users who have voted for each of the items. Pu,i I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 U1 5 3 4 1 4 2 4 U2 1 241 4 1 U3 5 2 4 354 4 U4 4 3 5 4 U5 3345 5 # 4104122234 0 1 4 1 Table 14 Recommended novelty: values with diagonal lines to the right; not recommended novelty: values with horizontal lines. Zu I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 U1 4.5 3.5 4.33 4.5 U2 4.5 3 4 4.5 U3 3.33 5 4 4 U4 5 3.5 4.5 4.33 U5 35 4 4 Table 15 Values of novelty-precision and novelty-recall using N = 4, h = 4, c = 3. U1 U2 U3 U4 U5 Average nu 0/4 1/4 1/4 1/4 2/4 n = 0.25 lu 0/(0 + 10) 1/(1 + 9) 1/(1 + 9) 1/(1 + 9) 2/(2 + 8) l = 0.10 14618 J. Bobadilla et al. / Expert Systems with Applications 38 (2011) 14609–14623