正在加载图片...
for our CF algorithms. This procedure generates a set of 258,701 in that neighborhood. A weighted aggregate of these frequencies user-item pairs, with 2, 539 unique users, and 87, 908 unique items. is used to generate the recommendations [4]. Item-based filtering When generating predictions, we withhold 10 randomly selected turns this around by matching items against the database to find the items from each user, and generate predictions by using the remain- neighborhood of similar items [131 ing data as training material. As retrieval algorithms recommender We test three different algorithms implemented in the freely avail algorithms tend to be controlled by several parameters. One way of able Suggest recommendation engine [7]. The first model is an optimizing these parameters is by maximizing performance on a item-based filtering approach that uses the cosine similarity met given data set. Such tuning. however, tends to overestimate the ex- ric to determine the similarity between items. Model 2 is an item- pected performance of the system, so in order to prevent this kind based algorithm that calculates similarity based on conditional prob- of overfitting we used 10-fold cross-validation [91 ability. Finally, model 3 is a user-based algorithm that uses the We first divided our data set into a training and a test set by ran sine similarity metric to determine similarity between users. Item- domly selecting 10% of the users to be in our test set. Final per- based filtering tends to perform well when there are more users than formance was evaluated on this 10% by withholding 10 items from items in the database whereas user-based filtering works better in each user, and using the remaining profile items together with the the reverse situation. We therefore expect the user-based filtering training set to generate the recommendations for those 10%0. In algorithm to outperform the other two algorithms order to properly optimize parameters we divided our training set ontaining 90% of the users) by randomly dividing the users over 5. RESULTS 10 folds, each containing 10% of the training users. Each fold is used as a validation set once with 10 items being withheld for eacl The CF algorithms we empl study have one impor- validation fold user. The remaining profile items and the data in the tant parameter that can be tuned: the neighborhood size k,i.e other 9 folds are then used to generate the predictions. The final the number of similar users used to generate the predictions. We values for our evaluation metrics on the withheld items were then tune this parameter for each of the models, using our 10-fold cross averaged over the 10 folds validation setup described in Section 4.1, varying k between I and 4.2 Evaluation effect of neighborhood size on the map scores of the three algo- rithms for k up to 140: MAP values stay the same for k>140. The In our evaluation, we adopt an iR perspective by treating each of other metrics show a similar trend over the different k values the users as a separate query or topic. The 10 withheld items for ach user make up the relevant items for which we have relevance udgments. For each user, a ranked list of items is produced and evaluated on whether these withheld items show up in the result model 2(ite model 3(user-based w/ cosine) list. While it is certainly possible and very likely that the recom- mendation lists contain other recommendations that the user would find relevant or interesting we cannot know this without the user judging them. This means that because our relevance judgment 02 correspond to items added to a user's profile, we can never have any items judged as being not relevant without user intervention Herlocker et al. [5] assesses the usefulness of different metrics for each of the six recommendation tasks they identified. For our Find Good Items"task, they find that metrics taking into account the ranking of the items are most appropriate. We therefore evalu- ated our recommender system using Mean Average Precision(MAP) Mean Reciprocal Rank(MRR) and Precision ( 10(P@10). In 20406080100120140 addition to these IR metrics, we also measured coverage: certain recommendation algorithms need enough data on a user or item for them to be able to reliably generate recommendations. Not all algo- Figure 1: The effect of neighborhood size k on MAP for the rithms will be able to cover every user or item. We therefore mea three algorithms. sure user coverage (UCOV), which is the percentage of all users for which we were able to generate any predictions at all. The two item-based algorithms, models I and 2, both have optimal k of 500. Performance increases for Models I and 2 as k 4.3 Alge orithm mender implementation ran out of ory when we tried increasing k beyond 500. The optimal value of pare three different CF algorithms. The term collaborative filter k for Model 3 lies between 4 and 8. with none of the differences was first used by Goldberg et al. [3] and describes a class of al in this range being statistically significantly different. We there- gorithms that, instead of looking at the content, use data about all fore picked a k of 5 as the optimal value for Model 3. Howeve for all values of k the user-based filtering algorithm significantly so-called'active'user. We use and briefly describe the two most outperforms the item-based filtering algorithm(P <10).Fi- user-based filtering, the active user is matched against the database Table 1. Model 3 outperforms the other models significantly find the neighboring users that the active user has a history of all measures(p< 10-6)and shows acceptable performance agreeing with. Once this neighborhood has been identified all ob preliminary approach jects in the neighbors' profiles unknown to the active user are con- sidered as possible recommendations and sorted by their frequen 6. TEMPORAL ANALYSIS CiteULike's daily database dumps offer us the unique opport SSee[14] for an overview of more sophisticated CF algorithms nity of analyzing the influence of growth of a social bookmarkingfor our CF algorithms. This procedure generates a set of 258,701 user-item pairs, with 2,539 unique users, and 87,908 unique items. When generating predictions, we withhold 10 randomly selected items from each user, and generate predictions by using the remain￾ing data as training material. As retrieval algorithms, recommender algorithms tend to be controlled by several parameters. One way of optimizing these parameters is by maximizing performance on a given data set. Such tuning, however, tends to overestimate the ex￾pected performance of the system, so in order to prevent this kind of overfitting we used 10-fold cross-validation [9]. We first divided our data set into a training and a test set by ran￾domly selecting 10% of the users to be in our test set. Final per￾formance was evaluated on this 10% by withholding 10 items from each user, and using the remaining profile items together with the training set to generate the recommendations for those 10%. In order to properly optimize parameters we divided our training set (containing 90% of the users) by randomly dividing the users over 10 folds, each containing 10% of the training users. Each fold is used as a validation set once, with 10 items being withheld for each validation fold user. The remaining profile items and the data in the other 9 folds are then used to generate the predictions. The final values for our evaluation metrics on the withheld items were then averaged over the 10 folds. 4.2 Evaluation In our evaluation, we adopt an IR perspective by treating each of the users as a separate query or topic. The 10 withheld items for each user make up the relevant items for which we have relevance judgments. For each user, a ranked list of items is produced and evaluated on whether these withheld items show up in the result list. While it is certainly possible and very likely that the recom￾mendation lists contain other recommendations that the user would find relevant or interesting, we cannot know this without the user judging them. This means that because our relevance judgments correspond to items added to a user’s profile, we can never have any items judged as being not relevant without user intervention. Herlocker et al. [5] assesses the usefulness of different metrics for each of the six recommendation tasks they identified. For our “Find Good Items” task, they find that metrics taking into account the ranking of the items are most appropriate. We therefore evalu￾ated our recommender system using Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Precision @ 10 (P@10). In addition to these IR metrics, we also measured coverage: certain recommendation algorithms need enough data on a user or item for them to be able to reliably generate recommendations. Not all algo￾rithms will be able to cover every user or item. We therefore mea￾sure user coverage (UCOV), which is the percentage of all users for which we were able to generate any predictions at all. 4.3 Algorithms In the preliminary experiments described in this paper we com￾pare three different CF algorithms. The term collaborative filtering was first used by Goldberg et al. [3] and describes a class of al￾gorithms that, instead of looking at the content, use data about all users’ preferences for items to generate recommendations for the so-called ‘active’ user. We use and briefly describe the two most simple variants: user-based filtering and item-based filtering5 . In user-based filtering, the active user is matched against the database to find the neighboring users that the active user has a history of agreeing with. Once this neighborhood has been identified, all ob￾jects in the neighbors’ profiles unknown to the active user are con￾sidered as possible recommendations and sorted by their frequency 5See [14] for an overview of more sophisticated CF algorithms. in that neighborhood. A weighted aggregate of these frequencies is used to generate the recommendations [4]. Item-based filtering turns this around by matching items against the database to find the neighborhood of similar items [13]. We test three different algorithms implemented in the freely avail￾able Suggest recommendation engine [7]. The first model is an item-based filtering approach that uses the cosine similarity met￾ric to determine the similarity between items. Model 2 is an item￾based algorithm that calculates similarity based on conditional prob￾ability. Finally, model 3 is a user-based algorithm that uses the co￾sine similarity metric to determine similarity between users. Item￾based filtering tends to perform well when there are more users than items in the database whereas user-based filtering works better in the reverse situation. We therefore expect the user-based filtering algorithm to outperform the other two algorithms. 5. RESULTS The CF algorithms we employ in our study have one impor￾tant parameter that can be tuned: the neighborhood size k, i.e. the number of similar users used to generate the predictions. We tune this parameter for each of the models, using our 10-fold cross￾validation setup described in Section 4.1, varying k between 1 and 500, evaluating performance at each step. Figure 1 displays the effect of neighborhood size on the MAP scores of the three algo￾rithms for k up to 140; MAP values stay the same for k > 140. The other metrics show a similar trend over the different k values. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 20 40 60 80 100 120 140 MAP k model 1 (item-based w/ cosine) model 2 (item-based w/ cond. prob.) model 3 (user-based w/ cosine) Figure 1: The effect of neighborhood size k on MAP for the three algorithms. The two item-based algorithms, models 1 and 2, both have an optimal k of 500. Performance increases for Models 1 and 2 as k increases, but the recommender implementation ran out of mem￾ory when we tried increasing k beyond 500. The optimal value of k for Model 3 lies between 4 and 8, with none of the differences in this range being statistically significantly different. We there￾fore picked a k of 5 as the optimal value for Model 3. However, for all values of k the user-based filtering algorithm significantly outperforms the item-based filtering algorithm (p < 10−10). Fi￾nal performance of the three models on the test set is displayed in Table 1. Model 3 outperforms the other models significantly on all measures (p < 10−6 ) and shows acceptable performance for a preliminary approach. 6. TEMPORAL ANALYSIS CiteULike’s daily database dumps offer us the unique opportu￾nity of analyzing the influence of growth of a social bookmarking
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有