正在加载图片...
3. EXPERIMENTS 3.1 Dataset We evaluate our approach on a corpus of originally 142 million bookmarks from the delicious bookmarking service These bookmarks were collected between September 19 2007 and January 22, 2008. This is the same corpus as described in [16. A previous analysis unveiled that the orig. inal corpus was highly polluted by spam [16. In order to get eaningful results, we limit the impact of spam users on the initial corpus as their anomalous behavior would strongly i terfere with our analysis. To identify spam users we employ a common spam usage pattern. As was shown in [16),many spam users try to heighten the visibility of their web do- mains and consequently post a very high number of URLs PLSA(=0.0) to very few domains. To reduce the spam ratio within the 一PLsA(a=1.0) data, we excluded the top 10 percent of users with the high est URLs per domain rate from our analysis. The filtered data set consists of 109 million bookmarks 1 3.2 Experimental setup Our experiments are performed on a 6 month section, July Figure 1: Magnified roC curves for the item recom December 2007, of the spam filtered corpus. We remove mendation task on the delicious dataset. The num- items, users and tags occurring less than 10 times within ber of latent topics(k)is set to 80 for the annotation- hese 6 months, thus generating the p-core 10 of the initial based PLSA recommender(a=0.0) and to 5 for the ripartite graph. We then split the remaining data into 6 collaborative version(a= 1.0). The MP line rep- ion bookmarks corresponding to more than 5.6 million tag classifier. pe monthly snapshots, each containing approximately 1.6 mil- resents the performance of a most-popular baseline assignments. For each month, the numbers of elements in ach dimension, I, T, U, roughly sum up to 200, 000, 95, 000 and 200, 000 respectively. The corresponding co-occurrence account user preferences. However, as for the Plsa recom- matrices I and IT are very sparse and only contain a per- mender, we set the weight of previously bookmarked item centage of around 0.004 and 0.012 non-zero entries. All re to 0. Most-popular recommenders have become a standard sults presented in this paper are averaged over all 6 months feature of Web 2.0 resource sharing communities. For each month we randomly select 80% of all bookmarks for 4. RESULTS raining and the remaining bookmarks are saved for testing. Figure l presents a section of the ROC ct This split is done on a per user basis. The bookmarks from rative filtering(a= 1.0)and the annotat a=0.0) he training period are then used to create the co-occurrence PLSA recommenders with the number of latent topics k set to 5 and 80 respectively. All values are averaged over the After training we select a random set of 1000 users, with at 6 evaluation months. The figure shows a significant boost least 10 test items each. For every user we recommend al items sorted by P(immun) where items bookmarked by the PLSA recommender(a=0)reaching AUC values of 0.9022 user during the training or before the evaluated month are compared to 0. 8425 for the most-popular recommender. For eighted with P(imU)=0. The quality of the recom- the collaborative method(a= 1)with an optimal k set to mended item list is evaluated using performance measures 5 we obtain an AUc result only slightly above the baseline commonly found in relevant literature (6), such as the area performance(0.8467). However, the collaborative recom der curve(AUC) value of the receiver operating charac- mender performs better for small numbers of recommended teristic(ROC)curves or the precision measure. Results are items averaged over all test users Multiple variables have to be taken into consideration when Table 1: Area under curve(AUC) for different pa- valuating recommender systems. Among these is the ques rameter settings. Bold entries indicate the best AUC value for a given number of latent topics k. tion whether items that do not appear in the training data should be included into the evaluation. As we are only terested in the relative improvement of our approach, we 0.2 0.84160.84910.8936 090090.9023 enove all previously unseen items. For the same reasons, 0 we also exclude items which appear in the training but not O.80.843808180 8461 all obtained results are compared with the performance of a baseline recommender (most-popular) that weights items y how often they were bookmarked during the training pe- Table 1 compares the resulting AUC values for the Plsa riod. These item weights are global and do not take into recommender and different choices of a and k. Once again3. EXPERIMENTS 3.1 Dataset We evaluate our approach on a corpus of originally 142 million bookmarks from the delicious bookmarking service. These bookmarks were collected between September 19, 2007 and January 22, 2008. This is the same corpus as described in [16]. A previous analysis unveiled that the orig￾inal corpus was highly polluted by spam [16]. In order to get meaningful results, we limit the impact of spam users on the initial corpus as their anomalous behavior would strongly in￾terfere with our analysis. To identify spam users we employ a common spam usage pattern. As was shown in [16], many spam users try to heighten the visibility of their web do￾mains and consequently post a very high number of URLs to very few domains. To reduce the spam ratio within the data, we excluded the top 10 percent of users with the high￾est URLs per domain rate from our analysis. The filtered data set consists of 109 million bookmarks. 3.2 Experimental setup Our experiments are performed on a 6 month section, July– December 2007, of the spam filtered corpus. We remove items, users and tags occurring less than 10 times within these 6 months, thus generating the p-core 10 of the initial tripartite graph. We then split the remaining data into 6 monthly snapshots, each containing approximately 1.6 mil￾lion bookmarks corresponding to more than 5.6 million tag assignments. For each month, the numbers of elements in each dimension, I, T, U, roughly sum up to 200, 000, 95, 000 and 200, 000 respectively. The corresponding co-occurrence matrices IU and IT are very sparse and only contain a per￾centage of around 0.004 and 0.012 non-zero entries. All re￾sults presented in this paper are averaged over all 6 months. For each month we randomly select 80% of all bookmarks for training and the remaining bookmarks are saved for testing. This split is done on a per user basis. The bookmarks from the training period are then used to create the co-occurrence matrices IU and IT on which the recommenders are trained. After training we select a random set of 1000 users, with at least 10 test items each. For every user we recommend all items sorted by P(im|ul) where items bookmarked by the user during the training or before the evaluated month are weighted with P(im|ul) = 0. The quality of the recom￾mended item list is evaluated using performance measures commonly found in relevant literature [6], such as the area under curve (AUC) value of the receiver operating charac￾teristic (ROC) curves or the precision measure. Results are averaged over all test users. Multiple variables have to be taken into consideration when evaluating recommender systems. Among these is the ques￾tion whether items that do not appear in the training data should be included into the evaluation. As we are only in￾terested in the relative improvement of our approach, we remove all previously unseen items. For the same reasons, we also exclude items which appear in the training but not in the test data. All obtained results are compared with the performance of a baseline recommender (most-popular) that weights items by how often they were bookmarked during the training pe￾riod. These item weights are global and do not take into 0 0.1 0.2 0.3 0.4 0.5 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 false positive rate true positive rate PLSA (α=0.0) PLSA (α=1.0) MP Figure 1: Magnified ROC curves for the item recom￾mendation task on the delicious dataset. The num￾ber of latent topics (k) is set to 80 for the annotation￾based PLSA recommender (α = 0.0) and to 5 for the collaborative version (α = 1.0). The MP line rep￾resents the performance of a most-popular baseline classifier. account user preferences. However, as for the PLSA recom￾mender, we set the weight of previously bookmarked items to 0. Most-popular recommenders have become a standard feature of Web 2.0 resource sharing communities. 4. RESULTS Figure 1 presents a section of the ROC curves for the collabo￾rative filtering (α = 1.0) and the annotation-based (α = 0.0) PLSA recommenders with the number of latent topics k set to 5 and 80 respectively. All values are averaged over the 6 evaluation months. The figure shows a significant boost in recommendation quality when using an annotation-based PLSA recommender (α = 0) reaching AUC values of 0.9022 compared to 0.8425 for the most-popular recommender. For the collaborative method (α = 1) with an optimal k set to 5 we obtain an AUC result only slightly above the baseline performance (0.8467). However, the collaborative recom￾mender performs better for small numbers of recommended items. Table 1: Area under curve (AUC) for different pa￾rameter settings. Bold entries indicate the best AUC value for a given number of latent topics k. α/k 1 5 10 20 40 80 0.0 0.8402 0.8736 0.8877 0.8936 0.9004 0.9022 0.2 0.8416 0.8491 0.8936 0.8975 0.9009 0.9023 0.4 0.8430 0.8419 0.8944 0.8986 0.8954 0.8935 0.6 0.8437 0.8423 0.8720 0.8916 0.8848 0.8722 0.8 0.8438 0.8418 0.8727 0.8678 0.8461 0.8178 1.0 0.8435 0.8467 0.8348 0.8110 0.7766 0.7466 Table 1 compares the resulting AUC values for the PLSA recommender and different choices of α and k. Once again
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有