正在加载图片...
Naive Bayes 2 Features 1-6 38.9 42.6 3 Features 1-3.7-9 39.3 41.1 46.2 44.9 376 Table 4. Combining all features in Maui each other. The baseline in row 1(left)shows the contribution of the strongest feature- Keas performance, using TFXIDF, first occur- keyphraseness-is, as expected, the highest, add rence, keyphraseness and Naive Bayes to com- ing 16.9 points. The second most important fea bine them(same as row 4 in Table 2). Using de- ture is Wikipedia keyphraseness, contributing 4 cision trees with these three features does not percentage points to the overall result improve the performance(row 1, right). The fol Since some of the features in the best perform lowing row combines the three original features ing combination rely on Wikipedia as a knowl with length, node degree and Wikipedia-based edge source, it is interesting to determine keyphraseness. In contrast to previous research Wikipedia's exact contribution. The last row of (Medelyan et al., 2008), in this setting we do not Table 5 combines the following features observe an improvement with either Naive Bayes TFXIDF, first occurrence, keyphraseness, length or bagged decision trees. In row 3 we combine and spread. The F-Measure is 5.4 points lower the three original features with the three new than that of maui with all 9 features combined ones introduced in this work. While Naive Therefore, the contribution of wikipedia-based Bayes values are lower than the baseline, with features is significant bagged decision trees Maui's F-Measure im- proves from 41.2 to 44.9%. The best results are 4.4 Maui's consistency with human taggers obtained by combining all nine features, again In Section 2.3 we discussed the indexing consis- using bagged decision trees, giving in row 4 tency of CiteULike users on our data. There are a (right)a notably improved F-Measure of 47. 1%. total of 332 taggers and their consistency with The recall of 48.6% shows that we match nearly each other is 18.5%. Now. we use results ob- half of all tags on which at least two human tag- tained with Maui during the cross-validation, when all 9 features and bagged decision trees are Given this best combination of features used(Table 4, row 4, right; see examples in Ta eliminate each feature one by one starting from ble 5), and compute how consistent Maui is with he individually weakest feature, in order to de- each human user. based on whatever document termine the contribution of each feature to this this user has tagged. Then we average the results overall result. Table 5 compares the values and to obtain the overall consistency with all 332 only bagged decision trees are used this time. users The 'Difference' column quantifies the differ Maui's consistency with the 332 human tag ence between the best F-Measure achieved with gers ranges from 0 to 80%, with an average of all features and excluding the one that is exam- 23.8%. The only cases where very low consis- ned in that row. Interestingly, one of the strong- tency was achieved are those where the human est features, TEXIDE, is the one that contributes has only assigned a few tags per document(one the least when all features are combined, while to three), or has some idiosyncratic tagging be havior (for example, one tagger adds the word Features F-Measure Difference cey in front of most tags). Still, with an average of 23. 8%, Maui's performance is over 5 points I st occu 45. 2.1 higher than that of an average CiteULike tagger -Inverse (18.5%and note this group only includes tag Semantic relatedness gers who have at least two co-taggers Node degree In Section 2.3 we were also able to determine 46.4 TFXIDF smaller group of users who perform best and wiki keyphraseness 43.1 are most prolific. This group consists of 36 tag gers whose consistency exceeds the average of the original 332 users. These 36 taggers hay Table 5 Evaluation using feature elimination tagged a total of 143 documents with an average consistency of 37.6%. Maui's consistency witheach other. The baseline in row 1 (left) shows Kea’s performance, using TF×IDF, first occur￾rence, keyphraseness and Naïve Bayes to com￾bine them (same as row 4 in Table 2). Using de￾cision trees with these three features does not improve the performance (row 1, right). The fol￾lowing row combines the three original features with length, node degree and Wikipedia-based keyphraseness. In contrast to previous research (Medelyan et al., 2008), in this setting we do not observe an improvement with either Naïve Bayes or bagged decision trees. In row 3 we combine the three original features with the three new ones introduced in this work. While Naïve Bayes’ values are lower than the baseline, with bagged decision trees Maui’s F-Measure im￾proves from 41.2 to 44.9%. The best results are obtained by combining all nine features, again using bagged decision trees, giving in row 4 (right) a notably improved F-Measure of 47.1%. The recall of 48.6% shows that we match nearly half of all tags on which at least two human tag￾gers have agreed. Given this best combination of features, we eliminate each feature one by one starting from the individually weakest feature, in order to de￾termine the contribution of each feature to this overall result. Table 5 compares the values and only bagged decision trees are used this time. The ‘Difference’ column quantifies the differ￾ence between the best F-Measure achieved with all 9 features and excluding the one that is exam￾ined in that row. Interestingly, one of the strong￾est features, TF×IDF, is the one that contributes the least when all features are combined, while the contribution of the strongest feature— keyphraseness—is, as expected, the highest, add￾ing 16.9 points. The second most important fea￾ture is Wikipedia keyphraseness, contributing 4 percentage points to the overall result. Since some of the features in the best perform￾ing combination rely on Wikipedia as a knowl￾edge source, it is interesting to determine Wikipedia’s exact contribution. The last row of Table 5 combines the following features: TF×IDF, first occurrence, keyphraseness, length and spread. The F-Measure is 5.4 points lower than that of Maui with all 9 features combined. Therefore, the contribution of Wikipedia-based features is significant. 4.4 Maui’s consistency with human taggers In Section 2.3 we discussed the indexing consis￾tency of CiteULike users on our data. There are a total of 332 taggers and their consistency with each other is 18.5%. Now, we use results ob￾tained with Maui during the cross-validation, when all 9 features and bagged decision trees are used (Table 4, row 4, right; see examples in Ta￾ble 5), and compute how consistent Maui is with each human user, based on whatever document this user has tagged. Then we average the results to obtain the overall consistency with all 332 users. Maui’s consistency with the 332 human tag￾gers ranges from 0 to 80%, with an average of 23.8%. The only cases where very low consis￾tency was achieved are those where the human has only assigned a few tags per document (one to three), or has some idiosyncratic tagging be￾havior (for example, one tagger adds the word key in front of most tags). Still, with an average of 23.8%, Maui’s performance is over 5 points higher than that of an average CiteULike tagger (18.5%)—and note this group only includes tag￾gers who have at least two co-taggers. In Section 2.3 we were also able to determine a smaller group of users who perform best and are most prolific. This group consists of 36 tag￾gers whose consistency exceeds the average of the original 332 users. These 36 taggers have tagged a total of 143 documents with an average consistency of 37.6%. Maui’s consistency with Naïve Bayes Bagged decision trees P R F P R F 1 Features 1 – 3 41.1 43.1 42.1 40.3 42.2 41.2 2 Features 1 – 6 38.9 41.1 40.0 40.3 42.6 41.4 3 Features 1 – 3, 7 – 9 39.3 41.1 40.2 43.7 46.2 44.9 4 Features 1 – 9 37.6 39.6 38.6 45.7 48.6 47.1 Table 4. Combining all features in Maui Features F-Measure Difference All 9 Features 47.1 – Length 45 2.1 – 1st occurrence 45.6 1.5 – Inverse Wikip linkage 45.1 2 – Semantic relatedness 45.4 1.7 – Node degree 46 1.1 – Spread 46.4 0.7 – TFxIDF 46.8 0.3 – Wikip keyphraseness 43.1 4 – Keyphraseness 30.2 16.9 Non-Wikip features 41.7 5.4 Table 5. Evaluation using feature elimination 1324
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有