正在加载图片...
0. 22. We compute the node degree of the corre- 3.2 Machine learning in Maui sponding Wikipedia article as the number of In order to build the model, we use the subset of hyperlinks that connect it to other Wikipedia the CiteULike collection described in Section pages that have been identified for other candi- date tags from the same document. a document 31. For each document we know a that describes a particular topic will cover many that at least two users have agreed on. This is related concepts, so high node degree--which used as ground truth for building the model. For indicates strong connectivity to other phrases in ch training document, candidate phrases (i.e the same document-means that a candidate is n-grams)are identified and their feature values more likely to be significant are calculated as described above 6. wikipedia-based keyphraseness is the Each candidate is then marked as a positive or likelihood of a phrase bein link in the negative example, depending on whether users Wikipedia pages in which the phrase appears in document. The machine-learning model is con- the anchor text of a link by the total number of structed automatically from these labeled train- We multiply this ing examples using the WEKA machine learning number by the phrase s document frequency The new features proposed in this paper are Naive Bayes classifier, which implicitly assumes the following that the features are independent of each other 7. Spread of a phrase is the distance between given the classification. However, Kea uses only its first and last occurrences in a document. Both two or three features, whereas Maui combines values are computed relative to the length of the nine features amongst which there are many ob- document(see feature 2). High values help to vious relationships, e.g. first occurrence and determine phrases that are mentioned both in the spread, or node degree and semantic relatedness beginning and at the end of a document. Consequently, we also consider bagged decision A8. Semantic relatedness of a phrase has al- trees, which can model attribute interactions and eady been captured as the node degree(see fea- do not require parameter tuning to yield good ture 5). However, recent research allows us to results. Bagging learns an ensemble of classifiers compute semantic relatedness with better tech- and uses them in combination, thereby often niques than mere hyperlink counts. Milne and achieving significantly better results than the in- Witten(2008) propose an efficient Wikipedia dividual classifiers(Breiman, 1996). Different based approach that is nearly as accurate as hu- trees are generated by sampling from the original man subjects at quantifying the relationship be- dataset with replacement. Like Naive Bayes, tween two given concepts. Given a set of candi- bagged trees yield probability estimates that can date phrase determine the most likely Wikipedia articles that describe them (as ex- To select tags from a new document, Maui de plained in feature 5), and then determine the total termines candidate phrases and their feature val relatedness of a given phrase to all other candi- ues, and then applies the classifier built during dates. The higher the value, the more likely is the training. This classifier determines the probabil phrase to be a tag ity that a candidate is a tag based on relat 9. Inverse Wikipedia linkage is another fea- quencies observed from the training data. ture that utilizes wikipedia as a source of lan- guage usage statistics. Here, again given the 4 Evaluation most likely Wikipedia article for a given phrase, Here we describe the data used in the experi that link to it and normalize this value as inin- ments and the results obtained, addressing the verse document frequenc following questions 1. How does a state-of-the-art keyphrase ex links To(Ap) traction method perform on collaboratively tagged data, compared to a baseline auto- where links To(Ap) is the number of incoming matic tagging method? links to the article A representing the candidate 2. What is the performance of Maui with old rase P. and n is the total number of links in and new features? our Wikipedia snapshot (52M). This feature 3. How consistent are Maui's tags compared to highlights those phrases that refer to concepts those assigned by human taggers? commonly used to describe other concepts0.22. We compute the node degree of the corre￾sponding Wikipedia article as the number of hyperlinks that connect it to other Wikipedia pages that have been identified for other candi￾date tags from the same document. A document that describes a particular topic will cover many related concepts, so high node degree—which indicates strong connectivity to other phrases in the same document—means that a candidate is more likely to be significant. 6. Wikipedia-based keyphraseness is the likelihood of a phrase being a link in the Wikipedia corpus. It divides the number of Wikipedia pages in which the phrase appears in the anchor text of a link by the total number of Wikipedia pages containing it. We multiply this number by the phrase’s document frequency. The new features proposed in this paper are the following: 7. Spread of a phrase is the distance between its first and last occurrences in a document. Both values are computed relative to the length of the document (see feature 2). High values help to determine phrases that are mentioned both in the beginning and at the end of a document. 8. Semantic relatedness of a phrase has al￾ready been captured as the node degree (see fea￾ture 5). However, recent research allows us to compute semantic relatedness with better tech￾niques than mere hyperlink counts. Milne and Witten (2008) propose an efficient Wikipedia based approach that is nearly as accurate as hu￾man subjects at quantifying the relationship be￾tween two given concepts. Given a set of candi￾date phrases we determine the most likely Wikipedia articles that describe them (as ex￾plained in feature 5), and then determine the total relatedness of a given phrase to all other candi￾dates. The higher the value, the more likely is the phrase to be a tag. 9. Inverse Wikipedia linkage is another fea￾ture that utilizes Wikipedia as a source of lan￾guage usage statistics. Here, again given the most likely Wikipedia article for a given phrase, we count the number of other Wikipedia articles that link to it and normalize this value as in in￾verse document frequency: where linksTo(AP) is the number of incoming links to the article A representing the candidate phrase P, and N is the total number of links in our Wikipedia snapshot (52M). This feature highlights those phrases that refer to concepts commonly used to describe other concepts. 3.2 Machine learning in Maui In order to build the model, we use the subset of the CiteULike collection described in Section 3.1. For each document we know a set of tags that at least two users have agreed on. This is used as ground truth for building the model. For each training document, candidate phrases (i.e. n-grams) are identified and their feature values are calculated as described above. Each candidate is then marked as a positive or negative example, depending on whether users have assigned it as a tag to the corresponding document. The machine-learning model is con￾structed automatically from these labeled train￾ing examples using the WEKA machine learning workbench. Kea (Frank et al., 1999) uses the Naïve Bayes classifier, which implicitly assumes that the features are independent of each other given the classification. However, Kea uses only two or three features, whereas Maui combines nine features amongst which there are many ob￾vious relationships, e.g. first occurrence and spread, or node degree and semantic relatedness. Consequently, we also consider bagged decision trees, which can model attribute interactions and do not require parameter tuning to yield good results. Bagging learns an ensemble of classifiers and uses them in combination, thereby often achieving significantly better results than the in￾dividual classifiers (Breiman, 1996). Different trees are generated by sampling from the original dataset with replacement. Like Naïve Bayes, bagged trees yield probability estimates that can be used to rank candidates. To select tags from a new document, Maui de￾termines candidate phrases and their feature val￾ues, and then applies the classifier built during training. This classifier determines the probabil￾ity that a candidate is a tag based on relative fre￾quencies observed from the training data. 4 Evaluation Here we describe the data used in the experi￾ments and the results obtained, addressing the following questions: 1. How does a state-of-the-art keyphrase ex￾traction method perform on collaboratively tagged data, compared to a baseline auto￾matic tagging method? 2. What is the performance of Maui with old and new features? 3. How consistent are Maui’s tags compared to those assigned by human taggers? 1322
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有