正在加载图片...
tency in this group is 92.3% and the average is 3.1 Features indicating significance 18.5%. The average consistency of the most pro- We now describe the features used in the classi- lific 70 indexers-those who have indexed fication model to determine whether a phrase is least five documents-is in the same range, likely to be a tag. We begin with three baseline namely 18.4%. The consistency of traditional approaches to free indexing is reported to be be- extend the set with three features that have been tween 4% and 67%, with an average of 27% de found useful in previous work. We also add three pending on what aids are used (Leininger, 2000). new features that have not been evaluated before It is instructive to consider the group of best taggers. We define these as the ones who (a)ex. Wikipedia linkage. All Wikipedia-based features It greater than average consistency with all are computed using the WikipediaMiner toolkit others, and(b) are su tagged at least five documents. There are 36 such 1. TFXIDF combines the frequency of a taggers Table I lists their consistency within this occurrence frequency in general use(Salton and group. The average consistency they achieve as a McGill, 1983 ). This score is high for rare phrases group is 37.7%, which is the similar to the aver- that appear frequently in a document and there- age consistency of professionals (Leininger, fore are more likely to be significant 2. Position of the first occurrence is com The above consistency analysis provides in- puted as the relative distance of the first occur sight into the tagging quality of the best rence of the candidate tag from the beginning of CiteULike users, based on High Wire and Nature the document. Candidates with very high or very articles. For the purposes of this paper, it shows low values are likely to be tags, because they how the tagging community can be restricted to a appear either in the opening document parts such best-performing group of taggers by measuring as title, abstract, table of contents, and introduc- their consistency. This is helpful for testing the tion, or in the documents final sections such as performance of automatic tagging(Section 4. 4). conclusion and reference lists 3 Automatic tagging with maui 3. Keyphraseness quantifies how often a can didate phrase appears as a tag in the training cor- Maui is a general algorithm for automatic topical pus. Automatic tagging approaches utilize the indexing based on the Kea system(Frank et al, same information: Mishne(2006)and Sood et al. 1999). It works in two stages: candidate selec-(2006)automatically suggest tags previously as tion and machine learning based filtering. In this signed to similar documents. However, in Maui paper, we apply it to automatic tagging In the (as in Kea)this feature is just one component of candidate selection stage, Maui first determines the overall model. Thus if a candidate never ap- textual sequences defined by orthographic pears as a keyphrase in the training corpus, it can boundaries and splits these sequences into to- still be extracted if its other feature values are kens. Then all n-grams up to a maximum length significant enough 4. Phrase length is measured in words. gen- of 3 words that do not begin or end with a stop- erally speaking, the longer the phrase, the more word are extracted as candidate tags. To reduce the number of candidates, all those that appear specific it is. Training captures and quantifies the only once are discarded. This speeds up the train- specificity preference in a given training corpus ing and the extraction process without impacting 5. Node degree quantifies the semantic relat the results In the filtering stage several features edness of a candidate tag to other candidates are computed for each candidate, which are then Turney(2003)computes semantic relatedness input to a machine learning model to obtain the using search engine statistics. Instead, following bability that the candidate is indeed Medelyan et al.(2008), we utilize wikipedia Maui 's architecture resembles that hyperlinks for this task. We first map each can other supervised keyphrase extraction systems didate phrase to its most common Wikipedia (Turney, 2000; Hulth 2004; Medelyan et al., page. For example, the word Jaguar appears as a 2008). However, this architecture has not previ- link anchor in wikipedia 927 times. In 466 cases ously been applied to the task of automatic tag monness of this mapping is 0.5. In 203 cases it gIng links to the animal description, a commonness of Maui is open-source and available for download hp:∥ ttp: //wikipedia sourceforge. net/ 1321tency in this group is 92.3% and the average is 18.5%. The average consistency of the most pro￾lific 70 indexers—those who have indexed at least five documents—is in the same range, namely 18.4%. The consistency of traditional approaches to free indexing is reported to be be￾tween 4% and 67%, with an average of 27% de￾pending on what aids are used (Leininger, 2000). It is instructive to consider the group of best taggers. We define these as the ones who (a) ex￾hibit greater than average consistency with all others, and (b) are sufficiently prolific, i.e. have tagged at least five documents. There are 36 such taggers; Table 1 lists their consistency within this group. The average consistency they achieve as a group is 37.7%, which is the similar to the aver￾age consistency of professionals (Leininger, 2000). The above consistency analysis provides in￾sight into the tagging quality of the best CiteULike users, based on HighWire and Nature articles. For the purposes of this paper, it shows how the tagging community can be restricted to a best-performing group of taggers by measuring their consistency. This is helpful for testing the performance of automatic tagging (Section 4.4). 3 Automatic tagging with Maui Maui is a general algorithm for automatic topical indexing based on the Kea system (Frank et al., 1999).1 It works in two stages: candidate selec￾tion and machine learning based filtering. In this paper, we apply it to automatic tagging. In the candidate selection stage, Maui first determines textual sequences defined by orthographic boundaries and splits these sequences into to￾kens. Then all n-grams up to a maximum length of 3 words that do not begin or end with a stop￾word are extracted as candidate tags. To reduce the number of candidates, all those that appear only once are discarded. This speeds up the train￾ing and the extraction process without impacting the results. In the filtering stage several features are computed for each candidate, which are then input to a machine learning model to obtain the probability that the candidate is indeed a tag. Maui’s architecture resembles that of many other supervised keyphrase extraction systems (Turney, 2000; Hulth 2004; Medelyan et al., 2008). However, this architecture has not previ￾ously been applied to the task of automatic tag￾ging. 1 Maui is open-source and available for download at http://maui-indexer.googlecode.com 3.1 Features indicating significance We now describe the features used in the classi￾fication model to determine whether a phrase is likely to be a tag. We begin with three baseline features used in Kea (Frank et al., 1999), and extend the set with three features that have been found useful in previous work. We also add three new features that have not been evaluated before: spread, semantic relatedness and inverse Wikipedia linkage. All Wikipedia-based features are computed using the WikipediaMiner toolkit. 2 1. TF×IDF combines the frequency of a phrase in a particular document with its inverse occurrence frequency in general use (Salton and McGill, 1983). This score is high for rare phrases that appear frequently in a document and there￾fore are more likely to be significant. 2. Position of the first occurrence is com￾puted as the relative distance of the first occur￾rence of the candidate tag from the beginning of the document. Candidates with very high or very low values are likely to be tags, because they appear either in the opening document parts such as title, abstract, table of contents, and introduc￾tion, or in the document’s final sections such as conclusion and reference lists. 3. Keyphraseness quantifies how often a can￾didate phrase appears as a tag in the training cor￾pus. Automatic tagging approaches utilize the same information: Mishne (2006) and Sood et al. (2006) automatically suggest tags previously as￾signed to similar documents. However, in Maui (as in Kea) this feature is just one component of the overall model. Thus if a candidate never ap￾pears as a keyphrase in the training corpus, it can still be extracted if its other feature values are significant enough. 4. Phrase length is measured in words. Gen￾erally speaking, the longer the phrase, the more specific it is. Training captures and quantifies the specificity preference in a given training corpus. 5. Node degree quantifies the semantic relat￾edness of a candidate tag to other candidates. Turney (2003) computes semantic relatedness using search engine statistics. Instead, following Medelyan et al. (2008), we utilize Wikipedia hyperlinks for this task. We first map each can￾didate phrase to its most common Wikipedia page. For example, the word Jaguar appears as a link anchor in Wikipedia 927 times. In 466 cases it links to the article Jaguar cars, thus the com￾monness of this mapping is 0.5. In 203 cases it links to the animal description, a commonness of 2 http://wikipedia-miner.sourceforge.net/ 1321
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有