tency in this group is 92.3% and the average is 3.1 Features indicating significance 18.5%. The average consistency of the most pro- We now describe the features used in the classi- lific 70 indexers-those who have indexed fication model to determine whether a phrase is least five documents-is in the same range, likely to be a tag. We begin with three baseline namely 18.4%. The consistency of traditional approaches to free indexing is reported to be be- extend the set with three features that have been tween 4% and 67%, with an average of 27% de found useful in previous work. We also add three pending on what aids are used (Leininger, 2000). new features that have not been evaluated before It is instructive to consider the group of best taggers. We define these as the ones who (a)ex. Wikipedia linkage. All Wikipedia-based features It greater than average consistency with all are computed using the WikipediaMiner toolkit others, and(b) are su tagged at least five documents. There are 36 such 1. TFXIDF combines the frequency of a taggers Table I lists their consistency within this occurrence frequency in general use(Salton and group. The average consistency they achieve as a McGill, 1983 ). This score is high for rare phrases group is 37.7%, which is the similar to the aver- that appear frequently in a document and there- age consistency of professionals (Leininger, fore are more likely to be significant 2. Position of the first occurrence is com The above consistency analysis provides in- puted as the relative distance of the first occur sight into the tagging quality of the best rence of the candidate tag from the beginning of CiteULike users, based on High Wire and Nature the document. Candidates with very high or very articles. For the purposes of this paper, it shows low values are likely to be tags, because they how the tagging community can be restricted to a appear either in the opening document parts such best-performing group of taggers by measuring as title, abstract, table of contents, and introduc- their consistency. This is helpful for testing the tion, or in the documents final sections such as performance of automatic tagging(Section 4. 4). conclusion and reference lists 3 Automatic tagging with maui 3. Keyphraseness quantifies how often a can didate phrase appears as a tag in the training cor- Maui is a general algorithm for automatic topical pus. Automatic tagging approaches utilize the indexing based on the Kea system(Frank et al, same information: Mishne(2006)and Sood et al. 1999). It works in two stages: candidate selec-(2006)automatically suggest tags previously as tion and machine learning based filtering. In this signed to similar documents. However, in Maui paper, we apply it to automatic tagging In the (as in Kea)this feature is just one component of candidate selection stage, Maui first determines the overall model. Thus if a candidate never ap- textual sequences defined by orthographic pears as a keyphrase in the training corpus, it can boundaries and splits these sequences into to- still be extracted if its other feature values are kens. Then all n-grams up to a maximum length significant enough 4. Phrase length is measured in words. gen- of 3 words that do not begin or end with a stop- erally speaking, the longer the phrase, the more word are extracted as candidate tags. To reduce the number of candidates, all those that appear specific it is. Training captures and quantifies the only once are discarded. This speeds up the train- specificity preference in a given training corpus ing and the extraction process without impacting 5. Node degree quantifies the semantic relat the results In the filtering stage several features edness of a candidate tag to other candidates are computed for each candidate, which are then Turney(2003)computes semantic relatedness input to a machine learning model to obtain the using search engine statistics. Instead, following bability that the candidate is indeed Medelyan et al.(2008), we utilize wikipedia Maui 's architecture resembles that hyperlinks for this task. We first map each can other supervised keyphrase extraction systems didate phrase to its most common Wikipedia (Turney, 2000; Hulth 2004; Medelyan et al., page. For example, the word Jaguar appears as a 2008). However, this architecture has not previ- link anchor in wikipedia 927 times. In 466 cases ously been applied to the task of automatic tag monness of this mapping is 0.5. In 203 cases it gIng links to the animal description, a commonness of Maui is open-source and available for download hp:∥ ttp: //wikipedia sourceforge. net/ 1321tency in this group is 92.3% and the average is 18.5%. The average consistency of the most prolific 70 indexers—those who have indexed at least five documents—is in the same range, namely 18.4%. The consistency of traditional approaches to free indexing is reported to be between 4% and 67%, with an average of 27% depending on what aids are used (Leininger, 2000). It is instructive to consider the group of best taggers. We define these as the ones who (a) exhibit greater than average consistency with all others, and (b) are sufficiently prolific, i.e. have tagged at least five documents. There are 36 such taggers; Table 1 lists their consistency within this group. The average consistency they achieve as a group is 37.7%, which is the similar to the average consistency of professionals (Leininger, 2000). The above consistency analysis provides insight into the tagging quality of the best CiteULike users, based on HighWire and Nature articles. For the purposes of this paper, it shows how the tagging community can be restricted to a best-performing group of taggers by measuring their consistency. This is helpful for testing the performance of automatic tagging (Section 4.4). 3 Automatic tagging with Maui Maui is a general algorithm for automatic topical indexing based on the Kea system (Frank et al., 1999).1 It works in two stages: candidate selection and machine learning based filtering. In this paper, we apply it to automatic tagging. In the candidate selection stage, Maui first determines textual sequences defined by orthographic boundaries and splits these sequences into tokens. Then all n-grams up to a maximum length of 3 words that do not begin or end with a stopword are extracted as candidate tags. To reduce the number of candidates, all those that appear only once are discarded. This speeds up the training and the extraction process without impacting the results. In the filtering stage several features are computed for each candidate, which are then input to a machine learning model to obtain the probability that the candidate is indeed a tag. Maui’s architecture resembles that of many other supervised keyphrase extraction systems (Turney, 2000; Hulth 2004; Medelyan et al., 2008). However, this architecture has not previously been applied to the task of automatic tagging. 1 Maui is open-source and available for download at http://maui-indexer.googlecode.com 3.1 Features indicating significance We now describe the features used in the classification model to determine whether a phrase is likely to be a tag. We begin with three baseline features used in Kea (Frank et al., 1999), and extend the set with three features that have been found useful in previous work. We also add three new features that have not been evaluated before: spread, semantic relatedness and inverse Wikipedia linkage. All Wikipedia-based features are computed using the WikipediaMiner toolkit. 2 1. TF×IDF combines the frequency of a phrase in a particular document with its inverse occurrence frequency in general use (Salton and McGill, 1983). This score is high for rare phrases that appear frequently in a document and therefore are more likely to be significant. 2. Position of the first occurrence is computed as the relative distance of the first occurrence of the candidate tag from the beginning of the document. Candidates with very high or very low values are likely to be tags, because they appear either in the opening document parts such as title, abstract, table of contents, and introduction, or in the document’s final sections such as conclusion and reference lists. 3. Keyphraseness quantifies how often a candidate phrase appears as a tag in the training corpus. Automatic tagging approaches utilize the same information: Mishne (2006) and Sood et al. (2006) automatically suggest tags previously assigned to similar documents. However, in Maui (as in Kea) this feature is just one component of the overall model. Thus if a candidate never appears as a keyphrase in the training corpus, it can still be extracted if its other feature values are significant enough. 4. Phrase length is measured in words. Generally speaking, the longer the phrase, the more specific it is. Training captures and quantifies the specificity preference in a given training corpus. 5. Node degree quantifies the semantic relatedness of a candidate tag to other candidates. Turney (2003) computes semantic relatedness using search engine statistics. Instead, following Medelyan et al. (2008), we utilize Wikipedia hyperlinks for this task. We first map each candidate phrase to its most common Wikipedia page. For example, the word Jaguar appears as a link anchor in Wikipedia 927 times. In 466 cases it links to the article Jaguar cars, thus the commonness of this mapping is 0.5. In 203 cases it links to the animal description, a commonness of 2 http://wikipedia-miner.sourceforge.net/ 1321