Domain-Specific Keyphrase Extraction Eibe Frank and Gordon W. Paynter and lan H. Witten Department of Computer Science University of waikato Hamilton new zealand Carl gutwin Craig G. Nevill-Manning Department of Computer Science Depart ment of Computer Science University of Saskatchewan Rut gers Universit Saskatoon canada Piscat away, New Jersey, USA abstract to existing do cuments is a very laborious tas fore, way s of autom ating this pro cess using artifi eyphrases are an ortant means of doc- cial intelligence--more specifica chine learning ent summ ariz ation, clustering, and search. Only a small minority of documents techniques-are of interest. There are two different ways of approacing the problem: keyphrase as signment and d keyphrase s, and manually assigning key phr ases to exi sting do cuments is key phr ase ertraction. In key phr ase assignment, also very laborious. Therefore it is highly desirable known as text categorization Dumais et al., 1998, it is to automate the key phr ase extraction process assumed that all potential kephrases appear in a prede- fined controlled vocabul ary -the categories. The learn This paper shows that a simple pro cedure for ng problem is to find a m apping from documents to cat phrase extraction based on the naive egories using a set of training documents, which can be learning scheme performs comparably to the accomplished by training a cl assifier for eac category state of the art. It goes on to explain how sing documents that belong to it as positive examples this procedures performance can be boosted by and the rest as negative ones. A new do cument is then utom atically tailoring the extraction pro cess pro cessed by each of the cl assifiers and assigned to those e particul ar do cument colle ction at hand categories whose classifiers identify it as a positive exam Results on a large collection of techni cal reports ple. The second approach, keyphrase extr action, which n computer science show that the quality of we pursue in this paper, does not restrict the set of pos- the extr acted keyphrase s improves significantly sible keyphrases to a selected vocabul ary. On the co when domain-specific inform ation is exploited trary, any phr ase in a new document can be identified extracted-as a keyphrase. Using a set of training doc 1 Introduction uments, machine learning is used to determine which proper ties di stinguish phrases that are keyphrases from Keyphrases give a high-level de scription of a document's ones that are not ontents that is intended to make it easy for prospec rne es a tive readers to deci de whether or not it is relevant for traction, Gen Ex, based on a set of parametrized heuri them. But they have other applications too. Because tic rules that are fine-tuned using a genetic algorith keyphrases summarize do cuments very concisely, they The genetic algorithm optimizes the number of cor can be use d as a low-cost measure of simil arity between rectly identified keyphrases in the training documents do cument s, making it possi ble to cluster documents into by adjusting the rules' parameters. Turney compares groups by measuring overlap between the key phr ases Gen Ex to the straightforward appli cation of a st an- they are assigned. A related appli cation is topic sear d: dard machine learning techni que-bagged decision trees upon entering a keyphrase into a sear ch engine, all doc- Breiman, 1996]-and concludes that it gives superior uments with this particul ar keyphrase attached are re- performance. He also shows that gen Ex generalizes well turned to the user. In summary, keyphrases provide a d on a collection of jo owerful means for sifting through large numbers of doc- nal articles it successfully extracts keyphrases from web ments by focusing on those that are likely to be rele- pages on a different topic. This is an important feature because training Gen Ex on a new collection is computa- Unfortunately, only m all fraction of documents have tionally very expensive. keyphrases assigned to them--mostly because authors This paper briefly summarizes the Kea keyphr only provide key phrases when they are explicitly in- traction algorith, and goes on to show that it gen- structed to do so-and m anually attaching key phr ases eralizes as well as Gen Ex across collections. In con
Domain-Specic Keyphrase Extraction Eibe Frank and Gordon W. Paynter and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand Carl Gutwin Department of Computer Science University of Saskatchewan Saskatoon, Canada Craig G. Nevill-Manning Department of Computer Science Rutgers University Piscataway, New Jersey, USA Abstract Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naive Bayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves signicantly when domain-specic information is exploited. 1 Introduction Keyphrases give a high-level description of a document's contents that is intended to make it easy for prospective readers to decide whether or not it is relevant for them. But they have other applications too. Because keyphrases summarize documents very concisely, they can be used as a low-cost measure of similarity between documents, making it possible to cluster documents into groups by measuring overlap between the keyphrases they are assigned. A related application is topic search: upon entering a keyphrase into a search engine, all documents with this particular keyphrase attached are returned to the user. In summary, keyphrases provide a powerful means for sifting through large numbers of documents by focusing on those that are likely to be relevant. Unfortunately, only a small fraction of documents have keyphrases assigned to them|mostly because authors only provide keyphrases when they are explicitly instructed to do so|and manually attaching keyphrases to existing documents is a very laborious task. Therefore, ways of automating this process using arti- cial intelligence|more specically, machine learning techniques|are of interest. There are two dierent ways of approaching the problem: keyphrase assignment and keyphrase extraction. In keyphrase assignment, also known as text categorization [Dumais et al., 1998], it is assumed that all potential kephrases appear in a prede- ned controlled vocabulary|the categories. The learning problem is to nd a mapping from documents to categories using a set of training documents, which can be accomplished by training a classier for each category, using documents that belong to it as positive examples and the rest as negative ones. A new document is then processed by each of the classiers and assigned to those categories whose classiers identify it as a positive example. The second approach, keyphrase extraction, which we pursue in this paper, does not restrict the set of possible keyphrases to a selected vocabulary. On the contrary, any phrase in a new document can be identied| extracted|as a keyphrase. Using a set of training documents, machine learning is used to determine which properties distinguish phrases that are keyphrases from ones that are not. Turney [1999] describes a system for keyphrase extraction, GenEx, based on a set of parametrized heuristic rules that are ne-tuned using a genetic algorithm. The genetic algorithm optimizes the number of correctly identied keyphrases in the training documents by adjusting the rules' parameters. Turney compares GenEx to the straightforward application of a standard machine learning technique|bagged decision trees [Breiman, 1996]|and concludes that it gives superior performance. He also shows that GenEx generalizes well across collections: when trained on a collection of journal articles it successfully extracts keyphrases from web pages on a dierent topic. This is an important feature because training GenEx on a new collection is computationally very expensive. This paper brie y summarizes the Kea keyphrase extraction algorithm, and goes on to show that it generalizes as well as GenEx across collections. In con-
trast to Gen Ex, however, it does not employ a special- stemmed phrases that o ccur only once in the do cument purpose genetic algorithm for training and keyphrase are removed traction: it is based on the well-known naive bayes achine learning technique. Training is therefore mud 2.2 Building the Model quicker. The main finding of this paper is that perfor- So far we have shown how candi date phrases are gener an oe can be boosted significantly if Kea is trained ated. However, in conventional m achine learning terms do cument s that are from the same domain as those from phrases by themselves are useless--it is their properties which keyphrases are to be extracted. This allows us to or "attributes, " that are important. Several plausible capitalize on speedy training, because deriving dom ain- attributes immediately spring to mind: the number of specific mo dels be less practical with the ds in a phr ase, the number of ch lengthy genetic algorithm approach of the phr ase in the do cument, et c. However, in our ex- Section 2 summ arizes the Kea algorithm for keyphrase periments, only two attributes turned out to be useful in extraction, and shows that it performs comparably to discriminating between keyphrases and non-keyphrases Gen Ex if used in the same dom ain-independent set- the TFX IDF score of a phrase, and the distance into the ting. Section 3 explains a simple enhancement that en- document of the phrases first appearance. In the follow bles Kea to exploit collection-specific inform ation about ing we explain how the se attributes are computed and keyphrases, and shows how this addition boosts perfor- how a naive b aves mt nodel Domingos and Pazzani, 1997 m ance on a large collection of computer science technical is built from them reports. The main findings of this paper are summarized The TFX IDF soore of a phr ase is a standard metric Section 4 al. It phrase P is to a given do cument D 2 Keyphrase Extraction using Naive TFXIDF(P, D)= Pr phrase in D is Px Ba Keyphrase extraction is a cl assification task: e ac phr ase The first probability in this equation is estim ated by in a document is either a key phr ase or n the prob- counting the number of times the phrase P occurs in lem is to correctly cl assify a phrase into one of these two the document D, and the second one by counting the categories. Machine learning provides off-the-shelf tools number of do cuments in the training corpus that cont air for this kind of situation. In machine learning termi- P(excluding D) nology, the phrases in a document are "examples"and The distance of a phrase from the beginning of a doc the learning problem is to find a mapping from the exam- ument is calculated as the number of words that precede sles to the two cl asses"keyphrase"and"not-keyphr ase". its first appear ance, divided by the number of wor ds in Machine learning techniques can automati cally generate the do cument. The resulting feature is a number be his mapping if they are provided with a set of tr aining tween 0 and 1 that represents the proportion of the doc- amples, that is, examples with cl ass labels assigned to ument prece ding the phrase's first appearance them. In our context, these are simply phrases which Both these attributes are re al numbers , The naive have been identified as either being keyphrases or not. Bayes learning method can proce ss numeric attributes Once the learning method has gener ated the mapping by assuming, for example, that they are normally dis given the training data, it can be applied to unlabeled tributed. However btained better re sults by dis data, in other wor ds, it can be use d to extract key phr ases cretizing the attributes prior to appl ying the learning from new documents scheme [Domingos and Pazzani, 1997. This indi cates that the normal distribution is not appropriate in this 2.1 Generating Candidate Phrases application. Discretization quantizes a numeric attribute Not all phrases in a do cument are equally likely to be into ranges so that the re sulting new attribute can be eyphrases a priori In order to facilitate the lear treated as a nominal one: each value represents a range of values of the original numeric attribute. Kea uses ing process, most phrases that appear can be eliminated Fayyad and Irani's [1993 discretization scheme, which from the set of examples that are presented to the le is based on the minimum desc First, the input text is split up according to phr ase It recursively splits the attribute into interv als,ution. boundaries(punctuation marks, dashes, brackets, and stage minimizing the entropy of the class di stribution characters(apart from in the discretization and the class distribut ternal periods)and all numbers are deleted. Kea takes reduced further all subsequences of the se initial phrases up to length The naive Bayes learning scheme is a simple applica- three as candidate phrases. It then eliminates those tion of Bayes'formula. It assumes that the attributes phrases that begin, or end, with a st d it al deletes phrases that consist merely of a in this case TFx Idf and diste pence the next step, all words are case-fol ded and The counters are initialized to one to avoid taking the using the iterated Lovins stemmer Lovins, 1968, and logarithm of zero
trast to GenEx, however, it does not employ a specialpurpose genetic algorithm for training and keyphrase extraction: it is based on the well-known naive Bayes machine learning technique. Training is therefore much quicker. The main nding of this paper is that performance can be boosted signicantly if Kea is trained on documents that are from the same domain as those from which keyphrases are to be extracted. This allows us to capitalize on speedy training, because deriving domainspecic models would be less practical with the original lengthy genetic algorithm approach. Section 2 summarizes the Kea algorithm for keyphrase extraction, and shows that it performs comparably to GenEx if used in the same domain-independent setting. Section 3 explains a simple enhancement that enables Kea to exploit collection-specic information about keyphrases, and shows how this addition boosts performance on a large collection of computer science technical reports. The main ndings of this paper are summarized in Section 4. 2 Keyphrase Extraction using Naive Bayes Keyphrase extraction is a classication task: each phrase in a document is either a keyphrase or not, and the problem is to correctly classify a phrase into one of these two categories. Machine learning provides o-the-shelf tools for this kind of situation. In machine learning terminology, the phrases in a document are \examples" and the learning problem is to nd a mapping from the examples to the two classes \keyphrase" and \not-keyphrase". Machine learning techniques can automatically generate this mapping if they are provided with a set of training examples, that is, examples with class labels assigned to them. In our context, these are simply phrases which have been identied as either being keyphrases or not. Once the learning method has generated the mapping given the training data, it can be applied to unlabeled data, in other words, it can be used to extract keyphrases from new documents. 2.1 Generating Candidate Phrases Not all phrases in a document are equally likely to be keyphrases a priori. In order to facilitate the learning process, most phrases that appear can be eliminated from the set of examples that are presented to the learning scheme. First, the input text is split up according to phrase boundaries (punctuation marks, dashes, brackets, and numbers). Non-alphanumeric characters (apart from internal periods) and all numbers are deleted. Kea takes all subsequences of these initial phrases up to length three as candidate phrases. It then eliminates those phrases that begin, or end, with a stopword. It also deletes phrases that consist merely of a proper noun. In the next step, all words are case-folded and stemmed using the iterated Lovins stemmer [Lovins, 1968], and stemmed phrases that occur only once in the document are removed. 2.2 Building the Model So far we have shown how candidate phrases are generated. However, in conventional machine learning terms, phrases by themselves are useless|it is their properties, or \attributes," that are important. Several plausible attributes immediately spring to mind: the number of words in a phrase, the number of characters, the position of the phrase in the document, etc. However, in our experiments, only two attributes turned out to be useful in discriminating between keyphrases and non-keyphrases: the TFIDF score of a phrase, and the distance into the document of the phrase's rst appearance. In the following we explain how these attributes are computed and how a naive Bayes model [Domingos and Pazzani, 1997] is built from them. The TFIDF score of a phrase is a standard metric in information retrieval. It is designed to measure how specic a phrase P is to a given document D: TFIDF(P; D) = Pr[phrase in D is P ] log Pr[P in a document]: The rst probability in this equation is estimated by counting the number of times the phrase P occurs in the document D, and the second one by counting the number of documents in the training corpus that contain P (excluding D).1 The distance of a phrase from the beginning of a document is calculated as the number of words that precede its rst appearance, divided by the number of words in the document. The resulting feature is a number between 0 and 1 that represents the proportion of the document preceding the phrase's rst appearance. Both these attributes are real numbers. The naive Bayes learning method can process numeric attributes by assuming, for example, that they are normally distributed. However, we obtained better results by discretizing the attributes prior to applying the learning scheme [Domingos and Pazzani, 1997]. This indicates that the normal distribution is not appropriate in this application. Discretization quantizes a numeric attribute into ranges so that the resulting new attribute can be treated as a nominal one: each value represents a range of values of the original numeric attribute. Kea uses Fayyad and Irani's [1993] discretization scheme, which is based on the Minimum Description Length principle. It recursively splits the attribute into intervals, at each stage minimizing the entropy of the class distribution. It stops splitting when the total cost for encoding both the discretization and the class distribution cannot be reduced further. The naive Bayes learning scheme is a simple application of Bayes' formula. It assumes that the attributes| in this case TFIDF and distance|are independent 1 The counters are initialized to one to avoid taking the logarithm of zero.
given the class. Making this assumption, the probability terion for success is the extent to which Kea pro duces hat a phr ase is a keyphrase given that it has discretized the same stemmed phr ases as authors do. Because this TFX IDF value t and di scretized distance d is method of evaluation is the same as used by Turney Ptke,D≈P1 r key] x Pr[Dkey×Pike 1999, we can directly compare Keas performance to PrT. D FX IDF score T, Pr Dkey the probability that it has We compared Kea and Gen Ex using two experimenta distance D, Pr[key the a priori probability that a phr ase settings from Turney's paper he first one involves is a keyphr ase, and Pr[T, D a normalization factor that training and testing on journal articles. In this setting, akes Pr]keyT, D lie between zero and one. All these 55 articles are used for training(6 from the Journal of the probabilities can be estimated reli ably by counting the International Academy of Hospitality Res earch, 2 fror number of times the corresponding event occurs in the The Neuroscientist, 14 from the Journal of comput training dat Aided Molecular Design, and 33 from Behavioral and It has been shown that naive Bayes can be a very Brain Sciences), and 20 for testing(all from Psycoloqruy accur ate classification method even if the independence In the se cond setting, the same do cument s are used for assumption is not correct Domingos and Pazzani, 1997. training but 35 FIPS web pages are used for testing However, it can be argued that the two at tributes we use, Table 1 shows the number of correctly identified TFX IDF and distance, are close to being independent author-provided keyphrases among the five and fifteen given the class. This implies that naive Bayes is close to top-ranking phr ases output by the extr action algorithms being the optimum cl assification method for this appli- Four extraction algorithms are represented: Gen Ex, fifty cation, and might be the reason why it performs better bagged C4.5 decision trees [Quinlan, 1992 as used by than all other learning me thods that we have investi- Turney, Kea, and Kea using fifty bagged C4.5 trees in gated. (In particular it performs better than bagged stead of the naive Bayes learning scheme. Results for decision trees, as we show in Section 2. 4. the first two methods are from Turney's paper.The 2.3 Extracting Keyphrases third scheme is the st andar d Kea algorithm that we have described. In the fourth, bagged C4.5 trees were usec Kea uses the procedure described above to generate a instead of discretization and naive Bayes, with all the naive Bayes model from a set of training documents for standard pre- and post -proce ssing done by Kea. This which keyphrases are known (typically because the au- variation of Kea is computationally much more exper thor provided them). The resulting model can then be sive(by a factor of at least fifty plied to a new document from which keyphrase s Turney found bagged C4. 5 trees to per form universall to be extracted worse than Gen Ex, but in only one of the four experi First, Kea computes TFX IDF scores and distance val- mental settings from Table 1, Journal/FIPS with cutoff les for all phr ases in the new document using the of five, was the difference statistically significant. Kea e dure described above, taking the discretization ob- sometimes performs worse than GenEx and sometimes ained from the training documents.(Both attributes better; the differences are not statistically significant(at an be computed without knowing whether a phrase the 5% level, according to a ttest ) Moreover, Ke a-C45 a keyphrase or not. The naive Bayes model is then ap- per forms much better than Turney's C45 in the case plied to each phr ase, computing the estimated probabil- where the latter does significantly worse than GenEx ity of it being a keyphrase. The result is a list of phrases We conclude that Gen Ex and Kea perform at about the ranked accor ding to their associ ated probabilities. As- same level, Ke a-C45 seems slightly worse but the dif- suming that the user wants to extract r keyphrases, Ke ference is not st atistically significant on these datasets then output s the r highest ranked phr ase The only statistically significant result is the poor per There are two special cases that have to be addressed formance that Turney observed in one case with C4.5 in order to achieve optimum per formance. First, if two The difference between Turney's findings for bagged phrases have equal probability-which is quite likely to C4.5 trees and ours deser happen due to the discretization-they are ranked cording to their TFX IDF score (in its pre-di scretized but he does not use TFX IDF. Moreover, he performs no form). Second, if a phrase is a subphrase of another post-pro cessing for C4.5--although he does for Gen Ex phrase, it is only accepted as a key phr ase if it higher; otherwise it is deleted from the list before the r Author-assigned keyphrases are, of course, deleted from documents before they are given to Kea. dWe could not compare Kea on the other document col ections used by turney because we did not hav e have evaluated Kea on several different documen nt his corpus of email messages, which contains confidential collections with author-assigned key phr ases. Our cri- inforrnation The naive Bayes implementation used by Kea initializes ney's"precision"figures were multiplied by the cutog tur. To get the number of correctly identified keyphrases all counts to one loyed five or fifteen)
given the class. Making this assumption, the probability that a phrase is a keyphrase given that it has discretized TFIDF value T and discretized distance D is: Pr[keyjT ; D] = Pr[T jkey] Pr[Djkey] Pr[key] Pr[T ; D] ; where Pr[T jkey] is the probability that a keyphrase has TFIDF score T , Pr[Djkey] the probability that it has distance D, Pr[key] the a priori probability that a phrase is a keyphrase, and Pr[T ; D] a normalization factor that makes Pr[keyjT ; D] lie between zero and one. All these probabilities can be estimated reliably by counting the number of times the corresponding event occurs in the training data.2 It has been shown that naive Bayes can be a very accurate classication method even if the independence assumption is not correct [Domingos and Pazzani, 1997]. However, it can be argued that the two attributes we use, TFIDF and distance, are close to being independent given the class. This implies that naive Bayes is close to being the optimum classication method for this application, and might be the reason why it performs better than all other learning methods that we have investigated. (In particular it performs better than bagged decision trees, as we show in Section 2.4.) 2.3 Extracting Keyphrases Kea uses the procedure described above to generate a naive Bayes model from a set of training documents for which keyphrases are known (typically because the author provided them). The resulting model can then be applied to a new document from which keyphrases are to be extracted. First, Kea computes TFIDF scores and distance values for all phrases in the new document using the procedure described above, taking the discretization obtained from the training documents. (Both attributes can be computed without knowing whether a phrase is a keyphrase or not.) The naive Bayes model is then applied to each phrase, computing the estimated probability of it being a keyphrase. The result is a list of phrases ranked according to their associated probabilities. Assuming that the user wants to extract r keyphrases, Kea then outputs the r highest ranked phrases. There are two special cases that have to be addressed in order to achieve optimum performance. First, if two phrases have equal probability|which is quite likely to happen due to the discretization|they are ranked according to their TFIDF score (in its pre-discretized form). Second, if a phrase is a subphrase of another phrase, it is only accepted as a keyphrase if it is ranked higher; otherwise it is deleted from the list before the r top-ranking phrases are output. 2.4 Experimental Results We have evaluated Kea on several dierent document collections with author-assigned keyphrases. Our cri- 2 The naive Bayes implementation used by Kea initializes all counts to one. terion for success is the extent to which Kea produces the same stemmed phrases as authors do.3 Because this method of evaluation is the same as used by Turney [1999], we can directly compare Kea's performance to his results. Comparison to GenEx We compared Kea and GenEx using two experimental settings from Turney's paper.4 The rst one involves training and testing on journal articles. In this setting, 55 articles are used for training (6 from the Journal of the International Academy of Hospitality Research, 2 from The Neuroscientist, 14 from the Journal of ComputerAided Molecular Design, and 33 from Behavioral and Brain Sciences), and 20 for testing (all from Psycoloquy). In the second setting, the same documents are used for training but 35 FIPS web pages are used for testing. Table 1 shows the number of correctly identied author-provided keyphrases among the ve and fteen top-ranking phrases output by the extraction algorithms. Four extraction algorithms are represented: GenEx, fty bagged C4.5 decision trees [Quinlan, 1992] as used by Turney, Kea, and Kea using fty bagged C4.5 trees instead of the naive Bayes learning scheme. Results for the rst two methods are from Turney's paper.5 The third scheme is the standard Kea algorithm that we have described. In the fourth, bagged C4.5 trees were used instead of discretization and naive Bayes, with all the standard pre- and post-processing done by Kea. This variation of Kea is computationally much more expensive (by a factor of at least fty). Turney found bagged C4.5 trees to perform universally worse than GenEx, but in only one of the four experimental settings from Table 1, Journal/FIPS with cuto of ve, was the dierence statistically signicant. Kea sometimes performs worse than GenEx and sometimes better; the dierences are not statistically signicant (at the 5% level, according to a t-test). Moreover, Kea-C4.5 performs much better than Turney's C4.5 in the case where the latter does signicantly worse than GenEx. We conclude that GenEx and Kea perform at about the same level, Kea-C4.5 seems slightly worse but the difference is not statistically signicant on these datasets. The only statistically signicant result is the poor performance that Turney observed in one case with C4.5. The dierence between Turney's ndings for bagged C4.5 trees and ours deserves some explanation. Turney uses many more attributes|among them distance| but he does not use TFIDF. Moreover, he performs no post-processing for C4.5|although he does for GenEx| 3 Author-assigned keyphrases are, of course, deleted from the documents before they are given to Kea. 4We could not compare Kea on the other document collections used by Turney because we did not have access to his corpus of email messages, which contains condential information. 5 To get the number of correctly identied keyphrases, Turney's \precision" gures were multiplied by the cuto employed (ve or fteen).
±I,21.4U1.281.35±0.91.20H0 152.65±1.952.55±1.702.75±1.252701.38 Journal FIPS 51.43±0.8577士0811,460.98140±0.9 152.461.172.120.992.20+1.352.261.32 Table 1: Experiment al results for different extraction algorith whereas we remove subphr ases if they do not perform str ates that this is not the case if domain-specific infor main differences between his way of a appear to be the mation is exploited in the learning and extr action pro our s ng Changing the amount of Training Data Subject Area of Training Documents An interesting question is how Keas performance scales Now we investigate the extent to which mo dels formed with the amount of tr aining data available. In order to by Kea transfer from one sublect domain to another investigate this, we performed experiment s with a lary To this end we use the collection of ournal articles de lection of computer science tecnical reports(CSTR scribed above and two collections of web pages also used fromtheNewZealandDigitalLibrary(www.nzdloybyTurney1999),Aliweb,andNasa,allofwhichhave The do cuments in CSTR are fairly noisy, partly be key phi to tr ail cause the source files have been extracted automati. on one of the collections and test on another, produc cally from Post Script. Al so, they contain on average ing nine combinations. For ead collection we chose 55 fewer keyphrases than the other collections. This m akes training documents at random and used the rest fe keyphrase extraction in this domain more difficult than ing, 20 for the ournal articles, 35 for Aliweb, and 86 for in the other corpuses NASA. The training do cument s were used to compute There are two potenti al way s in which the corpus of the do cument frequencies; thus the entire keyphrase as- do cument s that is av ail able can influence Kea's perfor- alone. For the ournal articles, as well as the r andomly mance on fresh data. First, training documents are ed when com puting both the di scretization of the at no sen test set, we ran experiments with the same train- tributes, and the corresponding counts for the naive g/te sting division that Turney 1999 used, the test set ayes model. It is essential that these do cuments comprising 20 articles in the ournal Psycoloquy key phr ases assigned to them because the method needs labeled examples. Second, the document key phr ases returned when five key phr ases are retrieved, corpus supports the learning pro cess when each phrase for twelve cases. The first nine repre sent every combina- do cument frequency"is calculatedthis is used for de- tion of tr aining and testing sets drawn from one of the riving its TFXIDF score. In this case the do cuments three collections, and the last represents the Psycolo need not be labeled. Our experiments showed that no quy test set with the same three training sets(except further perform ance improvement was gained by increas g the number of documents used to compute the do s beyond 50 To illustrate the effect of training set size, Figur shows Keas performance on an independent set of 500 test do cuments. It plots the number of "correct for both five and fifteen phr ases extr acted ag ainst the number of documents used for training, from 1 through 130 files. The error bars give 99% confidence inter s derived by tr aining Kea on ten different train throughout this particular experiment. It can be seen more than twenty documents ar d for t tle is gained by ber further. With 50 do cuments. there is no furthe performance improvement These result s show that Keas per formance is close to optimum if about 50 ti documents sed Figure 1: Performance on CSTR corpus for different other words, 50 labeled do cuments are sufficient to push number s of training files(error bars show 99%confidence performance to the limit. However, Section 3 demon tervals)
Experimental conditions Turney [1999]'s results Kea results Training/testing Cuto GenEx C4.5 Kea Kea-C4.5 Journal/Journal 5 1.451.24 1.401.28 1.350.93 1.200.83 15 2.651.95 2.551.70 2.751.25 2.701.38 Journal/FIPS 5 1.430.85 0.770.81 1.460.98 1.400.95 15 2.461.17 2.120.99 2.201.35 2.261.32 Table 1: Experimental results for dierent extraction algorithms whereas we remove subphrases if they do not perform better than their superphrases. These appear to be the main dierences between his way of applying C4.5 and ours. Changing the Amount of Training Data An interesting question is how Kea's performance scales with the amount of training data available. In order to investigate this, we performed experiments with a large collection of computer science technical reports (CSTR) from the New Zealand Digital Library (www.nzdl.org). The documents in CSTR are fairly noisy, partly because the source les have been extracted automatically from PostScript. Also, they contain on average fewer keyphrases than the other collections. This makes keyphrase extraction in this domain more dicult than in the other corpuses. There are two potential ways in which the corpus of documents that is available can in uence Kea's performance on fresh data. First, training documents are used when computing both the discretization of the attributes, and the corresponding counts for the naive Bayes model. It is essential that these documents have keyphrases assigned to them because the learning method needs labeled examples. Second, the document corpus supports the learning process when each phrase's \document frequency" is calculated|this is used for deriving its TFIDF score. In this case the documents need not be labeled. Our experiments showed that no further performance improvement was gained by increasing the number of documents used to compute the document frequencies beyond 50. To illustrate the eect of training set size, Figure 1 shows Kea's performance on an independent set of 500 test documents. It plots the number of \correct" keyphrases, for both ve and fteen phrases extracted, against the number of documents used for training, from 1 through 130 les. The error bars give 99% condence intervals derived by training Kea on ten dierent training sets of the same size. We used the same independent 100 documents for calculating the document frequencies throughout this particular experiment. It can be seen from Figure 1 that if more than twenty documents are used for training, little is gained by increasing the number further. With 50 documents, there is no further performance improvement. These results show that Kea's performance is close to optimum if about 50 training documents are used; in other words, 50 labeled documents are sucient to push performance to the limit. However, Section 3 demonstrates that this is not the case if domain-specic information is exploited in the learning and extraction process. In that case, much larger amounts of labeled training documents prove benecial. Sub ject Area of Training Documents Now we investigate the extent to which models formed by Kea transfer from one sub ject domain to another. To this end we use the collection of journal articles described above, and two collections of web pages also used by Turney [1999], Aliweb, and NASA, all of which have keyphrases assigned. The basic procedure was to train on one of the collections and test on another, producing nine combinations. For each collection we chose 55 training documents at random and used the rest for testing, 20 for the journal articles, 35 for Aliweb, and 86 for NASA. The training documents were used to compute the document frequencies; thus the entire keyphrase assignment model was based on the training documents alone. For the journal articles, as well as the randomlychosen test set, we ran experiments with the same training/testing division that Turney [1999] used, the test set comprising 20 articles in the journal Psycoloquy. Figure 2 shows the average number of correct keyphrases returned when ve keyphrases are retrieved, for twelve cases. The rst nine represent every combination of training and testing sets drawn from one of the three collections, and the last represents the Psycoloquy test set with the same three training sets (except 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 20 40 60 80 100 120 Number of "correct" keyphrases Number of training documents 15 phrases output 5 phrases output Figure 1: Performance on CSTR corpus for dierent numbers of training les (error bars show 99% condence intervals)
hat all but Psycoloquy articles were used for the Jour- phrase occurs as a key phr ase in the training documents nal training set, so it has a different composition from and use this information in the form of an additional the Journal training set used for the other three cases). third attribute during the learning and extraction pro The bars give 95% confidence intervals for the mean per formance over the test sets. The dark bars highlight the cases where the test and tr aining documents are drawn 3.1 Ext en ding the m odel from the same popul ation Note first that none of the four cases shown in Figure 2 which we call keyphrase-frequency-is simply the num- show any statistically significant difference between the nes occurs as an author-assi results for the three training sets-the confidence inter in all tr aining documents other than D. Be cause this als overlap substantially. However, in the fir st three new attribute is integer-valued, we discretize it using the ts seem to be obtained whe pro cedure from Section 2. 2. Making the est and training do cuments are dr awn from the san sumption of independence, with k being the discretized popul ation(dark bars). The fourth case (which cor- value of the keyphrase-frequency attribute, the probabi ity of a phrase being a key phr ase be in this regard, which suggests that the test set, in this Pr keyjK T, D raining set, in this case the other journal s. Also, the Psycoloquy test set produces consider ably (though not Kjke×Pr们key× Pr[Djkey×Prke statistically significantly )better results whatever set it d on. This is be thors specified m where Pr[ Kjkey is the probability that a keyphrase has keyphrases for these papers(an aver age of 8.4) We conclude that although it makes little difference probabilities-discussed in Section 2.2--Pr[ Kjkey car whether the tr aining and testing documents come from be estimated reliably by counting the number of different subject areas(certainly, the differences we have the corresponding event occurs in the training d times found are not stati sti ca ally significant, perhaps because attribute o akes sense if the documents recommend using documents from the same subject area same domain as the training documents. Otherwg o there is no reason to bias the extraction algorithm to This finding leads to the idea of directly exploiting wards choosing phrases that have occurred as author- the fact that different keyphr ases are used in di fferent assigned key phr ases during training. In order to make subject areas. We explore this, using Kea, next se of the information provi ded by the new attribute it is necessary to re-train the extraction algorithm if 3 Exploiting Domain-specific key phr ases are to be extracted from documents on a dif nformation ferent topic. Training time becomes a critical factor A simple modification of Kea enables it to exploit Kea can generate a mo del for a new set of training collection-specific knowledge about the likelihood of a documents far faster than Gen Ex because of the simple learning me tho ds it employ s 6 Asymptotically ary is to keep track of the number of times a candidate most of the time sorting the attribute values for dis- cretization. Since sorting is O(nlog(n)in the number of alues, Keais O(nlog(n))in the number of phrases con- tained by the training documents. On the collection of NASA colloquy ournal articles from Section 2.4, Kea needs 8 minutes for training, whereas GenEx needs approximately 48 hours Turney, 1999. Note that these times are measured on different computers. However, Kea is implemented in a ombination of Perl and Jav a, whereas Gen Ex is written 日日日日 in C: we expect that the difference would be even more pronouncedif the sy stems were compared on a level foot- 3.2 Exp er im ent al Evalu ation 111!11 We empirically verified that exploiting domain-specific nform ation increases the number of correctly extr acted key phr ases by performing experiments with the CstR collection described above. In or der to isol ate the effect Figure 2: Effect of training do cument s being on subjects However, Gen Ex and Kea both extract keyphrases ver di fferent from test documents quickly once a mo del h as been generate
that all but Psycoloquy articles were used for the Journal training set, so it has a dierent composition from the Journal training set used for the other three cases). The bars give 95% condence intervals for the mean performance over the test sets. The dark bars highlight the cases where the test and training documents are drawn from the same population. Note rst that none of the four cases shown in Figure 2 show any statistically signicant dierence between the results for the three training sets|the condence intervals overlap substantially. However, in the rst three cases, slightly better results seem to be obtained when test and training documents are drawn from the same population (dark bars). The fourth case (which corresponds to Turney's [1999] experiment) is anomalous in this regard, which suggests that the test set, in this case Psycoloquy journals, diers in character from the training set, in this case the other journals. Also, the Psycoloquy test set produces considerably (though not statistically signicantly) better results whatever set it is trained on. This is because authors specied more keyphrases for these papers (an average of 8.4). We conclude that although it makes little dierence whether the training and testing documents come from dierent sub ject areas (certainly, the dierences we have found are not statistically signicant, perhaps because the test sets are too small), to be on the safe side we recommend using documents from the same sub ject area if that is possible. This nding leads to the idea of directly exploiting the fact that dierent keyphrases are used in dierent sub ject areas. We explore this, using Kea, next. 3 Exploiting Domain-Specic Information A simple modication of Kea enables it to exploit collection-specic knowledge about the likelihood of a particular phrase being a keyphrase. All that is necessary is to keep track of the number of times a candidate 0 0.5 1 1.5 2 2.5 Aliweb NASA Journals Psycoloquy Aliweb NASA Journals Aliweb NASA Journals Aliweb Aliweb NASA Journals Journals* NASA Training documents Figure 2: Eect of training documents being on sub jects dierent from test documents phrase occurs as a keyphrase in the training documents, and use this information in the form of an additional, third attribute during the learning and extraction processes. 3.1 Extending the Model For a given phrase P in document D, the new attribute| which we call keyphrase-frequency|is simply the number of times P occurs as an author-assigned keyphrase in all training documents other than D. Because this new attribute is integer-valued, we discretize it using the procedure from Section 2.2. Making the naive Bayes assumption of independence, with K being the discretized value of the keyphrase-frequency attribute, the probability of a phrase being a keyphrase becomes: Pr[keyjK; T ; D] = Pr[Kjkey] Pr[T jkey] Pr[Djkey] Pr[key] Pr[K; T ; D] ; where Pr[Kjkey] is the probability that a keyphrase has discretized keyphrase-frequency value K. Like the other probabilities|discussed in Section 2.2|Pr[Kjkey] can be estimated reliably by counting the number of times the corresponding event occurs in the training data. The new attribute only makes sense if the documents for which keyphrases are to be extracted come from the same domain as the training documents. Otherwise, there is no reason to bias the extraction algorithm towards choosing phrases that have occurred as authorassigned keyphrases during training. In order to make use of the information provided by the new attribute, it is necessary to re-train the extraction algorithm if keyphrases are to be extracted from documents on a different topic. Training time becomes a critical factor. Kea can generate a model for a new set of training documents far faster than GenEx because of the simple learning methods it employs.6 Asymptotically, it spends most of the time sorting the attribute values for discretization. Since sorting is O(n log(n)) in the number of values, Kea is O(n log(n)) in the number of phrases contained by the training documents. On the collection of journal articles from Section 2.4, Kea needs 8 minutes for training, whereas GenEx needs approximately 48 hours [Turney, 1999]. Note that these times are measured on dierent computers. However, Kea is implemented in a combination of Perl and Java, whereas GenEx is written in C: we expect that the dierence would be even more pronounced if the systems were compared on a level footing. 3.2 Experimental Evaluation We empirically veried that exploiting domain-specic information increases the number of correctly extracted keyphrases by performing experiments with the CSTR collection described above. In order to isolate the eect 6 However, GenEx and Kea both extract keyphrases very quickly once a model has been generated.
Kea is particularly well suited for making use of this in formation because it can be trained very quickly in a new science technical reports confirm that the mo dification gn oves the quality of the keyphrases ex- tracted Making use of knowledge about which ke tly in a parti cular do advantage that the extracted keyphrases are more uni This property makes it easier to categorize uments using the key phr ases extracted, and should be beneficial if they are used for topic search or do cument 5 A ckn owled g m ent Many thanks to Peter Turney for making his do cument Figure 3: Performance on OSTR corpus for di fferent collections and dr aft s of his paper av ail able to us, and to lumbers of phrases output. Results are given using no John Cleary for independently suggesting the use of the keyphrase fre quencies, and using key phrase fre quency key phr ase-frequency attribu corpuses of 100 and 1000 documents. (Error bars are 95 confidence intervals. References Breim an, 1996 Leo Breiman. Bagging predictors. Ma- of changing the number of documents for computing the chine learnmg24(2):123-140,1996 keyphrase-frequency attribute, we used a separ ate set of [Domingos and Pazzani, 1997] do cument s-the key phr ase frequency corpusfor coun P. Domingos and M. Pazzani. On the optimality of ing the number of times a phrase occurs as a key phr ase. the simple bayesi an classifier under zero-one loss. Ma The actual set of 130 training do cument s was held con chine learning29(2/3103-130,1997 stant. Also, the same set of 500 test documents was used throughout this experiment. umais Et al, 1998S. T. Dumais, J. Platt, D. Hecker man, and M. Sahami. Inductive learning algorithms Figure 3 shows how the number of correctly identified and representations for text categorization. In Po- keyphrases varies with the amount of domain-specific in- peelings of the ?th International Conference on Infor- ormation available. The worst perform ance is obtained ratian and Know ledge Management 1998 when this information is not used- other wor ds the keyphrase-fiequency at tribute is excluded from the Fayy ad and Irani, 1993) Usama M. Fay yad and Keki B del. The performance improves as more do cuments Irani. Multi-interval di scretization of continuous. are included in the key phr ase frequency corpus. Re valued at tributes for classification learning. In PiO ults are shown for corpuses of size 100 and 1000. Erro peelings of the 13th International Joint Conference an bars are included for the case where no key phrase fre Artifical Intelligence 102-1027.Morg Kauf- uency co used. and pus of size 1000 mann,1993. hese give 95% confidence intervals on the number of Lovins, 1968 J B Lovins. Development of a stemming keyphrases correctly extr acted from a test do cument algorithm. Mechanical trans lation and mutational and show that it is indeed possible to get significantly linguistics, 11: 22-31, 1968 better results by exploiting domain-specific information [Quinlan, 1992) J.R. Quinlan. C1.5: Programs for Ma- tion 2.4, it pays to have more than 50 do cuments with chine learning morgan Kaufmann, Los Altos, CA author-assigned keyphrases available--in fact, moving from 100 to 1000 do cument s improves results remark- [ Turney, 1999 P D. Turney Learning to extract ably keyphrases from text. Technical Report ERB-1057 Nation al Research Council. Institute for Information Technology, 1999 We have evaluated a simple algorithm for keyphrase ex traction, called Kea, which is based on the naive bayes machine learning method, and shown that it performs mparably to the state of the art, repre sented by tur- Gen Ex algorithm. We then pro ceeded to show hot Kea's perform ance can be boo sted by exploiting dom ain- specific information about the likelihood of keyphrases
0 0.5 1 1.5 2 2.5 0 2 4 6 8 10 12 14 16 18 20 Number of "correct" keyphrases Number of phrases output no keyphrase frequencies size 100 size 1000 Figure 3: Performance on CSTR corpus for dierent numbers of phrases output. Results are given using no keyphrase frequencies, and using keyphrase frequency corpuses of 100 and 1000 documents. (Error bars are 95% condence intervals.) of changing the number of documents for computing the keyphrase-frequency attribute, we used a separate set of documents|the keyphrase frequency corpus|for counting the number of times a phrase occurs as a keyphrase. The actual set of 130 training documents was held constant. Also, the same set of 500 test documents was used throughout this experiment. Figure 3 shows how the number of correctly identied keyphrases varies with the amount of domain-specic information available. The worst performance is obtained when this information is not used|in other words, the keyphrase-frequency attribute is excluded from the model. The performance improves as more documents are included in the keyphrase frequency corpus. Results are shown for corpuses of size 100 and 1000. Error bars are included for the case where no keyphrase frequency corpus is used, and for a corpus of size 1000: these give 95% condence intervals on the number of keyphrases correctly extracted from a test document, and show that it is indeed possible to get signicantly better results by exploiting domain-specic information about keyphrases. In contrast to the results from Section 2.4, it pays to have more than 50 documents with author-assigned keyphrases available|in fact, moving from 100 to 1000 documents improves results remarkably. 4 Conclusions We have evaluated a simple algorithm for keyphrase extraction, called Kea, which is based on the naive Bayes machine learning method, and shown that it performs comparably to the state of the art, represented by Turney's GenEx algorithm. We then proceeded to show how Kea's performance can be boosted by exploiting domainspecic information about the likelihood of keyphrases. Kea is particularly well suited for making use of this information because it can be trained very quickly in a new domain. Experiments on a large collection of computer science technical reports conrm that the modication signicantly improves the quality of the keyphrases extracted. Making use of knowledge about which keyphrases are used frequently in a particular domain has the additional advantage that the extracted keyphrases are more uniform. This property makes it easier to categorize documents using the keyphrases extracted, and should be benecial if they are used for topic search or document clustering. 5 Acknowledgments Many thanks to Peter Turney for making his document collections and drafts of his paper available to us, and to John Cleary for independently suggesting the use of the keyphrase-frequency attribute. References [Breiman, 1996] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123{140, 1996. [Domingos and Pazzani, 1997] P. Domingos and M. Pazzani. On the optimality of the simple bayesian classier under zero-one loss. Machine Learning, 29(2/3):103{130, 1997. [Dumais et al., 1998] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Know ledge Management, 1998. [Fayyad and Irani, 1993] Usama M. Fayyad and Keki B. Irani. Multi-interval discretization of continuousvalued attributes for classication learning. In Proceedings of the 13th International Joint Conference on Artical Intel ligence, pages 1022{1027. Morgan Kaufmann, 1993. [Lovins, 1968] J.B. Lovins. Development of a stemming algorithm. Mechanical translation and computational linguistics, 11:22{31, 1968. [Quinlan, 1992] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos, CA, 1992. [Turney, 1999] P.D. Turney. Learning to extract keyphrases from text. Technical Report ERB-1057, National Research Council, Institute for Information Technology, 1999