《电子商务 E-business》阅读文献：Learning Agorithms for Keyphrase Extraction

团购合买资源类别：文库，文档格式：PDF，文档页数：49，文件大小：572.85KB

Turney Learning Algorithms for Keyphrase extraction 1. Introduction Many journals ask their authors to provide a list of keywords for their articles. We call these keyphrases, rather than keywords, because they are often phrases of two or more words, rather than single words. We define a keyphrase list as a short list of phrases(typically five to fifteen noun phrases)that capture the main topics discussed in a given document. This paper is concerned with the automatic extraction of keyphrases from text Keyphrases are meant to serve multiple goals. For example, (1)when they are printed on the first page of a journal article, the goal is summarization. They enable the reader to quickly determine whether the given article is in the readers fields of interest. (2)When they are printed in the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a relevant article when the reader has a specific need. (3)When a search engine form has a field labelled keywords, the goal is to enable the reader to make the search more precise. A search for documents that match a given query term in the keyword field will yield a smaller, higher quality list of hits than a search for the same term in the full text of the documents. Keyphrases can serve these diverse goals and others, because the goals share the requirement for a short list of phrases that captures the main topics of the documents We define automatic keyphrase extraction as the automatic selection of important, topi cal phrases from within the body of a document. Automatic keyphrase extraction is a special case of the more general task of automatic keyphrase generation, in which the generated phrases do not necessarily appear in the body of the given document. Section 2 discusses cri teria for measuring the performance of automatic keyphrase extraction algorithms. In the experiments in this paper, we measure the performance by comparing machine-generated keyphrases with human-generated key phrases In our document collections, an average of about 75% of the authors keyphrases appear somewhere in the body of the corresponding document. Thus, an ideal keyphrase extraction algorithm could (in principle) generate

Turney 2 Learning Algorithms for Keyphrase Extraction 1. Introduction Many journals ask their authors to provide a list of keywords for their articles. We call these keyphrases, rather than keywords, because they are often phrases of two or more words, rather than single words. We define a keyphrase list as a short list of phrases (typically five to fifteen noun phrases) that capture the main topics discussed in a given document. This paper is concerned with the automatic extraction of keyphrases from text. Keyphrases are meant to serve multiple goals. For example, (1) when they are printed on the first page of a journal article, the goal is summarization. They enable the reader to quickly determine whether the given article is in the reader’s fields of interest. (2) When they are printed in the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a relevant article when the reader has a specific need. (3) When a search engine form has a field labelled keywords, the goal is to enable the reader to make the search more precise. A search for documents that match a given query term in the keyword field will yield a smaller, higher quality list of hits than a search for the same term in the full text of the documents. Keyphrases can serve these diverse goals and others, because the goals share the requirement for a short list of phrases that captures the main topics of the documents. We define automatic keyphrase extraction as the automatic selection of important, topical phrases from within the body of a document. Automatic keyphrase extraction is a special case of the more general task of automatic keyphrase generation, in which the generated phrases do not necessarily appear in the body of the given document. Section 2 discusses criteria for measuring the performance of automatic keyphrase extraction algorithms. In the experiments in this paper, we measure the performance by comparing machine-generated keyphrases with human-generated keyphrases. In our document collections, an average of about 75% of the author’s keyphrases appear somewhere in the body of the corresponding document. Thus, an ideal keyphrase extraction algorithm could (in principle) generate

Learning Algorithms for Keyphrase Extraction phrases that match up to 75% of the authors keyphrases There is a need for tools that can automatically create keyphrases. Although keyphrases are very useful, only a small minority of the many documents that are available on-line today have keyphrases. There are already some commercial software products that use automatic keyphrase extraction algorithms. For example, Microsoft uses automatic keyphrase extrac tion in Word 97, to fill the Keywords field in the document metadata template(metadata is meta-information for document management ). Verity uses automatic keyphrase extraction in Search 97, their search engine product line. In Search 97. keyphrases are highlighted in bold to facilitate skimming through a list of search results Tetranet uses automatic key phrase extraction in their Metabot product, which is designed for maintaining metadata for web pages. Tetranet also uses automatic keyphrase extraction in their wisebot product, which builds an index for a web site Although the applications for keyphrases mentioned above share the requirement for a short list of phrases that captures the main topics of the documents, the precise size of the list will vary, depending on the particular application and the inclinations of the users. Therefore he algorithms that we discuss allow the users to specify the desired number of phrases We discuss related work by other researchers in Section 3. The most closely related work involves the problem of automatic index generation(Fagan, 1987; Salton, 1988; Ginsberg, 1993; Nakagawa, 1997; Leung and Kan, 1997). One difference between keyphrase extraction 1. To access the metadata template in Word 97, select File and then Properties. To au field, select Tools and then AutoSummarize. (This is not obvious from the word 97 documentation ) Microsoft and Word 97 are trademarks or registered trademarks of Microsoft Corporation 2. Microsoft and Verity use proprietary techniques for keyphrase extraction. It appears that their techniques do not involve machine learning. Verity and Search 97 are trademarks or registered trademarks of Verity Inc 3. Tetranet has licensed our keyphrase extraction software for use in their products. Tetranet, Metabot, and wise- bot are trademarks or registered trademarks of Tetranet Software. For experimental comparisons of Word 97 and Search 97 with our own work, see Turney (1997, 1999)

Learning Algorithms for Keyphrase Extraction 3 phrases that match up to 75% of the author’s keyphrases. There is a need for tools that can automatically create keyphrases. Although keyphrases are very useful, only a small minority of the many documents that are available on-line today have keyphrases. There are already some commercial software products that use automatic keyphrase extraction algorithms. For example, Microsoft uses automatic keyphrase extraction in Word 97, to fill the Keywords field in the document metadata template (metadata is meta-information for document management).1 Verity uses automatic keyphrase extraction in Search 97, their search engine product line. In Search 97, keyphrases are highlighted in bold to facilitate skimming through a list of search results.2 Tetranet uses automatic keyphrase extraction in their Metabot product, which is designed for maintaining metadata for web pages. Tetranet also uses automatic keyphrase extraction in their Wisebot product, which builds an index for a web site.3 Although the applications for keyphrases mentioned above share the requirement for a short list of phrases that captures the main topics of the documents, the precise size of the list will vary, depending on the particular application and the inclinations of the users. Therefore the algorithms that we discuss allow the users to specify the desired number of phrases. We discuss related work by other researchers in Section 3. The most closely related work involves the problem of automatic index generation (Fagan, 1987; Salton, 1988; Ginsberg, 1993; Nakagawa, 1997; Leung and Kan, 1997). One difference between keyphrase extraction 1. To access the metadata template in Word 97, select File and then Properties. To automatically fill the Keywords field, select Tools and then AutoSummarize. (This is not obvious from the Word 97 documentation.) Microsoft and Word 97 are trademarks or registered trademarks of Microsoft Corporation. 2. Microsoft and Verity use proprietary techniques for keyphrase extraction. It appears that their techniques do not involve machine learning. Verity and Search 97 are trademarks or registered trademarks of Verity Inc. 3. Tetranet has licensed our keyphrase extraction software for use in their products. Tetranet, Metabot, and Wisebot are trademarks or registered trademarks of Tetranet Software. For experimental comparisons of Word 97 and Search 97 with our own work, see Turney (1997, 1999)

Turney and index generation is that, al though key phrases may be used in an index, keyphrases have other applications, beyond indexing. Another difference between a keyphrase list and an index is length. Because a keyphrase list is relatively short, it must contain only the most important, topical phrases for a given document. Because an index is relatively long, it can contain many less important, less topical phrases. Also, a keyphrase list can be read and judged in seconds, but an index might never be read in its entirety. Automatic keyphrase extraction is thus a more demanding task than automatic index generation Keyphrase extraction is also distinct from information extraction, the task that has been studied in depth in the Message Understanding Conferences(MUC-3, 1991; MUC-4, 1992 MUC-5, 1993; MUC-6, 1995). Information extraction involves extracting specific types of task-dependent information. For example, given a collection of news reports on terrorist attacks, information extraction involves finding specific kinds of information, such as the name of the terrorist organization, the names of the victims, and the type of incident(e.g kidnapping, murder, bombing). In contrast, keyphrase extraction is not specific. The goal in keyphrase extraction is to produce topical phrases, for any type of factual, prosaic document We approach automatic keyphrase extraction as a supervised learning task. We treat a document as a set of phrases, which must be classified as either positive or negative exam ples of keyphrases. This is the classical machine learning problem of learning from exam- ples. In Section 5, we describe how we apply the C4. 5 decision tree induction algorithm to this task(Quinlan, 1993). There are several unusual aspects to this classification problem For example, the positive examples constitute only 0. 2% to 2.4% of the total number of examples. C4. 5 is ty pically applied to more balanced class distributions The experiments in this paper use five collections of documents, with a combined total of 652 documents. The collections are presented in Section 4. In our first set of experiments Section 6), we evaluate nine different ways to apply C4.5. In preliminary experiments with the training documents, we found that bagging seemed to improve the performance of C4.5

Turney 4 and index generation is that, although keyphrases may be used in an index, keyphrases have other applications, beyond indexing. Another difference between a keyphrase list and an index is length. Because a keyphrase list is relatively short, it must contain only the most important, topical phrases for a given document. Because an index is relatively long, it can contain many less important, less topical phrases. Also, a keyphrase list can be read and judged in seconds, but an index might never be read in its entirety. Automatic keyphrase extraction is thus a more demanding task than automatic index generation. Keyphrase extraction is also distinct from information extraction, the task that has been studied in depth in the Message Understanding Conferences (MUC-3, 1991; MUC-4, 1992; MUC-5, 1993; MUC-6, 1995). Information extraction involves extracting specific types of task-dependent information. For example, given a collection of news reports on terrorist attacks, information extraction involves finding specific kinds of information, such as the name of the terrorist organization, the names of the victims, and the type of incident (e.g., kidnapping, murder, bombing). In contrast, keyphrase extraction is not specific. The goal in keyphrase extraction is to produce topical phrases, for any type of factual, prosaic document. We approach automatic keyphrase extraction as a supervised learning task. We treat a document as a set of phrases, which must be classified as either positive or negative examples of keyphrases. This is the classical machine learning problem of learning from examples. In Section 5, we describe how we apply the C4.5 decision tree induction algorithm to this task (Quinlan, 1993). There are several unusual aspects to this classification problem. For example, the positive examples constitute only 0.2% to 2.4% of the total number of examples. C4.5 is typically applied to more balanced class distributions. The experiments in this paper use five collections of documents, with a combined total of 652 documents. The collections are presented in Section 4. In our first set of experiments (Section 6), we evaluate nine different ways to apply C4.5. In preliminary experiments with the training documents, we found that bagging seemed to improve the performance of C4.5

Learning Algorithms for Keyphrase Extraction ( Breiman, 1996a, 1996b; Quinlan, 1996). Bagging works by generating many different deci sion trees and allowing them to vote on the classification of each example. We experimented with different numbers of trees and different techniques for sampling the training data. The experiments support the hypothesis that bagging improves the performance of C4.5 when applied to automatic keyphrase extraction During our experiments with C4. 5, we came to believe that a specialized algorith developed specifically for learning to extract keyphrases, might achieve better results than a general-purpose learning algorithm, such as C4. 5. Section 7 introduces the GenEx algorithm GenEx is a hybrid of the Genitor steady-state genetic algorithm (Whitley, 1989)and the Extractor parameterized keyphrase extraction algorithm (Turney, 1997, 1999). Extractor works by assigning a numerical score to the phrases in the input document. The final output of Extractor is essentially a list of the highest scoring phrases. The behaviour of the scoring function is determined by a dozen numerical parameters. Genitor tunes the setting of these parameters, to maximize the performance of Extractor on a given set of training examples The second set of experiments(Section 8) supports the hypothesis that a lized algorithm(Gen Ex) can generate better keyphrases than a general-purpose algorithm(C4. 5) Both algorithms incorporate significant amounts of domain knowledge, but we avoided embedding specialized procedural knowledge in our application of C4.5. It appears that some degree of specialized procedural knowledge is necessary for automatic keyphrase extraction The third experiment (Section 9)looks at subjective human evaluation of the quality of the keyphrases produced by GenEx. On average, about 80% of the automatically generated keyphrases are judged to be acceptable and about 60%are judged to be good Section 10 discusses the experimental results and Section 1 l presents our plans for future work. We conclude (in Section 12)that GenEx is performing at a level that is suitable for 4. Extractor is an Official Mark of the National Research Council of Canada. Patent applications have been sub-

Learning Algorithms for Keyphrase Extraction 5 (Breiman, 1996a, 1996b; Quinlan, 1996). Bagging works by generating many different decision trees and allowing them to vote on the classification of each example. We experimented with different numbers of trees and different techniques for sampling the training data. The experiments support the hypothesis that bagging improves the performance of C4.5 when applied to automatic keyphrase extraction. During our experiments with C4.5, we came to believe that a specialized algorithm, developed specifically for learning to extract keyphrases, might achieve better results than a general-purpose learning algorithm, such as C4.5. Section 7 introduces the GenEx algorithm. GenEx is a hybrid of the Genitor steady-state genetic algorithm (Whitley, 1989) and the Extractor parameterized keyphrase extraction algorithm (Turney, 1997, 1999).4 Extractor works by assigning a numerical score to the phrases in the input document. The final output of Extractor is essentially a list of the highest scoring phrases. The behaviour of the scoring function is determined by a dozen numerical parameters. Genitor tunes the setting of these parameters, to maximize the performance of Extractor on a given set of training examples. The second set of experiments (Section 8) supports the hypothesis that a specialized algorithm (GenEx) can generate better keyphrases than a general-purpose algorithm (C4.5). Both algorithms incorporate significant amounts of domain knowledge, but we avoided embedding specialized procedural knowledge in our application of C4.5. It appears that some degree of specialized procedural knowledge is necessary for automatic keyphrase extraction. The third experiment (Section 9) looks at subjective human evaluation of the quality of the keyphrases produced by GenEx. On average, about 80% of the automatically generated keyphrases are judged to be acceptable and about 60% are judged to be good. Section 10 discusses the experimental results and Section 11 presents our plans for future work. We conclude (in Section 12) that GenEx is performing at a level that is suitable for 4. Extractor is an Official Mark of the National Research Council of Canada. Patent applications have been submitted for Extractor

Turney 6 many practical applications. 2. Measuring the Performance of Keyphrase Extraction Algorithms We measure the performance of keyphrase extraction algorithms by the number of matches between the machine-generated phrases and the human-generated phrases. A handmade keyphrase matches a machine-generated keyphrase when they correspond to the same sequence of stems. A stem is what remains when we remove the suffix from a word. By this definition, “neural networks” matches “neural network”, but it does not match “networks”. The order in the sequence is important, so “helicopter skiing” does not match “skiing helicopter”. The Porter (1980) and Lovins (1968) stemming algorithms are the two most popular algorithms for stemming English words. Both algorithms use heuristic rules to remove or transform English suffixes. The Lovins stemmer is more aggressive than the Porter stemmer. That is, the Lovins stemmer is more likely to recognize that two words share the same stem, but it is also more likely to incorrectly map two distinct words to the same stem (Krovetz, 1993). We have found that aggressive stemming is better for keyphrase extraction than conservative stemming. In our experiments, we have used an aggressive stemming algorithm that we call the Iterated Lovins stemmer. The algorithm repeatedly applies the Lovins stemmer, until the word stops changing. Iterating in this manner will necessarily increase (or leave unchanged) the aggressiveness of any stemmer. Table 1 shows some examples of the behaviour of the three stemming algorithms.5 We may view keyphrase extraction as a classification problem. The task is to classify 5. We used an implementation of the Porter (1980) stemming algorithm written in Perl, by Jim Richardson, at the University of Sydney, Australia. This implementation includes some extensions to Porter’s original algorithm, to handle British spelling. It is available at http://www.maths.usyd.edu.au:8000/jimr.html. For the Lovins (1968) stemming algorithm, we used an implementation written in C, by Linh Huynh. This implementation is part of the MG (Managing Gigabytes) search engine, which was developed by a group of people in Australia and New Zealand. The MG code is available at http://www.cs.mu.oz.au/mg/

Learning Algorithms for Keyphrase Extraction 7 each word or phrase in the document into one of two categories: either it is a keyphrase or it is not a keyphrase. We evaluate automatic keyphrase extraction by the degree to which its classifications correspond to human-generated classifications. Our performance measure is precision (the number of matches divided by the number of machine-generated keyphrases), using a variety of cut-offs for the number of machine-generated keyphrases. 3. Related Work Although there are several papers that discuss automatically extracting important phrases, as far as we know, we are the first to treat this problem as supervised learning from examples. Krulwich and Burkey (1996) use heuristics to extract keyphrases from a document. The heuristics are based on syntactic clues, such as the use of italics, the presence of phrases in section headers, and the use of acronyms. Their motivation is to produce phrases for use as features when automatically classifying documents. Their algorithm tends to produce a relatively large list of phrases, with low precision. Muñoz (1996) uses an unsupervised learning algorithm to discover two-word keyphrases. The algorithm is based on Adaptive Resonance Theory (ART) neural networks. Muñoz’s algorithm tends to produce a large list of phrases, with low precision. Also, the algorithm is not applicable to one-word or more-than-two-word keyphrases. Steier and Belew (1993) use the mutual information statistic to discover twoTable 1: Samples of the behaviour of three different stemming algorithms. Word Porter Stem Lovins Stem Iterated Lovins Stem believes believ belief belief belief belief belief belief believable believ belief belief jealousness jealous jeal jeal jealousy jealousi jealous jeal police polic polic pol policy polici polic pol assemblies assembli assembl assembl assembly assembli assemb assemb probable probabl prob prob probability probabl prob prob probabilities probabl probabil probabil

点击进入文档下载页（PDF格式）

共49页，可试读17页，点击继续阅读 ↓↓

点击下载（PDF格式）

浏览记录