正在加载图片...
Submitted to Information Retrieval- INRT 34-99 September 8, 1999 Learning algorithms for Keyphrase Extraction Peter. Turne Institute for Information Technology National Research Council of canada Ottawa. Ontario, Canada, KIA OR6 peter: turney ant.nrc.ca Phoe:613-993-8564 Fax.:6l3-952-715l Abstract Many academic journals ask their authors to provide a list of about five to fifteen keywords to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4. 5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5 The second set of experiments applies the GenEx algorithm to the task. We developed the Gen Ex algorithm specifically for automatically extracting keyphrases from text. The experi- mental results support the claim that a custom-designed algorithm( GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general purpose algorithm(C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80%of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications Keyphrases: machine learning, summarization, indexing, key words, keyphrase extraction c 1999 National Research Council CanadaSubmitted to Information Retrieval — INRT 34-99 September 8, 1999 © 1999 National Research Council Canada Learning Algorithms for Keyphrase Extraction Peter D. Turney Institute for Information Technology National Research Council of Canada Ottawa, Ontario, Canada, K1A 0R6 peter.turney@iit.nrc.ca Phone: 613-993-8564 Fax: 613-952-7151 Abstract Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experi￾mental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general￾purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by Extractor suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications. Keyphrases: machine learning, summarization, indexing, keywords, keyphrase extraction
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有