正在加载图片...
Turney many practical applications 2. Measuring the Performance of Keyphrase Extraction Algorithms We measure the performance of keyphrase extraction algorithms by the number of matches between the machine-generated phrases and the human-generated phrases. a handmade key phrase matches a machine-generated keyphrase when they correspond to the same sequence of stems. a stem is what remains when we remove the suffix from a word. By this definition neural networks" matches "neural network". but it does not match "networks". The order in the sequence is important, so" helicopter skiing"does not match"skiing helicopter The Porter(1980) and Lovins (1968) stemming algorithms are the two most popular algorithms for stemming English words. Both algorithms use heuristic rules to remove or transform English suffixes. The Lovins stemmer is more aggressive than the porter stemmer That is, the Lovins stemmer is more likely to recognize that two words share the same ster but it is also more likely to incorrectly map two distinct words to the same stem(Krovetz 1993). We have found that aggressive stemming is better for keyphrase extraction than con servative stemming In our experiments, we have used an aggressive stemming algorithm that we call the Iterated Lovins stemmer. The algorithm repeatedly applies the lovins stemmer until the word stops changing. Iterating in this manner will necessarily increase(or leave unchanged) the aggressiveness of any stemmer. Table 1 shows some examples of the behav iour of the three stemming algorithms. We may view keyphrase extraction as a classification problem. The task is to classify roup of peo w ZealTurney 6 many practical applications. 2. Measuring the Performance of Keyphrase Extraction Algorithms We measure the performance of keyphrase extraction algorithms by the number of matches between the machine-generated phrases and the human-generated phrases. A handmade key￾phrase matches a machine-generated keyphrase when they correspond to the same sequence of stems. A stem is what remains when we remove the suffix from a word. By this definition, “neural networks” matches “neural network”, but it does not match “networks”. The order in the sequence is important, so “helicopter skiing” does not match “skiing helicopter”. The Porter (1980) and Lovins (1968) stemming algorithms are the two most popular algorithms for stemming English words. Both algorithms use heuristic rules to remove or transform English suffixes. The Lovins stemmer is more aggressive than the Porter stemmer. That is, the Lovins stemmer is more likely to recognize that two words share the same stem, but it is also more likely to incorrectly map two distinct words to the same stem (Krovetz, 1993). We have found that aggressive stemming is better for keyphrase extraction than con￾servative stemming. In our experiments, we have used an aggressive stemming algorithm that we call the Iterated Lovins stemmer. The algorithm repeatedly applies the Lovins stemmer, until the word stops changing. Iterating in this manner will necessarily increase (or leave unchanged) the aggressiveness of any stemmer. Table 1 shows some examples of the behav￾iour of the three stemming algorithms.5 We may view keyphrase extraction as a classification problem. The task is to classify 5. We used an implementation of the Porter (1980) stemming algorithm written in Perl, by Jim Richardson, at the University of Sydney, Australia. This implementation includes some extensions to Porter’s original algorithm, to handle British spelling. It is available at http://www.maths.usyd.edu.au:8000/jimr.html. For the Lovins (1968) stemming algorithm, we used an implementation written in C, by Linh Huynh. This implementation is part of the MG (Managing Gigabytes) search engine, which was developed by a group of people in Australia and New Zealand. The MG code is available at http://www.cs.mu.oz.au/mg/
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有