正在加载图片...
Human-competitive tagging using automatic keyphrase extraction Olena Medelyan, Eibe frank, lan h. witten Computer Science Department University of Waikato olena, eibe, ihw)@cs. waikato. ac nz tags. How well do human taggers perform? How abstract consistent are they with each other? One potential solution to inconsistency This paper connects two research areas: auto- folksonomies is to use suggestion tools matic tagging on the web and statistical key automatically compute tags for new documents phrase extraction. First, we analyze the quality (e.g. Mishne, 2006 Sood et al, 2007; Heymann using traditional evaluation techniques. Next, et al, 2008). Interestingly, the blooming research we demonstrate how documents can be tagged on automatic tagging has so far not been con automatically with a state-of-the-art keyphrase nected to work on keyphrase extraction(e.g traction algorithm, and further improve per Frank et al, 1999, Turney, 2003, Hulth, 2004) ormance in this new domain using a new al which can be used as a tool for the same task gorithm,"Maui, that utilizes semantic infor (note: we use tag and keyph mation extracted from Wikipedia. Maui out- Instead of simple heuristics based on term fre performs existing approaches and extracts tags quencies and co-occurrence of tags, keyphrase hat are competitive with those assigned by the extraction methods apply machine learning to best performing human tagger determine typical distributions of properties common to manually assigned phrases, and can 1 Introduction include analysis of semantic relations betv Tagging is the process of labeling web resources candidate tags (Turney, 2003) state-of-the-art keyphrase extraction systems per- based on their content. Each label, or tag, corre- form compared to simple tagging techniques? sponds to a topic in a given document. Unlike How consistent are they with human taggers? metadata assigned by authors, or by professional indexers in libraries, tags are assigned by end These are questions we address in this paper users for organizing and sharing information that Until now, keyphrase extraction methods have is of interest to them. The organic system of tags primarily been evaluated using a single set of assigned by all users of a given web platform is keyphrases for each document, thereby largely called a folksonomy ignoring the subjective nature of the tas Collaboratively tagged documents, on the other In contrast to traditional taxonomies painstak- hand, offer multiple tag assignments by inde- ingly constructed by experts, a user can add any tags to a folksonomy. This leads to the greatest pendent users, a unique basis for evaluation that downside of tagging, inconsistency, which orig- we capitalize upon in this paper nates in the synonymy and polysemy of human The experiments reported in this paper language, as well as in the varying degrees of these gaps in the research on automatic tagging specificity used by taggers(Golder and Huber- and keyphrase extraction. First, we analyze tag man, 2006). In traditional libraries, consistency is gng consistency on the CiteULike. org platform the primary evaluation criterion of indexing or organizing academic citations Methods tradi (Rolling, 1981). Much work has been done on tionally used for the evaluation of professional describing the statistical properties of folksono- indexing will provide insight into the quality of mies such as tag distribution and co-occurrences this folksonomy. Next, we extract a high quality (Halpin et aL, 2007; Sigurbjornsson et al., 2008. corpus from CiteULike, containing documents Sood et al., 2007), but to our knowledge there that have been tagged consistently by the best has been none on assessing the actual quality of human taggers Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1318-1327 Singapore, 6-7 August 2009. 2009 ACL and AFNLPProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1318–1327, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP Human-competitive tagging using automatic keyphrase extraction Olena Medelyan, Eibe Frank, Ian H. Witten Computer Science Department University of Waikato {olena,eibe,ihw}@cs.waikato.ac.nz Abstract This paper connects two research areas: auto￾matic tagging on the web and statistical key￾phrase extraction. First, we analyze the quality of tags in a collaboratively created folksonomy using traditional evaluation techniques. Next, we demonstrate how documents can be tagged automatically with a state-of-the-art keyphrase extraction algorithm, and further improve per￾formance in this new domain using a new al￾gorithm, “Maui”, that utilizes semantic infor￾mation extracted from Wikipedia. Maui out￾performs existing approaches and extracts tags that are competitive with those assigned by the best performing human taggers. 1 Introduction Tagging is the process of labeling web resources based on their content. Each label, or tag, corre￾sponds to a topic in a given document. Unlike metadata assigned by authors, or by professional indexers in libraries, tags are assigned by end￾users for organizing and sharing information that is of interest to them. The organic system of tags assigned by all users of a given web platform is called a folksonomy. In contrast to traditional taxonomies painstak￾ingly constructed by experts, a user can add any tags to a folksonomy. This leads to the greatest downside of tagging, inconsistency, which origi￾nates in the synonymy and polysemy of human language, as well as in the varying degrees of specificity used by taggers (Golder and Huber￾man, 2006). In traditional libraries, consistency is the primary evaluation criterion of indexing (Rolling, 1981). Much work has been done on describing the statistical properties of folksono￾mies, such as tag distribution and co-occurrences (Halpin et al., 2007; Sigurbjörnsson et al., 2008; Sood et al., 2007), but to our knowledge there has been none on assessing the actual quality of tags. How well do human taggers perform? How consistent are they with each other? One potential solution to inconsistency in folksonomies is to use suggestion tools that automatically compute tags for new documents (e.g. Mishne, 2006; Sood et al., 2007; Heymann et al., 2008). Interestingly, the blooming research on automatic tagging has so far not been con￾nected to work on keyphrase extraction (e.g. Frank et al., 1999; Turney, 2003; Hulth, 2004), which can be used as a tool for the same task (note: we use tag and keyphrase as synonyms). Instead of simple heuristics based on term fre￾quencies and co-occurrence of tags, keyphrase extraction methods apply machine learning to determine typical distributions of properties common to manually assigned phrases, and can include analysis of semantic relations between candidate tags (Turney, 2003). How well do state-of-the-art keyphrase extraction systems per￾form compared to simple tagging techniques? How consistent are they with human taggers? These are questions we address in this paper. Until now, keyphrase extraction methods have primarily been evaluated using a single set of keyphrases for each document, thereby largely ignoring the subjective nature of the task. Collaboratively tagged documents, on the other hand, offer multiple tag assignments by inde￾pendent users, a unique basis for evaluation that we capitalize upon in this paper. The experiments reported in this paper fill these gaps in the research on automatic tagging and keyphrase extraction. First, we analyze tag￾ging consistency on the CiteULike.org platform for organizing academic citations. Methods tradi￾tionally used for the evaluation of professional indexing will provide insight into the quality of this folksonomy. Next, we extract a high quality corpus from CiteULike, containing documents that have been tagged consistently by the best human taggers. 1318
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有