正在加载图片...
2.1 Extracting a high quality tagged corpus 713, 600 all C teULike articles The CiteULike data set is freely available and 513, 100 at least 1 tac contains information about which documents 367, 600 at least 2 tags were tagged with what tags by which users(al though identities are not provided) CiteULike's 22 300 users have tagged 713, 600 documents with 2. 4M"tag assignments'-sin- 16,600 at least 3 taggers gle applications of a tag by a user to a document 2, 500 at least 3 agreed tag The two most popular tags, bibtex-import and no-tag, indicate an information source and a missing tag respectively. Most of the remainder describe particular concepts relevant to the documents. We exclude non-content tags fron Figure 1. Quality control of CiteULike data our experiments, e.g. personal tags like to-read or todo. Note that spam entries have been elimi- Following that, our goal is to build a system nated from the data set Because CiteULike tagger at matches the performance of these taggers. sional indexers, high quality of the assigned top- We first apply an existing approach proposed by Brooks and Montanez(2006) and compare it to s cannot be guaranteed. In fact, manual as- sessment of users' tags by human evaluators the keyphrase extraction algorithm Kea(Frank et shows precision of 59%(Mishne, 2006)and 49% al, 1999). Next we create a new algorithm alled Maui, that enhances Kea's successful ma. (Sood et al. 2006). However, why is the opinion chine learning framework with semantic know of human evaluators valued more than the edge retrieved from Wikipedia, new features, and lon of taggers? We propose an alternative way of a new classification model. We evaluate mau determining ground truth using an automatic ap- proach to determine reliable tags: We concen using tag sets assigned to the same documents by trate on a subset of CiteULike containing docu- with Citeulike users as they are with each other. ments that have been indexed with at least three Most of the computation required for auto matic tagging with this method can be performed sistency between the users, and then compare it offline. In practice, it can be used as a tag sug- to the algorithm's consistency, we need taggers gestion tool that provides users with tags describ who have tagged documents that some others ing the main topics of newly added documents, had tagged. We say that two users are "co- which can then be corrected or enhanced by per- taggers"if they have both tagged at least one sonal tags if required. This will improve consis common document. As well as restricting the tency in the folksonomy without compromising document set, we only include taggers who have its flexibility at least two co-taggers 2 Collaboratively-tagged data Figure 1 shows the proportions of CiteULike documents that are discarded in order to produce ike. org is a bookmarking service that re- our high quality data set. The final set contains sembles the popular deL icio us, but concentrates only 2, 100 documents(0.3% of the original) on scholarly papers. Rather than replicating the Unfortunately, many of these are unavailable for fulltextoftaggedpapersitsimplypointstothemdownload-forexamplebooksatAmazon.com on the web(e.g. PubMed, Cite Seer, Science Di- and ArXiv. org references cannot be crawled. We rect, Amazon)or in journals(e. g. High Wire, Na- further restrict attention to two sources: High ture). This avoids violating copyright but means Wire and Nature, both of which provide easily that the full text of articles is not necessarily accessible PDFs of the full text available. When entering new resources, users The result is a set of 180 documents indexed re encouraged to assign tags describing their by 332 taggers. A total of 4,638 tags were content or reflecting their own grouping of the signed by all taggers to documents in this set; information. However, the system does not sug- however, the number of tags on which gest tags. Moreover, users do not see other users' two users agreed is significantly smaller, n tags and are thus not biased in their tag choices. 946. Still, this results in accurate tag sets that contain an average of five tags per documentFollowing that, our goal is to build a system that matches the performance of these taggers. We first apply an existing approach proposed by Brooks and Montanez (2006) and compare it to the keyphrase extraction algorithm Kea (Frank et al., 1999). Next we create a new algorithm, called Maui, that enhances Kea’s successful ma￾chine learning framework with semantic knowl￾edge retrieved from Wikipedia, new features, and a new classification model. We evaluate Maui using tag sets assigned to the same documents by several users and show that it is as consistent with CiteULike users as they are with each other. Most of the computation required for auto￾matic tagging with this method can be performed offline. In practice, it can be used as a tag sug￾gestion tool that provides users with tags describ￾ing the main topics of newly added documents, which can then be corrected or enhanced by per￾sonal tags if required. This will improve consis￾tency in the folksonomy without compromising its flexibility. 2 Collaboratively-tagged Data CiteULike.org is a bookmarking service that re￾sembles the popular del.icio.us, but concentrates on scholarly papers. Rather than replicating the full text of tagged papers it simply points to them on the web (e.g. PubMed, CiteSeer, ScienceDi￾rect, Amazon) or in journals (e.g. HighWire, Na￾ture). This avoids violating copyright but means that the full text of articles is not necessarily available. When entering new resources, users are encouraged to assign tags describing their content or reflecting their own grouping of the information. However, the system does not sug￾gest tags. Moreover, users do not see other users’ tags and are thus not biased in their tag choices. 2.1 Extracting a high quality tagged corpus The CiteULike data set is freely available and contains information about which documents were tagged with what tags by which users (al￾though identities are not provided). CiteULike’s 22,300 users have tagged 713,600 documents with 2.4M “tag assignments”— sin￾gle applications of a tag by a user to a document. The two most popular tags, bibtex-import and no-tag, indicate an information source and a missing tag respectively. Most of the remainder describe particular concepts relevant to the documents. We exclude non-content tags from our experiments, e.g. personal tags like to-read or todo. Note that spam entries have been elimi￾nated from the data set. Because CiteULike taggers are not profes￾sional indexers, high quality of the assigned top￾ics cannot be guaranteed. In fact, manual as￾sessment of users’ tags by human evaluators shows precision of 59% (Mishne, 2006) and 49% (Sood et al., 2006). However, why is the opinion of human evaluators valued more than the opin￾ion of taggers? We propose an alternative way of determining ground truth using an automatic ap￾proach to determine reliable tags: We concen￾trate on a subset of CiteULike containing docu￾ments that have been indexed with at least three tags on which at least two users have agreed. In order to be able to measure the tagging con￾sistency between the users, and then compare it to the algorithm’s consistency, we need taggers who have tagged documents that some others had tagged. We say that two users are “co￾taggers” if they have both tagged at least one common document. As well as restricting the document set, we only include taggers who have at least two co-taggers. Figure 1 shows the proportions of CiteULike documents that are discarded in order to produce our high quality data set. The final set contains only 2,100 documents (0.3% of the original). Unfortunately, many of these are unavailable for download—for example, books at Amazon.com and ArXiv.org references cannot be crawled. We further restrict attention to two sources: High￾Wire and Nature, both of which provide easily￾accessible PDFs of the full text. The result is a set of 180 documents indexed by 332 taggers. A total of 4,638 tags were as￾signed by all taggers to documents in this set; however, the number of tags on which at least two users agreed is significantly smaller, namely 946. Still, this results in accurate tag sets that contain an average of five tags per document. Figure 1. Quality control of CiteULike data 1319
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有