正在加载图片...
reduce overload [8]. Since then, several types of agents have been Temporal metadata such as the year and, if available, month of rototyped and developed for many different fields, such as the the article's publication. Web for a comprehensive overview of agents available on the Web Miscellaneous metadata such as the article The extracted Yet there have been only a handful of approaches to recommend- data also includes the publisher details and number ing interesting academic articles to users, MeNee et al. (2006)ar information, and the number of pages SSN/ISBN uably being the most prominent one. McNee frames the interac- identifiers were also extracted as well volume ar tion between users and recommender systems, focusing on recom- the online reabouts of the article mending interesting research papers from a user-centric perspec- User-specific metadata including the tags assigned by each user tive. He identifies the different tasks a recommender system coul perform to assist the user, such as finding a starting point for re- comments by users on an article, and reading priorities carch on a particular topic, and maintaining awareness of a re- search field. Recommendations are generated on the basis of cita- As CiteULike offers the possibility of users setting up groups that connect users that share similar academic and topical interests, tions in scientific papers [10] for each group we collected the group name, a short textual descrip- Basu et al. 2001) focus on the related problem of recommending tion and a list of its members conference paper submissions to reviewing committee members 1]. They use a content-based approach to paper recommendation, 3.2 Characteristics of the collection using the Vector Space model with tf-idf weighting. Another re- lated area of research is the development of recommender systems After crawling and data clean-up, our collection contained a total that employ folksonomies. Most of the work so far has focused f 1,012, 898 different postings, where we define a posting as a user item pair in the database, i.e. an item that was added to a CiteULike recommending tags for bookmarks. Jaschke et al. (2007), for in- user profile. These postings comprised 803, 521 unique articles stance, compared two different CF algorithms with a graph-based algorithm for recommending tags in Bibsonomy. They found that posted by 25, 375 unique users using 232, 937 unique tags. Meta- data was available for 543, 433 of the 803. 521 articles.CiteULike the top 3 ranks [6]. Mishne (2006) per Similar experimen en predicting tags associated with blog posts [11]. In our ex member of one or more groups, corresponding to 9.1% of all users. We did not crawl the full text of publications, but 33.7% of the periments we focus on CiteULike as the social reference manager. articles included the abstract in their metadata Capocci et al. analyze the small-world properties of the CiteULike folksonomy [21 4. RECOMMENDING USING CITEULIKE 3. CITEULIKE McNee identifies eight different tasks that a recommender sys- tem could fulfill in a digital library CiteULike is a website that offers a"a free service to help you these tasks are equally applicable in the CiteULike environment, to store, organise, and share the scholarly papers you are reading and not all of them can be fulfilled using the collection we created It allows its users to add their academic reference library to their However, a social reference manager could arguably fulfill addi- online profile on the CiteULike website. At the time of writing. tional, new tasks not applicable in a digital library environment. CiteULike contains around 885, 310 unique items, annotated by 27, 489 users with 174,322 unique tags. Articles can be stored In this paper we focus on the task of generating list of related pa their metadata(in various formats), abstracts, and links to the pa- pers based on a users reference library. This task corresponds most closely to McNee's tasks of Fill Out Reference Lists and Maintain pers at the publishers'websites. Users can also add reading prior Awareness [10]. In contrast to McNee's approach of using cita- ities, personal comments, and tags to their papers. CiteULike also tions, we use the direct user-item preference relations to generate our recommendations from nect users sharing academic or topical interests. These group pages report on recent actIvity offer the possibility of maintaining 4.1 Experimental setup discussion fora or blogs. The full text of articles is not accessible In order to evaluate different recommender algorithms on the from CiteULike. although links to online articles can be added CiteULike data and to compare the usefulness of the different infor 3.1 Constructing a test collection mation we have available, we need a proper framework for experi CiteULike offers daily dumps of their core database. We used mentation and evaluation. Recommender systems evaluation-and the differences with IR evaluation-have been addressed by, among the dump of November 2, 2007 as the basis for our experiments others, Herlocker et al. [4, 51, the latter identifying six discernible A dump contains all information on which articles were posted b recommendation tasks The recommendation task we evaluate here whom, with which tags, and at what point in time. It does no is the"Find Good Items"task", where users are provided with a however, contain any of the other metadata described above ranked list of recommended items, based on their personal profile crawled this metadata ourselves from the CiteULike website using Following common practice in recommender system evaluation the article IDs. We collected the following five types of metadata [4, 5, 10], to ensure that we would be able to generate reliable rec- ommendations. we select a realistic subset of the citeULike data Topic-related metadata including all metadata descriptive of th set by only keeping the users who have added 20 items rticle's topic, such as the title and the publication informa- the personal profile. In addition, we filter out all articles cur only once, since these items do not contain sufficiently Person-related metadata such as the authors of the article as well ties to the rest of the data set, and thus would only introduce noise as the editors of the journal or conference proceedings it was published in wehe spvemwhtimin maw wt de tee ar this is beyond ithe mecas tt nis pape 2seEhttp://www.citeulike.org/fag/data.adp 4 Also known as Top-N recommendationreduce overload [8]. Since then, several types of agents have been prototyped and developed for many different fields, such as the Web, music, and academic writing. See Montaner et al. (2003) for a comprehensive overview of agents available on the Web. Yet there have been only a handful of approaches to recommend￾ing interesting academic articles to users, McNee et al. (2006) ar￾guably being the most prominent one. McNee frames the interac￾tion between users and recommender systems, focusing on recom￾mending interesting research papers from a user-centric perspec￾tive. He identifies the different tasks a recommender system could perform to assist the user, such as finding a starting point for re￾search on a particular topic, and maintaining awareness of a re￾search field. Recommendations are generated on the basis of cita￾tions in scientific papers [10]. Basu et al. (2001) focus on the related problem of recommending conference paper submissions to reviewing committee members [1]. They use a content-based approach to paper recommendation, using the Vector Space model with tf·idf weighting. Another re￾lated area of research is the development of recommender systems that employ folksonomies. Most of the work so far has focused on recommending tags for bookmarks. Jaschke et al. (2007), for in- ¨ stance, compared two different CF algorithms with a graph-based algorithm for recommending tags in Bibsonomy. They found that the graph-based algorithm outperforms the CF algorithms only for the top 3 ranks [6]. Mishne (2006) performs similar experiments when predicting tags associated with blog posts [11]. In our ex￾periments we focus on CiteULike as the social reference manager. Capocci et al. analyze the small-world properties of the CiteULike folksonomy [2]. 3. CITEULIKE CiteULike is a website that offers a “a free service to help you to store, organise, and share the scholarly papers you are reading”2 It allows its users to add their academic reference library to their online profile on the CiteULike website. At the time of writing, CiteULike contains around 885,310 unique items, annotated by 27,489 users with 174,322 unique tags. Articles can be stored with their metadata (in various formats), abstracts, and links to the pa￾pers at the publishers’ websites. Users can also add reading prior￾ities, personal comments, and tags to their papers. CiteULike also offers the possibility of users setting up and joining groups that con￾nect users sharing academic or topical interests. These group pages report on recent activity, and offer the possibility of maintaining discussion fora or blogs. The full text of articles is not accessible from CiteULike, although links to online articles can be added. 3.1 Constructing a test collection CiteULike offers daily dumps of their core database2 . We used the dump of November 2, 2007 as the basis for our experiments. A dump contains all information on which articles were posted by whom, with which tags, and at what point in time. It does not, however, contain any of the other metadata described above, so we crawled this metadata ourselves from the CiteULike website using the article IDs. We collected the following five types of metadata: Topic-related metadata including all metadata descriptive of the article’s topic, such as the title and the publication informa￾tion. Person-related metadata such as the authors of the article as well as the editors of the journal or conference proceedings it was published in. 2See http://www.citeulike.org/faq/data.adp. Temporal metadata such as the year and, if available, month of the article’s publication. Miscellaneous metadata such as the article type. The extracted data also includes the publisher details, volume and number information, and the number of pages. DOI and ISSN/ISBN identifiers were also extracted as well as URLs pointing to the online whereabouts of the article. User-specific metadata including the tags assigned by each user, comments by users on an article, and reading priorities. As CiteULike offers the possibility of users setting up groups that connect users that share similar academic and topical interests, for each group we collected the group name, a short textual descrip￾tion, and a list of its members. 3.2 Characteristics of the collection After crawling and data clean-up, our collection contained a total of 1,012,898 different postings, where we define a posting as a user￾item pair in the database, i.e. an item that was added to a CiteULike user profile. These postings comprised 803,521 unique articles posted by 25,375 unique users using 232,937 unique tags. Meta￾data was available for 543,433 of the 803,521 articles3 . CiteULike contained 1,243 different groups with 2,301 different users being a member of one or more groups, corresponding to 9.1% of all users. We did not crawl the full text of publications, but 33.7% of the articles included the abstract in their metadata. 4. RECOMMENDING USING CITEULIKE McNee identifies eight different tasks that a recommender sys￾tem could fulfill in a digital library environment [10]. Not all of these tasks are equally applicable in the CiteULike environment, and not all of them can be fulfilled using the collection we created. However, a social reference manager could arguably fulfill addi￾tional, new tasks not applicable in a digital library environment. In this paper we focus on the task of generating list of related pa￾pers based on a user’s reference library. This task corresponds most closely to McNee’s tasks of Fill Out Reference Lists and Maintain Awareness [10]. In contrast to McNee’s approach of using cita￾tions, we use the direct user-item preference relations to generate our recommendations from. 4.1 Experimental setup In order to evaluate different recommender algorithms on the CiteULike data and to compare the usefulness of the different infor￾mation we have available, we need a proper framework for experi￾mentation and evaluation. Recommender systems evaluation—and the differences with IR evaluation—have been addressed by, among others, Herlocker et al. [4, 5], the latter identifying six discernible recommendation tasks. The recommendation task we evaluate here is the “Find Good Items” task4 , where users are provided with a ranked list of recommended items, based on their personal profile. Following common practice in recommender system evaluation [4, 5, 10], to ensure that we would be able to generate reliable rec￾ommendations, we select a realistic subset of the CiteULike data set by only keeping the users who have added 20 items or more to the personal profile. In addition, we filter out all articles that oc￾cur only once, since these items do not contain sufficiently reliable ties to the rest of the data set, and thus would only introduce noise 3The overwhelming majority of the articles with missing metadata were spam articles. How we detected this is beyond the focus of this paper. 4Also known as Top-N recommendation
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有