Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·USA Context-aware Citation recommendation Jian pei Daniel Kifer Penn State University Simon Fraser University Penn State University ghe ist psu. edu jpi@ cs. sfu.ca dan+www10@cse psu. edu Prasenjit Mitra C. Lee Giles Penn State University Penn State Univer mitra ist psu. edu giles ist psu.e ABSTRACT nk-PLSA-LDA [? simplifies+I The missing lin When you write papers, how many times do you want to nake some citations at a place but you are not sure which models for papers to cite? Do you wish to have a recommendation here citations are 2. Mixed membership models of system which can recommend a small number of good can- a sample from a probability scientific publications(2004) didates for every place that you want to make some cita- distribution associated with tions? In this paper, we present our initiative of building a topIc. context-aware citation recommendation system. High qual ity citation recommendation is challenging: not only should Figure 1: A manuscript with citation placeholde the citations recommended be relevant to the paper under and recommended citations. The text is from Sec- composition, but also should match the local contexts of the. tion 2.2 places citations are made. Moreover, it is far from trivial to model how the topic of the whole paper and the contexts of the citation places should affect the selection and ranking of citations should be added. In order to fill in those cita- citations.To tackle the problem, we develop a context-aware tion placeholders, one needs to search the relevant literature approach. The core idea is to design a novel non-parametric and find a small number of proper citations. Searching for relevance between a citation context and a document, Our proper citations is often a labor-intensive task in research pa- proach can recommend citations for a context effectively. ot familiar with the very extensive literature. Moreover Moreover, it can recommend a set of citations for a paper the volume of research undertaken and information avail- with high quality. We implement a prototype system in CiteSeerX. An extensive empirical evaluation in the Cite- able make citation search hard even for senior researchers Do you wish to have a recommendation system which can the effectiveness and the scalability of our approach recommend a small number of good candidates for every place that you want to make some citations? High quality citation recommendation is a challenging problem for many Categories and Subject Descriptors reasons 1. 3. 3 Information Storage and Retrieval Information For each citation placeholder, we can collect the words sur- earch and retrieval rounding as the contert of the placeholder. One may think we can use some key words in the context of a placeholder General terms to search a literature search engine like Google scholar or CiteSeerx to obtain a list of documents as the candidates Algorithms, Design, Experimentation for citations. However, such a method, based on keyword matching, is often far from satisfactory. For example, us- Keywords ing query "frequent itemset mining one may want to search Bibliometrics, Context, Gleasons Theorem, Recommender for the first paper proposing the concept of frequent itemset Systems Inning, e.g. However, Google Scholar returns a paper about frequent closed itemset mining published in 2000 as the first result, and a paper on privacy preserving frequent 1. INTRODUCTION itemset mining as the second choice. [1 does not appear in When you write papers, how many times do you want to the first page of the results. CiteSeerX also lists a paper make some citations at a place but you are not sure which on privacy preserving frequent itemset mining as the first apers to cite? For example, the left part of Figure 1 shows result. CiteSeerX fails to return [1] on the first page, either a segment of a query manuscript containing some citation One may wonder, as we can model citations as links from placeholders(placeholders for short)marked as"?", where citing documents to cited ones, can we use graph-based link prediction techniques to recommend citations? Graph-based mittee(Iw3C2). Distribution of these papers is limited to classroom use, link prediction techniques often require a user to provide and pew 2010, April 26-30, 2010, Raleigh, North Carolina, USA. sample citations for each placeholder, and thus shifts much of the burden to the user. And, graph-based link prediction ACM978-1-60558-799-8/10/04
Context-aware Citation Recommendation Qi He Penn State University qhe@ist.psu.edu Jian Pei Simon Fraser University jpei@cs.sfu.ca Daniel Kifer Penn State University dan+www10@cse.psu.edu Prasenjit Mitra Penn State University pmitra@ist.psu.edu C. Lee Giles Penn State University giles@ist.psu.edu ABSTRACT When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want to make some citations? In this paper, we present our initiative of building a context-aware citation recommendation system. High quality citation recommendation is challenging: not only should the citations recommended be relevant to the paper under composition, but also should match the local contexts of the places citations are made. Moreover, it is far from trivial to model how the topic of the whole paper and the contexts of the citation places should affect the selection and ranking of citations. To tackle the problem, we develop a context-aware approach. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively. Moreover, it can recommend a set of citations for a paper with high quality. We implement a prototype system in CiteSeerX. An extensive empirical evaluation in the CiteSeerX digital library against many baselines demonstrates the effectiveness and the scalability of our approach. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Design, Experimentation Keywords Bibliometrics, Context, Gleason’s Theorem, Recommender Systems 1. INTRODUCTION When you write papers, how many times do you want to make some citations at a place but you are not sure which papers to cite? For example, the left part of Figure 1 shows a segment of a query manuscript containing some citation placeholders (placeholders for short) marked as “[?]”, where Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04. Figure 1: A manuscript with citation placeholders and recommended citations. The text is from Section 2.2. citations should be added. In order to fill in those citation placeholders, one needs to search the relevant literature and find a small number of proper citations. Searching for proper citations is often a labor-intensive task in research paper composition, particularly for junior researchers who are not familiar with the very extensive literature. Moreover, the volume of research undertaken and information available make citation search hard even for senior researchers. Do you wish to have a recommendation system which can recommend a small number of good candidates for every place that you want to make some citations? High quality citation recommendation is a challenging problem for many reasons. For each citation placeholder, we can collect the words surrounding as the context of the placeholder. One may think we can use some keywords in the context of a placeholder to search a literature search engine like Google Scholar or CiteSeerX to obtain a list of documents as the candidates for citations. However, such a method, based on keyword matching, is often far from satisfactory. For example, using query “frequent itemset mining” one may want to search for the first paper proposing the concept of frequent itemset mining, e.g. [1]. However, Google Scholar returns a paper about frequent closed itemset mining published in 2000 as the first result, and a paper on privacy preserving frequent itemset mining as the second choice. [1] does not appear in the first page of the results. CiteSeerX also lists a paper on privacy preserving frequent itemset mining as the first result. CiteSeerX fails to return [1] on the first page, either. One may wonder, as we can model citations as links from citing documents to cited ones, can we use graph-based link prediction techniques to recommend citations? Graph-based link prediction techniques often require a user to provide sample citations for each placeholder, and thus shifts much of the burden to the user. And, graph-based link prediction WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 421
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·USA text is a snippet of the text around a citation or a place- holder. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively, which is the innovative part of the paper. Moreover, it can recommend a set of citations for a paper with high quality. In addition to the theoretical contribution, we also imple- ment a prototype system in CiteSeerX. Figure 2 shows a real example in our demo system, where a user submits an cita- tion context with a placeholder and the title/atract. Among the top 6 results, the 1st, 4th, 5th and 6th ones are proper recommendations relevant and new to the manuscript; the 2nd one is the original citation: the 3rd one is a similar but irrelevant recommendation We also present an extensive empirical evaluation in the CiteSeerX digital library against many baselines. Our em- pirical study demonstrates the effectiveness and the scala bility of our approach. Figure 2: a demo of our context-aware citation rec- The rest of this paper is organized as follows. We discuss ommendation system related work in Section 2. We formalize the problem in Sec- tion 3. We discuss candidate retrieval methods in ection 4 We present our context-aware relevance model for ranking in Section 5. We report our empirical evaluation in Section 6. methods may encounter difficulties to make proper citations Section 7 concludes the paper. across multiple communities because a community may not be aware of the related work in some other community a detailed natural language processing analysis of the full- 2. RELATED WORK text for each document may help to make good citation In this section, we discuss the related work on document ommendations, but unfortunately has to incur serious scala- ecommendation, topic-based link prediction, and analysis bility issues. There may be hundreds of thousands of papers of citation contexts. that need to be compared with the given manuscript. Thus the natural language processing methods cannot be straight 2.1 Document Recommendation Systems forwardly scaled up to large digital libraries and electronic There are some previous efforts on recommending a bib- document archives liography list for a manuscript, or recommending papers to The recommended citations for placeholders should satisfy reviewers. The existing techniques generally rely on a user two requirements. First, a citation recommendation needs profile or a partial list of citations. to be explainable. Our problem is different from generat. Basu et al. [4 focused on recommending conference pa- ing a bibliography list for a paper where a recommendation er submissions to reviewers based on paper abstracts and should discuss some ideas related to the query manuscript. reviewer profiles. Reviewer profiles are extracted from the In our problem, a recommended citation for a placeholder Web. This is Web. This is a specific step in a more general problem known in a query manuscript should be relevant and authoritative as the Reviewer Assignment Problem, surveyed by Wang et to the particular idea that is being discussed at that point al. 26]. Chandrasekaran et al. [6 presented a technique to in the query manuscript. Different placeholders in the same recommend technical papers to readers whose profile infor- uery manuscript may need different citations. mation is stored in CiteSeer. A user's publication records Second, the recommendations for a manuscript need to are used to model her profile. User profiles and documents ake into account the various ideas discussed in the manuscript. are presented as hierarchial concept trees with predefined For example, suppose we are asked to recommend citations concepts from the ACM Computing Classification System. for a query manuscript in which one section discusses mix- The similarity between a user profile and a document is mea- ture models and another section discusses nonparametric sured by the weighed tree edit distance. Our work can also distributions. Citations to nonparametric mixture moc be seen as a profile-based system, where a query manuscript may be ranked high since they are related to both sections is a profile. However, our system uses richer information in the same manuscript than just predefined concepts or paper abstracts. In summary, citation recommendation for place Shaparenko and Joachims 22 proposed a technique based and for the overall bibliography) is a challenging task. The on language modeling and convex optimization to recom- recommendations need to consider many factors: the query mend documents. For a large corpus, the k-most similar manuscript in whole, the contexts of the citation placehold documents based on cosine similarity are retrieved. How ers individually and collectively, and the literature articles ever, similarity based on full-text is too slow for large digital individually. We need to construct an elegant mathematical libraries. Furthermore, according to our experiments, simi- framework for relevance and develop an efficient and scalable larity based on document abstract results in poor recall( Table 1 In this paper, we present an effective solution to the prob- Some previous studies recommend citations for a manuscript m of citation recommendations for placeholders in query already containing a partial list of citations. Specifically, manuscripts. Our approach is context-aware, where a con- given a document d and its partial citation list r, those
Figure 2: A demo of our context-aware citation recommendation system. methods may encounter difficulties to make proper citations across multiple communities because a community may not be aware of the related work in some other community. A detailed natural language processing analysis of the fulltext for each document may help to make good citation recommendations, but unfortunately has to incur serious scalability issues. There may be hundreds of thousands of papers that need to be compared with the given manuscript. Thus, the natural language processing methods cannot be straightforwardly scaled up to large digital libraries and electronic document archives. The recommended citations for placeholders should satisfy two requirements. First, a citation recommendation needs to be explainable. Our problem is different from generating a bibliography list for a paper where a recommendation should discuss some ideas related to the query manuscript. In our problem, a recommended citation for a placeholder in a query manuscript should be relevant and authoritative to the particular idea that is being discussed at that point in the query manuscript. Different placeholders in the same query manuscript may need different citations. Second, the recommendations for a manuscript need to take into account the various ideas discussed in the manuscript. For example, suppose we are asked to recommend citations for a query manuscript in which one section discusses mixture models and another section discusses nonparametric distributions. Citations to nonparametric mixture models may be ranked high since they are related to both sections in the same manuscript. In summary, citation recommendation for placeholders (and for the overall bibliography) is a challenging task. The recommendations need to consider many factors: the query manuscript in whole, the contexts of the citation placeholders individually and collectively, and the literature articles individually. We need to construct an elegant mathematical framework for relevance and develop an efficient and scalable implementation. In this paper, we present an effective solution to the problem of citation recommendations for placeholders in query manuscripts. Our approach is context-aware, where a context is a snippet of the text around a citation or a placeholder. The core idea is to design a novel non-parametric probabilistic model which can measure the context-based relevance between a citation context and a document. Our approach can recommend citations for a context effectively, which is the innovative part of the paper. Moreover, it can recommend a set of citations for a paper with high quality. In addition to the theoretical contribution, we also implement a prototype system in CiteSeerX. Figure 2 shows a real example in our demo system, where a user submits an citation context with a placeholder and the title/atract. Among the top 6 results, the 1st, 4th, 5th and 6th ones are proper recommendations relevant and new to the manuscript; the 2nd one is the original citation; the 3rd one is a similar but irrelevant recommendation. We also present an extensive empirical evaluation in the CiteSeerX digital library against many baselines. Our empirical study demonstrates the effectiveness and the scalability of our approach. The rest of this paper is organized as follows. We discuss related work in Section 2. We formalize the problem in Section 3. We discuss candidate retrieval methods in Section 4. We present our context-aware relevance model for ranking in Section 5. We report our empirical evaluation in Section 6. Section 7 concludes the paper. 2. RELATED WORK In this section, we discuss the related work on document recommendation, topic-based link prediction, and analysis of citation contexts. 2.1 Document Recommendation Systems There are some previous efforts on recommending a bibliography list for a manuscript, or recommending papers to reviewers. The existing techniques generally rely on a user profile or a partial list of citations. Basu et al. [4] focused on recommending conference paper submissions to reviewers based on paper abstracts and reviewer profiles. Reviewer profiles are extracted from the Web. This is a specific step in a more general problem known as the Reviewer Assignment Problem, surveyed by Wang et al. [26]. Chandrasekaran et al. [6] presented a technique to recommend technical papers to readers whose profile information is stored in CiteSeer. A user’s publication records are used to model her profile. User profiles and documents are presented as hierarchial concept trees with predefined concepts from the ACM Computing Classification System. The similarity between a user profile and a document is measured by the weighed tree edit distance. Our work can also be seen as a profile-based system, where a query manuscript is a profile. However, our system uses richer information than just predefined concepts or paper abstracts. Shaparenko and Joachims [22] proposed a technique based on language modeling and convex optimization to recommend documents. For a large corpus, the k-most similar documents based on cosine similarity are retrieved. However, similarity based on full-text is too slow for large digital libraries. Furthermore, according to our experiments, similarity based on document abstract results in poor recall (cf. Table 1). Some previous studies recommend citations for a manuscript already containing a partial list of citations. Specifically, given a document d and its partial citation list r , those WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 422
Www2010·Ful! Paper Aprl!26-30· Raleigh·Nc·USA studies try to recover the complete citation list denoted by automatically learn the motivation functions(e. g, compare, rDr. Collaborative filtering techniques have been widely contrast, use of the citation, etc. ) of citations by using applied. For example, MeNee et built various rat- tation context based features. The second group tries to matrices including a author-citation matrix, a paper- enhance topical similarity between citations. For example citation matrix, and a co-citation matrix. Pap Huang et al. [11] observed that citation contexts can effec o-cited often with citations in r are potential candidates. tively help to avoid "topic drifting" in clustering citations Zhou et al.[27] propagated the positive labels (i.e, the ex- into topics. Ritchie[21 extensively examined the impact of isting citations in r) in multiple graphs such as the paper- various citation context extraction methods on the perfo paper citation graph, the author-paper bipartite graph, and mance of information retrieval the rest documents for a given testing document in a seml- The previous studies clearly indicate that citation con- the paper-venue bipartite graph, and learned the labels of upervised manner. Torres et aL.[25]used a combination of paper and clearly reflect the information needs (i.e,moti- context-based and collaborative filtering algorithms to build vations) of citations. Our work moves one step ahead to a recommendation system, and reported that the hybrid al- recommend citations according to contexts gorithms performed better than individual ones. Strohman et al.(23] experimented with a citation recommendation sys- 3. PROBLEM DEFINITION AND ARCHITEC. tem where the relevance between two documents is mea sured by a linear combination of text features and citation TURE graph features. They concluded that similarity between bib In this section, we formulate the problem of context-based biographies and Katz distance [15 are the most important citation recommendation, and discuss the preprocessing step features. Tang and Zhang 24] explored recommending cita- tions for placeholders in a very limited extent. In particular, 3.1 Problem Definition a user must provide a bibliography with papers relevant to Let d be a document, and D be a document corpus. ach citation context, as this information is used to compute features in the hidden layer of a Restricted Boltzmann Ma- DEFINITIon 3. 1. In a document d, a conteat c is a bag chine before predictions can be made. We feel this require- of words. The global conteat is the title and abstract of ment negates the need for requesting recommendations. d. The local contert is the tert surrounding a citation or In our study, we do not require a partial list of citation placeholder:. If document di cites document d2, the local since creating such a list for each placeholder shifts most of contert of this citation is called an out-link conteat with the burden on the user. Thus. we can recommend citations respect to di and an in-link conteact writh respect to d both globally (i.e, for the bibliography)and also locally (i.e innovative part of this paper) A user can submit either a manuscript (i.e, a global con- text and a set of out-link local contexts)or a few sentences 2.2 Topic-based Citation Link Prediction (i.e, an out-link local context)as the query to our system Topic models are unsupervised techniques that analyze There are two types of citation recommendation tasks, which the text of a large document corpus and provide a low. happen in different application scenarios dimensional representation of documents in terms of au tomatically discovered and comprehensible "topics". Topic DEFINITION 3.2 (GLOBAL RECOMMENDATION) models have been extended to handle citations as well as query manuscript d without a bibliography, a global recof a text, and thus can be used to predict citations for bibliogra mendation is a ranked list of citations in a corpus D that hies. The aforementioned work of Tang and Zhang 24 are recommended as candidates for the bibliography of d fits in this framework. Nallapati et al. [18 introduced a Note that different citation contexts in d may express dif- called Pairwise-Link-LDA which models the presence sence of a link between every pair of documents and trent information needs. The bibliography candidates vided by a global recommendation should collectively satisfy does not scale to large digital libraries. Nallapati et the citation information needs of all out-link local contexts al. [18 also introduced a simpler model that is similar to the work of Cohn and Hofmann 7 and Erosheva et al. [9 in the query manuscript d Here citations are modeled as a sample from a probability DEFINITION 3.3(LOCAL RECOMMENDATION ) Given an distribution associated with a topic. Thus, a paper can be out-link local contert c. with respect to d, a local recom associated with topics when it is viewed as a citation. It mendation is a ranked list of citations in a corpus D that an also be associated with topics from the analysis of its are recommended as candidates for the placeholder associ- text. However, there is no mechanism to enforce consistency between the topics assigned in those two ways. els require a long training process For local recommendations, the query manuscript d is an such as Gibbs Sampling or variational inference. In addition te.tional input and it is not required to already contain a because they are typically trained using iterative technique to this, they must be retrained as new documents are added To the best of our knowledge, global recommendations 2.3 Citation Context Analysis are only tackled by document-citation graph methods (e. g. 23)and topic models(e. g,24, 18). However, the contert- The prior work on analyzing citation contexts mainly be- aware approaches have not been considered for global longs to two groups. The first group tries to understand local recommendations(except in a limited case where a the motivation functions of an existing citation. For ex- bibliography with papers relevant to each citation context is ample, Aya et al. 2 built a machine learning algorithm to required as the input)
studies try to recover the complete citation list denoted by r ⊃ r . Collaborative filtering techniques have been widely applied. For example, McNee et al. [16] built various rating matrices including a author-citation matrix, a papercitation matrix, and a co-citation matrix. Papers which are co-cited often with citations in r are potential candidates. Zhou et al. [27] propagated the positive labels (i.e., the existing citations in r ) in multiple graphs such as the paperpaper citation graph, the author-paper bipartite graph, and the paper-venue bipartite graph, and learned the labels of the rest documents for a given testing document in a semisupervised manner. Torres et al. [25] used a combination of context-based and collaborative filtering algorithms to build a recommendation system, and reported that the hybrid algorithms performed better than individual ones. Strohman et al. [23] experimented with a citation recommendation system where the relevance between two documents is measured by a linear combination of text features and citation graph features. They concluded that similarity between bibliographies and Katz distance [15] are the most important features. Tang and Zhang [24] explored recommending citations for placeholders in a very limited extent. In particular, a user must provide a bibliography with papers relevant to each citation context, as this information is used to compute features in the hidden layer of a Restricted Boltzmann Machine before predictions can be made. We feel this requirement negates the need for requesting recommendations. In our study, we do not require a partial list of citations, since creating such a list for each placeholder shifts most of the burden on the user. Thus, we can recommend citations both globally (i.e., for the bibliography) and also locally (i.e., for each placeholder, the innovative part of this paper). 2.2 Topic-based Citation Link Prediction Topic models are unsupervised techniques that analyze the text of a large document corpus and provide a lowdimensional representation of documents in terms of automatically discovered and comprehensible “topics”. Topic models have been extended to handle citations as well as text, and thus can be used to predict citations for bibliographies. The aforementioned work of Tang and Zhang [24] fits in this framework. Nallapati et al. [18] introduced a model called Pairwise-Link-LDA which models the presence or absence of a link between every pair of documents and thus does not scale to large digital libraries. Nallapati et al. [18] also introduced a simpler model that is similar to the work of Cohn and Hofmann [7] and Erosheva et al. [9]. Here citations are modeled as a sample from a probability distribution associated with a topic. Thus, a paper can be associated with topics when it is viewed as a citation. It can also be associated with topics from the analysis of its text. However, there is no mechanism to enforce consistency between the topics assigned in those two ways. In general, topic models require a long training process because they are typically trained using iterative techniques such as Gibbs Sampling or variational inference. In addition to this, they must be retrained as new documents are added. 2.3 Citation Context Analysis The prior work on analyzing citation contexts mainly belongs to two groups. The first group tries to understand the motivation functions of an existing citation. For example, Aya et al. [2] built a machine learning algorithm to automatically learn the motivation functions (e.g., compare, contrast, use of the citation, etc.) of citations by using citation context based features. The second group tries to enhance topical similarity between citations. For example, Huang et al. [11] observed that citation contexts can effectively help to avoid “topic drifting” in clustering citations into topics. Ritchie [21] extensively examined the impact of various citation context extraction methods on the performance of information retrieval. The previous studies clearly indicate that citation contexts are a good summary of the contributions of a cited paper and clearly reflect the information needs (i.e., motivations) of citations. Our work moves one step ahead to recommend citations according to contexts. 3. PROBLEM DEFINITION AND ARCHITECTURE In this section, we formulate the problem of context-based citation recommendation, and discuss the preprocessing step. 3.1 Problem Definition Let d be a document, and D be a document corpus. Definition 3.1. In a document d, a context c is a bag of words. The global context is the title and abstract of d. The local context is the text surrounding a citation or placeholder. If document d1 cites document d2, the local context of this citation is called an out-link context with respect to d1 and an in-link context with respect to d2. A user can submit either a manuscript (i.e., a global context and a set of out-link local contexts) or a few sentences (i.e., an out-link local context) as the query to our system. There are two types of citation recommendation tasks, which happen in different application scenarios. Definition 3.2 (Global Recommendation). Given a query manuscript d without a bibliography, a global recommendation is a ranked list of citations in a corpus D that are recommended as candidates for the bibliography of d. Note that different citation contexts in d may express different information needs. The bibliography candidates provided by a global recommendation should collectively satisfy the citation information needs of all out-link local contexts in the query manuscript d. Definition 3.3 (Local Recommendation). Given an out-link local context c∗ with respect to d, a local recommendation is a ranked list of citations in a corpus D that are recommended as candidates for the placeholder associated with c∗. For local recommendations, the query manuscript d is an optional input and it is not required to already contain a representative bibliography. To the best of our knowledge, global recommendations are only tackled by document-citation graph methods (e.g., [23]) and topic models (e.g., [24, 18]). However, the contextaware approaches have not been considered for global or local recommendations (except in a limited case where a bibliography with papers relevant to each citation context is required as the input). WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 423
Www2010·Ful! Paper Aprl!26-30· Raleigh·Nc·USA authors (b)author (a) single-context-based (b) hybrid method Figure 3: Context-oblivious methods for gl rec- Figure 4: Context-aware methods. The single- context-based method is for local recommendation with the absence of query manuscript di. The h 3.2 Preprocessing brid method is for global recommendation and local Our proposed context-based citation recommendation sys- recommendation with the presence of d1, where the tem can take two types of inputs, a query manuscript d, or final candidate set is the union of the candidate sets just a single out-link local context c. We preprocess the derived from each local out-link context in d1 query manuscript di by extracting its global context (i.e he title and abstract )and all of its out-link local contexts Extracting the local contexts from di is not a trivial task 3. The papers cited by documents already in a candidate Ritchie [21 conducted extensive experiments to study the set, generated by some other method (e. g, GN or Au- impact of different lengths of local citation contexts on in- thor ). We call this method CitHop formation retrieval performance, and concluded that fixed 4. The documents written by the authors whose papers window contexts(e. g. size of 100 words) are simple and rea- re already in a candidate set generated by some other onably effective. Thus, before removing all stop words, for ethod. We call this method AuthHop each placeholder we extract the citation context by taking 50 words before and 50 words after the placeholder. This Note that retrieval based on the similarity between the entire text content of documents would be too slow Thus and 20 citations, preprocessing takes on average less than these methods appear to provide a reasonable alternative 0.1 seconds However, the body of an academic paper covers many ideas The critical steps of the system are:(1)quickly retrievin that would not fit in its abstract and so many relevant doc- a large candidate set which has good coverage over the pos- uments will not be retrieved. Retrieval by author similarity sible citations, and(2) for each placeholder associated with have a broad range of research interests and if CitHop the citations by relevance and returning the top K. The or AuthHop is used to further expand the candidate se next two sections focus on these critical step Context-aware methods can avoid such probler 4.2 Context-aware Methods 4. THE CANDIDATE SET The contert-aware methods help improve coverage for rec- Citation recommendation systems, including our syste ommendations by considering local contexts in the query and especially those that compute intricate graph-based fea- manuscript. In a query manuscript di, for each context tures[ 23, 27], first quickly retrieve a large candidate set of we can retrieve papers. Then, features are computed for papers in this set and ranking is performed. This is done solely for scalabilit 5. The top N papers whose in-link contexts are most Techniques for retrieving this candidate set are illustrated ilar to c,. We call this method LN(e. g, L100 in Figures 3 and 4. Here, the query manuscript is rel 6. The papers containing the top-N out-link contexts most sented by a partially shaded circle. Retrieved documents similar to c.(these papers cite papers retrieved b are represented as circles with numbers corresponding to the LN). When used in conjunction with LN, we call this numbered items in the lists of retrieval methods discusse method LCN (e. g, LC100) below. In this section. we discuss two kinds of methods for retrieving a candidate set. Those methods will be evaluated We found that method LCN is needed because frequently a document from a digital library may have an out-link con- text that describes how it differs from related work(e. g 4.1 Context-oblivious Methods prior work required full bibliographies but we do not") The contert-oblivious methods do not consider local con- Thus, while an out-link context usually describes a cited texts in a query manuscript that has no bibliography. For a paper, sometimes it may also describe the citing paper, and query manuscript di, we can retrieve this description may be what best matches an out-link con- text c. from a query manuscript 1. The top N documents with abstract and title m The above 6 methods can be combined in some obvious similar to di. We call this method GN (e.g, G100, ways. For example,(L100+CitHop)+G1000 is the can- G1000 didate set formed by the following process: for each context c, in a query manuscript dl, add the top 100 documents 2. The documents that share authors with di. We call with in-link local context most similar to c. (i.e. L100)and this method Author all of their citations (i.e. CitHop)and then add the top
Figure 3: Context-oblivious methods for global recommendation. 3.2 Preprocessing Our proposed context-based citation recommendation system can take two types of inputs, a query manuscript d1 or just a single out-link local context c∗. We preprocess the query manuscript d1 by extracting its global context (i.e. the title and abstract) and all of its out-link local contexts. Extracting the local contexts from d1 is not a trivial task. Ritchie [21] conducted extensive experiments to study the impact of different lengths of local citation contexts on information retrieval performance, and concluded that fixed window contexts (e.g. size of 100 words) are simple and reasonably effective. Thus, before removing all stop words, for each placeholder we extract the citation context by taking 50 words before and 50 words after the placeholder. This preprocessing is efficient. For a PDF document of 10 pages and 20 citations, preprocessing takes on average less than 0.1 seconds. The critical steps of the system are: (1) quickly retrieving a large candidate set which has good coverage over the possible citations, and (2) for each placeholder associated with an out-link local context and for the bibliography, ranking the citations by relevance and returning the top K. The next two sections focus on these critical steps. 4. THE CANDIDATE SET Citation recommendation systems, including our system and especially those that compute intricate graph-based features [23, 27], first quickly retrieve a large candidate set of papers. Then, features are computed for papers in this set and ranking is performed. This is done solely for scalability. Techniques for retrieving this candidate set are illustrated in Figures 3 and 4. Here, the query manuscript is represented by a partially shaded circle. Retrieved documents are represented as circles with numbers corresponding to the numbered items in the lists of retrieval methods discussed below. In this section, we discuss two kinds of methods for retrieving a candidate set. Those methods will be evaluated in Section 6.2. 4.1 Context-oblivious Methods The context-oblivious methods do not consider local contexts in a query manuscript that has no bibliography. For a query manuscript d1, we can retrieve 1. The top N documents with abstract and title most similar to d1. We call this method GN (e.g., G100, G1000). 2. The documents that share authors with d1. We call this method Author. Figure 4: Context-aware methods. The singlecontext-based method is for local recommendation with the absence of query manuscript d1. The hybrid method is for global recommendation and local recommendation with the presence of d1, where the final candidate set is the union of the candidate sets derived from each local out-link context in d1. 3. The papers cited by documents already in a candidate set, generated by some other method (e.g., GN or Author). We call this method CitHop. 4. The documents written by the authors whose papers are already in a candidate set generated by some other method. We call this method AuthHop. Note that retrieval based on the similarity between the entire text content of documents would be too slow. Thus, these methods appear to provide a reasonable alternative. However, the body of an academic paper covers many ideas that would not fit in its abstract and so many relevant documents will not be retrieved. Retrieval by author similarity will add many irrelevant papers, especially if the authors have a broad range of research interests and if CitHop or AuthHop is used to further expand the candidate set. Context-aware methods can avoid such problems. 4.2 Context-aware Methods The context-aware methods help improve coverage for recommendations by considering local contexts in the query manuscript. In a query manuscript d1, for each context c∗, we can retrieve: 5. The top N papers whose in-link contexts are most similar to c∗. We call this method LN (e.g., L100). 6. The papers containing the top-N out-link contexts most similar to c∗ (these papers cite papers retrieved by LN). When used in conjunction with LN, we call this method LCN (e.g., LC100). We found that method LCN is needed because frequently a document from a digital library may have an out-link context that describes how it differs from related work (e.g., “prior work required full bibliographies but we do not”). Thus, while an out-link context usually describes a cited paper, sometimes it may also describe the citing paper, and this description may be what best matches an out-link context c∗ from a query manuscript. The above 6 methods can be combined in some obvious ways. For example, (L100+CitHop)+G1000 is the candidate set formed by the following process: for each context c∗ in a query manuscript d1, add the top 100 documents with in-link local context most similar to c∗ (i.e. L100) and all of their citations (i.e. CitHop) and then add the top WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 424
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·UsA 1000 documents with abstract and title most similar to d, d is then pa(c)=Trace(Tacc)=c Tac. Note that similar (ie.G1000) atomic concepts(as measured by the dot product)will have similar relevance scores 5. MODELING CONTEXTBASED CITATION Now, the probability distribution characterized by Glea son's theorem is not a generative distribution-it cannot be RELEVANCE used to sample concepts. Instead, it is a distribution over In this section, we propose a non-parametric probabilis. yes/no answers(e.g. is concept c relevant or not? ).Thus tic model to measure context-based(and overall) relevance to estimate Td document d we will need some (un- between a manuscript and a candidate citation, for rank known) generative distribution Pgen over unit vectors(cor ing retrieved candidates. To improve scalability, we use an cepts). Our evidence about the properties of Td come fror pproximate inference technique which yields a closed form the following process solution. Our model is general and simple so that it can be cepts c1,..., Ck and these concepts are then independently used to efficiently and effectively measure the similarity be- judged to be relevant to d. This happens with probability tween any two documents with respect to certain contexts 1 Pgen(ci)pa(ci). We seek to find a density matrix Ta r concepts in information retrieval that maximizes the likelihood. The log likelihood is 5.1 Context-Based relevance model Recall that a query manuscript di can have a global con- ∑ogp()+∑log(re) text cI(e. g, a title and abstract, which describes the prob- lem to be addressed) and the out-link local contexts c2 Now, if there is only one concept (i.e. k= 1), then the maximum likelihood estimator is easy to compute: it is Td= e. g, text surrounding citations)which compactly express deas that may be present in related work. a document d2 cici. In the general case, however, numerical methods are in an existing corpus D has a global context b1(e.g needed 3. The computational cost of estimating Td can be tle and abstract), and local in-link contexts b proposition: ext that is used by papers citing d)which compactly ex PROPOSITION 5. 2. The density matrit Td can be repre- press ideas covered in d2 sentedas 2is,t? where the t, are a set of at most r or. In this section, we describe a principled and consistent way thogonal column vectors with 2(ti- ti)= 1, and r is the of measuring sim(di, d2), defined as the overall relevance of dimension of the space spanned by the ci(the number of lin- d2 to di and sim(d1, d2; c), defined as the relevance of da early independent conterts). There are O(r )parameters to determine and numerical (iterative) techniques will scale could be any of the out-link contexts c ). Our techniques a polynomial inr2 are based on Gleasons Theorem specialized to finite dimen The detailed proof is in Appendix A Furthermore, the addition of new documents to the corpus ill cause the addition of new in-link contexts, requiring a THEOREM 5.1(GLEASON 10D. For an n-dir recomputation of all the Td. Thus the likelihood approach real vector space V (with n> 3), let p be a function that will not scale for our system signs a number in 0, 1 to each subspace of V such that p(u1 6 u2)=p(un)+p(u2) whenever v and v2 are orth 5.2 Scalable closed form solutions mal subspaces and p(v)= l. Then p(u)= Trace(TP) Since hundreds of thousands of density matrices need to where P. is the projection matri for subspace v(e.g. P be estimated (one for each document)and recomputed is the projection of vector w onto the subspace v), and T new documents are added to the corpus, we opt to replace a density matriz -a symmetric positive semidefinite matrix the exact but slow maximum likelihood computation with an whose trace is 1 approximate but fast closed form solution which we derive Note that van Rijsbergen 20 proposed using Gleason's The- in this section orem as a model for infornation retrieval. and this was als We begin with the following observation. For each con- extensively studied by Melucci [17. However, our frame- cept ci, the maximum likelihood estimate of Td given that is substantially different from their proposals since it concept ci is relevant is cici. It stands to reason that our overall estimate of Td should be similar to each of the Cic on comparisons between density matrices Let w be the set of all words in our corpus and let WI be We will measure similarity by the Frobenius norm(square- the number of words. The vector space V is W'l-dimensiona root of the sum of the squared matrix entries)and thus set with one dimension for each word up the following optimization problem: In this framework, atomic concepts will be represented as one-dimensional vector spaces and we will treat each context minimize(ra)=∑‖a-cct ink )as an atomic concept. Each atomic subject to the constraint that Td is a density ma shall also denote by c (one such representation can be de- derivatives, we get rived from normalizing tf-idf scores into a unit vector). The projection matrix for c is then cc. Our goal is to measure the probability that c is relevant to a document d and so 7=2∑(a-cc)=0 i=1 oy Gleasons Theorem, each document d is associated with a density matrix Td. The probability that c is relevant to leading to the solution Ta=fis cic. Now that we have a closed form estimate for Td, we can discuss how to measure Iu e u2 is the linear span of v1 and v2 the relevance between documents with respect to a concept
1000 documents with abstract and title most similar to d1 (i.e. G1000). 5. MODELING CONTEXT-BASED CITATION RELEVANCE In this section, we propose a non-parametric probabilistic model to measure context-based (and overall) relevance between a manuscript and a candidate citation, for ranking retrieved candidates. To improve scalability, we use an approximate inference technique which yields a closed form solution. Our model is general and simple so that it can be used to efficiently and effectively measure the similarity between any two documents with respect to certain contexts or concepts in information retrieval. 5.1 Context-Based Relevance Model Recall that a query manuscript d1 can have a global context c1 (e.g., a title and abstract, which describes the problem to be addressed) and the out-link local contexts c2,...,ck1 (e.g., text surrounding citations) which compactly express ideas that may be present in related work. A document d2 in an existing corpus D has a global context b1 (e.g. title and abstract), and local in-link contexts b1,...,bk2 (e.g., text that is used by papers citing d2) which compactly express ideas covered in d2. In this section, we describe a principled and consistent way of measuring sim(d1, d2), defined as the overall relevance of d2 to d1 and sim(d1, d2; c∗), defined as the relevance of d2 to d1 with respect to a specific context c∗ (in particular, c∗ could be any of the out-link contexts ci). Our techniques are based on Gleason’s Theorem specialized to finite dimensional real vector spaces. Theorem 5.1 (Gleason [10]). For an n-dimensional real vector space V (with n ≥ 3), let p be a function that assigns a number in [0, 1] to each subspace of V such that p(v1 ⊕ v2) = p(v1) + p(v2) whenever v1 and v2 are orthogonal subspaces 1 and p(V )=1. Then p(v) = T race(T Pv) where Pv is the projection matrix for subspace v (e.g. Pvw is the projection of vector w onto the subspace v), and T is a density matrix – a symmetric positive semidefinite matrix whose trace is 1. Note that van Rijsbergen [20] proposed using Gleason’s Theorem as a model for information retrieval, and this was also extensively studied by Melucci [17]. However, our framework is substantially different from their proposals since it relies on comparisons between density matrices. Let W be the set of all words in our corpus and let |W| be the number of words. The vector space V is |W|-dimensional with one dimension for each word. In this framework, atomic concepts will be represented as one-dimensional vector spaces and we will treat each context (global, in-link, out-link) as an atomic concept. Each atomic concept c will be associated a unit column vector which we shall also denote by c (one such representation can be derived from normalizing tf-idf scores into a unit vector). The projection matrix for c is then ccT . Our goal is to measure the probability that c is relevant to a document d and so, by Gleason’s Theorem, each document d is associated with a density matrix Td. The probability that c is relevant to 1v1 ⊕ v2 is the linear span of v1 and v2. d is then pd(c) = T race(TdccT ) = cT Tdc. Note that similar atomic concepts (as measured by the dot product) will have similar relevance scores. Now, the probability distribution characterized by Gleason’s theorem is not a generative distribution – it cannot be used to sample concepts. Instead, it is a distribution over yes/no answers (e.g. is concept c relevant or not?). Thus to estimate Td for a document d we will need some (unknown) generative distribution pgen over unit vectors (concepts). Our evidence about the properties of Td come from the following process. pgen independently generates k concepts c1,...,ck and these concepts are then independently judged to be relevant to d. This happens with probability k i=1 pgen(ci)pd(ci). We seek to find a density matrix Td that maximizes the likelihood. The log likelihood is: k i=1 log pgen(ci) +k i=1 log c T i Tdci . Now, if there is only one concept (i.e. k = 1), then the maximum likelihood estimator is easy to compute: it is Td = c1ct 1. In the general case, however, numerical methods are needed [3]. The computational cost of estimating Td can be seen from the following proposition: Proposition 5.2. The density matrix Td can be represented as r i=1 tit T i where the ti are a set of at most r orthogonal column vectors with (ti · ti)=1, and r is the dimension of the space spanned by the ci (the number of linearly independent contexts). There are O(r2) parameters to determine and numerical (iterative) techniques will scale as a polynomial in r2. The detailed proof is in Appendix A. Furthermore, the addition of new documents to the corpus will cause the addition of new in-link contexts, requiring a recomputation of all the Td. Thus the likelihood approach will not scale for our system. 5.2 Scalable Closed Form Solutions Since hundreds of thousands of density matrices need to be estimated (one for each document) and recomputed as new documents are added to the corpus, we opt to replace the exact but slow maximum likelihood computation with an approximate but fast closed form solution which we derive in this section. We begin with the following observation. For each concept ci, the maximum likelihood estimate of Td given that concept ci is relevant is cicT i . It stands to reason that our overall estimate of Td should be similar to each of the cicT i . We will measure similarity by the Frobenius norm (squareroot of the sum of the squared matrix entries) and thus set up the following optimization problem: minimize L(Td) = k i=1 ||Td − cic T i ||2 F , subject to the constraint that Td is a density matrix. Taking derivatives, we get ∂L ∂Td = 2k i=1 (Td − cic T i )=0, leading to the solution Td = 1 k k i=1 cicT i . Now that we have a closed form estimate for Td, we can discuss how to measure the relevance between documents with respect to a concept. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 425
Www2010·Ful! Paper Apr26-30· Raleigh·Nc·USA Global recommendation nodes2, each of which has 8 CPU processors of 2.67GHz and Let Td, and Td, be the respective density manuscript di and a document d2 from the define sim(d1, d2), the overall relevance of d2 to di to 6.1 Performance measures robability that a random context drawn from the uniform The performance ommendation can be measured distribution over contexts is relevant to both di and d2. Af- a wide range of metrics and means, including user studies ter much mathematical manipulation, we get the following and click-through monitoring. For experimental purpose, we focus on four performance measures as follows in this paper. PROPOSiTion 5.3. Let Td Recall (R): We removed original citations from the test- T2=∑1b1b· Then the relevance of d2 to d1 ing documents. The recall is defined as the percentage of original citations that appear in the top K recommended ci- tations. Furthermore, we categorize recall into global recall sim(d1,d2)=k1k21=1 ∑∑(c·b and local recall for global and local recommendation respec- tively. The global recall is first computed as the percentage of original bibliography of each testing document d that ap- The detailed proof is given in Appendix B. Given query pears in the top K recommended citations of d, and then manuscript ql, we use Equation I to rank documents for averaged over all 1, 612 testing documents. The local recall is first computed as, for each citation placeholder, the per- Local recommendation centage of the original citations cited by c, that appear in the top K recommended citations of c, and then averaged If we are given a single context c instead of a manuscript, we over all 7,049 testing out-link citation contexts can still compute sim(c, d2), the relevance of a document Co-cited probability(P): We may recommend some rel- d2∈ D to the context c. Letting Td2=∑1bb,then evant or even better recommendations other than those ol by Gleasons Theorem, we have inal ones among the top K results, which cannot be capture sim(ce,d2)=Trace(lag,C=N by the traditional metric like precision. The previous work (y3c)2.(2) usually conducted user studies for this kind of relevance eval- uation [ 16, 6. In this paper, we instead use the wisdom of he population as the ground truth to define a systematic We define sim(d1, d2; cs), the relevance of d2 to di with metric. For each pair of documents(di, di) where di is an respect to context c, as the probability that c. is relevant original citation and the d, is a recommended one, we cal- to both documents. Applying Gleasons Theorem twice ulate the probability that these two documents have been CO-cited by the popularity in the past sim(d1, d2: c)= Trace(Td, c.c. )Trace(Td, c.c.) of pa ng both di and dj ∑c)∑的c).() umber of papers citing di or dj The co-cited probability is then averaged over all K-I unique Given query manuscript di, we use Equation 3 to rank doc- of original citations. Again, we categorize this probability ments for recommendations at the placeholder associated into a global version and a local version: the former is av- with context c, in di. If a citation context c, is given with- eraged over all testing documents and the latter is averaged out dl, then we use Equation 2. over all testing citation contexts NDCG: The effectiveness of a recommendation system 6. EXPERIMENTS is also sensitive to the positions of relevant citations, which We built a real system in the CiteSeerX staging server cannot be evaluated by recall and co-cited probability. In- to evaluate context-aware citation recommendations. We tuitively, it is desirable that highly relevant citations appear used all research papers published and crawled before year earlier in the top K list. We use normalized discounted 2008 as the document corpus D. After removing duplicate cumulative gain(NDCg)to measure the ranked recommen- apers and the papers missing abstract/title and cita dation list. The NDCG value of a ranking list at position contexts, we obtained 456, 787 unique documents in the cor- is calculated as pus. For each paper, we extracted its title and abstract as the global citation context or text content. Within a NDCGoi= Z aper we simply took 50 words before and after each cita- tion placeholder as its local citation context. We removed ome popular stop words. In order to preserve the time where r( is the rating of the j-th document in the rank sensitive past/present/future tenses of verbs and the singt ing list, and the normalization constant Zi is chosen so that lar/plural styles of named entities, no stemming was done the perfect list gets a NDCG score of 1. Given a testing All words were transferred to lower-cases. Finally, we ob- document di and any other document d2 from our corpus tained 1, 810, 917 unique local citation contexts and 716. 92 D, we use the average co-cited probability of d2 with all ue word terms. We used all 1, 612 papers published and original citations of di to weigh the citation relevance score of d2 to dI. Then, we sort all d2 w.r. t. this crawled in early 2008 as the testing data set We implemented all algorithms in C++ and integrate Pmar is the highest score) and define 5-scale relevance num- them into the ,ava running environment of CiteSeerX. All ber for them as the ground truth: 4, 3. 2, 1, 0 for documents experiments were conducted on a Linux cluster with 128 Due to the psu policy, we could only use 8 of them 426
Global Recommendation Let Td1 and Td2 be the respective density matrices of the manuscript d1 and a document d2 from the corpus D. We define sim(d1, d2), the overall relevance of d2 to d1 to be the probability that a random context drawn from the uniform distribution over contexts is relevant to both d1 and d2. After much mathematical manipulation, we get the following: Proposition 5.3. Let Td1 = 1 k1 k1 i=1 cicT i and Td2 = 1 k2 k2 i=1 bibT i . Then the relevance of d2 to d1 is: sim(d1, d2) = 1 k1k2 k1 i=1 k2 j=1 (ci · bj ) 2 . (1) The detailed proof is given in Appendix B. Given query manuscript q1, we use Equation 1 to rank documents for global recommendation. Local Recommendation If we are given a single context c∗ instead of a manuscript, we can still compute sim(c∗, d2), the relevance of a document d2 ∈ D to the context c∗. Letting Td2 = 1 k2 k2 i=1 bibT i , then by Gleason’s Theorem, we have: sim(c∗, d2) = T race(Td2 c∗c T ∗ ) = 1 k2 k2 j=1 (bj · c∗) 2 . (2) We define sim(d1, d2; c∗), the relevance of d2 to d1 with respect to context c∗ as the probability that c∗ is relevant to both documents. Applying Gleason’s Theorem twice: sim(d1, d2; c∗) = T race(Td1c∗c T ∗ )T race(Td2c∗c T ∗ ) = 1 k1k2 k1 i=1 (ci · c∗) 2k2 j=1 (bj · c∗) 2 . (3) Given query manuscript d1, we use Equation 3 to rank documents for recommendations at the placeholder associated with context c∗ in d1. If a citation context c∗ is given without d1, then we use Equation 2. 6. EXPERIMENTS We built a real system in the CiteSeerX staging server to evaluate context-aware citation recommendations. We used all research papers published and crawled before year 2008 as the document corpus D. After removing duplicate papers and the papers missing abstract/title and citation contexts, we obtained 456, 787 unique documents in the corpus. For each paper, we extracted its title and abstract as the global citation context or text content. Within a paper, we simply took 50 words before and after each citation placeholder as its local citation context. We removed some popular stop words. In order to preserve the timesensitive past/present/future tenses of verbs and the singular/plural styles of named entities, no stemming was done. All words were transferred to lower-cases. Finally, we obtained 1, 810, 917 unique local citation contexts and 716, 927 unique word terms. We used all 1, 612 papers published and crawled in early 2008 as the testing data set. We implemented all algorithms in C++ and integrated them into the Java running environment of CiteSeerX. All experiments were conducted on a Linux cluster with 128 nodes2, each of which has 8 CPU processors of 2.67GHz and 32G main memory. 6.1 Performance Measures The performance of recommendation can be measured by a wide range of metrics and means, including user studies and click-through monitoring. For experimental purpose, we focus on four performance measures as follows in this paper. Recall (R): We removed original citations from the testing documents. The recall is defined as the percentage of original citations that appear in the top K recommended citations. Furthermore, we categorize recall into global recall and local recall for global and local recommendation respectively. The global recall is first computed as the percentage of original bibliography of each testing document d that appears in the top K recommended citations of d, and then averaged over all 1, 612 testing documents. The local recall is first computed as, for each citation placeholder, the percentage of the original citations cited by c∗ that appear in the top K recommended citations of c∗, and then averaged over all 7, 049 testing out-link citation contexts. Co-cited probability (P): We may recommend some relevant or even better recommendations other than those original ones among the top K results, which cannot be captured by the traditional metric like precision. The previous work usually conducted user studies for this kind of relevance evaluation [16, 6]. In this paper, we instead use the wisdom of the population as the ground truth to define a systematic metric. For each pair of documents di, dj where di is an original citation and the dj is a recommended one, we calculate the probability that these two documents have been co-cited by the popularity in the past as P = number of papers citing both di and dj number of papers citing di or dj . The co-cited probability is then averaged over all K·l unique document pairs for the top K results, where l is the number of original citations. Again, we categorize this probability into a global version and a local version: the former is averaged over all testing documents and the latter is averaged over all testing citation contexts. NDCG: The effectiveness of a recommendation system is also sensitive to the positions of relevant citations, which cannot be evaluated by recall and co-cited probability. Intuitively, it is desirable that highly relevant citations appear earlier in the top K list. We use normalized discounted cumulative gain (NDCG) to measure the ranked recommendation list. The NDCG value of a ranking list at position i is calculated as NDCG@i = Zi i j=1 2r(j) − 1 log(1 + j) , where r(j) is the rating of the j-th document in the ranking list, and the normalization constant Zi is chosen so that the perfect list gets a NDCG score of 1. Given a testing document d1 and any other document d2 from our corpus D, we use the average co-cited probability of d2 with all original citations of d1 to weigh the citation relevance score of d2 to d1. Then, we sort all d2 w.r.t. this score (suppose Pmax is the highest score) and define 5-scale relevance number for them as the ground truth: 4, 3, 2, 1, 0 for documents 2Due to the PSU policy, we could only use 8 of them. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 426
Www2010·Ful! Paper April26-30· Raleigh·Nc·USA Table 1: Compare different candidate sets. Num- Recommendation Quality bers are averaged over 1, 612 test documents. We compare CRM with 7 other context-oblivious baselines HITs [13 the candidates are ranked w.r. t. their author- ity scores in the candidate set subgraph. We choose to com- pare with HITs because it is interesting to see the difference between the popularity (link-based methods) and the rele- vance(context or document based methods L100+CitHo Katz [15]: the candidates are ranked w.r.t. the Katz distance, >i B Ni, where N is the number of unique paths of length i between the query manuscript/context and the C1000-+Cit candidate and the path should go through the top N similar Author +CitHop documents/contexts of the query manuscript/context, and B is a decay parameter between 0 and 1. We choose to compare Katz because this feature has bee to be L100+CitHop)+G1000 the most effective among all text/citation features in 23 for document-based global recommendation. Note that [2 used the ground truth to calculate the Katz distance, which I-count and g-count: the candidates are ranked accord ng to the number of citations in the candidate set subgraph in(3Pma: /4, Pmar],(Pmar/2, 3Pma:/4), (Pmaz/4, Pma /2), (I-count) or the whole corpus(g-count (0, Pmar/4 and 0 respectively. Finally, the NDCG over all textsim: the candidates are ranked according to similar- testing documents(the global version)or all testing cita- query manuscript using only title and abstra tion contexts(the local version) is averaged to yield a single This allows us to see the benefit of a context-aware approach qualitative metric for each recommendation problem diffusion: the candidates are ranked according to their Time: We use the running time as an important metric to topical similarities which are generated by the multinomial neasure the efficiency of the recommendation approaches diffusion kernel [14 to the query manuscript, K(01, 02) (4mt)-exp(-larccos2(ver.V02), where 01 and 02 are 6.2 Retrieving Candidate Sets opic distributions of the query manuscript dl and the can- Table l evaluates the quality of different candidate set re- didate d2 respectively, t is the decay factor. Since we only (see Section 4 for notation and detailed care about the ranking, we can ignore the first item and t descriptions). Here we measure coverage(the average recall We choose to compare with it because topic-based citation of the candidate set with respect to the true bibliography recommendation 24] is one of related work and the multi- of a testing document)and candidate set size. A good can- nomial diffusion kernel is the state-of-the-art tool in topic- high recall and a small size based text similarity 8]. We run LDA 5] on each candidate large candidate sets slow down the final ranking for all rec- set online(by setting the number of topics as 60)to get the ommendation systems. For context-aware methods, we feel topic distributions for documents LC100+G1000 achieves the best tradeoff between candi- mix-features: the candidates are ranked according to date set size and coverage. For co the weighted linear combination of the above 6 features. We G1000+CitHop works reasonably well(but not as well as choose to compare it because in[23 a mixture approach LC100+G1000). However, the retrieval time of context- considering both text-based features and citation-based fea- oblivious methods is around 0.28 seconds on an 8-node clus- tures is the most effective ter. On the other hand. the retrieval time of context-aware methods ranges from 2 to 10 seconds on the same clus- recall,co-cited probability and NDCG, respective/f: nces on ter(depending on the number of out-link contextsof a Among all methods, g-count is the worst one simply be- query manuscript). Our goal for the final system is to use cause it is measured over the whole corpus, not the candidate LC100-+G1000 and to speed up retrieval time using a cor set. Context-oblivious content-based methods like textsim bination of indexing tricks and more machines(since this and diffusion come to heel. indicating that abstract/title trieval is highly parallelizable). Note, however, that ranking only are too sparse to portray the specific citation functions echniques are orthogonal to candidate set retrieval meth- well. Moreover, they cannot find the proper related work ods. Thus, when we compare our ranking methods for rec- that uses different conceptual words. The diffusion is better with baselines and related work, we will use than textsim, indicating that topic-based similarity can cap- LC100+G1000 as the common candidate set, since our ture the citation relations more than raw text similarity. The eventual goal is to use this retrieval method in our system. social phenomenon that the rich get richer is also common in the citation graph, since the citation features including 6.3 Global recommendation I-count, HITs, and Katz work better than the abstract/ title- In this section, we compare our context-aware relevance based text features. Interestingly, 23 claimed that citation model(CRM for short, Section 5)with other baselines in features did a poor job at coverage. But on our data, they global recommendation performance, since the related work have higher recall values than text features. A combination only focused on recommending the bibliography of these features(mix- features)can further improve the per formance, especially on the recall and NDCG, which means Due to OCR and crawling issues, there were an average of that if a candidate is recommended by multiple features, it 5 out-link contexts per testing document
Table 1: Compare different candidate sets. Numbers are averaged over 1,612 test documents. Methods Coverage Candidate Set Size G1000 0.44 1000 L100 0.55 341 LC100 0.63 674 L1000 0.69 2,844 LC1000 0.78 5,692 Author 0.05 17 L100+CitHop 0.61 1,327 L1000+CitHop 0.72 9,049 LC100+CitHop 0.73 3,561 G1000+CitHop 0.73 3,790 LC1000+CitHop 0.83 22,629 Author+CitHop 0.15 63 L100+G1000 0.75 1,312 LC100+G1000 0.79 1,618 (L100+CitHop)+G1000 0.79 2,279 (LC100+CitHop)+G1000 0.85 4,460 (LC1000+G1000)+CitHop 0.92 24,793 LC100+G1000+(Author+CitHop) 0.79 1,674 (LC100+G1000)+AuthHop 0.88 39,496 in (3Pmax/4, Pmax], (Pmax/2, 3Pmax/4], (Pmax/4, Pmax/2], (0, Pmax/4] and 0 respectively. Finally, the NDCG over all testing documents (the global version) or all testing citation contexts (the local version) is averaged to yield a single qualitative metric for each recommendation problem. Time: We use the running time as an important metric to measure the efficiency of the recommendation approaches. 6.2 Retrieving Candidate Sets Table 1 evaluates the quality of different candidate set retrieval techniques (see Section 4 for notation and detailed descriptions). Here we measure coverage (the average recall of the candidate set with respect to the true bibliography of a testing document) and candidate set size. A good candidate set should have high recall and a small size since large candidate sets slow down the final ranking for all recommendation systems. For context-aware methods, we feel LC100+G1000 achieves the best tradeoff between candidate set size and coverage. For context-oblivious methods, G1000+CitHop works reasonably well (but not as well as LC100+G1000). However, the retrieval time of contextoblivious methods is around 0.28 seconds on an 8-node cluster. On the other hand, the retrieval time of context-aware methods ranges from 2 to 10 seconds on the same cluster (depending on the number of out-link contexts3 of a query manuscript). Our goal for the final system is to use LC100+G1000 and to speed up retrieval time using a combination of indexing tricks and more machines (since this retrieval is highly parallelizable). Note, however, that ranking techniques are orthogonal to candidate set retrieval methods. Thus, when we compare our ranking methods for recommendations with baselines and related work, we will use LC100+G1000 as the common candidate set, since our eventual goal is to use this retrieval method in our system. 6.3 Global Recommendation In this section, we compare our context-aware relevance model (CRM for short, Section 5) with other baselines in global recommendation performance, since the related work only focused on recommending the bibliography. 3Due to OCR and crawling issues, there were an average of 5 out-link contexts per testing document. Recommendation Quality We compare CRM with 7 other context-oblivious baselines: HITs [13]: the candidates are ranked w.r.t. their authority scores in the candidate set subgraph. We choose to compare with HITs because it is interesting to see the difference between the popularity (link-based methods) and the relevance (context or document based methods). Katz [15]: the candidates are ranked w.r.t. the Katz distance, i βi Ni, where Ni is the number of unique paths of length i between the query manuscript/context and the candidate and the path should go through the top N similar documents/contexts of the query manuscript/context, and βi is a decay parameter between 0 and 1. We choose to compare Katz because this feature has been shown to be the most effective among all text/citation features in [23] for document-based global recommendation. Note that [23] used the ground truth to calculate the Katz distance, which is impractical. l-count and g-count: the candidates are ranked according to the number of citations in the candidate set subgraph (l-count) or the whole corpus (g-count). textsim: the candidates are ranked according to similarity with the query manuscript using only title and abstract. This allows us to see the benefit of a context-aware approach. diffusion: the candidates are ranked according to their topical similarities which are generated by the multinomial diffusion kernel [14] to the query manuscript, K(θ1, θ2) = (4πt) − |W| 2 exp −1 t arccos2( √θ1 · √θ2) , where θ1 and θ2 are topic distributions of the query manuscript d1 and the candidate d2 respectively, t is the decay factor. Since we only care about the ranking, we can ignore the first item and t. We choose to compare with it because topic-based citation recommendation [24] is one of related work and the multinomial diffusion kernel is the state-of-the-art tool in topicbased text similarity [8]. We run LDA [5] on each candidate set online (by setting the number of topics as 60) to get the topic distributions for documents. mix-features: the candidates are ranked according to the weighted linear combination of the above 6 features. We choose to compare it because in [23], a mixture approach considering both text-based features and citation-based features is the most effective. Figures 5 (a), (b) and (c) illustrate their performances on recall, co-cited probability and NDCG, respectively. Among all methods, g-count is the worst one simply because it is measured over the whole corpus, not the candidate set. Context-oblivious content-based methods like textsim and diffusion come to heel, indicating that abstract/title only are too sparse to portray the specific citation functions well. Moreover, they cannot find the proper related work that uses different conceptual words. The diffusion is better than textsim, indicating that topic-based similarity can capture the citation relations more than raw text similarity. The social phenomenon that the rich get richer is also common in the citation graph, since the citation features including l-count, HITs, and Katz work better than the abstract/titlebased text features. Interestingly, [23] claimed that citation features did a poor job at coverage. But on our data, they have higher recall values than text features. A combination of these features (mix-features) can further improve the performance, especially on the recall and NDCG, which means that if a candidate is recommended by multiple features, it WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 427
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·UsA 0.65 0.20 ◆HIIs v Katz divisio 80.02 255075100125150175200225250 55075100125150175200225250 NACLa? NDCGla75 top K recommendations per document top K recommendations per document K recommendations per document (a)recal (b)co-cited probability (c)NDCG Figure 5: Compare performances for global recommendation hould be ranked higher. An interesting finding is that the 10000 performance of HITs(especially on NDCG)increases signif- hit citations antly after more candidates are returned, indicating that some candidates with moderate authoritative scores are our targets(people may stop citing the most well-known papers 10 after they become the standard techniques in their domains and shift the attentions to other recent good papers). Katz t historical citation works the best among single features, partially because it im (a)scatter distribution (b) probability distribution plicitly combines the text similarity with the citation count It works like the collaborative filtering, where candidates of- Figure 7: Correlation between count/probability of en cited by similar documents are recommended. Finally missed/hit citations and their number of historical our CRM method leads the pack on all three metrics in-link citation contexts plying that after considering all historical in-link citation contexts of candidates(then the problem of using different conceptual words for the same concept would be alleviated) CRM is effective in recommending bibliography crawling and OCR issues, our testing documents have an av- Ranking Time erage of 5 out-link citation contexts that point to documents in our corpus, so"top 5 per context"corresponds to"top Time is not a major issue for most ranking algorithms except 25 per document"), the performance of CRM-crosscontex for topic-based methods, where LDA usually took tens of is very close to CRM in global recommendation. We know seconds (50 100)for each new candidate set. Thus, topi CRM-crosscontext and CRM have the same input(the same based methods including diffusion and mix-features are not amount of information to use). However, CRM-crosscontext suitable for online recommendation systems. All other rank- tackles a much harder problem than CRM, and has to as- ing algorithms need less than 0.1 seconds. Limited by space, sign citations to each placeholder. Thus, CRM-crosscontext the detailed comparisons are omitted here is more capable than CRM. CRM-crosscontext outperforms all baselines of global recommendation and thus is also su- 6. 4 Local recommendation perior than their local versions. Limited by space, we omit Local recommendation is a novel task proposed by our the details here Here. we CRM-crosscontext is able to effectively rank candidates do not compare with the above baselines because they are for placeholders. For example, more than 42% original ci- not tailored for local recommendation. Instead, we evaluate tations can be found in the top 5 recommendations, the impact of the absence/presence of the global context and frequently co-cited papers(. r.t. original citations) also ap- ther local contexts, and analyze the problem of context pear early as indicated by ndcG. On the other hand, CRM- aware methods singlecontext uses much less information(without global If a user only inputs a bag of words as the single context context and other coupling contexts)but still achieves rea- to request recommendations, we can then only use Equa- sonable performance. For example, if a user only inputs 100 ion 2 to rank candidates. We name this method as Crm words as the query, around 34% original citations can be singlecontext. If a user inputs a manuscript with place. found in the top 5 list holders inside, we can then rank candidates for each place- One may wonder what would happen to some doct holder using Equation 3. We name this method as CRM- the corpus which do not have enough in-link citati crosscontext. Figure 6 illustrates the performance of these texts in history. Given an original citation from two kinds of local recommendations on recall, co-cited prob- manuscript, we examine the correlation of its missi ability and NdCg probability to its number of historical in-link citation con- CRM-crosscontext is rather effective In Figure 5(due to texts in CRM, shown in Figure 7
(a) recall (b) co-cited probability (c) NDCG Figure 5: Compare performances for global recommendation. should be ranked higher. An interesting finding is that the performance of HITs (especially on NDCG) increases significantly after more candidates are returned, indicating that some candidates with moderate authoritative scores are our targets (people may stop citing the most well-known papers after they become the standard techniques in their domains and shift the attentions to other recent good papers). Katz works the best among single features, partially because it implicitly combines the text similarity with the citation count. It works like the collaborative filtering, where candidates often cited by similar documents are recommended. Finally, our CRM method leads the pack on all three metrics, implying that after considering all historical in-link citation contexts of candidates (then the problem of using different conceptual words for the same concept would be alleviated), CRM is effective in recommending bibliography. Ranking Time Time is not a major issue for most ranking algorithms except for topic-based methods, where LDA usually took tens of seconds (50 ∼ 100) for each new candidate set. Thus, topicbased methods including diffusion and mix-features are not suitable for online recommendation systems. All other ranking algorithms need less than 0.1 seconds. Limited by space, the detailed comparisons are omitted here. 6.4 Local Recommendation Local recommendation is a novel task proposed by our context-aware citation recommendation system. Here, we do not compare with the above baselines because they are not tailored for local recommendation. Instead, we evaluate the impact of the absence/presence of the global context and other local contexts, and analyze the problem of contextaware methods. If a user only inputs a bag of words as the single context to request recommendations, we can then only use Equation 2 to rank candidates. We name this method as CRMsinglecontext. If a user inputs a manuscript with placeholders inside, we can then rank candidates for each placeholder using Equation 3. We name this method as CRMcrosscontext. Figure 6 illustrates the performance of these two kinds of local recommendations on recall, co-cited probability and NDCG. CRM-crosscontext is rather effective. In Figure 5 (due to (a) scatter distribution (b) probability distribution Figure 7: Correlation between count/probability of missed/hit citations and their number of historical in-link citation contexts. crawling and OCR issues, our testing documents have an average of 5 out-link citation contexts that point to documents in our corpus, so “top 5 per context” corresponds to “top 25 per document”), the performance of CRM-crosscontex is very close to CRM in global recommendation. We know CRM-crosscontext and CRM have the same input (the same amount of information to use). However, CRM-crosscontext tackles a much harder problem than CRM, and has to assign citations to each placeholder. Thus, CRM-crosscontext is more capable than CRM. CRM-crosscontext outperforms all baselines of global recommendation and thus is also superior than their local versions. Limited by space, we omit the details here. CRM-crosscontext is able to effectively rank candidates for placeholders. For example, more than 42% original citations can be found in the top 5 recommendations, and frequently co-cited papers (w.r.t. original citations) also appear early as indicated by NDCG. On the other hand, CRMsinglecontext uses much less information (without global context and other coupling contexts) but still achieves reasonable performance. For example, if a user only inputs 100 words as the query, around 34% original citations can be found in the top 5 list. One may wonder what would happen to some documents in the corpus which do not have enough in-link citation contexts in history. Given an original citation from a query manuscript, we examine the correlation of its missing/hit probability to its number of historical in-link citation contexts in CRM, shown in Figure 7. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 428
Www2010·Ful! Paper Aprl!26-30· Raleigh·Nc·USA 吾CRM- singlecontext 0.03 ◆CRM- crosscontext 8016◆ 0°备自鲁 NDCG'a75 top K recommendations per local context top K recommendations per local context (a) recall (b)co-cited probability NDCG Figure 6: Evaluate local recommendations The correlation results clearly indicate that the missing/hit [5] D. Blei, A Ng and M. Jordan. Latent dirichlet allocation probability of an original citation declines/raises proportion- ally to its number of historical in-link contexts. In fact [6] K Chandrasekaran, S Gauch, P Lakkaraju and H Luong for those new corpus documents without link con- Concept-Based Document Recommendations for CiteSee exts. our context-aware met hods can still conduct context- Authors. Adaptive Hypermedia and Adaptive Web-Base Systems, Springer, 2008 oblivious document-based recommendation, except that we still enhance the specific citation motivations of a query [7 D Cohn and T Hofmann. The missing link-a probabilistic model of document content and hypertext manuscript using its out- link local contexts connectivity. NIPs01 [8 F Diaz. Regularizing Ad Hoc Retrieval Scores. CIKM'05. 7. CONCLUSIONS 9 E Erosheva, S. Fienberg and J. Lafferty. Mixed membership models of scientific publications. PNAS 2004 In this paper, we tackled the novel problem of [10) A. Gleason. Measures on the Closed Subspaces of a Hilbert ware citation recommendation. and built a conte Space. J. of Mathematics and Mechanics, 1957. citation recommendation prototype in CiteSeerX [11] S. Huang, G. Xue, B. Zhang, Z. Che Y. Yu and W. Ma tem is capable of recommending the bibliography to a manuscript TSSP: A Reinforcement Algorithm to Find Related Papers and providing a ranked set of citations to a specific cita- Wr04. ion placeholder. We developed a mathematically soune [12 A Jeffrey and H. Dai. Handbook of Mathematical context-aware relevance model. A non-parametric proba- Formulas and Integrals. Academic Press, 2008. bilistic model as well as its scalable closed form solutions [ 13J. Kleinberg. Authoritative sources in a hyperlinked environment.J. of the ACM, 1999 are then introduced. We also conducted extensive experi- [14] J. Lafferty and G. Lebanon. Diffusion Kernels on Statistical ments to examine the performance of our approach Manifolds. J. of Machine Learning Research, 2005. In the future, we plan to make our citation recommer [15]D Liben-Nowell and J Kleinberg. The link prediction dation system publicly available. Moreover, we plan to problem for social networks. CIKM'03 velop scalable techniques for integrating various sources of [16] S. McNee, I. Albert, D Cosley, P. Gopalkrishnan, S. Lam, features, and explore semi-supervised learning on the partial list of citations in manuscripts f Citations for Research Papers. CSCw02. [17 M. Melucci. A basis for information retrieval in context. TOS,2008 8. ACKNOWLEDGEMENT [18 R Nallapati, A Ahmed, E. Xing and W. Cohen. Je This work is supported in part by the National Sciene latent topic models for text and citations. SIGKDD 08 [19 Z Nie, Y. Zhang, J. Wen and w. Ma Object-Level Foundation Grants 0535656, 0845487, and 0454052, an NSERC Ranking: Bringing Order to Web Objects. wwwo Discovery grant and an NSERC Discovery Accelerator Sup- [20 C. Rijsbergen. The Geometry of Information Retrieval plements grant. All opinions, findings, conclusions and rec- Cambridge University Press, 2004. ommendations in this paper are those of the authors and do [21] A. Ritchie. Citation context analysis for information not necessarily reflect the views of the funding agencies retrieval. PhD thesis, University of Cambridge, 2008 2 B Shaparenko and T Joachims. Identifying the Original REFERENCES ECML. 2009 23T Strohman, W. Croft and D Jensen. R SIGIR07: ze od mending [1 R. Ag Association Rules Between Sets of Items in Large p: //ciir-publications cs umass. Databases. SIGMOD. 1993. 2 S. Aya, C. Lagoze and T Joachims Citation Classification d its Applications. ICKM05 3 K. Banaszek, G. D'Ariano, M. Paris and M. Sacchi Maximum-likelihood estimation of the density matrix. Physical Review A, 1999 F. Wang, B. Chen and Z. Miao. A Survey on Reviewer Assignment Problem. IEA/AIE08 [4 C Basu, H Hirsh, W. Cohen and C Nevill-Manning echnical Paper Recommendation: A Study in Combining 27 D. Zhou, S Zhu, K. Yu, X Song, B. Tseng, H Zha and multiple Information Sources. J. of Artificial Intelligence L. Giles. Learning Multiple Graphs for Document Recommendations. www08 Research. 2001
(a) recall (b) co-cited probability (c) NDCG Figure 6: Evaluate local recommendations. The correlation results clearly indicate that the missing/hit probability of an original citation declines/raises proportionally to its number of historical in-link contexts. In fact, for those new corpus documents without any in-link contexts, our context-aware methods can still conduct contextoblivious document-based recommendation, except that we still enhance the specific citation motivations of a query manuscript using its out-link local contexts. 7. CONCLUSIONS In this paper, we tackled the novel problem of contextaware citation recommendation, and built a context-aware citation recommendation prototype in CiteSeerX. Our system is capable of recommending the bibliography to a manuscript and providing a ranked set of citations to a specific citation placeholder. We developed a mathematically sound context-aware relevance model. A non-parametric probabilistic model as well as its scalable closed form solutions are then introduced. We also conducted extensive experiments to examine the performance of our approach. In the future, we plan to make our citation recommendation system publicly available. Moreover, we plan to develop scalable techniques for integrating various sources of features, and explore semi-supervised learning on the partial list of citations in manuscripts. 8. ACKNOWLEDGEMENT This work is supported in part by the National Science Foundation Grants 0535656, 0845487, and 0454052, an NSERC Discovery grant and an NSERC Discovery Accelerator Supplements grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. 9. REFERENCES [1] R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. SIGMOD, 1993. [2] S. Aya, C. Lagoze and T. Joachims. Citation Classification and its Applications. ICKM’05. [3] K. Banaszek, G. D’Ariano, M. Paris and M. Sacchi. Maximum-likelihood estimation of the density matrix. Physical Review A, 1999. [4] C. Basu, H. Hirsh, W. Cohen and C. Nevill-Manning. Technical Paper Recommendation: A Study in Combining Multiple Information Sources. J. of Artificial Intelligence Research, 2001. [5] D. Blei, A. Ng and M. Jordan. Latent dirichlet allocation. J. Machine Learning Research 2003. [6] K. Chandrasekaran, S. Gauch, P. Lakkaraju and H. Luong. Concept-Based Document Recommendations for CiteSeer Authors. Adaptive Hypermedia and Adaptive Web-Based Systems, Springer, 2008. [7] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. NIPS’01. [8] F. Diaz. Regularizing Ad Hoc Retrieval Scores. CIKM’05. [9] E. Erosheva, S. Fienberg and J. Lafferty. Mixed membership models of scientific publications. PNAS 2004. [10] A. Gleason. Measures on the Closed Subspaces of a Hilbert Space. J. of Mathematics and Mechanics, 1957. [11] S. Huang, G. Xue, B. Zhang, Z. Chen, Y. Yu and W. Ma. TSSP: A Reinforcement Algorithm to Find Related Papers. WI’04. [12] A. Jeffrey and H. Dai. Handbook of Mathematical Formulas and Integrals. Academic Press, 2008. [13] J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 1999. [14] J. Lafferty and G. Lebanon. Diffusion Kernels on Statistical Manifolds. J. of Machine Learning Research, 2005. [15] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. CIKM’03. [16] S. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. Lam, A. Rashid, J. Konstan and J. Riedl. On the Recommending of Citations for Research Papers. CSCW’02. [17] M. Melucci. A basis for information retrieval in context. TOIS, 2008. [18] R. Nallapati, A. Ahmed, E. Xing and W. Cohen. Joint latent topic models for text and citations. SIGKDD’08. [19] Z. Nie, Y. Zhang, J. Wen and W. Ma. Object-Level Ranking: Bringing Order to Web Objects. WWW’05. [20] C. Rijsbergen. The Geometry of Information Retrieval. Cambridge University Press, 2004. [21] A. Ritchie. Citation context analysis for information retrieval. PhD thesis, University of Cambridge, 2008. [22] B. Shaparenko and T. Joachims. Identifying the Original Contribution of a Document via Language Modeling. ECML, 2009. [23] T. Strohman, W. Croft and D. Jensen. Recommending Citations for Academic Papers. SIGIR’07 and Technical Report, http://ciir-publications.cs.umass.edu/getpdf.php?id=610. [24] J. Tang and J. Zhang. A Discriminative Approach to Topic-Based Citation Recommendations PAKDD’09. [25] R. Torres, S. McNee, M. Abel, J. Konstan and J. Riedl. Enhancing Digitial Libraries with Techlens. JCDL’04. [26] F. Wang, B. Chen and Z. Miao. A Survey on Reviewer Assignment Problem. IEA/AIE’08. [27] D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha and L. Giles. Learning Multiple Graphs for Document Recommendations. WWW’08. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 429
Www2010·Fu! Paper Aprl!26-30· Raleigh·Nc·USA APPENDIX Cartesian coordinates to hyperspherical coordinates as A. PROOF OF PROPOSITION 5.2 1=rcos 01, w2=rsin 01 cos 82, wa=rsin @i sin ]2 cos 6: By the spectral theorem, we can choose all the ti to be u1w|-1= rsin 01 sin62……sin61w|-2cos61 Each ti can also be expresse aia Sa+Bilu1+.+Bib ub where the si and ui are orthogonal vectors such that the si span the same space as the ci g likelihood is then u2≥0 e1∈|,ml,的∈|0,],……,w-2∈0,可],w-1∈0 ∑ log pgen()+∑logp() the absolute value of the determinant of the jacobian is ∑log∑y=1(a15+…+ aisa+au1+…+ Bibubl-ce) Let us also define 2 log Pgen(co)+2 log Ej-1(lais+..+aiaA]. ca) fJ-… fo Joo rIM erdre- the ui are orthogonal to the ci, we can increase the D)W12 The moment calculations are thus as follows. Bibub with ti= aili t..+ aia sa to get the matrix Td=2, t; without changing the likelihood. However, -e-(=2)/dr1.dr JJn…Ja∫ Trace(4)=∑:tt ∑4(a1+…+a2a+B+…+B)= Trace(Ta)= bSo Jo(1-sin(01))2sinwI-2(01)sinw1-(@2)de1d92 Thus we need to multiply each t by a constant y>l to o Jo Jo sinlW1-2(01)-2sinIW1(@)+sinIw1+(01) x increase the trace of Td to 1. Multiplication by such a y will sinlwI-(02)de,de2 then increase each of the quantities in the log terms of the log =2中 likelihood. Thus Id would have a higher log likelihood than 24x[H Ta. Thus in the maximum likelihood solution Ta=>itit ach ti is in the subspace spanned by the contexts ci and representing all of the ti requires O(r) parameters. So by Wallis's formula [12 where n!! is the double factorial(1 numerical solution will have polynomial complexity in r if n=0 or 1; n!!=n x(n-2)!! otherwise). Also E|w2w21= ①)/2dx B. PROOF OF PROPOSITION 5.3 a2wzJ"J…JJ Recall that we treat concepts as one-dimensional vector II sin wI-1-(0 )drdo.. de wt e the uniform distribution over unit vectors. It =ofo fo cos2(01)sin2(01)cos(82)sinW1-2(01)sin w1-3(@)d01de2 known that sampling from p is equivalent to sampling w dependent standard Gaussian random variables(one for b Jo Jo sin/W1(01)-sin WI+a(e) sin wI-a(e-2)-sin 1-1(02) de1 de2 each dimension)and dividing them by the square root of their sum of squares. The similarity between di and d2 is: HP-出#[ =2丌 =24r[#H=[:H (cm)14,m2 Thus we have E 3ElwFwd. Thus sim(d1; d2)is pro- +(出 portional to Eww, a universal constant. We drop this ct,pwAci,Uyx sim(d, nt to simplify calculations. Thus our similarity is: C4, &Ci,, 26,, 6b Now, if all coordinates of w other than w'i (for some i) 时,+424A ed, it is easy to see that the proba d e2,a6 Dme coordinate will become 0 in tI value. Thus sim(d1; d2)simplifies to 品息,门+1,mm与mm sim(di; d2)=k 2 Finally, after removing the additive and multiplicative con- stants(they don't affect ranking), we can use the following equivalent formula: To compute the necessary moments, we change basis from
APPENDIX A. PROOF OF PROPOSITION 5.2 By the spectral theorem, we can choose all the ti to be orthogonal. Each ti can also be expressed as αi1si + ··· + αiasa+βi1u1+···+βibub where the si and uj are orthogonal unit vectors such that the si span the same space as the ci. The log likelihood is then k i=1 log pgen(ci) + k i=1 log pd(ci) = k i=1 log pgen(ci) + k i=1 log r j=1 cT i tj t T j ci = k i=1 log pgen(ci)+ k i=1 log r j=1([αi1si + ··· + αiasa + βi1u1 + ··· + βibub] · ci)2 = k i=1 log pgen(ci) + k i=1 log r j=1([αi1si + ··· + αiasa] · ci)2. Since the ui are orthogonal to the ci, we can increase the likelihood by replacing each ti = αi1si+···+αiasa+βi1u1+ ··· + βibub with t i = αi1si + ··· + αiasa to get the matrix T d = i t it T i without changing the likelihood. However, T race(T d) = i t i · t i = i(α2 i1 + ··· + α2 ia) since the αi and βj are all orthogonal 1 to increase the trace of T d to 1. Multiplication by such a γ will then increase each of the quantities in the log terms of the log likelihood. Thus γT d would have a higher log likelihood than Td. Thus in the maximum likelihood solution Td = i tit T i , each ti is in the subspace spanned by the contexts ci and representing all of the ti requires O(r2) parameters. So, a numerical solution will have polynomial complexity in r2. B. PROOF OF PROPOSITION 5.3 Recall that we treat concepts as one-dimensional vector spaces and thus can represent them as unit vectors. Let p be the uniform distribution over unit vectors. It is wellknown that sampling from p is equivalent to sampling |W| independent standard Gaussian random variables (one for each dimension) and dividing them by the square root of their sum of squares. The similarity between d1 and d2 is: sim(d1; d2) = T race(Td1wwT )T race(Td2wwT )p(w)dw = 1 k1k2 k1 i=1 k2 j=1 (ci · w)2(bj · w)2p(w)dw = 1 k1k2 E k1 i=1 k2 j=1 | W| α=1 c2 i,αw2 α + 2 | W| β=γ+1 | W| γ=1 ci,βwβci,γwγ × | W| δ=1 b2 j,δw2 δ + 2 | W| ω=+1 | W| =1 bj,ω wωbj,w . Now, if all coordinates of w other than wi (for some i) are fixed, it is easy to see that the probability of wi = x and wi = −x are the same, and so the coordinates of w are jointly uncorrelated. This means that all terms with an odd power of some coordinate will become 0 in the expected value. Thus sim(d1; d2) simplifies to: sim(d1; d2) = 1 k1k2 k1 i=1 k2 j=1 E | W| α=1 c2 i,αb2 i,αw4 α + β=γ c2 i,βb2 j,γw2 βw2 γ + 4 δ> ci,δci,bj,δbj,w2 δw2 . To compute the necessary moments, we change basis from Cartesian coordinates to hyperspherical coordinates as: w1 = r cos θ1, w2 = r sin θ1 cos θ2, w3 = r sin θ1 sin θ2 cos θ3, ..., w|W|−1 = r sin θ1 sin θ2 ... sin θ|W|−2 cos θ|W|−1, w|W| = r sin θ1 sin θ2 ... sin θ|W|−2 sin θ|W|−1; r = | W| i=1 w2 i ≥ 0; θ1 ∈ [0, π], θ2 ∈ [0, π], ..., θ|W|−2 ∈ [0, π], θ|W|−1 ∈ [0, 2π); the absolute value of the determinant of the Jacobian is r|W|−1 |W |−2 i=1 sin|W|−1−i (θi). Let us also define Φ ≡ 2π0 π 0 ... π 0 ∞0 r|W|−1 |W |−2 i=3 sin|W|−1−i(θi)drdθ3...dθ|W|−1 (2π)|W|/2 . The moment calculations are thus as follows: E[w4 1] = 1 (2π)|W|/2 ∞ −∞ ... ∞ −∞ x4 1 ( x2 i )2 e−( x2 i )/2dx1 . . . dx|W| = 1 (2π)|W |/2 2π 0 π 0 ... π 0 ∞ 0 r4 cos4(θ1) r4 r|W|−1× |W |−2 i=1 sin|W|−1−i(θi)drdθ1 . . . dθ|W|−1 = Φ π 0 π 0 (1 − sin2(θ1))2 sin|W|−2(θ1) sin|W|−3(θ2)dθ1dθ2 = Φ π 0 π 0 sin|W|−2(θ1) − 2 sin|W| (θ1) + sin|W|+2(θ1) × sin|W|−3(θ2)dθ1dθ2 = 2Φπ (|W|−3)!! (|W|−2)!! − 2 (|W|−1)!! (|W|)!! + (|W|+1)!! (|W|+2)!! (|W|−4)!! (|W|−3)!! = 2Φπ (|W|−4)!! (|W|−3)!! (|W|−3)!! (|W|−2)!! 1 − 2 |W|−1 |W| + (|W|+1)(|W|−1) (|W|+2)|W| = 2Φπ (|W|−4)!! (|W|−3)!! (|W|−3)!! (|W|−2)!! 3 (|W|+2)|W| , by Wallis’s formula [12] where n!! is the double factorial (1 if n = 0 or 1; n!! = n × (n − 2)!! otherwise). Also, E[w2 1w2 2] = 1 (2π)|W|/2 ∞ −∞ ... ∞ −∞ x2 1x2 2 ( x2 i )2 e−( x2 i )/2dx1 . . . dx|W| = 1 (2π)|W |/2 2π 0 π 0 ... π 0 ∞ 0 r4 cos2(θ1) sin2(θ1) cos2(θ2) r4 r|W|−1× |W |−2 i=1 sin|W|−1−i(θi)drdθ1 . . . dθ|W|−1 = Φ π 0 π 0 cos2(θ1) sin2(θ1) cos2(θ2) sin|W|−2(θ1) sin|W|−3(θ2)dθ1dθ2 = Φ π 0 π 0 sin|W| (θ1) − sin|W|+2(θ1) sin|W|−3(θ2) − sin|W|−1(θ2) dθ1dθ2 = 2Φπ (|W|−1)!! |W|!! − (|W|+1)!! (|W|+2)!! (|W|−4)!! (|W|−3)!! − (|W|−2)!! (|W|−1)!! = 2Φπ (|W|−4)!! (|W|−3)!! (|W|−3)!! (|W|−2)!! |W|−1 |W| − (|W|+1)(|W|−1) (|W|+2)|W| 1 − |W|−2 |W|−1 = 2Φπ (|W|−4)!! (|W|−3)!! (|W|−3)!! (|W|−2)!! 1 |W|(|W|+2) . Thus we have E[w4 i ]=3E[w2 i w2 j ]. Thus sim(d1; d2) is proportional to E[w2 i w2 j ], a universal constant. We drop this constant to simplify calculations. Thus our similarity is: sim(d1; d2) = 1 k1k2 k1 i=1 k2 j=1 3 | W| α=1 c2 i,αb2 i,α + β=γ c2 i,βb2 j,γ + 4 δ> ci,δci,bj,δbj, = 1 k1k2 k1 i=1 k2 j=1 2 | W| α=1 c2 i,αb2 i,α + | W| β=1 | W| γ=1 c2 i,βb2 j,γ + 4 δ> ci,δci,bj,δbj, = 1 k1k2 k1 i=1 k2 j=1 2(ci · bj )2 + | W| β=1 | W| γ=1 c2 i,βb2 j,γ = 1 k1k2 k1 i=1 k2 j=1 2(ci · bj )2 + 1, since the ci and bj are unit vectors. Finally, after removing the additive and multiplicative constants (they don’t affect ranking), we can use the following equivalent formula: sim(d1; d2) = 1 k1k2 k1 i=1 k2 j=1 (ci · bj )2. WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 430