Preprint of: Bela Gipp and Joran Beel. Identifying Related Documents For Research Paper Recommender By CPA And COA. In S. I. Ao, C. Douglas, w.S. Grundfest, and J. Burgstone, editors, Intemational Conference on Education and Information Technology (ICElTo9), volume I of Lecture Notes in Engineering and Computer Science, pages 636-639, Berkeley (USA), October 2009. International Association of Engineers(LAENG), Newswood Limited Isbn978-988-17012-6-8.Downloadedfromhttp://www.sciploreorg Identifying Related Documents For Research Paper Recommender by cpa and coa Bela gipp and Joran Beel Otto-von-Guericke University Magdeburg, Department of Computer Science, ITI and SciPlore. org gipplbeel@sciplore. org Abstrack-This work-in-progress paper introduces two new results can be achieved by applying co-citation analysis pproaches called Citation Proximity Analysis( CPA)and Citation proximity analysis is a further development of co- Citation Order Analysis (COA). They can be applied to citation analysis identify related documents for the purpose of research paper recommender systems. CPA is a variant of co-citation analysis hat additionally considers the proximity of citations to each ther within an article's full-text. The underlying idea is that the closer citations are to each other in a document, the more 2 and a e likely it is that the cited documents are related. For example 如mM itations listed in the same sentence are more likely to express related thoughts than citations listed only in the same section. In COA, the order of citations are considered, allowing the identification of a text similar to one that has been translated from language A to language b, as the citations would still ccur in the same order. However, it is also shown that CPa and COA cannot replace text analysis and existing citation alysis approaches for research paper recommender syste since they all have their own strengths and weaknesses. 巴四““黨 Index Terms-Bibliometrics, citation proximity analysis, citation order analysis, related documents, research paper Figure 1: GUI SciPlore- clustering similar documents recommender In the research paper recommender SciPlore. org this approach is mainly used for two purposes. First, to cluster . INTRODUCTION similar documents as shown in Figure 1; and secondly, to give recommendations for further related documents based The search for related work is a time-consuming procedure on one or more documents the user has been interested in, that even if performed by experienced scientists often leads to unsatisfying results. To alleviate the problem, search engines such as Google Scholar and Citeseer offer to In the first part of this paper related work is presented ane display "similar"documents based on text and citation the commonly applied citation analysis approaches discussed with the focus on co-citation analysis. In the Superior results are usually achieved by hybrid research following section the CPA approach is introduced paper recommender systems. By combining further Afterwards, the existing citation analysis approaches are techniques such as co-word analysis, collaborative filtering. compared to CPa and their suitability for research paper Subject-Action-Object (SAO)structures, etc, more precis systems examined. The paper concludes with a summary and an outlook which includes how this new approach is commendations can be given. However, these approaches are only suitable to a limited extent for identifying related going to be integrated in the research paper recommender SciPlore or work [2-81 Taking everything into account, our examination suggests that in the case of scientific documents, usually the best
Identifying Related Documents For Research Paper Recommender By CPA and COA Bela Gipp and Jöran Beel Otto-von-Guericke University Magdeburg, Department of Computer Science, ITI and SciPlore.org gipp|beel@sciplore.org Abstract—This work-in-progress paper introduces two new approaches called Citation Proximity Analysis (CPA) and Citation Order Analysis (COA). They can be applied to identify related documents for the purpose of research paper recommender systems. CPA is a variant of co-citation analysis that additionally considers the proximity of citations to each other within an article’s full-text. The underlying idea is that the closer citations are to each other in a document, the more likely it is that the cited documents are related. For example, citations listed in the same sentence are more likely to express related thoughts than citations listed only in the same section. In COA, the order of citations are considered, allowing the identification of a text similar to one that has been translated from language A to language B, as the citations would still occur in the same order. However, it is also shown that CPA and COA cannot replace text analysis and existing citation analysis approaches for research paper recommender systems since they all have their own strengths and weaknesses. Index Terms—Bibliometrics, citation proximity analysis, citation order analysis, related documents, research paper recommender I. INTRODUCTION The search for related work is a time-consuming procedure that even if performed by experienced scientists often leads to unsatisfying results. To alleviate the problem, search engines such as Google Scholar and Citeseer offer to display “similar” documents based on text and citation analysis. Superior results are usually achieved by hybrid research paper recommender systems. By combining further techniques such as co-word analysis, collaborative filtering, Subject-Action-Object (SAO) structures, etc., more precise recommendations can be given. However, these approaches are only suitable to a limited extent for identifying related work [2-8]. Taking everything into account, our examination suggests that in the case of scientific documents, usually the best results can be achieved by applying co-citation analysis. Citation proximity analysis is a further development of cocitation analysis. CCoocckkppiitt VViieeww Server connection with Scienstein.org established Data processing completed Graphical View (relevant documents are larger) Filter Publication date between: 2002 and 2008 Impact factor: Relevance: 2.5 7.5 Publication types Select languages Collaborative rating: 3.2 Change Query 2002 2003 2004 2005 2006 2007 2008 Settings Topicality Legend 2.5 Unrat 0-2 2-4 4-6 6-8 Collaborative R. Year Impact Year 8-10 Content Based Recommender Systems Evaluating Collaborative Recommender Systems JL Herlocker, JA Konstan, G Terveen and JT Riedl 2006, Journal of Science and Recommenders (IF 3.2) Abstract: Recommender systems have been evaluated in many, often incomparable, ways. In this paper we review the key decisions in evaluating collaborative filtering recommender systems… More Tags: Recommender Systems Collaboration Evaluation Metrics Performance Measurement 23 Data Mining Collaborative Document Evaluation Recommender Systems Figure 1: GUI SciPlore – clustering similar documents In the research paper recommender SciPlore.org this approach is mainly used for two purposes. First, to cluster similar documents as shown in Figure 1; and secondly, to give recommendations for further related documents based on one or more documents the user has been interested in, as shown in Figure 2. In the first part of this paper related work is presented and the commonly applied citation analysis approaches discussed with the focus on co-citation analysis. In the following section the CPA approach is introduced. Afterwards, the existing citation analysis approaches are compared to CPA and their suitability for research paper systems examined. The paper concludes with a summary and an outlook which includes how this new approach is going to be integrated in the research paper recommender SciPlore.org. Preprint of: Bela Gipp and Jöran Beel. Identifying Related Documents For Research Paper Recommender By CPA And COA. In S. I. Ao, C. Douglas, W. S. Grundfest, and J. Burgstone, editors, International Conference on Education and Information Technology (ICEIT'09), volume 1 of Lecture Notes in Engineering and Computer Science, pages 636–639, Berkeley (USA), October 2009. International Association of Engineers (IAENG), Newswood Limited. ISBN 978-988-17012-6-8. Downloaded from http://www.sciplore.org
Papers similar to the last papers you have read approach: Papers A and b are related because they both cite the delicate topic of the impact factor papers C, D and e Why the impact factor of joumals should not be used for In contrast, two documents are"co-cited"when at least one evaluating research paper cites both. This approach is illustrated in Figure 4 Papers A and B are related because they are both cited by papers C, D and e. The more co-citations two papers moreM Szklo(2008), receive, the more related they are [6 Epidemiology, vol. 19, no. 3 Figure 2: Similar paper recommendation IL. RELATED WO The usefulness of a research paper recommender system depends to a large extent on its ability to automatically determine related work to one or more documents. various approaches exist to determine the degree of similarity of DOC A documents in order to identify related work. cited cited Whereas text-mining approaches are used in cases in which references are not stated, citation analysis approaches Figure 4: Co-citation analysis usually deliver superior results as e.g. synonyms and unclear nomenclature do not lead to misleading results Although both approaches are suitable to identify similar 4, 5]. Many citation analysis approaches exist and they all papers, they serve different purposes. Whereas have their own strengths and weaknesses for identifying bibliographic coupling is retrospective, co-citation is similar documents. Among the most widely used are the essentially a forward-looking perspective [9]. However easily applicable cited by approach, which considers both approaches often deliver unsatisfying results, since papers as relevant that cite the same input document and the hey only make use of the bibliography at the end of the rence list approach, which considers papers as document without analyzing the constellation of citations relevant that were referenced by the input document. The Therefore it is not possible to determine in which part of a best results can usually be obtained by bibliographic related document the content of interest can be found coupling and co-citation analysis, which allow calculating the coupling strength [6]. These approaches, which were already invented in the 60s and 70s, are used by scientists II CITATION PROXIMITY ANALYSIS AND nd on academic search engine websites like CiteSeerl [9] CITATION ORDER ANALYSIS Instead of just using the bibliography, in CPa the information derived from the proximity of the citations to each other in the full-text is used to calculate the citation DOC A DoC B citing Proximity Index(CPD) in three steps 1. The document is parsed and a series of heuristics are used to process the citations including their position within the document 2. The citations are assigned to their corresponding items in the bibliography. The overall margin of error with the system we have developed equals nearly three percent for the first and second step Figure 3: Bibliographic coupling In the third step the proximity among each citation-pair is examined. The underlying assumption is that the closer the Documents are bibliographically coupled if they cite one or citations are to each other, the more likely it is that they ore documents in common. Figure 3 illustrates this 2 The citations were parsed using a modified version of parsCit (http://wing.comp.nus.edu.sg/parscit)incombinationwith exclusively developed software, which is available upon request
Based on document usage mining, Scienstein recommends you the following papers: Papers similar to the last papers you have read The delicate topic of the impact factor Why the impact factor of journals should not be used for evaluating research Impact Factor: Good Reasons for Concern more... Papers recently published by authors you have read Self-citations, co-authorships and keywords - A new approach to scientists’ field mobility Profiling citation impact - A new methodology more... Title Author Year Source Ratings Abstract Update M. Szklo (2008), Epidemiology, vol. 19, no. 3 Figure 2: Similar paper recommendation II. RELATED WORK The usefulness of a research paper recommender system depends to a large extent on its ability to automatically determine related work to one or more documents. Various approaches exist to determine the degree of similarity of documents in order to identify related work. Whereas text-mining approaches are used in cases in which references are not stated, citation analysis approaches usually deliver superior results as e.g. synonyms and unclear nomenclature do not lead to misleading results [3, 4, 5]. Many citation analysis approaches exist and they all have their own strengths and weaknesses for identifying similar documents. Among the most widely used are the easily applicable „cited by‟ approach, which considers papers as relevant that cite the same input document and the „reference list‟ approach, which considers papers as relevant that were referenced by the input document. The best results can usually be obtained by bibliographic coupling and co-citation analysis, which allow calculating the coupling strength [6]. These approaches, which were already invented in the 60s and 70s, are used by scientists and on academic search engine websites like CiteSeer1 [9]. Doc A citing Doc B citing Doc C Doc D Doc E cites cites Figure 3: Bibliographic coupling Documents are bibliographically coupled if they cite one or more documents in common. Figure 3 illustrates this 1 http://citeseer.ist.psu.edu approach: Papers A and B are related because they both cite papers C, D and E. In contrast, two documents are “co-cited” when at least one paper cites both. This approach is illustrated in Figure 4: Papers A and B are related because they are both cited by papers C, D and E. The more co-citations two papers receive, the more related they are [6]. Doc A cited Doc B cited Doc C Doc D Doc E cites cites Figure 4: Co-citation analysis Although both approaches are suitable to identify similar papers, they serve different purposes. Whereas bibliographic coupling is retrospective, co-citation is essentially a forward-looking perspective [9]. However, both approaches often deliver unsatisfying results, since they only make use of the bibliography at the end of the document without analyzing the constellation of citations. Therefore it is not possible to determine in which part of a related document the content of interest can be found. III. CITATION PROXIMITY ANALYSIS AND CITATION ORDER ANALYSIS Instead of just using the bibliography, in CPA the information derived from the proximity of the citations to each other in the full-text is used to calculate the Citation Proximity Index (CPI) in three steps. 1. The document is parsed and a series of heuristics are used to process the citations including their position within the document2 . 2. The citations are assigned to their corresponding items in the bibliography. The overall margin of error with the system we have developed equals nearly three percent for the first and second step. 3. In the third step the proximity among each citation-pair is examined. The underlying assumption is that the closer the citations are to each other, the more likely it is that they are 2 The citations were parsed using a modified version of parsCit (http://wing.comp.nus.edu.sg/parsCit) in combination with exclusively developed software, which is available upon request from the authors
calculated. If for example two citations are given in the the weighted average of the CPls. By automating the same sentence the probability that they are very similar is process described above, we have calculated the CPI for higher(CPI= 1) as if they were only in the same paragraph publications contained in the SciPlore database. The results (CPI= 1/2). See Figure 5 show that in comparison to the results delivered by co- citation analysis, CPa delivers considerably better results in Citing Document identifying similar documents [1] Similar to the idea of CPa is another approach currently under development, that we call Citation Order Analysi (COA). In contrast to CPA, in COA, only the order of citations is considered. The main advantage in comparison to the usually applied text analysis approaches is that even if documents are translated or paraphrased they can still be identified as similar. Depending on the level of tolerance even if citations were omitted. summarized documents can Document 1 be identified. This way a digital fingerprint of documents can be created that can, besides for recommender systems 回三 also be used to identify plagiarized work. In some regard this approach is similar to bibliographic coupling. However, by additionally considering the order of citations, this approach is more precise and robust. Figure 6 illustrates the concept. Document A Document B Figure 5: Illustration CPA However, further research needs to be performed to identify the appropriate weighting of the CPI values according to their occurrence, which also seems to depend on the publications research field and publications research type For example, it seems that for analyzing a technical report or patent specification, different weightings seem suitable First empirical evaluations have lead to the values shown in Figure 6: Illustration Citation Order Analysis Table I for calculating the CPl IV OUTLOOK Besides identifying related work, the authors work on Table 1: cpi values applying the idea behind CPA for automatic document CPI value classification for the research paper recommender SciPlore 1]. The aim is to automatically analyze the topics within documents by analyzing the distribution of references within research papers. So instead of knowing, for instance, Chapter that a certain publication focuses on the relativity theor Same journal same book the CPa makes it possible to identify the document sections Same journal but different edition l/16 focusing example,on‘ Time dilation;‘ Length contraction'or'Mass-energy equivalence' and then to give specific recommendations within documents or books The results delivered by CPA can be improved by Moreover, it is possible to combine the CPA with text evaluating as many sources as possible. This can be the mining algorithms in order to automatically detect e. g case due to multiple occurrences of the same citation and contradicting studies. "The author A has shown in his due to multiple documents citing a certain document. In our recent study /reference A that in contrast to a previous
related. Based on this proximity analysis, the CPI is calculated. If for example two citations are given in the same sentence the probability that they are very similar is higher (CPI = 1) as if they were only in the same paragraph (CPI = 1/2). See Figure 5. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example [3]. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [1]. Another exampleThis is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. This is another reference [2]. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. This is an example text with references to different documents. This is one reference [1], [2]. This is an example text with references to different documents. Another example. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. Document 2 Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example [3]. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [1]. Another exampleThis is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. This is another reference [2]. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. This is an example text with references to different documents. This is one reference [1], [2]. This is an example text with references to different documents. Another example. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.[1] Another example. This is an example text with references to different documents. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.[1] Another example. This is an example text with references to different documents.This is an example text with references to different documents. This is one reference [1], [2]. This is an example text with references to different documents. Another example. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example [3]. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [1]. Another exampleThis is an example text with references to different documents. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.Another example. This is another reference [2]. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. This is an example text with references to different documents. This is one reference. This is an example text with references to different documents. Two very similar references [1],[2]. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [3]. Another exampleThis is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. This is another reference. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. Document 1 Document 3 Citing Document CPI = ¼ CPI = 1 Figure 5: Illustration CPA However, further research needs to be performed to identify the appropriate weighting of the CPI values according to their occurrence, which also seems to depend on the publication‟s research field and publication‟s research type. For example, it seems that for analyzing a technical report or patent specification, different weightings seem suitable. First empirical evaluations have lead to the values shown in Table 1 for calculating the CPI. Table 1: CPI values The results delivered by CPA can be improved by evaluating as many sources as possible. This can be the case due to multiple occurrences of the same citation and due to multiple documents citing a certain document. In our series of tests we experienced the best results by calculating the weighted average of the CPIs. By automating the process described above, we have calculated the CPI for publications contained in the SciPlore database. The results show that in comparison to the results delivered by cocitation analysis, CPA delivers considerably better results in identifying similar documents [1]. Similar to the idea of CPA is another approach currently under development, that we call Citation Order Analysis (COA). In contrast to CPA, in COA, only the order of citations is considered. The main advantage in comparison to the usually applied text analysis approaches is that even if documents are translated or paraphrased they can still be identified as similar. Depending on the level of tolerance even if citations were omitted, summarized documents can be identified. This way a digital fingerprint of documents can be created that can, besides for recommender systems, also be used to identify plagiarized work. In some regard, this approach is similar to bibliographic coupling. However, by additionally considering the order of citations, this approach is more precise and robust. Figure 6 illustrates the concept. This is an example text with references to different documents.[1] Another example. This is an example text with references to different documents.This is an example text with references to different documents. This is one reference [1], [2]. This is an example text with references to different documents. Another example. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example [3]. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [1]. Another exampleThis is an example text with references to different documents. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.Another example. This is another reference [2]. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. This is an example text with references to different documents. This is one reference. This is an example text with references to different documents. Two very similar references [1],[2]. This is an example text with references to different documents.This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents. This is an example text with references to different documents. Another example. This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents.This is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. Another example. This is an example text with references to different documents [3]. Another exampleThis is an example text with references to different documents. Another example. This is an example text with references to different documents.Another example. This is another reference. Another example. This is an example text with references to different documents.Another example. This is an example text with references to different documents. Example. This is an example text with references to different documents. Document A This is an example text with references to different documents.[1] Another example. This is an example This is an example text with references to different documents.[1] Another example. This is an example text with references to different documents.This is an ex This is an example text with references to different documents.This is an ex asdasdasd Document B Figure 6: Illustration Citation Order Analysis IV. OUTLOOK Besides identifying related work, the authors work on applying the idea behind CPA for automatic document classification for the research paper recommender SciPlore [11]. The aim is to automatically analyze the topics within documents by analyzing the distribution of references within research papers. So instead of knowing, for instance, that a certain publication focuses on the relativity theory, the CPA makes it possible to identify the document sections focusing for example, on „Time dilation’, „Length contraction‟ or „Mass-energy equivalence‟ and then to give specific recommendations within documents or books. Moreover, it is possible to combine the CPA with text mining algorithms in order to automatically detect e.g. contradicting studies. “The author A has shown in his recent study [reference A] that in contrast to a previous Occurrence CPI value Sentence 1 Paragraph 1/2 Chapter 1/4 Same journal / same book 1/8 Same journal but different edition 1/16
tudy /reference B/..So by analyzing the words between [4] Marshakova, I. V 1973. System of document two references it is often possible to automatically analyze connections based on references Nauchno- the exact relationship between these two references and Tekhnicheskaya Informatsiya, vol. 2, no 6, pp. 3-8 how they compare to each other [5] Beel, J. Gipp, B. 2008, The Potential of Oftentimes it is possible by knowing the position of each Collaborative Document Evaluation for Science. the citation within a document. to draw conclusions about the 1 1th International Conference on Digital Asian document type e.g. state-of-the art publications, etc. The Libraries (ICadl 2008), December 2-5, Kuta, gathered information can be used to classify further Indonesia, published in G. Buchanan, M. Masoodian documents and to develop a more sophisticated 'Web of S. Cunningham(Eds ) Digital Libraries: Universal and Science. We believe that these technologies. in Ubiquitous Access to Information of Lecture Notes in combination with collaborative filtering, will be the future Computer Science, vol 5362, DOI 10.1007/978-3-540 for identifying related work and will open the doors for 895336,1SSN0302-9743,pp.375-378, Springer powerful research paper recommender systems Verlag Berlin Heidelberg. [6] Small, H. 1973. Co-citation in the scientific literature V. discussion Conclusion a new measure of the relationship between two As shown, the CPa and Coa offer substantial advantages documents, Journal of the American Society for in identifying related documents in comparison to existing Information Science, vol 24, pp, 265-269 approaches. However, it should also be taken into account [7] Klavans,R, Boyack, K.(2006). Identifying a better that the effort is considerable. it is not sufficient to evaluate measure of relatedness for mapping science, Journal of the bibliography of documents, but it is necessary to the American Society for Information Science and process the complete document, identify each reference and Technology, Vol. 57, No. 2, pp. 251-263 is in practice not always possible, and leads in ca. 3% of [8] Sternitzke C. Bergmann, I(2009), Similarity map it to the corresponding entry in the bibliography, which measures for document mapping: A comparative study cases to mismatches. This is because sometimes only an on the level of an individual scientist. scientometrics abstract and the bibliography can be accessed, documents Vol.78,No.1,pp.113-1 cannot be parsed as OCR fails, or a reference style is used that makes it unfeasible to automatically link references to 9] Garfield, E(2001, November 27, 2001). From the corresponding items in the bibliography. This leads to Bibliographic Coupling to Co-Citation Analysis Via the conclusion that although these new approaches deliver Algorithmic Historio-Bibliography: A Citationist's superior results, they cannot completely replace the already Tribute to BelverC. Griffith. Paper presented at the existing approaches, but should be used in combination Drexel University, Philadelphia, PA [10] Giles, C L Bollacker, K D. And Lawrence, S. 1998 CiteSeer: an automatic citation indexing system, In REFERENCES Digital Libraries 98- The Third ACM Conference on [1] Gipp, B. Beel, J. (2009). Citation Proximity Digital Libraries, pp 89-98 Analysis(CPA)-A new approach for identifying [11] Gipp, B. Beel, J. Hentschel, C(2009), Scienstein related work based on Co-Citation Analysis. In A Research Paper Recommender System, in Proceedings of the 12th International Conference on Proceedings of IEEE International Conference on Scientometrics and Informetrics, pp 571-575 Emerging Trends in Computing. Tamil Nadu, India [2] Rip, A, Courtial, J(1984). Co-Word Maps of Biotechnology: An Example of Cognitive Scientometrics. Scientometrics, 6(6), 381-400 3] Fano, R. M. 1956. Information theory and the retrieval of recorded information in documentation in Action Shera, J. H. Kent, A. Perry, J. w.(Edts), New York Reinhold Publ. Co., pp 238-244
study [reference B]...” So by analyzing the words between two references it is often possible to automatically analyze the exact relationship between these two references and how they compare to each other. Oftentimes it is possible by knowing the position of each citation within a document, to draw conclusions about the document type e.g. state-of-the art publications, etc. The gathered information can be used to classify further documents and to develop a more sophisticated „Web of Science‟. We believe that these technologies, in combination with collaborative filtering, will be the future for identifying related work and will open the doors for powerful research paper recommender systems. V. DISCUSSION & CONCLUSION As shown, the CPA and COA offer substantial advantages in identifying related documents in comparison to existing approaches. However, it should also be taken into account that the effort is considerable. It is not sufficient to evaluate the bibliography of documents, but it is necessary to process the complete document, identify each reference and map it to the corresponding entry in the bibliography, which is in practice not always possible, and leads in ca. 3% of cases to mismatches. This is because sometimes only an abstract and the bibliography can be accessed, documents cannot be parsed as OCR fails, or a reference style is used that makes it unfeasible to automatically link references to the corresponding items in the bibliography. This leads to the conclusion that although these new approaches deliver superior results, they cannot completely replace the already existing approaches, but should be used in combination. REFERENCES [1] Gipp, B. & Beel, J. (2009). Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In Proceedings of the 12th International Conference on Scientometrics and Informetrics, pp. 571-575. [2] Rip, A., & Courtial, J. (1984). Co-Word Maps of Biotechnology: An Example of Cognitive Scientometrics. Scientometrics, 6(6), 381-400. [3] Fano, R. M. 1956. Information theory and the retrieval of recorded information, in Documentation in Action, Shera, J. H. Kent, A. Perry, J. W. (Edts), New York: Reinhold Publ. Co., pp. 238–244. [4] Marshakova, I. V. 1973. System of document connections based on references, NauchnoTekhnicheskaya Informatsiya, vol. 2, no. 6, pp. 3–8. [5] Beel, J. & Gipp, B. 2008, The Potential of Collaborative Document Evaluation for Science, the 11th International Conference on Digital Asian Libraries (ICADL 2008), December 2 - 5, Kuta, Indonesia, published in G. Buchanan, M. Masoodian & S. Cunningham (Eds.), Digital Libraries: Universal and Ubiquitous Access to Information of Lecture Notes in Computer Science, vol. 5362, DOI 10.1007/978-3-540- 89533-6, ISSN 0302-9743, pp. 375-378, SpringerVerlag Berlin Heidelberg. [6] Small, H. 1973. Co-citation in the scientific literature: a new measure of the relationship between two documents, Journal of the American Society for Information Science, vol. 24, pp. 265–269. [7] Klavans, R., & Boyack, K. (2006). Identifying a better measure of relatedness for mapping science, Journal of the American Society for Information Science and Technology, Vol. 57, No. 2, pp. 251-263. [8] Sternitzke, C. Bergmann, I. (2009), Similarity measures for document mapping: A comparative study on the level of an individual scientist, Scientometrics, Vol. 78, No. 1, pp. 113-130. [9] Garfield, E. (2001, November 27, 2001). From Bibliographic Coupling to Co-CitationAnalysis Via Algorithmic Historio-Bibliography: A Citationist‟s Tribute to BelverC. Griffith. Paper presented at the Drexel University, Philadelphia, PA. [10] Giles, C. L. Bollacker, K. D. And Lawrence, S. 1998. CiteSeer: an automatic citation indexing system, In Digital Libraries 98 - The Third ACM Conference on Digital Libraries, pp. 89-98. [11] Gipp, B. Beel, J. & Hentschel, C. (2009), Scienstein - A Research Paper Recommender System, in Proceedings of IEEE International Conference on Emerging Trends in Computing. Tamil Nadu, India