《电子商务 E-business》阅读文献：Topic Extraction from Scientific Literature for Competency Management

团购合买资源类别：文库，文档格式：PDF，文档页数：12，文件大小：313.98KB

Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI Gmbh Language Technology Lab& Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrucken, Germany Abstract We describe an approach towards automatic, dynamic and time- critical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an ex- periment for our own organization, DFKI, as example of a knowledge organiza- tion. The paper presents evaluation results over a sample of 48 DFKI research ers that responded to our request for a-posteriori evaluation of automatically ex- racted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researche 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowl- edge has become a central factor in achieving commercial success. It is of fundamen tal importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure(interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management o In this paper we present a pattern-based approach to the extraction of competencies a knowledge-based research organization(scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only

Topic Extraction from Scientific Literature for Competency Management Paul Buitelaar, Thomas Eigner DFKI GmbH Language Technology Lab & Competence Center Semantic Web Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany paulb@dfki.de Abstract We describe an approach towards automatic, dynamic and timecritical support for competency management and expertise search through topic extraction from scientific publications. In the use case we present, we focus on the automatic extraction of scientific topics and technologies from publicly available publications using web sites like Google Scholar. We discuss an experiment for our own organization, DFKI, as example of a knowledge organization. The paper presents evaluation results over a sample of 48 DFKI researchers that responded to our request for a-posteriori evaluation of automatically extracted topics. The results of this evaluation are encouraging and provided us with useful feedback for further improving our methods. The extracted topics can be organized in an association network that can be used further to analyze how competencies are interconnected, thereby enabling also a better exchange of expertise and competence between researchers. 1 Introduction Competency management, the identification and management of experts on and their knowledge in certain competency areas, is a growing area of research as knowledge has become a central factor in achieving commercial success. It is of fundamental importance for any organization to keep up-to-date with the competencies it covers, in the form of experts among its work force. Identification of experts will be based mostly on recruitment information, but this is not sufficient as competency coverage (competencies of interest to the organization) and structure (interconnections between competencies) change rapidly over time. The automatic identification of competency coverage and structure, e.g. from publications, is therefore of increasing importance, as this allows for a sustainable, dynamic and time-critical approach to competency management. In this paper we present a pattern-based approach to the extraction of competencies in a knowledge-based research organization (scientific topics, technologies) from publicly available scientific publications. The core assumption of our approach is that such topics will not occur in random fashion across documents, but instead occur only

in specific scientific discourse contexts that can be precisely defined and used as pat terns for topic extraction The remainder of the paper is structured as follows In section 2 we describe related work in competency management and argue for an approach based on natural lan- guage processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then contin ues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these 2 Related work Competency management is a growing area of knowledge management that is con cerned with the"identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentia tions in strategies and organizational priorities. [1] Our particular focus here is on aspects of competency management relating to the identification and management of nowledge about scientific topics and technologies, which is at the basis of compe- tency management. Most of the work on competency management has been focused on the develop ment of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Re- cently, ontology-based approaches have been proposed that aim at modeling the do main model of particular organization types(e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bring ing together skills and organization requirements(e.g. [213]) The development of formal ontologies for competency management is important but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining(e.g. [4) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing 3 Approach Our approach towards the automatic construction and dynamic maintenance of on tologies for competency management is based on the extraction of relevant competen

in specific scientific discourse contexts that can be precisely defined and used as patterns for topic extraction. The remainder of the paper is structured as follows. In section 2 we describe related work in competency management and argue for an approach based on natural language processing and ontology modeling. We describe our specific approach to topic extraction for competency management in detail in section 3. The paper then continues with the description of an experiment that we performed on topic extraction for competency management in our own organization, DFKI. Finally, we conclude the paper with some conclusions that can be drawn from our research and ideas for future work that arise from these. 2 Related Work Competency management is a growing area of knowledge management that is concerned with the “identification of skills, knowledge, behaviors, and capabilities needed to meet current and future personnel selection needs, in alignment with the differentiations in strategies and organizational priorities.” [1] Our particular focus here is on aspects of competency management relating to the identification and management of knowledge about scientific topics and technologies, which is at the basis of competency management. Most of the work on competency management has been focused on the development of methods for the identification, modeling, and analysis of skills and skills gaps and on training solutions to help remedy the latter. An important initial step in this process is the identification of skills and knowledge of interest, which is mostly done through interviews, surveys and manual analysis of existing competency models. Recently, ontology-based approaches have been proposed that aim at modeling the domain model of particular organization types (e.g. computer science, health-care) through formal ontologies, over which matchmaking services can be defined for bringing together skills and organization requirements (e.g. [2], [3]). The development of formal ontologies for competency management is important, but there is an obvious need for automated methods in the construction and dynamic maintenance of such ontologies. Although some work has been done on developing automated methods for competency management through text and web mining (e.g. [4]) this is mostly restricted to the extraction of associative networks between people according to documents or other data they are associated with. Instead, for the purpose of automated and dynamic support of competency management a richer analysis of competencies and semantic relations between them is needed, as can be extracted from text through natural language processing. 3 Approach Our approach towards the automatic construction and dynamic maintenance of ontologies for competency management is based on the extraction of relevant competen-

cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and back- ound knowledge if available Central to the approach as discussed in this paper is the use of domain-specific lin- guistic patterns for the extraction of potentially relevant competencies, such as scien- tific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such devel- ped a tool for XYor'worked on methods for YZ, where XY, YZ are possibly rele vant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chem profile refinement method for nuclear and magnetic structures continuum method for modeling surface tension a screening method for the crystallisation of macromolecules In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: nuclear and magnetic structures,modeling surface tension, 'crystallization of macromolecules. The pattern that we can thus establish from these examples is as follows method for /TOPIC method for/nuclear and magnetic structures) method for/modeling surface tension/ method for((the) crystallization of macromolecules/ Other patterns that we manually identified in this way are: approach for/TOPIC/ approaches for /TOPIC/ pach to/TOPIC/ approaches to /TOPIC/ methods for /TOPIC/ solutions for /TOPIc/ tools for /TOPIC/ We call these the context patterns, which as their name suggests provide the lexi- cal context for the topic extraction. The topics themselves can be described by so- called topic patterns, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for in- stance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun(optional) followed by a sequence of zero or more adjectives followed by a

cies and semantic relations between them through a combination of linguistic patterns, statistical methods as used in information retrieval and machine learning and background knowledge if available. Central to the approach as discussed in this paper is the use of domain-specific linguistic patterns for the extraction of potentially relevant competencies, such as scientific topics and technologies, from publicly available scientific publications. In this text type, topics and technologies will occur in the context of cue phrases such ‘developed a tool for XY’ or ‘worked on methods for YZ’, where XY, YZ are possibly relevant competencies that the authors of the scientific publication is or has been working on. Consider for instance the following excerpts from three scientific articles in chemistry: …profile refinement method for nuclear and magnetic structures… …continuum method for modeling surface tension… …a screening method for the crystallization of macromolecules… In all three cases a method is discussed for addressing a particular problem that can be interpreted as a competency topic: ‘nuclear and magnetic structures’, ‘modeling surface tension’, ‘crystallization of macromolecules’. The pattern that we can thus establish from these examples is as follows: method for [TOPIC] as in: method for [nuclear and magnetic structures] method for [modeling surface tension] method for [(the) crystallization of macromolecules] Other patterns that we manually identified in this way are: approach for [TOPIC] approaches for [TOPIC] approach to [TOPIC] approaches to [TOPIC] methods for [TOPIC] solutions for [TOPIC] tools for [TOPIC] We call these the ‘context patterns’, which as their name suggests provide the lexical context for the topic extraction. The topics themselves can be described by socalled ‘topic patterns’, which describe the linguistic structure of possibly relevant topics that can be found in the right context of the defined context patterns. Topic patterns are defined in terms of part-of-speech tags that indicate if a word is for instance a noun, verb, etc. For now, we define only one topic pattern that defines a topic as a noun (optional) followed by a sequence of zero or more adjectives followed by a

4 Experiment o evaluate our methods we developed an experiment based on the methods dis cussed in the previous section, involving researchers from our own organization DFKI. For all of these, we downloaded their scientific publications, extracted and ranked topics as explained above and then asked a randomly selected subset of this group to evaluate the topics assigned to them. Details of the data set used, the evalua- tion procedure, results obtained and discussion of results and evaluation procedure are provided in the following 4.1 Data set The data set we used in this experiment consists of 3253 downloaded scientific publications for 199 researchers at DFKI. The scientific content of these publications are all concerned with computer science in general, but still varies significantly as we include researchers from all departments at dFKI with a range of scientific work in natural language processing, information retrieval, knowledge management, business informatics, image processing, robotics, agent systems, etc The documents were downloaded by use of the Google APl, in HTML format as provided by Google Scholar. The HTML content is generated automatically by Google from PDF, Postscript or other formats, which unfortunately contains a fair number of errors-among others the contraction of 'fi in words like specification (resulting inspecication'instead), the contraction of separate words into nonsensical oppositions such as'stemmainlyfromtwo' and the appearance of strange character combinations such ae". Although such errors potentially introduce noise into the extraction we assume that the statistical relevance assignment will largely normalize this as such errors do not occur in any systematic way. Needless to say that this situa tion is however not ideal and that we are looking for ways to improve this aspect of the extraction process The document collection was used to extract topics as discussed above, which re- sulted first in the extraction of 7946 topic text segments by he context pat- terns over the text sections of the HTML documents!. The extracted topic text seg- ments(each up to 10 words long) were then part-of-speech tagged with TnT, after which we applied the defined topic pattern to extract one topic from each topic texts Finally, to compute the weight of each topic for each researcher (a topic can be as- signed to several researchers but potentially with different weights) and to assign a www.dfki.de/web/welco n for an overview of DFKI departments and the corresponding range in scientific topics addressed 4 For this purpose we stripped of HTML tags and removed page numbering, new- lines and dashes at end-of-line( to normalize for instance as-signedtoassigned) 5 In theory it could also occur that no topic can be identified in a topic text, but this will almost never occur as the topic text will contain at least one noun( that matches the topic pattern as defined in section 3)

4 Experiment To evaluate our methods we developed an experiment based on the methods discussed in the previous section, involving researchers from our own organization, DFKI. For all of these, we downloaded their scientific publications, extracted and ranked topics as explained above and then asked a randomly selected subset of this group to evaluate the topics assigned to them. Details of the data set used, the evaluation procedure, results obtained and discussion of results and evaluation procedure are provided in the following. 4.1 Data Set The data set we used in this experiment consists of 3253 downloaded scientific publications for 199 researchers at DFKI. The scientific content of these publications are all concerned with computer science in general, but still varies significantly as we include researchers from all departments at DFKI 3 with a range of scientific work in natural language processing, information retrieval, knowledge management, business informatics, image processing, robotics, agent systems, etc. The documents were downloaded by use of the Google API, in HTML format as provided by Google Scholar. The HTML content is generated automatically by Google from PDF, Postscript or other formats, which unfortunately contains a fair number of errors - among others the contraction of ‘fi’ in words like ‘specification’ (resulting in ‘specication’ instead), the contraction of separate words into nonsensical compositions such as ‘stemmainlyfromtwo’ and the appearance of strange character combinations such ‘â✂✁’. Although such errors potentially introduce noise into the extraction we assume that the statistical relevance assignment will largely normalize this as such errors do not occur in any systematic way. Needless to say that this situation is however not ideal and that we are looking for ways to improve this aspect of the extraction process. The document collection was used to extract topics as discussed above, which resulted first in the extraction of 7946 topic text segments by running the context patterns over the text sections of the HTML documents 4 . The extracted topic text segments (each up to 10 words long) were then part-of-speech tagged with TnT, after which we applied the defined topic pattern to extract one topic from each topic text 5 . Finally, to compute the weight of each topic for each researcher (a topic can be assigned to several researchers but potentially with different weights) and to assign a 3 See http://www.dfki.de/web/welcome?set_language=en&cl=en for an overview of DFKI departments and the corresponding range in scientific topics addressed. 4 For this purpose we stripped of HTML tags and removed page numbering, new-lines and dashes at end-of-line (to normalize for instance ‘as-signed’ to ‘assigned’). 5 In theory it could also occur that no topic can be identified in a topic text, but this will almost never occur as the topic text will contain at least one noun (that matches the topic pattern as defined in section 3)

Level of correctness Number of researchers 11-20% 21-30% 61-70% 81-1009% 030 Table 1: Evaluation results 4.3 Discussion Results of the evaluation vary strongly between researchers: almost half of them Idge their assigned topics as more than 50% correct and 13 judge them more than 60% correct-on the other hand, 7 researchers are very critical of the topics extracted fro them(less than 10% correct)and slightly more than half judge their assigned top- ics less than 50% correct Additionally, in discussing evaluation results with some of the researchers involved we learned that it was sometimes difficult for them to decide on the appropriateness of an extracted topic, mainly because a topic may be appropriate in principle but it is: 1) too specific or too general; ii) slightly spelled wrong; iii) occurs in capitalized form as well as in small letters; iv)not entirely appropriate for the researcher in question. We also learned that researchers would like to rank (or rather re-rank) extracted topics, although we did not explicitly tell them they were ranked in any order In summary, we take the evaluation results as a good basis for further work on topic traction for competency management, in which we will address a number of the maller and bigger issues that we learned out of the evaluation 5 Applications The overall application of the work presented here is management of competencies in knowledge organizations such as research institutes like DFKI. As mentioned make the extracted topics available as ontology and nowledge base, on which further services can be defined and implemented such as expert finding and matching. For this purpose we need to organize the extracted topics further by extracting relations between topics and thus indirectly between researchers or groups of researchers working on these topics. We took a first step in this direction by analyzing the co-occurrence of positively judged topics(380 in total)from our on set in the documents that they w ed from. This resulted in ranked listed of pairs of topics co-occurring more or less frequently. The following

Level of Correctness Number of Researchers 0-10% 7 11-20% 1 21-30% 3 31-40% 9 41-50% 6 51-60% 9 61-70% 10 71-80% 3 81-100% 0 48 Table 1: Evaluation results 4.3 Discussion Results of the evaluation vary strongly between researchers: almost half of them judge their assigned topics as more than 50% correct and 13 judge them more than 60% correct – on the other hand, 7 researchers are very critical of the topics extracted fro them (less than 10% correct) and slightly more than half judge their assigned topics less than 50% correct. Additionally, in discussing evaluation results with some of the researchers involved we learned that it was sometimes difficult for them to decide on the appropriateness of an extracted topic, mainly because a topic may be appropriate in principle but it is: i) too specific or too general; ii) slightly spelled wrong; iii) occurs in capitalized form as well as in small letters; iv) not entirely appropriate for the researcher in question. We also learned that researchers would like to rank (or rather re-rank) extracted topics, although we did not explicitly tell them they were ranked in any order. In summary, we take the evaluation results as a good basis for further work on topic extraction for competency management, in which we will address a number of the smaller and bigger issues that we learned out of the evaluation. 5 Applications The overall application of the work presented here is management of competencies in knowledge organizations such as research institutes like DFKI. As mentioned we will therefore make the extracted topics available as ontology and corresponding knowledge base, on which further services can be defined and implemented such as expert finding and matching. For this purpose we need to organize the extracted topics further by extracting relations between topics and thus indirectly between researchers or groups of researchers working on these topics. We took a first step in this direction by analyzing the co-occurrence of positively judged topics (380 in total) from our evaluation set in the documents that they were extracted from. This resulted in a ranked listed of pairs of topics co-occurring more or less frequently. The following

点击进入文档下载页（PDF格式）

共12页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录