ideal for keyword extraction with TF-_中国高校课件下载中心

点击下载：《电子商务 E-business》阅读文献：mining for proposal reviewers lessons learned at the National Science Foundation

正在加载图片...

Industrial and Government Applications Track Paper deal for keyword extraction with TF-IDF. Second, the proposals The first step in cluster checking is to form a representation ave a variety of meta-data that is useful in other aspects of the of the important terms of the cluster. In Revaide, this is done by process. This meta-data includes the PIs name, e-mail address finding the centroid [10] of the proposals that are in each cluster, and other contact information, and an NSF ID for the Pls essentially creating a term vector for each cluster that is the university. This meta-data simplifies contacting the Pl and average"of the term vectors of the proposals. Next, the cosine checking for conflicts of interest between proposals and similarity [10] is found between each proposals term vector and reviewers. Third, NSF has a strong preference for using people each clusters term vector. REVAIDE produces a summary of with PH. D. degrees as reviewers, and one can't distinguish new e important terms in each cluster. These terms are chosen based graduate students from professors on published papers. By using on a weighted TF/IDF score. The example below illustrates such a people who have submitted to NSF as a reviewer In addition to the TF-IdF weight of each term problem is avoided since those eligible to apply to NSF are also prints out the number of proposals in the cluster that eligible to review. Finally, using proposals also avoids the ch it automatically creates a large pool of potential reviewers. A 24/280p senstem. 2o]ann 28/28) hobo: 0. 1447(n disadvantage of this approach is that it does include people who 22/28)imag: 0.114 (in 22/28) motion: 0.107 /A do not submit to NSF, such as researchers from industry or from 22/28) intellig: 0.104076 (in 25/28) mobil: 0.102 outside the Us. Of course, program directors may identify such (in23/28) agent:0.094(in18/28) autom:0.091 eople through usual means, such as checking the editorial board 0.077(in 23/28) sens: 0.068554 (in 26/28) of journals and program committees of conferences autonom:0.068(in25/28)se1f:0.068(in21/28) assemb1:0.064(in18/28) In practice, we restrict Revaide's pool of reviewers to those If the most similar cluster to a proposal is not the chu review process to insure that the reviewers were thought by their which a proposal has been assigned, that is a sign that a propos peers to have expertise in the area. We also leave out proposals is potentially in the wrong cluster. Such discrepancies are pointed with more than one author so that it is clear who has the e out to the program director with in a proposal. When more than one past proposal is available for a proposal to another panel. Below, the output of cluster checking given author, all of the proposals are combined by adding and is shown omitting any identifying information from the output. then re-normalizing the term vectors to form a model of the The top 20 terms of panel CIP-sc are: sensor section would also serve as the expertise representation of the 32/32) node: 0.147 (in 27/32) transtor: 0.157321 ous0.355(i 0.136(i 29/32) devic:0.132(in30/32) signa1:0.129(in 30/32) traffic:0.129(in22/32)grid:0.119 3.3 Cluster Checking 21/32) event:0.116937(in32/32) energ:0.107 The first task we consider is assisting groups of program (in 29/32) transmiss: 0. 105 (in 25/32)protoco directors to form panels. The most help is needed in large n25/32)mobi1:0.100(in26/32)rout competitions where 500-1500 proposals may be submitted at a 0.096 (in 23/32)agent: 0.092 (in 17/32) safeti time. NSF's system produces a spreadsheet that includes columns 0.091 (in 25/32) containing information such as the author' s name, institution, the Panel DsP is a better match for proposal title of the proposal and links to the abstract and the pdf of the NSFo4XXXX1 than cluster CIP-sC entire proposal. Teams of program directors manually sort these proposals first into general areas and then into panels of 20-30 In our experience, Revaide recommends a better panel for proposals. Due to the short time and large number of proposals, it approximately 5% of the proposals. We have received comments is possible that a proposal could be put into a panel with only a loose relationship to the majority of the proposals. Due to the overlooked that, "in response to Revaide's cluster checking distributed nature of the work, it is also possible that no one Often, Revaide finds a better panel that is a matter of emphasis claims responsibility for a proposal within a proposal, e.g contribution to comput As described earlier, attempts to use automated clustering opposed to making a failed at this task when program directors didn 't accept the results computer vision techniques of the clustering system. Instead of automatically clustering, Revaide checks the clusters produced by program directors for A special case of the cluster checking is when a proposal has coherence and suggests improvements. In addition, Revaide not been put into any panel. This can occur if no member of the suggests panels for"orphan proposals that are not assigned to am directors has identified that a panel. Furhermore, before program directors form panels, the proposal falls within the scope of the panel. In this case, the panel spreadsheet they use is augmented first with the terms that have that is most similar to the proposal is found, together with the next the highest TF-IDF weights of each proposal three, as determined by cosine similarity between the orphan This example shows an earlier version of Revaide that used Although the weights are not included, the terms are ordered by stemming [9], perhaps also illustrating why we turn stemming weigh off in later versionsideal for keyword extraction with TF-IDF. Second, the proposals have a variety of meta-data that is useful in other aspects of the process. This meta-data includes the PI’s name, e-mail address and other contact information, and an NSF ID for the PI’s university. This meta-data simplifies contacting the PI and checking for conflicts of interest between proposals and reviewers. Third, NSF has a strong preference for using people with PH.D. degrees as reviewers, and one can’t distinguish new graduate students from professors on published papers. By using people who have submitted to NSF as a reviewer pool, this problem is avoided since those eligible to apply to NSF are eligible to review. Finally, using proposals also avoids the problem of disambiguating people with common names. Finally, it automatically creates a large pool of potential reviewers. A disadvantage of this approach is that it does include people who do not submit to NSF, such as researchers from industry or from outside the US. Of course, program directors may identify such people through usual means, such as checking the editorial board of journals and program committees of conferences. In practice, we restrict Revaide’s pool of reviewers to those authors of proposals that have been judged as “fundable” by the review process to insure that the reviewers were thought by their peers to have expertise in the area. We also leave out proposals with more than one author so that it is clear who has the expertise in a proposal. When more than one past proposal is available for a given author, all of the proposals are combined by adding and then re-normalizing the term vectors to form a model of the expertise. The example proposal representation in the previous section would also serve as the expertise representation of the author that submitted the proposal. 3.3 Cluster Checking The first task we consider is assisting groups of program directors to form panels. The most help is needed in large competitions where 500-1500 proposals may be submitted at a time. NSF’s system produces a spreadsheet that includes columns containing information such as the author’s name, institution, the title of the proposal and links to the abstract and the PDF of the entire proposal. Teams of program directors manually sort these proposals first into general areas and then into panels of 20-30 proposals. Due to the short time and large number of proposals, it is possible that a proposal could be put into a panel with only a loose relationship to the majority of the proposals. Due to the distributed nature of the work, it is also possible that no one claims responsibility for a proposal. As described earlier, attempts to use automated clustering failed at this task when program directors didn’t accept the results of the clustering system. Instead of automatically clustering, Revaide checks the clusters produced by program directors for coherence and suggests improvements. In addition, Revaide suggests panels for “orphan” proposals that are not assigned to a panel. Furthermore, before program directors form panels, the spreadsheet they use is augmented first with the terms that have the highest TF-IDF weights1 of each proposal. 1 Although the weights are not included, the terms are ordered by weight. The first step in cluster checking is to form a representation of the important terms of the cluster. In Revaide, this is done by finding the centroid [10] of the proposals that are in each cluster, essentially creating a term vector for each cluster that is the “average” of the term vectors of the proposals. Next, the cosine similarity [10] is found between each proposal’s term vector and each cluster’s term vector. REVAIDE produces a summary of the important terms in each cluster. These terms are chosen based on a weighted TF/IDF score. The example below illustrates such a summary. In addition to the TF-IDF weight of each term2 , Revaide also prints out the number of proposals in the cluster that contain each term. The top 20 terms of panel ROB are: robot: 0.267(in 24/28) sensor: 0.203 (in 28/28) vehicl: 0.144 (in 22/28) imag: 0.114 (in 22/28) motion: 0.107 (in 22/28) intellig: 0.104076 (in 25/28) mobil: 0.102 (in 23/28) agent: 0.094 (in 18/28) autom: 0.091 (in 25/28) movement: 0.078 (in 17/28) action: 0.077 (in 23/28) sens: 0.068554 (in 26/28) autonom: 0.068 (in 25/28) self: 0.068 (in 21/28) assembl: 0.064 (in 18/28) If the most similar cluster to a proposal is not the cluster to which a proposal has been assigned, that is a sign that a proposal is potentially in the wrong cluster. Such discrepancies are pointed out to the program director with a suggestion to move the proposal to another panel. Below, the output of cluster checking is shown omitting any identifying information from the output. The top 20 terms of panel CIP-SC are: sensor: 0.355 (in 31/32) vehicl: 0.2493 (in 22/32) wireless: 0.178 (in 29/32) monitor: 0.157 (in 32/32) node: 0.147 (in 27/32) transport: 0.136 (in 29/32) devic: 0.132 (in 30/32) signal: 0.129 (in 30/32) traffic: 0.129 (in 22/32) grid: 0.119 (in 21/32) event: 0.116937 (in 32/32) energi: 0.107 (in 29/32) transmiss: 0.105 (in 25/32) protocol: 0.103 (in 27/32) flow: 0.103 (in 26/32) layer: 0.100317 (in 25/32) mobil: 0.100 (in 26/32) rout: 0.096 (in 23/32) agent: 0.092 (in 17/32) safeti: 0.091 (in 25/32) Panel DSP is a better match for proposal NSF04XXXX1 than cluster CIP-SC. In our experience, Revaide recommends a better panel for approximately 5% of the proposals. We have received comments from program directors that include, “Thanks, I don’t know how I overlooked that,” in response to Revaide’s cluster checking. Often, Revaide finds a better panel that is a matter of emphasis within a proposal, e.g., determining that a proposal will make a contribution to computer vision for astronomical applications as opposed to making a contribution to astronomy using existing computer vision techniques. A special case of the cluster checking is when a proposal has not been put into any panel. This can occur if no member of the distributed team of program directors has identified that a proposal falls within the scope of the panel. In this case, the panel that is most similar to the proposal is found, together with the next three, as determined by cosine similarity between the orphan 2 This example shows an earlier version of Revaide that used stemming [9], perhaps also illustrating why we turn stemming off in later versions. 865 Industrial and Government Applications Track Paper

<<向上翻页向下翻页>>

点击下载：《电子商务 E-business》阅读文献：mining for proposal reviewers lessons learned at the National Science Foundation