正在加载图片...
第11卷第3期 智能系统学报 Vol.11 No.3 2016年6月 CAAI Transactions on Intelligent Systems Jun.2016 D0I:10.11992/is.201603048 网络出版地址:http://www.enki..net/kcms/detail/23.1538.TP.20160513.0958.036.html 一种结合词向量和图模型的特定领域实体消歧方法 汪沛,线岩团2,郭剑毅2,文永华12,陈玮2,王红斌2 (1.昆明理工大学信息工程与自动化学院,云南昆明650500:2.昆明理工大学智能信息处理重点实验室,云南昆明 650500) 摘要:针对特定领域提出了一种结合词向量和图模型的方法来实现实体消歧。以旅游领域为例,首先选取维基百 科离线数据库中的旅游分类下的页面内容构建领域知识库,然后用知识库中的文本和从各大旅游网站爬取到的旅 游文本,通过词向量计算工具Wod2Vc构建词向量模型,结合人工标注的实体关系图谱,采用一种基于图的随机游 走算法辅助计算相似度,使其能够较准确地计算旅游领域词与词之间的相似度。最后,提取待消歧实体的背景文本 的若干关键词和知识库中候选实体文本的若干关键词,利用训练好的词向量模型结合图模型分别进行交叉相似度 计算,把相似度均值最高的候选实体作为最终的目标实体。实验结果表明,这种新的相似度计算方法能够有效获取 实体指称项与目标实体之间的相似度,从而能够较为准确地实现特定领域的实体消歧。 关键词:实体消歧;实体链接:Word2Vec;图模型:随机游走:维基百科 中图分类号:TP393文献标志码:A文章编号:1673-4785(2016)03-0366-09 中文引用格式:汪沛,线岩团,郭剑毅,等.一种结合词向量和图模型的特定领域实体消歧方法[J].智能系统学报,2016,11(3): 366-375. 英文引用格式:WANG Pei,XIAN Yantuan,GUO Jianyi,etal.A novel method using word vector and graphical models for entity disambiguation in specific topic domains[J].CAAI transactions on intelligent systems,2016,11(3):366-375. A novel method using word vector and graphical models for entity disambiguation in specific topic domains WANG Pei',XIAN Yantuan'2,GUO Jianyi2,WEN Yonghua2,CHEN Wei'2,WANG Hongbin'2 (1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China; 2.Key Laboratory of Intelligent Information Processing,Kunming University of Science and Technology,Kunming 650500,China) Abstract:In this paper,a novel method based on word vector and graph models is proposed to deal with entity dis- ambiguation in specific topic domains.Take the tourism topic domain as an example.The method firstly chooses the web-pages of the tourism category in a Wikipedia offline database to build a knowledge base;then,the tool Word2Vec is used to build a word vector model with the texts in the knowledge base and texts taken from several tourism websites.Combined with a manual annotation graph,a random walk algorithm based on the graph is used to compute similarity to accurately calculate the similarity between words within the tourism domain.Next,the method extracts several keywords from the background text of the entity to be disambiguated and compares them with the keyword text in the knowledge base that describes the candidate entities.Finally,the method uses the trained Word2Vec model and graphical model to calculate the similarity between the keywords of name mention and the keywords of candidate entities.The method then chooses the candidate entities which have the maximum average similarity to the target entity.Experimental results show that this new method can effectively capture the similarity between name mention and a target entity;thus,it can accurately achieve entity disambiguation of a topic-specific domain. Keywords:entity disambiguation;entity linking;Word2Vec;Wikipedia;graphical model;random walking 收稿日期:2016-03-19.网络出版日期:2016-05-13. 实体链接是知识库构建的关键技术之一,其目 基金项目:国家自然科学基金项目(61262041,61472168,61462054, 61562052):云南省自然科学基金重点项目(2013FA0B0). 的是将文本中已经获取到的命名实体链接到已有的 通信作者:郭剑毅.E-mail:adc86@hotmail.com第 11 卷第 3 期 智 能 系 统 学 报 Vol.11 №.3 2016 年 6 月 CAAI Transactions on Intelligent Systems Jun. 2016 DOI:10.11992 / tis.201603048 网络出版地址:http: / / www.cnki.net / kcms/ detail / 23.1538.TP.20160513.0958.036.html 一种结合词向量和图模型的特定领域实体消歧方法 汪沛1 ,线岩团1,2 ,郭剑毅1,2 ,文永华1,2 ,陈玮1,2 ,王红斌1,2 (1.昆明理工大学 信息工程与自动化学院,云南 昆明 650500; 2.昆明理工大学 智能信息处理重点实验室,云南 昆明 650500) 摘 要:针对特定领域提出了一种结合词向量和图模型的方法来实现实体消歧。 以旅游领域为例,首先选取维基百 科离线数据库中的旅游分类下的页面内容构建领域知识库,然后用知识库中的文本和从各大旅游网站爬取到的旅 游文本,通过词向量计算工具 Word2Vec 构建词向量模型,结合人工标注的实体关系图谱,采用一种基于图的随机游 走算法辅助计算相似度,使其能够较准确地计算旅游领域词与词之间的相似度。 最后,提取待消歧实体的背景文本 的若干关键词和知识库中候选实体文本的若干关键词,利用训练好的词向量模型结合图模型分别进行交叉相似度 计算,把相似度均值最高的候选实体作为最终的目标实体。 实验结果表明,这种新的相似度计算方法能够有效获取 实体指称项与目标实体之间的相似度,从而能够较为准确地实现特定领域的实体消歧。 关键词:实体消歧;实体链接;Word2Vec;图模型;随机游走;维基百科 中图分类号:TP393 文献标志码:A 文章编号:1673⁃4785(2016)03⁃0366⁃09 中文引用格式:汪沛,线岩团,郭剑毅,等.一种结合词向量和图模型的特定领域实体消歧方法[ J]. 智能系统学报, 2016, 11( 3): 366⁃375. 英文引用格式:WANG Pei, XIAN Yantuan, GUO Jianyi, et al. A novel method using word vector and graphical models for entity disambiguation in specific topic domains[J]. CAAI transactions on intelligent systems, 2016, 11(3): 366⁃375. A novel method using word vector and graphical models for entity disambiguation in specific topic domains WANG Pei 1 , XIAN Yantuan 1,2 , GUO Jianyi 1,2 , WEN Yonghua 1,2 , CHEN Wei 1,2 , WANG Hongbin 1,2 (1.School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China; 2. Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500, China) Abstract:In this paper, a novel method based on word vector and graph models is proposed to deal with entity dis⁃ ambiguation in specific topic domains. Take the tourism topic domain as an example. The method firstly chooses the web - pages of the tourism category in a Wikipedia offline database to build a knowledge base; then, the tool Word2Vec is used to build a word vector model with the texts in the knowledge base and texts taken from several tourism websites. Combined with a manual annotation graph, a random walk algorithm based on the graph is used to compute similarity to accurately calculate the similarity between words within the tourism domain. Next, the method extracts several keywords from the background text of the entity to be disambiguated and compares them with the keyword text in the knowledge base that describes the candidate entities. Finally, the method uses the trained Word2Vec model and graphical model to calculate the similarity between the keywords of name mention and the keywords of candidate entities. The method then chooses the candidate entities which have the maximum average similarity to the target entity. Experimental results show that this new method can effectively capture the similarity between name mention and a target entity; thus, it can accurately achieve entity disambiguation of a topic-specific domain. Keywords:entity disambiguation; entity linking; Word2Vec; Wikipedia; graphical model; random walking 收稿日期:2016⁃03⁃19. 网络出版日期:2016⁃05⁃13. 基金项 目: 国 家 自 然 科 学 基 金 项 目 ( 61262041, 61472168, 61462054, 61562052);云南省自然科学基金重点项目(2013FA030). 通信作者:郭剑毅.E⁃mail:gjade86@ hotmail.com. 实体链接是知识库构建的关键技术之一,其目 的是将文本中已经获取到的命名实体链接到已有的
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有