第5卷第2期智能系统学报 Vol.5 No.2 2010年4月 CAAI

正在加载图片...

第5卷第2期智能系统学报 Vol.5 No.2 2010年4月 CAAI Transactions on Intelligent Systems Apr.2010 doi:10.3969/j.issn.16734785.2010.02.005 命名实体的网络话题K-means动态检测方法刘素芹1，柴松1,2 (1.中国石油大学计算机与通信工程学院，山东青岛266555：2.山东省军区自动化工作站，山东济南250013) 摘要：针对传统的网络话题检测方法在文本特征表示方面的不足及K-means聚类算法面临的问题，提出了一种基于命名实体的网络话题K-meas动态检测方法.该方法对传统话题检测的特征表示方法进行了改进，用命名实体和文本特征词相结合表示文本特征，用命名实体对文本表示的贡献大小表示命名实体的权重；另外，利用自适应技术对K-means聚类算法中的K值进行自收敛，对K-means聚类算法进行了优化，利用K值的动态选取来实现网络话题的动态检测.实验结果表明，该方法较好地区分了相似话题，有救提高了话题检测的性能. 关键词：命名实体；网络话题；动态检测；K-means聚类；自相似度；话题向量中图分类号：TP18文献标识码：A文章编号：16734785(2010)02012205 K-means dynamic web topic detection method based on named entities LIU Su-qin',CHAI Song'2 (1.College of Computer&Communication Engineering,China University of Petroleum,Qingdao 266555,China;2.Automation Work- station,Military District,Shandong Province,Ji'nan 250013,China) Abstract:Current text representation models are not suitable for web topic detection,and the traditional K-means clustering algorithm has some drawbacks.The authors developed a dynamic K-means detection algorithm for web topics on the basis of named entities.In the new method,the representation model of the traditional topic detection method was modified.The text was represented by a combination of named entities and text features.The weight of the named entity was described by its contribution to the representation.The number of clusters K in the K-means algorithm self-converged by the use of an adaptive technique.The K-means algorithm was optimized,achieving a dynamic detection of web topics by using dynamic selection of K values.Experimental results indicated that the new method detects and distinguishes between similar topics effectively,thus significantly improving the performance of topic detection. Keywords:named entity;web topics;dynamic detection;K-means clustering method;self-similarity;topic vector 网络话题检测与追踪fl(topic detection and 仅仅依靠命名实体而放弃描述话题内容的大量其他 tracking,TDT)旨在开发出一种能在没有人工干预关键词，必然造成对话题框架概括不全面，从而影响的情况下自动判断新闻数据流话题的新技术[21.话话题检测的性能4」题检测主要研究将新闻报道、新闻专线等来源的数本文将文本中的命名实体及除命名实体之外的据流中的报道归入不同的话题并在必要时建立新话特征词进行分别提取，并赋予不同的权重，将新闻文题.相似话题的报道中有大量的相同词汇，容易造成档表示成基于命名实体及特征词的双特征向量；然话题误判，传统增量聚类方法很难解决这一问后在此基础上对K-means聚类方法5进行研究，结题3]，Kumaran利用命名实体来解决此问题.详细分合自相似度策略来确定K值，解决了聚类算法中K 析可以得知，利用命名实体虽然能在一定程度上区值自收敛的问题，最终实现利用K值的动态选取来分相似话题，但新闻报道中的命名实体的数目有限，实现网络话题的动态检测.试验结果表明，与传统的话题检测方法相比较，该方法能够很好地解决海量收稿日期：2009-1204. 网络数据环境下相似话题难以区分的问题，有效实通信作者：刘素芹.E-mail:liusq@upc.edu.cm. 现对网络话题的动态检测，该话题检测方法优于传

向下翻页>>

点击下载：【自然语言处理与理解】命名实体的网络话题K-means动态检测方法