工程科学学报,第 41 卷,第 9 期:1208鄄鄄1214,2019 年

正在加载图片...

工程科学学报，第41卷，第9期：1208-1214,2019年9月 Chinese Journal of Engineering,Vol.41,No.9:1208-1214,September 2019 DOI:10.13374/j.issn2095-9389.2019.09.013;http://journals.ustb.edu.cn 一种面向网络长文本的话题检测方法郑恒毅)，廖城霖)四，李天柱) 1)重庆大学机械工程学院，重庆4000442)重庆大学自动化学院，重庆400044 ☒通信作者，E-mail:liaochenglinl127@gmail.com 摘要提出了一种面向网络长文本的话题检测方法.针对文本表示的高维稀疏性和忽略潜在语义的问题，提出了Wod2vc &LDA(latent dirichlet allocation)的文本表示方法.将LDA提取的文本特征词隐含主题和Word2vec映射的特征词向量进行加权融合既能够进行降维的作用又可以较为完整的表示出文本信息.针对传统话题发现方法对长文本输入顺序敏感问题，提出了基于文本聚类的Single-Pass&HAC(hierarchical agglomerative clustering)的话题发现方法，在引入时间窗口和凝聚式层次聚类的基础上对于文本的输入顺序具有了更强的鲁棒性，同时提高了聚类的精度和效率.为了评估所提出方法的有效性，本文从某大学社交平台收集了来自真实世界的多源数据集，并基于此进行了大量的实验.实验结果证明，本文提出的方法相对于现有的方法，如VSM(state vector space model)、Single-Pass等拥有更好的效果，话题检测的精度提高了l0%~20%. 关键词网络长文本：话题检测：文本表示：话题发现：文本聚类分类号TP391.4 A topic detection method for network long text ZHENG Heng-yi,LIAO Cheng-lin)LI Tian-zhu) 1)College of Mechanical Engineering,Chongqing University,Chongqing 400044.China 2)College of Automation,Chongqing University,Chongqing 400044,China Corresponding author,E-mail:liaochenglinl127@gmail.com ABSTRACT Internet public opinion is an important source of people's views on social hotspots and national current affairs.Topic detection in network long text contributes toward the analysis of network public opinion.According to the results of topic detection,the policymaker can timely and reliably make scientific decisions.In general,topic detection can be divided into two steps,i.e.,repre- sentation learning and topic discovery.However,common representation learning methods,such as state vector space model (VSM) and term frequency-inverse document frequency,often lead to the problems of high dimensionality,sparsity,and latent semantic loss, whereas traditional topic discovery methods depend heavily on the text input orders.To overcome these,a novel topic detection method was presented herein.First,Word2vec latent Dirichlet allocation (LDA)-based methods for representation learning were proposed to avoid the problem of high-dimensional sparsity and neglect of latent semantics.Weighted fusion of the text feature word implicit topic extracted by LDA and the feature word vector of Word2vec mapping could not only perform dimensionality reduction but also completely represent text information.Furthermore,Single-Pass and hierarchical agglomerative clustering for topic discovery could be more robust for input orders.To evaluate the effectiveness and efficiency of the proposed method,extensive experiments were conducted on a real- world multi-source dataset,which was collected from university social platforms.The experimental results show that the proposed meth- od outperforms other methods,such as VSM and Single-Pass,by improving the clustering accuracy by 10%-20%. KEY WORDS network long text;topic detection;text representation;topic discovery;text cluster 收稿日期：2019-01-03工程科学学报,第 41 卷,第 9 期:1208鄄鄄1214,2019 年 9 月 Chinese Journal of Engineering, Vol. 41, No. 9: 1208鄄鄄1214, September 2019 DOI: 10. 13374 / j. issn2095鄄鄄9389. 2019. 09. 013; http: / / journals. ustb. edu. cn 一种面向网络长文本的话题检测方法郑恒毅1) , 廖城霖2) 苣 , 李天柱2) 1)重庆大学机械工程学院, 重庆 400044 2) 重庆大学自动化学院, 重庆 400044 苣通信作者, E鄄mail:liaochenglin1127@ gmail. com 摘要提出了一种面向网络长文本的话题检测方法. 针对文本表示的高维稀疏性和忽略潜在语义的问题,提出了 Word2vec & LDA (latent dirichlet allocation)的文本表示方法. 将 LDA 提取的文本特征词隐含主题和 Word2vec 映射的特征词向量进行加权融合既能够进行降维的作用又可以较为完整的表示出文本信息. 针对传统话题发现方法对长文本输入顺序敏感问题,提出了基于文本聚类的 Single鄄鄄Pass & HAC (hierarchical agglomerative clustering)的话题发现方法,在引入时间窗口和凝聚式层次聚类的基础上对于文本的输入顺序具有了更强的鲁棒性,同时提高了聚类的精度和效率. 为了评估所提出方法的有效性,本文从某大学社交平台收集了来自真实世界的多源数据集,并基于此进行了大量的实验. 实验结果证明,本文提出的方法相对于现有的方法,如 VSM (state vector space model)、Single鄄鄄Pass 等拥有更好的效果,话题检测的精度提高了 10% ~ 20% . 关键词网络长文本; 话题检测; 文本表示; 话题发现; 文本聚类分类号 TP391郾 4 收稿日期: 2019鄄鄄01鄄鄄03 A topic detection method for network long text ZHENG Heng鄄yi 1) , LIAO Cheng鄄lin 2) 苣 , LI Tian鄄zhu 2) 1) College of Mechanical Engineering, Chongqing University, Chongqing 400044, China 2) College of Automation, Chongqing University, Chongqing 400044, China 苣Corresponding author, E鄄mail: liaochenglin1127@ gmail. com ABSTRACT Internet public opinion is an important source of people爷 s views on social hotspots and national current affairs. Topic detection in network long text contributes toward the analysis of network public opinion. According to the results of topic detection, the policymaker can timely and reliably make scientific decisions. In general, topic detection can be divided into two steps, i. e. , repre鄄 sentation learning and topic discovery. However, common representation learning methods, such as state vector space model (VSM) and term frequency鄄鄄inverse document frequency, often lead to the problems of high dimensionality, sparsity, and latent semantic loss, whereas traditional topic discovery methods depend heavily on the text input orders. To overcome these, a novel topic detection method was presented herein. First, Word2vec & latent Dirichlet allocation (LDA)鄄based methods for representation learning were proposed to avoid the problem of high鄄dimensional sparsity and neglect of latent semantics. Weighted fusion of the text feature word implicit topic extracted by LDA and the feature word vector of Word2vec mapping could not only perform dimensionality reduction but also completely represent text information. Furthermore, Single鄄鄄Pass and hierarchical agglomerative clustering for topic discovery could be more robust for input orders. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments were conducted on a real鄄 world multi鄄source dataset, which was collected from university social platforms. The experimental results show that the proposed meth鄄 od outperforms other methods, such as VSM and Single鄄鄄Pass, by improving the clustering accuracy by 10% 鄄鄄20% . KEY WORDS network long text; topic detection; text representation; topic discovery; text cluster

向下翻页>>

点击下载：一种面向网络长文本的话题检测方法