工程科学学报,第 41 卷,第 5 期:682鄄鄄693,2019 年 5

正在加载图片...

工程科学学报，第41卷，第5期：682-693,2019年5月 Chinese Joural of Engineering,Vol.41,No.5:682-693,May 2019 D0L:10.13374/j.issn2095-9389.2019.05.015;htp:/journals.usth.edu.cm 基于属性值集中度的分类数据聚类有效性内部评价指标傅立伟，武森四北京科技大学东凌经济管理学院，北京100083 区通信作者，E-mail:wusen@manage.ustb.cd.cn 摘要针对分类数据，通过数据对象在属性值上的集中程度定义了新的基于属性值集中度的类内相似度(similarity based on concentration of attribute values,CONC),用于衡量聚类结果中类内各数据对象之间的相似度；通过不同类的特征属性值的差异程度定义了基于强度向量差异的类间差异度(dissimilarit的y based on discrepancy of SVs,DCRP),用于衡量两个类之间的差异度.基于CONC和DCRP提出了新的分类数据聚类有效性内部评价指标(clustering validation based on concentration of attribute vues,CVC),它具有以下3个特点：(1)在评价每个类内相似度时，不仅依靠类内各数据对象的特征，还考虑了整个数据集的信息：(2)采用几个特征属性值的差异评价两个类的差异度，确保评价过程不丢失有效的聚类信息，同时可以消除噪音的影响：(3)在评价类内相似度及类间差异度时，消除了数据对象个数对评价过程的影响.采用加州大学欧文分校提出的用于机器学习的数据库(UCI)进行实验，将CVC与类别效用(category utility,CU)指标、基于主观因素的分类数据指标(categorical data clustering with subjective factors,CDCS)指标和基于信息熵的内部评价指标(information entropy,E)等内部评价指标进行对比，通过外部评价指标标准交互信息(normalized mutual information,NMI)验证内部评价效果.实验表明相对其他内部评价指标， CVC指标可以更有效地评价聚类结果.此外，CVC指标相对于NMⅡ指标，不需要数据集以外的信息，更具实用性关键词聚类分析；聚类内部有效性评价指标；分类数据：高维数据；相似度：差异度分类号TP301 A new internal clustering validation index for categorical data based on concentration of attribute values FU Li-wei,WU Sen Donlinks School of Economics and Management,University of Science and Technology Beijing,Beijing 100083,China Corresponding author,E-mail:wusen@manage.ustb.edu.cn ABSTRACT Clustering is a main task of data mining,and its purpose is to identify natural structures in a dataset.The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions,such as clustering algorithms,sim- ilarity/dissimilarity,and parameters.For data without a clustering structure,clustering results need to be evaluated.For data with a clustering structure,different results obtained under different algorithms and parameters also need to be further optimized by clustering validation.Moreover,clustering validation is vital to clustering applications,especially when external information is not available.It is applied in algorithm selection,parameter determination,number of clusters determination.Most traditional internal clustering valida- tion indices for numerical data fail to measure the categorical data.Categorical data is a popular data type,and its attribute value is discrete and cannot be ordered.For categorical data,the existing measures have their limitations in different application circumstances. In this paper,a new similarity based on the concentration ratio of every attribute value,called CONC,which can evaluate the similarity 收稿日期：2018-04-18 基金项目：国家自然科学基金资助项目(71271027)工程科学学报,第 41 卷,第 5 期:682鄄鄄693,2019 年 5 月 Chinese Journal of Engineering, Vol. 41, No. 5: 682鄄鄄693, May 2019 DOI: 10. 13374 / j. issn2095鄄鄄9389. 2019. 05. 015; http: / / journals. ustb. edu. cn 基于属性值集中度的分类数据聚类有效性内部评价指标傅立伟,武森苣北京科技大学东凌经济管理学院, 北京 100083 苣通信作者, E鄄mail: wusen@ manage. ustb. edu. cn 摘要针对分类数据,通过数据对象在属性值上的集中程度定义了新的基于属性值集中度的类内相似度( similarity based on concentration of attribute values,CONC),用于衡量聚类结果中类内各数据对象之间的相似度;通过不同类的特征属性值的差异程度定义了基于强度向量差异的类间差异度(dissimilarity based on discrepancy of SVs,DCRP),用于衡量两个类之间的差异度. 基于 CONC 和 DCRP 提出了新的分类数据聚类有效性内部评价指标( clustering validation based on concentration of attribute values,CVC),它具有以下 3 个特点:(1)在评价每个类内相似度时,不仅依靠类内各数据对象的特征,还考虑了整个数据集的信息;(2)采用几个特征属性值的差异评价两个类的差异度,确保评价过程不丢失有效的聚类信息,同时可以消除噪音的影响;(3)在评价类内相似度及类间差异度时,消除了数据对象个数对评价过程的影响. 采用加州大学欧文分校提出的用于机器学习的数据库(UCI)进行实验,将 CVC 与类别效用(category utility,CU)指标、基于主观因素的分类数据指标(categorical data clustering with subjective factors,CDCS)指标和基于信息熵的内部评价指标(information entropy,IE)等内部评价指标进行对比, 通过外部评价指标标准交互信息(normalized mutual information,NMI)验证内部评价效果. 实验表明相对其他内部评价指标, CVC 指标可以更有效地评价聚类结果. 此外,CVC 指标相对于 NMI 指标,不需要数据集以外的信息,更具实用性. 关键词聚类分析; 聚类内部有效性评价指标; 分类数据; 高维数据; 相似度; 差异度分类号 TP301 收稿日期: 2018鄄鄄04鄄鄄18 基金项目: 国家自然科学基金资助项目(71271027) A new internal clustering validation index for categorical data based on concentration of attribute values FU Li鄄wei, WU Sen 苣 Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China 苣Corresponding author, E鄄mail: wusen@ manage. ustb. edu. cn ABSTRACT Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, sim鄄 ilarity / dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering valida鄄 tion indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity

向下翻页>>

点击下载：基于属性值集中度的分类数据聚类有效性内部评价指标