工程科学学报，第 38 卷，第 7 期: 1017--1024，2016

正在加载图片...

工程科学学报，第38卷，第7期：1017-1024,2016年7月 Chinese Journal of Engineering,Vol.38,No.7:1017-1024,July 2016 D0l:10.13374/j.issn2095-9389.2016.07.018;http:/journals.ustb.edu.cn 分类属性数据聚类算法HABOS 武森四，姜丹丹，王蔷北京科技大学东凌经济管理学院，北京100083 ☒通信作者，E-mail:wusen(@manage.stb.edu.cn 摘要CABOSFV._C是一种针对分类属性高维数据的高效聚类算法，该算法采用集合稀疏差异度进行距离计算，并采用稀疏特征向量实现数据压缩.该算法的聚类效果受集合稀疏差异度上限参数的影响，而该参数的选取没有明确的指导.针对该问题提出基于集合稀疏差异度的启发式分类属性数据层次聚类算法(heuristic hierarchical clustering algorithm of categorical data based on sparse feature dissimilarity,HABOS),该方法从聚结型层次聚类思想的角度出发，在聚类数上限参数的约束下，应用新的内部聚类有效性评价指标(clustering validation index based on sparse feature dissimilarity,CVISFD)进行启发式度量，从而实现对聚类层次的自动选取.UCI基准数据集的实验结果表明，HABOS有效地提高了聚类准确性和稳定性. 关键词数据挖掘：聚类算法：分类数据：属性分类号TP311 HABOS clustering algorithm for categorical data WU Sen,JIANG Dan-dan,WANG Qiang Donlinks School of Economics and Management,University of Science and Technology Beijing,Beijing 100083,China Corresponding author,E-mail:wusen@manage.ustb.edu.cn ABSTRACT The clustering algorithm based on sparse feature vector for categorical attributes (CABOSFV_C)is an efficient high-di- mensional clustering method for categorical data.Sparse feature dissimilarity (SFD)is used to calculate the distance and sparse fea- ture vector is used to achieve data compression.However,CABOSFV_C algorithm is dependent upon SFD upper limit parameter for which there is no guidance for configuration.Aimed at solving the problem that CABOSFVC algorithm is sensitive to this parameter, a new heuristic hierarchical clustering algorithm of categorical data based on SFD (HABOS)was proposed in this paper.With the con- straint of the upper limit number of clusters,this algorithm applied agglomerative hierarchical clustering and the new internal clustering validation index based on SFD (CVISFD)which was used to measure the results heuristically to achieve the best choice of the cluste- ring level.Three UCI benchmark data sets were used to compare the improved algorithm with the traditional ones.The empirical tests show that HABOS increases the clustering accuracy and stability effectively. KEY WORDS data mining:clustering algorithms:categorical data:attributes 聚类分析是数据挖掘的重要组成部分，它是一种法、BIRCH算法、DBSCAN算法、PAM算法及其改进算将物理或抽象对象的集合分成相似的对象类，使得同法-.此外，现实世界中还存在大量分类属性数据，一类内数据对象之间的相似度较高，不同类的数据对处理分类属性数据的聚类算法有CABOSFV C算象之间相似度较低的过程0.聚类算法可应用在客户法同、K-modes算法、COBWEB算法m等. 群划分、孤立点检测、模式识别、文档归类等领域.针 CABOSFV_C算法是一种处理分类属性数据的高对数值型聚类算法的研究较为深入，例如K-means算维数据聚类方法，它应用集合稀疏差异度(sparse fea- 收稿日期：20160105 基金项目：国家自然科学基金资助项目(71271027)：高等学校博士学科点专项科研基金资助项目(20120006110037)工程科学学报，第 38 卷，第 7 期: 1017--1024，2016 年 7 月 Chinese Journal of Engineering，Vol． 38，No． 7: 1017--1024，July 2016 DOI: 10． 13374 /j． issn2095--9389． 2016． 07． 018; http: / /journals． ustb． edu． cn 分类属性数据聚类算法 HABOS 武森，姜丹丹，王蔷北京科技大学东凌经济管理学院，北京 100083  通信作者，E-mail: wusen@ manage． ustb． edu． cn 摘要 CABOSFV_C 是一种针对分类属性高维数据的高效聚类算法，该算法采用集合稀疏差异度进行距离计算，并采用稀疏特征向量实现数据压缩．该算法的聚类效果受集合稀疏差异度上限参数的影响，而该参数的选取没有明确的指导．针对该问题提出基于集合稀疏差异度的启发式分类属性数据层次聚类算法( heuristic hierarchical clustering algorithm of categorical data based on sparse feature dissimilarity，HABOS) ，该方法从聚结型层次聚类思想的角度出发，在聚类数上限参数的约束下，应用新的内部聚类有效性评价指标( clustering validation index based on sparse feature dissimilarity，CVISFD) 进行启发式度量，从而实现对聚类层次的自动选取． UCI 基准数据集的实验结果表明，HABOS 有效地提高了聚类准确性和稳定性．关键词数据挖掘; 聚类算法; 分类数据; 属性分类号 TP311 HABOS clustering algorithm for categorical data WU Sen ，JIANG Dan-dan，WANG Qiang Donlinks School of Economics and Management，University of Science and Technology Beijing，Beijing 100083，China  Corresponding author，E-mail: wusen@ manage． ustb． edu． cn ABSTＲACT The clustering algorithm based on sparse feature vector for categorical attributes ( CABOSFV_C) is an efficient high-dimensional clustering method for categorical data． Sparse feature dissimilarity ( SFD) is used to calculate the distance and sparse feature vector is used to achieve data compression． However，CABOSFV_C algorithm is dependent upon SFD upper limit parameter for which there is no guidance for configuration． Aimed at solving the problem that CABOSFV_C algorithm is sensitive to this parameter， a new heuristic hierarchical clustering algorithm of categorical data based on SFD ( HABOS) was proposed in this paper． With the constraint of the upper limit number of clusters，this algorithm applied agglomerative hierarchical clustering and the new internal clustering validation index based on SFD ( CVISFD) which was used to measure the results heuristically to achieve the best choice of the clustering level． Three UCI benchmark data sets were used to compare the improved algorithm with the traditional ones． The empirical tests show that HABOS increases the clustering accuracy and stability effectively． KEY WOＲDS data mining; clustering algorithms; categorical data; attributes 收稿日期: 2016--01--05 基金项目: 国家自然科学基金资助项目( 71271027) ; 高等学校博士学科点专项科研基金资助项目( 20120006110037) 聚类分析是数据挖掘的重要组成部分，它是一种将物理或抽象对象的集合分成相似的对象类，使得同一类内数据对象之间的相似度较高，不同类的数据对象之间相似度较低的过程［1］．聚类算法可应用在客户群划分、孤立点检测、模式识别、文档归类等领域．针对数值型聚类算法的研究较为深入，例如 K-means 算法、BIＲCH 算法、DBSCAN 算法、PAM 算法及其改进算法［2--4］．此外，现实世界中还存在大量分类属性数据，处理分类属性数据的聚类算法有 CABOSFV _ C 算法［5］、K-modes［6］算法、COBWEB 算法［7］等． CABOSFV_C 算法是一种处理分类属性数据的高维数据聚类方法，它应用集合稀疏差异度( sparse fea-

向下翻页>>

点击下载：分类属性数据聚类算法HABOS