《工程科学学报》：基于近邻的不均衡数据聚类算法

针对经典K–means算法对不均衡数据进行聚类时产生的“均匀效应”问题，提出一种基于近邻的不均衡数据聚类算法（Clustering algorithm for imbalanced data based on nearest neighbor，CABON）。CABON算法首先对数据对象进行初始聚类，通过定义的类别待定集来确定初始聚类结果中类别归属有待进一步核定的数据对象集合；并给出一种类别待定集的动态调整机制，利用近邻思想实现此集合中数据对象所属类别的重新划分，按照从集合边缘到中心的顺序将类别待定集中的数据对象依次归入其最近邻居所在的类别中，得到最终的聚类结果，以避免“均匀效应”对聚类结果的影响。将该算法与K–means、多中心的非平衡K_均值聚类方法（Imbalanced K–means clustering method with multiple centers，MC_IK）和非均匀数据的变异系数聚类算法（Coefficient of variation clustering for non-uniform data，CVCN）在人工数据集和真实数据集上分别进行实验对比，结果表明CABON算法能够有效消减K–means算法对不均衡数据聚类时所产生的“均匀效应”，聚类效果明显优于K–means、MC_IK和CVCN算法。

团购合买资源类别：文库，文档格式：PDF，文档页数：11，文件大小：1.37MB

工程科学学报.第42卷，第9期：1209-1219.2020年9月 Chinese Journal of Engineering,Vol.42,No.9:1209-1219,September 2020 https://doi.org/10.13374/j.issn2095-9389.2019.10.09.003;http://cje.ustb.edu.cn 基于近邻的不均衡数据聚类算法武森，汪玉枝，高晓楠⑧ 北京科技大学经济管理学院.北京100083 ☒通信作者，E-mail:gaoxiaonan(0001@163.com 摘要针对经典K-meas算法对不均衡数据进行聚类时产生的“均匀效应”问题，提出一种基于近邻的不均衡数据聚类算法(Clustering algorithm for imbalanced data based on nearest neighbor,.CABON).CABON算法首先对数据对象进行初始聚类，通过定义的类别待定集来确定初始聚类结果中类别归属有待进一步核定的数据对象集合：并给出一种类别待定集的动态调整机制.利用近邻思想实现此集合中数据对象所属类别的重新划分，按照从集合边缘到中心的顺序将类别待定集中的数据对象依次归入其最近邻居所在的类别中，得到最终的聚类结果.以避免“均匀效应”对聚类结果的影响.将该算法与K-meas、多中心的非平衡K均值聚类方法(Imbalanced K-means clustering method with multiple centers,MCIK)和非均匀数据的变异系数聚类算法(Coefficient of variation clustering for non--uniform data,CVCN)在人工数据集和真实数据集上分别进行实验对比，结果表明CABON算法能够有效消减K-means算法对不均衡数据聚类时所产生的“均匀效应”，聚类效果明显优于 K-means、MC_IK和CVCN算法关键词K-means;均匀效应：类别待定集：近邻：基于近邻的不均衡数据聚类算法分类号TP311.13 Clustering algorithm for imbalanced data based on nearest neighbor WU Sen,WANG Yu-zhi,GAO Xiao-nan School of Economics and Management,University of Science and Technology Beijing.Beijing 100083.China Corresponding author,E-mail:gaoxiaonan0001@163.com ABSTRACT Clustering is an important task in the field of data mining.Most clustering algorithms can effectively deal with the clustering problems of balanced datasets,but their processing ability is weak for imbalanced datasets.For example,K-means,a classical partition clustering algorithm,tends to produce a"uniform effect"when dealing with imbalanced datasets,i.e.,the K-means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters "swallowing"the part of the data objects in large clusters.This means that the number and density of the data objects in different clusters tend to be the same.To solve the problem of "uniform effect"generated by the classical K-means algorithm in the clustering of imbalanced data,a clustering algorithm based on nearest neighbor (CABON)is proposed for imbalanced data.Firstly,the initial clustering of data objects is performed to obtain the undetermined-cluster set,which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong.Then,from the edge to the center of the set,the nearest- neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors.Meanwhile the undetermined-cluster set is dynamically adjusted,to obtain the final clustering result,which prevents the influence of the "uniform effect"on the clustering result.The clustering results of the proposed algorithm is compared with that of K-means,the imbalanced K-means clustering method with multiple centers(MC_IK),and the coefficient of variation clustering for non-uniform data(CVCN)on synthetic and real datasets.The experimental results reveal that the CABON algorithm effectively reduces "uniform effect"generated by 收稿日期：2019-10-09 基金项目：国家自然科学基金资助项目(71971025,71271027)

基于近邻的不均衡数据聚类算法武森，汪玉枝，高晓楠苣北京科技大学经济管理学院，北京 100083 苣通信作者，E-mail: gaoxiaonan0001@163.com 摘要针对经典 K–means 算法对不均衡数据进行聚类时产生的“均匀效应”问题，提出一种基于近邻的不均衡数据聚类算法（Clustering algorithm for imbalanced data based on nearest neighbor，CABON）. CABON 算法首先对数据对象进行初始聚类，通过定义的类别待定集来确定初始聚类结果中类别归属有待进一步核定的数据对象集合；并给出一种类别待定集的动态调整机制，利用近邻思想实现此集合中数据对象所属类别的重新划分，按照从集合边缘到中心的顺序将类别待定集中的数据对象依次归入其最近邻居所在的类别中，得到最终的聚类结果，以避免“均匀效应”对聚类结果的影响. 将该算法与 K–means、多中心的非平衡 K_均值聚类方法（Imbalanced K–means clustering method with multiple centers，MC_IK）和非均匀数据的变异系数聚类算法（Coefficient of variation clustering for non-uniform data，CVCN）在人工数据集和真实数据集上分别进行实验对比，结果表明 CABON 算法能够有效消减 K –means 算法对不均衡数据聚类时所产生的“均匀效应” ，聚类效果明显优于 K–means、MC_IK 和 CVCN 算法. 关键词 K–means；均匀效应；类别待定集；近邻；基于近邻的不均衡数据聚类算法分类号 TP311.13 Clustering algorithm for imbalanced data based on nearest neighbor WU Sen，WANG Yu-zhi，GAO Xiao-nan苣 School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China 苣 Corresponding author, E-mail: gaoxiaonan0001@163.com ABSTRACT Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearestneighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K –means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by 收稿日期: 2019−10−09 基金项目: 国家自然科学基金资助项目（71971025, 71271027）工程科学学报，第 42 卷，第 9 期：1209−1219，2020 年 9 月 Chinese Journal of Engineering, Vol. 42, No. 9: 1209−1219, September 2020 https://doi.org/10.13374/j.issn2095-9389.2019.10.09.003; http://cje.ustb.edu.cn

·1210 工程科学学报，第42卷，第9期 the K-means algorithm on imbalanced data,and its clustering result is superior to that of the K-means,MC IK,and CVCN algorithms. KEY WORDS K-means;uniform effect;undetermined-cluster set;neighbor:clustering algorithm for imbalanced data based on nearest neighbor(CABON) 聚类分析是数据挖掘领域最为常见的技术之一，法的有效性降低.第三类方法是优化目标函数的用于发现数据集中未知的对象类四聚类分析在客方法，此类方法从目标函数优化的角度提出新的户细分)、模式识别阗、医疗决策、异常检测阿等算法，通过推导出相应的聚类优化目标函数，以解诸多领域有着广泛的应用前景.传统的聚类算法决“均匀效应”问题，如杨天鹏通过优化数据离散能够很好地处理均衡数据的聚类问题，但是现实程度进而优化目标函数，提出了一种基于变异系生活中存在许多不均衡数据，例如医疗诊断阿、故数的聚类算法(CVCN)阿这类方法从目标函数直障诊断)等领域的数据中表现正常的数据量要远接切入，相比于之前的聚类算法是一种较为直接远大于表现异常的数据量，这类不均衡数据集的的新方法且有一定的实用性，但是此类方法一般特点是同一数据集中归属于某一类别的数据对象涉及目标函数参数的求解，属于非线性函数优化的数量和密度与其他类别数据对象的数量和密度问题，难以得到全局最优解，这决定了该类算法的有较大差异，通常数据对象数量较多的类称之为聚类结果具有相对较大的随机性，影响算法的聚大类，数据对象数量较少的类称之为小类.对于不类精度均衡数据的聚类问题，传统的聚类算法处理能力针对上述问题，本文借助近邻思想，提出基于较弱.以著名的K-means算法为例，K-means算近邻的不均衡数据聚类算法(CABON).首先运用法是经典的划分式聚类算法，在处理不均衡数据 K-means算法对数据对象进行初始聚类，针对聚集时，容易出现“均匀效应”图，即聚类时往往会产类结果中部分数据对象类别可能划分错误的问生相对均匀尺寸的类，小类中的数据对象会“吞题，从数据对象与其最近的两个类中心距离差值掉”大类中的部分数据对象，造成聚类结果中不同的计算出发，给出了类别待定集的定义及构造规类的数据对象数量和密度趋于一致. 则，用以确定初始聚类结果中类别归属有待进一为解决不均衡数据的聚类问题，研究者从不步核定的数据对象集合.其次针对类别待定集中同角度提出了多种方法，大致可以分成以下三类：数据对象所属类别的重新划分问题，提出了一种 (1)数据预处理21：(2)多中心点-：(3)优化目新的类的划分规则，通过查找类别待定集中每个标函数6第一类方法是数据预处理，此类方法数据对象的最近邻居，借助近邻思想，将类别待定对数据集进行欠采样和过采样处理后再进行聚集的数据对象按照从集合边缘到中心的顺序依次类，但是欠采样方法仅仅采用了属于大类中的一归入其最近邻居所在的类别中，能够将初次聚类部分具有代表性的子集，导致大类中大量的有效结果中类别错分的数据对象校正回正确的类，并信息被忽略，影响了聚类效果；过采样方法通过且定义了一种类别待定集的动态调整机制，提高增加小类中对象数量来进行数据分析，使原有数数据对象类别划分的准确率，从而进一步消减K-means 据集达到均衡状态，但是这样做一方面可能导致算法的“均匀效应”，得到最终的聚类结果过拟合，另一方面也可能给数据集带来噪声.第二 1相关概念与规则类方法是多中心点的方法，此类方法基于多中心的角度解决K-means的“均匀效应”问题，其思想 1.1K-means算法的“均匀效应” 是用多个类中心代替单个类中心代表一个类，在 K-means算法是划分式聚类的经典算法，在某些情况下，借助该思想，K-means算法在迭代过 K-means算法迭代过程中，采用一个类中所有对程中根据距离“中心”最近的原则，能够让部分被象的平均值作为该类新的中心，然后根据距离“中错分到小类中的数据对象校正回大类中.如亓慧心”最近原则，重新确定所有对象的类别，这会导提出了一种多中心的不均衡K均值聚类方法致K-means算法在处理不均衡数据集时，出现部 (MCK),具有一定的有效性和可行性.但此类分数据对象与多个类中心距离相近的情况，从而方法对于一些大类分布极其不均匀的不均衡数据导致这部分的数据对象归属类别被错分.文献[8] 聚类问题，不能全面地反映数据分布特征，导致算介绍了K-means算法的“均匀效应”，即K-means

the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms. KEY WORDS K –means； uniform effect； undetermined-cluster set； neighbor； clustering algorithm for imbalanced data based on nearest neighbor (CABON) 聚类分析是数据挖掘领域最为常见的技术之一，用于发现数据集中未知的对象类[1] . 聚类分析在客户细分[2]、模式识别[3]、医疗决策[4]、异常检测[5] 等诸多领域有着广泛的应用前景. 传统的聚类算法能够很好地处理均衡数据的聚类问题，但是现实生活中存在许多不均衡数据，例如医疗诊断[6]、故障诊断[7] 等领域的数据中表现正常的数据量要远远大于表现异常的数据量. 这类不均衡数据集的特点是同一数据集中归属于某一类别的数据对象的数量和密度与其他类别数据对象的数量和密度有较大差异，通常数据对象数量较多的类称之为大类，数据对象数量较少的类称之为小类. 对于不均衡数据的聚类问题，传统的聚类算法处理能力较弱. 以著名的 K–means 算法为例，K–means 算法是经典的划分式聚类算法，在处理不均衡数据集时，容易出现“均匀效应” [8] ，即聚类时往往会产生相对均匀尺寸的类，小类中的数据对象会“吞掉”大类中的部分数据对象，造成聚类结果中不同类的数据对象数量和密度趋于一致. 为解决不均衡数据的聚类问题，研究者从不同角度提出了多种方法，大致可以分成以下三类：（1）数据预处理[9−12] ；（2）多中心点[13−14] ；（3）优化目标函数[15−16] . 第一类方法是数据预处理，此类方法对数据集进行欠采样和过采样处理后再进行聚类，但是欠采样方法仅仅采用了属于大类中的一部分具有代表性的子集，导致大类中大量的有效信息被忽略，影响了聚类效果[17] ；过采样方法通过增加小类中对象数量来进行数据分析，使原有数据集达到均衡状态，但是这样做一方面可能导致过拟合，另一方面也可能给数据集带来噪声. 第二类方法是多中心点的方法，此类方法基于多中心的角度解决 K–means 的“均匀效应”问题，其思想是用多个类中心代替单个类中心代表一个类，在某些情况下，借助该思想，K–means 算法在迭代过程中根据距离“中心”最近的原则，能够让部分被错分到小类中的数据对象校正回大类中. 如亓慧提出了一种多中心的不均衡 K_均值聚类方法 (MC_IK)[14] ，具有一定的有效性和可行性. 但此类方法对于一些大类分布极其不均匀的不均衡数据聚类问题，不能全面地反映数据分布特征，导致算法的有效性降低. 第三类方法是优化目标函数的方法，此类方法从目标函数优化的角度提出新的算法，通过推导出相应的聚类优化目标函数，以解决“均匀效应”问题. 如杨天鹏通过优化数据离散程度进而优化目标函数，提出了一种基于变异系数的聚类算法（CVCN） [15] . 这类方法从目标函数直接切入，相比于之前的聚类算法是一种较为直接的新方法且有一定的实用性，但是此类方法一般涉及目标函数参数的求解，属于非线性函数优化问题，难以得到全局最优解，这决定了该类算法的聚类结果具有相对较大的随机性，影响算法的聚类精度. 针对上述问题，本文借助近邻思想，提出基于近邻的不均衡数据聚类算法（CABON）. 首先运用 K–means 算法对数据对象进行初始聚类，针对聚类结果中部分数据对象类别可能划分错误的问题，从数据对象与其最近的两个类中心距离差值的计算出发，给出了类别待定集的定义及构造规则，用以确定初始聚类结果中类别归属有待进一步核定的数据对象集合. 其次针对类别待定集中数据对象所属类别的重新划分问题，提出了一种新的类的划分规则，通过查找类别待定集中每个数据对象的最近邻居，借助近邻思想，将类别待定集的数据对象按照从集合边缘到中心的顺序依次归入其最近邻居所在的类别中，能够将初次聚类结果中类别错分的数据对象校正回正确的类，并且定义了一种类别待定集的动态调整机制，提高数据对象类别划分的准确率，从而进一步消减K–means 算法的“均匀效应”，得到最终的聚类结果. 1 相关概念与规则 1.1 K–means 算法的“均匀效应” K–means 算法是划分式聚类的经典算法，在 K–means 算法迭代过程中，采用一个类中所有对象的平均值作为该类新的中心，然后根据距离“中心”最近原则，重新确定所有对象的类别，这会导致 K–means 算法在处理不均衡数据集时，出现部分数据对象与多个类中心距离相近的情况，从而导致这部分的数据对象归属类别被错分. 文献 [8] 介绍了 K–means 算法的“均匀效应”，即 K–means · 1210 · 工程科学学报，第 42 卷，第 9 期

武森等：基于近邻的不均衡数据聚类算法 ·1211 习惯于将大类中的部分对象划分到小类中，从而据对象进行聚类，得到初始聚类结果.由于传统的使获得的类拥有相对均匀的尺度.如图1(a)所示， K-means算法不能很好地处理不均衡数据的聚类展示了一个由Python人工合成的二维不均衡数据问题，因此在初始聚类结果中会造成部分数据对象集的真实分布，两个类的数据对象的数量和密度在算法迭代过程中与多个类中心距离都较为相近，有较大差别：图1(b)为K-means算法对图1(a)中以至于无法准确确定其类别，出现类别错分的情况. 数据集的聚类结果，可以看到，聚类后小类“吞掉”了基于此，本文定义了一种构造规则来确定类别待定大类中的部分数据对象，使得两个类的数据对象的集，并进一步提出一种新的类的划分规则实现对数量和密度趋于相近，此为K-means算法所产生类别待定集中数据对象的重新划分，得到最终聚的“均匀效应”现象.从类中心的角度来说，因为单类结果，从而处理不均衡数据的“均匀效应”问题. 个类中心不能充分反映一个大类的特征，导致大类已有研究MCIK算法提出过一种模糊工作集中的部分数据对象被划分到小类中l]在K-means 的定义：对于任意一个数据对象，若与初次聚类结算法迭代过程中，当大类中心与小类中心的距离果中任意两类或两类以上的类中心距离差小于预越来越近，大类中的部分对象极易被错分到小类设阈值，则将此数据对象归入模糊工作集4但中，最终的聚类结果产生的“均匀效应”就越显著是，这种构造方法可能将普通数据对象划归到模 1.2类别待定集糊工作集中，存在模糊工作集构造不精准的缺点，类别待定集是数据集中所属类别有待进一步影响算法的聚类效果.如图2(a)所示，三角形、圆核定的数据对象构成的集合，对应地，数据集中所圈、正方形和四角星分别代表了四种类别，其类中属类别确定的数据对象构成的集合称之为类别确心分别是c1、c2、c3、c4,图2(b)是图2(a)中数据集定集.在CABON算法中，首先运用K-means对数的初次聚类结果，可以看到，Cluster1和Cluster 30 20 a (b) 15 15 10 10 5 5 0 A Cluster 1 Cluster 1 口Cluster2 Cluster 2 -5 0 10 15 10 0 5 10 15 Attribute I Attribute】图1二维不均衡数据集的真实分布和K-means的聚类结果.(a)数据集真实分布：(b)K-means的聚类结果 Fig.1 Real distributions of two-dimensional unbalanced data sets and clustering results of the K-means algorithm:(a)real distribution of datasets,(b)the clustering result of the K-means algorithm (a) b A 35 ◆ 009 % 25 0 3 20 15 A Cluster A Cluster 1 O Cluster 2 10 O Cluster 2 ■luster5 ▣Cluster3 5 ◇Cluster4 Centroids 8 品 51015202530354045505560 0 51015202530354045505560 Attribute 1 Attribute 1 因2MCK算法中模糊工作集构造规则存在缺陷示意图.(a)数据集真实分布：(b)MCIK对数据集初次聚类结果 Fig.2 Schematic diagram of the defects in the construction rules of the MC IK algorithm's fuzzy working set:(a)real distribution of data sets;(b)initial clustering result of the MC IK algorithm on datasets

习惯于将大类中的部分对象划分到小类中，从而使获得的类拥有相对均匀的尺度. 如图 1（a）所示，展示了一个由 Python 人工合成的二维不均衡数据集的真实分布，两个类的数据对象的数量和密度有较大差别；图 1（b）为 K–means 算法对图 1（a）中数据集的聚类结果，可以看到，聚类后小类“吞掉”了大类中的部分数据对象，使得两个类的数据对象的数量和密度趋于相近，此为 K–means 算法所产生的“均匀效应”现象. 从类中心的角度来说，因为单个类中心不能充分反映一个大类的特征，导致大类中的部分数据对象被划分到小类中[18] . 在 K–means 算法迭代过程中，当大类中心与小类中心的距离越来越近，大类中的部分对象极易被错分到小类中，最终的聚类结果产生的“均匀效应”就越显著. 1.2 类别待定集类别待定集是数据集中所属类别有待进一步核定的数据对象构成的集合，对应地，数据集中所属类别确定的数据对象构成的集合称之为类别确定集. 在 CABON 算法中，首先运用 K–means 对数据对象进行聚类，得到初始聚类结果. 由于传统的 K–means 算法不能很好地处理不均衡数据的聚类问题，因此在初始聚类结果中会造成部分数据对象在算法迭代过程中与多个类中心距离都较为相近，以至于无法准确确定其类别，出现类别错分的情况. 基于此，本文定义了一种构造规则来确定类别待定集，并进一步提出一种新的类的划分规则实现对类别待定集中数据对象的重新划分，得到最终聚类结果，从而处理不均衡数据的“均匀效应”问题. 已有研究 MC_IK 算法提出过一种模糊工作集的定义：对于任意一个数据对象，若与初次聚类结果中任意两类或两类以上的类中心距离差小于预设阈值，则将此数据对象归入模糊工作集[14] . 但是，这种构造方法可能将普通数据对象划归到模糊工作集中，存在模糊工作集构造不精准的缺点，影响算法的聚类效果. 如图 2（a）所示，三角形、圆圈、正方形和四角星分别代表了四种类别，其类中心分别是 c1、c2、c3、c4，图 2（b）是图 2（a）中数据集的初次聚类结果，可以看到，Cluster 1 和 Cluster 20 −10 15 −5 10 0 Attribute 1 (a) Attribute 2 5 5 0 10 −5 15 −10 −15 20 −10 15 −5 10 0 Attribute 1 (b) Attribute 2 5 5 0 10 −5 15 −10 −15 Cluster 1 Cluster 2 Cluster 1 Cluster 2 图 1 二维不均衡数据集的真实分布和 K–means 的聚类结果. （a）数据集真实分布；（b） K–means 的聚类结果 Fig.1 Real distributions of two-dimensional unbalanced data sets and clustering results of the K–means algorithm: (a) real distribution of datasets; (b) the clustering result of the K–means algorithm (a) Attribute 2 Attribute 1 35 0 30 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 55 60 (b) Attribute 2 Attribute 1 35 0 30 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 55 60 c1 c1 x1 c2 c2 c4 c4 c3 c3 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids 图 2 MC_IK 算法中模糊工作集构造规则存在缺陷示意图. （a）数据集真实分布；（b）MC_IK 对数据集初次聚类结果 Fig.2 Schematic diagram of the defects in the construction rules of the MC_IK algorithm’s fuzzy working set: (a) real distribution of data sets; (b) initial clustering result of the MC_IK algorithm on datasets 武森等：基于近邻的不均衡数据聚类算法 · 1211 ·

1212 工程科学学报，第42卷.第9期 2之间发生了“均匀效应”.对于对象，根据 (1)查找类别待定集中每个数据对象在类别确定 MCK算法定义的模糊工作集的构造规则，其与集中的最近邻居，这些最近邻居数据对象构成一初次聚类结果中类中心c3、c4的距离差值最小，若个集合，称之为最近邻居集；(2)依据距离最近邻此距离差值小于预设阈值可将数据对象：归入模居距离最小原则，确定类别待定集的边界对象，并糊工作集.但实际上对象x1与初次聚类结果中最将边界对象归入其最近邻居所在类别中：(3)将已近的两个类中心为C1、c2,且距离c1、c2的距离差经重新划分过类别的边界对象从类别待定集中删值较大，这种情况下可以把对象x直接归入距离除，并添加到类别确定集中，动态调整类别待定最近的类别c2中，不属于模糊工作集.因此，集；(4)重复第(2)、(3)步，直到类别待定集为空 MCK算法定义的构造规则会导致部分对象被误在上述规则中可以看出，CABON算法在实现分到模糊工作集，影响算法的聚类效果过程中，类别待定集中的对象不是固定不变的，因因此，在MCK算法定义的模糊工作集的基为类别待定集固定的思路仍会导致数据对象类别础上，本文将数据集中所属类别有待进一步核定错分的情况，利用上述规则中类别待定集的动态的数据对象构成的集合定义为类别待定集，构造调整机制可以避免这一情况的发生.如图4所示，规则为：使用K-means算法对数据对象进行聚类三角形和圆圈分别代表初始聚类得到的两种类得到初始聚类结果后，对于任意一个数据对象，若别，虚线内对象代表已构造好的类别待定集.按照与初始聚类结果中最近的两个类中心距离差小于上述类别待定集中数据对象的划分规则，随着类预设阈值（即类别待定集构造阈值），则将此数据别待定集的动态调整，对象，应归入大类Cluster1 对象归入类别待定集.图3(a)为用K-means算法中.若类别待定集是固定的，即虚线内对象在算法对原始数据对象聚类得到的初始聚类结果，产生的迭代过程中不发生变化，CABON算法在计算类了“均匀效应”：图3(b)根据本文定义的构造规则别待定集中数据对象与类别确定集中数据对象之确定了类别待定集，虚线内数据对象距离最近的间的距离时，对象1与对象3的距离略大于对象两个类中心距离差小于预设阈值6，构成了类别待 1与对象的距离(dx,3)Pdx1,x2),且dx1, 定集.相比于MCIK算法中定义的模糊工作集， )为距离集合里的最小值，因此对象：的最近邻本文提出的类别待定集构造方法可以有效避免归居为小类中的对象2，其类别也将与对象x2一致属类别确定的数据对象被错分到类别模糊不确定如此一来，类别待定集中原本应归入大类中的对的集合里的情形，能够准确识别可能被错分的数象会被错分到小类中.故在此基础上，我们定义类据对象，为后续数据对象的重新划分打好基础别待定集和类别确定集是动态变化的：当类别待 1.3类别待定集中数据对象的划分规则定集中的边界对象类别确定后，将此边界对象从初始类别待定集构造完成后，为了将该集合类别待定集中别除并添加至类别确定集中，此对中数据对象准确地划分到正确的类中，CABON算象的类别确定后就有可能成为类别待定集其他数法定义了一种新的类的划分规则.规则描述如下：据对象的最近邻居.将类别待定集的每一数据对 △Cluster1 P 0 (a) A Cluster 1 (b) 35 O Cluster 2 35 O Cluster 2 ▣Cluster3 0xg马 ▣Cluster3 ◆Cluster4 ◆Cluster4 阳马 30 Centroids 30 Centroids 25 25 099 女 20 20 15 15 o89 0o80 △会△ 10 △△ 0 A△ 0 0 5 1015202530 3540 45 50 0 5 10 15202530 35404550 Attribute 1 Attribute 1 图3类别待定集的构造过程示意图.(a)K-means算法对数据集的初始聚类结果：(b)CABON算法构造类别待定集图示 Fig.3 Schematic of the construction process of the undetermined-cluster set:(a)initial clustering result of the K-means algorithm on data sets;(b) undetermined-cluster set constructed by the CABON algorithm

2 之间发生了 “ 均匀效应 ” . 对于对象 x1，根据 MC_IK 算法定义的模糊工作集的构造规则，其与初次聚类结果中类中心 c3、c4 的距离差值最小，若此距离差值小于预设阈值可将数据对象 x1 归入模糊工作集. 但实际上对象 x1 与初次聚类结果中最近的两个类中心为 c1、c2，且距离 c1、c2 的距离差值较大，这种情况下可以把对象 x1 直接归入距离最近的类别 c2 中，不属于模糊工作集 . 因此， MC_IK 算法定义的构造规则会导致部分对象被误分到模糊工作集，影响算法的聚类效果. 因此，在 MC_IK 算法定义的模糊工作集的基础上，本文将数据集中所属类别有待进一步核定的数据对象构成的集合定义为类别待定集，构造规则为：使用 K–means 算法对数据对象进行聚类得到初始聚类结果后，对于任意一个数据对象，若与初始聚类结果中最近的两个类中心距离差小于预设阈值 δ（即类别待定集构造阈值），则将此数据对象归入类别待定集. 图 3（a）为用 K–means 算法对原始数据对象聚类得到的初始聚类结果，产生了“均匀效应”；图 3（b）根据本文定义的构造规则确定了类别待定集，虚线内数据对象距离最近的两个类中心距离差小于预设阈值 δ，构成了类别待定集. 相比于 MC_IK 算法中定义的模糊工作集，本文提出的类别待定集构造方法可以有效避免归属类别确定的数据对象被错分到类别模糊不确定的集合里的情形，能够准确识别可能被错分的数据对象，为后续数据对象的重新划分打好基础. 1.3 类别待定集中数据对象的划分规则初始类别待定集构造完成后，为了将该集合中数据对象准确地划分到正确的类中，CABON 算法定义了一种新的类的划分规则. 规则描述如下：（1）查找类别待定集中每个数据对象在类别确定集中的最近邻居，这些最近邻居数据对象构成一个集合，称之为最近邻居集；（2）依据距离最近邻居距离最小原则，确定类别待定集的边界对象，并将边界对象归入其最近邻居所在类别中；（3）将已经重新划分过类别的边界对象从类别待定集中删除，并添加到类别确定集中，动态调整类别待定集；（4）重复第（2）、（3）步，直到类别待定集为空. 在上述规则中可以看出，CABON 算法在实现过程中，类别待定集中的对象不是固定不变的，因为类别待定集固定的思路仍会导致数据对象类别错分的情况，利用上述规则中类别待定集的动态调整机制可以避免这一情况的发生. 如图 4 所示，三角形和圆圈分别代表初始聚类得到的两种类别，虚线内对象代表已构造好的类别待定集. 按照上述类别待定集中数据对象的划分规则，随着类别待定集的动态调整，对象 x1 应归入大类 Cluster1 中. 若类别待定集是固定的，即虚线内对象在算法的迭代过程中不发生变化，CABON 算法在计算类别待定集中数据对象与类别确定集中数据对象之间的距离时，对象 x1 与对象 x3 的距离略大于对象 x1 与对象 x2 的距离 (d(x1 , x3 )>d(x1 , x2 ))，且 d(x1 , x2 ) 为距离集合里的最小值，因此对象 x1 的最近邻居为小类中的对象 x2，其类别也将与对象 x2 一致. 如此一来，类别待定集中原本应归入大类中的对象会被错分到小类中. 故在此基础上，我们定义类别待定集和类别确定集是动态变化的：当类别待定集中的边界对象类别确定后，将此边界对象从类别待定集中剔除并添加至类别确定集中，此对象的类别确定后就有可能成为类别待定集其他数据对象的最近邻居. 将类别待定集的每一数据对 (a) Attribute 2 Attribute 1 35 0 30 40 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids (b) Attribute 2 Attribute 1 35 0 30 40 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids 图 3 类别待定集的构造过程示意图. （a）K–means 算法对数据集的初始聚类结果；（b）CABON 算法构造类别待定集图示 Fig.3 Schematic of the construction process of the undetermined-cluster set: (a) initial clustering result of the K-means algorithm on data sets; (b) undetermined-cluster set constructed by the CABON algorithm · 1212 · 工程科学学报，第 42 卷，第 9 期

武森等：基于近邻的不均衡数据聚类算法 ·1213 30 O Cluster 1 有数据对象已实现了类别的重新划分，得到了最 25 △Cluster2 后的聚类结果 00 。 Centroids 2 CABON算法过程 0 15 CABON算法的输入数据包括数据集X={xI 10 88 △△ 聚类个数参数k、类别待定集界定阈值6，输出数 3 △A 据为数据集X的聚类标签Label..CABON算法具 X △ 体步骤如下： 10 152025 0 3540 Step 1:对数据集X用经典的K-means算法聚 Attribute 1 类，得到初始聚类结果；图4 CABON算法固定类别待定集的情况示意图 Step2:构造类别待定集.对于任意一个数据 Fig.4 Schematic of the CABON algorithm's fixed of the undetermined- 对象xi=1,2,,n),选择距离最近的两个类X,与X cluster set (s,仁1,2，，k,s≠)，且它们的类中心为c和c,若与象按照从集合边缘到中心的顺序依次归入其最近数据对象x的距离符合如下关系：邻居所在的类别中，实现类别待定集动态调整的 d(xi,cs)-d(xi.c)<6 过程，以避免上述类别待定集固定所导致的数据则将对象x归入类别待定集心={二m是类对象类别错分的后果别待定集中数据对象的数量)：数据集X中别除类图5展示了类别待定集中数据对象根据CABON 别待定集后的对象构成了类别确定集；算法所定义的类划分规则实现类划分的过程.首 Step3:查找最近邻居集.根据类别待定集先，图5(a)在图3(b)构造的类别待定集的基础上，中数据对象与类别确定集中数据对象的距离大确定了类别待定集中每一个数据对象在类别确定小，在类别确定集中选取距离最小的对象集合作集中的最近邻居；其次，根据二者距离的大小选择为x的最近邻居集X=(对应类别待定集距离最小的数据对象作为边界对象，即图5(a)中中数据对象x在类别确定集中的最近邻居)：虚线内有颜色填充的对象为此次迭代中的边界对 Step4:确定边界对象.查找到距离最近邻居象，而虚线外有相同颜色填充的对象为边界对象 x(p=1,2,,m)最小的数据对象xp,确定xp为类别对应的最近邻居：最后，根据近邻思想将边界对象待定集的边界对象，将此边界对象x归入其最归入其最近邻居所在的类中，由此确定边界对象近邻居x所在的类别中；同时将边界对象x从类别的类别，进而将边界对象由类别待定集划分到类待定集x中别除并添加到类别确定集中；别确定集中（图5(b)).图5展示了类别待定集中 Step5:重复Step3至Step4,直到类别待定集数据对象类别重新划分的一次迭代过程，经过多为空集、类别确定集为全集时停止次迭代后，类别待定集中的数据对象为空，表示所由CABON算法的具体实现步骤可知CABON 40 40 △Cluster (a) A Cluster (b) 35 OCluster 2 O Cluster 2 口Cluster3 35 ▣Cluster3 30 ◆Cluster4 阳B 品 ◆Cluster4 0始马 Centroids 30 Centroids 品 25 25 个 0 15 15 ǒ9 10 44 A△ △会△ 10 94 A△ △△ 5 A△ X△△ A△ △ 0 0 5 10 15202530 3540 4550 5 10 15202530 35 40 4550 Attribute 1 Attribute 1 图5类别待定集中数据对象类别重新划分过程示意图.(a)边界对象确定过程：(b)类别待定集调整过程 Fig.5 Schematic of the reclassification process of data objects in the undetermined-cluster set:(a)determination of boundary objects;(b)the adjustment process of undetermined-cluster set

象按照从集合边缘到中心的顺序依次归入其最近邻居所在的类别中，实现类别待定集动态调整的过程，以避免上述类别待定集固定所导致的数据对象类别错分的后果. 图 5 展示了类别待定集中数据对象根据 CABON 算法所定义的类划分规则实现类划分的过程. 首先，图 5（a）在图 3（b）构造的类别待定集的基础上，确定了类别待定集中每一个数据对象在类别确定集中的最近邻居；其次，根据二者距离的大小选择距离最小的数据对象作为边界对象，即图 5（a）中虚线内有颜色填充的对象为此次迭代中的边界对象，而虚线外有相同颜色填充的对象为边界对象对应的最近邻居；最后，根据近邻思想将边界对象归入其最近邻居所在的类中，由此确定边界对象的类别，进而将边界对象由类别待定集划分到类别确定集中（图 5（b））. 图 5 展示了类别待定集中数据对象类别重新划分的一次迭代过程，经过多次迭代后，类别待定集中的数据对象为空，表示所有数据对象已实现了类别的重新划分，得到了最后的聚类结果. 2 CABON 算法过程 X = {xi} n CABON 算法的输入数据包括数据集 i=1、聚类个数参数 k、类别待定集界定阈值 δ，输出数据为数据集 X 的聚类标签 Label. CABON 算法具体步骤如下： Step 1：对数据集 X 用经典的 K–means 算法聚类，得到初始聚类结果； xi Xs Xt cs ct xi Step 2：构造类别待定集. 对于任意一个数据对象 (i =1,2,…,n)，选择距离最近的两个类与 (s, t=1,2,…,k, s≠k)，且它们的类中心为和，若与数据对象的距离符合如下关系： |d(xi ,cs)−d(xi ,ct)| < δ xi X 0 = { x 0 i }m 则将对象归入类别待定集 i=1 (m 是类别待定集中数据对象的数量)；数据集 X 中剔除类别待定集后的对象构成了类别确定集； X 0 X 0 X 1 = { x 1 i }m i=1 x 1 i x 0 i Step 3：查找最近邻居集. 根据类别待定集中数据对象与类别确定集中数据对象的距离大小，在类别确定集中选取距离最小的对象集合作为的最近邻居集 ( 对应类别待定集中数据对象在类别确定集中的最近邻居)； x 1 p (p = 1,2,...,m) xp xp X 0 xp x 1 p xp X 0 Step 4：确定边界对象. 查找到距离最近邻居最小的数据对象，确定为类别待定集的边界对象，将此边界对象归入其最近邻居所在的类别中；同时将边界对象从类别待定集中剔除并添加到类别确定集中； X 0 Step 5：重复 Step 3 至 Step 4，直到类别待定集为空集、类别确定集为全集时停止. 由 CABON 算法的具体实现步骤可知 CABON Attribute 2 Attribute 1 0 30 5 25 10 20 15 15 20 10 25 5 30 0 35 40 x1 x2 x3 Cluster 1 Cluster 2 Centroids 图 4 CABON 算法固定类别待定集的情况示意图 Fig.4 Schematic of the CABON algorithm’s fixed of the undeterminedcluster set (a) Attribute 2 Attribute 1 35 0 30 40 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids (b) Attribute 2 Attribute 1 35 0 30 40 5 25 10 20 15 15 20 10 25 5 30 0 35 40 45 50 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Centroids 图 5 类别待定集中数据对象类别重新划分过程示意图. （a）边界对象确定过程；（b）类别待定集调整过程 Fig.5 Schematic of the reclassification process of data objects in the undetermined-cluster set: (a) determination of boundary objects; (b) the adjustment process of undetermined-cluster set 武森等：基于近邻的不均衡数据聚类算法 · 1213 ·

.1214 工程科学学报，第42卷，第9期算法具有两个显著特点：(1)从计算数据对象与其上界均为1，指标值越接近1，聚类效果越好最近的两个类中心距离差的角度，提出了类别待设k为聚类个数，N为数据集中数据对象的个定集的构造方法，弥补已有研究构造模糊工作集数，P={P1,P2,P为数据集的实际类分布，C= 时部分对象被误分这一不足；(2)基于近邻思想， (C1,C2,Ck为数据集的聚类结果，=C:nP伪为类别待定集中对象重新定义一种新的类的划分 C,和P的共同部分的基数，则聚类精度(Accuracy) 规则.利用类别待定集对象的最近邻居作为确定的公式如(1)所示：其类别的参照依据，可以更精确地划分类别待定 I k 集对象所归属的类别，并且CABON算法中类别待 Accuracy N乙mx (1) 定集的动态调整机制大大降低了对象归属类别错分的概率，从而保证聚类结果的质量 F-measure的计算公式如(2)所示： F-measure= (1+B2).Recall.Precision 3实验分析 (2) B2.Recall+Precision 3.1实验数据其中：Recall和Precision分别表示数据集中真实聚为了验证CABON算法解决K-means“均匀效类结果与聚类结果相比的召回率和准确率；B表应”问题的有效性，实验分别在6个人工数据集和示Recall和Precision的相对重要性，通常设为1. 4个UCI(University of California Irvine)真实数据集 NMI的计算公式如(3)所示：上进行，详细信息见表1，编号1~6为人工数据集 ln(N)/mnj)》具体信息，真实分布如图6所示，其中编号1~3描述的是测试聚类常用的Flamel9、Aggregation2o、 NMI= (3) Jain2数据集，编号4~6描述的是Python合成的 3个数据集，分别为DS1、DS2、DS3.另外，表1编号7~10为真实数据集的具体信息，实验采用了其中：N为数据集的数据对象数量，n:为属于类的加州大学欧文分校提出的UCI标准数据库中的数据对象数量；n为属于类的数据对象数量：为 Wine2a、Newthyroid!、Ionospherel2和Heart这真实数据集中类与聚类结果类相一致的数据对 4个数据集. 象数量；R、K分别对应真实数据集和聚类结果类 3.2评价指标中的类个数为了评估CABON算法及对比算法的聚类效 RI的计算公式如(4)所示：果，本文引入了4个评价指标，分别是聚类精度 RI= a+b (Accuracy)2a、F值(F-measure)图、标准化互信息 a+b+h+8 (4) Normalized mutual information,NMI)5.27.281 其中：α表示被聚类结果和真实结果判定为同一类德指数(Rand index,RI)2930,4个评价指标的取值别的数据对(，x)的数目；b表示聚类结果和真实表1数据集参数信息 Table 1 Parameter information for the data set No Datasets Data Sources Distribution Dimension Class Instance Flame Synthesis 147:93 2 2 240 2 Aggregation Synthesis 272:170:127:105:45:35:34 2 7 788 3 Jain Synthesis 276:97 2 2 373 4 DSI Synthesis 1000:200 2 2 1200 5 DS2 Synthesis 2823:529 2 2 3352 6 DS3 Synthesis 1000:500:400:200:200 2 5 2300 7 Wine UCI 59:71:48 13 3 178 8 Newthyroid UCI 150:35:30 5 3 215 9 Ionosphere UCI 225:126 2 351 10 Heart UCI 150:120 13 2 270

算法具有两个显著特点：（1）从计算数据对象与其最近的两个类中心距离差的角度，提出了类别待定集的构造方法，弥补已有研究构造模糊工作集时部分对象被误分这一不足；（2）基于近邻思想，为类别待定集中对象重新定义一种新的类的划分规则. 利用类别待定集对象的最近邻居作为确定其类别的参照依据，可以更精确地划分类别待定集对象所归属的类别，并且 CABON 算法中类别待定集的动态调整机制大大降低了对象归属类别错分的概率，从而保证聚类结果的质量. 3 实验分析 3.1 实验数据为了验证 CABON 算法解决 K–means“均匀效应”问题的有效性，实验分别在 6 个人工数据集和 4 个 UCI（University of California Irvine）真实数据集上进行，详细信息见表 1，编号 1～6 为人工数据集具体信息，真实分布如图 6 所示，其中编号 1～3 描述的是测试聚类常用的 Flame[19]、 Aggregation[20]、 Jain[21] 数据集，编号 4～6 描述的是 Python 合成的 3 个数据集，分别为 DS1、DS2、DS3. 另外，表 1 编号 7～10 为真实数据集的具体信息，实验采用了加州大学欧文分校提出的 UCI 标准数据库中的 Wine[22]、 Newthyroid[23]、 Ionosphere[24] 和 Heart[25] 这 4 个数据集. 3.2 评价指标为了评估 CABON 算法及对比算法的聚类效果，本文引入了 4 个评价指标，分别是聚类精度（Accuracy） [26]、F 值（F-measure） [8]、标准化互信息（ Normalized mutual information， NMI） [15, 27, 28] 和兰德指数（Rand index，RI） [29−30] ，4 个评价指标的取值上界均为 1，指标值越接近 1，聚类效果越好. P = {P1, P2,...Pk} C = {C1,C2,...,Ck} ni j = |Ci ∩ Pj | Ci Pj 设 k 为聚类个数，N 为数据集中数据对象的个数，为数据集的实际类分布，为数据集的聚类结果，为和的共同部分的基数，则聚类精度（Accuracy）的公式如（1）所示： Accuracy = 1 N ∑ k i=1 k max j=1 ni j （1） F-measure 的计算公式如（2）所示： F−measure = (1+β 2 )·Recall·Precision β 2 ·Recall+Precision （2）其中：Recall 和 Precision 分别表示数据集中真实聚类结果与聚类结果相比的召回率和准确率；β 表示 Recall 和 Precision 的相对重要性，通常设为 1. NMI 的计算公式如（3）所示： NMI = ∑ R i=1 ∑ K j=1 ni j ln((N · ni j)/(ni · nj)) vut∑ R i=1 ni ln( ni N ) · ∑ K j=1 nj ln( nj N ) （3） ni i nj j ni j i j 其中：N 为数据集的数据对象数量，为属于类的数据对象数量；为属于类的数据对象数量；为真实数据集中类与聚类结果类相一致的数据对象数量；R、K 分别对应真实数据集和聚类结果类中的类个数. RI 的计算公式如（4）所示： RI = a+b a+b+h+g （4） (xi , x j) 其中：a 表示被聚类结果和真实结果判定为同一类别的数据对的数目；b 表示聚类结果和真实表 1 数据集参数信息 Table 1 Parameter information for the data set No Datasets Data Sources Distribution Dimension Class Instance 1 Flame Synthesis 147∶93 2 2 240 2 Aggregation Synthesis 272∶170∶127∶105∶45∶35∶34 2 7 788 3 Jain Synthesis 276∶97 2 2 373 4 DS1 Synthesis 1000∶200 2 2 1200 5 DS2 Synthesis 2823∶529 2 2 3352 6 DS3 Synthesis 1000∶500∶400∶200∶200 2 5 2300 7 Wine UCI 59∶71∶48 13 3 178 8 Newthyroid UCI 150∶35∶30 5 3 215 9 Ionosphere UCI 225∶126 34 2 351 10 Heart UCI 150∶120 13 2 270 · 1214 · 工程科学学报，第 42 卷，第 9 期

武森等：基于近邻的不均衡数据聚类算法 1215 28 688 30 26 Cluster 2 25 25 94 24 20 449 15 10 16 5 (a) (c) 1 0 0 0 5 10 15 0 510152025 303540 0 10 2030 4050 Attribute I Attribute 1 Attribute】 20 20 20 10 0 0 5 0 0 -10 -20 5 -20 -10 -30 (d) Cluster 2 ( -15 -30 (e) -40 10 0 10 15 -20 -10 0 10 20 15 -10 10 15 Attribute 1 Attribute 1 Attribute 1 图6人工数据集真实分布.(a)Flame:(b)Aggregation:(c)Jain:(d)DS1:(e)DS2:(f)DS3 Fig.6 True distribution of synthetic data sets:(a)Flame;(b)Aggregation;(c)Jain;(d)DS1;(e)DS2;(f)DS3 结果都将(，x)判定为不同类别的数据对的数目； 01为步长确定一个最优取值区间后，缩短步长为 h表示聚类结果认为同类，但真实结果认为不同类 0.01,在此区间内再次选择一个最优值作为此数据的数据对的数目；g表示聚类结果认为不同类，但集的类别待定集构造阈值真实结果认为同类的数据对的数目 3.4实验结果 3.3实验设计 (1)人工数据集实验结果实验将CABON算法与K-means、MC IK和表2~表5给出了CABON、K-means、MCIK和 CVCNUS算法进行对比.K-means算法是经典的 CVCN算法对人工数据集的聚类结果.从表2 聚类算法，作为对比算法可比较不同算法降低“均表5中可以看出，CABON算法在大多数情况下明显匀效应”影响的程度：MCIK和CVCN算法分别从优于其他三种算法，这在一定程度上说明CABON 多中心和优化目标函数的角度有效降低了不均衡算法能够消减K-means算法产生的“均匀效应” 数据聚类中“均匀效应”问题的影响.对比实验通对于Flame、Aggregation数据集，CABON的各个指过将CABON算法和这些算法进行对比，以衡量标值均优于对比算法，但是CABON在处理Jain数 CABON算法处理不均衡数据聚类问题降低“均匀据集时的F-measure、NMl值略低于CVCN算法，效应”影响的能力.在设定实验参数方面，参数其原因在于Jain为非球形不均衡数据集，CABON k是输入的聚类个数，可参考实验数据集类的数量对部分此类数据集中的数据对象聚类时无法准确设定.6为构造类别待定集事先设定的参数，取值表2人工数据集不同算法聚类结果(Accuracy指标) 范围根据数据集的大小而确定.在本实验中，若数 Table 2 Clustering results of artificial data set with different 据集较小，阈值6的取值范围设定为0.5~1.5，以 algorithms(Accuracy indicators) 0.5为初始值、0.1为步长确定一个最优取值区间， Index Datasets CABON K-means MC IK CVCN 表现为阈值δ在此区间内时聚类评价指标值最 Flame 0.985 0.8292 0.82420.8583 高；最优取值区间确定后，缩短步长为0.01，在此 Aggregation 0.9505 0.9156 0.93640.9010 区间内选择一个最优值作为此数据集的类别待定 Jain 0.8801 0.7819 0.7943 0.8729 Accuracy 集构造阈值，表现为阈值δ取到此参数值时聚类 DSI 0.9817 0.8628 0.8724 0.866 评价指标值最高.同样地，若数据集较大，阈值 DS2 0.9852 0.8422 0.8640 0.8765 DS3 0.9917 0.9032 0.92120.6687 δ的取值范围设定为1.0~2.0，以1.0为初始值

(xi 结果都将 , x j) 判定为不同类别的数据对的数目； h 表示聚类结果认为同类，但真实结果认为不同类的数据对的数目；g 表示聚类结果认为不同类，但真实结果认为同类的数据对的数目. 3.3 实验设计实验将 CABON 算法与 K–means、MC_IK[14] 和 CVCN[15] 算法进行对比. K–means 算法是经典的聚类算法，作为对比算法可比较不同算法降低“均匀效应”影响的程度；MC_IK 和 CVCN 算法分别从多中心和优化目标函数的角度有效降低了不均衡数据聚类中“均匀效应”问题的影响. 对比实验通过将 CABON 算法和这些算法进行对比，以衡量 CABON 算法处理不均衡数据聚类问题降低“均匀效应”影响的能力. 在设定实验参数方面，参数 k 是输入的聚类个数，可参考实验数据集类的数量设定. δ 为构造类别待定集事先设定的参数，取值范围根据数据集的大小而确定. 在本实验中，若数据集较小，阈值 δ 的取值范围设定为 0.5～1.5，以 0.5 为初始值、0.1 为步长确定一个最优取值区间，表现为阈值 δ 在此区间内时聚类评价指标值最高；最优取值区间确定后，缩短步长为 0.01，在此区间内选择一个最优值作为此数据集的类别待定集构造阈值，表现为阈值 δ 取到此参数值时聚类评价指标值最高. 同样地，若数据集较大，阈值 δ 的取值范围设定为 1.0～2.0，以 1.0 为初始值、 0.1 为步长确定一个最优取值区间后，缩短步长为 0.01，在此区间内再次选择一个最优值作为此数据集的类别待定集构造阈值. 3.4 实验结果（1）人工数据集实验结果. 表2～表5 给出了CABON、K–means、MC_IK 和 CVCN 算法对人工数据集的聚类结果. 从表 2～表 5 中可以看出，CABON 算法在大多数情况下明显优于其他三种算法，这在一定程度上说明 CABON 算法能够消减 K–means 算法产生的“均匀效应”. 对于 Flame、Aggregation 数据集，CABON 的各个指标值均优于对比算法，但是 CABON 在处理 Jain 数据集时的 F-measure、NMI 值略低于 CVCN 算法，其原因在于 Jain 为非球形不均衡数据集，CABON 对部分此类数据集中的数据对象聚类时无法准确 28 0 26 24 Attribute 1 (a) Attribute 2 22 5 20 10 18 15 16 14 20 −10 15 −5 10 0 Attribute 1 (d) Attribute 2 5 5 0 10 −5 15 −10 −15 20 −20 −10 10 0 Attribute 1 (e) Attribute 2 0 10 −10 20 −20 −30 20 −15 −10 −5 10 0 Attribute 1 (f) Attribute 2 0 5 −10 10 −20 15 −30 −40 Cluster 1 Cluster 2 Cluster 1 Cluster 2 (c) Attribute 2 Attribute 1 0 30 25 10 20 15 20 10 5 30 0 40 50 (b) Attribute 2 Attribute 1 0 30 5 25 10 20 15 15 20 10 25 5 30 0 35 40 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 图 6 人工数据集真实分布. （a） Flame；（b） Aggregation；（c） Jain；（d） DS1；（e） DS2；（f） DS3 Fig.6 True distribution of synthetic data sets: (a) Flame; (b) Aggregation; (c) Jain; (d) DS1; (e) DS2; (f) DS3 表 2 人工数据集不同算法聚类结果（Accuracy 指标） Table 2 Clustering results of artificial data set with different algorithms (Accuracy indicators) Index Datasets CABON K–means MC_IK CVCN Accuracy Flame 0.985 0.8292 0.8242 0.8583 Aggregation 0.9505 0.9156 0.9364 0.9010 Jain 0.8801 0.7819 0.7943 0.8729 DS1 0.9817 0.8628 0.8724 0.866 DS2 0.9852 0.8422 0.8640 0.8765 DS3 0.9917 0.9032 0.9212 0.6687 武森等：基于近邻的不均衡数据聚类算法 · 1215 ·

1216 工程科学学报，第42卷，第9期表3人工数据集不同算法聚类结果(F-measure指标) 表5人工数据集不同算法聚类结果(RI指标) Table 3 Clustering results of artificial data set with different Table 5 Clustering results of artificial data set with different algorithms(F-measure indicators) algorithms(RI indicators) Index Datasets CABON K-means MC IK CVCN Index Datasets CABON K-means MC IK CVCN Flame 0.9748 0.8313 0.8264 0.859 Flame 0.945 0.7155 0.7091 0.7558 Aggregation 0.9148 0.8498 0.8504 0.8858 Aggregation 0.9473 0.9192 0.9259 0.9301 Jain 0.862 0.7953 0.8069 0.876 Jain 0.7832 0.6581 0.6724 0.7795 F-measure 0.982 RI DSI 0.8768 0.8864 0.9074 DSI 0.9641 0.7641 0.7776 0.7693 DS2 0.9851 0.8574 0.8785 0.8889 DS2 0.9426 0.7292 0.7649 0.7834 DS3 0.9918 0.9061 0.9247 0.6371 DS3 0.9913 0.9065 0.9278 0.7383 表4人工数据集不同算法聚类结果(NM指标) 异，CABON处理这两个数据集时取得了较好的聚 Table 4 Clustering results of artificial data set with different 类效果；对于DS3数据集，K-means算法对其聚类 algorithms (NMI indicators) 时在多个类之间产生了“均匀效应”，CABON算法 Index Datasets CABON K-means MC IK CVCN 的聚类精度、F-measure、.RI指标值均接近于l,验 Flame 0.842 0.3543 0.3461 0.407 证了该算法处理此类问题的有效性，同时， Aggregation 0.8635 0.8134 0.8271 0.8129 CVCN算法在DS3数据集上的各个指标值均低于 Jain 0.3561 0.3332 0.3661 0.4272 K-means算法，这在一定程度上说明CVCN算法 NMI DS1 0.816 0.4696 0.4025 0.3105 在处理多类非均匀数据聚类问题上处理能力较 DS2 0.5301 0.3430 0.3850 0.4092 弱.另外，MCIK算法在DS1、DS2、DS3数据集上 DS3 0.9651 0.8089 0.9123 0.5014 的实验表现较K-means而言有不同程度的提升，表明MC_IK算法对K-means在DSl、DS2、DS3上构建类别待定集，从而影响最后的聚类结果产生的“均匀效应”有消减作用对于3个合成的数据集(DS1、DS2、DS3), 图7~图l2是CABON、K-means、MCIK和 CABON算法的聚类精度均优于对比算法.DSI和 CVCN算法对人工数据聚类结果的图示，这与表2~ DS2数据集在数据对象的数量和密度上有较大差表5中的数据基本吻合.综上，人工数据集的实验 28 28 28 28 (a) (b) (d) 26 (c) 26 26 26 24 N24 24 24 3 22 22 22 之 20 20 18 18 16 ster 16 Cluster 6 1 0 10 15 0 10 15 10 15 5 10 5 Attribute 1 Attribute 1 Attribute 1 Attribute I 图7 Flame数据集不同算法聚类结果图示.(a)CABON:(b)K- means:(c)MC IK:(d)CVCN Fig.7 Graphical representations of clustering results with different algorithms on Flame data sets:(a)CABON;(b)K-means;(c)MC IK;(d)CVCN 30 (a) 30 c 30 dd) 20 20 20 20 15 15 10 10 15 15 15 15 0 0 89 89 10 20 0 40 10 20 30 40 10 20 30 40 1020 30 40 Attribute 1 Attribute Attribute 1 图8 Aggregation数据集不同算法聚类结果图示.(a)CABON:(b)K-means:(c)MCIK:(d)CVCN Fig.8 Graphical representations of clustering results with different algorithms on Aggregation data sets: (a)CABON;(b)K-means;(c)MC IK;(d) CVCN

构建类别待定集，从而影响最后的聚类结果. 对于 3 个合成的数据集（ DS1、 DS2、 DS3）， CABON 算法的聚类精度均优于对比算法. DS1 和 DS2 数据集在数据对象的数量和密度上有较大差异，CABON 处理这两个数据集时取得了较好的聚类效果；对于 DS3 数据集，K–means 算法对其聚类时在多个类之间产生了“均匀效应”，CABON 算法的聚类精度、F-measure、RI 指标值均接近于 1，验证了该算法处理此类问题的有效性 . 同时， CVCN 算法在 DS3 数据集上的各个指标值均低于 K–means 算法，这在一定程度上说明 CVCN 算法在处理多类非均匀数据聚类问题上处理能力较弱. 另外，MC_IK 算法在 DS1、DS2、DS3 数据集上的实验表现较 K–means 而言有不同程度的提升，表明 MC_IK 算法对 K–means 在 DS1、DS2、DS3 上产生的“均匀效应”有消减作用. 图 7～图 12 是 CABON、K–means、MC_IK 和 CVCN 算法对人工数据聚类结果的图示，这与表 2～表 5 中的数据基本吻合. 综上，人工数据集的实验 Attribute 1 Attribute 2 28 0 26 5 24 10 22 15 (a) 20 18 16 14 Cluster 1 Cluster 2 Attribute 1 Attribute 2 28 0 26 5 24 10 22 15 (b) 20 18 16 14 Cluster 1 Cluster 2 Attribute 1 Attribute 2 28 0 26 5 24 10 22 15 (c) 20 18 16 14 Cluster 1 Cluster 2 Attribute 1 Attribute 2 28 0 26 5 24 10 22 15 (d) 20 18 16 14 Cluster 1 Cluster 2 图 7 Flame 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.7 Graphical representations of clustering results with different algorithms on Flame data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (a) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (b) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (c) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (d) 15 10 15 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 图 8 Aggregation 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.8 Graphical representations of clustering results with different algorithms on Aggregation data sets: (a) CABON; (b) K –means; (c) MC_IK; (d) CVCN 表 3 人工数据集不同算法聚类结果（F-measure 指标） Table 3 Clustering results of artificial data set with different algorithms (F-measure indicators) Index Datasets CABON K–means MC_IK CVCN F-measure Flame 0.9748 0.8313 0.8264 0.859 Aggregation 0.9148 0.8498 0.8504 0.8858 Jain 0.862 0.7953 0.8069 0.876 DS1 0.982 0.8768 0.8864 0.9074 DS2 0.9851 0.8574 0.8785 0.8889 DS3 0.9918 0.9061 0.9247 0.6371 表 4 人工数据集不同算法聚类结果（NMI 指标） Table 4 Clustering results of artificial data set with different algorithms (NMI indicators) Index Datasets CABON K–means MC_IK CVCN NMI Flame 0.842 0.3543 0.3461 0.407 Aggregation 0.8635 0.8134 0.8271 0.8129 Jain 0.3561 0.3332 0.3661 0.4272 DS1 0.816 0.4696 0.4025 0.3105 DS2 0.5301 0.3430 0.3850 0.4092 DS3 0.9651 0.8089 0.9123 0.5014 表 5 人工数据集不同算法聚类结果（RI 指标） Table 5 Clustering results of artificial data set with different algorithms (RI indicators) Index Datasets CABON K–means MC_IK CVCN RI Flame 0.945 0.7155 0.7091 0.7558 Aggregation 0.9473 0.9192 0.9259 0.9301 Jain 0.7832 0.6581 0.6724 0.7795 DS1 0.9641 0.7641 0.7776 0.7693 DS2 0.9426 0.7292 0.7649 0.7834 DS3 0.9913 0.9065 0.9278 0.7383 · 1216 · 工程科学学报，第 42 卷，第 9 期

武森等：基于近邻的不均衡数据聚类算法 .1217 30 30 (a) (b) (c) 30 (d) 25 25 20 20 20 15 15 15 10 15 15 00 102030 40 102030 40 102030 40 102030 40 Attribute 1 Attribute 1 Attribute 1 Attribute 1 图9 Jain数据集不同算法聚类结果图示.(a)CABON:(b)K-means;(c)MC_IK;(d)CVCN Fig.9 Graphical representations of clustering results with different algorithms on Jain data sets:(a)CABON;(b)K-means,(c)MC IK;(d)CVCN 2 20 15 (a) 15 (b) 20 15 (c) 20, 1 (d) 10 10 10 10 5 5 0 0 0 5 5 -10 -10 -10 -10 -15005,10 15 -005 1015 -1510-50,5, -15 1015 10-5051015 Attribute 1 Attribute 1 Attribute 1 Attribute I 图10 DS1数据集不同算法聚类结果图示.(a)CABON:(b)K-means:(c)MCIK:(d)CVCN Fig.10 Graphical representations of clustering results with different algorithms on DSI data sets:(a)CABON;(b)K-means;(c)MC IK;(d)CVCN 20 20 (b) 20 20 (c) 15 (a 15 15 (d) 10 10 0 10 -5 -5 -10 -10 10 -10 00 -15 -20 -20 -20 -20 -25 Cluster 2 -25 -25 15-10-505101520 -15-10-505101520 15-10-505101520 15-10-505101520 Attribute 1 Attribute I Attribute 1 Attribute 1 图11 DS2数据集不同算法聚类结果图示.(a)CABON:(b)K-means:(c)MCIK:(d)CVCN Fig.11 Graphical representations of clustering results with different algorithms on DS2 data sets:(a)CABON:(b)K-means;(c)MC IK;(d)CVCN 20 (a) 20 (b) 20 (c) 20 (d) 10 10 10 0 0 0 0 -10 -10 0 -10 20 汤 3 -30 -30 -30 -30 0 83 3 -4 -40 15-10-5051015 -15-10-505 1015 15-10-5051015 15-10-5051015 Attribute 1 Attribute 1 Attribute I Attribute 1 图12DS3数据集不同算法聚类结果图示.(a)CABON:(b)K-means;(c)MCIK;(d)CVCN Fig.12 Graphical representations of clustering results with different algorithms on DS3 data sets:(a)CABON;(b)K-means;(c)MC IK;(d)CVCN 结果表明CABON算法能够有效消减不均衡数据的NM值略低于MC_IK算法，但是其他三个指标聚类的“均匀效应” 值均高于对比算法；对于Newthyroid、Ionosphere (2)真实数据集实验结果数据集，CABON算法的各个指标值均优于对比算表6~表9给出了CABON、K-means、MC_IK和法；对于Heart数据集，CABON算法聚类精度和 CVCN算法分别对4个真实数据集进行聚类的结 F-measure值略低于CVCN算法，但是NMI值和果，从表6~表9中可以看出，对于UCI上的4个 RI值均高于对比算法.以上结果表明，CABON算真实数据集，CABON算法在大多数情况下明显优法对于UCI中的真实不均衡数据集也能有效解决于其他三种算法.对于Wine数据集而言，CABON算法 “均匀效应”问题，提高聚类效果

结果表明 CABON 算法能够有效消减不均衡数据聚类的“均匀效应”. （2）真实数据集实验结果. 表6～表9 给出了CABON、K–means、MC_IK 和 CVCN 算法分别对 4 个真实数据集进行聚类的结果，从表 6～表 9 中可以看出，对于 UCI 上的 4 个真实数据集，CABON 算法在大多数情况下明显优于其他三种算法. 对于Wine 数据集而言，CABON 算法的 NMI 值略低于 MC_IK 算法，但是其他三个指标值均高于对比算法；对于 Newthyroid、 Ionosphere 数据集，CABON 算法的各个指标值均优于对比算法；对于 Heart 数据集，CABON 算法聚类精度和 F-measure 值略低于 CVCN 算法，但是 NMI 值和 RI 值均高于对比算法. 以上结果表明，CABON 算法对于 UCI 中的真实不均衡数据集也能有效解决 “均匀效应”问题，提高聚类效果. Attribute 1 Attribute 2 20 −10 −5 10 15 0 5 5 10 15 (a) 0 −5 −10 −15 Cluster 1 Cluster 2 Attribute 1 Attribute 2 20 −10 −5 10 15 0 5 5 10 15 (b) 0 −5 −10 −15 Cluster 1 Cluster 2 Attribute 1 Attribute 2 20 −10 −5 10 15 0 5 5 10 15 (c) 0 −5 −10 −15 Cluster 1 Cluster 2 Attribute 1 Attribute 2 20 −10 −5 10 15 0 5 5 10 15 (d) 0 −5 −10 −15 Cluster 1 Cluster 2 图 10 DS1 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.10 Graphical representations of clustering results with different algorithms on DS1 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN Attribute 1 Attribute 2 20 −15 −5 −10 10 15 0 5 5 10 15 20 (a) 0 −5 −10 −15 Cluster 1 Cluster 2 −20 −25 Attribute 1 Attribute 2 20 −15 −5 −10 10 15 0 5 5 10 15 20 (b) 0 −5 −10 −15 Cluster 1 Cluster 2 −20 −25 Attribute 1 Attribute 2 20 −15 −5 −10 10 15 0 5 5 10 15 20 (c) 0 −5 −10 −15 Cluster 1 Cluster 2 −20 −25 Attribute 1 Attribute 2 20 −15 −5 −10 10 15 0 5 5 10 15 20 (d) 0 −5 −10 −15 Cluster 1 Cluster 2 −20 −25 图 11 DS2 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.11 Graphical representations of clustering results with different algorithms on DS2 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN Attribute 1 Attribute 2 20 −15 −10 −5 10 0 10 5 15 (a) 0 −10 −40 −20 −30 Attribute 1 Attribute 2 20 −15 −10 −5 10 0 10 5 15 (b) 0 −10 −40 −20 −30 Attribute 1 Attribute 2 20 −15 −10 −5 10 0 10 5 15 (c) 0 −10 −40 −20 −30 Attribute 1 Attribute 2 20 −15 −10 −5 10 0 10 5 15 (d) 0 −10 −40 −20 −30 Cluster 1 Cluster 3 Cluster 2 Cluster 4 Cluster 5 Cluster 1 Cluster 3 Cluster 2 Cluster 4 Cluster 5 Cluster 1 Cluster 3 Cluster 2 Cluster 4 Cluster 5 Cluster 1 Cluster 3 Cluster 2 Cluster 4 Cluster 5 图 12 DS3 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.12 Graphical representations of clustering results with different algorithms on DS3 data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (a) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (b) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (c) 15 10 15 0 Attribute 1 Attribute 2 30 0 10 25 20 30 20 40 (d) 15 10 15 0 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 图 9 Jain 数据集不同算法聚类结果图示. （a） CABON；（b） K–means；（c） MC_IK；（d） CVCN Fig.9 Graphical representations of clustering results with different algorithms on Jain data sets: (a) CABON; (b) K–means; (c) MC_IK; (d) CVCN 武森等：基于近邻的不均衡数据聚类算法 · 1217 ·

1218 工程科学学报，第42卷，第9期表6UCI数据集不同算法聚类结果(Accuracy指标) 构造模糊工作集时部分数据对象类别被错分这一 Table 6 Clustering results of UCI dataset with different algorithms 问题，间接提高了算法的聚类质量. (Accuracy indicators) (2)针对不均衡数据的聚类问题，提出了基于 Index Datasets CABON K-means MC IK CVCN 近邻的不均衡数据聚类算法(CABON).CABON Wine 0.7542 0.6972 0.7032 0.7331 算法在对给定的数据集进行初始聚类后，针对初 Newthyroid 0.8884 0.8233 0.8512 0.8279 始聚类结果按照预设阈值δ构造类别待定集，基 Accurary Ionosphere 0.7322 0.7123 0.698 0.7193 于近邻思想将类别待定集中数据对象归入其最近 Heart 0.6037 0.5889 0.6011 0.6235 邻居所在的类别中，并动态更新类别待定集，得到最终的聚类结果表7UCI数据集不同算法聚类结果(F-measure指标) (3)人工数据集和UCI真实数据集的实验结 Table 7 Clustering results of UCI dataset with different algorithms 果表明，CABON算法能够有效消减K-means算 (F-measure indicators) 法所产生的“均匀效应”.与K-means以及专门针对 Index Datasets CABON K-means MC IK CVCN 不均衡数据聚类的MCIK和CVCN算法的实验 Wine 0.7559 0.6912 0.71480.7265 对比表明，CABON算法的聚类精度有明显提升 Newthyroid 0.8835 0.8175 0.8507 0.8251 F-measure 0.7317 0.7177 0.70210.6906 参考文献 Heart 0.6102 0.5853 0.59070.6305 [1]Wu S,Feng X D.Zhou W J.Spectral clustering of high- dimensional data exploiting sparse representation vectors. 表8UCI数据集不同算法聚类结果(NMI指标) Neurocomputing,2014,135:229 Table 8 Clustering results of UCI dataset with different algorithms [2] Wilson J,Chaudhury S,Lall B.Clustering short temporal (NMI indicators) behaviour sequences for customer segmentation using LDA. Index Datasets CABON K-means MC_IK CVCN Expert Syst,.2018,35(3):el2250 [3] Zhao L B,Shi G Y.A trajectory clustering method based on Wine 0.4201 0.4197 0.4298 0.3947 Douglas-Peucker compression and density for marine traffic Newthyroid 0.5308 0.3918 0.4249 0.4067 NMI pattern recognition.Ocean Eng,2019,172:456 Ionosphere 0.1421 0.1312 0.1008 0.1268 [4] Al-Shammari A,Zhou R,Naseriparsaa M,et al.An effective Heart 0.0213 0.0176 0.0222 0.0688 density-based clustering and dynamic maintenance framework for evolving medical data streams.Int J Med Inform,019,126:176 表9UCI数据集不同算法聚类结果(RI指标) [5] Hu Y,Li H,Chen M.Taxi abnormal trajectory detection based on Table 9 Clustering results of UCI dataset with different algorithms density clustering.Comput Modernization,2019(6):49 (RI indicators) (胡圆，李晖，陈梅.基于密度聚类的出租车异常轨迹检测.计算 Index Datasets CABON K-means MC IK CVCN 机与现代化，2019(6)：49) [6] Wine 0.7223 Han W H,Huang ZZ,Li S D,et al.Distribution-sensitive 0.7106 0.7128 0.7183 unbalanced data oversampling method for medical diagnosis.J Newthyroid 0.8199 0.7402 0.7867 0.7501 RI Med Syst,.2019,43(2):39 0.6067 0.5889 0.5772 0.5963 [7]Chen L T,Xu G H,Zhang Q,et al.Learning deep representation of Heart 0.5806 0.5144 0.5182 0.5330 imbalanced SCADA data for fault detection of wind turbines. Meas,2019,139:370 总体来说，本文提出的CABON算法明显优于「81 Xiong H,Wu JJ,Chen J.K-means clustering versus validation 其他三种对比算法.人工数据集和真实数据集的 measures:A data-distribution perspective.IEEE Trans Syst Man 实验结果，验证了CABON算法在解决不均衡数据 Cybern Part B(Cybern),2009,39(2):318 聚类的“均匀效应”问题和提高聚类效果上的可行 [91 Luo Z C,Jin S,Qiu X F.Spectral clustering based oversampling: 性和有效性 oversampling taking within class imbalance into consideration. Comput Eng Appl,2014,50(11):120 4结论 (骆自超，金隼，邱雪峰.考虑类内不平衡的谱聚类过抽样方法计算机工程与应用，2014,50(11)：120) (1)针对K-means算法的“均匀效应”问题， [10]Kumar N S,Rao K N,Govardhan A,et al.Undersampled 从计算数据对象与其最近的两个类中心距离差的 K-means approach for handling imbalanced distributed data.Prog 角度，给出了类别待定集的定义，解决了已有研究 Artif Intelligence,2014,3(1):29

总体来说，本文提出的 CABON 算法明显优于其他三种对比算法. 人工数据集和真实数据集的实验结果，验证了 CABON 算法在解决不均衡数据聚类的“均匀效应”问题和提高聚类效果上的可行性和有效性. 4 结论（1）针对 K–means 算法的“均匀效应”问题，从计算数据对象与其最近的两个类中心距离差的角度，给出了类别待定集的定义，解决了已有研究构造模糊工作集时部分数据对象类别被错分这一问题，间接提高了算法的聚类质量. （2）针对不均衡数据的聚类问题，提出了基于近邻的不均衡数据聚类算法（CABON）. CABON 算法在对给定的数据集进行初始聚类后，针对初始聚类结果按照预设阈值 δ 构造类别待定集，基于近邻思想将类别待定集中数据对象归入其最近邻居所在的类别中，并动态更新类别待定集，得到最终的聚类结果. （3）人工数据集和 UCI 真实数据集的实验结果表明，CABON 算法能够有效消减 K–means 算法所产生的“均匀效应”. 与 K–means 以及专门针对不均衡数据聚类的 MC_IK 和 CVCN 算法的实验对比表明，CABON 算法的聚类精度有明显提升. 参考文献 Wu S, Feng X D, Zhou W J. Spectral clustering of highdimensional data exploiting sparse representation vectors. Neurocomputing, 2014, 135: 229 [1] Wilson J, Chaudhury S, Lall B. Clustering short temporal behaviour sequences for customer segmentation using LDA. Expert Syst, 2018, 35（3）: e12250 [2] Zhao L B, Shi G Y. A trajectory clustering method based on Douglas-Peucker compression and density for marine traffic pattern recognition. Ocean Eng, 2019, 172: 456 [3] Al-Shammari A, Zhou R, Naseriparsaa M, et al. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. Int J Med Inform, 2019, 126: 176 [4] Hu Y, Li H, Chen M. Taxi abnormal trajectory detection based on density clustering. Comput Modernization, 2019（6）: 49 （胡圆, 李晖, 陈梅. 基于密度聚类的出租车异常轨迹检测. 计算机与现代化, 2019（6）：49） [5] Han W H, Huang Z Z, Li S D, et al. Distribution-sensitive unbalanced data oversampling method for medical diagnosis. J Med Syst, 2019, 43（2）: 39 [6] Chen L T, Xu G H, Zhang Q, et al. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. Meas, 2019, 139: 370 [7] Xiong H, Wu J J, Chen J. K –means clustering versus validation measures: A data-distribution perspective. IEEE Trans Syst Man Cybern Part B (Cybern), 2009, 39（2）: 318 [8] Luo Z C, Jin S, Qiu X F. Spectral clustering based oversampling: oversampling taking within class imbalance into consideration. Comput Eng Appl, 2014, 50（11）: 120 （骆自超, 金隼, 邱雪峰. 考虑类内不平衡的谱聚类过抽样方法. 计算机工程与应用, 2014, 50（11）：120） [9] Kumar N S, Rao K N, Govardhan A, et al. Undersampled K–means approach for handling imbalanced distributed data. Prog Artif Intelligence, 2014, 3（1）: 29 [10] 表 6 UCI 数据集不同算法聚类结果（Accuracy 指标） Table 6 Clustering results of UCI dataset with different algorithms (Accuracy indicators) Index Datasets CABON K–means MC_IK CVCN Accurary Wine 0.7542 0.6972 0.7032 0.7331 Newthyroid 0.8884 0.8233 0.8512 0.8279 Ionosphere 0.7322 0.7123 0.698 0.7193 Heart 0.6037 0.5889 0.6011 0.6235 表 7 UCI 数据集不同算法聚类结果（F-measure 指标） Table 7 Clustering results of UCI dataset with different algorithms (F-measure indicators) Index Datasets CABON K–means MC_IK CVCN F-measure Wine 0.7559 0.6912 0.7148 0.7265 Newthyroid 0.8835 0.8175 0.8507 0.8251 Ionosphere 0.7317 0.7177 0.7021 0.6906 Heart 0.6102 0.5853 0.5907 0.6305 表 8 UCI 数据集不同算法聚类结果（NMI 指标） Table 8 Clustering results of UCI dataset with different algorithms (NMI indicators) Index Datasets CABON K–means MC_IK CVCN NMI Wine 0.4201 0.4197 0.4298 0.3947 Newthyroid 0.5308 0.3918 0.4249 0.4067 Ionosphere 0.1421 0.1312 0.1008 0.1268 Heart 0.0213 0.0176 0.0222 0.0688 表 9 UCI 数据集不同算法聚类结果（RI 指标） Table 9 Clustering results of UCI dataset with different algorithms (RI indicators) Index Datasets CABON K–means MC_IK CVCN RI Wine 0.7223 0.7106 0.7128 0.7183 Newthyroid 0.8199 0.7402 0.7867 0.7501 Ionosphere 0.6067 0.5889 0.5772 0.5963 Heart 0.5806 0.5144 0.5182 0.5330 · 1218 · 工程科学学报，第 42 卷，第 9 期

点击进入文档下载页（PDF格式）

共11页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录