工程科学学报,第 39 卷,第 8 期:1244鄄鄄1253,2017 年

正在加载图片...

工程科学学报，第39卷.第8期：1244-1253,2017年8月 Chinese Journal of Engineering,Vol.39,No.8:1244-1253,August 2017 D0L:10.13374/j.issn2095-9389.2017.08.015;htp:/journals..usth.edu.cn 基于聚类欠采样的集成不均衡数据分类算法武森区，刘露，卢丹北京科技大学东凌经济管理学院，北京100083 ☒通信作者，E-mail:wusen(@manage..usth.cdu.cn 摘要传统的分类算法大多假设数据集是均衡的，追求整体的分类精度.而实际数据集经常是不均衡的，因此传统的分类算法在处理实际数据集时容易导致少数类样本有较高的分类错误率.现有针对不均衡数据集改进的分类方法主要有两类：一类是进行数据层面的改进，用过采样或欠采样的方法增加少数类数据或减少多数类数据：另一个是进行算法层面的改进本文在原有的基于聚类的欠采样方法和集成学习方法的基础上，采用两种方法相结合的思想，对不均衡数据进行分类.即先在数据处理阶段采用基于聚类的欠采样方法形成均衡数据集，然后用AdaBoost集成算法对新的数据集进行分类训练，并在算法集成过程中引用权重来区分少数类数据和多数类数据对计算集成学习错误率的贡献，进而使算法更关注少数数据类，提高少数类数据的分类精度. 关键词不均衡数据；欠采样；聚类；集成学习分类号TP311 Imbalanced data ensemble classification based on cluster-based under-sampling algorithm WU Sen LIU Lu,LU Dan Donlinks School of Economics and Management,University of Science and Technology Beijing,Beijing 100083.China Corresponding author,E-mail:wusen@manage.ustb.edu.cn ABSTRACT Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classifi- cation accuracy.However,actual data sets are usually imbalanced,so traditional classification approaches may lead to classification errors in minority class samples.With respect to imbalanced data,there are two main methods for improving classification perform- ance.The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the num- ber of majority class samples by under-sampling.The other method is to improve the algorithm itself.By combining the cluster-based under-sampling method with ensemble classification,in this paper,an approach was proposed for classifying imbalanced data.First, the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage,and then the new data set is trained by the AdaBoost ensemble algorithm.In the integration process,when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data.This makes the algorithm focus more on small data classes,thereby improving the classification accuracy of minority class data. KEY WORDS imbalanced data;under-sampling;classification;ensemble learning 分类是数据挖掘中的常见任务.经典的分类算法为机器学习、智能信息系统等领域的重要课题之在均衡的数据集上表现出较好的分类精度，而实际数一【]，主要集中在采用重采样方法的数据层面改进据集往往不均衡.针对不均衡数据分类的研究已经成和算法层面的研究. 收稿日期：2016-12-30 基金项目：国家自然科学基金资助项目(71271027)：高等学校博土学科点专项科研基金资助项目(20120006110037)工程科学学报,第 39 卷,第 8 期:1244鄄鄄1253,2017 年 8 月 Chinese Journal of Engineering, Vol. 39, No. 8: 1244鄄鄄1253, August 2017 DOI: 10. 13374 / j. issn2095鄄鄄9389. 2017. 08. 015; http: / / journals. ustb. edu. cn 基于聚类欠采样的集成不均衡数据分类算法武森苣 , 刘露, 卢丹北京科技大学东凌经济管理学院, 北京 100083 苣通信作者, E鄄mail: wusen@ manage. ustb. edu. cn 摘要传统的分类算法大多假设数据集是均衡的,追求整体的分类精度. 而实际数据集经常是不均衡的,因此传统的分类算法在处理实际数据集时容易导致少数类样本有较高的分类错误率. 现有针对不均衡数据集改进的分类方法主要有两类: 一类是进行数据层面的改进,用过采样或欠采样的方法增加少数类数据或减少多数类数据;另一个是进行算法层面的改进. 本文在原有的基于聚类的欠采样方法和集成学习方法的基础上,采用两种方法相结合的思想,对不均衡数据进行分类. 即先在数据处理阶段采用基于聚类的欠采样方法形成均衡数据集,然后用 AdaBoost 集成算法对新的数据集进行分类训练,并在算法集成过程中引用权重来区分少数类数据和多数类数据对计算集成学习错误率的贡献,进而使算法更关注少数数据类,提高少数类数据的分类精度. 关键词不均衡数据; 欠采样; 聚类; 集成学习分类号 TP311 Imbalanced data ensemble classification based on cluster鄄based under鄄sampling algorithm WU Sen 苣 , LIU Lu, LU Dan Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China 苣 Corresponding author, E鄄mail: wusen@ manage. ustb. edu. cn ABSTRACT Most traditional classification algorithms assume the data set to be well鄄balanced and focus on achieving overall classifi鄄 cation accuracy. However, actual data sets are usually imbalanced, so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data, there are two main methods for improving classification perform鄄 ance. The first is to improve the data set by increasing the number of minority class samples by over鄄sampling and decreasing the num鄄 ber of majority class samples by under鄄sampling. The other method is to improve the algorithm itself. By combining the cluster鄄based under鄄sampling method with ensemble classification, in this paper, an approach was proposed for classifying imbalanced data. First, the cluster鄄based under鄄sampling method is used to establish a balanced data set in the data processing stage, and then the new data set is trained by the AdaBoost ensemble algorithm. In the integration process, when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes, thereby improving the classification accuracy of minority class data. KEY WORDS imbalanced data; under鄄sampling; classification; ensemble learning 收稿日期: 2016鄄鄄12鄄鄄30 基金项目: 国家自然科学基金资助项目(71271027);高等学校博士学科点专项科研基金资助项目(20120006110037) 分类是数据挖掘中的常见任务. 经典的分类算法在均衡的数据集上表现出较好的分类精度,而实际数据集往往不均衡. 针对不均衡数据分类的研究已经成为机器学习、智能信息系统等领域的重要课题之一[1鄄鄄4] ,主要集中在采用重采样方法的数据层面改进和算法层面的研究

向下翻页>>

点击下载：基于聚类欠采样的集成不均衡数据分类算法