基于空间近邻关系的非平衡数据重采样算法李睿峰苣，李文海，孙艳丽，吴阳勇

正在加载图片...

工程科学学报.第43卷.第6期：862-869.2021年6月 Chinese Journal of Engineering,Vol.43,No.6:862-869,June 2021 https://doi.org/10.13374/j.issn2095-9389.2020.04.05.002;http://cje.ustb.edu.cn 基于空间近邻关系的非平衡数据重采样算法李睿峰四，李文海，孙艳丽，吴阳勇海军航空大学.烟台264001 ☒通信作者，E-mail:dongzhil110@foxmail.com 摘要为了提高非平衡数据集的分类精度，提出了一种基于样本空间近邻关系的重采样算法.该方法首先根据数据集中少数类样本的空间近邻关系进行安全级别评估，根据安全级别有指导的采用合成少数类过采样技术(Synthetic minority oversampling technique,.SMOTE)进行升采样；然后对多数类样本依据其空间近邻关系计算局部密度，从而对多数类样本密集区域进行降采样处理.通过以上两种手段可以均衡测试数据集，并控制数据规模防止过拟合，实现对两类样本分类的均衡化. 采用十折交叉验证的方式产生训练集和测试集，在对训练集重采样之后，以核超限学习机作为分类器进行训练，并在测试集上进行验证。在UCI非平衡数据集和电路故障诊断实测数据上的实验结果表明，所提方法在整体上优于其他重采样算法. 关键词非平衡数据：近邻关系；重采样：局部密度；分类分类号TP206.1 Resampling algorithm for imbalanced data based on their neighbor relationship LI Rui-feng.LI Wen-hai.SUN Yan-li,WU Yang-yong Naval Aviation University,Yantai 264001,China Corresponding author,E-mail:dongzhil110@foxmail.com ABSTRACT The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications.The minority samples in such applications usually contain important information.This information plays an important role in data analysis.At present,two methods(improved algorithm and data set reconstruction)are used in machine learning and data mining to address the data set imbalance.Data set reconstruction is also known as the resampling method,which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used.As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information,thus reducing the classification accuracy.A reasonable oversampling and undersampling algorithm are the core of the resampling method.To improve the classification accuracy of imbalanced data sets,a resampling algorithm based on the neighbor relationship of sample space was proposed.This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level.Then,the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area.By the above two means,the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories.The training set and test set were generated via the method of 5 x 10 fold cross validation.After resampling the training set, the kernel extreme learning machine (KELM)was used as the classifier for training,and the test set was used for verification.The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms. 收稿日期：2020-04-05 基金项目：军内科研项目“新一代航空电子装备测试关键技术研究”资助项目(4172122113R)基于空间近邻关系的非平衡数据重采样算法李睿峰苣，李文海，孙艳丽，吴阳勇海军航空大学，烟台 264001 苣通信作者，E-mail：dongzhi1110@foxmail.com 摘要为了提高非平衡数据集的分类精度，提出了一种基于样本空间近邻关系的重采样算法. 该方法首先根据数据集中少数类样本的空间近邻关系进行安全级别评估，根据安全级别有指导的采用合成少数类过采样技术（Synthetic minority oversampling technique，SMOTE）进行升采样；然后对多数类样本依据其空间近邻关系计算局部密度，从而对多数类样本密集区域进行降采样处理. 通过以上两种手段可以均衡测试数据集，并控制数据规模防止过拟合，实现对两类样本分类的均衡化. 采用十折交叉验证的方式产生训练集和测试集，在对训练集重采样之后，以核超限学习机作为分类器进行训练，并在测试集上进行验证. 在 UCI 非平衡数据集和电路故障诊断实测数据上的实验结果表明，所提方法在整体上优于其他重采样算法. 关键词非平衡数据；近邻关系；重采样；局部密度；分类分类号 TP206.1 Resampling algorithm for imbalanced data based on their neighbor relationship LI Rui-feng苣，LI Wen-hai，SUN Yan-li，WU Yang-yong Naval Aviation University, Yantai 264001, China 苣 Corresponding author, E-mail: dongzhi1110@foxmail.com ABSTRACT The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms. 收稿日期: 2020−04−05 基金项目: 军内科研项目“新一代航空电子装备测试关键技术研究”资助项目（4172122113R）工程科学学报，第 43 卷，第 6 期：862−869，2021 年 6 月 Chinese Journal of Engineering, Vol. 43, No. 6: 862−869, June 2021 https://doi.org/10.13374/j.issn2095-9389.2020.04.05.002; http://cje.ustb.edu.cn

<<向上翻页向下翻页>>

点击下载：基于空间近邻关系的非平衡数据重采样算法