正在加载图片...
·262· 智能系统学报 第11卷 从图1可看出,随着参与训练数据集比例的增 of computers,2012,35(2):202-209. 大,无论是正类分类性能还是整体分类精度,都有所 [6]CHEN Xiaolin,SONG Enming,MA Guangzhi.An adaptive 上升,但是随着数据比例的增大,相应的分类性能提 cost-sensitive classifier[C//Proceedings of the 2nd Inter- 升幅度有限。另外,在数据比例为20%、40%时,3 national Conference on Computer and Automation Engineer- 种算法相对应的F。和Gm值几乎是线性提升, ing.Singapore:IEEE,2010,1:699-701 [7]李倩倩,刘胥影.多类类别不平衡学习算法:EasyEnsem- 这说明过低比例的抽样数据由于损失太大的原始数 ble.M[J].模式识别与人工智能,2014,27(2):187- 据分布信息,会严重影响算法的分类性能。 192 4 结束语 LI Qianqian,LIU Xuying.EasyEnsemble.M for multiclass imbalance problem[J].Pattern recognition and artificial in- 针对类别不平衡数据分类问题,本文提出了一 telligence,2014,27(2):187-192. 种混合数据采样与Boosting技术相结合的集成分类 [8]韩敏,朱新荣.不平衡数据分类的混合算法[J].控制理 方法。该方法统筹运用欠采样和过采样,在保持训 论与应用,2011,28(10):1485-1489. 练集数据规模一致条件下,灵活调整各类别样本数 HAN Min,ZHU Xinrong.Hybrid algorithm for classification 量比例,较好地保持原始数据分布,然后采用Bo0s of unbalanced datasets[J].Control theory applications, 2012,28(10):1485-1489. ting技术进行多次迭代学习,获得更强性能分类器。 [9]WANG Shijin,XI Lifeng.Condition monitoring system de- 实验结果表明,该方法能够有效提高正类样本的分 sign with one-class and imbalanced-data classifier [C]// 类性能。 Proceedings of the 16th International Conference on Industri- 由于数据集本身的多样性和复杂性,诸如类重 al Engineering and Engineering Management.Beijing,Chi- 叠分布、噪声样本等均会影响不平衡数据性能,如果 na:EEE,2009:779-783. 进行有针对性的数据预处理工作,将会使得动态平 [10]叶志飞,文益民,吕宝粮.不平衡分类问题研究综述 衡采样的数据分布更加合理,对正类的分类性能将 [J].智能系统学报,2009,4(2):148-156. 会进一步提高。此外,将本文方法应用于多类别不 YE Zhifei,WEN Yimin,LV Baoliang.A survey of imbal- 平衡数据分类,也是今后需要进一步研究的方向。 anced pattern classification problems[J].CAAI transac- tions on intelligent systems,2009,4(2):148-156. 参考文献: [11]翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J] 计算机科学,2010,37(10):27-32. [1]CATENI S,COLLA V,VANNUCCI M.A method for resam- ZHAI Yun,YANG Bingyu,QU Wu.Survey of mining im- pling imbalanced datasets in binary classification tasks for balanced datasets[].Computer science,2010,37(10): real-world problems[J].Neurocomputing,2014,135:32- 27-32. 41 [12]HAN Hui,WANG Wenyuan,MAO Binghuan.Borderline- [2]ZHANG Huaxiang,LI Mingfang.RWO-Sampling:a random SMOTE:a new over-sampling method in imbalanced data walk over-sampling approach to imbalanced data classifica- sets learning[C]//International Conference on Intelligent tion[J].Information fusion,2014,20:99-116. Computing.Berlin Heidelberg,Germany:Springer,2005: [3]CHAWLA N V,BOWYER K W,HALL L O,et al. 878-887. SMOTE:synthetic minority over-sampling technique[J]. 13]HE Haibo,BAI Yang,GARCIA E A,et al.ADASYN:a- Journal of artificial intelligence research,2002,16(1): daptive synthetic sampling approach for imbalanced learning 321-357 [C]//Proceedings of IEEE International Joint Conference [4]郭丽娟,倪子伟,江弋,等.集成降采样不平衡数据分类 on Neural Networks.Hong Kong,China:IEEE,2008: 方法研究[J].计算机科学与探索,2013,7(7):630- 1322.1328. 638 [14]BATISTA G,PRATI R C,MONARD M C.A study of the GUO Lijuan,NI Ziwei,JIANG Yi,et al.Research on im- behavior of several methods for balancing machine learning balanced data classification based on ensemble and under- training data[J].ACM SIGKDD explorations newsletter, sampling[J].Joumnal of frontiers of computer and technolo- 2004,6(1):20-29. 鄂,2013,7(7):630-638. [15]KUBAT M,MATWIN S.Addressing the curse of imbal- [5]李雄飞,李军,董元方,等.一种新的不平衡数据学习算 anced training sets:one-sided selection[C]//Proceedings 法PCBoost[J].计算机学报,2012,35(2):202-209. of the 14th International Conference on Machine Learning. LI Xiongfei,LI Jun,DONG Yuanfang,et al.A new learning San Francisco,USA:Morgan Kaufmann,1997:179-186. algorithm for imbalanced data-PCBoost[J].Chinese journal [16]蒋盛益,苗邦,余雯.基于一趟聚类的不平衡数据下抽从图 1 可看出,随着参与训练数据集比例的增 大,无论是正类分类性能还是整体分类精度,都有所 上升,但是随着数据比例的增大,相应的分类性能提 升幅度有限。 另外,在数据比例为 20%、40% 时,3 种算法相对应的 Fmeasure 和 Gmean 值几乎是线性提升, 这说明过低比例的抽样数据由于损失太大的原始数 据分布信息,会严重影响算法的分类性能。 4 结束语 针对类别不平衡数据分类问题,本文提出了一 种混合数据采样与 Boosting 技术相结合的集成分类 方法。 该方法统筹运用欠采样和过采样,在保持训 练集数据规模一致条件下,灵活调整各类别样本数 量比例,较好地保持原始数据分布,然后采用 Boos⁃ ting 技术进行多次迭代学习,获得更强性能分类器。 实验结果表明,该方法能够有效提高正类样本的分 类性能。 由于数据集本身的多样性和复杂性,诸如类重 叠分布、噪声样本等均会影响不平衡数据性能,如果 进行有针对性的数据预处理工作,将会使得动态平 衡采样的数据分布更加合理,对正类的分类性能将 会进一步提高。 此外,将本文方法应用于多类别不 平衡数据分类,也是今后需要进一步研究的方向。 参考文献: [1]CATENI S, COLLA V, VANNUCCI M. A method for resam⁃ pling imbalanced datasets in binary classification tasks for real⁃world problems[ J]. Neurocomputing, 2014, 135: 32⁃ 41. [2]ZHANG Huaxiang, LI Mingfang. RWO⁃Sampling: a random walk over⁃sampling approach to imbalanced data classifica⁃ tion[J]. Information fusion, 2014, 20: 99⁃116. [3] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over⁃sampling technique [ J]. Journal of artificial intelligence research, 2002, 16 ( 1): 321⁃357. [4]郭丽娟, 倪子伟, 江弋, 等. 集成降采样不平衡数据分类 方法研究[ J]. 计算机科学与探索, 2013, 7 ( 7): 630⁃ 638. GUO Lijuan, NI Ziwei, JIANG Yi, et al. Research on im⁃ balanced data classification based on ensemble and under⁃ sampling[J]. Journal of frontiers of computer and technolo⁃ gy, 2013, 7(7): 630⁃638. [5]李雄飞, 李军, 董元方, 等. 一种新的不平衡数据学习算 法 PCBoost[J]. 计算机学报, 2012, 35(2): 202⁃209. LI Xiongfei, LI Jun, DONG Yuanfang, et al. A new learning algorithm for imbalanced data⁃PCBoost[ J]. Chinese journal of computers, 2012, 35(2): 202⁃209. [6]CHEN Xiaolin, SONG Enming, MA Guangzhi. An adaptive cost⁃sensitive classifier[C] / / Proceedings of the 2nd Inter⁃ national Conference on Computer and Automation Engineer⁃ ing. Singapore: IEEE, 2010, 1: 699⁃701. [7]李倩倩, 刘胥影. 多类类别不平衡学习算法: EasyEnsem⁃ ble. M[J]. 模式识别与人工智能, 2014, 27 ( 2): 187⁃ 192. LI Qianqian, LIU Xuying. EasyEnsemble. M for multiclass imbalance problem[J]. Pattern recognition and artificial in⁃ telligence, 2014, 27(2): 187⁃192. [8]韩敏, 朱新荣. 不平衡数据分类的混合算法[ J]. 控制理 论与应用, 2011, 28(10): 1485⁃1489. HAN Min, ZHU Xinrong. Hybrid algorithm for classification of unbalanced datasets [ J]. Control theory & applications, 2012, 28(10): 1485⁃1489. [9] WANG Shijin, XI Lifeng. Condition monitoring system de⁃ sign with one⁃class and imbalanced⁃data classifier [ C] / / Proceedings of the 16th International Conference on Industri⁃ al Engineering and Engineering Management. Beijing, Chi⁃ na: IEEE, 2009: 779⁃783. [10]叶志飞, 文益民, 吕宝粮. 不平衡分类问题研究综述 [J]. 智能系统学报, 2009, 4(2): 148⁃156. YE Zhifei, WEN Yimin, LV Baoliang. A survey of imbal⁃ anced pattern classification problems [ J]. CAAI transac⁃ tions on intelligent systems, 2009, 4(2): 148⁃156. [11]翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[ J]. 计算机科学, 2010, 37(10): 27⁃32. ZHAI Yun, YANG Bingyu, QU Wu. Survey of mining im⁃ balanced datasets[J]. Computer science, 2010, 37(10): 27⁃32. [12]HAN Hui, WANG Wenyuan, MAO Binghuan. Borderline⁃ SMOTE: a new over⁃sampling method in imbalanced data sets learning [ C] / / International Conference on Intelligent Computing. Berlin Heidelberg, Germany: Springer, 2005: 878⁃887. [13]HE Haibo, BAI Yang, GARCIA E A, et al. ADASYN: a⁃ daptive synthetic sampling approach for imbalanced learning [C] / / Proceedings of IEEE International Joint Conference on Neural Networks. Hong Kong, China: IEEE, 2008: 1322⁃1328. [14]BATISTA G, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data [ J]. ACM SIGKDD explorations newsletter, 2004, 6(1): 20⁃29. [15] KUBAT M, MATWIN S. Addressing the curse of imbal⁃ anced training sets: one⁃sided selection[C] / / Proceedings of the 14th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 1997: 179⁃186. [16]蒋盛益, 苗邦, 余雯. 基于一趟聚类的不平衡数据下抽 ·262· 智 能 系 统 学 报 第 11 卷
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有