第14卷第2期 智能系统学报 Vol.14 No.2 2019年3月 CAAI Transactions on Intelligent Systems Mar.2019 D0:10.11992/tis.201809007 网络出版地址:http:/kns.cnki.net/kcms/detail/23.1538.TP.20181214.1001.002.html 基于MapReduce的并行异常检测算法 齐小刚',胡秋秋,刘立芳2 (1.西安电子科技大学数学与统计学院,陕西西安710071;2.西安电子科技大学计算机学院,陕西西安 710071) 摘要:为了提高数据挖掘中异常检测算法在数据量增大时的准确度、灵敏度和执行效率,本文提出了一种基 于MapReduce框架和Local Outlier Factor(LOF)算法的并行异常检测算法(MR-DLOF)。首先,将存放在Ha- doop分布式文件系统(HDFS)上的数据集逻辑地切分为多个数据块。然后,利用MapReduce原理将各个数据块 中的数据并行处理,使得每个数据点的k邻近距离和LOF值的计算仅在单个块中执行,从而提高了算法的执 行效率:同时重新定义了k邻近距离的概念,避免了数据集中存在大于或等于k个重复点而导致局部密度为无 穷大的情况。最后,将LOF值较大的数据点合并重新计算其LOF值,从而提高算法准确度和灵敏度。通过真 实数据集验证了MR-DLOF算法的有效性、高效性和可扩展性。 关键词:数据挖掘:异常检测;局部离群因子;Hadoop;MapReduce;分布式文件系统;并行计算;局部密度 中图分类号:TP311 文献标志码:A文章编号:1673-4785(2019)02-0224-07 中文引用格式:齐小刚,胡秋秋,刘立芳.基于MapReduce的并行异常检测算法J.智能系统学报,2019,14(2):224-230. 英文引用格式:QI Xiaogang,HU Qiuqiu.,LIU Lifang..Parallel anomaly algorithm based on MapReduce[J.CAAI transactions on intelligent systems,2019,14(2):224-230. Parallel anomaly algorithm based on MapReduce QI Xiaogang',HU Qiuqiu',LIU Lifang" (1.School of Mathematics and Statistics,Xidian University,Xi'an 710071,China;2.School of Computer Science and Technology, Xidian University,Xi'an 710071,China) Abstract:To improve the accuracy,sensitivity,and efficiency of anomaly detection algorithm in data mining when the amount of data increases,a parallel anomaly detection algorithm(MR-LOF)based on the MapReduce framework and the local outlier factor(LOF)algorithm is proposed in this paper.First,the dataset,stored in the Hadoop distributed file system(HDFS),is logically divided into multiple data blocks.Then,the MapReduce principle is used to process the data in each data block in parallel,so that the k-distance and LOF value of each data point is calculated only in a single block. It greatly improves the efficiency of the algorithm.Simultaneously,the concept of k-distance is redefined.It avoids the situation where the local density is infinite because more than k repeated points exist in the dataset.Finally,the data points whose LOF value is larger than threshold are merged,and the LOF values of combined data are recalculated.This process can effectively improve the accuracy and sensitivity.Experiments with real-world datasets demonstrate the validity,high efficiency,and extendibility of the MR-DLOF algorithm. Keywords:data mining;anomaly detection;local outlier factor;Hadoop;MapReduce;Distributed File System;parallel computing;local density 收稿日期:201809-04.网络出版日期:2018-12-18 基金项目:国家自然科学基金项目(61572435,61472305,61473222): 随着当前数据量的增多,在数据处理和数据 教育部-中国移动联合基金项目(MCM20170103):复 杂电子系统仿真重点实验室基础研究基金项目 分析过程中,应用有效的数据挖掘技术能够大大 (DXZT-JC-ZZ-2015-015):宁波市自然科学基金项目 提升数据处理和分析的速度,同时也能够提升数 (2016A610035,2017A610119). 通信作者:胡秋秋.E-mail:huqiuq@yeah.net 据处理的准确性。其中,数据挖掘就是从大量
DOI: 10.11992/tis.201809007 㑽㐈ܦ❵౬: http://kns.cnki.net/kcms/detail/23.1538.TP.20181214.1001.002.html ദκ MapReduce ⮰Ꭲ㵸ᐮ፤ᷬ≷ッ∁ ᴎ࠴ѷ1 Ὃᑊሖሖ1 Ὃѵበᔉ2 1. ᜴߶ႂߔመܷߥ ஜߥˀፑߥᬒὋᬎ᜴ ᜴߶ 710071; 2. ᜴߶ႂߔመܷߥ ኪߥᬒὋᬎ᜴ ᜴߶ 710071Ὀ ᦄ 㺭喝˝˿ଡᰳஜ્ଇ˖ऩ࣡ኪกڙஜ᧙ܘܷௐᄉэᆷऎNj༦ஏऎ֖੯ᛠညὋఴଡѢ˿ʶሗ۲ ̅ MapReduce ಳ֖ Local Outlier Factor(LOF) ኪกᄉࣲᛠऩ࣡ኪก (MR-DLOF)njᯪЎὋߚ࠱உڙ Hadoop ѫ࣊य͇ጆፑ (HDFS) ʼᄉஜᬶᣣڠѬѫ˝ܲ˓ஜڰnjཨՐὋѽၸ MapReduce ԓူ࠱Չ˓ஜڰ ˖ᄉஜࣲᛠܪူὋ३ඇ˓ஜཁᄉ k-ᤂᡯሎ֖ LOF Ϙᄉኪ̨ڙӬ˓ڰ˖੯ᛠὋ̯Ꮺଡᰳ˿ኪกᄉ੯ ᛠညՎௐ᧗ள߿ ˿˦k-ᤂᡯሎᄉഏএὋᥗБ˿ஜᬶ˖ڙߚܷ̅ኍ̅ k ˓᧗ܬཁᏪ࠭ᒰࡌᦉࠚऎ˝ ቃܷᄉৰхnjణՐὋ࠱ LOF ϘᣖܷᄉஜཁՋࣲ᧗ளኪФ LOF ϘὋ̯Ꮺଡᰳኪกэᆷऎ֖༦ஏऎnjᄽ ࠃஜᬶᰍ˿ MR-DLOF ኪกᄉథবNjᰳব֖Ժੰࡘবnj ڟ䩚䃹喝ஜ્ଇऩ࣡ࡌᦉሎᏅߔځHadoopMapReduceѫ࣊य͇ጆፑࣲᛠኪࡌᦉࠚऎ ͙పܲㆧण喝TP311 ᪳⡚ᴳᔃⴭ喝A ᪳「㑂ण喝1673−4785(2019)02−0224−07 ͙᪳ᑁ⩔ᵨᐻ喝呼ᄻ݆, 㘍, ݄⿷㟟. ദκ MapReduce ⮰Ꭲ㵸ᐮ፤ᷬ≷ッ∁[J]. ᮦ㘩㈧㐋႒៑, 2019, 14(2): 224–230. 㠝᪳ᑁ⩔ᵨᐻ喝QI Xiaogang, HU Qiuqiu, LIU Lifang. Parallel anomaly algorithm based on MapReduce[J]. CAAI transactions on intelligent systems, 2019, 14(2): 224–230. Parallel anomaly algorithm based on MapReduce QI Xiaogang1 ὋHU Qiuqiu1 ὋLIU Lifang2 (1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China; 2. School of Computer Science and Technology, Xidian University, Xi’an 710071, China) Abstract: To improve the accuracy, sensitivity, and efficiency of anomaly detection algorithm in data mining when the amount of data increases, a parallel anomaly detection algorithm (MR-LOF) based on the MapReduce framework and the local outlier factor (LOF) algorithm is proposed in this paper. First, the dataset, stored in the Hadoop distributed file system (HDFS), is logically divided into multiple data blocks. Then, the MapReduce principle is used to process the data in each data block in parallel, so that the k-distance and LOF value of each data point is calculated only in a single block. It greatly improves the efficiency of the algorithm. Simultaneously, the concept of k-distance is redefined. It avoids the situation where the local density is infinite because more than k repeated points exist in the dataset. Finally, the data points whose LOF value is larger than threshold are merged, and the LOF values of combined data are recalculated. This process can effectively improve the accuracy and sensitivity. Experiments with real-world datasets demonstrate the validity, high efficiency, and extendibility of the MR-DLOF algorithm. Keywords: data mining; anomaly detection; local outlier factor; Hadoop; MapReduce; Distributed File System; parallel computing; local density ᬣᅋॆґஜ᧙ᄉܘܲὋڙஜܪ֖ူஜ ѫౡር˖Ὃःၸథᄉஜ્ଇశᑞܴܷܷ ଡӣஜܪ֖ူѫౡᄉᤳऎὋՎௐ˶ᑞܴଡӣஜ ܪူᄉэᆷবnjФ˖Ὃஜ્ଇࡂ௦̯ܷ᧙ ᩢ⽫ᬑ喝2018−09−04. 㑽㐈ܦ❵ᬑ喝2018−12−18. ദ䛽䶥Ⱊ喝ࠑڍᒬཨመߥ۲᧚ᮉᄫ (61572435, 61472305, 61473222) ஓᐱᦉ-˖ڍሧҮᐎՋ۲᧚ᮉᄫ (MCM20170103)ܬ ఽႂߔጆፑ͋ᄽ᧗ཁࠃᰍࠈ۲ᆨᆐቂ۲᧚ᮉᄫ (DXZT-JC-ZZ-2015-015)߰จࣉᒬཨመߥ۲᧚ᮉᄫ (2016A610035, 2017A610119). 䕆ԍ҈㔱喝ᑊሖሖ. E-mailὙhuqiuq@yeah.net. ኃԃኃయ ఄNJᑞNJጆNJፑNJߥNJઐ Vol.14 No.2 ࣱత $""*5SBOTBDUJPOTPO*OUFMMJHFOU4ZTUFNT Mar. 2019
第2期 齐小刚,等:基于MapReduce的并行异常检测算法 ·225· 的、不完全的、有噪声的、模糊的及随机的数据集 据集中存在大于或等于k个重复点而导致数据点 中提取隐含在其中的、事先不知道的但又潜在的 局部密度为无穷大的情况,重新定义了-邻近距 有用的信息和知识的过程。异常检测是数据挖 离。同时为了减少计算量,将整个数据集逻辑地 掘中的重要任务之一,目的是从给定的数据集中 划分为多个数据块,利用MapReduce原理并行处 发现异常的数据对象。异常检测又称为离群点 理各个数据块中的数据,使得每个数据点的k邻 检测、偏差检测、孤立点检测等,用于反作弊、故 近距离和LOF值的计算仅在单个块中执行;然 障诊断、金融诈骗等领域。随着移动通信、云计 后,将每个数据块中LOF值较高的数据点合并后 算等技术的快速发展,数据量的日渐增多),传统 进一步计算,得到最终异常数据点,从而提高算 基于单机内存设计的异常检测算法面临着很大的 法的准确性和灵敏性。 挑战。因此研究适用于大规模数据的异常检测算 法具有重要的应用价值。 1 Hadoop技术 近年来,已经出现许多异常检测算法,主要包 Hadoop是Apache软件基金会开发的分布式 括两类:有监督性的6和无监督性的-11。有监 系统基础架构。其理论基础是Google在国际上 督性的异常检测算法在监测异常数据前需要大量 发表的关于MapReduce、GFSl和BigTablet的 的样本进行模型检测,但实际应用中往往无法事 三篇论文。Hadoop由HDFS、MapReduce、HBase、 先获取大量的训练样本。因此无监督的异常检测 Hive和Zookeeper等成员组成,其中HDFS和Map 算法具有更高的实用价值。 Reduce是两个最重要的成员。HDFS可以实现不 在无监督异常检测算法中,Breunig等o提出 同节点、不同类型计算机的数据整合,MapRe- 的Local Outlier Factor(LOF)算法,通过计算每个 duce通过Map函数和Reduce函数建模来实现可 点的局部异常因子(LOF值)从而判断一个数据 靠的分布式计算。实践证明,大量基于MapRe- 对象的异常程度。该算法与其他算法相比,理论 简单、适应性较高,并且能有效地检测全局异常 duce的并行算法的实现有效地提高了算法的计算 和局部异常。然而LOF算法是基于局部密度设 能力-20」 计的,算法复杂度较高且假设不存在大于或等于 1.1 Hadoop MapReduce k个重复点。在这个算法的基础上,Bhat等提 Hadoop MapReduce的实现采用了Master/ 出了一种改进的LOF算法,将k-distance修改为 Slave结构。MapReduce分布式计算框架主要包 m-distance以提高算法性能。其中,k-distance是数 含两个处理过程:Map阶段和Reduce阶段。Map 据点与其最近的第k个数据点之间的距离,而m 阶段的Map函数和Reduce阶段的Reduce函数都 distance是数据点与其k邻域内点距离的平均 由用户根据需求进行自定义。Map函数主要处理 值。Miao等提出了一种基于核密度的局部离 输入数据集并产生中间输出,然后通过某种方式 群因子算法(KLOF)来计算每个数据点离群程度。 将这些中间输出通过Reduce函数组合在一起导 Tang等I引入了相对密度的离群值用来度量数 出最终结果。此过程均以键值对(key/value)形式 据对象的局部异常,其中数据对象的密度分布是 成对地输入和输出数据,保证了数据的安全性和 根据数据对象的最近邻来估计的。此外,进一步 可靠性。当Reducer函数输出最后一个键值对 考虑了反向最近邻和共享最近邻估计对象的密度 时,MapReduce任务完成。图I为MapReduce详 分布。虽然以上算法一定程度上提高了LOF算 细流程。HDFS首先根据数据大小和块大小对数 法的有效性,但其处理数据规模受限于内存容量 据进行物理分割,用户可根据需求对数据进行逻 和数据的复杂性。因此,设计一种既能保证LOF 辑分割,经过Map阶段,对每一块数据执行Map 算法优点又能高效和有效地处理大量数据的异常 任务后,对数据进行排序后进人Reduce阶段合并 检测算法是件有意义的工作。 结果,最后将结果写入HDFS中。 本文主要针对算法的计算量和算法中重复点 1.2 Hadoop HDFS 对局部异常因子的影响这两个方面对LOF异常 Hadoop Distributed File System(HDFS)采用 检测算法进行了深人分析,并根据Hadoop作业调 Master/Slave结构,其作为一个高度容错且易扩展 度机制和MapReduce计算框架,设计了一种基于 的分布式文件系统,主要包含NameNode、Second: MapReduce和LOF思想的并行异常检测算法 ary NameNode和DataNode三类节点。图2详 (MR-DLOF)。在MR-DLOF算法中,为了避免数 细地描述了HDFS结构
ᄉNjʿ߸КᄉNjథ٩ܥᄉNjഴዹᄉԢᬣᄉஜᬶ ˖ଡԨᬤդڙФ˖ᄉNĵЎʿᅻ᥊ᄉͭԠڙᄉ థၸᄉζো֖ᅻខᄉር[1] njऩ࣡௦ஜ્ ଇ˖ᄉ᧗᜵͉ҫ˧ʶὋᄫᄉ௦̯ፋ߿ᄉஜᬶ˖ Ԧဗऩ࣡ᄉஜࠪ៵[2] njऩ࣡Ԡሥ˝ሎᏅཁ NjϟࢿNjߣበཁኍὋၸ̅ԥͺभNj ᬩងறNj᧚ᚷគᰔኍᮖ۪njᬣᅋሧҮζNj̇ ኪኍశᄉঋᤳԦࡘὋஜ᧙ᄉຑܘ]ܲ3] Ὃ͛ፑ ۲̅ӬЮߚᄉऩ࣡ኪก᭦˙ᅋॡܷᄉ ૌੌnjځᆐቂၸܷ̅ഴஜᄉऩ࣡ኪ กХథ᧗᜵ᄉःၸ͈Ϙnj ᤂࣱὋࣂፂѢဗܲऩ࣡ኪกὋ˞᜵Ӊ હˏዜ[4] Ὑథᄢᅕবᄉ[5-6] ֖ᄢᅕবᄉ[7-13] njథᄢ ᅕবᄉऩ࣡ኪกڙᄢऩ࣡ஜґᭉ᜵ܷ᧙ ᄉಧఴᤈᛠഴۋὋͭࠃᬄःၸ˖ज़ज़ก̂ ЎᖌԨܷ᧙ᄉᝪጶಧఴnjځᄢᅕᄉऩ࣡ ኪกХథఝᰳᄉࠃၸ͈Ϙnj k k k ڙᄢᅕऩ࣡ኪก˖ὋBreunig ኍ[10] ଡѢ ᄉ Local Outlier Factor(LOF) ኪกὋኪඇ˓ ཁᄉࡌᦉऩߔځ࣡) LOF Ϙ) ̯Ꮺѻறʶ˓ஜ ࠪ៵ᄉऩ࣡ርऎnjឞኪกˀФ̴ኪกᄰඊὋူ ኤӬNjःবᣖᰳὋࣲ˄ᑞథڠКࡌऩ࣡ ֖ࡌᦉऩ࣡njཨᏪ LOF ኪก௦۲̅ࡌᦉࠚऎ ᄉὋኪกܬఽऎᣖᰳ˄ϛʿڙߚܷ̅ኍ̅ ˓᧗ܬཁnjڙᤇ˓ኪกᄉ۲ᆨʼὋBhatt ኍ[11] ଡ Ѣ˿ʶሗஇᤈᄉ LOF ኪกὋ࠱ k-distance νஇ˝ m-distance ̾ଡᰳኪกবᑞnjФ˖Ὃk-distance ௦ஜ ཁˀФణᤂᄉኃ ˓ஜཁ˧ᫍᄉᡯሎὋᏪ mdistance ௦ஜཁˀФ ۪Юཁᡯሎᄉڨࣰ ϘnjMiao ኍ[12] ଡѢ˿ʶሗ۲̅ನࠚऎᄉࡌᦉሎ Ꮕߔځኪก (KLOF) ኪඇ˓ஜཁሎᏅርऎnj Tang ኍ[13] लЙ˿ᄰࠪࠚऎᄉሎᏅϘၸऎ᧙ஜ ࠪ៵ᄉࡌᦉऩ࣡ὋФ˖ஜࠪ៵ᄉࠚऎѫ࣊௦ ஜࠪ៵ᄉణᤂͤᄉnjܰὋᤈʶ Ꮵᘼ˿ԥՓణᤂ֖Р̙ణᤂͤࠪ៵ᄉࠚऎ ѫ࣊njᙉཨ̾ʼኪกʶ߿ርऎʼଡᰳ˿ LOF ኪ กᄉథবὋͭФܪူஜഴԩᬌ̅Юࠓߚ᧙ ֖ஜᄉܬఽবnjځὋʶሗᑞγ LOF ኪก͕ཁԠᑞᰳ֖థܪڠܷူ᧙ஜᄉऩ࣡ ኪก௦͇థ˦ᄉࢹͺnj ఴ˞᜵᧪ࠪኪกᄉኪ᧙֖ኪก˖᧗ܬཁ ࠪࡌᦉऩߔځ࣡ᄉॕֽᤇˏ˓ழ᭦ࠪ LOF ऩ࣡ ኪกᤈᛠ˿ຆЙѫౡὋࣲ Hadoop ͺˉុ ऎ֖҃ MapReduce ኪಳὋ˿ʶሗ۲̅ MapReduce ֖ LOF ধਆᄉࣲᛠऩ࣡ኪก (MR-DLOF)njڙ MR-DLOF ኪก˖Ὃ˝˿ᥗБஜ ᬶ˖ڙߚܷ̅ኍ̅ k ˓᧗ܬཁᏪ࠭ᒰஜཁ ࡌᦉࠚऎ˝ቃܷᄉৰхὋ᧗ள߿ ˿˦k-ᤂᡯ ሎnjՎௐ˝˿ђ࠵ኪ᧙Ὃ࠱ஞ˓ஜᬶᣣڠ Ѳѫ˝ܲ˓ஜڰὋѽၸ MapReduce ԓူࣲᛠܪ ူՉ˓ஜڰ˖ᄉஜὋ३ඇ˓ஜཁᄉ k- ᤂᡯሎ֖ LOF Ϙᄉኪ̨ڙӬ˓ڰ˖੯ᛠཨ ՐὋ࠱ඇ˓ஜڰ ˖LOF ϘᣖᰳᄉஜཁՋࣲՐ ᤈʶኪὋ३ҁణጻऩ࣡ஜཁὋ̯Ꮺଡᰳኪ กᄉэᆷব֖༦ஏবnj 1 Hadoop శ Hadoop ௦ Apache ᣃ͇۲᧚͗धԦᄉѫ࣊य ጆፑ۲ᆨౝnjФူ۲ᆨ௦ Google ڍڙᬄʼ ԦᛪᄉС̅ MapReduce[14] NjGFS[15] ֖ BigTable[16] ᄉ ʻኼnjHadoop ၿ HDFSNjMapReduceNjHBaseNj Hive ֖ Zookeeper ኍտጷὋФ˖ HDFS ֖ MapReduce ௦ˏ˓ణ᧗᜵ᄉտnjHDFS Ժ̾ࠃဗʿ ՎᓫཁNjʿՎዜۋኪᄉஜஞՋὋMapReduce Map ѥஜ֖ Reduce ѥஜतഴࠃဗԺ ᭤ᄉѫ࣊यኪnjࠃὋܷ᧙۲̅ MapReduce ᄉࣲᛠኪกᄉࠃဗథڠଡᰳ˿ኪกᄉኪ ᑞҦ[17-20] nj 1.1 Hadoop MapReduce Hadoop MapReduce ᄉࠃဗ᧓ၸ˿ Master/ Slave ፆౝnjMapReduce ѫ࣊यኪಳ˞᜵Ӊ դˏ˓ܪူርὙMap ൿ֖ Reduce ൿnjMap ൿᄉ Map ѥஜ֖ Reduce ൿᄉ Reduce ѥஜᦏ ၿၸਖ਼ᭉයᤈᛠᒬ߿˦njMap ѥஜ˞᜵ܪူ ᣤЙஜᬶࣲ̖ၶ˖ᫍᣤѢὋཨՐ౼ሗழय ࠱ᤇ̎˖ᫍᣤѢ Reduce ѥஜጷՋڙʶᡐ࠭ Ѣణጻፆ౦njርڨ̾᪃Ϙࠪ (key/value) ्य ࠪڠᣤЙ֖ᣤѢஜὋγ˿ஜᄉ߶Кব֖ Ժ᭤বnjॆ Reducer ѥஜᣤѢణՐʶ˓᪃Ϙࠪ ௐὋMapReduce ͉ҫ߸njڎ 1 ˝ MapReduce ស ጹึርnjHDFS ᯪЎஜܷ࠴֖ڰܷ࠴ࠪஜ ᤈᛠྫྷူѫҞὋၸਖ਼Ժᭉයࠪஜᤈᛠ ᣣѫҞὋፂ Map ൿὋࠪඇʶڰஜ੯ᛠ Map ͉ҫՐὋࠪஜᤈᛠଅࣿՐᤈЙ Reduce ൿՋࣲ ፆ౦ὋణՐ࠱ፆ౦зЙ HDFS ˖nj 1.2 Hadoop HDFS Hadoop Distributed File System (HDFS) ᧓ၸ Master/Slave ፆౝὋФͺ˝ʶ˓ᰳऎࠓᩱ˄ੰࡘ ᄉѫ࣊य͇ጆፑὋ˞᜵Ӊդ NameNodeNjSecondary NameNode ֖ DataNode ʻዜᓫཁ[21] njڎ 2 ស ጹڠଠᤗ˿ HDFS ፆౝnj ኃ 2 య ᴎ࠴ѷὋኍὙ۲̅ MapReduce ᄉࣲᛠऩ࣡ኪก g225g
·226· 智能系统学报 第14卷 Reduce <kV Split 1 Map task Shuffle Reduce task Split 2 Map task Shuffle Reduce task Input Output Split n Map task Shuffle Reduce task Split n+1 Map task 图1 MapReduce framework处理过程 Fig.1 MapReduce framework process 元数据信息(名称,副本位置.) 域N:(p)即与点p的之间距离小于k-distance(p) HDFS Name node [Secondary client 的对象的集合。 (master) name node 3)可达距离(reachability distance)给定参数k 时,数据点p到数据点o的可达距离reach-dis(p,o) 为数据点o的k-distance(o)和数据点p与点o之 Data node ode 间距离的最大值,即 ■■ reach-dis(p,o)=max(k-distance(o),d(p,o) 图2HDFS结构 4)局部可达密度(local reachability density):数 Fig.2 HDFS structure 据点p的局部可达密度为它与k距离邻域内数据 NameNode为主节点且只有一个,负责接收用 点的平均可达距离的倒数,即 户的操作请求,记录每个文件数据块在各个Data- IW.(p川 rd(p)eah-dis (p) (1) Node上的位置和副本信息。Secondary Name- Node为可选节点,与NameNode进行通信,定期 5)局部异常因子(local outlier factor):数据p 保存HDFS元数据快照。DataNode为从节点且可 的局部相对密度(局部异常因子)为点p的邻域 以有多个,主要负责存储数据。同时为了保证数 内点的局部可达密度和数据点p的局部可达密度 据安全,文件会有多个副本(默认为3个)。Data 的比值的平均值,即 Node定期向NameNode上报心跳,NameNode通 Ird(o) 过心跳响应来控制DataNode. ∑Ird(p) LOF(P)=INA(P) (2) 2算法设计 根据局部异常因子的定义,如果数据点p的 LOF值在1附近,表明数据点p的局部密度跟它 2.1LOF算法 的邻居们差不多;如果数据p的LOF值小于1, LOF是基于密度的经典算法,其核心部分是 表明数据点p处在一个相对密集的区域,不是一 关于数据点密度的刻画。由数据点的k-邻近距 个异常点;如果数据点p的LOF得分远大于1, 离,计算各个数据点的局部可达密度和局部异常 表明数据点p跟其他点比较疏远,很可能是一个 因子,根据局部异常因子大小判断数据点的异常 异常点。 程度从而得到异常点。算法的基本概念如下: 2.2MR-DLOF算法 1)k-邻近距离(k-distance:对于任意正整数 LOF算法是基于密度的异常检测算法,计算 k,点p的k邻近距离定义为k-distance(p)=d(p,o), 量大,且在LOF算法中关于局部可达密度的定义 如果满足以下条件: 存在假设:不存在大于或等于k个重复点。当这 ①在样本空间中,至少存在k个点q,使得 样的重复点存在的时候,这些点的平均可达距离 dp,q)≤dp,o)。 为零,局部可达密度就变得无穷,从而影响了算 ②在样本空间中,至多存在k-1个点9,使 法的有效性。 得dp,q)<d(p,o)。 算法1MR-DLOF algorith 其中,d(p,o)为点p与点o之间的距离。 输入X=x1,x2,…,xw:data set 2)k-距离邻域(k-neighbor)):点p的k-距离邻 k:number of nearest neighbor
,QSXW 0DSWDVN 0DSWDVN 0DSWDVN 0DSWDVN 6KXIIOH 6KXIIOH 6KXIIOH 5HGXFHWDVN 5HGXFHWDVN 5HGXFHWDVN 2XWSXW 6SOLW 6SOLW 6SOLWQ 6SOLWQ NY! NY! N^Y«`! 5HGXFH N 0DS Y! « « ప 1 MapReduce framework ะ⤲䓳⼷ Fig. 1 MapReduce framework process ٯᢚԍᖛह⼜喏ޛ᱘ѹ㒚« +')6 FOLHQW 'DWDQRGH 'DWDQRGH 'DWDQRGH 'DWDQRGH « 1DPHQRGH PDVWHU 6HFRQGDU\ QDPHQRGH ప 2 HDFS ㏿Ჰ Fig. 2 HDFS structure NameNode ˝˞ᓫཁ˄Եథʶ˓Ὃ᠆᠉ଋஅၸ ਖ਼ᄉୱͺឯයὋेඇ˓͇ஜڙڰՉ˓ DataNode ʼᄉͮᎵ֖ҝఴζোnjSecondary NameNode ˝ԺᤤᓫཁὋˀ NameNode ᤈᛠζὋ߿య γߚ HDFS ЊஜঋཱnjDataNode ˝̯ᓫཁ˄Ժ ̾థܲ˓Ὃ˞᜵᠆᠉ߚϱஜnjՎௐ˝˿γஜ ߶КὋ͇͗థܲ˓ҝఴ (᳭ᝢ˝ 3 ˓)njDataNode ߿యՓ NameNode ʼઐॶὋNameNode ॶֽःଌ҃ DataNodenj 2 ኪก 2.1 LOF ッ∁ LOF ௦۲̅ࠚऎᄉፂЦኪกὋФನॶᦉѫ௦ С̅ஜཁࠚऎᄉ҈ႆnjၿஜཁᄉ k-ᤂᡯ ሎὋኪՉ˓ஜཁᄉࡌᦉԺࠚऎ֖ࡌᦉऩ࣡ ࣡ཁᄉऩѻறஜ࠴ܷߔځ࣡ᦉऩࡌὋߔځ ርऎ̯Ꮺ३ҁऩ࣡ཁnjኪกᄉ۲ఴഏএݟʽὙ k p k−distance(p) = d(p,o) 1) k-ᤂᡯሎ (k-distance): ͉ࠪ̅ൣஞஜ Ὃཁ ᄉ k-ᤂᡯሎ߿ ˝˦Ὃ ݟ౦໗ᡛ̾ʽ్͇Ὑ k q d(p,q) d(p,o) Ŀ ڙಧఴቆᫍ˖Ὃᒯڙߚ࠵˓ ཁ Ὃ३ nj k−1 q d(p,q) < d(p,o) ŀ ڙಧఴቆᫍ˖Ὃᒯܲڙߚ˓ ཁ Ὃ ३ nj Ф˖Ὃd(p,o) ˝ཁ p ˀཁ o ˧ᫍᄉᡯሎnj 2) k-ᡯሎ۪ (k-neighbor): ཁ p ᄉ k-ᡯሎ ۪ Nk(p) Ԁˀཁ p ᄉ˧ᫍᡯሎ࠴ ̅k−distance(p) ᄉࠪ៵ᄉᬶՋnj k p o reach−dis(p,o) o k−distance(o) p o 3) Ժᡯሎ (reachability distance): ፋ߿ԟஜ ௐὋஜཁ ҁஜཁ ᄉԺᡯሎ ˝ஜཁ ᄉ ֖ஜཁ ˀཁ ˧ ᫍᡯሎᄉణܷϘὋԀ reach−disk(p,o) = max{k−distance(o),d(p,o)} p 4) ࡌᦉԺࠚऎ (local reachability density): ஜ ཁ ᄉࡌᦉԺࠚऎ˝߱ˀ k-ᡯሎ۪Юஜ ཁᄉڨࣰԺᡯሎᄉχஜὋԀ lrdk(p) = |Nk(p)| o∈Nk (p) reach−disk(p,o) p p p 5) ࡌᦉऩߔځ࣡) local outlier factor): ஜ ᄉࡌᦉᄰࠪࠚऎ (ࡌᦉऩߔځ࣡˝ (ཁ ᄉ۪ ЮཁᄉࡌᦉԺࠚऎ֖ஜཁ ᄉࡌᦉԺࠚऎ ᄉඊϘᄉڨࣰϘὋԀ LOFk(p) = o∈Nk (p) lrd(o) lrd(p) |Nk(P)| p p p p p p ࡌᦉऩߔځ࣡ᄉ߿˦Ὃݟ౦ஜཁ ᄉ LOF Ϙڙ 1 ᬃᤂὋᛪஜཁ ᄉࡌᦉࠚऎᡱ߱ ᄉࡏ͂ࢿʿܲݟ౦ஜ ᄉ LOF Ϙ࠴ ̅1Ὃ ᛪஜཁ ڙܪʶ˓ᄰࠪࠚᬶᄉӜ۪Ὃʿ௦ʶ ˓ऩ࣡ཁݟ౦ஜཁ ᄉ LOF ३ѫᤉܷ̅ 1Ὃ ᛪஜཁ ᡱФ̴ཁඊᣖ႟ᤉὋॡԺᑞ௦ʶ˓ ऩ࣡ཁnj 2.2 MR-DLOF ッ∁ k LOF ኪก௦۲̅ࠚऎᄉऩ࣡ኪกὋኪ ᧙ܷὋ˄ڙ LOF ኪก˖С̅ࡌᦉԺࠚऎᄉ߿˦ ڙߚϛὙʿڙߚܷ̅ኍ̅ ˓᧗ܬཁnjॆᤇ ಧᄉ᧗ܬཁڙߚᄉௐϊὋᤇ̎ཁᄉڨࣰԺᡯሎ ˝ᭅὋࡌᦉԺࠚऎࡂԪ३ቃὋ̯Ꮺॕֽ˿ኪ กᄉథবnj 算法 1 MR-DLOF algorith 输入 X = x1, x2,··· , xN: data set k: number of nearest neighbor g226g ఄNJᑞNJጆNJፑNJߥNJઐ ኃ 14 ԃ
第2期 齐小刚,等:基于MapReduce的并行异常检测算法 ·227· N:data block number 输出= 0:threshold for LOF 输出Abnormal data and LOF values SecondMapper Get data set which lof>0 from算法2 add in 输入= XX Calculate lof>0 ofxEXX 输出= return Abnormal data and LOF 由于LOF算法不足,本文重新定义了k邻近 for osE k-distinct-neighbour do 距离的概念,并结合MapReduce框架提出并行异 if k-dis(og)= dp,q)≤dp,o)o 2)在样本空间中,至多存在k-1个点q,使 输出= 得0g Ird(d)=k/reach-disk(di,o), 输入data setX=x1,2,…,xw oEk-distinct-neighbour k:number of nearest neighbor end 6:threshold for LOF ThirdMapper N:data block number 输入= 输出data set which lof>e 输出= Initialize a Hadoop Job ),lof(d)> Set TaskMapReduce class for oge k-distinct-neighbour do Logically divide X into multiple data blocks: lof(d)=(Ird(ox)/)/Ird(di). D1,D2,·,DN o E k-distinct-neighbour In the j-th TaskMapReduce end FirstMapper if lof(d,)> 输入D=d1,d,…,dm output 输出= ThirdReduce 输入=),lof(d)> for each data di.i=1,2.....m do 输出= Calculate dis distance(di,dj).j=1..,m for value do Sort dis of d Sort lof(d)for d,and record d,* for each dis of d,do end if dis0&k-neighbourl= 值的数据点合并成一个新的数据集,更新其k邻 近距离和LOF值,从而提高算法的准确度和灵敏
N: data block number θ: threshold for LOF 输出 Abnormal data and LOF values lo fi > θ XX Get data set which from 算法 2 add in lo f Calculate of i > θ xj ∈ XX return Abnormal data and LOF ၿ̅ LOF ኪกʿᡛὋఴ᧗ள߿ ˿˦k-ᤂ ᡯሎᄉഏএὋࣲፆՋ MapReduce ಳଡѢࣲᛠऩ ࣡ኪกnj᧗ள߿˦ᄉ k-ᤂᡯሎഏএݟʽnj k p k−distance(p) = d(p,o) k-ᤂᡯሎ (k-distinct-distance): ͉ࠪ̅ൣஞ ஜ Ὃཁ ᄉ k-ᤂᡯሎ߿˝˦ Ὃݟ౦໗ᡛ̾ʽ్͇Ὑ k q d(p,q) d(p,o) 1) ڙಧఴቆᫍ˖Ὃᒯڙߚ࠵˓ ཁ Ὃ३ nj k−1 q 0 θ 输入 data set X = x1, x2,··· , xN k: number of nearest neighbor θ: threshold for LOF N: data block number 输出 data set which lofi > θ Initialize a Hadoop Job NJNJSet TaskMapReduce class D1,D2,··· ,DN Logically divide X into multiple data blocks: . In the j-th TaskMapReduce FirstMapper NJNJ输入 D = d1,d2,··· ,dm = NJNJ输出 NJNJNJNJ di for each data do ,i = 1,2,··· ,m disij = distance(di Calculate ,dj), j = 1,··· ,m NJNJSort of disij di NJNJfor each of do disij di NJNJNJif disij 0&|k−neighbour| = NJNJ输入 NJNJNJNJ = NJNJ输出 NJNJNJNJ SecondMapper = NJNJ输入 NJNJNJNJ = NJNJ输出 NJNJNJNJ for do ok ∈ k−distinct−neighbour k−dis(ok) = NJNJ输入 NJNJNJNJ = NJNJ输出 NJNJNJNJ for value do lrd(di) = k/ reach−disk(di,ok) ok ∈ k−distinct− NJNJ , NJNJNJNJNJ neighbour end ThirdMapper = = θ),lof(di) > NJNJ输出 NJNJNJNJ for do ok ∈ k−distinct−neighbour lof(di) = ( lrd(ok)/k)/lrd(di) ok ∈ k−distinct− NJNJ , NJNJNJNJNJ neighbour end if lof(di) > θ NJNJoutput ThirdReduce 输入 = θ),lof(di) > NJNJ输出 = for value do NJNJSort for and record lof(di) di di∗ end ఴᦉѫ˞᜵̭ፀ˿ MR-DLOF ኪก۲ఴধਆ ֖ᰠnjᯪЎὋ࠱ஜᬶߚஉڙ HDFS ʼ࠱ࣲԓ ݼஜᬶᣣڠѬѫ˝ܲ˓ஜڰཨՐὋ MapReduce ԓူࣲᛠܪူՉ˓ஜڰ˖ᄉஜὋ ३ඇ˓ஜཁᄉ k-ᤂᡯሎ֖ LOF Ϙᄉኪ ̨ڙӬ˓ڰ˖੯ᛠణՐ࠱ඇ˓ஜڰ˖ࡌᦉऩ ߿ܷ̅࠱ࣲϘᄉཁҔᬓὋ߿̅࠴ߔځ࣡ ϘᄉஜཁՋࣲʶ˓ளᄉஜᬶὋఝளФ k- ᤂᡯሎ֖ LOF ϘὋ̯Ꮺଡᰳኪกᄉэᆷऎ֖༦ஏ ኃ 2 య ᴎ࠴ѷὋኍὙ۲̅ MapReduce ᄉࣲᛠऩ࣡ኪก g227g
·228· 智能系统学报 第14卷 度。Algorihml和Algorihm2为算法伪代码。 理相同规模数据集的准确度和灵敏度,来验证 3 实验与分析 MR-DLOF算法的有效性。针对每种规模的数据 集,分别从KDD-CUP1999标准化处理后的数据 实验平台配置:3台PC机(通过局域网连 库中随机选取10组不同的数据集,并使所选取的 接),节点配置为VMware Workstation Pro 12.0.0for 每种规模数据集中攻击数据(即异常点)在该数 Windows下的CentOS-7,JDK为l.8版本,Ha- 据集中占比为1%2%。使用LOF算法和MR-DLOF doop为2.7.4版本。本文所有算法均采用JAVA 算法分别计算其准确度和灵敏度,并取其平均值 语言实现,eclipse编译环境。实验环境为基于云 作为评价指标,其中设定阈值0=1.2。 平台的Hadoop集群,共有3个节点:1个控制节 由图3~4可知,LOF和MR-DLOF在处理相 点和2个计算节点,控制节点内存为32GB,计算 同数据集时,MR-DLOF算法在保证其灵敏性的 节点内存为8GB。节点信息如下表1。 基础上,大大提高了其准确率。 表1节点信息 Table 1 Node information 100 IMR-DLOF 95 LOF 节点名 P地址 角色 % Master 192.168.0.120 NameNodeResourceManager 85 Slave0 192.168.0.121 DataNodeNodeManager 80 Slavel 192.168.0.122 DataNodeNodeManager 75 实验数据集:为了验证MR-DLOF算法的有 效性和高效性,本文选用网络入侵数据集KDD- 500 30000 CUP19992m,KDD-CUP1999数据集中每个连接用 数据集数量 41个特征和1个标签来表述:其中3个特征以 图3算法准确度比较 CSV格式写成。这41个特征包含7个离散变量, Fig.3 Accuracy comparison of algorithm 34个连续变量,并且第20个变量数据全为0。 LOF和MR-DLOF算法中采取距离的方法进 SDL.oF 行计算,由于每个特征属性的度量方法不同,为 了避免出现“大数吃小数”的现象,消除属性度量 90 差异对计算结果的影响,需要对数据集进行预处 5 理。本文对除去全为0变量和CSV格式变量后 80 的37个变量进行标准化处理。 75 算法的有效性验证 70 3.1 性能衡量标准:异常检测算法对正常数据与 500 异常数据的检测结果如表2所示。 数据集规模 表2数据检测结果 图4算法灵敏度比较 Table 2 Data test result Fig.4 Sensitivity comparison of algorithm 结果 被检测为正常 被检测为异常 3.2 算法的高效性验证 正常数据 TP(True Positive) FP(False Positive) 本文主要对比LOF算法和MR-DLOF算法处 异常数据 FN(False Negative) TN(True Negative) 理相同规模数据集的执行时间,来验证MR-DLOF 准确度: 算法的执行效率。如图5所示,可以看出当数据 TP+TN Accuracy =TP+TN+FP+FN (3) 量比较大时,MR-DLOF的执行效率明显优于LOF 算法。数据量少时LOF算法执行效率优于MR 灵敏度:即真正辨识率,是正确检测出异常数 LOF算法的原因是Hadoop调度Map任务和Re- 据的个数与实际异常数据个数之比。 duce任务时需要一定的时间:但当数据量比较大 TN Sensitivit=TN+FN (4) 时,将数据分块后,Hadoop会调度多个MapRe- 本文主要对比LOF算法和MR-DLOF算法处 duce并行执行任务,效率明显增加
ऎnjAlgorihm1 ֖ Algorihm2 ˝ኪก̼͡ᆉnj 3 ࠃᰍˀѫౡ ࠃᰍࣰԻᦠᎵὙ 3 Ի PC (ࡌ۪Ꭹᤋ ଋ)ὋᓫཁᦠᎵ˝ VMware Workstation Pro 12.0.0 for Windows ʽᄉ CentOS-7ὋJDK ˝ 1.8 ྟఴὋHadoop ˝ 2.7.4 ྟఴnjఴథኪกڨ᧓ၸ JAVA ឥᝒࠃဗὋeclipse ᎃដဖܑnjࠃᰍဖܑ˝۲̅̇ ࣰԻᄉ Hadoop ᬶᏅὋРథ 3 ˓ᓫཁὙ1 ˓ଌ҃ᓫ ཁ֖ 2 ˓ኪᓫཁὋଌ҃ᓫཁЮߚ ˝32 GBὋኪ ᓫཁЮߚ ˝8 GBnjᓫཁζোݟʽᛪ 1nj 㶔 1 㞮◥ԍᖛ Table 1 Node information ᓫཁՏ IP ڦڠ ᝇᓣ Master 192.168.0.120 NameNodeResourceManager Slave0 192.168.0.121 DataNodeNodeManager Slave1 192.168.0.122 DataNodeNodeManager ࠃᰍஜᬶὙ˝˿ᰍ MR-DLOF ኪกᄉథ ব֖ᰳবὋఴᤤၸᎩፎЙΥஜᬶ KDDCUP1999[22] ὋKDD-CUP1999 ஜᬶ˖ඇ˓ᤋଋၸ 41 ˓ྱड़֖ 1 ˓ಕኣᛪᤗὙФ˖ 3 ˓ྱड़̾ CSV ಪयзnjᤇ 41 ˓ྱड़Ӊդ 7 ˓ሎԪ᧙Ὃ 34 ˓ᤋ፝Ԫ᧙Ὃࣲ˄ኃ 20 ˓Ԫ᧙ஜК˝ 0nj LOF ֖ MR-DLOF ኪก˖᧓Ԩᡯሎᄉழกᤈ ᛠኪὋၿ̅ඇ˓ྱड़࡚বᄉऎ᧙ழกʿՎὋ˝ ˿ᥗБѢဗ“ܷஜՈ࠴ஜ”ᄉဗ៵Ὃ๖ᬓ࡚বऎ᧙ ࢿऩࠪኪፆ౦ᄉॕֽὋᭉ᜵ࠪஜᬶᤈᛠᮔܪ ူnjఴࠪᬓԜК˝ 0 Ԫ᧙֖ CSV ಪयԪ᧙Ր ᄉ 37 ˓Ԫ᧙ᤈᛠಕэӐܪူnj 3.1 ッ∁⮰ᰵᩴᕓ侸䃭 বᑞᛥ᧙ಕэὙऩ࣡ኪกࠪൣ࣡ஜˀ ऩ࣡ஜᄉፆ౦ݟᛪ 2 ᇧnj 㶔 2 ᢚᷬ≷㏿ Table 2 Data test result ፆ౦ ᜁ˝ൣ࣡ ᜁ˝ऩ࣡ ൣ࣡ஜ TP(True Positive) FP(False Positive) ऩ࣡ஜ FN(False Negative) TN(True Negative) эᆷऎὙ Accuracy = TP+TN TP+TN+FP+FN ༦ஏऎὙԀᄽൣᣱខညὋ௦ൣᆷѢऩ࣡ஜ ᄉ˓ஜˀࠃᬄऩ࣡ஜ˓ஜ˧ඊnj Sensitivit = TN TN+FN ఴ˞᜵ࠪඊ LOF ኪก֖ MR-DLOF ኪกܪ θ = 1.2 ူᄰՎഴஜᬶᄉэᆷऎ֖༦ஏऎὋᰍ MR-DLOF ኪกᄉథবnj᧪ࠪඇሗഴᄉஜ ᬶὋѫѾ̯ KDD-CUP1999 ಕэӐܪူՐᄉஜ ं˖ᬣᤤԨ 10 ጷʿՎᄉஜᬶὋࣲᤤԨᄉ ඇሗഴஜᬶ˖ஈѣஜ (Ԁऩ࣡ཁ) ڙឞஜ ᬶ˖ӳඊ˝ 1%~2%njၸ LOF ኪก֖ MR-DLOF ኪกѫѾኪФэᆷऎ֖༦ஏऎὋࣲԨФڨࣰϘ ͺ˝͈ૈಕὋФ˖߿Ϙ nj ၿڎ 3ᾝ4 ԺᅻὋLOF ֖ MR-DLOF ܪڙူᄰ ՎஜᬶௐὋMR-DLOF ኪกڙγФ༦ஏবᄉ ۲ᆨʼὋܷܷଡᰳ˿Фэᆷညnj ᢚ䯲䛻 ۲Ꮢ 05'/2) /2) ప 3 ッ∁۲Ꮢ℀䒯 Fig. 3 Accuracy comparison of algorithm ᢚ䯲㻰Ὅ □᩻Ꮢ 05'/2) /2) ప 4 ッ∁□᩻Ꮢ℀䒯 Fig. 4 Sensitivity comparison of algorithm 3.2 ッ∁⮰倄ᩴᕓ侸䃭 ఴ˞᜵ࠪඊ LOF ኪก֖ MR-DLOF ኪกܪ ူᄰՎഴஜᬶᄉ੯ᛠௐᫍὋᰍ MR-DLOF ኪกᄉ੯ᛠညnjڎݟ 5 ᇧὋԺ̾ᄹѢॆஜ ᧙ඊᣖܷௐὋMR-DLOF ᄉ੯ᛠည௬͕̅ LOF ኪกnjஜ᧙࠵ௐ LOF ኪก੯ᛠည͕̅ MRLOF ኪกᄉԓځ௦ Hadoop ុऎ Map ͉ҫ֖ Reduce ͉ҫௐᭉ᜵ʶ߿ᄉௐᫍͭॆஜ᧙ඊᣖܷ ௐὋ࠱ஜѫڰՐὋHadoop ͗ុऎܲ˓ MapReduce ࣲᛠ੯ᛠ͉ҫὋည௬ܘҪnj g228g ఄNJᑞNJጆNJፑNJߥNJઐ ኃ 14 ԃ
第2期 齐小刚,等:基于MapReduce的并行异常检测算法 ·229· ×10 techniques[M].San Francisco:Morgan Kaufmann Publish- 2.0 --LOF ers Inc,2006:1-18. ◇-MR-DLOF 1.5 [2]AGGARWAL CC.Outlier analysis[M].New York: Springer,2013:75-99 Ξ1.0 [3]CHEN Feng,DENG Pan,WAN Jiafu,et al.Data mining for the internet of things:literature review and challenges 0.5 [J].International journal of distributed sensor networks, 2015.2015:431047. 0 00年 ×10 0.51.01.52.02.53.0 [4]吴镜锋,金炜东,唐鹏.数据异常的监测技术综述几.计 数据集规模 算机科学,2017,44(S2):24-28. 图5算法效率比较 WU Jingfeng,JIN Weidong,TANG Peng.Survey on mon- Fig.5 Efficiency comparison of algorithm itoring techniques for data abnormalities[J].Computer sci- 3.3算法的可扩展性验证 ence,2017,44(S2y:24-28 [5]STEINWART I,HUSH D,SCOVEL C.A classification 为了验证MR-DLOF算法的可扩展性,本文 framework for anomaly detection[J].The journal of ma- 通过扩大数据规模来比较MR-DLOF在不同计算 chine learning research,2005,6(1):211-232 节点下的执行效率,由图6可以看出,在相同数据 [6]邓红莉,杨韬.一种基于深度学习的异常检测方法).信 集规模下,当集群计算节点增加时,算法的执行 息通信,2015(3):3-4 效率提高。因此当数据集增大时,MR-DLOF算 DENG Hongli,YANG Tao.An anomaly detection method 法具有可扩展性,可通过扩充Hadoop集群中计算 based on deep learning[J].Information and communica- 节点的方法来提高算法的执行效率。 tions,.2015(3):3-4 5.0*10 [7]ZHAO Xuanqiang,WANG Guoying,LI Zhixing.Unsuper- ×-2个计算节点 vised network anomaly detection based on abnormality 4.5 0-3个计算节点 weights and subspace clustering[Cl//Proceedings of the 6th 4.0 -◇.4个计算节点 0 International Conference on Information Science and Technology.Dalian,China,2016:482-486. 2.5 0 [8]左进,陈泽茂.基于改进K均值聚类的异常检测算法U 2.0 .01 计算机科学,2016,43(8):258-261. 1.5.-- ZUO Jin,CHEN Zemao.Anomaly detection algorithm 1.0 based on improved K-means clustering[J].Computer sci- 数据集规模 ence,2016,43(8):258-261. 图6不同计算节点数量下的执行效率 [9]邹云峰,张昕,宋世渊,等,基于局部密度的快速离群点 Fig.6 Execution efficiency under different calculating 检测算法.计算机应用,2017,37(10):2932-2937, node numbers ZOU Yunfeng,ZHANG Xin,SONG Shiyuan,et al.Fast 4结束语 outlier detection algorithm based on local density[J]. Journal of computer applications,2017,37(10):2932- 本文通过分析LOF算法的不足,设计了一种 2937. 基于MapReduce和LOF算法的并行异常检测算 [10]BREUNIG MM,KRIEGEL H P,NG R T,et al.LOF: 法。该算法修正了k邻近距离的概念,从而避免 identifying density-based local outliers[C]//Proceedings 某些点的可达距离为零、局部可达密度为无穷大 of 2000 ACM SIGMOD International Conference on 的情况,以提高算法的有效性,同时,为了减少计 Management of Data.Dallas,Texas,USA,2000:93-104. [11]BHATT V.SHARMA K G,RAM A.An enhanced ap- 算量,将分块、利用MapReduce框架思想并行化 处理各个数据块中的数据点,大大提高了算法的 proach for LOF in data mining[C]//Proceedings of 2013 International Conference on Green High Performance 执行效率。最后,通过真实数据集验证算法的有 Computing.Nagercoil,India,2013:1-3. 效性、高效性和可扩展性。 [12]MIAO Dandan,QIN Xiaowei,WANG Weidong.Anom- 参考文献: alous cell detection with kernel density-based local out- lier factor[J].China communications,2015,12(9):64-75. [1]HAN Jiawei,KAMBER M.Data mining:concepts and [13]TANG Bo,HE Haibo.A local density-based approach for
ᢚ䯲㻰Ὅ 䓼㵸ᬢ䬠V /2) 05'/2) î î ప 5 ッ∁ᩴ⢳℀䒯 Fig. 5 Efficiency comparison of algorithm 3.3 ッ∁⮰छផᆁᕓ侸䃭 ˝˿ᰍ MR-DLOF ኪกᄉԺੰࡘবὋఴ ੰܷஜഴඊᣖ MR-DLOF ڙʿՎኪ ᓫཁʽᄉ੯ᛠညὋၿڎ 6 Ժ̾ᄹѢὋڙᄰՎஜ ᬶഴʽὋॆᬶᏅኪᓫཁܘҪௐὋኪกᄉ੯ᛠ ညଡᰳnjځॆஜᬶܘܷௐὋMR-DLOF ኪ กХథԺੰࡘবὋԺੰЌ Hadoop ᬶᏅ˖ኪ ᓫཁᄉழกଡᰳኪกᄉ੯ᛠညnj ᢚ䯲㻰Ὅ 䓼㵸ᬢ䬠V ͖䃍ッ㞮◥ ͖䃍ッ㞮◥ ͖䃍ッ㞮◥ î î ప 6 ̹स䃍ッ㞮◥䛻̷⮰ន㵸ᩴ⢳ Fig. 6 Execution efficiency under different calculating node numbers 4 ፆోឥ ఴѫౡ LOF ኪกᄉʿᡛὋ˿ʶሗ ۲̅ MapReduce ֖ LOF ኪกᄉࣲᛠऩ࣡ኪ กnjឞኪกνൣ˿ k-ᤂᡯሎᄉഏএὋ̯ᏪᥗБ ౼̎ཁᄉԺᡯሎ˝ᭅNjࡌᦉԺࠚऎ˝ቃܷ ᄉৰхὋ̾ଡᰳኪกᄉథবὋՎௐὋ˝˿ђ࠵ ኪ᧙Ὃ࠱ѫڰNjѽၸ MapReduce ಳধਆࣲᛠӐ ܪူՉ˓ஜڰ˖ᄉஜཁὋܷܷଡᰳ˿ኪกᄉ ੯ᛠညnjణՐὋᄽࠃஜᬶᰍኪกᄉథ বNjᰳব֖Ժੰࡘবnj ࣮㔯᪳⡚喝 >1@ HAN Jiawei, KAMBER M. Data mining: concepts and techniques[M]. San Francisco: Morgan Kaufmann Publishers Inc., 2006: 1–18. AGGARWAL C C. Outlier analysis[M]. New York: Springer, 2013: 75–99. >2@ CHEN Feng, DENG Pan, WAN Jiafu, et al. Data mining for the internet of things: literature review and challenges [J]. International journal of distributed sensor networks, 2015, 2015: 431047. >3@ ի᪪ᩣ, ᧚༶ˋ, נ .᳀ஜऩ࣡ᄉᄢశ፫ᤗ[J]. ኪመߥ ,2017, 44(S2): 24–28. WU Jingfeng, JIN Weidong, TANG Peng. Survey on monitoring techniques for data abnormalities[J]. Computer science, 2017, 44(S2): 24–28. >4@ STEINWART I, HUSH D, SCOVEL C. A classification framework for anomaly detection[J]. The journal of machine learning research, 2005, 6(1): 211–232. >5@ ᥞጙᕹ, ᮀ. ʶሗ۲̅ຆऎߥ˷ᄉऩ࣡ழก[J]. ζ োζ, 2015(3): 3–4. DENG Hongli, YANG Tao. An anomaly detection method based on deep learning[J]. Information and communications, 2015(3): 3–4. >6@ ZHAO Xuanqiang, WANG Guoying, LI Zhixing. Unsupervised network anomaly detection based on abnormality weights and subspace clustering[C]//Proceedings of the 6th International Conference on Information Science and Technology. Dalian, China, 2016: 482–486. >7@ ࢺᤈ, ᬇบᔳ. ۲̅இᤈ K ڨϘᐐዜᄉऩ࣡ኪก[J]. ኪመߥ ,2016, 43(8): 258–261. ZUO Jin, CHEN Zemao. Anomaly detection algorithm based on improved K-means clustering[J]. Computer science, 2016, 43(8): 258–261. >8@ ᥳ̇ࢎ ,ष, ߷ˆຍ, ኍ. ۲̅ࡌᦉࠚऎᄉঋᤳሎᏅཁ ኪก[J]. ኪःၸ, 2017, 37(10): 2932–2937. ZOU Yunfeng, ZHANG Xin, SONG Shiyuan, et al. Fast outlier detection algorithm based on local density[J]. Journal of computer applications, 2017, 37(10): 2932– 2937. >9@ BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF: identifying density-based local outliers[C]//Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. Dallas, Texas, USA, 2000: 93–104. >10@ BHATT V, SHARMA K G, RAM A. An enhanced approach for LOF in data mining[C]//Proceedings of 2013 International Conference on Green High Performance Computing. Nagercoil, India, 2013: 1–3. >11@ MIAO Dandan, QIN Xiaowei, WANG Weidong. Anomalous cell detection with kernel density-based local outlier factor[J]. China communications, 2015, 12(9): 64–75. >12@ >13@ TANG Bo, HE Haibo. A local density-based approach for ኃ 2 య ᴎ࠴ѷὋኍὙ۲̅ MapReduce ᄉࣲᛠऩ࣡ኪก g229g
·230· 智能系统学报 第14卷 outlier detection[J].Neurocomputing,2017,241:171- k-近邻连接算法U.软件学报,2013,24(8:1836-1851, 180. LIU Yi,JING Ning,CHEN Luo,et al.Algorithm for pro- [14]DEAN J,GHEMAWAT S.MapReduce:simplified data cessing k-nearest join based on R-tree in MapReduce[J]. processing on large clusters[J].Communications of the Journal of software,2013,24(8):1836-1851. ACM,2008.51(1107-113. [15]GHEMAWAT S.GOBIOFF H,LEUNG S T.The google [21]WHITE T.Hadoop:the definitive guide[M].4th ed. file system[C]//Proceedings of the 19th ACM Symposi- Gravenstein Highway North:O'reilly Media Inc.,2015: um on Operating Systems Principles.Bolton Landing, 1-4. NY,USA.2003:29-43. [22]http://archive.ics.uci.edu/ml/machine-learning-databases/ [16]CHANG F.DEAN J,GHEMAWAT S,et al.Bigtable:a kddcup99-mld/. distributed storage system for structured data[C]//Pro- ceedings of the 7th USENIX Symposium on Operating 作者简介: Systems Design and Implementation.Seattle,WA,2006: 齐小刚,男,1973年生,教授,博 15. 士生导师,主要研究方向为系统建模 [17]MU Yashuang,LIU Xiaodong,YANG Zhihao,et al.A 与故障诊断、网络优化与算法设计。 parallel C4.5 decision tree algorithm based on MapRe- 发表学术论文50余篇,被SCI检索 10余篇、EI检索50余篇。申请专利 duce[J].Concurrency and computation practice and ex- 18项(授权9项)、登记软件著作权 perience,.2017,298):e4015. 3项。 [18]侯泳旭,段磊,秦江龙,等.基于Isolation Forest的并行 化异常探测设计[.计算机工程与科学,2017,39(2): 胡秋秋,女,1995年生,硕士研究 236-244 生,主要研究方向为分布式系统、数据 HOU Yongxu,DUAN Lei,QIN Jianglong,et al.Parallel 处理与分析。 anomaly detection based on isolation forest[J].Computer engineering science,2017,39(2):236-244. [I9]叶海琴,孟彩霞,王意锋,等.一种基于MapReduce的频 繁模式挖掘算法[.南京理工大学学报,2018,42(1): 62-67. 刘立芳,女,1972年生,教授,博 士,主要研究方向为数据处理与智能 YE Haiqin,MENG Caixia,WANG Yifeng,et al.Fre- 计算。发表学术论文40余篇,其中 quent pattern mining algorithm based on MapReduce[J]. 被SCI检索9篇、EI检索30余篇。 Journal of Nanjing University of Science and Technology, 2018.42(1:62-67 [20]刘义,景宁,陈荦,等.MapReduce框架下基于R-树的
outlier detection[J]. Neurocomputing, 2017, 241: 171– 180. DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107–113. >14@ GHEMAWAT S, GOBIOFF H, LEUNG S T. The google file system[C]//Proceedings of the 19th ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, 2003: 29–43. >15@ CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable: a distributed storage system for structured data[C]//Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. Seattle, WA, 2006: 15. >16@ MU Yashuang, LIU Xiaodong, YANG Zhihao, et al. A parallel C4.5 decision tree algorithm based on MapReduce[J]. Concurrency and computation practice and experience, 2017, 29(8): e4015. >17@ Τฒோ, ൿᇕ, ሟᴜ, ኍ. ۲̅ Isolation Forest ᄉࣲᛠ Ӑऩ࣡ଉ[J]. ኪࢹርˀመߥ ,2017, 39(2): 236–244. HOU Yongxu, DUAN Lei, QIN Jianglong, et al. Parallel anomaly detection based on isolation forest[J]. Computer engineering & science, 2017, 39(2): 236–244. >18@ Հ๑၀, ߠॐ᭖, ဌᩣ, ኍ. ʶሗ۲̅ MapReduce ᄉᮟ ጒഴय્ଇኪก[J]. Ӯ̚ူࢹܷߥߥઐ, 2018, 42(1): 62–67. YE Haiqin, MENG Caixia, WANG Yifeng, et al. Frequent pattern mining algorithm based on MapReduce[J]. Journal of Nanjing University of Science and Technology, 2018, 42(1): 62–67. >19@ >20@ ѵ˦, ߰, ᬇᕨ, ኍ. MapReduce ಳʽ۲̅ R-ಝᄉ k-ᤂᤋଋኪก[J]. ᣃ͇ߥઐ, 2013, 24(8): 1836–1851. LIU Yi, JING Ning, CHEN Luo, et al. Algorithm for processing k-nearest join based on R-tree in MapReduce[J]. Journal of software, 2013, 24(8): 1836–1851. WHITE T. Hadoop: the definitive guide[M]. 4th ed. Gravenstein Highway North: O’reilly Media Inc., 2015: 1–4. >21@ http://archive.ics.uci.edu/ml/machine-learning-databases/ kddcup99-mld/. >22@ ҈㔱ガϷ喝 ᴎ࠴ѷὋႃὋ1973 ࣱၶὋஓ૾Ὃӯ ܢၶ࠭࣍Ὃ˞᜵ᆐቂழՓ˝ጆፑतഴ ˀᬩងறNjᎩፎ͕Ӑˀኪกnj Ԧᛪߥశ 50 ͷኼὋᜁ SCI ጉ 10 ͷኼNjEI ጉ 50 ͷኼnjႁឯ˃ѽ 18 ᮉ (૾ా 9 ᮉ)Njᄅᣃ͇ᗂͺా 3 ᮉnj ᑊሖሖὋݘὋ1995 ࣱၶὋᆯܢᆐቂ ၶὋ˞᜵ᆐቂழՓ˝ѫ࣊यጆፑNjஜ ˀѫౡnjူܪ ѵበᔉὋݘὋ1972 ࣱၶὋஓ૾Ὃӯ ܢὋ˞᜵ᆐቂழՓ˝ஜܪူˀఄᑞ ኪnjԦᛪߥశ 40 ͷኼὋФ˖ ᜁ SCI ጉ 9 ኼNjEI ጉ 30 ͷኼnj g230g ఄNJᑞNJጆNJፑNJߥNJઐ ኃ 14 ԃ