第5卷第2期智能系统学报 Vol.5 No.2 2010年4月 CAAI

正在加载图片...

第5卷第2期智能系统学报 Vol.5 No.2 2010年4月 CAAI Transactions on Intelligent Systems Apr.2010 doi:10.3969/i.issn.1673-4785.2010.02.009 信息熵度量的离群数据挖掘算法张贺，蔡江辉，张继福，乔行2 (1.太原科技大学计算机科学与技术学院，山西太原030024：2.北京航空航天大学自动化科学与电气工程学院，北京100191) 摘要：离群数据挖掘是为了找出隐含在海量数据中相对稀疏而孤立的异常数据模式，但传统的离群数据挖掘方法受人为因素影响较大，通过引入基于信息熵的离群度量因子，给出一种离群数据挖掘新算法.该算法先利用信息熵计算每个数据对象的离群度量因子，然后通过离群度量因子来衡量每个对象的离群程度，进而检测离群数据，有效地消除了人为主观因素对离群检测的影响，并能很好地解释离群点的含义.最后，采用UC和恒星光谱数据作为实验数据，通过对实验的分析，验证了该算法的可行性和有效性】关键词：离群数据：信息熵；离群度量因子：数据挖掘中图分类号：TP311文献标识码：A文章编号：16734785(2010)02-0150-06 An outlier mining algorithm based on information entropy ZHANG He',CAI Jiang-hui,ZHANG Ji-fu',QIAO Kan2 (1.School of Computer Science and Technology,Taiyuan University of Science&Technology,Taiyuan030024,China;2.Automation Science and Electrical Engineering College,Beijing University of Aeronautics and Astronautics,Beijing 100191,China) Abstract:The task of outlier mining is to discover patterns that are exceptional,interesting,and sparse or isolated even though they are concealed within tremendous volumes of data.Traditional outlier detection methods are easily influenced by man-made factors.A novel outlier mining algorithm based on information entropy has been formula- ted.It used an outlier measurement factor based on information entropy.In the algorithm,the outlier measurement factor of each record was calculated using information entropy.Outliers were then detected by analyzing the values of the outlier measurement factor.In this way the impact of man-made factors was eliminated in outlier mining.The definition of an outlier was based on an outlier measurement factor which could explain the meaning of the outliers. Experimental results proved the feasibility and effectiveness of the algorithm when it was used to analyze the UC Ir- vine (UCI)data set as well as high-dimensional star spectrum data. Keywords:outlier;information entropy;outlier measure factor;data mining 离群数据(Outlier)是明显偏离其他数据，不满仅仅得出一个信息，而l0个异常数据很可能得出足数据的一般模式或行为，与存在的其他数据不一 10个不同的信息.离群数据的发现往往可以使人们致的数据「山.但是，迄今为止，离群点还没有一个被发现一些真实的，但又出乎意料的知识；因此通过对普遍采纳的定义，统计学家Hawkins2]1980年给出离群数据的研究，发现异常的行为和模式，有着非常的离群点定义在一定意义上揭示了离群点的本质：重要的意义.离群数据检测技术现已被广泛地应用 “离群点与其他点如此不同，以至于让人怀疑它们于许多领域，如金融欺诈、电信计费、医疗保险、网络是由一个不同的机制产生的”.事实上，“一个人的安全等噪声可能是另一个人的信号”，稀有事件比普通目前，现有经典离群检测算法主要分为以下几事件更有研究价值，这是由于数万个数据记录可能类：基于统计(statistical-based)的方法[31、基于深度 (depth-based)的方法[41、基于偏离(deviation-based) 收稿日期：2008-12-30，基金项目：山西省青年科学基金资助项目(2008021028). 的方法s)]、基于距离(distance-based)的方法[6与基通信作者：张贺.Emai:zhanghe_.helen@126.com. 于密度(density-based)的方法).这些方法存在以

向下翻页>>

点击下载：人工智能基础：信息熵度量的离群数据挖掘算法