AUSTRIAN JOURNAL OF STATISTICS Volume34(2005),Number2,127-138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology,Austria Abstract:Three methods for the identification of multivariate outliers(Rouss- eeuw and Van Zomeren,1990;Becker and Gather,1999;Filzmoser et al., 2005)are compared.They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance.The comparison is made by means of a simulation study.Not only the case of multivariate normally distributed data.but also heavy tailed and asymmetric distributions will be considered.The simula- tions are focused on low dimensional (p=5)and high dimensional (p =30) data. Keywords:Outlier Detection,MCD Estimator,Mahalanobis Distance,Ro- bustness. 1 Introduction The increasing size of data sets makes it more and more difficult to identify common struc- tures in the data.Especially for high dimensional data it is often impossible to see data structures by visualizations even with highly sophisticated graphical tools(e.g.Swayne et al.,1998;Doleisch et al.,2003).Data mining algorithms as an answer to these difficul- ties try to fit a variety of different models to the data in order to get an idea of relations in the data,but usually another problem arises:multivariate outliers. Many papers and studies with real data have demonstrated that data without any out- liers(clean data")are rather an exception.Outliers can-and very often do-influence the fit of statistical models,and it is not desirable that parameter estimations are biased by the outliers.This problem can be avoided by either using a robust method for model fitting or by first cleaning the data from outliers and then applying classical statistical methods for model fitting. Removing outliers does not mean to throw away measured information.Outliers usu- ally include important information about certain phenomena,artifacts,or substructures in the data.The knowledge about this deviating behavior is important,although it might not always be easy for the practitioner to find the reasons for the existence of outliers in the data,or to interpret them. Multivariate outliers are not necessarily characterized by extremely high or low values along single coordinates.Rather,their univariate projection on certain directions separates them from the mass of the data(this projection approach for outlier detection was intro- duced by Gnanadesikan and Kettenring,1972).Standard methods for multivariate outlier detection are based on the robust Mahalanobis distance which is defined as MD:=((-t)TC-1(-t))2 (1)AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127–138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three methods for the identification of multivariate outliers (Rousseeuw and Van Zomeren, 1990; Becker and Gather, 1999; Filzmoser et al., 2005) are compared. They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance. The comparison is made by means of a simulation study. Not only the case of multivariate normally distributed data, but also heavy tailed and asymmetric distributions will be considered. The simulations are focused on low dimensional (p = 5) and high dimensional (p = 30) data. Keywords: Outlier Detection, MCD Estimator, Mahalanobis Distance, Robustness. 1 Introduction The increasing size of data sets makes it more and more difficult to identify common structures in the data. Especially for high dimensional data it is often impossible to see data structures by visualizations even with highly sophisticated graphical tools (e.g. Swayne et al., 1998; Doleisch et al., 2003). Data mining algorithms as an answer to these difficulties try to fit a variety of different models to the data in order to get an idea of relations in the data, but usually another problem arises: multivariate outliers. Many papers and studies with real data have demonstrated that data without any outliers (“clean data”) are rather an exception. Outliers can–and very often do– influence the fit of statistical models, and it is not desirable that parameter estimations are biased by the outliers. This problem can be avoided by either using a robust method for model fitting or by first cleaning the data from outliers and then applying classical statistical methods for model fitting. Removing outliers does not mean to throw away measured information. Outliers usually include important information about certain phenomena, artifacts, or substructures in the data. The knowledge about this deviating behavior is important, although it might not always be easy for the practitioner to find the reasons for the existence of outliers in the data, or to interpret them. Multivariate outliers are not necessarily characterized by extremely high or low values along single coordinates. Rather, their univariate projection on certain directions separates them from the mass of the data (this projection approach for outlier detection was introduced by Gnanadesikan and Kettenring, 1972). Standard methods for multivariate outlier detection are based on the robust Mahalanobis distance which is defined as MDi = ³ (xi − t) T C −1 (xi − t) ´1/2 (1)