128 Austrian Journal of Statistics, V_中国高校课件下载中心

点击下载：《多元统计分析》课程教学资源（阅读材料）Outlier detection

正在加载图片...

128 Austrian Journal of Statistics,Vol.34(2005),No.2,127-138 for a p-dimensional observation and i=1,...,n.t and C are robust estimations of location and scatter,respectively.For normally distributed data(and if arithmetic mean and sample covariance matrix were used),the Mahalanobis distance is approximately chi- square distributed with p degrees of freedom(x2).Potential multivariate outliers will typically have large values MDi,and a comparison with the x2 distribution can be made. Garrett (1989)introduced the chi-square plot,which draws the empirical distribution function of the robust Mahalanobis distances against thethe x2 distribution.A break in the tail of the distributions is an indication for outliers,and values beyond this break are iteratively deleted until a straight line appears. Rousseeuw and Van Zomeren (1990)use a cut-off value for distinguishing outliers from non-outliers.This value is a certain quantile(e.g.the 97.5%quantile)of the distribution.For t and C the MVE(minimum volume ellipsoid)estimator(Rousseeuw, 1985)was used.However,several years later the MVE was replaced by the MCD(min- imum covariance determinant)estimator for this purpose which has better statistical prop- erties and because a fast algorithm exists for its computation(Rousseeuw and Van Driessen, 1999). Various other concepts for multivariate outlier detection methods exist in the literature (e.g.Barnett and Lewis,1994;Rocke and Woodruff,1996;Becker and Gather,1999:Pena and Prieto,2001)and different other robust estimators for multivariate location and scatter can be considered (e.g.Maronna,1976;Davies,1987;Tyler,1991;Maronna and Yohai, 1995;Kent and Tyler,1996). Recently,Filzmoser et al.(2005)introduced a multivariate outlier detection method that can be seen as an automation of the method of Garrett (1989).The principle is to measure the deviation of the data distribution from multivariate normality in the tails.In Section 2 we will briefly introduce this method.A comparison with other outlier identi- fication methods is done by means of simulated data in Section 3.Throughout the paper we restrict ourselves to the p-dimensional normal distribution Np(u,>)with mean u and positive definite covariance matrix >as model distribution.However,we also simulate data from other distributions in order to get an idea about the performance in different situations.Section 4 provides conclusions. 2 Methods The method of Filzmoser et al.(2005)follows an idea of Gervini(2003)for increasing the efficiency of the robust estimation of multivariate location and scatter.Let G(u)denote the empirical distribution function of the squared robust Mahalanobis distances MD?,and let G(u)be the distribution function ofx2.For multivariate normally distributed samples, Gn converges to G.Therefore the tails of Gn and G can be compared to detect outliers. The tails will be defined by the quantilefor a certain small (e.g..=0.025). and Pn(6)=sup(G(u)-Gn(u)) (2) u≥ is considered,where"+"indicates the positive differences.In this way,pn()measures the departure of the empirical from the theoretical distribution only in the tails,defined by the value of 6.If pn()is larger than a critical value perit(,n,p),it can be considered128 Austrian Journal of Statistics, Vol. 34 (2005), No. 2, 127-138 for a p-dimensional observation xi and i = 1, . . . , n. t and C are robust estimations of location and scatter, respectively. For normally distributed data (and if arithmetic mean and sample covariance matrix were used), the Mahalanobis distance is approximately chisquare distributed with p degrees of freedom (χ 2 p ). Potential multivariate outliers xi will typically have large values MDi , and a comparison with the χ 2 p distribution can be made. Garrett (1989) introduced the chi-square plot, which draws the empirical distribution function of the robust Mahalanobis distances against the the χ 2 p distribution. A break in the tail of the distributions is an indication for outliers, and values beyond this break are iteratively deleted until a straight line appears. Rousseeuw and Van Zomeren (1990) use a cut-off value for distinguishing outliers from non-outliers. This value is a certain quantile (e.g., the 97.5% quantile) of the χ 2 p distribution. For t and C the MVE (minimum volume ellipsoid) estimator (Rousseeuw, 1985) was used. However, several years later the MVE was replaced by the MCD (minimum covariance determinant) estimator for this purpose which has better statistical properties and because a fast algorithm exists for its computation (Rousseeuw and Van Driessen, 1999). Various other concepts for multivariate outlier detection methods exist in the literature (e.g. Barnett and Lewis, 1994; Rocke and Woodruff, 1996; Becker and Gather, 1999; Pena˜ and Prieto, 2001) and different other robust estimators for multivariate location and scatter can be considered (e.g. Maronna, 1976; Davies, 1987; Tyler, 1991; Maronna and Yohai, 1995; Kent and Tyler, 1996). Recently, Filzmoser et al. (2005) introduced a multivariate outlier detection method that can be seen as an automation of the method of Garrett (1989). The principle is to measure the deviation of the data distribution from multivariate normality in the tails. In Section 2 we will briefly introduce this method. A comparison with other outlier identi- fication methods is done by means of simulated data in Section 3. Throughout the paper we restrict ourselves to the p-dimensional normal distribution Np(µ, Σ) with mean µ and positive definite covariance matrix Σ, as model distribution. However, we also simulate data from other distributions in order to get an idea about the performance in different situations. Section 4 provides conclusions. 2 Methods The method of Filzmoser et al. (2005) follows an idea of Gervini (2003) for increasing the efficiency of the robust estimation of multivariate location and scatter. Let Gn(u) denote the empirical distribution function of the squared robust Mahalanobis distances MD2 i , and let G(u) be the distribution function of χ 2 p . For multivariate normally distributed samples, Gn converges to G. Therefore the tails of Gn and G can be compared to detect outliers. The tails will be defined by the quantile δ = χ 2 p,1−β for a certain small β (e.g., β = 0.025), and pn(δ) = sup u≥δ ³ G(u) − Gn(u) ´+ (2) is considered, where “+” indicates the positive differences. In this way, pn(δ) measures the departure of the empirical from the theoretical distribution only in the tails, defined by the value of δ. If pn(δ) is larger than a critical value pcrit(δ, n, p), it can be considered

<<向上翻页向下翻页>>

点击下载：《多元统计分析》课程教学资源（阅读材料）Outlier detection