正在加载图片...
582 P.Filzmoser et al.Computers Geosciences 31 (2005)579-587 identify the different geochemical processes that influence detect outliers.The tails will be defined by =for the data.Only in doing so can the true outliers be a certain small a(e.g.,=0.02),and identified and differentiated from extreme members of the one or more background populations in the data.This pn(=sup(G()-Gn()计 (2) u≥6 distinction should also be made in the multivariate case. In the previous section the assumption of multivariate is considered,where +indicates the positive differences. normality was implicitly used because this led to chi- In this way,p()measures the departure of the square distributed Mahalanobis distances.Also for the empirical from the theoretical distribution only in the RD this assumption was used,at least for the majority tails,defined by the value of p()can be considered as of data (depending on the choice of h for the MCD a measure of outliers in the sample.Gervini (2003)used estimator).Defining outliers by using a fixed threshold this idea as a reweighting step for the robust estimation value (e.g.,:9s)is rather subjective because of multivariate location and scatter.In this way,the efficiency (in terms of statistical precision)of the estimator could be improved considerably. (1)If the data should indeed come from a single P()will not be directly used as a measure of outliers. multivariate normal distribution.the threshold As mentioned in the previous section,the threshold would be infinity because there are no observations should be infinity in case of multivariate normally from a different distribution (only extremes); distributed background data.This means,that if the (2)There is no reason why this fixed threshold should be data are coming from a multivariate normal distribu- appropriate for every data set;and tion,no observation should be declared as an outlier. (3)The threshold has to be adjusted to the sample size Instead.observations with a large RD should be seen as (see Reimann et al.,2005;and simulations below). extremes of the distribution.Therefore a critical value Pr is introduced,which helps to distinguish between A better procedure than using a fixed threshold is to outliers and extremes.The measure of outliers in the adjust the threshold to the data set at hand.Garrett sample is then defined as (1989)used the chi-square plot for this purpose,by plotting the squared Mahalanobis distances(which have 0 ifPm()≤Peit(⑥,n,p, 以a()= Pn(5)if pn(5)>Peru(6.n.p). (3) to be computed on the basis of robust estimations of location and scatter)against the quantiles of the most extreme points are deleted until the remaining points The threshold value is then determined as c()= follow a straight line.The deleted points are the G(1-xn(). identified outliers,the multivariate threshold corre- The critical value porir for distinguishing between sponds to the distance of the closest outlier,the farthest outliers and extremes can be derived by simulation.For background individual,or some intermediate distance. different sample sizes n and different dimensions(num- Alternately,the cube root of the squared Mahalanobis bers of variables)p data from a multivariate normal distances may be plotted against normal quantiles (e.g., distribution are simulated.Then Eq.(2)is applied for Chork.1990).This procedure (Garrett.1989)is not computing the value p()for a fixed value (in the automatic,it needs user interaction and experience on simulations =72:0.9s is used).The procedure is repeated the part of the analyst.Moreover,especially for large 1000 times for every considered value of n and p. data sets,it can be time consuming,and also to some To directly compute the limiting distribution of the extent it is subjective.In the next section a procedure statistic defined by Eq.(2)would be a more elegant way that does not require analyst intervention,is reprodu- for determining the critical value.However,even for cible and therefore objective,and takes the above points, related simpler problems Csorgo and Revesz (1981, (1)-(3),into consideration is introduced. Chapter 5)note that this is analytically extremely difficult and they recommend simulation. The resulting values give an indication of the differences between the theoretical and the empirical 4.Adaptive outlier detection distributions,G(u)-G(u),if the data are sampled from multivariate normal distributions.To be on the safe side. The chi-square plot is useful for visualising the the 95%percentile of the 1000 simulated values can be deviation of the data distribution from multivariate used for every n and p,and these percentiles are shown normality in the tails.This principle is used in the for p=2,4,6,8,10 by different symbols in Fig.3.By following.Let G(u)denote the empirical distribution transforming the x-axis by the inverse of n it can be function of the squared robust distances RD,and let seen that-at least for larger sample size-the points lie G(u)be the distribution function of For multivariate on a line(see Fig.3).The lines in Fig.3 are estimated by normally distributed samples,Gn converges to G. least trimmed sum of squares (LTS)regression (Rous- Therefore the tails of G and G can be compared to seeuw.1984).Using LTS regression the less preciseidentify the different geochemical processes that influence the data. Only in doingso can the true outliers be identified and differentiated from extreme members of the one or more background populations in the data. This distinction should also be made in the multivariate case. In the previous section the assumption of multivariate normality was implicitly used because this led to chi￾square distributed Mahalanobis distances. Also for the RD this assumption was used, at least for the majority of data (dependingon the choice of h for the MCD estimator). Definingoutliers by usinga fixed threshold value (e.g., w2 p;0:98) is rather subjective because (1) If the data should indeed come from a single multivariate normal distribution, the threshold would be infinity because there are no observations from a different distribution (only extremes); (2) There is no reason why this fixed threshold should be appropriate for every data set; and (3) The threshold has to be adjusted to the sample size (see Reimann et al., 2005; and simulations below). A better procedure than usinga fixed threshold is to adjust the threshold to the data set at hand. Garrett (1989) used the chi-square plot for this purpose, by plottingthe squared Mahalanobis distances (which have to be computed on the basis of robust estimations of location and scatter) against the quantiles of w2 p; the most extreme points are deleted until the remainingpoints follow a straight line. The deleted points are the identified outliers, the multivariate threshold corre￾sponds to the distance of the closest outlier, the farthest background individual, or some intermediate distance. Alternately, the cube root of the squared Mahalanobis distances may be plotted against normal quantiles (e.g., Chork, 1990). This procedure (Garrett, 1989) is not automatic, it needs user interaction and experience on the part of the analyst. Moreover, especially for large data sets, it can be time consuming, and also to some extent it is subjective. In the next section a procedure that does not require analyst intervention, is reprodu￾cible and therefore objective, and takes the above points, (1)–(3), into consideration is introduced. 4. Adaptive outlier detection The chi-square plot is useful for visualisingthe deviation of the data distribution from multivariate normality in the tails. This principle is used in the following. Let Gnð Þ u denote the empirical distribution function of the squared robust distances RD2 i ; and let G uð Þ be the distribution function of w2 p: For multivariate normally distributed samples, Gn converges to G. Therefore the tails of Gn and G can be compared to detect outliers. The tails will be defined by d ¼ w2 p;1a for a certain small a (e.g., a ¼ 0:02), and pnð Þ¼ d sup uXd ð Þ G uð Þ Gnð Þ u þ (2) is considered, where + indicates the positive differences. In this way, pnð Þ d measures the departure of the empirical from the theoretical distribution only in the tails, defined by the value of d: pnð Þ d can be considered as a measure of outliers in the sample. Gervini (2003) used this idea as a reweighting step for the robust estimation of multivariate location and scatter. In this way, the efficiency (in terms of statistical precision) of the estimator could be improved considerably. pnð Þ d will not be directly used as a measure of outliers. As mentioned in the previous section, the threshold should be infinity in case of multivariate normally distributed background data. This means, that if the data are comingfrom a multivariate normal distribu￾tion, no observation should be declared as an outlier. Instead, observations with a large RD should be seen as extremes of the distribution. Therefore a critical value pcrit is introduced, which helps to distinguish between outliers and extremes. The measure of outliers in the sample is then defined as anðdÞ ¼ 0 if pnðdÞppcritðd; n; pÞ; pnðdÞ if pnðdÞ4pcritðd; n; pÞ: ( (3) The threshold value is then determined as cnð Þ¼ d G1 n ð Þ 1 anð Þ d : The critical value pcrit for distinguishing between outliers and extremes can be derived by simulation. For different sample sizes n and different dimensions (num￾bers of variables) p data from a multivariate normal distribution are simulated. Then Eq. (2) is applied for computingthe value pnð Þ d for a fixed value d (in the simulations d ¼ w2 p;0:98 is used). The procedure is repeated 1000 times for every considered value of n and p. To directly compute the limitingdistribution of the statistic defined by Eq. (2) would be a more elegant way for determiningthe critical value. However, even for related simpler problems Cso¨rgo+ and Re´ve´sz (1981, Chapter 5) note that this is analytically extremely difficult and they recommend simulation. The resultingvalues give an indication of the differences between the theoretical and the empirical distributions, G uð Þ Gnð Þ u ; if the data are sampled from multivariate normal distributions. To be on the safe side, the 95% percentile of the 1000 simulated values can be used for every n and p, and these percentiles are shown for p ¼ 2; 4, 6, 8, 10 by different symbols in Fig. 3. By transformingthe x-axis by the inverse of ffiffiffi n p it can be seen that—at least for larger sample size—the points lie on a line (see Fig. 3). The lines in Fig. 3 are estimated by least trimmed sum of squares (LTS) regression (Rous￾seeuw, 1984). UsingLTS regression the less precise ARTICLE IN PRESS 582 P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有