正在加载图片...
586 P.Filzmoser et al.Computers Geosciences 31 (2005)579-587 10 data.In the univariate case it is often very difficult to identify data outliers originating from a second or other rare process,rather than extreme values in relation to the underlying data of the more common process(es). Extreme values can be easily detected due to their distance from the core of the data.If they originate from the underlying data they are of little interest to the exploration or environmental geochemist because they will neither identify mineralisation nor contamination. In contrast,in the multivariate case it is necessary also to consider the shape of the data,its structure,in the multivariate space and all the dependencies between the variables.Thus the really interesting data outliers, caused by additional,rare processes,can be easily identified. As Cd Co Cu Mg Pb Zn Not surprisingly the identified multivariate outliers in the test data set consisting of seven variables and 617 Fig.11.Plot of single elements for Kola O-horizon data,with samples are often not the univariate extreme values.In same symbols as used in Fig.10. the context of Fig.1,they are equivalent to the distant off-axis individuals in the middle of the data range,e.g., the individual at (-1,1).The map of the multivariate achieve this.It is possible to use the same symbols as in outliers clearly identifies contaminated sites and those the multivariate outlier plot to provide important information about the structure of these outliers. affected by the input of marine aerosols near the coast as For exploratory investigations,however,it is infor- regionally important processes causing different data mative to have an overview of the position of the outlier populations. multivariate outliers within the distribution of the single Although multivariate outlier identification is impor- elements.To achieve this we can simply plot the values tant for thorough data analysis,the task of interpreta- of the elements and use the same symbols and colours as tion goes beyond that first step as the researcher is also in the multivariate outlier plot.See Fig.11 for the Kola interested in identifying the geochemical processes O-horizon data.All variables are presented as a series of leading to the data structure.A crucial point,however, vertically scaled parallel bars,where the values are is that multivariate outliers are not simply excluded from scattered randomly in the horizontal direction (one- further analysis,but that after applying robust proce- dimensional scatter plot).Since the original values of the dures which reduce the impact of the outliers the outliers variables have very different data ranges,the data were are actually left in the data set.Working in this way first centred and scaled for this presentation by using the permits the outliers to be viewed in the context of the robust multivariate estimates of location and scatter.In main mass of the data,which facilitates an appreciation of their relationship to the core data.In this context,the this way the different variables can be easily compared. This visualisation provides insight into the data struc- data analyst should use a variety of procedures,often ture and quality.As in the multivariate outlier plot,the graphical,to gain as great an insight as possible into the multivariate outliers are presented by large symbols data structure and the controlling processes behind the for every variable.Not unsurprisingly in the light of the observations.For example,since factor analysis (like many other multivariate methods)is based on the previous discussion,the multivariate outliers occur over covariance matrix,a robust estimation of the covariance the complete univariate data ranges.and not only at the extremes.Moreover,extremely low values,e.g.,for Pb, matrix will reduce the effect of (multivariate)outlying which seem to be univariate outliers are not necessarily observations (Chork and Salminen,1993:Reimann multivariate outliers.The explanation can be found by et al..2002)and lead to a data interpretation centred looking at the simulation example.Fig.9.again,where on the dominant process(es).Furthermore,when a the lowest values for the x-axis are not multivariate single dominant process is present the factor loadings outliers but members of the main data structure. may be interpretable in the context of that process. When non-robust procedures are used in the presence of multiple processes factor analysis often behaves more like a cluster analysis procedure.In such cases the factor 8.Conclusions loadings provide little or no information on the internal structure of the processes,but define a framework for An automated method to identify outliers in multi- differentiating between them.Both applications have variate space was developed and demonstrated with real merit,the latter in exploratory data analysis,and theachieve this. It is possible to use the same symbols as in the multivariate outlier plot to provide important information about the structure of these outliers. For exploratory investigations, however, it is infor￾mative to have an overview of the position of the multivariate outliers within the distribution of the single elements. To achieve this we can simply plot the values of the elements and use the same symbols and colours as in the multivariate outlier plot. See Fig. 11 for the Kola O-horizon data. All variables are presented as a series of vertically scaled parallel bars, where the values are scattered randomly in the horizontal direction (one￾dimensional scatter plot). Since the original values of the variables have very different data ranges, the data were first centred and scaled for this presentation by usingthe robust multivariate estimates of location and scatter. In this way the different variables can be easily compared. This visualisation provides insight into the data struc￾ture and quality. As in the multivariate outlier plot, the multivariate outliers are presented by large symbols + for every variable. Not unsurprisingly in the light of the previous discussion, the multivariate outliers occur over the complete univariate data ranges, and not only at the extremes. Moreover, extremely low values, e.g., for Pb, which seem to be univariate outliers are not necessarily multivariate outliers. The explanation can be found by lookingat the simulation example, Fig. 9, again, where the lowest values for the x-axis are not multivariate outliers but members of the main data structure. 8. Conclusions An automated method to identify outliers in multi￾variate space was developed and demonstrated with real data. In the univariate case it is often very difficult to identify data outliers originating from a second or other rare process, rather than extreme values in relation to the underlyingdata of the more common process(es). Extreme values can be easily detected due to their distance from the core of the data. If they originate from the underlyingdata they are of little interest to the exploration or environmental geochemist because they will neither identify mineralisation nor contamination. In contrast, in the multivariate case it is necessary also to consider the shape of the data, its structure, in the multivariate space and all the dependencies between the variables. Thus the really interestingdata outliers, caused by additional, rare processes, can be easily identified. Not surprisingly the identified multivariate outliers in the test data set consistingof seven variables and 617 samples are often not the univariate extreme values. In the context of Fig. 1, they are equivalent to the distant off-axis individuals in the middle of the data range, e.g., the individual at (1,1). The map of the multivariate outliers clearly identifies contaminated sites and those affected by the input of marine aerosols near the coast as regionally important processes causing different data outlier populations. Although multivariate outlier identification is impor￾tant for thorough data analysis, the task of interpreta￾tion goes beyond that first step as the researcher is also interested in identifyingthe geochemical processes leadingto the data structure. A crucial point, however, is that multivariate outliers are not simply excluded from further analysis, but that after applyingrobust proce￾dures which reduce the impact of the outliers the outliers are actually left in the data set. Workingin this way permits the outliers to be viewed in the context of the main mass of the data, which facilitates an appreciation of their relationship to the core data. In this context, the data analyst should use a variety of procedures, often graphical, to gain as great an insight as possible into the data structure and the controllingprocesses behind the observations. For example, since factor analysis (like many other multivariate methods) is based on the covariance matrix, a robust estimation of the covariance matrix will reduce the effect of (multivariate) outlying observations (Chork and Salminen, 1993; Reimann et al., 2002) and lead to a data interpretation centred on the dominant process(es). Furthermore, when a single dominant process is present the factor loadings may be interpretable in the context of that process. When non-robust procedures are used in the presence of multiple processes factor analysis often behaves more like a cluster analysis procedure. In such cases the factor loadings provide little or no information on the internal structure of the processes, but define a framework for differentiatingbetween them. Both applications have merit, the latter in exploratory data analysis, and the ARTICLE IN PRESS -4 -2 0 10 Centered and scaled data As Cd Co Cu Mg Pb Zn 2 4 6 8 Fig. 11. Plot of single elements for Kola O-horizon data, with same symbols as used in Fig. 10. 586 P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有