《多元统计分析》课程教学资源（阅读材料）Cluster Validation.pdf_大学文库

Combining the results of different algorithms in order to obtain "ro- bust"clusterings [17]. Intregration of already existing clusterings that have been built be- fore but that cannot be reconstructed (e.g.because the algorithms or features that were used to build them are not known)[16]. Many companies store their data not only in one database but the data is geographically distributed.Often,it is unfeasible to transfer all data to one place for performing data analysis there because of the high computational,bandwidth and storage costs.Thus,it is desirable to have methods for combining decentrally performed clusterings to one clustering that represents the whole data. Legal restrictions force the companies to have several copies of their data,each copy with a different feature set (certain features must not be stored together).For cluster analysis,they have to perform feature distributed clustering and afterwards join the clusterings into one "mean value"clustering. In social sciences often arise clustering problems with multiple opti- mization criteria:a typical example is the "second world war politi- cians"problem [19],in which many persons were asked to rate the dissimilarities of second world war politicians.Each person corre- sponds to an optimization criterion.A good clustering of the politi- cians should be as close as possible to all the personal ratings.A common approach to these multiple criteria clustering problems is the calculation of a "mean value"of the single criterion clusterings. These are only some applications in which a"mean value"of multiple clus- terings is needed.For this purpose we need a distance (or similarity)measure for clusterings.This paper gives an overview and some analysis of the mea- sures that we find in the literature. In the second section we introduce the basic definitions and some notations. In section three,four and five we present the measures for comparing clus- terings that have been presented in the literature so far.The subdivision into three sections corresponds to two "natural"divisions of the measures: Even though the measures can all be derived from the confusion matrix, they base on different ideas:counting of pairs of elements(Section 3),sum- mation of set overlaps (Section 4),and the use of the information-theoretical mutual information (Section 5). 2

• Combining the results of different algorithms in order to obtain ”robust” clusterings [17]. • Intregration of already existing clusterings that have been built before but that cannot be reconstructed (e.g. because the algorithms or features that were used to build them are not known)[16]. • Many companies store their data not only in one database but the data is geographically distributed. Often, it is unfeasible to transfer all data to one place for performing data analysis there because of the high computational, bandwidth and storage costs. Thus, it is desirable to have methods for combining decentrally performed clusterings to one clustering that represents the whole data. • Legal restrictions force the companies to have several copies of their data, each copy with a different feature set (certain features must not be stored together). For cluster analysis, they have to perform feature distributed clustering and afterwards join the clusterings into one ”mean value” clustering. • In social sciences often arise clustering problems with multiple optimization criteria: a typical example is the ”second world war politicians” problem [19], in which many persons were asked to rate the dissimilarities of second world war politicians. Each person corresponds to an optimization criterion. A good clustering of the politicians should be as close as possible to all the personal ratings. A common approach to these multiple criteria clustering problems is the calculation of a ”mean value” of the single criterion clusterings. These are only some applications in which a ”mean value” of multiple clusterings is needed. For this purpose we need a distance (or similarity) measure for clusterings. This paper gives an overview and some analysis of the measures that we find in the literature. In the second section we introduce the basic definitions and some notations. In section three, four and five we present the measures for comparing clusterings that have been presented in the literature so far. The subdivision into three sections corresponds to two ”natural” divisions of the measures: Even though the measures can all be derived from the confusion matrix, they base on different ideas: counting of pairs of elements (Section 3), summation of set overlaps (Section 4), and the use of the information-theoretical mutual information (Section 5). 2

However, the division also reflects the chronological development of the measures: the counting pairs measures date from the 1970s and 1980s, the measures based on set overlaps from the 1990s and the information-theoretical measures have been developed in the last years (2002/2003). Till this day, there is no formalization of the problem of comparing clusterings. We think that a set of axioms would be helpful in detecting and defining ”good” measures. As a first step towards such a set of axioms, we give in Section 6 aspects and properties that have to be taken into account. 2 Definitions and notations Let X be a finite set with cardinality |X| = n. A clustering C is a set {C1, . . . , Ck} of non-empty disjoint subsets of X such that their union equals X. The set of all clusterings of X is denoted by P(X). For a clustering C = {C1, . . . , Ck} we assume |Ci | > 0 for all i = 1, . . . , k. A trivial clustering is either the one-clustering that consist of just one cluster or the singleton clustering in which every element forms its own cluster. Let C 0 = {C 0 1 , . . . , C0 ` } ∈ P(X) denote a second clustering of X. The confusion matrix M = (mij ) (or contingency table) of the pair C, C 0 is a k × `- matrix whose ij-th entry equals the number of elements in the intersection of the clusters Ci and C 0 j : mij = |Ci ∩ C 0 j |, 1 ≤ i ≤ k, 1 ≤ j ≤ `. Clustering C 0 is a refinement of C (and C is a coarsening of C 0 ), if each cluster of C 0 is contained in a cluster of C, formally: ∀C 0 j ∈ C0 ∃Ci ∈ C : C 0 j ⊆ Ci . The product C×C0 of two clusterings C, C 0 is the coarsest common refinement of the two clusterings: C × C0 = {Ci ∩ C 0 j | Ci ∈ C, C0 j ∈ C0 , Ci ∩ C 0 j 6= ∅}. The product C × C0 is again a clustering, and if C 0 is a refinement of C, then C × C0 = C 0 . 3 Measures based on counting pairs A very intuitional approach to comparing clusterings is counting pairs of objects that are ”classified” in the same way in both clusterings, i.e. pairs of 3

cation. The most common performance measure for this problem calculates the fraction of correctly classified (respectively misclassified) elements to all elements. For Rand, comparing two clusterings was just a natural extension of this problem which has a corresponding extension of the performance measure: instead of counting single elements he counts correctly classified pairs of elements. Thus, the Rand Index is defined by: R(C, C 0 ) = 2(n11 + n00) n(n − 1) R ranges from 0 (no pair classified in the same way under both clusterings) to 1 (identical clusterings). The value of R depends on both, the number of clusters and the number of elements. Morey and Agresti showed that the Rand Index is highly dependent upon the number of clusters [2]. In [4], Fowlkes and Mallows show that in the (unrealistic) case of independant clusterings the Rand Index converges to 1 as the number of clusters increases which is undesirable for a similarity measure. 3.2.2 Adjusted Rand Index The expected value of the Rand Index of two random partitions does not take a constant value (e.g. zero). Thus, Hubert and Arabie proposed an adjustment [3] which assumes a generalized hypergeometric distribution as null hypothesis: the two clusterings are drawn randomly with a fixed number of clusters and a fixed number of elements in each cluster (the number of clusters in the two clusterings need not be the same). Then the adjusted Rand Index is the (normalized) difference of the Rand Index and its expected value under the null hypothesis. It is defined as follows [6]: Radj (C, C 0 ) = Pk i=1 P` j=1 mij 2 − t3 1 2 (t1 + t2) − t3 where t1 = X k i=1 |Ci | 2 , t2 = X ` j=1 |C 0 j | 2 , t3 = 2t1t2 n(n − 1) This index has expected value zero for independant clusterings and maximum value 1 (for identical clusterings). The significance of this measure has to be put into question because of the strong assumptions it makes on the distribution. Meila [7] notes, that some pairs of clusterings may result in negative index values. 5

3.3 Fowlkes-Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4].However,it can also be used for flat clusterings since it consists in calculating an index Bi for each level i=2,...,n-1 of the hierarchies in consideration and plotting Bi against i.The measure B is easily generalized to a measure for clusterings with different numbers of clusters.The generalized Fowlkes-Mallows Index is defined by FM(C,C')= ∑1f=1m喝-n n11 V(∑：lCP-n)(∑，lCgP-n)Vm1+no)m1+no In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant docmnsto the total mber of retrieved docmentand recall (ratio of the number of retrieved relevant documents to the total number of relevant documents n11) n11十n01 Like for the adjusted Rand Index,the "amount"of similarity of two clus- terings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes.Again,the strong assumptions on the distribution make the result hard to interpret. Futhermore,this measure has the undesirable property that for small num- bers of clusters,the value is very high,even for independant clusterings (which even achieve the maximum value for small numbers of clusters).Wal- lace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of"good" pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11]is defined by MC,C=】 It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy ofXI clusterings,with the two trivial clusterings at the top and bottom,respectively,and each level of the hierarchy is a refinement of all the levels above. 6

3.3 Fowlkes–Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4]. However, it can also be used for flat clusterings since it consists in calculating an index Bi for each level i = 2, . . . , n − 1 of the hierarchies in consideration and plotting Bi against i. The measure Bi is easily generalized to a measure for clusterings with different numbers of clusters. The generalized Fowlkes–Mallows Index is defined by FM(C, C 0 ) = Pk i=1 P` j=1 m2 ij − n q ( P i |Ci | 2 − n)(P j |C0 j | 2 − n) = p n11 (n11 + n10)(n11 + n01) In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant documents to the total number of retrieved documents = n11 n11+n10 ) and recall (ratio of the number of retrieved relevant documents to the total number of relevant documents = n11 n11+n01 ). Like for the adjusted Rand Index, the ”amount” of similarity of two clusterings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes. Again, the strong assumptions on the distribution make the result hard to interpret. Futhermore, this measure has the undesirable property that for small numbers of clusters, the value is very high, even for independant clusterings (which even achieve the maximum value for small numbers of clusters). Wallace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of ”good” pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11] is defined by M(C, C 0 ) = X k i=1 |Ci | 2 + X ` j=1 |C 0 j | 2 − 2 X k i=1 X l j=1 m2 ij . It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy of |X| clusterings, with the two trivial clusterings at the top and bottom, respectively, and each level of the hierarchy is a refinement of all the levels above. 6

3.6 General remarks As mentioned before,the measures presented in this section can all be cal- culated by means of the confusion matrix M(and the cluster sizes);this is either obvious from the formula (e.g.for the Fowlkes-Mallow Index)or can be seen after some transformation (e.g.for the Rand Index,which can be transformed into a variation of the Mirkin Metric,see 3.4). For different reasons,these measures do not seem to be very appealing. Some of them are sensitive to certain parameters(cluster sizes,number of clusters);think of a pair of clusterings with similarity a E 0,1]and replace each element in the underlying set by two elements.Why should the result- ing pair of clusterings have a similarity other than a?This behavior,as well as sensitivity to the number of clusters,are undesirable. Other measures,like the Fowlkes-Mallows Index,suffer from another draw- back:they make use of a very strong null hypothesis,that is,independance of the clusterings,fixed number of clusters,and fixed cluster sizes.When comparing results provided by clustering algorithms these assumptions are apart from the number of clusters that is fixed for some algorithms-vio- lated.None of the algorithms works with fixed cluster sizes.Furthermore, in practice it would be against the intuition to compare two clusterings when assuming that there is no relationship between them.In fact,we compare clusterings because we suppose a certain relationship and we want to know how strong it is [12]. 4 Measures based on set overlaps Another kind of measure tries to match clusters that have a maximum ab- solute or relative overlap.This is also a quite intuitional approach,however the assymetry of some of the measures makes them difficult to use. 4.1 F-Measure The F-Measure has its origin in the field of document clustering where it is used to evaluate the accuracy of a clustering solution.Each cluster of the first clustering is a(predefined)class of documents and each cluster of the second clustering is treated as the result of a query [14,[13].The F- Measure for a cluster C with respect to a certain class Ci indicates how "good"the cluster C describes the class Ci by calculating the harmonic 8

3.6 General remarks As mentioned before, the measures presented in this section can all be calculated by means of the confusion matrix M (and the cluster sizes); this is either obvious from the formula (e.g. for the Fowlkes-Mallow Index) or can be seen after some transformation (e.g. for the Rand Index, which can be transformed into a variation of the Mirkin Metric, see 3.4). For different reasons, these measures do not seem to be very appealing. Some of them are sensitive to certain parameters (cluster sizes, number of clusters); think of a pair of clusterings with similarity α ∈ [0, 1] and replace each element in the underlying set by two elements. Why should the resulting pair of clusterings have a similarity other than α? This behavior, as well as sensitivity to the number of clusters, are undesirable. Other measures, like the Fowlkes-Mallows Index, suffer from another drawback: they make use of a very strong null hypothesis, that is, independance of the clusterings, fixed number of clusters, and fixed cluster sizes. When comparing results provided by clustering algorithms these assumptions are - apart from the number of clusters that is fixed for some algorithms - violated. None of the algorithms works with fixed cluster sizes. Furthermore, in practice it would be against the intuition to compare two clusterings when assuming that there is no relationship between them. In fact, we compare clusterings because we suppose a certain relationship and we want to know how strong it is [12]. 4 Measures based on set overlaps Another kind of measure tries to match clusters that have a maximum absolute or relative overlap. This is also a quite intuitional approach, however the assymetry of some of the measures makes them difficult to use. 4.1 F-Measure The F-Measure has its origin in the field of document clustering where it is used to evaluate the accuracy of a clustering solution. Each cluster of the first clustering is a (predefined) class of documents and each cluster of the second clustering is treated as the result of a query [14], [13]. The FMeasure for a cluster C 0 j with respect to a certain class Ci indicates how ”good” the cluster C 0 j describes the class Ci by calculating the harmonic 8

mean of precision pij = mij |C0 j | and recall rij = mij |Ci| for C 0 j and Ci : F(Ci , C0 j ) = 2 · rij · pij rij + pij = 2|Ci ||C 0 j | |Ci | + |C0 j | The overall F-Measure is then defined as the weighted sum of the maximum F-Measures for the clusters in C 0 : F(C, C 0 ) = F(C 0 ) = 1 n X k i=1 ni ` max j=1 {F(Ci , C0 j )} It can easily be seen that this measure is not symmetric. Thus, this may be an appropriate index for comparing a clustering with an optimal clustering solution. However, in general the optimal solution is not know, which makes an assymetric measure hard to interpret. In [7], Meila claims, that in [13], Larsen uses a variation of this measure which is normalized by the number of clusters instead of the number of elements. She gives an example where this ”Larsen-measure” has a very strange behavior. However, as can be seen in [13], Larsen does not use this measure, but also the F-Measure as defined above. Actually, other authors, as for example Steinbach, Kapyris and Kumar in [15], or Fung in [14], refer to Larsen when introducing the F-Measure. 4.2 Meila-Heckerman- and Maximum-Match-Measure In [8], Meila and Heckerman use another assymetric measure, which they apply to comparing clustering algorithms. For their study, they do not compare the results of the different clustering methods among each other, but they compare each clustering result with an optimal clustering solution (their study is on synthetic data). For this purpose they use the following measure: MH(C, C 0 ) = 1 n X k i=1 max C0 j∈C mij where C is the clustering that is provided by the algorithm and C 0 is the optimal clustering. As for the preceding measure, its assymetry makes it inappropriate for the general task of comparing clusterings. However, it can be generalized to the symmetric Maximum-Match-Measure M(C, C 0 ) which can be determined as follows: look for the largest entry mab of the confusion matrix M and match the corresponding clusters Ca and C 0 b (this is the cluster pair with the largest (absolute) overlap). Afterwards cross out the a-th row 9