Comparing Clusterings-An Overview Silke Wagner Dorothea Wagner January 12,2007 1 Introduction As the amount of data we nowadays have to deal with becomes larger and larger,the methods that help us to detect structures in the data and to identify interesting subsets in the data become more and more important. One of these methods is clustering,i.e.segmenting a set of elements into subsets such that the elements in each subset are somehow "similiar"to each other and elements of different subsets are "unsimilar".In the literature we can find a large variety of clustering algorithms,each having certain advantages but also certain drawbacks.Typical questions that arise in this context comprise: Is the algorithm sensitive to small perturbations,i.e.can small changes in the data (so-called "noise")entail large changes in the clustering? Is the algorithm sensitive to the order of the data,i.e.can another order of the data result in a very different clustering? How similar are the solutions of two different algorithms? If an optimal solution is available:How close is the clustering solution to the optimal one? For examining these aspects,it would be desirable to have a "measure"for the similarity between two clusterings or for their distance'.In a more gen- eral context,it can be necessary to combine different clusterings to a single one,i.e.calculating a"mean value"of the clusterings.Possible applications are: *The authors gratefully acknowledge financial support from the European Commission within DELIS (contract no.001907). Every similarity measure can be transformed into a distance measure and vice versa. 1
Comparing Clusterings - An Overview ∗ Silke Wagner Dorothea Wagner January 12, 2007 1 Introduction As the amount of data we nowadays have to deal with becomes larger and larger, the methods that help us to detect structures in the data and to identify interesting subsets in the data become more and more important. One of these methods is clustering, i.e. segmenting a set of elements into subsets such that the elements in each subset are somehow ”similiar” to each other and elements of different subsets are ”unsimilar”. In the literature we can find a large variety of clustering algorithms, each having certain advantages but also certain drawbacks. Typical questions that arise in this context comprise: • Is the algorithm sensitive to small perturbations, i.e. can small changes in the data (so-called ”noise”) entail large changes in the clustering? • Is the algorithm sensitive to the order of the data, i.e. can another order of the data result in a very different clustering? • How similar are the solutions of two different algorithms? • If an optimal solution is available: How close is the clustering solution to the optimal one? For examining these aspects, it would be desirable to have a ”measure” for the similarity between two clusterings or for their distance1 . In a more general context, it can be necessary to combine different clusterings to a single one, i.e. calculating a ”mean value” of the clusterings. Possible applications are: ∗The authors gratefully acknowledge financial support from the European Commission within DELIS (contract no. 001907). 1Every similarity measure can be transformed into a distance measure and vice versa. 1
Combining the results of different algorithms in order to obtain "ro- bust"clusterings [17]. Intregration of already existing clusterings that have been built be- fore but that cannot be reconstructed (e.g.because the algorithms or features that were used to build them are not known)[16]. Many companies store their data not only in one database but the data is geographically distributed.Often,it is unfeasible to transfer all data to one place for performing data analysis there because of the high computational,bandwidth and storage costs.Thus,it is desirable to have methods for combining decentrally performed clusterings to one clustering that represents the whole data. Legal restrictions force the companies to have several copies of their data,each copy with a different feature set (certain features must not be stored together).For cluster analysis,they have to perform feature distributed clustering and afterwards join the clusterings into one "mean value"clustering. In social sciences often arise clustering problems with multiple opti- mization criteria:a typical example is the "second world war politi- cians"problem [19],in which many persons were asked to rate the dissimilarities of second world war politicians.Each person corre- sponds to an optimization criterion.A good clustering of the politi- cians should be as close as possible to all the personal ratings.A common approach to these multiple criteria clustering problems is the calculation of a "mean value"of the single criterion clusterings. These are only some applications in which a"mean value"of multiple clus- terings is needed.For this purpose we need a distance (or similarity)measure for clusterings.This paper gives an overview and some analysis of the mea- sures that we find in the literature. In the second section we introduce the basic definitions and some notations. In section three,four and five we present the measures for comparing clus- terings that have been presented in the literature so far.The subdivision into three sections corresponds to two "natural"divisions of the measures: Even though the measures can all be derived from the confusion matrix, they base on different ideas:counting of pairs of elements(Section 3),sum- mation of set overlaps (Section 4),and the use of the information-theoretical mutual information (Section 5). 2
• Combining the results of different algorithms in order to obtain ”robust” clusterings [17]. • Intregration of already existing clusterings that have been built before but that cannot be reconstructed (e.g. because the algorithms or features that were used to build them are not known)[16]. • Many companies store their data not only in one database but the data is geographically distributed. Often, it is unfeasible to transfer all data to one place for performing data analysis there because of the high computational, bandwidth and storage costs. Thus, it is desirable to have methods for combining decentrally performed clusterings to one clustering that represents the whole data. • Legal restrictions force the companies to have several copies of their data, each copy with a different feature set (certain features must not be stored together). For cluster analysis, they have to perform feature distributed clustering and afterwards join the clusterings into one ”mean value” clustering. • In social sciences often arise clustering problems with multiple optimization criteria: a typical example is the ”second world war politicians” problem [19], in which many persons were asked to rate the dissimilarities of second world war politicians. Each person corresponds to an optimization criterion. A good clustering of the politicians should be as close as possible to all the personal ratings. A common approach to these multiple criteria clustering problems is the calculation of a ”mean value” of the single criterion clusterings. These are only some applications in which a ”mean value” of multiple clusterings is needed. For this purpose we need a distance (or similarity) measure for clusterings. This paper gives an overview and some analysis of the measures that we find in the literature. In the second section we introduce the basic definitions and some notations. In section three, four and five we present the measures for comparing clusterings that have been presented in the literature so far. The subdivision into three sections corresponds to two ”natural” divisions of the measures: Even though the measures can all be derived from the confusion matrix, they base on different ideas: counting of pairs of elements (Section 3), summation of set overlaps (Section 4), and the use of the information-theoretical mutual information (Section 5). 2
However,the division also reflects the chronological development of the mea- sures:the counting pairs measures date from the 1970s and 1980s,the mea- sures based on set overlaps from the 1990s and the information-theoretical measures have been developed in the last years (2002/2003). Till this day,there is no formalization of the problem of comparing clus- terings.We think that a set of axioms would be helpful in detecting and defining"good"measures.As a first step towards such a set of axioms,we give in Section 6 aspects and properties that have to be taken into account. 2 Definitions and notations Let X be a finite set with cardinality =n.A clustering C is a set [C1,...,C}of non-empty disjoint subsets of X such that their union equals X.The set of all clusterings of X is denoted by P(X).For a clustering C={C1,...,Ck}we assume Cil>0 for all i=1,...,k.A trivial clustering is either the one-clustering that consist of just one cluster or the singleton clustering in which every element forms its own cluster. Let C={C1,...,C}P(X)denote a second clustering of X.The con- fusion matrix M=(mij)(or contingency table)of the pair C,C is a k x e- matrix whose ij-th entry equals the number of elements in the intersection of the clusters Ci and C: m=ICinC,1≤i≤k,1≤j≤e. Clustering C'is a refinement of C (and C is a coarsening of C),if each cluster of C'is contained in a cluster of C,formally: C∈C3C∈C:CcC. The product CxC'of two clusterings C,C'is the coarsest common refinement of the two clusterings: C×C={CnC|C∈C,Cg∈C,CnC≠0}. The product C xC'is again a clustering,and if C'is a refinement of C,then Cx C=C. 3 Measures based on counting pairs A very intuitional approach to comparing clusterings is counting pairs of objects that are"classified"in the same way in both clusterings,i.e.pairs of 3
However, the division also reflects the chronological development of the measures: the counting pairs measures date from the 1970s and 1980s, the measures based on set overlaps from the 1990s and the information-theoretical measures have been developed in the last years (2002/2003). Till this day, there is no formalization of the problem of comparing clusterings. We think that a set of axioms would be helpful in detecting and defining ”good” measures. As a first step towards such a set of axioms, we give in Section 6 aspects and properties that have to be taken into account. 2 Definitions and notations Let X be a finite set with cardinality |X| = n. A clustering C is a set {C1, . . . , Ck} of non-empty disjoint subsets of X such that their union equals X. The set of all clusterings of X is denoted by P(X). For a clustering C = {C1, . . . , Ck} we assume |Ci | > 0 for all i = 1, . . . , k. A trivial clustering is either the one-clustering that consist of just one cluster or the singleton clustering in which every element forms its own cluster. Let C 0 = {C 0 1 , . . . , C0 ` } ∈ P(X) denote a second clustering of X. The confusion matrix M = (mij ) (or contingency table) of the pair C, C 0 is a k × `- matrix whose ij-th entry equals the number of elements in the intersection of the clusters Ci and C 0 j : mij = |Ci ∩ C 0 j |, 1 ≤ i ≤ k, 1 ≤ j ≤ `. Clustering C 0 is a refinement of C (and C is a coarsening of C 0 ), if each cluster of C 0 is contained in a cluster of C, formally: ∀C 0 j ∈ C0 ∃Ci ∈ C : C 0 j ⊆ Ci . The product C×C0 of two clusterings C, C 0 is the coarsest common refinement of the two clusterings: C × C0 = {Ci ∩ C 0 j | Ci ∈ C, C0 j ∈ C0 , Ci ∩ C 0 j 6= ∅}. The product C × C0 is again a clustering, and if C 0 is a refinement of C, then C × C0 = C 0 . 3 Measures based on counting pairs A very intuitional approach to comparing clusterings is counting pairs of objects that are ”classified” in the same way in both clusterings, i.e. pairs of 3
elements of X that are in the same cluster(in different clusters,respectively) under both clusterings. The set of all (unordered)pairs of elements of X is the disjoint union of the following sets: S11= {pairs that are in the same cluster under C and C' Soo pairs that are in different clusters under C and C S10 pairs that are in the same cluster under C but in different ones under C' Sou pairs that are in different clusters under C but in the same under C' Let nab:=Sabl,a,bE10,1},denote the respective sizes.We have n11+n00+n10+n01 3.1 Chi Squared Coefficient The most ancient measures for comparing clusterings were originally devel- oped for statistical issues.We want to assert the Chi Squared Coefficient as a representative,since it is one of the most well-known measures of this kind.It is defined as x(C,c) (m-E2 where Eij= IC劉 i=1j= E As can be seen in 20,there are several variations of the measure.Originally, it was suggested in 1900 by Pearson for testing independance in a bivariate distribution,not for evaluating association (which,in the context of cluster- ing,corresponds to evaluating similarity).The problem in transferring such a measure to the purpose of comparing clusterings lies in the fact that we have to assume independance of the two clusterings.In general,this is not true and therefore the result of a comparison with such a measure has to be challenged (see Sect.3.6). 3.2 Rand Index 3.2.1 General Rand Index Rand's Index [1]was motivated by standard classification problems in which the result of a classification scheme has to be compared to a correct classifi- 4
elements of X that are in the same cluster (in different clusters, respectively) under both clusterings. The set of all (unordered) pairs of elements of X is the disjoint union of the following sets: S11 = pairs that are in the same cluster under C and C’ S00 = pairs that are in different clusters under C and C’ S10 = pairs that are in the same cluster under C but in different ones under C’ S01 = pairs that are in different clusters under C but in the same under C’ Let nab := |Sab|, a, b ∈ {0, 1}, denote the respective sizes. We have n11 + n00 + n10 + n01 = n 2 . 3.1 Chi Squared Coefficient The most ancient measures for comparing clusterings were originally developed for statistical issues. We want to assert the Chi Squared Coefficient as a representative, since it is one of the most well-known measures of this kind. It is defined as χ(C, C 0 ) = X k i=1 X ` j=1 (mij − Eij ) 2 Eij where Eij = |Ci ||C 0 j | n . As can be seen in [20], there are several variations of the measure. Originally, it was suggested in 1900 by Pearson for testing independance in a bivariate distribution, not for evaluating association (which, in the context of clustering, corresponds to evaluating similarity). The problem in transferring such a measure to the purpose of comparing clusterings lies in the fact that we have to assume independance of the two clusterings. In general, this is not true and therefore the result of a comparison with such a measure has to be challenged (see Sect. 3.6). 3.2 Rand Index 3.2.1 General Rand Index Rand’s Index [1] was motivated by standard classification problems in which the result of a classification scheme has to be compared to a correct classifi- 4
cation.The most common performance measure for this problem calculates the fraction of correctly classified (respectively misclassified)elements to all elements.For Rand,comparing two clusterings was just a natural extension of this problem which has a corresponding extension of the performance measure:instead of counting single elements he counts correctly classified pairs of elements.Thus,the Rand Index is defined by: R(C,c)=2mu+no0) n(n-1) R ranges from 0(no pair classified in the same way under both clusterings) to 1 (identical clusterings).The value of R depends on both,the number of clusters and the number of elements.Morey and Agresti showed that the Rand Index is highly dependent upon the number of clusters [2].In [4],Fowlkes and Mallows show that in the (unrealistic)case of independant clusterings the Rand Index converges to 1 as the number of clusters increases which is undesirable for a similarity measure. 3.2.2 Adjusted Rand Index The expected value of the Rand Index of two random partitions does not take a constant value (e.g.zero).Thus,Hubert and Arabie proposed an adjustment [3 which assumes a generalized hypergeometric distribution as null hypothesis:the two clusterings are drawn randomly with a fixed number of clusters and a fixed number of elements in each cluster (the number of clusters in the two clusterings need not be the same).Then the adjusted Rand Index is the (normalized)difference of the Rand Index and its expected value under the null hypothesis.It is defined as follows [6]: RC,c)==1学)- (t1+2)-t3 where ti=】 ()-() 2t1t2 n(n-1) -1 -1 This index has expected value zero for independant clusterings and maxi- mum value 1(for identical clusterings).The significance of this measure has to be put into question because of the strong assumptions it makes on the distribution.Meila [7]notes,that some pairs of clusterings may result in negative index values. 5
cation. The most common performance measure for this problem calculates the fraction of correctly classified (respectively misclassified) elements to all elements. For Rand, comparing two clusterings was just a natural extension of this problem which has a corresponding extension of the performance measure: instead of counting single elements he counts correctly classified pairs of elements. Thus, the Rand Index is defined by: R(C, C 0 ) = 2(n11 + n00) n(n − 1) R ranges from 0 (no pair classified in the same way under both clusterings) to 1 (identical clusterings). The value of R depends on both, the number of clusters and the number of elements. Morey and Agresti showed that the Rand Index is highly dependent upon the number of clusters [2]. In [4], Fowlkes and Mallows show that in the (unrealistic) case of independant clusterings the Rand Index converges to 1 as the number of clusters increases which is undesirable for a similarity measure. 3.2.2 Adjusted Rand Index The expected value of the Rand Index of two random partitions does not take a constant value (e.g. zero). Thus, Hubert and Arabie proposed an adjustment [3] which assumes a generalized hypergeometric distribution as null hypothesis: the two clusterings are drawn randomly with a fixed number of clusters and a fixed number of elements in each cluster (the number of clusters in the two clusterings need not be the same). Then the adjusted Rand Index is the (normalized) difference of the Rand Index and its expected value under the null hypothesis. It is defined as follows [6]: Radj (C, C 0 ) = Pk i=1 P` j=1 mij 2 − t3 1 2 (t1 + t2) − t3 where t1 = X k i=1 |Ci | 2 , t2 = X ` j=1 |C 0 j | 2 , t3 = 2t1t2 n(n − 1) This index has expected value zero for independant clusterings and maximum value 1 (for identical clusterings). The significance of this measure has to be put into question because of the strong assumptions it makes on the distribution. Meila [7] notes, that some pairs of clusterings may result in negative index values. 5
3.3 Fowlkes-Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4].However,it can also be used for flat clusterings since it consists in calculating an index Bi for each level i=2,...,n-1 of the hierarchies in consideration and plotting Bi against i.The measure B is easily generalized to a measure for clusterings with different numbers of clusters.The generalized Fowlkes-Mallows Index is defined by FM(C,C')= ∑1f=1m喝-n n11 V(∑:lCP-n)(∑,lCgP-n)Vm1+no)m1+no In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant docmnsto the total mber of retrieved docmentand recall (ratio of the number of retrieved relevant documents to the total number of relevant documents n11) n11十n01 Like for the adjusted Rand Index,the "amount"of similarity of two clus- terings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes.Again,the strong assumptions on the distribution make the result hard to interpret. Futhermore,this measure has the undesirable property that for small num- bers of clusters,the value is very high,even for independant clusterings (which even achieve the maximum value for small numbers of clusters).Wal- lace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of"good" pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11]is defined by MC,C=】 It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy ofXI clusterings,with the two trivial clusterings at the top and bottom,respectively,and each level of the hierarchy is a refinement of all the levels above. 6
3.3 Fowlkes–Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4]. However, it can also be used for flat clusterings since it consists in calculating an index Bi for each level i = 2, . . . , n − 1 of the hierarchies in consideration and plotting Bi against i. The measure Bi is easily generalized to a measure for clusterings with different numbers of clusters. The generalized Fowlkes–Mallows Index is defined by FM(C, C 0 ) = Pk i=1 P` j=1 m2 ij − n q ( P i |Ci | 2 − n)(P j |C0 j | 2 − n) = p n11 (n11 + n10)(n11 + n01) In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant documents to the total number of retrieved documents = n11 n11+n10 ) and recall (ratio of the number of retrieved relevant documents to the total number of relevant documents = n11 n11+n01 ). Like for the adjusted Rand Index, the ”amount” of similarity of two clusterings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes. Again, the strong assumptions on the distribution make the result hard to interpret. Futhermore, this measure has the undesirable property that for small numbers of clusters, the value is very high, even for independant clusterings (which even achieve the maximum value for small numbers of clusters). Wallace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of ”good” pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11] is defined by M(C, C 0 ) = X k i=1 |Ci | 2 + X ` j=1 |C 0 j | 2 − 2 X k i=1 X l j=1 m2 ij . It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy of |X| clusterings, with the two trivial clusterings at the top and bottom, respectively, and each level of the hierarchy is a refinement of all the levels above. 6
binary vector defined on this enumeration.An advantage is the fact that this distance is a metric on P(X).However,this measure is very sensitive to cluster sizes such that two clusterings that are"at right angles"to each other (i.e.each cluster in one clustering contains the same amount of elements of each of the clusters of the other clustering)are closer to each other than two clusterings for which one is a refinement of the other [11].The Mirkin Metric is a variation of the Rand Index [7]since it can be rewritten as M(C,C')=2(no1+n1o)=n(n-1)(1-R(C,C') 3.5 Other measures 3.5.1 Jaccard Index The Jaccard Index is very common in geology and ecology,e.g.for measur- ing the species diversity between two different communities [10.It is very similar to the Rand Index,however it disregards the pairs of elements that are in different clusters for both clusterings.It is defined as follows: J(C,C)= n11 n11+n10+n01 3.5.2 Partition Difference The Partition Difference [19]simply counts the pairs of elements that belong to different clusters unter both clusterings: PD(C,C=n0o According to [19],this measure is commonly used.In our opinion,it has too many drawbacks and should therefore not be used:the measure wants to express a distance,but it is not a distance in the mathematical sense, since it fulfills neither the identity of indiscernibles-property (you can have PD(C,C)=0,but CC,e.g.for the trivial one-clustering C and an arbitrary clustering CC),nor the triangle inequality (take two arbi- trary,non-trivial clusterings CC and the trivial one-clustering C",then PD(C,C")+PD(C",C)=0<PD(C,C)).The measure is sensitive to clus- ter sizes and the number of clusters and since it is not normalized,the values are hard to interpret(what does a distance of 5 mean,when we do not know the total number of pairs of elements?). 7
binary vector defined on this enumeration. An advantage is the fact that this distance is a metric on P(X). However, this measure is very sensitive to cluster sizes such that two clusterings that are ”at right angles” to each other (i.e. each cluster in one clustering contains the same amount of elements of each of the clusters of the other clustering) are closer to each other than two clusterings for which one is a refinement of the other [11]. The Mirkin Metric is a variation of the Rand Index [7] since it can be rewritten as M(C, C 0 ) = 2(n01 + n10) = n(n − 1)(1 − R(C, C 0 )). 3.5 Other measures 3.5.1 Jaccard Index The Jaccard Index is very common in geology and ecology, e.g. for measuring the species diversity between two different communities [10]. It is very similar to the Rand Index, however it disregards the pairs of elements that are in different clusters for both clusterings. It is defined as follows: J (C, C 0 ) = n11 n11 + n10 + n01 3.5.2 Partition Difference The Partition Difference [19] simply counts the pairs of elements that belong to different clusters unter both clusterings: PD(C, C 0 ) = n00 According to [19], this measure is commonly used. In our opinion, it has too many drawbacks and should therefore not be used: the measure wants to express a distance, but it is not a distance in the mathematical sense, since it fulfills neither the identity of indiscernibles-property (you can have PD(C, C 0 ) = 0, but C 6= C 0 , e.g. for the trivial one-clustering C and an arbitrary clustering C 0 6= C), nor the triangle inequality (take two arbitrary, non-trivial clusterings C 6= C 0 and the trivial one-clustering C 00, then PD(C, C 00) + PD(C 00 , C 0 ) = 0 < PD(C, C 0 )). The measure is sensitive to cluster sizes and the number of clusters and since it is not normalized, the values are hard to interpret (what does a distance of 5 mean, when we do not know the total number of pairs of elements?). 7
3.6 General remarks As mentioned before,the measures presented in this section can all be cal- culated by means of the confusion matrix M(and the cluster sizes);this is either obvious from the formula (e.g.for the Fowlkes-Mallow Index)or can be seen after some transformation (e.g.for the Rand Index,which can be transformed into a variation of the Mirkin Metric,see 3.4). For different reasons,these measures do not seem to be very appealing. Some of them are sensitive to certain parameters(cluster sizes,number of clusters);think of a pair of clusterings with similarity a E 0,1]and replace each element in the underlying set by two elements.Why should the result- ing pair of clusterings have a similarity other than a?This behavior,as well as sensitivity to the number of clusters,are undesirable. Other measures,like the Fowlkes-Mallows Index,suffer from another draw- back:they make use of a very strong null hypothesis,that is,independance of the clusterings,fixed number of clusters,and fixed cluster sizes.When comparing results provided by clustering algorithms these assumptions are apart from the number of clusters that is fixed for some algorithms-vio- lated.None of the algorithms works with fixed cluster sizes.Furthermore, in practice it would be against the intuition to compare two clusterings when assuming that there is no relationship between them.In fact,we compare clusterings because we suppose a certain relationship and we want to know how strong it is [12]. 4 Measures based on set overlaps Another kind of measure tries to match clusters that have a maximum ab- solute or relative overlap.This is also a quite intuitional approach,however the assymetry of some of the measures makes them difficult to use. 4.1 F-Measure The F-Measure has its origin in the field of document clustering where it is used to evaluate the accuracy of a clustering solution.Each cluster of the first clustering is a(predefined)class of documents and each cluster of the second clustering is treated as the result of a query [14,[13].The F- Measure for a cluster C with respect to a certain class Ci indicates how "good"the cluster C describes the class Ci by calculating the harmonic 8
3.6 General remarks As mentioned before, the measures presented in this section can all be calculated by means of the confusion matrix M (and the cluster sizes); this is either obvious from the formula (e.g. for the Fowlkes-Mallow Index) or can be seen after some transformation (e.g. for the Rand Index, which can be transformed into a variation of the Mirkin Metric, see 3.4). For different reasons, these measures do not seem to be very appealing. Some of them are sensitive to certain parameters (cluster sizes, number of clusters); think of a pair of clusterings with similarity α ∈ [0, 1] and replace each element in the underlying set by two elements. Why should the resulting pair of clusterings have a similarity other than α? This behavior, as well as sensitivity to the number of clusters, are undesirable. Other measures, like the Fowlkes-Mallows Index, suffer from another drawback: they make use of a very strong null hypothesis, that is, independance of the clusterings, fixed number of clusters, and fixed cluster sizes. When comparing results provided by clustering algorithms these assumptions are - apart from the number of clusters that is fixed for some algorithms - violated. None of the algorithms works with fixed cluster sizes. Furthermore, in practice it would be against the intuition to compare two clusterings when assuming that there is no relationship between them. In fact, we compare clusterings because we suppose a certain relationship and we want to know how strong it is [12]. 4 Measures based on set overlaps Another kind of measure tries to match clusters that have a maximum absolute or relative overlap. This is also a quite intuitional approach, however the assymetry of some of the measures makes them difficult to use. 4.1 F-Measure The F-Measure has its origin in the field of document clustering where it is used to evaluate the accuracy of a clustering solution. Each cluster of the first clustering is a (predefined) class of documents and each cluster of the second clustering is treated as the result of a query [14], [13]. The FMeasure for a cluster C 0 j with respect to a certain class Ci indicates how ”good” the cluster C 0 j describes the class Ci by calculating the harmonic 8
mean of precision pi and recall ri for C and C: 下(C,C)=2p= 2CC Ti话+pi -1Ca+IC☒ The overall F-Measure is then defined as the weighted sum of the max- imum F-Measures for the clusters in C': FC,c=Fc=∑Fc,c} i=1 1= It can easily be seen that this measure is not symmetric.Thus,this may be an appropriate index for comparing a clustering with an optimal clustering solution.However,in general the optimal solution is not know,which makes an assymetric measure hard to interpret. In [7],Meila claims,that in [13],Larsen uses a variation of this measure which is normalized by the number of clusters instead of the number of elements.She gives an example where this "Larsen-measure"has a very strange behavior.However,as can be seen in [13],Larsen does not use this measure,but also the F-Measure as defined above.Actually,other authors, as for example Steinbach,Kapyris and Kumar in [15],or Fung in [14],refer to Larsen when introducing the F-Measure. 4.2 Meila-Heckerman-and Maximum-Match-Measure In [8],Meila and Heckerman use another assymetric measure,which they apply to comparing clustering algorithms.For their study,they do not compare the results of the different clustering methods among each other, but they compare each clustering result with an optimal clustering solution (their study is on synthetic data).For this purpose they use the following measure: k MHC,c)=∑maxm 1ec where C is the clustering that is provided by the algorithm and C is the optimal clustering.As for the preceding measure,its assymetry makes it inappropriate for the general task of comparing clusterings.However,it can be generalized to the symmetric Marimum-Match-Measure M(C,C)which can be determined as follows:look for the largest entry mab of the confusion matrix M and match the corresponding clusters Ca and C(this is the cluster pair with the largest (absolute)overlap).Afterwards cross out the a-th row 9
mean of precision pij = mij |C0 j | and recall rij = mij |Ci| for C 0 j and Ci : F(Ci , C0 j ) = 2 · rij · pij rij + pij = 2|Ci ||C 0 j | |Ci | + |C0 j | The overall F-Measure is then defined as the weighted sum of the maximum F-Measures for the clusters in C 0 : F(C, C 0 ) = F(C 0 ) = 1 n X k i=1 ni ` max j=1 {F(Ci , C0 j )} It can easily be seen that this measure is not symmetric. Thus, this may be an appropriate index for comparing a clustering with an optimal clustering solution. However, in general the optimal solution is not know, which makes an assymetric measure hard to interpret. In [7], Meila claims, that in [13], Larsen uses a variation of this measure which is normalized by the number of clusters instead of the number of elements. She gives an example where this ”Larsen-measure” has a very strange behavior. However, as can be seen in [13], Larsen does not use this measure, but also the F-Measure as defined above. Actually, other authors, as for example Steinbach, Kapyris and Kumar in [15], or Fung in [14], refer to Larsen when introducing the F-Measure. 4.2 Meila-Heckerman- and Maximum-Match-Measure In [8], Meila and Heckerman use another assymetric measure, which they apply to comparing clustering algorithms. For their study, they do not compare the results of the different clustering methods among each other, but they compare each clustering result with an optimal clustering solution (their study is on synthetic data). For this purpose they use the following measure: MH(C, C 0 ) = 1 n X k i=1 max C0 j∈C mij where C is the clustering that is provided by the algorithm and C 0 is the optimal clustering. As for the preceding measure, its assymetry makes it inappropriate for the general task of comparing clusterings. However, it can be generalized to the symmetric Maximum-Match-Measure M(C, C 0 ) which can be determined as follows: look for the largest entry mab of the confusion matrix M and match the corresponding clusters Ca and C 0 b (this is the cluster pair with the largest (absolute) overlap). Afterwards cross out the a-th row 9
and the b-th column and repeat this step (searching for the maximum entry, matching the corresponding clusters and deleting the corresponding row and column)until the matrix has size 0.Afterwards you sum up the matches and divide it by the total number of elements: MMic,c=H∑ mintke m n where i'is the index of the cluster in C'that is matched to cluster CiC. Note,that in the case ofk,this measure completely disregards the -e "remaining"clusters in the clustering with the higher cardinality. 4.3 Van Dongen-Measure In [11],van Dongen introduces a symmetric measure,that is also based on maximum intersections of clusters.It is defined as follows: k D(C,C)=2n- >max mij一 max mij i=1 j=1 This measure has the nice property of being a metric on the space of all clusterings of the underlying set X.However,it ignores the parts of the clusters outside the intersections (see 4.4). 4.4 General remarks The preceding measures have the common property of just taking the over- laps into account.They completely disregard the unmatched parts of the clusters (or even complete clusters,as the Maximum-Match-Measure).In [7],Meila presents a nice example that points out the negative effect of this "behavior"of a measure:take a clustering C with k equal clusters and derive two variations C'and C"as follows:C'is obtained from C by shifting a frac- tion a of the elements in each cluster Ci to the"next"cluster Ci+l mod k. The clustering C"is obtained from C by reassigning a fraction a of the ele- ments in each cluster Ci evenly between the other clusters.If a<0.5,then F(C,C)=F(C,C"),MH(C,C)=MH(C,c"),MM(C,C)=MM(C,c") and D(C,C)=D(C,C"),which means that for all the measures c"is as similar to C as C.This contradicts our intuition that C'is a less modified version of C than C"and is therefore not desirable. 10
and the b-th column and repeat this step (searching for the maximum entry, matching the corresponding clusters and deleting the corresponding row and column) until the matrix has size 0. Afterwards you sum up the matches and divide it by the total number of elements: MM(C, C 0 ) = 1 n min X {k,`} i=1 mii0 where i 0 is the index of the cluster in C’ that is matched to cluster Ci ∈ C. Note, that in the case of k 6= `, this measure completely disregards the |k−`| ”remaining” clusters in the clustering with the higher cardinality. 4.3 Van Dongen-Measure In [11], van Dongen introduces a symmetric measure, that is also based on maximum intersections of clusters. It is defined as follows: D(C, C 0 ) = 2n − X k i=1 max j mij − X ` j=1 max i mij This measure has the nice property of being a metric on the space of all clusterings of the underlying set X. However, it ignores the parts of the clusters outside the intersections (see 4.4). 4.4 General remarks The preceding measures have the common property of just taking the overlaps into account. They completely disregard the unmatched parts of the clusters (or even complete clusters, as the Maximum-Match-Measure). In [7], Meila presents a nice example that points out the negative effect of this ”behavior” of a measure: take a clustering C with k equal clusters and derive two variations C 0 and C 00 as follows: C 0 is obtained from C by shifting a fraction α of the elements in each cluster Ci to the ”next” cluster Ci+1 mod k. The clustering C 00 is obtained from C by reassigning a fraction α of the elements in each cluster Ci evenly between the other clusters. If α < 0.5, then F(C, C 0 ) = F(C, C 00), MH(C, C 0 ) = MH(C, C 00), MM(C, C 0 ) = MM(C, C 00) and D(C, C 0 ) = D(C, C 00), which means that for all the measures C 00 is as similar to C as C 0 . This contradicts our intuition that C 0 is a less modified version of C than C 00 and is therefore not desirable. 10