正在加载图片...
Comparing Clusterings-An Overview Silke Wagner Dorothea Wagner January 12,2007 1 Introduction As the amount of data we nowadays have to deal with becomes larger and larger,the methods that help us to detect structures in the data and to identify interesting subsets in the data become more and more important. One of these methods is clustering,i.e.segmenting a set of elements into subsets such that the elements in each subset are somehow "similiar"to each other and elements of different subsets are "unsimilar".In the literature we can find a large variety of clustering algorithms,each having certain advantages but also certain drawbacks.Typical questions that arise in this context comprise: Is the algorithm sensitive to small perturbations,i.e.can small changes in the data (so-called "noise")entail large changes in the clustering? Is the algorithm sensitive to the order of the data,i.e.can another order of the data result in a very different clustering? How similar are the solutions of two different algorithms? If an optimal solution is available:How close is the clustering solution to the optimal one? For examining these aspects,it would be desirable to have a "measure"for the similarity between two clusterings or for their distance'.In a more gen- eral context,it can be necessary to combine different clusterings to a single one,i.e.calculating a"mean value"of the clusterings.Possible applications are: *The authors gratefully acknowledge financial support from the European Commission within DELIS (contract no.001907). Every similarity measure can be transformed into a distance measure and vice versa. 1Comparing Clusterings - An Overview ∗ Silke Wagner Dorothea Wagner January 12, 2007 1 Introduction As the amount of data we nowadays have to deal with becomes larger and larger, the methods that help us to detect structures in the data and to identify interesting subsets in the data become more and more important. One of these methods is clustering, i.e. segmenting a set of elements into subsets such that the elements in each subset are somehow ”similiar” to each other and elements of different subsets are ”unsimilar”. In the literature we can find a large variety of clustering algorithms, each having certain advantages but also certain drawbacks. Typical questions that arise in this context comprise: • Is the algorithm sensitive to small perturbations, i.e. can small changes in the data (so-called ”noise”) entail large changes in the clustering? • Is the algorithm sensitive to the order of the data, i.e. can another order of the data result in a very different clustering? • How similar are the solutions of two different algorithms? • If an optimal solution is available: How close is the clustering solution to the optimal one? For examining these aspects, it would be desirable to have a ”measure” for the similarity between two clusterings or for their distance1 . In a more gen￾eral context, it can be necessary to combine different clusterings to a single one, i.e. calculating a ”mean value” of the clusterings. Possible applications are: ∗The authors gratefully acknowledge financial support from the European Commission within DELIS (contract no. 001907). 1Every similarity measure can be transformed into a distance measure and vice versa. 1
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有