正在加载图片...
Data Clustering 265 CONTENTS Intuitively,patterns within a valid clus- ter are more similar to each other than 1.Introduction 1.1 Motivation they are to a pattern belonging to a 1.2 Components of a Clustering Task different cluster.An example of cluster- 1.3 The User's Dilemma and the Role of Expertise ing is depicted in Figure 1.The input 1.4 History patterns are shown in Figure 1(a),and 1.5 Outline 2.Definitions and Notation the desired clusters are shown in Figure 3.Pattern Representation,Feature Selection and 1(b).Here,points belonging to the same Extraction cluster are given the same label.The 4.Similarity Measures variety of techniques for representing 5.Clustering Techniques data,measuring proximity (similarity) 5.1 Hierarchical Clustering Algorithms 5.2 Partitional Algorithms between data elements,and grouping 5.3 Mixture-Resolving and Mode-Seeking data elements has produced a rich and Algorithms often confusing assortment of clustering 5.4 Nearest Neighbor Clustering methods. 5.5 Fuzzy Clustering 5.6 Representation of Clusters It is important to understand the dif- 5.7 Artificial Neural Networks for Clustering ference between clustering (unsuper- 5.8 Evolutionary Approaches for Clustering vised classification)and discriminant 5.9 Search-Based Approaches analysis (supervised classification).In 5.10 A Comparison of Techniques supervised classification,we are pro- 5.11 Incorporating Domain Constraints in Clustering vided with a collection of labeled (pre- 5.12 Clustering Large Data Sets classified)patterns;the problem is to 6.Applications label a newly encountered,yet unla- 6.1 Image Segmentation Using Clustering beled,pattern.Typically,the given la- 6.2 Object and Character Recognition 6.3 Information Retrieval beled (training)patterns are used to 6.4 Data Mining learn the descriptions of classes which 7.Summary in turn are used to label a new pattern. In the case of clustering,the problem is to group a given collection of unlabeled patterns into meaningful clusters.In a 1.INTRODUCTION sense,labels are associated with clus- ters also,but these category labels are 1.1 Motivation data driven;that is,they are obtained solely from the data. Data analysis underlies many comput- Clustering is useful in several explor- ing applications,either in a design atory pattern-analysis,grouping,deci- phase or as part of their on-line opera- sion-making,and machine-learning sit- tions.Data analysis procedures can be uations, including data mining, dichotomized as either exploratory or document retrieval,image segmenta- confirmatory,based on the availability tion,and pattern classification.How- of appropriate models for the data ever,in many such problems,there is source,but a key element in both types little prior information (e.g.,statistical of procedures (whether for hypothesis models)available about the data,and formation or decision-making)is thethe decision-maker must make as few grouping,or classification of measure-assumptions about the data as possible. ments based on either(i)goodness-of-fit It is under these restrictions that clus- to a postulated model,or (ii)natural tering methodology is particularly ap- groupings(clustering)revealed through propriate for the exploration of interre- analysis.Cluster analysis is the organi- lationships among the data points to zation of a collection of patterns (usual-make an assessment (perhaps prelimi- ly represented as a vector of measure-nary)of their structure. ments,or a point in a multidimensional The term“clustering'”is used in sev- space)into clusters based on similarity.eral research communities to describe ACM Computing Surveys,Vol.31,No.3,September 19991. INTRODUCTION 1.1 Motivation Data analysis underlies many comput￾ing applications, either in a design phase or as part of their on-line opera￾tions. Data analysis procedures can be dichotomized as either exploratory or confirmatory, based on the availability of appropriate models for the data source, but a key element in both types of procedures (whether for hypothesis formation or decision-making) is the grouping, or classification of measure￾ments based on either (i) goodness-of-fit to a postulated model, or (ii) natural groupings (clustering) revealed through analysis. Cluster analysis is the organi￾zation of a collection of patterns (usual￾ly represented as a vector of measure￾ments, or a point in a multidimensional space) into clusters based on similarity. Intuitively, patterns within a valid clus￾ter are more similar to each other than they are to a pattern belonging to a different cluster. An example of cluster￾ing is depicted in Figure 1. The input patterns are shown in Figure 1(a), and the desired clusters are shown in Figure 1(b). Here, points belonging to the same cluster are given the same label. The variety of techniques for representing data, measuring proximity (similarity) between data elements, and grouping data elements has produced a rich and often confusing assortment of clustering methods. It is important to understand the dif￾ference between clustering (unsuper￾vised classification) and discriminant analysis (supervised classification). In supervised classification, we are pro￾vided with a collection of labeled (pre￾classified) patterns; the problem is to label a newly encountered, yet unla￾beled, pattern. Typically, the given la￾beled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clus￾ters also, but these category labels are data driven; that is, they are obtained solely from the data. Clustering is useful in several explor￾atory pattern-analysis, grouping, deci￾sion-making, and machine-learning sit￾uations, including data mining, document retrieval, image segmenta￾tion, and pattern classification. How￾ever, in many such problems, there is little prior information (e.g., statistical models) available about the data, and the decision-maker must make as few assumptions about the data as possible. It is under these restrictions that clus￾tering methodology is particularly ap￾propriate for the exploration of interre￾lationships among the data points to make an assessment (perhaps prelimi￾nary) of their structure. The term “clustering” is used in sev￾eral research communities to describe CONTENTS 1. Introduction 1.1 Motivation 1.2 Components of a Clustering Task 1.3 The User’s Dilemma and the Role of Expertise 1.4 History 1.5 Outline 2. Definitions and Notation 3. Pattern Representation, Feature Selection and Extraction 4. Similarity Measures 5. Clustering Techniques 5.1 Hierarchical Clustering Algorithms 5.2 Partitional Algorithms 5.3 Mixture-Resolving and Mode-Seeking Algorithms 5.4 Nearest Neighbor Clustering 5.5 Fuzzy Clustering 5.6 Representation of Clusters 5.7 Artificial Neural Networks for Clustering 5.8 Evolutionary Approaches for Clustering 5.9 Search-Based Approaches 5.10 A Comparison of Techniques 5.11 Incorporating Domain Constraints in Clustering 5.12 Clustering Large Data Sets 6. Applications 6.1 Image Segmentation Using Clustering 6.2 Object and Character Recognition 6.3 Information Retrieval 6.4 Data Mining 7. Summary Data Clustering • 265 ACM Computing Surveys, Vol. 31, No. 3, September 1999
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有