Chapter 8. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical methods Density-Based Methods Grid-Based methods Evaluation of Clustering Summar
1 Chapter 8. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 1
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar(or related) to one another and different from (or unrelated to) the objects in other groups Inter-clustel Intra-cluster distances are distances are maximized minimized ○ 2
2 What is Cluster Analysis? ◼ Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
What is Cluster Analysis? Cluster a collection of data objects similar(or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis (or clustering, data segmentation, .. Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes(i.e, learning by observations Vs learning by examples: supervised Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
3 What is Cluster Analysis? ◼ Cluster: A collection of data objects ◼ similar (or related) to one another within the same group ◼ dissimilar (or unrelated) to the objects in other groups ◼ Cluster analysis (or clustering, data segmentation, …) ◼ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ◼ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ◼ Typical applications ◼ As a stand-alone tool to get insight into data distribution ◼ As a preprocessing step for other algorithms
Clustering for Data Understanding and Applications Biology taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retrieval: document clustering a land use: dentification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate find patterns of atmospheric and ocean Economic Science market research
4 Clustering for Data Understanding and Applications ◼ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ◼ Information retrieval: document clustering ◼ Land use: Identification of areas of similar land use in an earth observation database ◼ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ◼ City-planning: Identifying groups of houses according to their house type, value, and geographical location ◼ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults ◼ Climate: understanding earth climate, find patterns of atmospheric and ocean ◼ Economic Science: market research
Clustering as a Preprocessing Tool (Utility) Summarization Preprocessing for regression, PCA, classification, and association analysis Compression Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those far away from any cluster
5 Clustering as a Preprocessing Tool (Utility) ◼ Summarization: ◼ Preprocessing for regression, PCA, classification, and association analysis ◼ Compression: ◼ Image processing: vector quantization ◼ Finding K-nearest Neighbors ◼ Localizing search to one or a small number of clusters ◼ Outlier detection ◼ Outliers are often viewed as those “far away” from any cluster
Applications of Cluster Analysis ■ Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia 6
6 Applications of Cluster Analysis ◼ Understanding ◼ Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations ◼ Summarization ◼ Reduce the size of large data sets Clustering precipitation in Australia
Clustering: Rich Applications and Multidisciplinary Efforts Pattern Recognition Spatial data Analysis Create thematic maps in Gis by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science(especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
7 Clustering: Rich Applications and Multidisciplinary Efforts ◼ Pattern Recognition ◼ Spatial Data Analysis ◼ Create thematic maps in GIS by clustering feature spaces ◼ Detect spatial clusters or for other spatial mining tasks ◼ Image Processing ◼ Economic Science (especially market research) ◼ WWW ◼ Document classification ◼ Cluster Weblog data to discover groups of similar access patterns
Quality: What s g。。 d clustering A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The guality of a clustering method depends on the similarity measure used by the method its implementation, and Its ability to discover some or all of the hidden patterns
Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters ◼ high intra-class similarity: cohesive within clusters ◼ low inter-class similarity: distinctive between clusters ◼ The quality of a clustering method depends on ◼ the similarity measure used by the method ◼ its implementation, and ◼ Its ability to discover some or all of the hidden patterns 8
What is not Cluster Analysis? Supervised classification Have class label information Simple segmentation Dividing students into different registration groups alphabetically, by last name Results of a query Groupings are a result of an external specification Graph partitioning Some mutual relevance and synergy, but areas are not identical
9 What is not Cluster Analysis? ◼ Supervised classification ◼ Have class label information ◼ Simple segmentation ◼ Dividing students into different registration groups alphabetically, by last name ◼ Results of a query ◼ Groupings are a result of an external specification ◼ Graph partitioning ◼ Some mutual relevance and synergy, but areas are not identical
Measure the Quality of Clustering Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function typically metric: diD The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering There is usually a separate " quality 'function that measures the goodness of a cluster. It is hard to define“ similar enough”or"“ good enough” The answer is typically highly subjective
Measure the Quality of Clustering ◼ Dissimilarity/Similarity metric ◼ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ◼ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ◼ Weights should be associated with different variables based on applications and data semantics ◼ Quality of clustering: ◼ There is usually a separate “quality” function that measures the “goodness” of a cluster. ◼ It is hard to define “similar enough” or “good enough” ◼ The answer is typically highly subjective 10