《知识发现和数据挖掘 Knowledge Discovery and Data Mining》课程教学课件（PPT讲稿）Chapter 10. Cluster Analysis：Basic Concepts and Methods

◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary

团购合买资源类别：文库，文档格式：PPTX，文档页数：100，文件大小：1.69MB

COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei C2012 Han Kamber pei. All rights reserved

1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei ©2012 Han, Kamber & Pei. All rights reserved

Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical methods Density-Based Methods Grid-Based methods Evaluation of clustering Summary

2 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 2

What is Cluster Analysis? Cluster: a collection of data objects similar(or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis(or clustering, data segmentation,. Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e, learning by observations vs learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

3 What is Cluster Analysis? ◼ Cluster: A collection of data objects ◼ similar (or related) to one another within the same group ◼ dissimilar (or unrelated) to the objects in other groups ◼ Cluster analysis (or clustering, data segmentation, …) ◼ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ◼ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ◼ Typical applications ◼ As a stand-alone tool to get insight into data distribution ◼ As a preprocessing step for other algorithms

Clustering for Data Understanding and Applications Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: ldentification of areas of similar land use in an earth observation database Marketing Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch

4 Clustering for Data Understanding and Applications ◼ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ◼ Information retrieval: document clustering ◼ Land use: Identification of areas of similar land use in an earth observation database ◼ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ◼ City-planning: Identifying groups of houses according to their house type, value, and geographical location ◼ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults ◼ Climate: understanding earth climate, find patterns of atmospheric and ocean ◼ Economic Science: market resarch

Clustering as a Preprocessing Tool ( Utility) Summarization Preprocessing for regression, PCA, classification, and association analysis Compression Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those far away' from any cluster

5 Clustering as a Preprocessing Tool (Utility) ◼ Summarization: ◼ Preprocessing for regression, PCA, classification, and association analysis ◼ Compression: ◼ Image processing: vector quantization ◼ Finding K-nearest Neighbors ◼ Localizing search to one or a small number of clusters ◼ Outlier detection ◼ Outliers are often viewed as those “far away” from any cluster

Quality: What Is Good clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The guality of a clustering method depends on the similarity measure used by the method a its implementation, and Its ability to discover some or all of the hidden patterns 6

Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters ◼ high intra-class similarity: cohesive within clusters ◼ low inter-class similarity: distinctive between clusters ◼ The quality of a clustering method depends on ◼ the similarity measure used by the method ◼ its implementation, and ◼ Its ability to discover some or all of the hidden patterns 6

Measure the quality of clustering Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function typically metric: d(,D The definitions of distance functions are usually rather different for interval-scaled boolean categorical ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering There is usually a separate "quality function that measures the goodness" of a cluster It is hard to define similar enough"or "good enough The answer is typically highly subjective

Measure the Quality of Clustering ◼ Dissimilarity/Similarity metric ◼ Similarity is expressed in terms of a distance function, typically metric: d(i, j) ◼ The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables ◼ Weights should be associated with different variables based on applications and data semantics ◼ Quality of clustering: ◼ There is usually a separate “quality” function that measures the “goodness” of a cluster. ◼ It is hard to define “similar enough” or “good enough” ◼ The answer is typically highly subjective 7

Considerations for Cluster Analysis Partitioning criteria Single level vs. hierarchical partitioning(often, multi-level hierarchical partitioning is desirable Separation of clusters EXclusive(e.g, one customer belongs to only one region)Vs non exclusive(e.g, one document may belong to more than one class Similarity measure Distance-based(e.g, Euclidian, road network, vector)VS connectivity-based(e.g, density or contiguity) Clustering space Full space(often when low dimensional)vs subspaces(often in high-dimensional clustering 8

Considerations for Cluster Analysis ◼ Partitioning criteria ◼ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) ◼ Separation of clusters ◼ Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class) ◼ Similarity measure ◼ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) ◼ Clustering space ◼ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 8

Requirements and challenges Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of the lese Constraint-based clustering User may give inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionalit

Requirements and Challenges ◼ Scalability ◼ Clustering all the data instead of only on samples ◼ Ability to deal with different types of attributes ◼ Numerical, binary, categorical, ordinal, linked, and mixture of these ◼ Constraint-based clustering ◼ User may give inputs on constraints ◼ Use domain knowledge to determine input parameters ◼ Interpretability and usability ◼ Others ◼ Discovery of clusters with arbitrary shape ◼ Ability to deal with noisy data ◼ Incremental clustering and insensitivity to input order ◼ High dimensionality 9

Major Clustering Approaches o Partitioning approach Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach Create a hierarchical decomposition of the set of data(or objects using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach based on a multiple- level granularity structure Typical methods: STING, WaveCluster, CLIQUE 10

Major Clustering Approaches (I) ◼ Partitioning approach: ◼ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ◼ Typical methods: k-means, k-medoids, CLARANS ◼ Hierarchical approach: ◼ Create a hierarchical decomposition of the set of data (or objects) using some criterion ◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON ◼ Density-based approach: ◼ Based on connectivity and density functions ◼ Typical methods: DBSACN, OPTICS, DenClue ◼ Grid-based approach: ◼ based on a multiple-level granularity structure ◼ Typical methods: STING, WaveCluster, CLIQUE 10

点击下载完整版文档（PPTX格式）

共100页，可试读20页，点击继续阅读 ↓↓

点击下载（PPTX格式）

浏览记录