Outline a Cluster basics Clustering algorithms a Hierarchical clustering a K-means a Expectation-Maximization(EM) a Cluster Validity n determining the number of clusters a Clustering evaluation
2 Outline ◼ Cluster Basics ◼ Clustering algorithms Hierarchical clustering k-means Expectation-Maximization (EM) ◼ Cluster Validity determining the number of clusters clustering evaluation
Clustering Analysis ■ Definition 口物以类聚,人以群居 n Grouping the data with similar features It's a method of data exploration, a way of looking for patterns or structure in the V:"... data that are of interest a Properties: unsupervised parameter needed Application field: Machine learning, pattern recognition mage analysis, data mining information retrieval and K-means animation bioinformatics etc
3 Clustering Analysis ◼ Definition: 物以类聚,人以群居 Grouping the data with similar features ◼ It’s a method of data exploration, a way of looking for patterns or structure in the data that are of interest. ◼ Properties: unsupervised, parameter needed ◼ Application field: Machine learning, pattern recognition, image analysis, data mining, information retrieval and bioinformatics etc. K-means animation
Factors of Clustering What data could be used in clustering? a Large or small, Gaussian or non-Gaussian, etc a Which clustering algorithm?(cost function) Partition-based(e.g k-means n Model-based(e.g EM algorithm) a Density-based(e.g. DBSCAN) Genetic, spectral a Choosing(dis similarity measures-a critical step in clustering 口 Euclidean distance, a Pearson linear correlation a How to evaluate the clustering result?(cluster validity)
4 Factors of Clustering ◼ What data could be used in clustering? Large or small, Gaussian or non-Gaussian, etc. ◼ Which clustering algorithm? (cost function) Partition-based (e.g. k-means) Model-based (e.g. EM algorithm) Density-based (e.g. DBSCAN) Genetic, spectral …… ◼ Choosing (dis)similarity measures – a critical step in clustering Euclidean distance,… Pearson Linear Correlation,… ◼ How to evaluate the clustering result? (cluster validity)
Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with a high intra-class similarity a low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation a The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
5 Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity ◼ The quality of a clustering result depends on both the similarity measure used by the method and its implementation ◼ The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Requirements of clustering in data mining(1) Scalability ability to deal with different types of attributes Discovery of clusters with arbitrary shape a Minimal requirements for domain knowledge to determine input parameters
◼ Scalability ◼ Ability to deal with different types of attributes ◼ Discovery of clusters with arbitrary shape ◼ Minimal requirements for domain knowledge to determine input parameters Requirements of clustering in data mining (1) 6
Requirements of clustering in data mining(2) a Ability to deal with noise and outliers 丈 Insensitivity to order of input records a High dimensionality a Incorporation of user- specified constraints Interpretability and usability
Requirements of clustering in data mining (2) ◼ Ability to deal with noise and outliers ◼ Insensitivity to order of input records ◼ High dimensionality ◼ Incorporation of userspecified constraints ◼ Interpretability and usability 7
Similarity and dissimilarit a there is no single definition of similarity or dissimilarity between data objects The definition of similarity or dissimilarity between objects depends on a the type of the data considered a what kind of similarity we are looking for
Similarity and dissimilarity 8 ◼ There is no single definition of similarity or dissimilarity between data objects ◼ The definition of similarity or dissimilarity between objects depends on the type of the data considered what kind of similarity we are looking for
Similarity and dissimilarit Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x, y) a Ideally, every distance measure should be a metric. i.e. it should satisfy the following conditions (x,y)≥0 2. d(x,y)=oiff x 3.d(x,y)=d(y,x) 4.d(x,2)≤d(x,y)+d(y,2)
Similarity and dissimilarity 9 ◼ Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x,y) ◼ Ideally, every distance measure should be a metric, i.e., it should satisfy the following conditions: 4. ( , ) ( , ) ( , ) 3. ( , ) ( , ) 2. ( , ) 0 iff 1. ( , ) 0 d x z d x y d y z d x y d y x d x y x y d x y + = = =
Euclidean distance Find distance between X and y: xEX, yEY, n="number of dimensions", A="degree of minkowski" x2Ⅹ x ="absolute value of x",W="weight vector"can be all I 单Qw0(VE(+(x-¥xD) ⊥=1 A=1 city Block(x, r)=2(**( x,-y D) 主=1 Euclidean(x, y) E(w:*(lxY, D') Y = 1=1
Euclidean distance ◼ Here D is the number of dimensions in the data vector. For instance: ◆ RGB color channel in images ◆ Logitude and latitude in GPS data = = − D i d euc x i y i 1 2 (x,y) ( ) 10
Pearson Linear Correlation (x1-x)v1- No obvious p(x,y) relationship? X -x A medium direcl relationship? x Input A strong, inverse relationship a Were shifting the expression profiles down(subtracting the means)and scaling by the standard deviations (i.e. making the data have mean =0 and std= 1) a Always between-1 and +1(perfectly anti-correlated and perfectly correlated) 11
Pearson Linear Correlation ◼ We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1) ◼ Always between –1 and +1 (perfectly anti-correlated and perfectly correlated) = = − − − − = = = = n i i n i i n i i n i i i n i i y n y x n x x x y y x x y y 1 1 ( ) ( ) ( )( ) ( , ) 1 2 1 2 1 x y 11