Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Yang @cs. ust. hk Thanks: J W. Han, I witten e frank
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: J.W. Han, I. Witten, E. Frank
Essentials erminology Objects rows= records Variables attributes= features a good clustering method high on intra-class similarity and low on inter-class similarity What is similarity? Based on computation of distance Between two numerical attributes Between two nominal attributes Mixed attributes
2 Essentials ◼ Terminology: ◼ Objects = rows = records ◼ Variables = attributes = features ◼ A good clustering method ◼ high on intra-class similarity and low on inter-class similarity ◼ What is similarity? ◼ Based on computation of distance ◼ Between two numerical attributes ◼ Between two nominal attributes ◼ Mixed attributes
The database XX Object i p
3 The database n n p i i p p x x x x x x 1 1 1 1 1 ... ... ... ... ... ... ... ... Object i
Numerical attributes Distances are normally used to measure the similarity or dissimilarity between two data objects Euclidean distance d(1)=,(x2-x,}+x2-x,P+,+|x,-x2) 12J2 Jp Where/=(Ⅻ…,)andj=(场灬加)le two p-dimensional records, Manhattan distance d(i)=x.-x,+|x.-x,|+.+1x2-x
4 Numerical Attributes ◼ Distances are normally used to measure the similarity or dissimilarity between two data objects ◼ Euclideandistance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional records, ◼ Manhattan distance | | ... | | ) 2 ( , ) (| | 2 2 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + − ( , ) | | | | ... | | 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + −
Binary variables([0, 1], or [true, false]) contingency table for binary data Row 10 sum 6 a+b Row i d c+d sum a+c b+d Simple matching coefficient btc +6+c+d Invariant of coding of binary variable: if you assign 1 to pass"and 0 to fail or the other way around, you'll get the same distance value 5
5 Binary Variables ({0, 1}, or {true, false}) ◼ A contingency table for binary data ◼ Simple matching coefficient ◼ Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. a b c d b c d i j + + + ( , )= + sum a c b d p c d c d a b a b sum + + + + 0 1 1 0 Row i Row j
Nominal attributes a generalization of the binary variable in that it can take more than 2 states, e.g. red yellow, blue green Method 1: Simple matching m:# of matches, p: total of variables d(i,j)=Prm Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
6 Nominal Attributes ◼ A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green ◼ Method 1: Simple matching ◼ m: # of matches, p: total # of variables ◼ Method 2: use a large number of binary variables ◼ creating a new binary variable for each of the M nominal states p p m d i j − ( , )=
Other measures of cluster distance Minimum distance d(Ci, ci)=min peCi,p,eg IP-PI Max distance maX p∈Ci,p∈C Mean distance (C1,C1)=m Avarage distance d(C2C)=∑∑|P 7
7 Other measures of cluster distance ◼ Minimum distance ◼ Max distance ◼ Mean distance ◼ Avarage distance ( , ) min | '| d Ci Cj = pCi, p'Cj P − P ( , ) max | '| d Ci Cj = pCi, p'Cj P − P d Ci Cj = mi − mj ( , ) = − p C i p C j P P n n d C C i j i j | '| 1 ( , )
Major clustering methods u Partition based(K-means) Produces sphere-like clusters Good when know number of clusters Small and med sized databases Hierarchical methods(agglomerative or divisive) Produces trees of clusters Fast Density based(dbscan Produces arbitrary shaped clusters Good when dealing with spatial clusters(maps) Grid-based Produces clusters based on grids Fast for large, multidimensional databases Model-based Based on statistical models Allow objects to belong to several clusters 8
8 Major clustering methods ◼ Partition based (K-means) ◼ Produces sphere-like clusters ◼ Good when ◼ know number of clusters, ◼ Small and med sized databases ◼ Hierarchical methods (Agglomerative or divisive) ◼ Produces trees of clusters ◼ Fast ◼ Density based (DBScan) ◼ Produces arbitrary shaped clusters ◼ Good when dealing with spatial clusters (maps) ◼ Grid-based ◼ Produces clusters based on grids ◼ Fast for large, multidimensional databases ◼ Model-based ◼ Based on statistical models ◼ Allow objects to belong to several clusters
The K-Means Clustering method: for numerical attributes Given k, the k-means algorithm is implemented in four steps Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition the centroid is the center, i. e. mean point, of the cluster) Assign each object to the cluster with the nearest seed point go back to Step 2, stop when no more new assignment
9 The K-Means Clustering Method: for numerical attributes ◼ Given k, the k-means algorithm is implemented in four steps: ◼ Partition objects into k non-empty subsets ◼ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ◼ Assign each object to the cluster with the nearest seed point ◼ Go back to Step 2, stop when no more new assignment
The mean point Y 2.5 2.75 The mean point can be a virtual point 10
10 The mean point 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 X X Y 1 2 2 4 3 3 4 2 2.5 2.75 The mean point can be a virtual point