当前位置:高等教育资讯网  >  中国高校课件下载中心  >  大学文库  >  浏览文档

香港科技大学:Clustering(PPT讲稿)

资源类别:文库,文档格式:PPT,文档页数:61,文件大小:1.45MB,团购合买
点击下载完整版文档(PPT)

Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Yang @cs. ust. hk Thanks: J W. Han, I witten e frank

1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: J.W. Han, I. Witten, E. Frank

Essentials erminology Objects rows= records Variables attributes= features a good clustering method high on intra-class similarity and low on inter-class similarity What is similarity? Based on computation of distance Between two numerical attributes Between two nominal attributes Mixed attributes

2 Essentials ◼ Terminology: ◼ Objects = rows = records ◼ Variables = attributes = features ◼ A good clustering method ◼ high on intra-class similarity and low on inter-class similarity ◼ What is similarity? ◼ Based on computation of distance ◼ Between two numerical attributes ◼ Between two nominal attributes ◼ Mixed attributes

The database XX Object i p

3 The database                 n n p i i p p x x x x x x 1 1 1 1 1 ... ... ... ... ... ... ... ... Object i

Numerical attributes Distances are normally used to measure the similarity or dissimilarity between two data objects Euclidean distance d(1)=,(x2-x,}+x2-x,P+,+|x,-x2) 12J2 Jp Where/=(Ⅻ…,)andj=(场灬加)le two p-dimensional records, Manhattan distance d(i)=x.-x,+|x.-x,|+.+1x2-x

4 Numerical Attributes ◼ Distances are normally used to measure the similarity or dissimilarity between two data objects ◼ Euclideandistance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional records, ◼ Manhattan distance | | ... | | ) 2 ( , ) (| | 2 2 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + − ( , ) | | | | ... | | 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + −

Binary variables([0, 1], or [true, false]) contingency table for binary data Row 10 sum 6 a+b Row i d c+d sum a+c b+d Simple matching coefficient btc +6+c+d Invariant of coding of binary variable: if you assign 1 to pass"and 0 to fail or the other way around, you'll get the same distance value 5

5 Binary Variables ({0, 1}, or {true, false}) ◼ A contingency table for binary data ◼ Simple matching coefficient ◼ Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. a b c d b c d i j + + + ( , )= + sum a c b d p c d c d a b a b sum + + + + 0 1 1 0 Row i Row j

Nominal attributes a generalization of the binary variable in that it can take more than 2 states, e.g. red yellow, blue green Method 1: Simple matching m:# of matches, p: total of variables d(i,j)=Prm Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

6 Nominal Attributes ◼ A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green ◼ Method 1: Simple matching ◼ m: # of matches, p: total # of variables ◼ Method 2: use a large number of binary variables ◼ creating a new binary variable for each of the M nominal states p p m d i j − ( , )=

Other measures of cluster distance Minimum distance d(Ci, ci)=min peCi,p,eg IP-PI Max distance maX p∈Ci,p∈C Mean distance (C1,C1)=m Avarage distance d(C2C)=∑∑|P 7

7 Other measures of cluster distance ◼ Minimum distance ◼ Max distance ◼ Mean distance ◼ Avarage distance ( , ) min | '| d Ci Cj = pCi, p'Cj P − P ( , ) max | '| d Ci Cj = pCi, p'Cj P − P d Ci Cj = mi − mj ( , )     = − p C i p C j P P n n d C C i j i j | '| 1 ( , )

Major clustering methods u Partition based(K-means) Produces sphere-like clusters Good when know number of clusters Small and med sized databases Hierarchical methods(agglomerative or divisive) Produces trees of clusters Fast Density based(dbscan Produces arbitrary shaped clusters Good when dealing with spatial clusters(maps) Grid-based Produces clusters based on grids Fast for large, multidimensional databases Model-based Based on statistical models Allow objects to belong to several clusters 8

8 Major clustering methods ◼ Partition based (K-means) ◼ Produces sphere-like clusters ◼ Good when ◼ know number of clusters, ◼ Small and med sized databases ◼ Hierarchical methods (Agglomerative or divisive) ◼ Produces trees of clusters ◼ Fast ◼ Density based (DBScan) ◼ Produces arbitrary shaped clusters ◼ Good when dealing with spatial clusters (maps) ◼ Grid-based ◼ Produces clusters based on grids ◼ Fast for large, multidimensional databases ◼ Model-based ◼ Based on statistical models ◼ Allow objects to belong to several clusters

The K-Means Clustering method: for numerical attributes Given k, the k-means algorithm is implemented in four steps Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition the centroid is the center, i. e. mean point, of the cluster) Assign each object to the cluster with the nearest seed point go back to Step 2, stop when no more new assignment

9 The K-Means Clustering Method: for numerical attributes ◼ Given k, the k-means algorithm is implemented in four steps: ◼ Partition objects into k non-empty subsets ◼ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ◼ Assign each object to the cluster with the nearest seed point ◼ Go back to Step 2, stop when no more new assignment

The mean point Y 2.5 2.75 The mean point can be a virtual point 10

10 The mean point 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 X X Y 1 2 2 4 3 3 4 2 2.5 2.75 The mean point can be a virtual point

点击下载完整版文档(PPT)VIP每日下载上限内不扣除下载券和下载次数;
按次数下载不扣除下载券;
24小时内重复下载只扣除一次;
顺序:VIP每日次数-->可用次数-->下载券;
共61页,可试读20页,点击继续阅读 ↓↓
相关文档

关于我们|帮助中心|下载说明|相关软件|意见反馈|联系我们

Copyright © 2008-现在 cucdc.com 高等教育资讯网 版权所有