同济大学：《大数据分析与数据挖掘 Big Data Analysis and Mining》课程教学资源（PPT课件讲稿）Clustering Basics（主讲：赵钦佩）

◼ Cluster Basics ◼ Clustering algorithms  Hierarchical clustering  k-means  Expectation-Maximization (EM) ◼ Cluster Validity  determining the number of clusters  clustering evaluation

团购合买资源类别：文库，文档格式：PPTX，文档页数：48，文件大小：1.34MB

Outline a Cluster basics Clustering algorithms a Hierarchical clustering a K-means a Expectation-Maximization(EM) a Cluster Validity n determining the number of clusters a Clustering evaluation

2 Outline ◼ Cluster Basics ◼ Clustering algorithms  Hierarchical clustering  k-means  Expectation-Maximization (EM) ◼ Cluster Validity  determining the number of clusters  clustering evaluation

Clustering Analysis ■ Definition 口物以类聚,人以群居 n Grouping the data with similar features It's a method of data exploration, a way of looking for patterns or structure in the V:"... data that are of interest a Properties: unsupervised parameter needed Application field: Machine learning, pattern recognition mage analysis, data mining information retrieval and K-means animation bioinformatics etc

3 Clustering Analysis ◼ Definition:  物以类聚，人以群居  Grouping the data with similar features ◼ It’s a method of data exploration, a way of looking for patterns or structure in the data that are of interest. ◼ Properties: unsupervised, parameter needed ◼ Application field: Machine learning, pattern recognition, image analysis, data mining, information retrieval and bioinformatics etc. K-means animation

Factors of Clustering What data could be used in clustering? a Large or small, Gaussian or non-Gaussian, etc a Which clustering algorithm?(cost function) Partition-based(e.g k-means n Model-based(e.g EM algorithm) a Density-based(e.g. DBSCAN) Genetic, spectral a Choosing(dis similarity measures-a critical step in clustering 口 Euclidean distance, a Pearson linear correlation a How to evaluate the clustering result?(cluster validity)

4 Factors of Clustering ◼ What data could be used in clustering?  Large or small, Gaussian or non-Gaussian, etc. ◼ Which clustering algorithm? (cost function)  Partition-based (e.g. k-means)  Model-based (e.g. EM algorithm)  Density-based (e.g. DBSCAN)  Genetic, spectral …… ◼ Choosing (dis)similarity measures – a critical step in clustering  Euclidean distance,…  Pearson Linear Correlation,… ◼ How to evaluate the clustering result? (cluster validity)

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with a high intra-class similarity a low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation a The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

5 Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity ◼ The quality of a clustering result depends on both the similarity measure used by the method and its implementation ◼ The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Requirements of clustering in data mining(1) Scalability ability to deal with different types of attributes Discovery of clusters with arbitrary shape a Minimal requirements for domain knowledge to determine input parameters

◼ Scalability ◼ Ability to deal with different types of attributes ◼ Discovery of clusters with arbitrary shape ◼ Minimal requirements for domain knowledge to determine input parameters Requirements of clustering in data mining (1) 6

Requirements of clustering in data mining(2) a Ability to deal with noise and outliers 丈 Insensitivity to order of input records a High dimensionality a Incorporation of user- specified constraints Interpretability and usability

Requirements of clustering in data mining (2) ◼ Ability to deal with noise and outliers ◼ Insensitivity to order of input records ◼ High dimensionality ◼ Incorporation of userspecified constraints ◼ Interpretability and usability 7

Similarity and dissimilarit a there is no single definition of similarity or dissimilarity between data objects The definition of similarity or dissimilarity between objects depends on a the type of the data considered a what kind of similarity we are looking for

Similarity and dissimilarity 8 ◼ There is no single definition of similarity or dissimilarity between data objects ◼ The definition of similarity or dissimilarity between objects depends on  the type of the data considered  what kind of similarity we are looking for

Similarity and dissimilarit Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x, y) a Ideally, every distance measure should be a metric. i.e. it should satisfy the following conditions (x,y)≥0 2. d(x,y)=oiff x 3.d(x,y)=d(y,x) 4.d(x,2)≤d(x,y)+d(y,2)

Similarity and dissimilarity 9 ◼ Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x,y) ◼ Ideally, every distance measure should be a metric, i.e., it should satisfy the following conditions: 4. ( , ) ( , ) ( , ) 3. ( , ) ( , ) 2. ( , ) 0 iff 1. ( , ) 0 d x z d x y d y z d x y d y x d x y x y d x y  + = = = 

Euclidean distance Find distance between X and y: xEX, yEY, n="number of dimensions", A="degree of minkowski" x2Ⅹ x ="absolute value of x",W="weight vector"can be all I 单Qw0(VE(+(x-￥xD) ⊥=1 A=1 city Block(x, r)=2(**( x,-y D) 主=1 Euclidean(x, y) E(w:*(lxY, D') Y = 1=1

Euclidean distance ◼ Here D is the number of dimensions in the data vector. For instance: ◆ RGB color channel in images ◆ Logitude and latitude in GPS data = = − D i d euc x i y i 1 2 (x,y) ( ) 10

Pearson Linear Correlation (x1-x)v1- No obvious p(x,y) relationship? X -x A medium direcl relationship? x Input A strong, inverse relationship a Were shifting the expression profiles down(subtracting the means)and scaling by the standard deviations (i.e. making the data have mean =0 and std= 1) a Always between-1 and +1(perfectly anti-correlated and perfectly correlated) 11

Pearson Linear Correlation ◼ We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1) ◼ Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)      = = − − − − = = = = n i i n i i n i i n i i i n i i y n y x n x x x y y x x y y 1 1 ( ) ( ) ( )( ) ( , ) 1 2 1 2 1  x y 11

点击进入文档下载页（PPTX格式）

共48页，可试读16页，点击继续阅读 ↓↓

点击下载（PPTX格式）

浏览记录