《多元统计分析》课程教学资源（阅读材料）A tutorial on Spectral clustering

团购合买资源类别：文库，文档格式：PDF，文档页数：32，文件大小：435.6KB

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr.38,72076 Tuibingen,Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics and Computing,17(4),2007. The original publication is available at www.springer.com. Abstract In recent years,spectral clustering has become one of the most popular modern clustering algorithms.It is simple to implement,can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm.On the first glance spectral clustering appears slightly mysterious,and it is not obvious to see why it works at all and what it really does.The goal of this tutorial is to give some intuition on those questions.We describe different graph Laplacians and their basic properties,present the most common spectral clustering algorithms,and derive those algorithms from scratch by several different approaches.Advantages and disadvantages of the different spectral clustering algorithms are discussed. Keywords:spectral clustering;graph Laplacian 1 Introduction Clustering is one of the most widely used techniques for exploratory data analysis,with applications ranging from statistics,computer science,biology to social sciences or psychology.In virtually every scientific field dealing with empirical data,people attempt to get a first impression on their data by trying to identify groups of"similar behavior"in their data.In this article we would like to introduce the reader to the family of spectral clustering algorithms.Compared to the "traditional algorithms" such as k-means or single linkage,spectral clustering has many fundamental advantages.Results ob- tained by spectral clustering often outperform the traditional approaches,spectral clustering is very simple to implement and can be solved efficiently by standard linear algebra methods This tutorial is set up as a self-contained introduction to spectral clustering.We derive spectral clustering from scratch and present different points of view to why spectral clustering works.Apart from basic linear algebra,no particular mathematical background is required by the reader.However, we do not attempt to give a concise review of the whole literature on spectral clustering,which is impossible due to the overwhelming amount of literature on this subject.The first two sections are devoted to a step-by-step introduction to the mathematical objects used by spectral clustering: similarity graphs in Section 2,and graph Laplacians in Section 3.The spectral clustering algorithms themselves will be presented in Section 4.The next three sections are then devoted to explaining why those algorithms work.Each section corresponds to one explanation:Section 5 describes a graph partitioning approach,Section 6 a random walk perspective,and Section 7 a perturbation theory approach.In Section 8 we will study some practical issues related to spectral clustering,and discuss various extensions and literature related to spectral clustering in Section 9

A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 T¨ubingen, Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics and Computing, 17 (4), 2007. The original publication is available at www.springer.com. Abstract In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed. Keywords: spectral clustering; graph Laplacian 1 Introduction Clustering is one of the most widely used techniques for exploratory data analysis, with applications ranging from statistics, computer science, biology to social sciences or psychology. In virtually every scientific field dealing with empirical data, people attempt to get a first impression on their data by trying to identify groups of “similar behavior” in their data. In this article we would like to introduce the reader to the family of spectral clustering algorithms. Compared to the “traditional algorithms” such as k-means or single linkage, spectral clustering has many fundamental advantages. Results obtained by spectral clustering often outperform the traditional approaches, spectral clustering is very simple to implement and can be solved efficiently by standard linear algebra methods. This tutorial is set up as a self-contained introduction to spectral clustering. We derive spectral clustering from scratch and present different points of view to why spectral clustering works. Apart from basic linear algebra, no particular mathematical background is required by the reader. However, we do not attempt to give a concise review of the whole literature on spectral clustering, which is impossible due to the overwhelming amount of literature on this subject. The first two sections are devoted to a step-by-step introduction to the mathematical objects used by spectral clustering: similarity graphs in Section 2, and graph Laplacians in Section 3. The spectral clustering algorithms themselves will be presented in Section 4. The next three sections are then devoted to explaining why those algorithms work. Each section corresponds to one explanation: Section 5 describes a graph partitioning approach, Section 6 a random walk perspective, and Section 7 a perturbation theory approach. In Section 8 we will study some practical issues related to spectral clustering, and discuss various extensions and literature related to spectral clustering in Section 9. 1

2 Similarity graphs Given a set of data points z1,...n and some notion of similarity sij>0 between all pairs of data points xi and zj,the intuitive goal of clustering is to divide the data points into several groups such that points in the same group are similar and points in different groups are dissimilar to each other.If we do not have more information than similarities between data points,a nice way of representing the data is in form of the similarity graph G=(V,E).Each vertex vi in this graph represents a data point xi.Two vertices are connected if the similarity sij between the corresponding data points ri and zj is positive or larger than a certain threshold,and the edge is weighted by sij.The problem of clustering can now be reformulated using the similarity graph:we want to find a partition of the graph such that the edges between different groups have very low weights(which means that points in different clusters are dissimilar from each other)and the edges within a group have high weights(which means that points within the same cluster are similar to each other).To be able to formalize this intuition we first want to introduce some basic graph notation and briefly discuss the kind of graphs we are going to study. 2.1 Graph notation Let G=(V,E)be an undirected graph with vertex set V=[v1,...,Un}.In the following we assume that the graph G is weighted,that is each edge between two vertices vi and vj carries a non-negative weight wij >0.The weighted adjacency matrir of the graph is the matrix W =(wij)i.j=1...n.If wij=0 this means that the vertices vi and vj are not connected by an edge.As G is undirected we require wij =wji.The degree of a vertex vi EV is defined as n d=∑ j=1 Note that,in fact,this sum only runs over all vertices adjacent to vi,as for all other vertices vj the weight wij is 0.The degree matrir D is defined as the diagonal matrix with the degrees di,...,dn on the diagonal.Given a subset of vertices A C V,we denote its complement V\A by A.We define the indicator vector 1A =(f1,...,fn)'E Rn as the vector with entries fi=1 if vi A and fi=0 otherwise.For convenience we introduce the shorthand notation iE A for the set of indices (iA),in particular when dealing with a sum likeFor two not necessarily disjoint sets A,BC V we define W(A,B)=∑： i∈A,j∈B We consider two different ways of measuring the "size"of a subset A C V: A:=the number of vertices in A vol(A):=>di. iEA Intuitively,A measures the size of A by its number of vertices,while vol(A)measures the size of A by summing over the weights of all edges attached to vertices in A.A subset Ac V of a graph is connected if any two vertices in A can be joined by a path such that all intermediate points also lie in A.A subset A is called a connected component if it is connected and if there are no connections between vertices in A and A.The nonempty sets A1,...,Ak form a partition of the graph if A:A;=0 and A1 U ..U Ak =V. 2

2 Similarity graphs Given a set of data points x1, . . . xn and some notion of similarity sij ≥ 0 between all pairs of data points xi and xj , the intuitive goal of clustering is to divide the data points into several groups such that points in the same group are similar and points in different groups are dissimilar to each other. If we do not have more information than similarities between data points, a nice way of representing the data is in form of the similarity graph G = (V, E). Each vertex vi in this graph represents a data point xi . Two vertices are connected if the similarity sij between the corresponding data points xi and xj is positive or larger than a certain threshold, and the edge is weighted by sij . The problem of clustering can now be reformulated using the similarity graph: we want to find a partition of the graph such that the edges between different groups have very low weights (which means that points in different clusters are dissimilar from each other) and the edges within a group have high weights (which means that points within the same cluster are similar to each other). To be able to formalize this intuition we first want to introduce some basic graph notation and briefly discuss the kind of graphs we are going to study. 2.1 Graph notation Let G = (V, E) be an undirected graph with vertex set V = {v1, . . . , vn}. In the following we assume that the graph G is weighted, that is each edge between two vertices vi and vj carries a non-negative weight wij ≥ 0. The weighted adjacency matrix of the graph is the matrix W = (wij )i,j=1,...,n. If wij = 0 this means that the vertices vi and vj are not connected by an edge. As G is undirected we require wij = wji. The degree of a vertex vi ∈ V is defined as di = Xn j=1 wij . Note that, in fact, this sum only runs over all vertices adjacent to vi , as for all other vertices vj the weight wij is 0. The degree matrix D is defined as the diagonal matrix with the degrees d1, . . . , dn on the diagonal. Given a subset of vertices A ⊂ V , we denote its complement V \ A by A. We define the indicator vector ✶A = (f1, . . . , fn) 0 ∈ ❘ n as the vector with entries fi = 1 if vi ∈ A and fi = 0 otherwise. For convenience we introduce the shorthand notation i ∈ A for the set of indices {i | vi ∈ A}, in particular when dealing with a sum like P i∈A wij . For two not necessarily disjoint sets A, B ⊂ V we define W(A, B) := X i∈A,j∈B wij . We consider two different ways of measuring the “size” of a subset A ⊂ V : |A| := the number of vertices in A vol(A) := X i∈A di . Intuitively, |A| measures the size of A by its number of vertices, while vol(A) measures the size of A by summing over the weights of all edges attached to vertices in A. A subset A ⊂ V of a graph is connected if any two vertices in A can be joined by a path such that all intermediate points also lie in A. A subset A is called a connected component if it is connected and if there are no connections between vertices in A and A. The nonempty sets A1, . . . , Ak form a partition of the graph if Ai∩Aj = ∅ and A1 ∪ . . . ∪ Ak = V . 2

2.2 Different similarity graphs There are several popular constructions to transform a given set z1,...,n of data points with pairwise similarities sij or pairwise distances dij into a graph.When constructing similarity graphs the goal is to model the local neighborhood relationships between the data points. The s-neighborhood graph:Here we connect all points whose pairwise distances are smaller than s. As the distances between all connected points are roughly of the same scale(at most s),weighting the edges would not incorporate more information about the data to the graph.Hence,the e-neighborhood graph is usually considered as an unweighted graph. k-nearest neighbor graphs:Here the goal is to connect vertex vi with vertex vj if vj is among the k-nearest neighbors of vi.However,this definition leads to a directed graph,as the neighborhood relationship is not symmetric.There are two ways of making this graph undirected.The first way is to simply ignore the directions of the edges,that is we connect vi and vj with an undirected edge if vi is among the k-nearest neighbors of vj or if vi is among the k-nearest neighbors of vi.The resulting graph is what is usually called the k-nearest neighbor graph.The second choice is to connect vertices vi and vj if both vi is among the k-nearest neighbors of v;and v;is among the k-nearest neighbors of vi.The resulting graph is called the mutual k-nearest neighbor graph.In both cases,after connecting the appropriate vertices we weight the edges by the similarity of their endpoints. The fully connected graph:Here we simply connect all points with positive similarity with each other,and we weight all edges by sij.As the graph should represent the local neighborhood re- lationships,this construction is only useful if the similarity function itself models local neighbor- hoods.An example for such a similarity function is the Gaussian similarity function s(i,j)= exp(-i-xil2/(202)),where the parameter o controls the width of the neighborhoods.This pa- rameter plays a similar role as the parameter s in case of the s-neighborhood graph. All graphs mentioned above are regularly used in spectral clustering.To our knowledge,theoretical results on the question how the choice of the similarity graph influences the spectral clustering result do not exist.For a discussion of the behavior of the different graphs we refer to Section 8. 3 Graph Laplacians and their basic properties The main tools for spectral clustering are graph Laplacian matrices.There exists a whole field ded- icated to the study of those matrices,called spectral graph theory (e.g.,see Chung,1997).In this section we want to define different graph Laplacians and point out their most important properties. We will carefully distinguish between different variants of graph Laplacians.Note that in the literature there is no unique convention which matrix exactly is called "graph Laplacian".Usually,every author just calls "his"matrix the graph Laplacian.Hence,a lot of care is needed when reading literature on graph Laplacians. In the following we always assume that G is an undirected,weighted graph with weight matrix W, where wij=wji>0.When using eigenvectors of a matrix,we will not necessarily assume that they are normalized.For example,the constant vector 1 and a multiple al for some a0 will be considered as the same eigenvectors.Eigenvalues will always be ordered increasingly,respecting multiplicities. By "the first k eigenvectors"we refer to the eigenvectors corresponding to the k smallest eigenvalues. 3

2.2 Different similarity graphs There are several popular constructions to transform a given set x1, . . . , xn of data points with pairwise similarities sij or pairwise distances dij into a graph. When constructing similarity graphs the goal is to model the local neighborhood relationships between the data points. The ε-neighborhood graph: Here we connect all points whose pairwise distances are smaller than ε. As the distances between all connected points are roughly of the same scale (at most ε), weighting the edges would not incorporate more information about the data to the graph. Hence, the ε-neighborhood graph is usually considered as an unweighted graph. k-nearest neighbor graphs: Here the goal is to connect vertex vi with vertex vj if vj is among the k-nearest neighbors of vi . However, this definition leads to a directed graph, as the neighborhood relationship is not symmetric. There are two ways of making this graph undirected. The first way is to simply ignore the directions of the edges, that is we connect vi and vj with an undirected edge if vi is among the k-nearest neighbors of vj or if vj is among the k-nearest neighbors of vi . The resulting graph is what is usually called the k-nearest neighbor graph. The second choice is to connect vertices vi and vj if both vi is among the k-nearest neighbors of vj and vj is among the k-nearest neighbors of vi . The resulting graph is called the mutual k-nearest neighbor graph. In both cases, after connecting the appropriate vertices we weight the edges by the similarity of their endpoints. The fully connected graph: Here we simply connect all points with positive similarity with each other, and we weight all edges by sij . As the graph should represent the local neighborhood relationships, this construction is only useful if the similarity function itself models local neighborhoods. An example for such a similarity function is the Gaussian similarity function s(xi , xj ) = exp(−kxi − xjk 2/(2σ 2 )), where the parameter σ controls the width of the neighborhoods. This parameter plays a similar role as the parameter ε in case of the ε-neighborhood graph. All graphs mentioned above are regularly used in spectral clustering. To our knowledge, theoretical results on the question how the choice of the similarity graph influences the spectral clustering result do not exist. For a discussion of the behavior of the different graphs we refer to Section 8. 3 Graph Laplacians and their basic properties The main tools for spectral clustering are graph Laplacian matrices. There exists a whole field dedicated to the study of those matrices, called spectral graph theory (e.g., see Chung, 1997). In this section we want to define different graph Laplacians and point out their most important properties. We will carefully distinguish between different variants of graph Laplacians. Note that in the literature there is no unique convention which matrix exactly is called “graph Laplacian”. Usually, every author just calls “his” matrix the graph Laplacian. Hence, a lot of care is needed when reading literature on graph Laplacians. In the following we always assume that G is an undirected, weighted graph with weight matrix W, where wij = wji ≥ 0. When using eigenvectors of a matrix, we will not necessarily assume that they are normalized. For example, the constant vector ✶ and a multiple a✶ for some a 6= 0 will be considered as the same eigenvectors. Eigenvalues will always be ordered increasingly, respecting multiplicities. By “the first k eigenvectors” we refer to the eigenvectors corresponding to the k smallest eigenvalues. 3

3.1 The unnormalized graph Laplacian The unnormalized graph Laplacian matrix is defined as L = D − W. An overview over many of its properties can be found in Mohar (1991, 1997). The following proposition summarizes the most important facts needed for spectral clustering. Proposition 1 (Properties of L) The matrix L satisfies the following properties: 1. For every vector f ∈ ❘ n we have f 0Lf = 1 2 Xn i,j=1 wij (fi − fj ) 2 . 2. L is symmetric and positive semi-definite. 3. The smallest eigenvalue of L is 0, the corresponding eigenvector is the constant one vector ✶. 4. L has n non-negative, real-valued eigenvalues 0 = λ1 ≤ λ2 ≤ . . . ≤ λn. Proof. Part (1): By the definition of di , f 0Lf = f 0Df − f 0W f = Xn i=1 dif 2 i − Xn i,j=1 fifjwij = 1 2   Xn i=1 dif 2 i − 2 Xn i,j=1 fifjwij + Xn j=1 djf 2 j   = 1 2 Xn i,j=1 wij (fi − fj ) 2 . Part (2): The symmetry of L follows directly from the symmetry of W and D. The positive semidefiniteness is a direct consequence of Part (1), which shows that f 0Lf ≥ 0 for all f ∈ ❘ n. Part (3): Obvious. Part (4) is a direct consequence of Parts (1) - (3). 2 Note that the unnormalized graph Laplacian does not depend on the diagonal elements of the adjacency matrix W. Each adjacency matrix which coincides with W on all off-diagonal positions leads to the same unnormalized graph Laplacian L. In particular, self-edges in a graph do not change the corresponding graph Laplacian. The unnormalized graph Laplacian and its eigenvalues and eigenvectors can be used to describe many properties of graphs, see Mohar (1991, 1997). One example which will be important for spectral clustering is the following: Proposition 2 (Number of connected components and the spectrum of L) Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of L equals the number of connected components A1, . . . , Ak in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors ✶A1 , . . . , ✶Ak of those components. Proof. We start with the case k = 1, that is the graph is connected. Assume that f is an eigenvector with eigenvalue 0. Then we know that 0 = f 0Lf = Xn i,j=1 wij (fi − fj ) 2 . 4

As the weights wij are non-negative,this sum can only vanish if all terms wij(fi-fi)2 vanish.Thus, if two vertices vi and vj are connected (i.e.,wij>0),then fi needs to equal fj.With this argument we can see that f needs to be constant for all vertices which can be connected by a path in the graph. Moreover,as all vertices of a connected component in an undirected graph can be connected by a path,f needs to be constant on the whole connected component.In a graph consisting of only one connected component we thus only have the constant one vector 1 as eigenvector with eigenvalue 0, which obviously is the indicator vector of the connected component. Now consider the case of k connected components.Without loss of generality we assume that the vertices are ordered according to the connected components they belong to.In this case,the adjacency matrix W has a block diagonal form,and the same is true for the matrix L: Note that each of the blocks Li is a proper graph Laplacian on its own,namely the Laplacian corre- sponding to the subgraph of the i-th connected component.As it is the case for all block diagonal matrices,we know that the spectrum of L is given by the union of the spectra of Li,and the corre- sponding eigenvectors of L are the eigenvectors of Li,filled with 0 at the positions of the other blocks. As each Li is a graph Laplacian of a connected graph,we know that every Li has eigenvalue 0 with multiplicity 1,and the corresponding eigenvector is the constant one vector on the i-th connected component.Thus,the matrix L has as many eigenvalues 0 as there are connected components,and the corresponding eigenvectors are the indicator vectors of the connected components. ▣ 3.2 The normalized graph Laplacians There are two matrices which are called normalized graph Laplacians in the literature.Both matrices are closely related to each other and are defined as Lm=D-1/2LD-1/2=I-D-1/2wD-1/2 Ltw :D-1L=I-D-iW. We denote the first matrix by Lsym as it is a symmetric matrix,and the second one by Lrw as it is closely related to a random walk.In the following we summarize several properties of Lsym and Lrw. The standard reference for normalized graph Laplacians is Chung(1997). Proposition 3(Properties of Lsym and Lrw)The normalized Laplacians satisfy the following prop- erties: 1.For every f E Rn we have di 2.A is an eigenvalue of Lrw with eigenvector u if and only if A is an eigenvalue of Lsum with eigenvector w D1/2u. 3.A is an eigenvalue of Lrw with eigenvector u if and only if A and u solve the generalized eigen- problem Lu=λD. 5

As the weights wij are non-negative, this sum can only vanish if all terms wij (fi − fj ) 2 vanish. Thus, if two vertices vi and vj are connected (i.e., wij > 0), then fi needs to equal fj . With this argument we can see that f needs to be constant for all vertices which can be connected by a path in the graph. Moreover, as all vertices of a connected component in an undirected graph can be connected by a path, f needs to be constant on the whole connected component. In a graph consisting of only one connected component we thus only have the constant one vector ✶ as eigenvector with eigenvalue 0, which obviously is the indicator vector of the connected component. Now consider the case of k connected components. Without loss of generality we assume that the vertices are ordered according to the connected components they belong to. In this case, the adjacency matrix W has a block diagonal form, and the same is true for the matrix L: L =   L1 L2 . . . Lk   Note that each of the blocks Li is a proper graph Laplacian on its own, namely the Laplacian corresponding to the subgraph of the i-th connected component. As it is the case for all block diagonal matrices, we know that the spectrum of L is given by the union of the spectra of Li , and the corresponding eigenvectors of L are the eigenvectors of Li , filled with 0 at the positions of the other blocks. As each Li is a graph Laplacian of a connected graph, we know that every Li has eigenvalue 0 with multiplicity 1, and the corresponding eigenvector is the constant one vector on the i-th connected component. Thus, the matrix L has as many eigenvalues 0 as there are connected components, and the corresponding eigenvectors are the indicator vectors of the connected components. 2 3.2 The normalized graph Laplacians There are two matrices which are called normalized graph Laplacians in the literature. Both matrices are closely related to each other and are defined as Lsym := D−1/2LD−1/2 = I − D−1/2W D−1/2 Lrw := D−1L = I − D−1W. We denote the first matrix by Lsym as it is a symmetric matrix, and the second one by Lrw as it is closely related to a random walk. In the following we summarize several properties of Lsym and Lrw. The standard reference for normalized graph Laplacians is Chung (1997). Proposition 3 (Properties of Lsym and Lrw) The normalized Laplacians satisfy the following properties: 1. For every f ∈ ❘ n we have f 0Lsymf = 1 2 Xn i,j=1 wij fi √ di − p fj dj !2 . 2. λ is an eigenvalue of Lrw with eigenvector u if and only if λ is an eigenvalue of Lsym with eigenvector w = D1/2u. 3. λ is an eigenvalue of Lrw with eigenvector u if and only if λ and u solve the generalized eigenproblem Lu = λDu. 5

4.0 is an eigenvalue of Lrw with the constant one vector 1 as eigenvector.0 is an eigenvalue of Lsym with eigenvector D1/21. 5.Lsym and Lrw are positive semi-definite and have n non-negative real-valued eigenvalues 0= 入1≤.≤入n Proof.Part(1)can be proved similarly to Part (1)of Proposition 1. Part (2)can be seen immediately by multiplying the eigenvalue equation Lsymw=Aw with D-1/2 from the left and substituting u=D-1/2w. Part (3)follows directly by multiplying the eigenvalue equation Lrwu=Au with D from the left. Part (4):The first statement is obvious as Lrw1=0,the second statement follows from(2). Part(5):The statement about Lsym follows from (1),and then the statement about Lrw follows from (2) As it is the case for the unnormalized graph Laplacian,the multiplicity of the eigenvalue 0 of the normalized graph Laplacian is related to the number of connected components: Proposition 4(Number of connected components and spectra of Lsym and Lw)Let G be an undirected graph with non-negative weights.Then the multiplicity k of the eigenvalue O of both Lrw and Lsum equals the number of connected components A1,...,Ak in the graph.For Lrw;the eigenspace of 0 is spanned by the indicator vectors 1A.of those components.For Lsym,the eigenspace of 0 is spanned by the vectors D21A. Proof.The proof is analogous to the one of Proposition 2,using Proposition 3. 0 4 Spectral Clustering Algorithms Now we would like to state the most common spectral clustering algorithms.For references and the history of spectral clustering we refer to Section 9.We assume that our data consists of n "points" x1,...,In which can be arbitrary objects.We measure their pairwise similarities sij=s(ri,zj) by some similarity function which is symmetric and non-negative,and we denote the corresponding similarity matrix by S=(sij)i.j=1...n. Unnormalized spectral clustering Input:Similarity matrix SE Rnxm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the unnormalized Laplacian L. Compute the first k eigenvectors u1,...,uk of L. Let UE Rnxk be the matrix containing the vectors u1,...,uk as columns. For i=1,...,n,let yi E R be the vector corresponding to the i-th row of U. .Cluster the points ()i=1..n in Rk with the k-means algorithm into clusters C1,,Ck· Output: Clusters A1,...,Ak with Ai={jl y E Ci}. There are two different versions of normalized spectral clustering,depending which of the normalized 6

4. 0 is an eigenvalue of Lrw with the constant one vector ✶ as eigenvector. 0 is an eigenvalue of Lsym with eigenvector D1/2✶. 5. Lsym and Lrw are positive semi-definite and have n non-negative real-valued eigenvalues 0 = λ1 ≤ . . . ≤ λn. Proof. Part (1) can be proved similarly to Part (1) of Proposition 1. Part (2) can be seen immediately by multiplying the eigenvalue equation Lsymw = λw with D−1/2 from the left and substituting u = D−1/2w. Part (3) follows directly by multiplying the eigenvalue equation Lrwu = λu with D from the left. Part (4): The first statement is obvious as Lrw✶ = 0, the second statement follows from (2). Part (5): The statement about Lsym follows from (1), and then the statement about Lrw follows from (2). 2 As it is the case for the unnormalized graph Laplacian, the multiplicity of the eigenvalue 0 of the normalized graph Laplacian is related to the number of connected components: Proposition 4 (Number of connected components and spectra of Lsym and Lrw) Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of both Lrw and Lsym equals the number of connected components A1, . . . , Ak in the graph. For Lrw, the eigenspace of 0 is spanned by the indicator vectors ✶Ai of those components. For Lsym, the eigenspace of 0 is spanned by the vectors D1/2✶Ai . Proof. The proof is analogous to the one of Proposition 2, using Proposition 3. 2 4 Spectral Clustering Algorithms Now we would like to state the most common spectral clustering algorithms. For references and the history of spectral clustering we refer to Section 9. We assume that our data consists of n “points” x1, . . . , xn which can be arbitrary objects. We measure their pairwise similarities sij = s(xi , xj ) by some similarity function which is symmetric and non-negative, and we denote the corresponding similarity matrix by S = (sij )i,j=1...n. Unnormalized spectral clustering Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L. • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in ❘ k with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. There are two different versions of normalized spectral clustering, depending which of the normalized 6

graph Laplacians is used.We name both algorithms after two popular papers,for more references and history please see Section 9. Normalized spectral clustering according to Shi and Malik(2000) Input:Similarity matrix SER"xm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the unnormalized Laplacian L. Compute the first k generalized eigenvectors u1,...,uk of the generalized eigenprob- lem Lu=入Du. 。LetU∈Rnxk be the matrix containing the vectors山1，：，uk as columns. For i=1,...,n,let yiERk be the vector corresponding to the i-th row of U. Cluster the points (vi)i=1...n in Rk with the k-means algorithm into clusters C1:....Ck. Output:Clusters A1,...,Ak with Ai=jlyCi}. Note that this algorithm uses the generalized eigenvectors of L,which according to Proposition 3 correspond to the eigenvectors of the matrix Lw.So in fact,the algorithm works with eigenvectors of the normalized Laplacian Lw,and hence is called normalized spectral clustering.The next algorithm also uses a normalized Laplacian,but this time the matrix Lsym instead of Lrw.As we will see,this algorithm needs to introduce an additional row normalization step which is not needed in the other algorithms.The reasons will become clear in Section 7. Normalized spectral clustering according to Ng,Jordan,and Weiss(2002) Input:Similarity matrix SERnxm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the normalized Laplacian Lsym. Compute the first k eigenvectors u1,...,uk of Lsym. Let UE Rnxk be the matrix containing the vectors u1,...,uk as columns. Form the matrix TE Rnxk from U by normalizing the rows to norm 1, that is set ti=u/(∑ku绿)/2. .For i=1,...,n,let yi ER be the vector corresponding to the i-th row of T. Cluster the points (yi)i=1.....n with the k-means algorithm into clusters C1,...,Ck. Output: C1 usters A1,.·,Ak with A:={功∈C}. All three algorithms stated above look rather similar,apart from the fact that they use three different graph Laplacians.In all three algorithms,the main trick is to change the representation of the abstract data points zi to points yiER.It is due to the properties of the graph Laplacians that this change of representation is useful.We will see in the next sections that this change of representation enhances the cluster-properties in the data,so that clusters can be trivially detected in the new representation. In particular,the simple k-means clustering algorithm has no difficulties to detect the clusters in this new representation.Readers not familiar with k-means can read up on this algorithm in numerous

graph Laplacians is used. We name both algorithms after two popular papers, for more references and history please see Section 9. Normalized spectral clustering according to Shi and Malik (2000) Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L. • Compute the first k generalized eigenvectors u1, . . . , uk of the generalized eigenproblem Lu = λDu. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in ❘ k with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. Note that this algorithm uses the generalized eigenvectors of L, which according to Proposition 3 correspond to the eigenvectors of the matrix Lrw. So in fact, the algorithm works with eigenvectors of the normalized Laplacian Lrw, and hence is called normalized spectral clustering. The next algorithm also uses a normalized Laplacian, but this time the matrix Lsym instead of Lrw. As we will see, this algorithm needs to introduce an additional row normalization step which is not needed in the other algorithms. The reasons will become clear in Section 7. Normalized spectral clustering according to Ng, Jordan, and Weiss (2002) Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the normalized Laplacian Lsym. • Compute the first k eigenvectors u1, . . . , uk of Lsym. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • Form the matrix T ∈ ❘ n×k from U by normalizing the rows to norm 1, that is set tij = uij/( P k u 2 ik) 1/2. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of T. • Cluster the points (yi)i=1,...,n with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. All three algorithms stated above look rather similar, apart from the fact that they use three different graph Laplacians. In all three algorithms, the main trick is to change the representation of the abstract data points xi to points yi ∈ ❘ k . It is due to the properties of the graph Laplacians that this change of representation is useful. We will see in the next sections that this change of representation enhances the cluster-properties in the data, so that clusters can be trivially detected in the new representation. In particular, the simple k-means clustering algorithm has no difficulties to detect the clusters in this new representation. Readers not familiar with k-means can read up on this algorithm in numerous 7

Histogram of the sample Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 008 0. 0.08 0 E0.0 04支846678910 24含 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 0.04 0.03 0.02 品 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 08 -0.1451 0 4 -01451 事◆ -0.1451 02345678910 488 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 ◆ 0.15 0.0707 0.0 0.1 -0.0707 0.05 -0.0707 345678910 2468 468 Figure 1:Toy example for spectral clustering where the data points have been drawn from a mixture of four Gaussians on R.Left upper corner:histogram of the data.First and second row:eigenvalues and eigenvectors of Lrw and L based on the k-nearest neighbor graph.Third and fourth row:eigenvalues and eigenvectors of Lw and L based on the fully connected graph.For all plots,we used the Gaussian kernel with o=1 as similarity function.See text for more details text books,for example in Hastie,Tibshirani,and Friedman(2001). Before we dive into the theory of spectral clustering,we would like to illustrate its principle on a very simple toy example.This example will be used at several places in this tutorial,and we chose it because it is so simple that the relevant quantities can easily be plotted.This toy data set consists of a random sample of 200 points z1,...,200 ER drawn according to a mixture of four Gaussians.The first row of Figure 1 shows the histogram of a sample drawn from this distribution (the z-axis represents the one-dimensional data space).As similarity function on this data set we choose the Gaussian similarity function s(xi,xj)=exp(-xi-zj2/(202))with o 1.As similarity graph we consider both the fully connected graph and the 10-nearest neighbor graph.In Figure 1 we show the first eigenvalues and eigenvectors of the unnormalized Laplacian L and the normalized Laplacian Lrw.That is,in the eigenvalue plot we plot i vs.Ai(for the moment ignore the dashed line and the different shapes of the eigenvalues in the plots for the unnormalized case;their meaning will be discussed in Section 8.5).In the eigenvector plots of an eigenvector u=(u1,...,u200)we plot zi vs.u;(note that in the example chosen zi is simply a real number,hence we can depict it on the r-axis).The first two rows of Figure 1 show the results based on the 10-nearest neighbor graph.We can see that the first four eigenvalues are 0,and the corresponding eigenvectors are cluster indicator vectors.The reason is that the clusters 8

0 2 4 6 8 10 0 2 4 6 8 Histogram of the sample 1 2 3 4 5 6 7 8 9 10 0 0.02 0.04 0.06 0.08 Eigenvalues norm, knn 2 4 6 8 0 0.2 0.4 norm, knn Eigenvector 1 2 4 6 8 −0.5 −0.4 −0.3 −0.2 −0.1 Eigenvector 2 2 4 6 8 0 0.2 0.4 Eigenvector 3 2 4 6 8 0 0.2 0.4 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.01 0.02 0.03 0.04 Eigenvalues unnorm, knn 2 4 6 8 0 0.05 0.1 unnorm, knn Eigenvector 1 2 4 6 8 −0.1 −0.05 0 Eigenvector 2 2 4 6 8 −0.1 −0.05 0 Eigenvector 3 2 4 6 8 −0.1 −0.05 0 Eigenvector 4 2 4 6 8 −0.1 0 0.1 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.2 0.4 0.6 0.8 Eigenvalues norm, full graph 2 4 6 8 −0.1451 −0.1451 −0.1451 norm, full graph Eigenvector 1 2 4 6 8 −0.1 0 0.1 Eigenvector 2 2 4 6 8 −0.1 0 0.1 Eigenvector 3 2 4 6 8 −0.1 0 0.1 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 Eigenvalues unnorm, full graph 2 4 6 8 −0.0707 −0.0707 −0.0707 unnorm, full graph Eigenvector 1 2 4 6 8 −0.05 0 0.05 Eigenvector 2 2 4 6 8 −0.05 0 0.05 Eigenvector 3 2 4 6 8 −0.05 0 0.05 Eigenvector 4 2 4 6 8 0 0.2 0.4 0.6 0.8 Eigenvector 5 Figure 1: Toy example for spectral clustering where the data points have been drawn from a mixture of four Gaussians on ❘. Left upper corner: histogram of the data. First and second row: eigenvalues and eigenvectors of Lrw and L based on the k-nearest neighbor graph. Third and fourth row: eigenvalues and eigenvectors of Lrw and L based on the fully connected graph. For all plots, we used the Gaussian kernel with σ = 1 as similarity function. See text for more details. text books, for example in Hastie, Tibshirani, and Friedman (2001). Before we dive into the theory of spectral clustering, we would like to illustrate its principle on a very simple toy example. This example will be used at several places in this tutorial, and we chose it because it is so simple that the relevant quantities can easily be plotted. This toy data set consists of a random sample of 200 points x1, . . . , x200 ∈ ❘ drawn according to a mixture of four Gaussians. The first row of Figure 1 shows the histogram of a sample drawn from this distribution (the x-axis represents the one-dimensional data space). As similarity function on this data set we choose the Gaussian similarity function s(xi , xj ) = exp(−|xi − xj | 2/(2σ 2 )) with σ = 1. As similarity graph we consider both the fully connected graph and the 10-nearest neighbor graph. In Figure 1 we show the first eigenvalues and eigenvectors of the unnormalized Laplacian L and the normalized Laplacian Lrw. That is, in the eigenvalue plot we plot i vs. λi (for the moment ignore the dashed line and the different shapes of the eigenvalues in the plots for the unnormalized case; their meaning will be discussed in Section 8.5). In the eigenvector plots of an eigenvector u = (u1, . . . , u200) 0 we plot xi vs. ui (note that in the example chosen xi is simply a real number, hence we can depict it on the x-axis). The first two rows of Figure 1 show the results based on the 10-nearest neighbor graph. We can see that the first four eigenvalues are 0, and the corresponding eigenvectors are cluster indicator vectors. The reason is that the clusters 8

form disconnected parts in the 10-nearest neighbor graph,in which case the eigenvectors are given as in Propositions 2 and 4.The next two rows show the results for the fully connected graph.As the Gaussian similarity function is always positive,this graph only consists of one connected component. Thus,eigenvalue 0 has multiplicity 1,and the first eigenvector is the constant vector.The following eigenvectors carry the information about the clusters.For example in the unnormalized case (last row),if we threshold the second eigenvector at 0,then the part below 0 corresponds to clusters 1 and 2,and the part above 0 to clusters 3 and 4.Similarly,thresholding the third eigenvector separates clusters 1 and 4 from clusters 2 and 3,and thresholding the fourth eigenvector separates clusters 1 and 3 from clusters 2 and 4.Altogether,the first four eigenvectors carry all the information about the four clusters.In all the cases illustrated in this figure,spectral clustering using k-means on the first four eigenvectors easily detects the correct four clusters. 5 Graph cut point of view The intuition of clustering is to separate points in different groups according to their similarities.For data given in form of a similarity graph,this problem can be restated as follows:we want to find a par- tition of the graph such that the edges between different groups have a very low weight(which means that points in different clusters are dissimilar from each other)and the edges within a group have high weight (which means that points within the same cluster are similar to each other).In this section we will see how spectral clustering can be derived as an approximation to such graph partitioning problems. Given a similarity graph with adjacency matrix W,the simplest and most direct way to construct a partition of the graph is to solve the mincut problem.To define it,please recall the notation W(A,B):and A for the complement of A.For a given numberk of subsets,the mincut approach simply consists in choosing a partition A1,...,Ak which minimizes cut(A1,...,Ak):= ∑W(A,A) 2 i=1 Here we introduce the factor 1/2 for notational consistency,otherwise we would count each edge twice in the cut.In particular for k=2,mincut is a relatively easy problem and can be solved efficiently, see Stoer and Wagner (1997)and the discussion therein.However,in practice it often does not lead to satisfactory partitions.The problem is that in many cases,the solution of mincut simply separates one individual vertex from the rest of the graph.Of course this is not what we want to achieve in clustering,as clusters should be reasonably large groups of points.One way to circumvent this problem is to explicitly request that the sets A1,...,Ak are "reasonably large".The two most common objective functions to encode this are RatioCut (Hagen and Kahng,1992)and the normalized cut Ncut (Shi and Malik,2000).In RatioCut,the size of a subset A of a graph is measured by its number of vertices A,while in Ncut the size is measured by the weights of its edges vol(A).The definitions are: RatioCut(41,,A):=)∑4,4 、cut(Ai,A) Ncut(A1,...,Ak):= 2 =>cut(4,A) vol(A:) i=1 vol(Ai) Note that both objective functions take a small value if the clusters Ai are not too small.In partic- ular,the minimum of the function(A)is achieved if all coincide,and the minimum of (1vo(A))is achieved if all vol(A)coincide.So what both objective functionstry to achieve is that the clusters are "balanced",as measured by the number of vertices or edge weights,respectively. Unfortunately,introducing balancing conditions makes the previously simple to solve mincut problem 9

form disconnected parts in the 10-nearest neighbor graph, in which case the eigenvectors are given as in Propositions 2 and 4. The next two rows show the results for the fully connected graph. As the Gaussian similarity function is always positive, this graph only consists of one connected component. Thus, eigenvalue 0 has multiplicity 1, and the first eigenvector is the constant vector. The following eigenvectors carry the information about the clusters. For example in the unnormalized case (last row), if we threshold the second eigenvector at 0, then the part below 0 corresponds to clusters 1 and 2, and the part above 0 to clusters 3 and 4. Similarly, thresholding the third eigenvector separates clusters 1 and 4 from clusters 2 and 3, and thresholding the fourth eigenvector separates clusters 1 and 3 from clusters 2 and 4. Altogether, the first four eigenvectors carry all the information about the four clusters. In all the cases illustrated in this figure, spectral clustering using k-means on the first four eigenvectors easily detects the correct four clusters. 5 Graph cut point of view The intuition of clustering is to separate points in different groups according to their similarities. For data given in form of a similarity graph, this problem can be restated as follows: we want to find a partition of the graph such that the edges between different groups have a very low weight (which means that points in different clusters are dissimilar from each other) and the edges within a group have high weight (which means that points within the same cluster are similar to each other). In this section we will see how spectral clustering can be derived as an approximation to such graph partitioning problems. Given a similarity graph with adjacency matrix W, the simplest and most direct way to construct a partition of the graph is to solve the mincut problem. To define it, please recall the notation W(A, B) := P i∈A,j∈B wij and A for the complement of A. For a given number k of subsets, the mincut approach simply consists in choosing a partition A1, . . . , Ak which minimizes cut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai). Here we introduce the factor 1/2 for notational consistency, otherwise we would count each edge twice in the cut. In particular for k = 2, mincut is a relatively easy problem and can be solved efficiently, see Stoer and Wagner (1997) and the discussion therein. However, in practice it often does not lead to satisfactory partitions. The problem is that in many cases, the solution of mincut simply separates one individual vertex from the rest of the graph. Of course this is not what we want to achieve in clustering, as clusters should be reasonably large groups of points. One way to circumvent this problem is to explicitly request that the sets A1, . . . , Ak are “reasonably large”. The two most common objective functions to encode this are RatioCut (Hagen and Kahng, 1992) and the normalized cut Ncut (Shi and Malik, 2000). In RatioCut, the size of a subset A of a graph is measured by its number of vertices |A|, while in Ncut the size is measured by the weights of its edges vol(A). The definitions are: RatioCut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai) |Ai | = X k i=1 cut(Ai , Ai) |Ai | Ncut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai) vol(Ai) = X k i=1 cut(Ai , Ai) vol(Ai) . Note that both objective functions take a small value if the clusters Ai are not too small. In particular, the minimum of the function Pk i=1(1/|Ai |) is achieved if all |Ai | coincide, and the minimum of Pk i=1(1/ vol(Ai)) is achieved if all vol(Ai) coincide. So what both objective functions try to achieve is that the clusters are “balanced”, as measured by the number of vertices or edge weights, respectively. Unfortunately, introducing balancing conditions makes the previously simple to solve mincut problem 9

点击进入文档下载页（PDF格式）

共32页，可试读12页，点击继续阅读 ↓↓

点击下载（PDF格式）

浏览记录