different context as a quality criterion for spectral clustering,namely when choosing the number k of clusters to construct. If the perturbation H is too large or the eigengap is too small,we might not find a set S such that both the first k eigenvalues of L and L are contained in S1.In this case,we need to make a compromise by choosing the set S to contain the first k eigenvalues of L,but maybe a few more or less eigenvalues of L.The statement of the theorem then becomes weaker in the sense that either we do not compare the eigenspaces corresponding to the first k eigenvectors of L and L,but the eigenspaces corresponding to the first k eigenvectors of L and the first k eigenvectors of L(where k is the number of eigenvalues of L contained in S1).Or,it can happen that 6 becomes so small that the bound on the distance between d(Vi,Vi)blows up so much that it becomes useless. 7.2 Comments about the perturbation approach A bit of caution is needed when using perturbation theory arguments to justify clustering algorithms based on eigenvectors of matrices.In general,any block diagonal symmetric matrix has the property that there exists a basis of eigenvectors which are zero outside the individual blocks and real-valued within the blocks.For example,based on this argument several authors use the eigenvectors of the similarity matrix S or adjacency matrix W to discover clusters.However,being block diagonal in the ideal case of completely separated clusters can be considered as a necessary condition for a successful use of eigenvectors,but not a sufficient one.At least two more properties should be satisfied: First,we need to make sure that the order of the eigenvalues and eigenvectors is meaningful.In case of the Laplacians this is always true,as we know that any connected component possesses exactly one eigenvector which has eigenvalue 0.Hence,if the graph has k connected components and we take the first k eigenvectors of the Laplacian,then we know that we have exactly one eigenvector per compo- nent.However,this might not be the case for other matrices such as S or W.For example,it could be the case that the two largest eigenvalues of a block diagonal similarity matrix S come from the same block.In such a situation,if we take the first k eigenvectors of S.some blocks will be represented several times,while there are other blocks which we will miss completely (unless we take certain pre- cautions).This is the reason why using the eigenvectors of S or W for clustering should be discouraged. The second property is that in the ideal case,the entries of the eigenvectors on the components should be "safely bounded away"from 0.Assume that an eigenvector on the first connected component has an entry u1.i>0 at position i.In the ideal case,the fact that this entry is non-zero indicates that the corresponding point i belongs to the first cluster.The other way round,if a point j does not belong to cluster 1,then in the ideal case it should be the case that u1,;=0.Now consider the same situation, but with perturbed data.The perturbed eigenvector i will usually not have any non-zero component any more;but if the noise is not too large,then perturbation theory tells us that the entries i.i and are still "close"to their original values u1.i and u1.j.So both entries i.i and will take some small values,say s1 and E2.In practice,if those values are very small it is unclear how we should interpret this situation.Either we believe that small entries in i indicate that the points do not belong to the first cluster(which then misclassifies the first data point i),or we think that the entries already indicate class membership and classify both points to the first cluster(which misclassifies point j). For both matrices L and Lw,the eigenvectors in the ideal situation are indicator vectors,so the second problem described above cannot occur.However,this is not true for the matrix Lsym,which is used in the normalized spectral clustering algorithm of Ng et al.(2002).Even in the ideal case,the eigen- vectors of this matrix are given as D/21A.If the degrees of the vertices differ a lot,and in particular if there are vertices which have a very low degree,the corresponding entries in the eigenvectors are very small.To counteract the problem described above,the row-normalization step in the algorithm of Ng et al.(2002)comes into play.In the ideal case,the matrix U in the algorithm has exactly one 18different context as a quality criterion for spectral clustering, namely when choosing the number k of clusters to construct. If the perturbation H is too large or the eigengap is too small, we might not find a set S1 such that both the first k eigenvalues of L and L˜ are contained in S1. In this case, we need to make a compromise by choosing the set S1 to contain the first k eigenvalues of L, but maybe a few more or less eigenvalues of L˜. The statement of the theorem then becomes weaker in the sense that either we do not compare the eigenspaces corresponding to the first k eigenvectors of L and L˜, but the eigenspaces corresponding to the first k eigenvectors of L and the first ˜k eigenvectors of L˜ (where ˜k is the number of eigenvalues of L˜ contained in S1). Or, it can happen that δ becomes so small that the bound on the distance between d(V1, V˜ 1) blows up so much that it becomes useless. 7.2 Comments about the perturbation approach A bit of caution is needed when using perturbation theory arguments to justify clustering algorithms based on eigenvectors of matrices. In general, any block diagonal symmetric matrix has the property that there exists a basis of eigenvectors which are zero outside the individual blocks and real-valued within the blocks. For example, based on this argument several authors use the eigenvectors of the similarity matrix S or adjacency matrix W to discover clusters. However, being block diagonal in the ideal case of completely separated clusters can be considered as a necessary condition for a successful use of eigenvectors, but not a sufficient one. At least two more properties should be satisfied: First, we need to make sure that the order of the eigenvalues and eigenvectors is meaningful. In case of the Laplacians this is always true, as we know that any connected component possesses exactly one eigenvector which has eigenvalue 0. Hence, if the graph has k connected components and we take the first k eigenvectors of the Laplacian, then we know that we have exactly one eigenvector per component. However, this might not be the case for other matrices such as S or W. For example, it could be the case that the two largest eigenvalues of a block diagonal similarity matrix S come from the same block. In such a situation, if we take the first k eigenvectors of S, some blocks will be represented several times, while there are other blocks which we will miss completely (unless we take certain precautions). This is the reason why using the eigenvectors of S or W for clustering should be discouraged. The second property is that in the ideal case, the entries of the eigenvectors on the components should be “safely bounded away” from 0. Assume that an eigenvector on the first connected component has an entry u1,i > 0 at position i. In the ideal case, the fact that this entry is non-zero indicates that the corresponding point i belongs to the first cluster. The other way round, if a point j does not belong to cluster 1, then in the ideal case it should be the case that u1,j = 0. Now consider the same situation, but with perturbed data. The perturbed eigenvector ˜u will usually not have any non-zero component any more; but if the noise is not too large, then perturbation theory tells us that the entries ˜u1,i and u˜1,j are still “close” to their original values u1,i and u1,j . So both entries ˜u1,i and ˜u1,j will take some small values, say ε1 and ε2. In practice, if those values are very small it is unclear how we should interpret this situation. Either we believe that small entries in ˜u indicate that the points do not belong to the first cluster (which then misclassifies the first data point i), or we think that the entries already indicate class membership and classify both points to the first cluster (which misclassifies point j). For both matrices L and Lrw, the eigenvectors in the ideal situation are indicator vectors, so the second problem described above cannot occur. However, this is not true for the matrix Lsym, which is used in the normalized spectral clustering algorithm of Ng et al. (2002). Even in the ideal case, the eigenvectors of this matrix are given as D1/2✶Ai . If the degrees of the vertices differ a lot, and in particular if there are vertices which have a very low degree, the corresponding entries in the eigenvectors are very small. To counteract the problem described above, the row-normalization step in the algorithm of Ng et al. (2002) comes into play. In the ideal case, the matrix U in the algorithm has exactly one 18