itself.for example the Gaussian similarity function.then the scale of the similarity function should be chosen such that the resulting graph has similar properties as a corresponding k-nearest neighbor or e-neighborhood graph would have.One needs to make sure that for most data points the set of neigh- bors with a similarity significantly larger than 0 is "not too small and not too large".In particular, for the Gaussian similarity function several rules of thumb are frequently used.For example,one can choose o in the order of the mean distance of a point to its k-th nearest neighbor,where k is chosen similarly as above (e.g.,k ~log(n)+1 )Another way is to determine s by the minimal spanning tree heuristic described above,and then choose o=e.But note that all those rules of thumb are very ad-hoc,and depending on the given data at hand and its distribution of inter-point distances they might not work at all. In general,experience shows that spectral clustering can be quite sensitive to changes in the similarity graph and to the choice of its parameters.Unfortunately,to our knowledge there has been no sys- tematic study which investigates the effects of the similarity graph and its parameters on clustering and comes up with well-justified rules of thumb.None of the recommendations above is based on a firm theoretic ground.Finding rules which have a theoretical justification should be considered an interesting and important topic for future research. 8.2 Computing the eigenvectors To implement spectral clustering in practice one has to compute the first k eigenvectors of a potentially large graph Laplace matrix.Luckily,if we use the k-nearest neighbor graph or the s-neighborhood graph,then all those matrices are sparse.Efficient methods exist to compute the first eigenvectors of sparse matrices,the most popular ones being the power method or Krylov subspace methods such as the Lanczos method (Golub and Van Loan,1996).The speed of convergence of those algorithms depends on the size of the eigengap(also called spectral gap)T=k-k+1.The larger this eigengap is,the faster the algorithms computing the first k eigenvectors converge. Note that a general problem occurs if one of the eigenvalues under consideration has multiplicity larger than one.For example,in the ideal situation of k disconnected clusters,the eigenvalue 0 has multi- plicity k.As we have seen,in this case the eigenspace is spanned by the k cluster indicator vectors. But unfortunately,the vectors computed by the numerical eigensolvers do not necessarily converge to those particular vectors.Instead they just converge to some orthonormal basis of the eigenspace,and it usually depends on implementation details to which basis exactly the algorithm converges.But this is not so bad after all.Note that all vectors in the space spanned by the cluster indicator vectors 1A, have the form u.for some coefficients that is,they are piecewise constant on the clusters.So the vectors returned by the eigensolvers still encode the information about the clusters, which can then be used by the k-means algorithm to reconstruct the clusters. 8.3 The number of clusters Choosing the number k of clusters is a general problem for all clustering algorithms,and a variety of more or less successful methods have been devised for this problem.In model-based clustering settings there exist well-justified criteria to choose the number of clusters from the data.Those criteria are usually based on the log-likelihood of the data,which can then be treated in a frequentist or Bayesian way,for examples see Fraley and Raftery (2002).In settings where no or few assumptions on the underlying model are made,a large variety of different indices can be used to pick the number of clusters.Examples range from ad-hoc measures such as the ratio of within-cluster and between-cluster similarities,over information-theoretic criteria(Still and Bialek,2004),the gap statistic (Tibshirani, Walther,and Hastie,2001),to stability approaches(Ben-Hur,Elisseeff,and Guyon,2002;Lange,Roth, 22itself, for example the Gaussian similarity function, then the scale of the similarity function should be chosen such that the resulting graph has similar properties as a corresponding k-nearest neighbor or ε-neighborhood graph would have. One needs to make sure that for most data points the set of neighbors with a similarity significantly larger than 0 is “not too small and not too large”. In particular, for the Gaussian similarity function several rules of thumb are frequently used. For example, one can choose σ in the order of the mean distance of a point to its k-th nearest neighbor, where k is chosen similarly as above (e.g., k ∼ log(n) + 1 ). Another way is to determine ε by the minimal spanning tree heuristic described above, and then choose σ = ε. But note that all those rules of thumb are very ad-hoc, and depending on the given data at hand and its distribution of inter-point distances they might not work at all. In general, experience shows that spectral clustering can be quite sensitive to changes in the similarity graph and to the choice of its parameters. Unfortunately, to our knowledge there has been no systematic study which investigates the effects of the similarity graph and its parameters on clustering and comes up with well-justified rules of thumb. None of the recommendations above is based on a firm theoretic ground. Finding rules which have a theoretical justification should be considered an interesting and important topic for future research. 8.2 Computing the eigenvectors To implement spectral clustering in practice one has to compute the first k eigenvectors of a potentially large graph Laplace matrix. Luckily, if we use the k-nearest neighbor graph or the ε-neighborhood graph, then all those matrices are sparse. Efficient methods exist to compute the first eigenvectors of sparse matrices, the most popular ones being the power method or Krylov subspace methods such as the Lanczos method (Golub and Van Loan, 1996). The speed of convergence of those algorithms depends on the size of the eigengap (also called spectral gap) γk = |λk−λk+1|. The larger this eigengap is, the faster the algorithms computing the first k eigenvectors converge. Note that a general problem occurs if one of the eigenvalues under consideration has multiplicity larger than one. For example, in the ideal situation of k disconnected clusters, the eigenvalue 0 has multiplicity k. As we have seen, in this case the eigenspace is spanned by the k cluster indicator vectors. But unfortunately, the vectors computed by the numerical eigensolvers do not necessarily converge to those particular vectors. Instead they just converge to some orthonormal basis of the eigenspace, and it usually depends on implementation details to which basis exactly the algorithm converges. But this is not so bad after all. Note that all vectors in the space spanned by the cluster indicator vectors ✶Ai have the form u = Pk i=1 ai✶Ai for some coefficients ai , that is, they are piecewise constant on the clusters. So the vectors returned by the eigensolvers still encode the information about the clusters, which can then be used by the k-means algorithm to reconstruct the clusters. 8.3 The number of clusters Choosing the number k of clusters is a general problem for all clustering algorithms, and a variety of more or less successful methods have been devised for this problem. In model-based clustering settings there exist well-justified criteria to choose the number of clusters from the data. Those criteria are usually based on the log-likelihood of the data, which can then be treated in a frequentist or Bayesian way, for examples see Fraley and Raftery (2002). In settings where no or few assumptions on the underlying model are made, a large variety of different indices can be used to pick the number of clusters. Examples range from ad-hoc measures such as the ratio of within-cluster and between-cluster similarities, over information-theoretic criteria (Still and Bialek, 2004), the gap statistic (Tibshirani, Walther, and Hastie, 2001), to stability approaches (Ben-Hur, Elisseeff, and Guyon, 2002; Lange, Roth, 22