non-zero entry per row.After row-normalization,the matrix T in the algorithm of Ng et al.(2002) then consists of the cluster indicator vectors.Note however,that this might not always work out correctly in practice.Assume that we have ii.1 =1 and ii2 =E2.If we now normalize the i-th row of U,both e and s2 will be multiplied by the factor of 1/Vs+and become rather large.We now run into a similar problem as described above:both points are likely to be classified into the same cluster,even though they belong to different clusters.This argument shows that spectral clustering using the matrix Lsym can be problematic if the eigenvectors contain particularly small entries.On the other hand,note that such small entries in the eigenvectors only occur if some of the vertices have a particularly low degrees(as the eigenvectors ofm are given by D21.).One could argue that in such a case,the data point should be considered an outlier anyway,and then it does not really matter in which cluster the point will end up. To summarize,the conclusion is that both unnormalized spectral clustering and normalized spectral clustering with Lrw are well justified by the perturbation theory approach.Normalized spectral clus- tering with Lsym can also be justified by perturbation theory,but it should be treated with more care if the graph contains vertices with very low degrees. d Practical details In this section we will briefly discuss some of the issues which come up when actually implementing spectral clustering.There are several choices to be made and parameters to be set.However,the discussion in this section is mainly meant to raise awareness about the general problems which an occur.For thorough studies on the behavior of spectral clustering for various real world tasks we refer to the literature. 8.1 Constructing the similarity graph Constructing the similarity graph for spectral clustering is not a trivial task,and little is known on theoretical implications of the various constructions. The similarity function itself Before we can even think about constructing a similarity graph,we need to define a similarity function on the data.As we are going to construct a neighborhood graph later on,we need to make sure that the local neighborhoods induced by this similarity function are "meaningful".This means that we need to be sure that points which are considered to be "very similar"by the similarity function are also closely related in the application the data comes from.For example,when constructing a similarity function between text documents it makes sense to check whether documents with a high similarity score indeed belong to the same text category.The global "long-range"behavior of the similarity function is not so important for spectral clustering-it does not really matter whether two data points have similarity score 0.01 or 0.001,say,as we will not connect those two points in the similarity graph anyway.In the common case where the data points live in the Euclidean space Rd,a reasonable default candidate is the Gaussian similarity function s(zi,zj)=exp(-llzi-zjll2/(202))(but of course we need to choose the parameter o here,see below).Ultimately,the choice of the similarity function depends on the domain the data comes from,and no general advice can be given. Which type of similarity graph The next choice one has to make concerns the type of the graph one wants to use,such as the k-nearest neighbor or the s-neighborhood graph.Let us illustrate the behavior of the different graphs using the toy example presented in Figure 3.As underlying distribution we choose a distribution on R2 with 19non-zero entry per row. After row-normalization, the matrix T in the algorithm of Ng et al. (2002) then consists of the cluster indicator vectors. Note however, that this might not always work out correctly in practice. Assume that we have ˜ui,1 = ε1 and ˜ui,2 = ε2. If we now normalize the i-th row of U, both ε1 and ε2 will be multiplied by the factor of 1/ p ε 2 1 + ε 2 2 and become rather large. We now run into a similar problem as described above: both points are likely to be classified into the same cluster, even though they belong to different clusters. This argument shows that spectral clustering using the matrix Lsym can be problematic if the eigenvectors contain particularly small entries. On the other hand, note that such small entries in the eigenvectors only occur if some of the vertices have a particularly low degrees (as the eigenvectors of Lsym are given by D1/2✶Ai ). One could argue that in such a case, the data point should be considered an outlier anyway, and then it does not really matter in which cluster the point will end up. To summarize, the conclusion is that both unnormalized spectral clustering and normalized spectral clustering with Lrw are well justified by the perturbation theory approach. Normalized spectral clustering with Lsym can also be justified by perturbation theory, but it should be treated with more care if the graph contains vertices with very low degrees. 8 Practical details In this section we will briefly discuss some of the issues which come up when actually implementing spectral clustering. There are several choices to be made and parameters to be set. However, the discussion in this section is mainly meant to raise awareness about the general problems which an occur. For thorough studies on the behavior of spectral clustering for various real world tasks we refer to the literature. 8.1 Constructing the similarity graph Constructing the similarity graph for spectral clustering is not a trivial task, and little is known on theoretical implications of the various constructions. The similarity function itself Before we can even think about constructing a similarity graph, we need to define a similarity function on the data. As we are going to construct a neighborhood graph later on, we need to make sure that the local neighborhoods induced by this similarity function are “meaningful”. This means that we need to be sure that points which are considered to be “very similar” by the similarity function are also closely related in the application the data comes from. For example, when constructing a similarity function between text documents it makes sense to check whether documents with a high similarity score indeed belong to the same text category. The global “long-range” behavior of the similarity function is not so important for spectral clustering — it does not really matter whether two data points have similarity score 0.01 or 0.001, say, as we will not connect those two points in the similarity graph anyway. In the common case where the data points live in the Euclidean space ❘ d , a reasonable default candidate is the Gaussian similarity function s(xi , xj ) = exp(−kxi − xjk 2/(2σ 2 )) (but of course we need to choose the parameter σ here, see below). Ultimately, the choice of the similarity function depends on the domain the data comes from, and no general advice can be given. Which type of similarity graph The next choice one has to make concerns the type of the graph one wants to use, such as the k-nearest neighbor or the ε-neighborhood graph. Let us illustrate the behavior of the different graphs using the toy example presented in Figure 3. As underlying distribution we choose a distribution on ❘ 2 with 19