maps the vertices vi of the graph on points zi R"such that the Euclidean distances between the points zi coincide with the commute distances on the graph.This works as follows.As the matrix Lt is positive semi-definite and symmetric,it induces an inner product on R"(or to be more formal,it induces an inner product on the subspace of Rn which is perpendicular to the vector 1).Now choose zi as the point in Rm corresponding to the i-th row of the matrix U(At)1/2.Then,by Proposition 6 and by the construction of Lt we have that (zi,zj)=eLfej and cij vol(V)2i-2 The embedding used in unnormalized spectral clustering is related to the commute time embedding, but not identical.In spectral clustering,we map the vertices of the graph on the rows yi of the matrix U,while the commute time embedding maps the vertices on the rows zi of the matrix (At)1/2U.That is,compared to the entries of yi,the entries of zi are additionally scaled by the inverse eigenvalues of L.Moreover,in spectral clustering we only take the first k columns of the matrix,while the com- mute time embedding takes all columns.Several authors now try to justify why yi and z are not so different after all and state a bit hand-waiving that the fact that spectral clustering constructs clusters based on the Euclidean distances between the yi can be interpreted as building clusters of the vertices in the graph based on the commute distance.However,note that both approaches can differ considerably.For example,in the optimal case where the graph consists of k disconnected components, the first k eigenvalues of L are 0 according to Proposition 2,and the first k columns of U consist of the cluster indicator vectors.However,the first k columns of the matrix (At)1/2U consist of zeros only,as the first k diagonal elements of Af are 0.In this case,the information contained in the first k columns of U is completely ignored in the matrix (At)1/2U,and all the non-zero elements of the matrix (At)1/2U which can be found in columns k+1 to n are not taken into account in spectral clustering,which discards all those columns.On the other hand,those problems do not occur if the underlying graph is connected.In this case,the only eigenvector with eigenvalue 0 is the constant one vector,which can be ignored in both cases.The eigenvectors corresponding to small eigenvalues Ai of L are then stressed in the matrix(At)1/2U as they are multiplied by A=1/Ai.In such a situa- tion,it might be true that the commute time embedding and the spectral embedding do similar things All in all,it seems that the commute time distance can be a helpful intuition,but without making further assumptions there is only a rather loose relation between spectral clustering and the commute distance.It might be possible that those relations can be tightened,for example if the similarity function is strictly positive definite.However,we have not yet seen a precise mathematical statement about this. 7 Perturbation theory point of view Perturbation theory studies the question of how eigenvalues and eigenvectors of a matrix A change if we add a small perturbation H,that is we consider the perturbed matrix A:=A+H.Most perturba- tion theorems state that a certain distance between eigenvalues or eigenvectors of A and A is bounded by a constant times a norm of H.The constant usually depends on which eigenvalue we are looking at.and how far this eigenvalue is separated from the rest of the spectrum(for a formal statement see below).The justification of spectral clustering is then the following:Let us first consider the "ideal case"where the between-cluster similarity is exactly 0.We have seen in Section 3 that then the first k eigenvectors of L or Lrw are the indicator vectors of the clusters.In this case,the points yiE Rk constructed in the spectral clustering algorithms have the form (0,...,0,1,0,...0)'where the position of the 1 indicates the connected component this point belongs to.In particular,all yi belonging to the same connected component coincide.The k-means algorithm will trivially find the correct partition by placing a center point on each of the points (0,...,0,1,0,...0)'ER.In a "nearly ideal case" where we still have distinct clusters,but the between-cluster similarity is not exactly 0.we consider the Laplacian matrices to be perturbed versions of the ones of the ideal case.Perturbation theory then tells us that the eigenvectors will be very close to the ideal indicator vectors.The points yi might not 16maps the vertices vi of the graph on points zi ∈ ❘ n such that the Euclidean distances between the points zi coincide with the commute distances on the graph. This works as follows. As the matrix L † is positive semi-definite and symmetric, it induces an inner product on ❘ n (or to be more formal, it induces an inner product on the subspace of ❘ n which is perpendicular to the vector ✶). Now choose zi as the point in ❘ n corresponding to the i-th row of the matrix U(Λ† ) 1/2 . Then, by Proposition 6 and by the construction of L † we have that hzi , zj i = e 0 iL † ej and cij = vol(V )||zi − zj ||2 . The embedding used in unnormalized spectral clustering is related to the commute time embedding, but not identical. In spectral clustering, we map the vertices of the graph on the rows yi of the matrix U, while the commute time embedding maps the vertices on the rows zi of the matrix (Λ† ) 1/2U. That is, compared to the entries of yi , the entries of zi are additionally scaled by the inverse eigenvalues of L. Moreover, in spectral clustering we only take the first k columns of the matrix, while the commute time embedding takes all columns. Several authors now try to justify why yi and zi are not so different after all and state a bit hand-waiving that the fact that spectral clustering constructs clusters based on the Euclidean distances between the yi can be interpreted as building clusters of the vertices in the graph based on the commute distance. However, note that both approaches can differ considerably. For example, in the optimal case where the graph consists of k disconnected components, the first k eigenvalues of L are 0 according to Proposition 2, and the first k columns of U consist of the cluster indicator vectors. However, the first k columns of the matrix (Λ† ) 1/2U consist of zeros only, as the first k diagonal elements of Λ† are 0. In this case, the information contained in the first k columns of U is completely ignored in the matrix (Λ† ) 1/2U, and all the non-zero elements of the matrix (Λ† ) 1/2U which can be found in columns k + 1 to n are not taken into account in spectral clustering, which discards all those columns. On the other hand, those problems do not occur if the underlying graph is connected. In this case, the only eigenvector with eigenvalue 0 is the constant one vector, which can be ignored in both cases. The eigenvectors corresponding to small eigenvalues λi of L are then stressed in the matrix (Λ† ) 1/2U as they are multiplied by λ † i = 1/λi . In such a situation, it might be true that the commute time embedding and the spectral embedding do similar things. All in all, it seems that the commute time distance can be a helpful intuition, but without making further assumptions there is only a rather loose relation between spectral clustering and the commute distance. It might be possible that those relations can be tightened, for example if the similarity function is strictly positive definite. However, we have not yet seen a precise mathematical statement about this. 7 Perturbation theory point of view Perturbation theory studies the question of how eigenvalues and eigenvectors of a matrix A change if we add a small perturbation H, that is we consider the perturbed matrix A˜ := A+ H. Most perturbation theorems state that a certain distance between eigenvalues or eigenvectors of A and A˜ is bounded by a constant times a norm of H. The constant usually depends on which eigenvalue we are looking at, and how far this eigenvalue is separated from the rest of the spectrum (for a formal statement see below). The justification of spectral clustering is then the following: Let us first consider the “ideal case” where the between-cluster similarity is exactly 0. We have seen in Section 3 that then the first k eigenvectors of L or Lrw are the indicator vectors of the clusters. In this case, the points yi ∈ ❘ k constructed in the spectral clustering algorithms have the form (0, . . . , 0, 1, 0, . . . 0)0 where the position of the 1 indicates the connected component this point belongs to. In particular, all yi belonging to the same connected component coincide. The k-means algorithm will trivially find the correct partition by placing a center point on each of the points (0, . . . , 0, 1, 0, . . . 0)0 ∈ ❘ k . In a “nearly ideal case” where we still have distinct clusters, but the between-cluster similarity is not exactly 0, we consider the Laplacian matrices to be perturbed versions of the ones of the ideal case. Perturbation theory then tells us that the eigenvectors will be very close to the ideal indicator vectors. The points yi might not 16