similarity matrix is not a sparse mat_中国高校课件下载中心

点击下载：《多元统计分析》课程教学资源（阅读材料）A tutorial on Spectral clustering

正在加载图片...

similarity matrix is not a sparse matrix As a general recommendation we suggest to work with the k-nearest neighbor graph as the first choice. It is simple to work with,results in a sparse adjacency matrix W,and in our experience is less vulnerable to unsuitable choices of parameters than the other graphs. The parameters of the similarity graph Once one has decided for the type of the similarity graph,one has to choose its connectivity parameter k or s,respectively.Unfortunately,barely any theoretical results are known to guide us in this task.In general,if the similarity graph contains more connected components than the number of clusters we ask the algorithm to detect,then spectral clustering will trivially return connected components as clusters. Unless one is perfectly sure that those connected components are the correct clusters,one should make sure that the similarity graph is connected,or only consists of "few"connected components and very few or no isolated vertices.There are many theoretical results on how connectivity of random graphs can be achieved,but all those results only hold in the limit for the sample size n-oo.For example. it is known that for n data points drawn i.i.d.from some underlying density with a connected support in Rd,the k-nearest neighbor graph and the mutual k-nearest neighbor graph will be connected if we choose k on the order of log(n)(e.g.,Brito,Chavez,Quiroz,and Yukich,1997).Similar arguments show that the parameter s in the s-neighborhood graph has to be chosen as (log(n)/n)d to guarantee connectivity in the limit (Penrose,1999).While being of theoretical interest,all those results do not really help us for choosing k on a finite sample. Now let us give some rules of thumb.When working with the k-nearest neighbor graph,then the connectivity parameter should be chosen such that the resulting graph is connected,or at least has significantly fewer connected components than clusters we want to detect.For small or medium-sized graphs this can be tried out "by foot".For very large graphs,a first approximation could be to choose k in the order of log(n),as suggested by the asymptotic connectivity results. For the mutual k-nearest neighbor graph,we have to admit that we are a bit lost for rules of thumb. The advantage of the mutual k-nearest neighbor graph compared to the standard k-nearest neighbor graph is that it tends not to connect areas of different density.While this can be good if there are clear clusters induced by separate high-density areas,this can hurt in less obvious situations as disconnected parts in the graph will always be chosen to be clusters by spectral clustering.Very generally,one can observe that the mutual k-nearest neighbor graph has much fewer edges than the standard k-nearest neighbor graph for the same parameter k.This suggests to choose k significantly larger for the mutual k-nearest neighbor graph than one would do for the standard k-nearest neighbor graph.However,to take advantage of the property that the mutual k-nearest neighbor graph does not connect regions of different density,it would be necessary to allow for several "meaningful"disconnected parts of the graph.Unfortunately,we do not know of any general heuristic to choose the parameter k such that this can be achieved. For the e-neighborhood graph,we suggest to choose s such that the resulting graph is safely connected. To determine the smallest value of s where the graph is connected is very simple:one has to choose s as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data points.The latter can be determined easily by any minimal spanning tree algorithm.However, note that when the data contains outliers this heuristic will choose s so large that even the outliers are connected to the rest of the data.A similar effect happens when the data contains several tight clusters which are very far apart from each other.In both cases,s will be chosen too large to reflect the scale of the most important part of the data. Finally,if one uses a fully connected graph together with a similarity function which can be scaled 21similarity matrix is not a sparse matrix. As a general recommendation we suggest to work with the k-nearest neighbor graph as the first choice. It is simple to work with, results in a sparse adjacency matrix W, and in our experience is less vulnerable to unsuitable choices of parameters than the other graphs. The parameters of the similarity graph Once one has decided for the type of the similarity graph, one has to choose its connectivity parameter k or ε, respectively. Unfortunately, barely any theoretical results are known to guide us in this task. In general, if the similarity graph contains more connected components than the number of clusters we ask the algorithm to detect, then spectral clustering will trivially return connected components as clusters. Unless one is perfectly sure that those connected components are the correct clusters, one should make sure that the similarity graph is connected, or only consists of “few” connected components and very few or no isolated vertices. There are many theoretical results on how connectivity of random graphs can be achieved, but all those results only hold in the limit for the sample size n → ∞. For example, it is known that for n data points drawn i.i.d. from some underlying density with a connected support in ❘ d , the k-nearest neighbor graph and the mutual k-nearest neighbor graph will be connected if we choose k on the order of log(n) (e.g., Brito, Chavez, Quiroz, and Yukich, 1997). Similar arguments show that the parameter ε in the ε-neighborhood graph has to be chosen as (log(n)/n) d to guarantee connectivity in the limit (Penrose, 1999). While being of theoretical interest, all those results do not really help us for choosing k on a finite sample. Now let us give some rules of thumb. When working with the k-nearest neighbor graph, then the connectivity parameter should be chosen such that the resulting graph is connected, or at least has significantly fewer connected components than clusters we want to detect. For small or medium-sized graphs this can be tried out ”by foot”. For very large graphs, a first approximation could be to choose k in the order of log(n), as suggested by the asymptotic connectivity results. For the mutual k-nearest neighbor graph, we have to admit that we are a bit lost for rules of thumb. The advantage of the mutual k-nearest neighbor graph compared to the standard k-nearest neighbor graph is that it tends not to connect areas of different density. While this can be good if there are clear clusters induced by separate high-density areas, this can hurt in less obvious situations as disconnected parts in the graph will always be chosen to be clusters by spectral clustering. Very generally, one can observe that the mutual k-nearest neighbor graph has much fewer edges than the standard k-nearest neighbor graph for the same parameter k. This suggests to choose k significantly larger for the mutual k-nearest neighbor graph than one would do for the standard k-nearest neighbor graph. However, to take advantage of the property that the mutual k-nearest neighbor graph does not connect regions of different density, it would be necessary to allow for several “meaningful” disconnected parts of the graph. Unfortunately, we do not know of any general heuristic to choose the parameter k such that this can be achieved. For the ε-neighborhood graph, we suggest to choose ε such that the resulting graph is safely connected. To determine the smallest value of ε where the graph is connected is very simple: one has to choose ε as the length of the longest edge in a minimal spanning tree of the fully connected graph on the data points. The latter can be determined easily by any minimal spanning tree algorithm. However, note that when the data contains outliers this heuristic will choose ε so large that even the outliers are connected to the rest of the data. A similar effect happens when the data contains several tight clusters which are very far apart from each other. In both cases, ε will be chosen too large to reflect the scale of the most important part of the data. Finally, if one uses a fully connected graph together with a similarity function which can be scaled 21

<<向上翻页向下翻页>>

点击下载：《多元统计分析》课程教学资源（阅读材料）A tutorial on Spectral clustering