for dynamically determining the numbe_中国高校课件下载中心

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching

正在加载图片...

for dynamically determining the number of prototypes (e.g. Fahlman,Scott Lebiere,Christian (1989).The cascade- [18]).Techniques for creating and destroying prototypes are correlation learning architecture. In Touretzky,D.,editor, particularly attractive when thinking about Method 1.Here Advances in Neural Information Processing Systems (volume 2), we have the simplicity of completely ignoring all data points (pp.524-532),San Mateo,CA.Morgan Kaufmann. outside a prototype's associated cluster,with the problem, Fahlman.S.E.and Lebiere,C.."The Cascade Correlation however,that we may have too many or too few prototypes Learning Architecture,"NIPS,Vol.2,pp.524-532,Morgan within the canopy.In Method 3,as in Method 1,prototypes Kaufmann,1990. are associated with canopies and only see points within their canopy.Here,however,we use techniques that create (and Fahlmann,S.E.and Lebiere,C.(1989). The cascade- possibly destroy)prototypes dynamically during clustering. correlation learning architecture.In Advances in Neural In- formation Processing Systems 2 (NIPS-2),Denver,Colorado. We avoid creating multiple prototypes to cover points that fall into more than one canopy by invoking a"conservation Figure 2:Three sample citations to the same paper. of mass"principle for points.In practice,this means that Note the different layout formats,and the mistakes the contribution of each point is divided among the canopies in spelling that make it difficult to identify these as as falling out naturally from the normalization in the E-step: citations to the same paper. as is traditional,membership to a prototype is determined by dividing the inverse distance to the prototype by the sum of inverse distances to all the prototypes to which its overlap factor f as data points do.Then,each cluster needs distance was measured,even if some of those prototypes fall to compare itself to the fn/c points in f different canopies. in different canopies. For all clusters,this is O(nkf2/c)per iteration,yielding the same complexity reduction as seen for GAC 2.4 Computational Complexity We can formally quantify the computational savings of the In the experimental results described in a following section. canopies technique.The technique has two components:a c is on the order of 1,000,so the savings are significant.In relatively fast step where the canopies are created,followed an industrial sized merge-purge problem,far more canopies by a relatively slow clustering process.If we create canopies would be used,and full pair-wise distance calculations would using an inverted index,we do not need to perform even the not at all be feasible. complete pair-wise distance comparisons.If we assume that each document has w words,and these words are evenly distributed across the vocabulary V,then we can compare 3. CLUSTERING TEXTUAL a document to n other documents in O(nw2/VI).This BIBLIOGRAPHIC REFERENCES canopy creation cost is typically insignificant in comparison In this section we provide empirical evidence that using to the more detailed clustering phase canopies for clustering can increase computational efficiency by an order of magnitude without losing any clustering ac Assume that we have n data points that originated from curacy.We demonstrate these results in the domain of bib- k clusters.For concreteness,first consider the Greedy Ag- liographic references. glomerative Clustering algorithm.Clustering without canop- ies requires calculating the distance between all pairs of The Cora web site (urwnw.cora.whizbang.com)provides a search points,an O(n)operation.This is the dominating com- interface to over 50,000 computer science research papers plexity term in GAC clustering.Now consider how this [11].As part of the site's functionality,we provide an inter- is reduced by using canopies.Assume that there are c face for traversing the citation graph.That is,for a given canopies and that each data point on average falls into f paper,we provide links to all other papers it references,and canopies.This factor f estimates the amount to which the links to all other papers that reference it,in the respective canopies overlap with each other.Now,in an optimistic bibliography section of each paper.To provide this interface scenario where each canopy is of equal size,there will be to the data,it is necessary to recognize when two citations roughly fn/c data points per canopy.When clustering with from different papers are referencing the same third paper, canopies we need only calculate the distances between pairs even though the text of the citations may differ.For exam- of points within the same canopy.This requires at most ple,one paper may abbreviate first author names,while the O(c(fn/c))=O(f-n/c)distance measurements.(This is second may include them.Some typical citations are shown probably an over-estimate,as the same points tend to fall in Figure 2.Note that different styles of reference format- into multiple canopies together,and we only need to cal- ting and abbreviation,as well as outright citation mistakes, culate their distances once.)This represents a reduction in make this a difficult task to automate.We pose this as a complexity of f2/c.In general,c will be much larger than f. clustering problem,where the task is to take all the cita- Given a typical case in which n =1,000,000,k 10,000. tions from all papers in the collection,and cluster them so c=1,000,and f is a small constant,the canopies technique that each cluster contains all and only citations to a sin- reduces computation by a factor of 1,000. gle paper.Since the Cora collection contains over a million bibliographic entries,it is necessary to perform this cluster- In the case of K-means or Expectation-Maximization,clus- ing efficiently.Using straightforward GAC would take more tering without canopies requires O(nk)distance compar- than one CPU year,assuming unlimited memory.If we isons per iteration of clustering (finding the distance be- estimate the total number of canopies and average canopy tween each data point and each cluster prototype).Con- membership from labeled dataset used below,the canopies sider the EM method 1,where each cluster belongs to one approach will provide a speedup of five orders of magnitude or more canopy.Assume that clusters have roughly the same reducing the clustering time to a couple hoursfor dynamically determining the number of prototypes (e.g. [18]). Techniques for creating and destroying prototypes are particularly attractive when thinking about Method 1. Here we have the simplicity of completely ignoring all data points outside a prototype’s associated cluster, with the problem, however, that we may have too many or too few prototypes within the canopy. In Method 3, as in Method 1, prototypes are associated with canopies and only see points within their canopy. Here, however, we use techniques that create (and possibly destroy) prototypes dynamically during clustering. We avoid creating multiple prototypes to cover points that fall into more than one canopy by invoking a “conservation of mass” principle for points. In practice, this means that the contribution of each point is divided among the canopies, as falling out naturally from the normalization in the E-step: as is traditional, membership to a prototype is determined by dividing the inverse distance to the prototype by the sum of inverse distances to all the prototypes to which its distance was measured, even if some of those prototypes fall in different canopies. 2.4 Computational Complexity We can formally quantify the computational savings of the canopies technique. The technique has two components: a relatively fast step where the canopies are created, followed by a relatively slow clustering process. If we create canopies using an inverted index, we do not need to perform even the complete pair-wise distance comparisons. If we assume that each document has w words, and these words are evenly distributed across the vocabulary V , then we can compare a document to n other documents in O(nw2/|V |). This canopy creation cost is typically insignificant in comparison to the more detailed clustering phase. Assume that we have n data points that originated from k clusters. For concreteness, first consider the Greedy Agglomerative Clustering algorithm. Clustering without canopies requires calculating the distance between all pairs of points, an O(n2) operation. This is the dominating complexity term in GAC clustering. Now consider how this is reduced by using canopies. Assume that there are c canopies and that each data point on average falls into f canopies. This factor f estimates the amount to which the canopies overlap with each other. Now, in an optimistic scenario where each canopy is of equal size, there will be roughly fn/c data points per canopy. When clustering with canopies we need only calculate the distances between pairs of points within the same canopy. This requires at most O(c(fn/c) 2) = O(f 2n2/c) distance measurements. (This is probably an over-estimate, as the same points tend to fall into multiple canopies together, and we only need to calculate their distances once.) This represents a reduction in complexity of f 2/c. In general, c will be much larger than f. Given a typical case in which n = 1, 000, 000, k = 10, 000, c = 1, 000, and f is a small constant, the canopies technique reduces computation by a factor of 1, 000. In the case of K-means or Expectation-Maximization, clustering without canopies requires O(nk) distance comparisons per iteration of clustering (finding the distance between each data point and each cluster prototype). Consider the EM method 1, where each cluster belongs to one or more canopy. Assume that clusters have roughly the same Fahlman, Scott & Lebiere, Christian (1989). The cascadecorrelation learning architecture. In Touretzky, D., editor, Advances in Neural Information Processing Systems (volume 2), (pp. 524-532), San Mateo, CA. Morgan Kaufmann. Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,” NIPS, Vol. 2, pp. 524-532, Morgan Kaufmann, 1990. Fahlmann, S. E. and Lebiere, C. (1989). The cascadecorrelation learning architecture. In Advances in Neural Information Processing Systems 2 (NIPS-2), Denver, Colorado. Figure 2: Three sample citations to the same paper. Note the different layout formats, and the mistakes in spelling that make it difficult to identify these as citations to the same paper. overlap factor f as data points do. Then, each cluster needs to compare itself to the fn/c points in f different canopies. For all clusters, this is O(nkf 2/c) per iteration, yielding the same complexity reduction as seen for GAC. In the experimental results described in a following section, c is on the order of 1,000, so the savings are significant. In an industrial sized merge-purge problem, far more canopies would be used, and full pair-wise distance calculations would not at all be feasible. 3. CLUSTERING TEXTUAL BIBLIOGRAPHIC REFERENCES In this section we provide empirical evidence that using canopies for clustering can increase computational efficiency by an order of magnitude without losing any clustering accuracy. We demonstrate these results in the domain of bibliographic references. The Cora web site (www.cora.whizbang.com) provides a search interface to over 50,000 computer science research papers [11]. As part of the site’s functionality, we provide an interface for traversing the citation graph. That is, for a given paper, we provide links to all other papers it references, and links to all other papers that reference it, in the respective bibliography section of each paper. To provide this interface to the data, it is necessary to recognize when two citations from different papers are referencing the same third paper, even though the text of the citations may differ. For example, one paper may abbreviate first author names, while the second may include them. Some typical citations are shown in Figure 2. Note that different styles of reference formatting and abbreviation, as well as outright citation mistakes, make this a difficult task to automate. We pose this as a clustering problem, where the task is to take all the citations from all papers in the collection, and cluster them so that each cluster contains all and only citations to a single paper. Since the Cora collection contains over a million bibliographic entries, it is necessary to perform this clustering efficiently. Using straightforward GAC would take more than one CPU year, assuming unlimited memory. If we estimate the total number of canopies and average canopy membership from labeled dataset used below, the canopies approach will provide a speedup of five orders of magnitude, reducing the clustering time to a couple hours

<<向上翻页向下翻页>>