terms, he requires the number of cano_中国高校课件下载中心

点击下载：《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching

正在加载图片...

terms,he requires the number of canopies to equal the num- one wishes to collapse the records of multiple people who ber of items.)This is very expensive.Also,because each live in the same household. pool is centered on a single item,Hylton does not support the use of arbitrary clustering methods for the fine-grained When information is extracted from the web,the reference clustering portion of the algorithm. matching problem is even more severe.One can search the web and extract descriptions of products and their attributes Giles et al.6]also study the domain of clustering cita- (e.g.different models of Palm Pilots and their weight,mem- tions.They present experiments with several different word- ory,size,etc.)or descriptions of companies and what indus- matching techniques and one string edit based technique. tries and countries in which they have business.This in- They find that the best-performing method is a word match- formation extraction is error-prone,but redundant-many ing algorithm similar to the Cora technique used in Sec different sources sell the same product.Again,the goal is to tion 3,augmented to use field extraction information and cluster descriptions into sets that describe the same product bigrams.Their lesser-performing string edit distance met- or company,and then to determine a canonical description. ric is similar to ours,but the clustering algorithm they use with this metric is not greedy agglomerative clustering.In- Because of thethe large number of items,feature dimen- stead,they use an iterative match-and-remove algorithm to sions and clusters,carefully comparing every item against perform the clustering every other item is intractable in all the above cases.For- tunately,there are often cheap and approximate means to CONCLUSIONS group items into overlapping subsets we call "canopies",so Clustering large data sets is a ubiquitous task.Astronomers that accurate,expensive comparisons can be made between work to classify stars into similar sets based on their images items in the same canopy.Canopies,ideally,have the prop- Search engines on the web seek to group together similar erty that all items in any true cluster fall in the same canopy; this guarantees that no accuracy is lost by restricting com- documents based on the words they contain or based on their citations.Marketers seek clusters of similar shoppers parisons of items to those in the same canopy. based on their purchase history and demographics.Shop- bots seek to identify similar products based on the product The canopy approach is widely applicable.The cheap mea- descriptions.Biologists seek to group DNA sequences based sures can be binning,comparison of a few attributes of a complex record,or finding similarity using an inverted index. on the proteins they code for or to group proteins based on their function in cells. The expensive measure can use detailed similarity measure- ments such as string edit distance computed with dynamic In all of these cases,the objects are characterized by large programming.The clustering can be greedy agglomerative, feature vectors (the images,text,or DNA sequences).Fur- K-nearest neighbor,K-means,or any of a wide variety of thermore,in each case there are inexpensive measures of EM methods.The commonality across the methods is the similarity that one can compute (e.g.number of words in creation of canopies using the cheap measure so that the use common),and more expensive and accurate measures (e.g. of the expensive measure is drastically reduced. based on parse trees and part of speech tagging for text or string-edit distances for DNA).Given large data sets with We have demonstrated the success of the canopies approach hundreds of thousands or millions of entries,computing all on a reference matching problem from the domain of bibli- pairwise similarities between objects is often intractable ographic citations.Here we reduced computation time by and more efficient methods are called for.Also,increas- more than an order of magnitude while also slightly increas- ingly,people are trying to fit complex models such as mix- ing accuracy.In ongoing work we are running the canopies ture distributions or HMMs to these large data sets.Com- algorithm on the full set of over a million citations and ex- puting global models where all observations can affect all pect reduced computation of five orders of magnitude,from parameters is again intractable,and methods for grouping more than one CPU year to a couple hours.In future work we will quantify analytically the correspondence between the observations(similarly to the grouping of objects above)are needed.Canopies provide a principled approach to all these cheap and expensive distance metrics,and we will perform problems. experiments with EM and with a wider variety of domains including data sets with real-valued attributes. In this paper we have focused on reference matching,a par- ticular class of problems that arise when one has many dif- Acknowledgments ferent descriptions for each of many different objects,and Most of this work was done while the first two authors were wishes to know (1)which descriptions refer to the same ob at Just Research (www.justresearch.com).Although Cora is ject,and (2)what the best description of that object is.We now supported by WhizBang!Labs,it was originally created present experimental results for the domain of bibliographic at Just Research. citation matching.Another important instance of this class is the merge-purge problem.Companies often purchase and 6.REFERENCES merge multiple mailing lists.The resulting list then has [1]H.Akaike.On entropy maximization principle. multiple entries for each household.Even for a single per- Applications of Statistics,pages 27-41,1977. son,the name and address in each version on the list may differ slightly,with middle initials present or absent,words [2]M.R.Anderberg.Cluster Analysis for Application. abbreviated or expanded,zip codes present or absent.This Academic Press,1973. problem of merging large mailing lists and eliminating dupli- [3]P.S.Bradley,U.Fayyad,and C.Reina.Scaling cates,becomes even more complex for householding,where clustering algorithms to large databases.In Proc.4thterms, he requires the number of canopies to equal the number of items.) This is very expensive. Also, because each pool is centered on a single item, Hylton does not support the use of arbitrary clustering methods for the fine-grained clustering portion of the algorithm. Giles et al. [6] also study the domain of clustering citations. They present experiments with several different wordmatching techniques and one string edit based technique. They find that the best-performing method is a word matching algorithm similar to the Cora technique used in Section 3, augmented to use field extraction information and bigrams. Their lesser-performing string edit distance metric is similar to ours, but the clustering algorithm they use with this metric is not greedy agglomerative clustering. Instead, they use an iterative match-and-remove algorithm to perform the clustering. 5. CONCLUSIONS Clustering large data sets is a ubiquitous task. Astronomers work to classify stars into similar sets based on their images. Search engines on the web seek to group together similar documents based on the words they contain or based on their citations. Marketers seek clusters of similar shoppers based on their purchase history and demographics. Shopbots seek to identify similar products based on the product descriptions. Biologists seek to group DNA sequences based on the proteins they code for or to group proteins based on their function in cells. In all of these cases, the objects are characterized by large feature vectors (the images, text, or DNA sequences). Furthermore, in each case there are inexpensive measures of similarity that one can compute (e.g. number of words in common), and more expensive and accurate measures (e.g. based on parse trees and part of speech tagging for text or string-edit distances for DNA). Given large data sets with hundreds of thousands or millions of entries, computing all pairwise similarities between objects is often intractable, and more efficient methods are called for. Also, increasingly, people are trying to fit complex models such as mixture distributions or HMMs to these large data sets. Computing global models where all observations can affect all parameters is again intractable, and methods for grouping observations (similarly to the grouping of objects above) are needed. Canopies provide a principled approach to all these problems. In this paper we have focused on reference matching, a particular class of problems that arise when one has many different descriptions for each of many different objects, and wishes to know (1) which descriptions refer to the same object, and (2) what the best description of that object is. We present experimental results for the domain of bibliographic citation matching. Another important instance of this class is the merge-purge problem. Companies often purchase and merge multiple mailing lists. The resulting list then has multiple entries for each household. Even for a single person, the name and address in each version on the list may differ slightly, with middle initials present or absent, words abbreviated or expanded, zip codes present or absent. This problem of merging large mailing lists and eliminating duplicates, becomes even more complex for householding, where one wishes to collapse the records of multiple people who live in the same household. When information is extracted from the web, the reference matching problem is even more severe. One can search the web and extract descriptions of products and their attributes (e.g. different models of Palm Pilots and their weight, memory, size, etc.) or descriptions of companies and what industries and countries in which they have business. This information extraction is error-prone, but redundant—many different sources sell the same product. Again, the goal is to cluster descriptions into sets that describe the same product or company, and then to determine a canonical description. Because of the the large number of items, feature dimensions and clusters, carefully comparing every item against every other item is intractable in all the above cases. Fortunately, there are often cheap and approximate means to group items into overlapping subsets we call “canopies”, so that accurate, expensive comparisons can be made between items in the same canopy. Canopies, ideally, have the property that all items in any true cluster fall in the same canopy; this guarantees that no accuracy is lost by restricting comparisons of items to those in the same canopy. The canopy approach is widely applicable. The cheap measures can be binning, comparison of a few attributes of a complex record, or finding similarity using an inverted index. The expensive measure can use detailed similarity measurements such as string edit distance computed with dynamic programming. The clustering can be greedy agglomerative, K-nearest neighbor, K-means, or any of a wide variety of EM methods. The commonality across the methods is the creation of canopies using the cheap measure so that the use of the expensive measure is drastically reduced. We have demonstrated the success of the canopies approach on a reference matching problem from the domain of bibliographic citations. Here we reduced computation time by more than an order of magnitude while also slightly increasing accuracy. In ongoing work we are running the canopies algorithm on the full set of over a million citations and expect reduced computation of five orders of magnitude, from more than one CPU year to a couple hours. In future work we will quantify analytically the correspondence between the cheap and expensive distance metrics, and we will perform experiments with EM and with a wider variety of domains, including data sets with real-valued attributes. Acknowledgments Most of this work was done while the first two authors were at Just Research (www.justresearch.com). Although Cora is now supported by WhizBang! Labs, it was originally created at Just Research. 6. REFERENCES [1] H. Akaike. On entropy maximization principle. Applications of Statistics, pages 27–41, 1977. [2] M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973. [3] P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 4th

<<向上翻页向下翻页>>