北京大学：《模式识别》课程教学资源（参考资料）Algorithms for Clustering Data.pdf_大学文库

Chap.1 Introduction 3 12 lower similarity threshold,we perceive nine clusters.Which answer is correct? Looking at the data at multiple scales may actually help in analyzing its structure. Thus the crucial problem in identifying clusters in data is to specify what proximity is and how to measure it.As is to be expected,the notion of proximity is problem dependent. Clustering techniques offer several advantages over a manual grouping pro- cess.First,a clustering program can apply a specified objective criterion consistently to form the groups.Human beings are excellent cluster seekers in two and often in three dimensions,but different individuals do not always identify the same clusters in data.The proximity measure defining similarity among objects depends on an individual's educational and cultural background.Thus it is quite common for different human subjects to form different groups in the same data,especially when the groups are not well separated.Second,a clustering algorithm can form the groups in a fraction of time required by a manual grouping,particularly if a long list of descriptors or features is associated with each object.The speed, reliability,and consistency of a clustering algorithm in organizing data together constitute an overwhelming reason to use it.A clustering algorithm relieves a scientist or data analyst of the treacherous job of "looking''at a pattern matrix or a similarity matrix to detect clusters.A data analyst's time is better spent in analyzing or interpreting the results provided by a clustering algorithm. Clustering is also useful in implementing the"divide and conquerstrategy to reduce the computational complexity of various decision-making algorithms in pattern recognition.For example,the nearest-neighbor decision rule is a popular technique in pattern recognition (Duda and Hart,1973).However,finding the nearest neighbor of a test pattern can be very time consuming if the number of training patterns or prototypes is large.Fukunaga and Narendra(1975)used the well-known partitional clustering algorithm,ISODATA(Chapter 3),to decompose the patterns,and then in conjunction with the branch-and-bound method obtained an efficient algorithm to compute nearest neighbors.Similarly,Fukunaga and Short (1978)used clustering for problem localization,whereby a simple decision rule can be implemented in local regions or clusters of the pattern space.The applications of clustering continue to grow. Consider the problem of grouping various colleges and universities in the United States to illustrate the factors in clustering problems.Schools can be clustered based on their geographical location,size of the student body,size of the campus, tuition fee,or offerings of various professional graduate programs.The factors depend on the goal of the analysis.The shapes and sizes of the clusters formed will depend on which particular attribute is used in defining the similarity between colleges.Interesting and challenging clustering problems arise when several attri- butes are taken together to construct clusters.One cluster could represent private, midwestern,and primarily liberal arts colleges with fewer than 1000 students and another can represent large state universities.The features or attributes that we have mentioned so far can easily be measured.What about such attributes as quality of education,quality of faculty,and the quality of campus life,which

4 Introduction Chap.1 cannot be measured easily?One can poll alumni or a panel of experts to get either a numerical score (on a scale of,say,I to 10)for these factors or similarity measures for all pairs of universities.These scores or similarities must be averaged over all respondents because individual opinions differ.One can also measure subjective attributes indirectly.For example,faculty excellence in a graduate pro- gram can be estimated from the number of professional papers written and number of Ph.D.degrees awarded. The example above illustrates the difference between decision making and clustering.Suppose that we want to partition computer science graduate programs in the United States into two categories based on such attributes as size of faculty, computing resources,external research support,and faculty publications.In the decision-making paradigm,an"'expert'must first define these two categories by identifying some computer science programs from each of the two categories (these are the training samples in pattern recognition terminology).The attributes of these training samples will be used to construct decision boundaries (or simply thresholds on attribute values)that will separate the two types of programs.Once the decision boundary is available,the remaining computer science programs(those that were not labeled by the expert)will be assigned to one of the two categories. In the clustering paradigm,no expert is available to define the categories.The objective is to determine whether a two-category partition of the data,based on the given attributes,is reasonable,and if so,to determine the memberships of the two clusters.This can be achieved by forming similarities between all pairs of computer science graduate programs based on the given attributes and then constructing groups such that the within-group similarities are larger than the be- tween-group similarities. Cluster analysis is one component of exploratory data analysis,which means sifting through data to make sense out of measurements by whatever means are available.The information gained about a set of data from a cluster analysis should prod one's creativity,suggest new experiments,and provide fresh insight into the subject matter.The modern digital computer makes all this possible. Cluster analysis is a child of the computer revolution and frees the analyst from time-honored statistical models and procedures conceived when the human brain was aided only by pencil and paper.The development of clustering methodol- ogy has been truly interdisciplinary.Researchers in almost every area of science that collects data have contributed,such as taxonomists,psychologists,biologists, statisticians,social scientists,and engineers.I.J.Good (1977)has suggested the new name botryology for the discipline of cluster analysis,from the Greek word for a cluster of grapes. One objective of this book is to encourage communication among disciplines. All too often,the same procedures are developed in different disciplines but are so clothed in the language of the individual disciplines that cross fertilization is severely hindered.A casual scan of the bibliography for this book reveals citations from almost 100 different journals.Only the Journal of Classification,a publication of the Classification Society of North America which first appeared in 1984,is