正在加载图片...
270 A.Jain et al. -d is the dimensionality of the pattern clustering system.Because of the diffi- or of the pattern space. culties surrounding pattern representa- tion,it is conveniently assumed that the -A pattern set is denoted pattern representation is available prior {x1,...x.The ith pattern in is to clustering.Nonetheless,a careful in- denoted x;=(xi,,··.xi,d).In many vestigation of the available features and cases a pattern set to be clustered is any available transformations (even simple ones)can yield significantly im- viewed as an n x d pattern matrix. proved clustering results.A good pat- -A class,in the abstract,refers to a tern representation can often yield a state of nature that governs the pat-simple and easily understood clustering; tern generation process in some cases. a poor pattern representation may yield More concretely,a class can be viewed a complex clustering whose true struc- as a source of patterns whose distri-ture is difficult or impossible to discern. bution in feature space is governed by Figure 3 shows a simple example.The a probability density specific to the points in this 2D feature space are ar- class.Clustering techniques attempt ranged in a curvilinear cluster of ap- to group patterns so that the classes proximately constant distance from the thereby obtained reflect the different origin.If one chooses Cartesian coordi- pattern generation processes repre- nates to represent the patterns,many sented in the pattern set. clustering algorithms would be likely to fragment the cluster into two or more -Hard clustering techniques assign a clusters,since it is not compact.If,how- class label li to each patterns xi,iden-ever,one uses a polar coordinate repre- tifying its class.The set of all labels sentation for the clusters,the radius for a pattern set is coordinate exhibits tight clustering and l1,...I,with li(1,...,k),a one-cluster solution is likely to be where k is the number of clusters. easily obtained. A pattern can measure either a phys- -Fuzzy clustering procedures assign to ical object (e.g.,a chair)or an abstract each input pattern x:a fractional de-notion (e.g.,a style of writing).As noted gree of membership fi;in each output above,patterns are represented conven- tionally as multidimensional vectors, cluster / where each dimension is a single fea- -A distance measure (a specialization ture [Duda and Hart 1973].These fea- of a proximity measure)is a metric tures can be either quantitative or qual- (or quasi-metric)on the feature space itative.For example,if weight and color used to quantify the similarity of pat-are the two features used,then terns. (20,black)is the representation of a black object with 20 units of weight. 3.PATTERN REPRESENTATION,FEATURE The features can be subdivided into the SELECTION AND EXTRACTION following types [Gowda and Diday 1992: There are no theoretical guidelines that suggest the appropriate patterns and (1)Quantitative features:e.g. features to use in a specific situation. (a)continuous values (e.g.,weight); Indeed,the pattern generation process (b)discrete values (e.g.,the number is often not directly controllable;the of computers); user's role in the pattern representation (c)interval values (e.g.,the dura- process is to gather facts and conjec- tion of an event). tures about the data,optionally perform feature selection and extraction,and de- (2)Qualitative features: sign the subsequent elements of the (a)nominal or unordered (e.g.,color); ACM Computing Surveys,Vol.31,No.3,September 1999—d is the dimensionality of the pattern or of the pattern space. —A pattern set is denoted - 5 $x1,... xn%. The ith pattern in - is denoted xi 5 ~xi,1,... xi,d!. In many cases a pattern set to be clustered is viewed as an n 3 d pattern matrix. —A class, in the abstract, refers to a state of nature that governs the pat￾tern generation process in some cases. More concretely, a class can be viewed as a source of patterns whose distri￾bution in feature space is governed by a probability density specific to the class. Clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes repre￾sented in the pattern set. —Hard clustering techniques assign a class label li to each patterns xi, iden￾tifying its class. The set of all labels for a pattern set - is + 5 $l1,... ln%, with li [ $1, · · ·, k%, where k is the number of clusters. —Fuzzy clustering procedures assign to each input pattern xi a fractional de￾gree of membership fij in each output cluster j. —A distance measure (a specialization of a proximity measure) is a metric (or quasi-metric) on the feature space used to quantify the similarity of pat￾terns. 3. PATTERN REPRESENTATION, FEATURE SELECTION AND EXTRACTION There are no theoretical guidelines that suggest the appropriate patterns and features to use in a specific situation. Indeed, the pattern generation process is often not directly controllable; the user’s role in the pattern representation process is to gather facts and conjec￾tures about the data, optionally perform feature selection and extraction, and de￾sign the subsequent elements of the clustering system. Because of the diffi￾culties surrounding pattern representa￾tion, it is conveniently assumed that the pattern representation is available prior to clustering. Nonetheless, a careful in￾vestigation of the available features and any available transformations (even simple ones) can yield significantly im￾proved clustering results. A good pat￾tern representation can often yield a simple and easily understood clustering; a poor pattern representation may yield a complex clustering whose true struc￾ture is difficult or impossible to discern. Figure 3 shows a simple example. The points in this 2D feature space are ar￾ranged in a curvilinear cluster of ap￾proximately constant distance from the origin. If one chooses Cartesian coordi￾nates to represent the patterns, many clustering algorithms would be likely to fragment the cluster into two or more clusters, since it is not compact. If, how￾ever, one uses a polar coordinate repre￾sentation for the clusters, the radius coordinate exhibits tight clustering and a one-cluster solution is likely to be easily obtained. A pattern can measure either a phys￾ical object (e.g., a chair) or an abstract notion (e.g., a style of writing). As noted above, patterns are represented conven￾tionally as multidimensional vectors, where each dimension is a single fea￾ture [Duda and Hart 1973]. These fea￾tures can be either quantitative or qual￾itative. For example, if weight and color are the two features used, then ~20, black! is the representation of a black object with 20 units of weight. The features can be subdivided into the following types [Gowda and Diday 1992]: (1) Quantitative features: e.g. (a) continuous values (e.g., weight); (b) discrete values (e.g., the number of computers); (c) interval values (e.g., the dura￾tion of an event). (2) Qualitative features: (a) nominal or unordered (e.g., color); 270 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有