270 A.Jain et al. -d is the dimensionality of the pattern clustering system.Because of the diffi- or of the pattern space. culties surrounding pattern representa- tion,it is conveniently assumed that the -A pattern set is denoted pattern representation is available prior {x1,...x.The ith pattern in is to clustering.Nonetheless,a careful in- denoted x;=(xi,,··.xi,d).In many vestigation of the available features and cases a pattern set to be clustered is any available transformations (even simple ones)can yield significantly im- viewed as an n x d pattern matrix. proved clustering results.A good pat- -A class,in the abstract,refers to a tern representation can often yield a state of nature that governs the pat-simple and easily understood clustering; tern generation process in some cases. a poor pattern representation may yield More concretely,a class can be viewed a complex clustering whose true struc- as a source of patterns whose distri-ture is difficult or impossible to discern. bution in feature space is governed by Figure 3 shows a simple example.The a probability density specific to the points in this 2D feature space are ar- class.Clustering techniques attempt ranged in a curvilinear cluster of ap- to group patterns so that the classes proximately constant distance from the thereby obtained reflect the different origin.If one chooses Cartesian coordi- pattern generation processes repre- nates to represent the patterns,many sented in the pattern set. clustering algorithms would be likely to fragment the cluster into two or more -Hard clustering techniques assign a clusters,since it is not compact.If,how- class label li to each patterns xi,iden-ever,one uses a polar coordinate repre- tifying its class.The set of all labels sentation for the clusters,the radius for a pattern set is coordinate exhibits tight clustering and l1,...I,with li(1,...,k),a one-cluster solution is likely to be where k is the number of clusters. easily obtained. A pattern can measure either a phys- -Fuzzy clustering procedures assign to ical object (e.g.,a chair)or an abstract each input pattern x:a fractional de-notion (e.g.,a style of writing).As noted gree of membership fi;in each output above,patterns are represented conven- tionally as multidimensional vectors, cluster / where each dimension is a single fea- -A distance measure (a specialization ture [Duda and Hart 1973].These fea- of a proximity measure)is a metric tures can be either quantitative or qual- (or quasi-metric)on the feature space itative.For example,if weight and color used to quantify the similarity of pat-are the two features used,then terns. (20,black)is the representation of a black object with 20 units of weight. 3.PATTERN REPRESENTATION,FEATURE The features can be subdivided into the SELECTION AND EXTRACTION following types [Gowda and Diday 1992: There are no theoretical guidelines that suggest the appropriate patterns and (1)Quantitative features:e.g. features to use in a specific situation. (a)continuous values (e.g.,weight); Indeed,the pattern generation process (b)discrete values (e.g.,the number is often not directly controllable;the of computers); user's role in the pattern representation (c)interval values (e.g.,the dura- process is to gather facts and conjec- tion of an event). tures about the data,optionally perform feature selection and extraction,and de- (2)Qualitative features: sign the subsequent elements of the (a)nominal or unordered (e.g.,color); ACM Computing Surveys,Vol.31,No.3,September 1999—d is the dimensionality of the pattern or of the pattern space. —A pattern set is denoted - 5 $x1,... xn%. The ith pattern in - is denoted xi 5 ~xi,1,... xi,d!. In many cases a pattern set to be clustered is viewed as an n 3 d pattern matrix. —A class, in the abstract, refers to a state of nature that governs the pattern generation process in some cases. More concretely, a class can be viewed as a source of patterns whose distribution in feature space is governed by a probability density specific to the class. Clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes represented in the pattern set. —Hard clustering techniques assign a class label li to each patterns xi, identifying its class. The set of all labels for a pattern set - is + 5 $l1,... ln%, with li [ $1, · · ·, k%, where k is the number of clusters. —Fuzzy clustering procedures assign to each input pattern xi a fractional degree of membership fij in each output cluster j. —A distance measure (a specialization of a proximity measure) is a metric (or quasi-metric) on the feature space used to quantify the similarity of patterns. 3. PATTERN REPRESENTATION, FEATURE SELECTION AND EXTRACTION There are no theoretical guidelines that suggest the appropriate patterns and features to use in a specific situation. Indeed, the pattern generation process is often not directly controllable; the user’s role in the pattern representation process is to gather facts and conjectures about the data, optionally perform feature selection and extraction, and design the subsequent elements of the clustering system. Because of the difficulties surrounding pattern representation, it is conveniently assumed that the pattern representation is available prior to clustering. Nonetheless, a careful investigation of the available features and any available transformations (even simple ones) can yield significantly improved clustering results. A good pattern representation can often yield a simple and easily understood clustering; a poor pattern representation may yield a complex clustering whose true structure is difficult or impossible to discern. Figure 3 shows a simple example. The points in this 2D feature space are arranged in a curvilinear cluster of approximately constant distance from the origin. If one chooses Cartesian coordinates to represent the patterns, many clustering algorithms would be likely to fragment the cluster into two or more clusters, since it is not compact. If, however, one uses a polar coordinate representation for the clusters, the radius coordinate exhibits tight clustering and a one-cluster solution is likely to be easily obtained. A pattern can measure either a physical object (e.g., a chair) or an abstract notion (e.g., a style of writing). As noted above, patterns are represented conventionally as multidimensional vectors, where each dimension is a single feature [Duda and Hart 1973]. These features can be either quantitative or qualitative. For example, if weight and color are the two features used, then ~20, black! is the representation of a black object with 20 units of weight. The features can be subdivided into the following types [Gowda and Diday 1992]: (1) Quantitative features: e.g. (a) continuous values (e.g., weight); (b) discrete values (e.g., the number of computers); (c) interval values (e.g., the duration of an event). (2) Qualitative features: (a) nominal or unordered (e.g., color); 270 • A. Jain et al. ACM Computing Surveys, Vol. 31, No. 3, September 1999