4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.22,NO.1,JANUARY 2000 Statistical Pattern Recognition:A Review Anil K.Jain,Fellow,IEEE,Robert P.W.Duin,and Jianchang Mao,Senior Member,IEEE Abstract-The primary goal of pattern recognition is supervised or unsupervised classification.Among the various frameworks in which pattern recognition has been traditionally formulated,the statistical approach has been most intensively studied and used in practice.More recently,neural network techniques and methods imported from statistical leaming theory have been receiving increasing attention.The design of a recognition system requires careful attention to the following issues:definition of pattern classes, sensing environment,pattern representation,feature extraction and selection,cluster analysis,classifier design and leaming,selection of training and test samples,and performance evaluation.In spite of almost 50 years of research and development in this field,the general problem of recognizing complex patterns with arbitrary orientation,location,and scale remains unsolved.New and emerging applications,such as data mining,web searching,retrieval of multimedia data,face recognition,and cursive handwriting recognition, require robust and efficient pattem recognition techniques.The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field Index Terms-Statistical pattern recognition,classification,clustering,feature extraction,feature selection,error estimation,classifier combination,neural networks. ◆ 1 INTRODUCTION DY the time they are five years old,most children can depend in some way on pattern recognition...We need to Drecognize digits and letters.Small characters,large pay much more explicit attention to teaching pattern characters,handwritten,machine printed,or rotated-all recognition."Our goal here is to introduce pattern recogni- are easily recognized by the young.The characters may be tion as the best possible way of utilizing available sensors, written on a cluttered background,on crumpled paper or processors,and domain knowledge to make decisions may even be partially occluded.We take this ability for automatically. granted until we face the task of teaching a machine how to do the same.Pattern recognition is the study of how 1.1 What is Pattern Recognition? machines can observe the environment,learn to distinguish Automatic (machine)recognition,description,classifica- patterns of interest from their background,and make sound tion,and grouping of patterns are important problems in a and reasonable decisions about the categories of the variety of engineering and scientific disciplines such as patterns.In spite of almost 50 years of research,design of biology,psychology,medicine,marketing,computer vision, a general purpose machine pattern recognizer remains an artificial intelligence,and remote sensing.But what is a elusive goal. pattern?Watanabe [163]defines a pattern "as opposite of a The best pattern recognizers in most instances are chaos;it is an entity,vaguely defined,that could be given a humans,yet we do not understand how humans recognize name."For example,a pattern could be a fingerprint image, patterns.Ross [140]emphasizes the work of Nobel Laureate a handwritten cursive word,a human face,or a speech Herbert Simon whose central finding was that pattern signal.Given a pattern,its recognition/classification may recognition is critical in most human decision making tasks: consist of one of the following two tasks [163]:1)supervised "The more relevant patterns at your disposal,the better classification(e.g.,discriminant analysis)in which the input your decisions will be.This is hopeful news to proponents pattern is identified as a member of a predefined class, of artificial intelligence,since computers can surely be 2)unsupervised classification (e.g.,clustering)in which the taught to recognize patterns.Indeed,successful computer pattern is assigned to a hitherto unknown class.Note that programs that help banks score credit applicants,help the recognition problem here is being posed as a classifica- doctors diagnose disease and help pilots land airplanes tion or categorization task,where the classes are either defined by the system designer (in supervised classifica- tion)or are learned based on the similarity of patterns(in .A.K.Jain is with the Department of Computer Science and Engineering, unsupervised classification). Michigan State University,East Lansing,MI 48824. Interest in the area of pattern recognition has been E-mail:jain@cse.msu.edu. .R.P.W.Duin is with the Department of Applied Physics,Delft University renewed recently due to emerging applications which are of Techmnology,2600 GA Delft,the Netherlands. not only challenging but also computationally more E-mail:duin@ph.tn.tudelft.nl. demanding(see Table 1).These applications include data .J.Mao is with the IBM Almaden Research Center,650 Harry Road,San mining (identifying a "pattern,"e.g.,correlation,or an Jose,CA 95120.E-mail:mao@almaden.ibm.com. outlier in millions of multidimensional patterns),document Manuscript received 23 July 1999;accepted 12 Oct.1999. classification (efficiently searching text documents),finan- Recommended for acceptance by K.Bowyer. For information on obtaining reprints of this article,please send e-mail to: cial forecasting,organization and retrieval of multimedia fpami@computer.org,and reference IEEECS Log Number 110296. databases,and biometrics (personal identification based on 0162-8828/00s10.00020001EEE
Statistical Pattern Recognition: A Review Anil K. Jain, Fellow, IEEE, Robert P.W. Duin, and Jianchang Mao, Senior Member, IEEE AbstractÐThe primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field. Index TermsÐStatistical pattern recognition, classification, clustering, feature extraction, feature selection, error estimation, classifier combination, neural networks. æ 1 INTRODUCTION BY the time they are five years old, most children can recognize digits and letters. Small characters, large characters, handwritten, machine printed, or rotatedÐall are easily recognized by the young. The characters may be written on a cluttered background, on crumpled paper or may even be partially occluded. We take this ability for granted until we face the task of teaching a machine how to do the same. Pattern recognition is the study of how machines can observe the environment, learn to distinguish patterns of interest from their background, and make sound and reasonable decisions about the categories of the patterns. In spite of almost 50 years of research, design of a general purpose machine pattern recognizer remains an elusive goal. The best pattern recognizers in most instances are humans, yet we do not understand how humans recognize patterns. Ross [140] emphasizes the work of Nobel Laureate Herbert Simon whose central finding was that pattern recognition is critical in most human decision making tasks: ªThe more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to proponents of artificial intelligence, since computers can surely be taught to recognize patterns. Indeed, successful computer programs that help banks score credit applicants, help doctors diagnose disease and help pilots land airplanes depend in some way on pattern recognition... We need to pay much more explicit attention to teaching pattern recognition.º Our goal here is to introduce pattern recognition as the best possible way of utilizing available sensors, processors, and domain knowledge to make decisions automatically. 1.1 What is Pattern Recognition? Automatic (machine) recognition, description, classification, and grouping of patterns are important problems in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, artificial intelligence, and remote sensing. But what is a pattern? Watanabe [163] defines a pattern ªas opposite of a chaos; it is an entity, vaguely defined, that could be given a name.º For example, a pattern could be a fingerprint image, a handwritten cursive word, a human face, or a speech signal. Given a pattern, its recognition/classification may consist of one of the following two tasks [163]: 1) supervised classification (e.g., discriminant analysis) in which the input pattern is identified as a member of a predefined class, 2) unsupervised classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown class. Note that the recognition problem here is being posed as a classification or categorization task, where the classes are either defined by the system designer (in supervised classification) or are learned based on the similarity of patterns (in unsupervised classification). Interest in the area of pattern recognition has been renewed recently due to emerging applications which are not only challenging but also computationally more demanding (see Table 1). These applications include data mining (identifying a ªpattern,º e.g., correlation, or an outlier in millions of multidimensional patterns), document classification (efficiently searching text documents), financial forecasting, organization and retrieval of multimedia databases, and biometrics (personal identification based on 4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 . A.K. Jain is with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824. E-mail: jain@cse.msu.edu. . R.P.W. Duin is with the Department of Applied Physics, Delft University of Technology, 2600 GA Delft, the Netherlands. E-mail: duin@ph.tn.tudelft.nl. . J. Mao is with the IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120. E-mail: mao@almaden.ibm.com. Manuscript received 23 July 1999; accepted 12 Oct. 1999. Recommended for acceptance by K. Bowyer. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number 110296. 0162-8828/00/$10.00 ß 2000 IEEE
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW TABLE 1 Examples of Pattern Recognition Applications Problem Domain Application Input Pattern Pattern Classes Bioinformatics Sequence analysis DNA/Protein sequence Known types of genes/ patterns Data mining Searching for Points in multi- Compact and well- meaningful patterns dimensional space separated clusters Document Internet search Text document Semantic categories classification (e.g.,business,sports, etc.) Document image Reading machine for Document image Alphanumeric analysis the blind characters,words Industrial automation Printed circuit board Intensity or range Defective /non-defective inspection image nature of product Multimedia database Internet search Video clip Video genres (e.g., retrieval action,dialogue,etc.) Biometric recognition Personal identification Face,iris, Authorized users for fingerprint access control Remote sensing Forecasting crop yield Multispectral image Land use categories, growth pattern of crops Speech recognition Telephone directory Speech waveform Spoken words enquiry without operator assistance various physical attributes such as face and fingerprints). well-defined and sufficiently constrained recognition pro- Picard [125]has identified a novel application of pattern blem (small intraclass variations and large interclass recognition,called affective computing which will give a variations)will lead to a compact pattern representation computer the ability to recognize and express emotions,to and a simple decision making strategy.Learning from a set respond intelligently to human emotion,and to employ of examples (training set)is an important and desired mechanisms of emotion that contribute to rational decision attribute of most pattern recognition systems.The four best making.A common characteristic of a number of these known approaches for pattern recognition are:1)template applications is that the available features(typically,in the matching,2)statistical classification,3)syntactic or struc- thousands)are not usually suggested by domain experts, tural matching,and 4)neural networks.These models are but must be extracted and optimized by data-driven not necessarily independent and sometimes the same procedures. pattern recognition method exists with different interpreta- The rapidly growing and available computing power, tions.Attempts have been made to design hybrid systems while enabling faster processing of huge data sets,has also involving multiple models [57].A brief description and facilitated the use of elaborate and diverse methods for data comparison of these approaches is given below and analysis and classification.At the same time,demands on summarized in Table 2. automatic pattern recognition systems are rising enor- 1.2 Template Matching mously due to the availability of large databases and One of the simplest and earliest approaches to pattern stringent performance requirements(speed,accuracy,and recognition is based on template matching.Matching is a cost).In many of the emerging applications,it is clear that no single approach for classification is "optimal"and that generic operation in pattern recognition which is used to determine the similarity between two entities (points, multiple methods and approaches have to be used. curves,or shapes)of the same type.In template matching, Consequently,combining several sensing modalities and a template (typically,a 2D shape)or a prototype of the classifiers is now a commonly used practice in pattern pattern to be recognized is available.The pattern to be recognition. recognized is matched against the stored template while The design of a pattern recognition system essentially taking into account all allowable pose (translation and involves the following three aspects:1)data acquisition and rotation)and scale changes.The similarity measure,often a preprocessing,2)data representation,and 3)decision correlation,may be optimized based on the available making.The problem domain dictates the choice of training set.Often,the template itself is learned from the sensor(s),preprocessing technique,representation scheme, training set.Template matching is computationally de- and the decision making model.It is generally agreed that a manding,but the availability of faster processors has now
various physical attributes such as face and fingerprints). Picard [125] has identified a novel application of pattern recognition, called affective computing which will give a computer the ability to recognize and express emotions, to respond intelligently to human emotion, and to employ mechanisms of emotion that contribute to rational decision making. A common characteristic of a number of these applications is that the available features (typically, in the thousands) are not usually suggested by domain experts, but must be extracted and optimized by data-driven procedures. The rapidly growing and available computing power, while enabling faster processing of huge data sets, has also facilitated the use of elaborate and diverse methods for data analysis and classification. At the same time, demands on automatic pattern recognition systems are rising enormously due to the availability of large databases and stringent performance requirements (speed, accuracy, and cost). In many of the emerging applications, it is clear that no single approach for classification is ªoptimalº and that multiple methods and approaches have to be used. Consequently, combining several sensing modalities and classifiers is now a commonly used practice in pattern recognition. The design of a pattern recognition system essentially involves the following three aspects: 1) data acquisition and preprocessing, 2) data representation, and 3) decision making. The problem domain dictates the choice of sensor(s), preprocessing technique, representation scheme, and the decision making model. It is generally agreed that a well-defined and sufficiently constrained recognition problem (small intraclass variations and large interclass variations) will lead to a compact pattern representation and a simple decision making strategy. Learning from a set of examples (training set) is an important and desired attribute of most pattern recognition systems. The four best known approaches for pattern recognition are: 1) template matching, 2) statistical classification, 3) syntactic or structural matching, and 4) neural networks. These models are not necessarily independent and sometimes the same pattern recognition method exists with different interpretations. Attempts have been made to design hybrid systems involving multiple models [57]. A brief description and comparison of these approaches is given below and summarized in Table 2. 1.2 Template Matching One of the simplest and earliest approaches to pattern recognition is based on template matching. Matching is a generic operation in pattern recognition which is used to determine the similarity between two entities (points, curves, or shapes) of the same type. In template matching, a template (typically, a 2D shape) or a prototype of the pattern to be recognized is available. The pattern to be recognized is matched against the stored template while taking into account all allowable pose (translation and rotation) and scale changes. The similarity measure, often a correlation, may be optimized based on the available training set. Often, the template itself is learned from the training set. Template matching is computationally demanding, but the availability of faster processors has now JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 5 TABLE 1 Examples of Pattern Recognition Applications
6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 TABLE 2 Pattern Recognition Models Approach Representation Recognition Function Typical Criterion Template matching Samples,pixels,curves Correlation,distance measure Classification error Statistical Features Discriminant function Classification error Syntactic or structural Primitives Rules,grammar Acceptance error Neural networks Samples,pixels,features Network function Mean square error made this approach more feasible.The rigid template 1.4 Syntactic Approach matching mentioned above,while effective in some In many recognition problems involving complex patterns, application domains,has a number of disadvantages.For it is more appropriate to adopt a hierarchical perspective instance,it would fail if the patterns are distorted due to the where a pattern is viewed as being composed of simple imaging process,viewpoint change,or large intraclass subpatterns which are themselves built from yet simpler variations among the patterns.Deformable template models subpatterns [56],[121].The simplest/elementary subpat- [69]or rubber sheet deformations [9]can be used to match terns to be recognized are called primitives and the given patterns when the deformation cannot be easily explained complex pattern is represented in terms of the interrelation- or modeled directly. ships between these primitives.In syntactic pattern recog- nition,a formal analogy is drawn between the structure of 1.3 Statistical Approach patterns and the syntax of a language.The patterns are In the statistical approach,each pattern is represented in viewed as sentences belonging to a language,primitives are terms of d features or measurements and is viewed as a viewed as the alphabet of the language,and the sentences are generated according to a grammar.Thus,a large point in a d-dimensional space.The goal is to choose those collection of complex patterns can be described by a small features that allow pattern vectors belonging to different number of primitives and grammatical rules.The grammar categories to occupy compact and disjoint regions in a for each pattern class must be inferred from the available d-dimensional feature space.The effectiveness of the training samples. representation space (feature set)is determined by how Structural pattern recognition is intuitively appealing well patterns from different classes can be separated.Given because,in addition to classification,this approach also a set of training patterns from each class,the objective is to provides a description of how the given pattern is establish decision boundaries in the feature space which constructed from the primitives.This paradigm has been separate patterns belonging to different classes.In the used in situations where the patterns have a definite statistical decision theoretic approach,the decision bound- structure which can be captured in terms of a set of rules, aries are determined by the probability distributions of the such as EKG waveforms,textured images,and shape patterns belonging to each class,which must either be analysis of contours [56].The implementation of a syntactic specified or learned [41],[44]. approach,however,leads to many difficulties which primarily have to do with the segmentation of noisy One can also take a discriminant analysis-based ap- patterns (to detect the primitives)and the inference of the proach to classification:First a parametric form of the grammar from training data.Fu [56]introduced the notion decision boundary (e.g,linear or quadratic)is specified; of attributed grammars which unifies syntactic and statis- then the "best"decision boundary of the specified form is tical pattern recognition.The syntactic approach may yield found based on the classification of training patterns.Such a combinatorial explosion of possibilities to be investigated, boundaries can be constructed using,for example,a mean demanding large training sets and very large computational squared error criterion.The direct boundary construction efforts [122]. approaches are supported by Vapnik's philosophy [162]:"If 1.5 Neural Networks you possess a restricted amount of information for solving some problem,try to solve the problem directly and never Neural networks can be viewed as massively parallel solve a more general problem as an intermediate step.It is computing systems consisting of an extremely large number of simple processors with many interconnections. possible that the available information is sufficient for a Neural network models attempt to use some organiza- direct solution but is insufficient for solving a more general tional principles (such as learning,generalization,adap- intermediate problem." tivity,fault tolerance and distributed representation,and
made this approach more feasible. The rigid template matching mentioned above, while effective in some application domains, has a number of disadvantages. For instance, it would fail if the patterns are distorted due to the imaging process, viewpoint change, or large intraclass variations among the patterns. Deformable template models [69] or rubber sheet deformations [9] can be used to match patterns when the deformation cannot be easily explained or modeled directly. 1.3 Statistical Approach In the statistical approach, each pattern is represented in terms of d features or measurements and is viewed as a point in a d-dimensional space. The goal is to choose those features that allow pattern vectors belonging to different categories to occupy compact and disjoint regions in a d-dimensional feature space. The effectiveness of the representation space (feature set) is determined by how well patterns from different classes can be separated. Given a set of training patterns from each class, the objective is to establish decision boundaries in the feature space which separate patterns belonging to different classes. In the statistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the patterns belonging to each class, which must either be specified or learned [41], [44]. One can also take a discriminant analysis-based approach to classification: First a parametric form of the decision boundary (e.g., linear or quadratic) is specified; then the ªbestº decision boundary of the specified form is found based on the classification of training patterns. Such boundaries can be constructed using, for example, a mean squared error criterion. The direct boundary construction approaches are supported by Vapnik's philosophy [162]: ªIf you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.º 1.4 Syntactic Approach In many recognition problems involving complex patterns, it is more appropriate to adopt a hierarchical perspective where a pattern is viewed as being composed of simple subpatterns which are themselves built from yet simpler subpatterns [56], [121]. The simplest/elementary subpatterns to be recognized are called primitives and the given complex pattern is represented in terms of the interrelationships between these primitives. In syntactic pattern recognition, a formal analogy is drawn between the structure of patterns and the syntax of a language. The patterns are viewed as sentences belonging to a language, primitives are viewed as the alphabet of the language, and the sentences are generated according to a grammar. Thus, a large collection of complex patterns can be described by a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from the available training samples. Structural pattern recognition is intuitively appealing because, in addition to classification, this approach also provides a description of how the given pattern is constructed from the primitives. This paradigm has been used in situations where the patterns have a definite structure which can be captured in terms of a set of rules, such as EKG waveforms, textured images, and shape analysis of contours [56]. The implementation of a syntactic approach, however, leads to many difficulties which primarily have to do with the segmentation of noisy patterns (to detect the primitives) and the inference of the grammar from training data. Fu [56] introduced the notion of attributed grammars which unifies syntactic and statistical pattern recognition. The syntactic approach may yield a combinatorial explosion of possibilities to be investigated, demanding large training sets and very large computational efforts [122]. 1.5 Neural Networks Neural networks can be viewed as massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections. Neural network models attempt to use some organizational principles (such as learning, generalization, adaptivity, fault tolerance and distributed representation, and 6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 TABLE 2 Pattern Recognition Models
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 7 computation)in a network of weighted directed graphs and demonstrated to be useful in practical applications, in which the nodes are artificial neurons and directed along with the new trends and ideas edges (with weights)are connections between neuron The literature on pattern recognition is vast and outputs and neuron inputs.The main characteristics of scattered in numerous journals in several disciplines neural networks are that they have the ability to learn (e.g,applied statistics,machine learning,neural net- complex nonlinear input-output relationships,use se- works,and signal and image processing).A quick scan of quential training procedures,and adapt themselves to the table of contents of all the issues of the IEEE the data. Transactions on Pattern Analysis and Machine Intelligence, The most commonly used family of neural networks for since its first publication in January 1979,reveals that pattern classification tasks [83]is the feed-forward network, approximately 350 papers deal with pattern recognition. which includes multilayer perceptron and Radial-Basis Approximately 300 of these papers covered the statistical Function (RBF)networks.These networks are organized approach and can be broadly categorized into the into layers and have unidirectional connections between the following subtopics:curse of dimensionality (15),dimen- layers.Another popular network is the Self-Organizing sionality reduction (50),classifier design (175),classifier Map (SOM),or Kohonen-Network [92],which is mainly combination (10),error estimation(25)and unsupervised used for data clustering and feature mapping.The learning classification (50).In addition to the excellent textbooks process involves updating network architecture and con- by Duda and Hart [44],Fukunaga [58],Devijver and nection weights so that a network can efficiently perform a Kittler [39],Devroye et al.[41],Bishop [18],Ripley [137], specific classification/clustering task.The increasing popu- Schurmann [147],and McLachlan [105],we should also larity of neural network models to solve pattern recognition point out two excellent survey papers written by Nagy problems has been primarily due to their seemingly low [111]in 1968 and by Kanal [89]in 1974.Nagy described dependence on domain-specific knowledge (relative to the early roots of pattern recognition,which at that time model-based and rule-based approaches)and due to the was shared with researchers in artificial intelligence and availability of efficient learning algorithms for practitioners perception.A large part of Nagy's paper introduced a to use. number of potential applications of pattern recognition Neural networks provide a new suite of nonlinear and the interplay between feature definition and the algorithms for feature extraction (using hidden layers) application domain knowledge.He also emphasized the and classification(e.g.,multilayer perceptrons).In addition, linear classification methods;nonlinear techniques were existing feature extraction and classification algorithms can based on polynomial discriminant functions as well as on also be mapped on neural network architectures for potential functions (similar to what are now called the efficient (hardware)implementation.In spite of the see- kernel functions).By the time Kanal wrote his survey mingly different underlying principles,most of the well- paper,more than 500 papers and about half a dozen known neural network models are implicitly equivalent or books on pattern recognition were already published. similar to classical statistical pattern recognition methods Kanal placed less emphasis on applications,but more on (see Table 3).Ripley 136]and Anderson et al.5]also modeling and design of pattern recognition systems.The discuss this relationship between neural networks and discussion on automatic feature extraction in [89]was statistical pattern recognition.Anderson et al.point out that based on various distance measures between class- "neural networks are statistics for amateurs...Most NNs conditional probability density functions and the result- conceal the statistics from the user."Despite these simila- ing error bounds.Kanal's review also contained a large rities,neural networks do offer several advantages such as, section on structural methods and pattern grammars. unified approaches for feature extraction and classification In comparison to the state of the pattern recognition field and flexible procedures for finding good,moderately as described by Nagy and Kanal in the 1960s and 1970s, nonlinear solutions. today a number of commercial pattern recognition systems 1.6 Scope and Organization are available which even individuals can buy for personal use (e.g,machine printed character recognition and In the remainder of this paper we will primarily review isolated spoken word recognition).This has been made statistical methods for pattern representation and classifica- possible by various technological developments resulting in tion,emphasizing recent developments.Whenever appro- the availability of inexpensive sensors and powerful desk- priate,we will also discuss closely related algorithms from top computers.The field of pattern recognition has become the neural networks literature.We omit the whole body of so large that in this review we had to skip detailed literature on fuzzy classification and fuzzy clustering which descriptions of various applications,as well as almost all are in our opinion beyond the scope of this paper. the procedures which model domain-specific knowledge Interested readers can refer to the well-written books on (e.g.,structural pattern recognition,and rule-based sys- fuzzy pattern recognition by Bezdek [15]and [16].In most tems).The starting point of our review (Section 2)is the of the sections,the various approaches and methods are basic elements of statistical methods for pattern recognition. summarized in tables as an easy and quick reference for the It should be apparent that a feature vector is a representa- reader.Due to space constraints,we are not able to provide tion of real world objects;the choice of the representation many details and we have to omit some of the approaches strongly influences the classification results and the associated references.Our goal is to emphasize those approaches which have been extensively evaluated 1.Its second edition by Duda,Hart,and Stork [45]is in press
computation) in a network of weighted directed graphs in which the nodes are artificial neurons and directed edges (with weights) are connections between neuron outputs and neuron inputs. The main characteristics of neural networks are that they have the ability to learn complex nonlinear input-output relationships, use sequential training procedures, and adapt themselves to the data. The most commonly used family of neural networks for pattern classification tasks [83] is the feed-forward network, which includes multilayer perceptron and Radial-Basis Function (RBF) networks. These networks are organized into layers and have unidirectional connections between the layers. Another popular network is the Self-Organizing Map (SOM), or Kohonen-Network [92], which is mainly used for data clustering and feature mapping. The learning process involves updating network architecture and connection weights so that a network can efficiently perform a specific classification/clustering task. The increasing popularity of neural network models to solve pattern recognition problems has been primarily due to their seemingly low dependence on domain-specific knowledge (relative to model-based and rule-based approaches) and due to the availability of efficient learning algorithms for practitioners to use. Neural networks provide a new suite of nonlinear algorithms for feature extraction (using hidden layers) and classification (e.g., multilayer perceptrons). In addition, existing feature extraction and classification algorithms can also be mapped on neural network architectures for efficient (hardware) implementation. In spite of the seemingly different underlying principles, most of the wellknown neural network models are implicitly equivalent or similar to classical statistical pattern recognition methods (see Table 3). Ripley [136] and Anderson et al. [5] also discuss this relationship between neural networks and statistical pattern recognition. Anderson et al. point out that ªneural networks are statistics for amateurs... Most NNs conceal the statistics from the user.º Despite these similarities, neural networks do offer several advantages such as, unified approaches for feature extraction and classification and flexible procedures for finding good, moderately nonlinear solutions. 1.6 Scope and Organization In the remainder of this paper we will primarily review statistical methods for pattern representation and classification, emphasizing recent developments. Whenever appropriate, we will also discuss closely related algorithms from the neural networks literature. We omit the whole body of literature on fuzzy classification and fuzzy clustering which are in our opinion beyond the scope of this paper. Interested readers can refer to the well-written books on fuzzy pattern recognition by Bezdek [15] and [16]. In most of the sections, the various approaches and methods are summarized in tables as an easy and quick reference for the reader. Due to space constraints, we are not able to provide many details and we have to omit some of the approaches and the associated references. Our goal is to emphasize those approaches which have been extensively evaluated and demonstrated to be useful in practical applications, along with the new trends and ideas. The literature on pattern recognition is vast and scattered in numerous journals in several disciplines (e.g., applied statistics, machine learning, neural networks, and signal and image processing). A quick scan of the table of contents of all the issues of the IEEE Transactions on Pattern Analysis and Machine Intelligence, since its first publication in January 1979, reveals that approximately 350 papers deal with pattern recognition. Approximately 300 of these papers covered the statistical approach and can be broadly categorized into the following subtopics: curse of dimensionality (15), dimensionality reduction (50), classifier design (175), classifier combination (10), error estimation (25) and unsupervised classification (50). In addition to the excellent textbooks by Duda and Hart [44],1 Fukunaga [58], Devijver and Kittler [39], Devroye et al. [41], Bishop [18], Ripley [137], Schurmann [147], and McLachlan [105], we should also point out two excellent survey papers written by Nagy [111] in 1968 and by Kanal [89] in 1974. Nagy described the early roots of pattern recognition, which at that time was shared with researchers in artificial intelligence and perception. A large part of Nagy's paper introduced a number of potential applications of pattern recognition and the interplay between feature definition and the application domain knowledge. He also emphasized the linear classification methods; nonlinear techniques were based on polynomial discriminant functions as well as on potential functions (similar to what are now called the kernel functions). By the time Kanal wrote his survey paper, more than 500 papers and about half a dozen books on pattern recognition were already published. Kanal placed less emphasis on applications, but more on modeling and design of pattern recognition systems. The discussion on automatic feature extraction in [89] was based on various distance measures between classconditional probability density functions and the resulting error bounds. Kanal's review also contained a large section on structural methods and pattern grammars. In comparison to the state of the pattern recognition field as described by Nagy and Kanal in the 1960s and 1970s, today a number of commercial pattern recognition systems are available which even individuals can buy for personal use (e.g., machine printed character recognition and isolated spoken word recognition). This has been made possible by various technological developments resulting in the availability of inexpensive sensors and powerful desktop computers. The field of pattern recognition has become so large that in this review we had to skip detailed descriptions of various applications, as well as almost all the procedures which model domain-specific knowledge (e.g., structural pattern recognition, and rule-based systems). The starting point of our review (Section 2) is the basic elements of statistical methods for pattern recognition. It should be apparent that a feature vector is a representation of real world objects; the choice of the representation strongly influences the classification results. JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 7 1. Its second edition by Duda, Hart, and Stork [45] is in press
8 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 TABLE 3 Links Between Statistical and Neural Network Methods Statistical Pattern Recognition Artificial Neural Networks Linear Discriminant Function Perceptron Principal Component Analysis Auto-Associative Network,and various PCA networks A Posteriori Probability Estimation Multilayer Perceptron Nonlinear Discriminant Analysis Multilayer Perceptron Parzen Window Density-based Classifier Radial Basis Function Network Edited K-NN Rule Kohonen's LVQ The topic of probabilistic distance measures is cur- recognition.To this purpose,we have included a number rently not as important as 20 years ago,since it is very of examples to illustrate the performance of various difficult to estimate density functions in high dimensional algorithms.Nevertheless,we realize that,due to space feature spaces.Instead,the complexity of classification limitations,we have not been able to introduce all the procedures and the resulting accuracy have gained a concepts completely.At these places,we have to rely on large interest.The curse of dimensionality (Section 3)as the background knowledge which may be available only well as the danger of overtraining are some of the to the more experienced readers. consequences of a complex classifier.It is now under- stood that these problems can,to some extent,be circumvented using regularization,or can even be 2 STATISTICAL PATTERN RECOGNITION completely resolved by a proper design of classification Statistical pattern recognition has been used successfully to procedures.The study of support vector machines design a number of commercial recognition systems.In (SVMs),discussed in Section 5,has largely contributed statistical pattern recognition,a pattern is represented by a to this understanding.In many real world problems, set of d features,or attributes,viewed as a d-dimensional patterns are scattered in high-dimensional (often)non- feature vector.Well-known concepts from statistical linear subspaces.As a consequence,nonlinear procedures decision theory are utilized to establish decision boundaries and subspace approaches have become popular,both for dimensionality reduction (Section 4)and for building between pattern classes.The recognition system is operated classifiers (Section 5).Neural networks offer powerful in two modes:training (learning)and classification(testing) tools for these purposes.It is now widely accepted that (see Fig.1).The role of the preprocessing module is to no single procedure will completely solve a complex segment the pattern of interest from the background, classification problem.There are many admissible ap- remove noise,normalize the pattern,and any other proaches,each capable of discriminating patterns in operation which will contribute in defining a compact certain portions of the feature space.The combination of representation of the pattern.In the training mode,the classifiers has,therefore,become a heavily studied topic feature extraction/selection module finds the appropriate (Section 6).Various approaches to estimating the error features for representing the input patterns and the rate of a classifier are presented in Section 7.The topic of classifier is trained to partition the feature space.The unsupervised classification or clustering is covered in feedback path allows a designer to optimize the preproces- Section 8.Finally,Section 9 identifies the frontiers of sing and feature extraction/selection strategies.In the pattern recognition. classification mode,the trained classifier assigns the input It is our goal that most parts of the paper can be pattern to one of the pattern classes under consideration appreciated by a newcomer to the field of pattern based on the measured features. test Feature Preprocessing Measurement Classification pattern Classification Training Feature training Preprocessing Extraction/ Learning pattern Selection Fig.1.Model for statistical pattem recognition
The topic of probabilistic distance measures is currently not as important as 20 years ago, since it is very difficult to estimate density functions in high dimensional feature spaces. Instead, the complexity of classification procedures and the resulting accuracy have gained a large interest. The curse of dimensionality (Section 3) as well as the danger of overtraining are some of the consequences of a complex classifier. It is now understood that these problems can, to some extent, be circumvented using regularization, or can even be completely resolved by a proper design of classification procedures. The study of support vector machines (SVMs), discussed in Section 5, has largely contributed to this understanding. In many real world problems, patterns are scattered in high-dimensional (often) nonlinear subspaces. As a consequence, nonlinear procedures and subspace approaches have become popular, both for dimensionality reduction (Section 4) and for building classifiers (Section 5). Neural networks offer powerful tools for these purposes. It is now widely accepted that no single procedure will completely solve a complex classification problem. There are many admissible approaches, each capable of discriminating patterns in certain portions of the feature space. The combination of classifiers has, therefore, become a heavily studied topic (Section 6). Various approaches to estimating the error rate of a classifier are presented in Section 7. The topic of unsupervised classification or clustering is covered in Section 8. Finally, Section 9 identifies the frontiers of pattern recognition. It is our goal that most parts of the paper can be appreciated by a newcomer to the field of pattern recognition. To this purpose, we have included a number of examples to illustrate the performance of various algorithms. Nevertheless, we realize that, due to space limitations, we have not been able to introduce all the concepts completely. At these places, we have to rely on the background knowledge which may be available only to the more experienced readers. 2 STATISTICAL PATTERN RECOGNITION Statistical pattern recognition has been used successfully to design a number of commercial recognition systems. In statistical pattern recognition, a pattern is represented by a set of d features, or attributes, viewed as a d-dimensional feature vector. Well-known concepts from statistical decision theory are utilized to establish decision boundaries between pattern classes. The recognition system is operated in two modes: training (learning) and classification (testing) (see Fig. 1). The role of the preprocessing module is to segment the pattern of interest from the background, remove noise, normalize the pattern, and any other operation which will contribute in defining a compact representation of the pattern. In the training mode, the feature extraction/selection module finds the appropriate features for representing the input patterns and the classifier is trained to partition the feature space. The feedback path allows a designer to optimize the preprocessing and feature extraction/selection strategies. In the classification mode, the trained classifier assigns the input pattern to one of the pattern classes under consideration based on the measured features. 8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 TABLE 3 Links Between Statistical and Neural Network Methods Fig. 1. Model for statistical pattern recognition
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 9 The decision making process in statistical pattern supervised nonparametric method which constructs a recognition can be summarized as follows:A given pattern decision boundary. is to be assigned to one of c categories w,w2,.,we based Another dichotomy in statistical pattern recognition is on a vector of d feature values x =(z1,x2,...,xd).The that of supervised learning (labeled training samples) features are assumed to have a probability density or mass versus unsupervised learning (unlabeled training sam- (depending on whether the features are continuous or ples).The label of a training pattern represents the discrete)function conditioned on the pattern class.Thus,a category to which that pattern belongs.In an unsuper- pattern vector z belonging to class wi is viewed as an vised learning problem,sometimes the number of classes observation drawn randomly from the class-conditional must be learned along with the structure of each class. probability function p(xwi).A number of well-known The various dichotomies that appear in statistical pattern decision rules,including the Bayes decision rule,the recognition are shown in the tree structure of Fig.2.As maximum likelihood rule (which can be viewed as a we traverse the tree from top to bottom and left to right, particular case of the Bayes rule),and the Neyman-Pearson less information is available to the system designer and as rule are available to define the decision boundary.The a result,the difficulty of classification problems increases. "optimal"Bayes decision rule for minimizing the risk In some sense,most of the approaches in statistical (expected value of the loss function)can be stated as pattern recognition (leaf nodes in the tree of Fig.2)are follows:Assign input pattern x to class w;for which the attempting to implement the Bayes decision rule.The conditional risk field of cluster analysis essentially deals with decision making problems in the nonparametric and unsupervised R(ulr)=∑L(,w)·P(ulz) (1) learning mode [81].Further,in cluster analysis the number of categories or clusters may not even be specified;the task is to discover a reasonable categoriza- is minimum,where L(wi,wj)is the loss incurred in deciding tion of the data (if one exists).Cluster analysis algorithms wi when the true class is wj and P(wjlr)is the posterior along with various techniques for visualizing and project- probability [44].In the case of the 0/1 loss function,as defined in (2),the conditional risk becomes the conditional ing multidimensional data are also referred to as exploratory data analysis methods. probability of misclassification. Yet another dichotomy in statistical pattern recognition 0,i=j can be based on whether the decision boundaries are L(w,)={1,i≠j (2) obtained directly (geometric approach)or indirectly (probabilistic density-based approach)as shown in Fig.2. For this choice of loss function,the Bayes decision rule can The probabilistic approach requires to estimate density be simplified as follows (also called the maximum a functions first,and then construct the discriminant posteriori(MAP)rule):Assign input pattern x to class w;if functions which specify the decision boundaries.On the P(wil)>P(wjlx)for all jti. (3) other hand,the geometric approach often constructs the decision boundaries directly from optimizing certain cost Various strategies are utilized to design a classifier in functions.We should point out that under certain statistical pattern recognition,depending on the kind of assumptions on the density functions,the two approaches information available about the class-conditional densities. are equivalent.We will see examples of each category in If all of the class-conditional densities are completely Section 5. specified,then the optimal Bayes decision rule can be No matter which classification or decision rule is used,it used to design a classifier.However,the class-conditional must be trained using the available training samples.As a densities are usually not known in practice and must be result,the performance of a classifier depends on both the learned from the available training patterns.If the form of number of available training samples as well as the specific the class-conditional densities is known (e.g,multivariate values of the samples.At the same time,the goal of Gaussian),but some of the parameters of the densities designing a recognition system is to classify future test (e.g.,mean vectors and covariance matrices)are un- samples which are likely to be different from the training known,then we have a parametric decision problem.A samples.Therefore,optimizing a classifier to maximize its common strategy for this kind of problem is to replace performance on the training set may not always result in the the unknown parameters in the density functions by their desired performance on a test set.The generalization ability estimated values,resulting in the so-called Bayes plug-in of a classifier refers to its performance in classifying test classifier.The optimal Bayesian strategy in this situation patterns which were not used during the training stage.A requires additional information in the form of a prior poor generalization ability of a classifier can be attributed to distribution on the unknown parameters.If the form of any one of the following factors:1)the number of features is the class-conditional densities is not known,then we too large relative to the number of training samples(curse operate in a nonparametric mode.In this case,we must of dimensionality [80]),2)the number of unknown either estimate the density function (e.g.,Parzen window parameters associated with the classifier is large approach)or directly construct the decision boundary (e.g,polynomial classifiers or a large neural network), based on the training data (e.g.,k-nearest neighbor rule). and 3)a classifier is too intensively optimized on the In fact,the multilayer perceptron can also be viewed as a training set (overtrained);this is analogous to the
The decision making process in statistical pattern recognition can be summarized as follows: A given pattern is to be assigned to one of c categories !1; !2; ; !c based on a vector of d feature values x x1; x2; ; xd. The features are assumed to have a probability density or mass (depending on whether the features are continuous or discrete) function conditioned on the pattern class. Thus, a pattern vector x belonging to class !i is viewed as an observation drawn randomly from the class-conditional probability function p xj!i. A number of well-known decision rules, including the Bayes decision rule, the maximum likelihood rule (which can be viewed as a particular case of the Bayes rule), and the Neyman-Pearson rule are available to define the decision boundary. The ªoptimalº Bayes decision rule for minimizing the risk (expected value of the loss function) can be stated as follows: Assign input pattern x to class !i for which the conditional risk R !ijx Xc j1 L !i; !j P !jjx 1 is minimum, where L !i; !j is the loss incurred in deciding !i when the true class is !j and P !jjx is the posterior probability [44]. In the case of the 0/1 loss function, as defined in (2), the conditional risk becomes the conditional probability of misclassification. L !i; !j 0; i j 1; i 6 j : 2 For this choice of loss function, the Bayes decision rule can be simplified as follows (also called the maximum a posteriori (MAP) rule): Assign input pattern x to class !i if P !ijx > P !jjx for all j 6 i: 3 Various strategies are utilized to design a classifier in statistical pattern recognition, depending on the kind of information available about the class-conditional densities. If all of the class-conditional densities are completely specified, then the optimal Bayes decision rule can be used to design a classifier. However, the class-conditional densities are usually not known in practice and must be learned from the available training patterns. If the form of the class-conditional densities is known (e.g., multivariate Gaussian), but some of the parameters of the densities (e.g., mean vectors and covariance matrices) are unknown, then we have a parametric decision problem. A common strategy for this kind of problem is to replace the unknown parameters in the density functions by their estimated values, resulting in the so-called Bayes plug-in classifier. The optimal Bayesian strategy in this situation requires additional information in the form of a prior distribution on the unknown parameters. If the form of the class-conditional densities is not known, then we operate in a nonparametric mode. In this case, we must either estimate the density function (e.g., Parzen window approach) or directly construct the decision boundary based on the training data (e.g., k-nearest neighbor rule). In fact, the multilayer perceptron can also be viewed as a supervised nonparametric method which constructs a decision boundary. Another dichotomy in statistical pattern recognition is that of supervised learning (labeled training samples) versus unsupervised learning (unlabeled training samples). The label of a training pattern represents the category to which that pattern belongs. In an unsupervised learning problem, sometimes the number of classes must be learned along with the structure of each class. The various dichotomies that appear in statistical pattern recognition are shown in the tree structure of Fig. 2. As we traverse the tree from top to bottom and left to right, less information is available to the system designer and as a result, the difficulty of classification problems increases. In some sense, most of the approaches in statistical pattern recognition (leaf nodes in the tree of Fig. 2) are attempting to implement the Bayes decision rule. The field of cluster analysis essentially deals with decision making problems in the nonparametric and unsupervised learning mode [81]. Further, in cluster analysis the number of categories or clusters may not even be specified; the task is to discover a reasonable categorization of the data (if one exists). Cluster analysis algorithms along with various techniques for visualizing and projecting multidimensional data are also referred to as exploratory data analysis methods. Yet another dichotomy in statistical pattern recognition can be based on whether the decision boundaries are obtained directly (geometric approach) or indirectly (probabilistic density-based approach) as shown in Fig. 2. The probabilistic approach requires to estimate density functions first, and then construct the discriminant functions which specify the decision boundaries. On the other hand, the geometric approach often constructs the decision boundaries directly from optimizing certain cost functions. We should point out that under certain assumptions on the density functions, the two approaches are equivalent. We will see examples of each category in Section 5. No matter which classification or decision rule is used, it must be trained using the available training samples. As a result, the performance of a classifier depends on both the number of available training samples as well as the specific values of the samples. At the same time, the goal of designing a recognition system is to classify future test samples which are likely to be different from the training samples. Therefore, optimizing a classifier to maximize its performance on the training set may not always result in the desired performance on a test set. The generalization ability of a classifier refers to its performance in classifying test patterns which were not used during the training stage. A poor generalization ability of a classifier can be attributed to any one of the following factors: 1) the number of features is too large relative to the number of training samples (curse of dimensionality [80]), 2) the number of unknown parameters associated with the classifier is large (e.g., polynomial classifiers or a large neural network), and 3) a classifier is too intensively optimized on the training set (overtrained); this is analogous to the JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 9
10 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 Class-Conditional Densities Known Unknown Bayes Decision Supervised Unsupervised Theory Learning Learning Parametric Nonparametric Parametric Nonparametric 'Optimal" Plug-in Density Decision Mixture Cluster Rules Rules Estimation Boundary Resolving Analysis Construction Density-Based Approaches Geometric Approach Fig.2.Various approaches in statistical pattem recognition. phenomenon of overfitting in regression when there are too discriminant analysis [611,and the use of bootstrapping many free parameters. for designing classifiers [48],and for error estimation [82]. Overtraining has been investigated theoretically for Throughout the paper,some of the classification meth- classifiers that minimize the apparent error rate(the error ods will be illustrated by simple experiments on the on the training set).The classical studies by Cover [33]and following three data sets: Vapnik [162]on classifier capacity and complexity provide Dataset 1:An artificial dataset consisting of two classes a good understanding of the mechanisms behind with bivariate Gaussian density with the following para- overtraining.Complex classifiers (e.g.,those having many meters: independent parameters)may have a large capacity,i.e., they are able to represent many dichotomies for a given 「101 dataset.A frequently used measure for the capacity is the m=m=20,a=[625]mdg-[0.8l 01 Vapnik-Chervonenkis (VC)dimensionality [162].These results can also be used to prove some interesting proper- The intrinsic overlap between these two densities is ties,for example,the consistency of certain classifiers (see, 12.5 percent. Devroye et al.[40],[411).The practical use of the results on Dataset 2:Iris dataset consists of 150 four-dimensional classifier complexity was initially limited because the patterns in three classes(50 patterns each):Iris Setosa,Iris proposed bounds on the required number of (training) Versicolor,and Iris Virginica. samples were too conservative.In the recent development Dataset 3:The digit dataset consists of handwritten of support vector machines [162],however,these results numerals ("0"-"9")extracted from a collection of Dutch have proved to be quite useful.The pitfalls of over- utility maps.Two hundred patterns per class(for a total of adaptation of estimators to the given training set are 2,000 patterns)are available in the form of 30x 48 binary observed in several stages of a pattern recognition system, images.These characters are represented in terms of the such as dimensionality reduction,density estimation,and following six feature sets: classifier design.A sound solution is to always use an independent dataset(test set)for evaluation.In order to 1.76 Fourier coefficients of the character shapes; avoid the necessity of having several independent test sets, 2. 216 profile correlations; estimators are often based on rotated subsets of the data, 3.64 Karhunen-Loeve coefficients; preserving different parts of the data for optimization and 4. 240 pixel averages in 2 x 3 windows; evaluation [166].Examples are the optimization of the 5.47 Zernike moments; covariance estimates for the Parzen kernel [76]and 6. 6 morphological features
phenomenon of overfitting in regression when there are too many free parameters. Overtraining has been investigated theoretically for classifiers that minimize the apparent error rate (the error on the training set). The classical studies by Cover [33] and Vapnik [162] on classifier capacity and complexity provide a good understanding of the mechanisms behind overtraining. Complex classifiers (e.g., those having many independent parameters) may have a large capacity, i.e., they are able to represent many dichotomies for a given dataset. A frequently used measure for the capacity is the Vapnik-Chervonenkis (VC) dimensionality [162]. These results can also be used to prove some interesting properties, for example, the consistency of certain classifiers (see, Devroye et al. [40], [41]). The practical use of the results on classifier complexity was initially limited because the proposed bounds on the required number of (training) samples were too conservative. In the recent development of support vector machines [162], however, these results have proved to be quite useful. The pitfalls of overadaptation of estimators to the given training set are observed in several stages of a pattern recognition system, such as dimensionality reduction, density estimation, and classifier design. A sound solution is to always use an independent dataset (test set) for evaluation. In order to avoid the necessity of having several independent test sets, estimators are often based on rotated subsets of the data, preserving different parts of the data for optimization and evaluation [166]. Examples are the optimization of the covariance estimates for the Parzen kernel [76] and discriminant analysis [61], and the use of bootstrapping for designing classifiers [48], and for error estimation [82]. Throughout the paper, some of the classification methods will be illustrated by simple experiments on the following three data sets: Dataset 1: An artificial dataset consisting of two classes with bivariate Gaussian density with the following parameters: m1 1; 1; m2 2; 0; 1 1 0 0 0:25 and 2 0:8 0 0 1 : The intrinsic overlap between these two densities is 12.5 percent. Dataset 2: Iris dataset consists of 150 four-dimensional patterns in three classes (50 patterns each): Iris Setosa, Iris Versicolor, and Iris Virginica. Dataset 3: The digit dataset consists of handwritten numerals (ª0º-ª9º) extracted from a collection of Dutch utility maps. Two hundred patterns per class (for a total of 2,000 patterns) are available in the form of 30 48 binary images. These characters are represented in terms of the following six feature sets: 1. 76 Fourier coefficients of the character shapes; 2. 216 profile correlations; 3. 64 Karhunen-LoeÁve coefficients; 4. 240 pixel averages in 2 3 windows; 5. 47 Zernike moments; 6. 6 morphological features. 10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000 Fig. 2. Various approaches in statistical pattern recognition.
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 11 Details of this dataset are available in [160].In our imum discrimination between the two classes.The only experiments we always used the same subset of 1,000 parameter in the densities is the mean vector, patterns for testing and various subsets of the remaining m=m1=-m2. 1,000 patterns for training.2 Throughout this paper,when Trunk considered the following two cases: we refer to "the digit dataset,"just the Karhunen-Loeve features (in item 3)are meant,unless stated otherwise. 1.The mean vector m is known.In this situation,we can use the optimal Bayes decision rule (with a 0/1 loss function)to construct the decision boundary. 3 THE CURSE OF DIMENSIONALITY AND PEAKING The probability of error as a function of d can be PHENOMENA expressed as: The performance of a classifier depends on the interrela- tionship between sample sizes,number of features,and P(d ed (4) classifier complexity.A naive table-lookup technique V2π (partitioning the feature space into cells and associating a class label with each cell)requires the number of training It is easy to verify that limd Pe(d)=0.In other data points to be an exponential function of the feature words,we can perfectly discriminate the two classes by arbitrarily increasing the number of features,d. dimension [18].This phenomenon is termed as "curse of 2 The mean vector m is unknown and n labeled dimensionality,"which leads to the "peaking phenomenon" (see discussion below)in classifier design.It is well-known training samples are available.Trunk found the maximum likelihood estimate m of m and used the that the probability of misclassification of a decision rule plug-in decision rule (substitute m for m in the does not increase as the number of features increases,as optimal Bayes decision rule).Now the probability of long as the class-conditional densities are completely error which is a function of both n and d can be known (or,equivalently,the number of training samples written as: is arbitrarily large and representative of the underlying densities).However,it has been often observed in practice 00 1 Pe(n,d)= ·erdz,where Jo(d)V2 (5) that the added features may actually degrade the perfor- mance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features.This paradoxical behavior is referred to (d)= ∑) (6) as the peaking phenomenon3 [80],[131],[132].A simple V1+)∑份+月 explanation for this phenomenon is as follows:The most commonly used parametric classifiers estimate the un- Trunk showed that lim P(n,d)=,which implies known parameters and plug them in for the true parameters that the probability of error approaches the maximum in the class-conditional densities.For a fixed sample size,as possible value of 0.5 for this two-class problem.This the number of features is increased(with a corresponding demonstrates that,unlike case 1)we cannot arbitrarily increase in the number of unknown parameters),the increase the number of features when the parameters of reliability of the parameter estimates decreases.Conse- class-conditional densities are estimated from a finite quently,the performance of the resulting plug-in classifiers, number of training samples.The practical implication of for a fixed sample size,may degrade with an increase in the the curse of dimensionality is that a system designer should number of features. try to select only a small number of salient features when Trunk[157]provided a simple example to illustrate the confronted with a limited training set. curse of dimensionality which we reproduce below. All of the commonly used classifiers,including multi- Consider the two-class classification problem with equal layer feed-forward networks,can suffer from the curse of prior probabilities,and a d-dimensional multivariate Gaus- dimensionality.While an exact relationship between the sian distribution with the identity covariance matrix for probability of misclassification,the number of training each class.The mean vectors for the two classes have the samples,the number of features and the true parameters of following components the class-conditional densities is very difficult to establish, 11 some guidelines have been suggested regarding the ratio of m=万…园 and the sample size to dimensionality.It is generally accepted that using at least ten times as many training samples per m2=(-1,- class as the number of features(n/d 10)is a good practice a to follow in classifier design [80].The more complex the Note that the features are statistically independent and the classifier,the larger should the ratio of sample size to discriminating power of the successive features decreases dimensionality be to avoid the curse of dimensionality. monotonically with the first feature providing the max- 4 DIMENSIONALITY REDUCTION 2.The dataset is available through the University of California,Irvine Machine Learning Repository (www.ics.uci.edu/-mlearn/MLRepositor- There are two main reasons to keep the dimensionality of y.html) 3.In the rest of this paper,we do not make distinction between the curse the pattern representation (i.e.,the number of features)as of dimensionality and the peaking phenomenon. small as possible:measurement cost and classification
Details of this dataset are available in [160]. In our experiments we always used the same subset of 1,000 patterns for testing and various subsets of the remaining 1,000 patterns for training.2 Throughout this paper, when we refer to ªthe digit dataset,º just the Karhunen-Loeve features (in item 3) are meant, unless stated otherwise. 3 THE CURSE OF DIMENSIONALITY AND PEAKING PHENOMENA The performance of a classifier depends on the interrelationship between sample sizes, number of features, and classifier complexity. A naive table-lookup technique (partitioning the feature space into cells and associating a class label with each cell) requires the number of training data points to be an exponential function of the feature dimension [18]. This phenomenon is termed as ªcurse of dimensionality,º which leads to the ªpeaking phenomenonº (see discussion below) in classifier design. It is well-known that the probability of misclassification of a decision rule does not increase as the number of features increases, as long as the class-conditional densities are completely known (or, equivalently, the number of training samples is arbitrarily large and representative of the underlying densities). However, it has been often observed in practice that the added features may actually degrade the performance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features. This paradoxical behavior is referred to as the peaking phenomenon3 [80], [131], [132]. A simple explanation for this phenomenon is as follows: The most commonly used parametric classifiers estimate the unknown parameters and plug them in for the true parameters in the class-conditional densities. For a fixed sample size, as the number of features is increased (with a corresponding increase in the number of unknown parameters), the reliability of the parameter estimates decreases. Consequently, the performance of the resulting plug-in classifiers, for a fixed sample size, may degrade with an increase in the number of features. Trunk [157] provided a simple example to illustrate the curse of dimensionality which we reproduce below. Consider the two-class classification problem with equal prior probabilities, and a d-dimensional multivariate Gaussian distribution with the identity covariance matrix for each class. The mean vectors for the two classes have the following components m1 1; 1 2 p ; 1 3 p ; ; 1 d p and m2 ÿ1; ÿ 1 2 p ; ÿ 1 3 p ; ; ÿ 1 d p : Note that the features are statistically independent and the discriminating power of the successive features decreases monotonically with the first feature providing the maximum discrimination between the two classes. The only parameter in the densities is the mean vector, m m1 ÿm2. Trunk considered the following two cases: 1. The mean vector m is known. In this situation, we can use the optimal Bayes decision rule (with a 0/1 loss function) to construct the decision boundary. The probability of error as a function of d can be expressed as: Pe d Z 1 Pd i1 1 i q 1 2 p eÿ1 2z2 dz: 4 It is easy to verify that limd!1 Pe d 0. In other words, we can perfectly discriminate the two classes by arbitrarily increasing the number of features, d. 2. The mean vector m is unknown and n labeled training samples are available. Trunk found the maximum likelihood estimate m^ of m and used the plug-in decision rule (substitute m^ for m in the optimal Bayes decision rule). Now the probability of error which is a function of both n and d can be written as: Pe n; d Z 1 d 1 2 p eÿ1 2z2 dz; where 5 d Pd i1 1 i 1 1 n Pd i1 1 i d n q : 6 Trunk showed that limd!1 Pe n; d 1 2 , which implies that the probability of error approaches the maximum possible value of 0.5 for this two-class problem. This demonstrates that, unlike case 1) we cannot arbitrarily increase the number of features when the parameters of class-conditional densities are estimated from a finite number of training samples. The practical implication of the curse of dimensionality is that a system designer should try to select only a small number of salient features when confronted with a limited training set. All of the commonly used classifiers, including multilayer feed-forward networks, can suffer from the curse of dimensionality. While an exact relationship between the probability of misclassification, the number of training samples, the number of features and the true parameters of the class-conditional densities is very difficult to establish, some guidelines have been suggested regarding the ratio of the sample size to dimensionality. It is generally accepted that using at least ten times as many training samples per class as the number of features (n=d > 10) is a good practice to follow in classifier design [80]. The more complex the classifier, the larger should the ratio of sample size to dimensionality be to avoid the curse of dimensionality. 4 DIMENSIONALITY REDUCTION There are two main reasons to keep the dimensionality of the pattern representation (i.e., the number of features) as small as possible: measurement cost and classification JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 11 2. The dataset is available through the University of California, Irvine Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepository.html) 3. In the rest of this paper, we do not make distinction between the curse of dimensionality and the peaking phenomenon
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL 22,NO.1,JANUARY 2000 accuracy.A limited yet salient feature set simplifies both the classification error of a feature subset.But the classification pattern representation and the classifiers that are built on error itself cannot be reliably estimated when the ratio of the selected representation.Consequently,the resulting sample size to the number of features is small.In addition to classifier will be faster and will use less memory.Moreover, the choice of a criterion function,we also need to determine as stated earlier,a small number of features can alleviate the the appropriate dimensionality of the reduced feature curse of dimensionality when the number of training space.The answer to this question is embedded in the samples is limited.On the other hand,a reduction in the notion of the intrinsic dimensionality of data.Intrinsic number of features may lead to a loss in the discrimination dimensionality essentially determines whether the given power and thereby lower the accuracy of the resulting d-dimensional patterns can be described adequately in a recognition system.Watanabe's ugly duckling theorem [163] subspace of dimensionality less than d.For example, also supports the need for a careful choice of the features, d-dimensional patterns along a reasonably smooth curve since it is possible to make two arbitrary patterns similar by have an intrinsic dimensionality of one,irrespective of the encoding them with a sufficiently large number of value of d.Note that the intrinsic dimensionality is not the redundant features. same as the linear dimensionality which is a global property It is important to make a distinction between feature of the data involving the number of significant eigenvalues selection and feature extraction.The term feature selection of the covariance matrix of the data.While several refers to algorithms that select the (hopefully)best subset of algorithms are available to estimate the intrinsic dimension- the input feature set.Methods that create new features ality [81],they do not indicate how a subspace of the based on transformations or combinations of the original identified dimensionality can be easily identified. feature set are called feature extraction algorithms.How- We now briefly discuss some of the commonly used ever,the terms feature selection and feature extraction are methods for feature extraction and feature selection. used interchangeably in the literature.Note that often feature extraction precedes feature selection;first,features 4.1 Feature Extraction are extracted from the sensed data (e.g.,using principal Feature extraction methods determine an appropriate sub- component or discriminant analysis)and then some of the space of dimensionality m(either in a linear or a nonlinear extracted features with low discrimination ability are way)in the original feature space of dimensionality d discarded.The choice between feature selection and feature (m<d).Linear transforms,such as principal component extraction depends on the application domain and the analysis,factor analysis,linear discriminant analysis,and specific training data which is available.Feature selection projection pursuit have been widely used in pattern leads to savings in measurement cost (since some of the recognition for feature extraction and dimensionality features are discarded)and the selected features retain their reduction.The best known linear feature extractor is the original physical interpretation.In addition,the retained principal component analysis (PCA)or Karhunen-Loeve features may be important for understanding the physical expansion,that computes the m largest eigenvectors of the process that generates the patterns.On the other hand, d x d covariance matrix of the n d-dimensional patterns.The transformed features generated by feature extraction may linear transformation is defined as provide a better discriminative ability than the best subset Y=XH of given features,but these new features (a linear or a (7) nonlinear combination of given features)may not have a where X is the given n x d pattern matrix,Y is the derived clear physical meaning. n x m pattern matrix,and H is the d x m matrix of linear In many situations,it is useful to obtain a two-or three- transformation whose columns are the eigenvectors.Since dimensional projection of the given multivariate data(n x d PCA uses the most expressive features (eigenvectors with pattern matrix)to permit a visual examination of the data. the largest eigenvalues),it effectively approximates the data Several graphical techniques also exist for visually obser- by a linear subspace using the mean squared error criterion. ving multivariate data,in which the objective is to exactly Other methods,like projection pursuit [53]and depict each pattern as a picture with d degrees of freedom, independent component analysis (ICA)[31],[11],[24],[96] where d is the given number of features.For example, are more appropriate for non-Gaussian distributions since Chernoff [29]represents each pattern as a cartoon face they do not rely on the second-order property of the data. whose facial characteristics,such as nose length,mouth ICA has been successfully used for blind-source separation curvature,and eye size,are made to correspond to [78];extracting linear feature combinations that define individual features.Fig.3 shows three faces corresponding independent sources.This demixing is possible if at most to the mean vectors of Iris Setosa,Iris Versicolor,and Iris one of the sources has a Gaussian distribution. Virginica classes in the Iris data (150 four-dimensional Whereas PCA is an unsupervised linear feature extrac- patterns;50 patterns per class).Note that the face associated tion method,discriminant analysis uses the category with Iris Setosa looks quite different from the other two information associated with each pattern for (linearly) faces which implies that the Setosa category can be well extracting the most discriminatory features.In discriminant separated from the remaining two categories in the four-analysis,interclass separation is emphasized by replacing dimensional feature space (This is also evident in the two- the total covariance matrix in PCA by a general separability dimensional plots of this data in Fig.5). measure like the Fisher criterion,which results in finding The main issue in dimensionality reduction is the choice the eigenvectors of SS(the product of the inverse of the of a criterion function.A commonly used criterion is the within-class scatter matrix,S,and the between-class
accuracy. A limited yet salient feature set simplifies both the pattern representation and the classifiers that are built on the selected representation. Consequently, the resulting classifier will be faster and will use less memory. Moreover, as stated earlier, a small number of features can alleviate the curse of dimensionality when the number of training samples is limited. On the other hand, a reduction in the number of features may lead to a loss in the discrimination power and thereby lower the accuracy of the resulting recognition system. Watanabe's ugly duckling theorem [163] also supports the need for a careful choice of the features, since it is possible to make two arbitrary patterns similar by encoding them with a sufficiently large number of redundant features. It is important to make a distinction between feature selection and feature extraction. The term feature selection refers to algorithms that select the (hopefully) best subset of the input feature set. Methods that create new features based on transformations or combinations of the original feature set are called feature extraction algorithms. However, the terms feature selection and feature extraction are used interchangeably in the literature. Note that often feature extraction precedes feature selection; first, features are extracted from the sensed data (e.g., using principal component or discriminant analysis) and then some of the extracted features with low discrimination ability are discarded. The choice between feature selection and feature extraction depends on the application domain and the specific training data which is available. Feature selection leads to savings in measurement cost (since some of the features are discarded) and the selected features retain their original physical interpretation. In addition, the retained features may be important for understanding the physical process that generates the patterns. On the other hand, transformed features generated by feature extraction may provide a better discriminative ability than the best subset of given features, but these new features (a linear or a nonlinear combination of given features) may not have a clear physical meaning. In many situations, it is useful to obtain a two- or threedimensional projection of the given multivariate data (n d pattern matrix) to permit a visual examination of the data. Several graphical techniques also exist for visually observing multivariate data, in which the objective is to exactly depict each pattern as a picture with d degrees of freedom, where d is the given number of features. For example, Chernoff [29] represents each pattern as a cartoon face whose facial characteristics, such as nose length, mouth curvature, and eye size, are made to correspond to individual features. Fig. 3 shows three faces corresponding to the mean vectors of Iris Setosa, Iris Versicolor, and Iris Virginica classes in the Iris data (150 four-dimensional patterns; 50 patterns per class). Note that the face associated with Iris Setosa looks quite different from the other two faces which implies that the Setosa category can be well separated from the remaining two categories in the fourdimensional feature space (This is also evident in the twodimensional plots of this data in Fig. 5). The main issue in dimensionality reduction is the choice of a criterion function. A commonly used criterion is the classification error of a feature subset. But the classification error itself cannot be reliably estimated when the ratio of sample size to the number of features is small. In addition to the choice of a criterion function, we also need to determine the appropriate dimensionality of the reduced feature space. The answer to this question is embedded in the notion of the intrinsic dimensionality of data. Intrinsic dimensionality essentially determines whether the given d-dimensional patterns can be described adequately in a subspace of dimensionality less than d. For example, d-dimensional patterns along a reasonably smooth curve have an intrinsic dimensionality of one, irrespective of the value of d. Note that the intrinsic dimensionality is not the same as the linear dimensionality which is a global property of the data involving the number of significant eigenvalues of the covariance matrix of the data. While several algorithms are available to estimate the intrinsic dimensionality [81], they do not indicate how a subspace of the identified dimensionality can be easily identified. We now briefly discuss some of the commonly used methods for feature extraction and feature selection. 4.1 Feature Extraction Feature extraction methods determine an appropriate subspace of dimensionality m (either in a linear or a nonlinear way) in the original feature space of dimensionality d (m d). Linear transforms, such as principal component analysis, factor analysis, linear discriminant analysis, and projection pursuit have been widely used in pattern recognition for feature extraction and dimensionality reduction. The best known linear feature extractor is the principal component analysis (PCA) or Karhunen-LoeÁve expansion, that computes the m largest eigenvectors of the d d covariance matrix of the n d-dimensional patterns. The linear transformation is defined as Y XH; 7 where X is the given n d pattern matrix, Y is the derived n m pattern matrix, and H is the d m matrix of linear transformation whose columns are the eigenvectors. Since PCA uses the most expressive features (eigenvectors with the largest eigenvalues), it effectively approximates the data by a linear subspace using the mean squared error criterion. Other methods, like projection pursuit [53] and independent component analysis (ICA) [31], [11], [24], [96] are more appropriate for non-Gaussian distributions since they do not rely on the second-order property of the data. ICA has been successfully used for blind-source separation [78]; extracting linear feature combinations that define independent sources. This demixing is possible if at most one of the sources has a Gaussian distribution. Whereas PCA is an unsupervised linear feature extraction method, discriminant analysis uses the category information associated with each pattern for (linearly) extracting the most discriminatory features. In discriminant analysis, interclass separation is emphasized by replacing the total covariance matrix in PCA by a general separability measure like the Fisher criterion, which results in finding the eigenvectors of Sÿ1 w Sb (the product of the inverse of the within-class scatter matrix, Sw, and the between-class 12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 1, JANUARY 2000
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 13 Setosa Versicolor Virginica Fig.3.Chemnoff Faces corresponding to the mean vectors of Iris Setosa,Iris Versicolor,and Iris Virginica. scatter matrix,S)[58.Another supervised criterion for criterion is the stress function introduced by Sammon non-Gaussian class-conditional densities is based on the [141]and Niemann [114].A problem with MDS is that it Patrick-Fisher distance using Parzen density estimates [41].does not give an explicit mapping function,so it is not There are several ways to define nonlinear feature possible to place a new pattern in a map which has been extraction techniques.One such method which is directly computed for a given training set without repeating the related to PCA is called the Kernel PCA [73],[145].The mapping.Several techniques have been investigated to basic idea of kernel PCA is to first map input data into some address this deficiency which range from linear interpola- new feature space F typically via a nonlinear function tion to training a neural network [38].It is also possible to (e.g.,polynomial of degree p)and then perform a linear redefine the MDS algorithm so that it directly produces a PCA in the mapped space.However,the F space often has map that may be used for new test patterns [165]. a very high dimension.To avoid computing the mapping A feed-forward neural network offers an integrated explicitly,kernel PCA employs only Mercer kernels which procedure for feature extraction and classification;the can be decomposed into a dot product, output of each hidden layer may be interpreted as a set of new,often nonlinear,features presented to the output layer K(x,)=(c)·(y), for classification.In this sense,multilayer networks serve as As a result,the kernel space has a well-defined metric. feature extractors [100].For example,the networks used by Examples of Mercer kernels include pth-order polynomial Fukushima [62]et al.and Le Cun et al.[95]have the so (x-y)P and Gaussian kernel called shared weight layers that are in fact filters for extracting features in two-dimensional images.During e training,the filters are tuned to the data,so as to maximize the classification performance. Let X be the normalized n x d pattern matrix with zero Neural networks can also be used directly for feature mean,and (X)be the pattern matrix in the F space. extraction in an unsupervised mode.Fig.4a shows the The linear PCA in the F space solves the eigenvectors of the correlation matrixΦ(X))Φ(X)T,which is”also called the architecture of a network which is able to find the PCA subspace [1171.Instead of sigmoids,the neurons have linear kernel matrix K(X,X).In kernel PCA,the first m transfer functions.This network has d inputs and d outputs, eigenvectors of K(X,X)are obtained to define a transfor- where d is the given number of features.The inputs are also mation matrix,E.(E has size n x m,where m represents the used as targets,forcing the output layer to reconstruct the desired number of features,m <d).New patterns x are input space using only one hidden layer.The three nodes in mapped by K(z,X)E,which are now represented relative the hidden layer capture the first three principal compo- to the training set and not by their measured feature values. nents [18].If two nonlinear layers with sigmoidal hidden Note that for a complete representation,up to m eigenvec- units are also included (see Fig.4b),then a nonlinear tors in E may be needed(depending on the kernel function) subspace is found in the middle layer (also called the by kernel PCA,while in linear PCA a set of d eigenvectors bottleneck layer).The nonlinearity is limited by the size of represents the original feature space.How the kernel these additional layers.These so-called autoassociative,or function should be chosen for a given application is still nonlinear PCA networks offer a powerful tool to train and an open issue. describe nonlinear subspaces [98].Oja [118]shows how Multidimensional scaling (MDS)is another nonlinear autoassociative networks can be used for ICA. feature extraction technique.It aims to represent a multi- The Self-Organizing Map (SOM),or Kohonen Map [92], dimensional dataset in two or three dimensions such that can also be used for nonlinear feature extraction.In SOM, the distance matrix in the original d-dimensional feature neurons are arranged in an m-dimensional grid,where m is space is preserved as faithfully as possible in the projected usually 1,2,or 3.Each neuron is connected to all the d input space.Various stress functions are used for measuring the units.The weights on the connections for each neuron form performance of this mapping [20];the most popular a d-dimensional weight vector.During training,patterns are
scatter matrix, Sb) [58]. Another supervised criterion for non-Gaussian class-conditional densities is based on the Patrick-Fisher distance using Parzen density estimates [41]. There are several ways to define nonlinear feature extraction techniques. One such method which is directly related to PCA is called the Kernel PCA [73], [145]. The basic idea of kernel PCA is to first map input data into some new feature space F typically via a nonlinear function (e.g., polynomial of degree p) and then perform a linear PCA in the mapped space. However, the F space often has a very high dimension. To avoid computing the mapping explicitly, kernel PCA employs only Mercer kernels which can be decomposed into a dot product, K x; y x y: As a result, the kernel space has a well-defined metric. Examples of Mercer kernels include pth-order polynomial x ÿ y p and Gaussian kernel eÿkxÿyk2 c : Let X be the normalized n d pattern matrix with zero mean, and X be the pattern matrix in the F space. The linear PCA in the F space solves the eigenvectors of the correlation matrix X X T , which is also called the kernel matrix K X; X. In kernel PCA, the first m eigenvectors of K X;X are obtained to define a transformation matrix, E. (E has size n m, where m represents the desired number of features, m d). New patterns x are mapped by K x; XE, which are now represented relative to the training set and not by their measured feature values. Note that for a complete representation, up to m eigenvectors in E may be needed (depending on the kernel function) by kernel PCA, while in linear PCA a set of d eigenvectors represents the original feature space. How the kernel function should be chosen for a given application is still an open issue. Multidimensional scaling (MDS) is another nonlinear feature extraction technique. It aims to represent a multidimensional dataset in two or three dimensions such that the distance matrix in the original d-dimensional feature space is preserved as faithfully as possible in the projected space. Various stress functions are used for measuring the performance of this mapping [20]; the most popular criterion is the stress function introduced by Sammon [141] and Niemann [114]. A problem with MDS is that it does not give an explicit mapping function, so it is not possible to place a new pattern in a map which has been computed for a given training set without repeating the mapping. Several techniques have been investigated to address this deficiency which range from linear interpolation to training a neural network [38]. It is also possible to redefine the MDS algorithm so that it directly produces a map that may be used for new test patterns [165]. A feed-forward neural network offers an integrated procedure for feature extraction and classification; the output of each hidden layer may be interpreted as a set of new, often nonlinear, features presented to the output layer for classification. In this sense, multilayer networks serve as feature extractors [100]. For example, the networks used by Fukushima [62] et al. and Le Cun et al. [95] have the so called shared weight layers that are in fact filters for extracting features in two-dimensional images. During training, the filters are tuned to the data, so as to maximize the classification performance. Neural networks can also be used directly for feature extraction in an unsupervised mode. Fig. 4a shows the architecture of a network which is able to find the PCA subspace [117]. Instead of sigmoids, the neurons have linear transfer functions. This network has d inputs and d outputs, where d is the given number of features. The inputs are also used as targets, forcing the output layer to reconstruct the input space using only one hidden layer. The three nodes in the hidden layer capture the first three principal components [18]. If two nonlinear layers with sigmoidal hidden units are also included (see Fig. 4b), then a nonlinear subspace is found in the middle layer (also called the bottleneck layer). The nonlinearity is limited by the size of these additional layers. These so-called autoassociative, or nonlinear PCA networks offer a powerful tool to train and describe nonlinear subspaces [98]. Oja [118] shows how autoassociative networks can be used for ICA. The Self-Organizing Map (SOM), or Kohonen Map [92], can also be used for nonlinear feature extraction. In SOM, neurons are arranged in an m-dimensional grid, where m is usually 1, 2, or 3. Each neuron is connected to all the d input units. The weights on the connections for each neuron form a d-dimensional weight vector. During training, patterns are JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 13 Fig. 3. Chernoff Faces corresponding to the mean vectors of Iris Setosa, Iris Versicolor, and Iris Virginica.