正在加载图片...
JAIN ET AL.:STATISTICAL PATTERN RECOGNITION:A REVIEW 11 Details of this dataset are available in [160].In our imum discrimination between the two classes.The only experiments we always used the same subset of 1,000 parameter in the densities is the mean vector, patterns for testing and various subsets of the remaining m=m1=-m2. 1,000 patterns for training.2 Throughout this paper,when Trunk considered the following two cases: we refer to "the digit dataset,"just the Karhunen-Loeve features (in item 3)are meant,unless stated otherwise. 1.The mean vector m is known.In this situation,we can use the optimal Bayes decision rule (with a 0/1 loss function)to construct the decision boundary. 3 THE CURSE OF DIMENSIONALITY AND PEAKING The probability of error as a function of d can be PHENOMENA expressed as: The performance of a classifier depends on the interrela- tionship between sample sizes,number of features,and P(d ed (4) classifier complexity.A naive table-lookup technique V2π (partitioning the feature space into cells and associating a class label with each cell)requires the number of training It is easy to verify that limd Pe(d)=0.In other data points to be an exponential function of the feature words,we can perfectly discriminate the two classes by arbitrarily increasing the number of features,d. dimension [18].This phenomenon is termed as "curse of 2 The mean vector m is unknown and n labeled dimensionality,"which leads to the "peaking phenomenon" (see discussion below)in classifier design.It is well-known training samples are available.Trunk found the maximum likelihood estimate m of m and used the that the probability of misclassification of a decision rule plug-in decision rule (substitute m for m in the does not increase as the number of features increases,as optimal Bayes decision rule).Now the probability of long as the class-conditional densities are completely error which is a function of both n and d can be known (or,equivalently,the number of training samples written as: is arbitrarily large and representative of the underlying densities).However,it has been often observed in practice 00 1 Pe(n,d)= ·erdz,where Jo(d)V2 (5) that the added features may actually degrade the perfor- mance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features.This paradoxical behavior is referred to (d)= ∑) (6) as the peaking phenomenon3 [80],[131],[132].A simple V1+)∑份+月 explanation for this phenomenon is as follows:The most commonly used parametric classifiers estimate the un- Trunk showed that lim P(n,d)=,which implies known parameters and plug them in for the true parameters that the probability of error approaches the maximum in the class-conditional densities.For a fixed sample size,as possible value of 0.5 for this two-class problem.This the number of features is increased(with a corresponding demonstrates that,unlike case 1)we cannot arbitrarily increase in the number of unknown parameters),the increase the number of features when the parameters of reliability of the parameter estimates decreases.Conse- class-conditional densities are estimated from a finite quently,the performance of the resulting plug-in classifiers, number of training samples.The practical implication of for a fixed sample size,may degrade with an increase in the the curse of dimensionality is that a system designer should number of features. try to select only a small number of salient features when Trunk[157]provided a simple example to illustrate the confronted with a limited training set. curse of dimensionality which we reproduce below. All of the commonly used classifiers,including multi- Consider the two-class classification problem with equal layer feed-forward networks,can suffer from the curse of prior probabilities,and a d-dimensional multivariate Gaus- dimensionality.While an exact relationship between the sian distribution with the identity covariance matrix for probability of misclassification,the number of training each class.The mean vectors for the two classes have the samples,the number of features and the true parameters of following components the class-conditional densities is very difficult to establish, 11 some guidelines have been suggested regarding the ratio of m=万…园 and the sample size to dimensionality.It is generally accepted that using at least ten times as many training samples per m2=(-1,- class as the number of features(n/d 10)is a good practice a to follow in classifier design [80].The more complex the Note that the features are statistically independent and the classifier,the larger should the ratio of sample size to discriminating power of the successive features decreases dimensionality be to avoid the curse of dimensionality. monotonically with the first feature providing the max- 4 DIMENSIONALITY REDUCTION 2.The dataset is available through the University of California,Irvine Machine Learning Repository (www.ics.uci.edu/-mlearn/MLRepositor- There are two main reasons to keep the dimensionality of y.html) 3.In the rest of this paper,we do not make distinction between the curse the pattern representation (i.e.,the number of features)as of dimensionality and the peaking phenomenon. small as possible:measurement cost and classificationDetails of this dataset are available in [160]. In our experiments we always used the same subset of 1,000 patterns for testing and various subsets of the remaining 1,000 patterns for training.2 Throughout this paper, when we refer to ªthe digit dataset,º just the Karhunen-Loeve features (in item 3) are meant, unless stated otherwise. 3 THE CURSE OF DIMENSIONALITY AND PEAKING PHENOMENA The performance of a classifier depends on the interrela￾tionship between sample sizes, number of features, and classifier complexity. A naive table-lookup technique (partitioning the feature space into cells and associating a class label with each cell) requires the number of training data points to be an exponential function of the feature dimension [18]. This phenomenon is termed as ªcurse of dimensionality,º which leads to the ªpeaking phenomenonº (see discussion below) in classifier design. It is well-known that the probability of misclassification of a decision rule does not increase as the number of features increases, as long as the class-conditional densities are completely known (or, equivalently, the number of training samples is arbitrarily large and representative of the underlying densities). However, it has been often observed in practice that the added features may actually degrade the perfor￾mance of a classifier if the number of training samples that are used to design the classifier is small relative to the number of features. This paradoxical behavior is referred to as the peaking phenomenon3 [80], [131], [132]. A simple explanation for this phenomenon is as follows: The most commonly used parametric classifiers estimate the un￾known parameters and plug them in for the true parameters in the class-conditional densities. For a fixed sample size, as the number of features is increased (with a corresponding increase in the number of unknown parameters), the reliability of the parameter estimates decreases. Conse￾quently, the performance of the resulting plug-in classifiers, for a fixed sample size, may degrade with an increase in the number of features. Trunk [157] provided a simple example to illustrate the curse of dimensionality which we reproduce below. Consider the two-class classification problem with equal prior probabilities, and a d-dimensional multivariate Gaus￾sian distribution with the identity covariance matrix for each class. The mean vectors for the two classes have the following components m1 ˆ …1; 1  2 p ; 1  3 p ;  ; 1  d p † and m2 ˆ …ÿ1; ÿ 1  2 p ; ÿ 1  3 p ;  ; ÿ 1  d p †: Note that the features are statistically independent and the discriminating power of the successive features decreases monotonically with the first feature providing the max￾imum discrimination between the two classes. The only parameter in the densities is the mean vector, m ˆ m1 ˆ ÿm2. Trunk considered the following two cases: 1. The mean vector m is known. In this situation, we can use the optimal Bayes decision rule (with a 0/1 loss function) to construct the decision boundary. The probability of error as a function of d can be expressed as: Pe…d† ˆ Z 1  Pd iˆ1 …1 i q † 1  2 p  eÿ1 2z2 dz: …4† It is easy to verify that limd!1 Pe…d† ˆ 0. In other words, we can perfectly discriminate the two classes by arbitrarily increasing the number of features, d. 2. The mean vector m is unknown and n labeled training samples are available. Trunk found the maximum likelihood estimate m^ of m and used the plug-in decision rule (substitute m^ for m in the optimal Bayes decision rule). Now the probability of error which is a function of both n and d can be written as: Pe…n; d† ˆ Z 1 …d† 1  2 p  eÿ1 2z2 dz; where …5† …d† ˆ Pd iˆ1…1 i †  …1 ‡ 1 n† Pd iˆ1…1 i † ‡ d n q : …6† Trunk showed that limd!1 Pe…n; d† ˆ 1 2 , which implies that the probability of error approaches the maximum possible value of 0.5 for this two-class problem. This demonstrates that, unlike case 1) we cannot arbitrarily increase the number of features when the parameters of class-conditional densities are estimated from a finite number of training samples. The practical implication of the curse of dimensionality is that a system designer should try to select only a small number of salient features when confronted with a limited training set. All of the commonly used classifiers, including multi￾layer feed-forward networks, can suffer from the curse of dimensionality. While an exact relationship between the probability of misclassification, the number of training samples, the number of features and the true parameters of the class-conditional densities is very difficult to establish, some guidelines have been suggested regarding the ratio of the sample size to dimensionality. It is generally accepted that using at least ten times as many training samples per class as the number of features (n=d > 10) is a good practice to follow in classifier design [80]. The more complex the classifier, the larger should the ratio of sample size to dimensionality be to avoid the curse of dimensionality. 4 DIMENSIONALITY REDUCTION There are two main reasons to keep the dimensionality of the pattern representation (i.e., the number of features) as small as possible: measurement cost and classification JAIN ET AL.: STATISTICAL PATTERN RECOGNITION: A REVIEW 11 2. The dataset is available through the University of California, Irvine Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepositor￾y.html) 3. In the rest of this paper, we do not make distinction between the curse of dimensionality and the peaking phenomenon
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有