正在加载图片...
ZHANG:NEURAL NETWORKS FOR CLASSIFICATION 453 ej=(0,...,0,1,0,...,0t if x group j.Hence the networks and statistical classifiers.The direct comparison be- ith element of F(x)is given by tween them may not be possible since neural networks are non- linear model-free method while statistical methods are basically F(x)=Elvilx linear and model based. =1·P(=1|x)+0·P(物=0|x) By appropriate coding of the desired output membership =P(=1|x) values,we may let neural networks directly model some dis- criminant functions.For example,in a two-group classification =P(wjx). (9) problem,if the desired output is coded as 1 if the object is from class I and-1 if it is from class 2.Then,from(9)the neural That is,the least squares estimate for the mapping function in a network estimates the following discriminant function: classification problem is exactly the posterior probability. Neural networks are universal approximators [37]and in theory can approximate any function arbitrarily closely. g(x)=P(w1 x)-P(w2 x). (11) However,the mapping function represented by a network is not perfect due to the local minima problem,suboptimal network The discriminating rule is simply:assign x to w ifg(x)>0 or architecture and the finite sample data in neural network w2 if g(x)<0.Any monotone increasing function of the poste- training.Therefore,it is clear that neural networks actually rior probability can be used to replace the posterior probability in(1 1)to form a different discriminant function but essentially provide estimates of the posterior probabilities. The mean squared error function(7)can be derived [143], the same classification rule [83]as As the statistical counterpart of neural networks,discriminant analysis is a well-known supervised classifier.Gallinari et al. [54]describe a general framework to establish the link between MSE [F(x)-P((x)dx discriminant analysis and neural network models.They find that in quite general conditions the hidden layers of an MLP project the input data onto different clusters in a way that these clus- P(wil x(1-P(wj x))f(x)dx.(10) ters can be further aggregated into different classes.For linear MLPs,the projection performed by the hidden layer is shown theoretically equivalent to the linear discriminant analysis.The The second term of the right-hand side is called the approxima- nonlinear MLPs,on the other hand,have been demonstrated tion error [14]and is independent of neural networks.It reflects through experiments the capability in performing more pow- the inherent irreducible error due to randomness of the data.The erful nonlinear discriminant analysis.Their work helps under- first term termed as the estimation error is affected by the effec- stand the underlying function and behavior of the hidden layer tiveness of neural network mapping.Theoretically speaking,it for classification problems and also explains why the neural net- may need a large network as well as large sample data in order works in principle can provide superior performance over linear to get satisfactory approximation.For example,Funahashi [53] discriminant analysis.The discriminant feature extraction by shows that for the two-group d-dimensional Gaussian classifi- the network with nonlinear hidden nodes has also been demon- cation problem,neural networks with at least 2d hidden nodes strated in Asoh and Otsu [6]and Webb and Lowe [181].Lim. have the capability to approximate the posterior probability with Alder and Hadingham [103]show that neural networks can per- arbitrary accuracy when infinite data is available and the training form quadratic discriminant analysis. proceeds ideally.Empirically,it is found that sample size is crit- Raudys [134],[135]presents a detailed analysis of nonlinear ical in learning but the number of hidden nodes may not be so single layer perceptron(SLP).He shows that during the adap- important [83],[138]. tive training process of SLP,by purposefully controlling the That the outputs of neural networks are least square estimates SLP classifier complexity through adjusting the target values, of the Bayesian a posteriori probabilities is also valid for other learning-steps,number of iterations and using regularization types of cost or error function such as the cross entropy function terms,the decision boundaries of SLP classifiers are equivalent [63],[138].The cross entropy function can be a more appro- or close to those of seven statistical classifiers.These statistical priate criterion than the squared error cost function in training classifiers include the Enclidean distance classifier,the Fisher neural networks for classification problems because of their bi- linear discriminant function,the Fisher linear discriminant nary output characteristic [144].Improved performance and re- function with pseudo-inversion of the covariance matrix,the duced training time have been reported with the cross entropy generalized Fisher linear discriminant function,the regularized function [75],[77].Miyake and Kanaya [116]show that neural linear discriminant analysis,the minimum empirical error networks trained with a generalized mean-squared error objec- classifier,and the maximum margin classifier [134].Kanaya tive function can yield the optimal Bayes rule. and Miyake [88]and Miyake and Kanaya [116]also illustrate theoretically and empirically the link between neural networks C.Neural Networks and Conventional Classifiers and the optimal Bayes rule in statistical decision problems. Statistical pattern classifiers are based on the Bayes decision Logistic regression is another important classification tool. theory in which posterior probabilities play a central role.The In fact,it is a standard statistical approach used in medical fact that neural networks can in fact provide estimates of pos- diagnosis and epidemiologic studies [91].Logistic regression terior probability implicitly establishes the link between neural is often preferred over discriminant analysis in practice [65],ZHANG: NEURAL NETWORKS FOR CLASSIFICATION 453 if group . Hence the th element of is given by (9) That is, the least squares estimate for the mapping function in a classification problem is exactly the posterior probability. Neural networks are universal approximators [37] and in theory can approximate any function arbitrarily closely. However, the mapping function represented by a network is not perfect due to the local minima problem, suboptimal network architecture and the finite sample data in neural network training. Therefore, it is clear that neural networks actually provide estimates of the posterior probabilities. The mean squared error function (7) can be derived [143], [83] as (10) The second term of the right-hand side is called the approxima￾tion error [14] and is independent of neural networks. It reflects the inherent irreducible error due to randomness of the data. The first term termed as the estimation error is affected by the effec￾tiveness of neural network mapping. Theoretically speaking, it may need a large network as well as large sample data in order to get satisfactory approximation. For example, Funahashi [53] shows that for the two-group -dimensional Gaussian classifi￾cation problem, neural networks with at least hidden nodes have the capability to approximate the posterior probability with arbitrary accuracy when infinite data is available and the training proceeds ideally. Empirically, it is found that sample size is crit￾ical in learning but the number of hidden nodes may not be so important [83], [138]. That the outputs of neural networks are least square estimates of the Bayesian a posteriori probabilities is also valid for other types of cost or error function such as the cross entropy function [63], [138]. The cross entropy function can be a more appro￾priate criterion than the squared error cost function in training neural networks for classification problems because of their bi￾nary output characteristic [144]. Improved performance and re￾duced training time have been reported with the cross entropy function [75], [77]. Miyake and Kanaya [116] show that neural networks trained with a generalized mean-squared error objec￾tive function can yield the optimal Bayes rule. C. Neural Networks and Conventional Classifiers Statistical pattern classifiers are based on the Bayes decision theory in which posterior probabilities play a central role. The fact that neural networks can in fact provide estimates of pos￾terior probability implicitly establishes the link between neural networks and statistical classifiers. The direct comparison be￾tween them may not be possible since neural networks are non￾linear model-free method while statistical methods are basically linear and model based. By appropriate coding of the desired output membership values, we may let neural networks directly model some dis￾criminant functions. For example, in a two-group classification problem, if the desired output is coded as 1 if the object is from class 1 and if it is from class 2. Then, from (9) the neural network estimates the following discriminant function: (11) The discriminating rule is simply: assign to if or if . Any monotone increasing function of the poste￾rior probability can be used to replace the posterior probability in (11) to form a different discriminant function but essentially the same classification rule. As the statistical counterpart of neural networks, discriminant analysis is a well-known supervised classifier. Gallinari et al. [54] describe a general framework to establish the link between discriminant analysis and neural network models. They find that in quite general conditions the hidden layers of an MLP project the input data onto different clusters in a way that these clus￾ters can be further aggregated into different classes. For linear MLPs, the projection performed by the hidden layer is shown theoretically equivalent to the linear discriminant analysis. The nonlinear MLPs, on the other hand, have been demonstrated through experiments the capability in performing more pow￾erful nonlinear discriminant analysis. Their work helps under￾stand the underlying function and behavior of the hidden layer for classification problems and also explains why the neural net￾works in principle can provide superior performance over linear discriminant analysis. The discriminant feature extraction by the network with nonlinear hidden nodes has also been demon￾strated in Asoh and Otsu [6] and Webb and Lowe [181]. Lim, Alder and Hadingham [103] show that neural networks can per￾form quadratic discriminant analysis. Raudys [134], [135] presents a detailed analysis of nonlinear single layer perceptron (SLP). He shows that during the adap￾tive training process of SLP, by purposefully controlling the SLP classifier complexity through adjusting the target values, learning-steps, number of iterations and using regularization terms, the decision boundaries of SLP classifiers are equivalent or close to those of seven statistical classifiers. These statistical classifiers include the Enclidean distance classifier, the Fisher linear discriminant function, the Fisher linear discriminant function with pseudo-inversion of the covariance matrix, the generalized Fisher linear discriminant function, the regularized linear discriminant analysis, the minimum empirical error classifier, and the maximum margin classifier [134]. Kanaya and Miyake [88] and Miyake and Kanaya [116] also illustrate theoretically and empirically the link between neural networks and the optimal Bayes rule in statistical decision problems. Logistic regression is another important classification tool. In fact, it is a standard statistical approach used in medical diagnosis and epidemiologic studies [91]. Logistic regression is often preferred over discriminant analysis in practice [65]
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有