正在加载图片...
ZHANG:NEURAL NETWORKS FOR CLASSIFICATION 455 F(x))2]=Ele2],is independent of both the training sample to produce accurate classification.That simple classifiers often and the underlying function,it reflects the irreducible estima- perform well in practice [76]seems to support Friedman's find- tion error because of the intrinsic noise of the data.The second ings term on the right hand side of(14),therefore,is a nature measure of the effectiveness of f(x;D)as a predictor of y.This term can B.Methods for Reducing Prediction Error be further decomposed as [57] As a flexible"model-free"approach to classification,neural ED[(f(x:D)-E(x2 networks often tend to fit the training data very well and thus have low bias.But the potential risk is the overfitting that causes ={ED[fxD】-E(y|x)}2 high variance in generalization.Dietterich and Kong [41]point +ED{(f(x:D)-ED[f(x;D)]2). (15) out in the machine learning context that the variance is a more important factor than the learning bias in poor prediction perfor- The first term on the right hand side is the square of the bias mance.Breiman [26]finds that neural network classifiers be- and is for simplicity called model bias while the second one is long to unstable prediction methods in that small changes in termed as model variance.This is the famous bias plus variance the training sample could cause large variations in the test re- decomposition of the prediction error. sults.Much attention has been paid to this problem ofoverfitting Ideally,the optimal model that minimizes the overall MSE or high variance in the literature.A majority of research effort in (14)is given by f(;D)=E(yx),which leaves the min- has been devoted to developing methods to reduce the overfit- imum MSE to be the intrinsic error Ele2].In reality,however, ting effect.Such methods include cross validation [118,[184]. because of the randomness of the limited data set D.the esti- training with penalty terms [182],and weight decay and node mate f(x;D)is also a random variable which will hardly be the pruning [137],[148].Weigend [183]analyzes overfitting phe- best possible function E(y x)for a given data set.The bias and nomena by introducing the concept of the effective number of variance terms in(15)hence provide useful information on how hidden nodes.An interesting observation by Dietterich [39]is the estimation differs from the desired function.The model bias that improving the optimization algorithms in training does not measures the extent to which the average of the estimation func- have positive effect on the testing performance and hence the tion over all possible data sets with the same size differs from the overfitting effect may be reduced by "undercomputing." desired function.The model variance,on the other hand,mea- Wang [179]points out the unpredictability of neural networks sures the sensitivity of the estimation function to the training in classification applications in the context of learning and gen- data set.Although it is desirable to have both low bias and low eralization.He proposes a global smoothing training strategy variance,we can not reduce both at the same time for a given by imposing monotonic constraints on network training,which data set because these goals are conflicting.A model that is less seems effective in solving classification problems [5] dependent on the data tends to have low variance but high bias Ensemble method or combining multiple classifiers [21],[8], if the model is incorrect.On the other hand,a model that fits the [64],[67],[87刀,[128],[129],[192]is another active research data well tends to have low bias but high variance when applied area to reduce generalization error [153].By averaging or voting to different data sets.Hence a good model should balance well the prediction results from multiple networks,the model vari- between model bias and model variance ance can be significantly reduced.The motivation of combining The work by Geman et al.[57]on bias and variance tradeoff several neural networks is to improve the out-of-sample clas- under the quadratic objective function has stimulated a lot of sification performance over individual classifiers or to guard research interest and activities in the neural network,machine against the failure of individual component networks.It has been learning,and statistical communities.Wolpert [190]extends the shown theoretically that the performance of the ensemble can bias-plus-variance dilemma to a more general bias-variance-co- not be worse than any single model used separately if the pre- variance tradeoff in the Bayesian context.Jacobs [85]studies dictions of individual classifier are unbiased and uncorrelated various properties of bias and variance components for mix- [129].Tumer and Ghosh [172]provide an analytical frame- tures-of-experts architectures.Dietterich and Kong [41],Kong work to understand the reasons why linearly combined neural and Dietterich [94],Breiman [26],Kohavi and Wolpert [93], classifiers work and how to quantify the improvement achieved Tibshirani [168],James and Hastie [86],and Heskes [71]have by combining.Kittler et al.[90]present a general theoretical developed different versions of bias-variance decomposition for framework for classifier ensembles.They review and compare zero-one loss functions of classification problems.These alter- many existing classifier combination schemes and show that native decompositions provide insights into the nature of gen- many different ensemble methods can be treated as special cases eralization error from different perspectives.Each decomposi- of compound classification where all the pattern representations tion formula has its own merits as well as demerits.Noticing are used jointly to make decisions. that all formulations of the bias and variance decomposition in An ensemble can be formed by multiple network architec- classification are in additive forms,Friedman [48]points out tures,same architecture trained with different algorithms,dif- that the bias and variance components are not necessarily addi- ferent initial random weights,or even different classifiers.The tive and instead they can be "interactive in a multiplicative and component networks can also be developed by training with dif- highly nonlinear way."He finds that this interaction may be ex- ferent data such as the resampling data.The mixed combination ploited to reduce classification errors because bias terms may of neural networks with traditional statistical classifiers has also be cancelled by low-variance but potentially high-bias methods been suggested [35],[112].ZHANG: NEURAL NETWORKS FOR CLASSIFICATION 455 , is independent of both the training sample and the underlying function, it reflects the irreducible estima￾tion error because of the intrinsic noise of the data. The second term on the right hand side of (14), therefore, is a nature measure of the effectiveness of as a predictor of . This term can be further decomposed as [57] (15) The first term on the right hand side is the square of the bias and is for simplicity called model bias while the second one is termed as model variance. This is the famous bias plus variance decomposition of the prediction error. Ideally, the optimal model that minimizes the overall MSE in (14) is given by , which leaves the min￾imum MSE to be the intrinsic error . In reality, however, because of the randomness of the limited data set , the esti￾mate is also a random variable which will hardly be the best possible function for a given data set. The bias and variance terms in (15) hence provide useful information on how the estimation differs from the desired function. The model bias measures the extent to which the average of the estimation func￾tion over all possible data sets with the same size differs from the desired function. The model variance, on the other hand, mea￾sures the sensitivity of the estimation function to the training data set. Although it is desirable to have both low bias and low variance, we can not reduce both at the same time for a given data set because these goals are conflicting. A model that is less dependent on the data tends to have low variance but high bias if the model is incorrect. On the other hand, a model that fits the data well tends to have low bias but high variance when applied to different data sets. Hence a good model should balance well between model bias and model variance. The work by Geman et al. [57] on bias and variance tradeoff under the quadratic objective function has stimulated a lot of research interest and activities in the neural network, machine learning, and statistical communities. Wolpert [190] extends the bias-plus-variance dilemma to a more general bias-variance-co￾variance tradeoff in the Bayesian context. Jacobs [85] studies various properties of bias and variance components for mix￾tures-of-experts architectures. Dietterich and Kong [41], Kong and Dietterich [94], Breiman [26], Kohavi and Wolpert [93], Tibshirani [168], James and Hastie [86], and Heskes [71] have developed different versions of bias-variance decomposition for zero-one loss functions of classification problems. These alter￾native decompositions provide insights into the nature of gen￾eralization error from different perspectives. Each decomposi￾tion formula has its own merits as well as demerits. Noticing that all formulations of the bias and variance decomposition in classification are in additive forms, Friedman [48] points out that the bias and variance components are not necessarily addi￾tive and instead they can be “interactive in a multiplicative and highly nonlinear way.” He finds that this interaction may be ex￾ploited to reduce classification errors because bias terms may be cancelled by low-variance but potentially high-bias methods to produce accurate classification. That simple classifiers often perform well in practice [76] seems to support Friedman’s find￾ings. B. Methods for Reducing Prediction Error As a flexible “model-free” approach to classification, neural networks often tend to fit the training data very well and thus have low bias. But the potential risk is the overfitting that causes high variance in generalization. Dietterich and Kong [41] point out in the machine learning context that the variance is a more important factor than the learning bias in poor prediction perfor￾mance. Breiman [26] finds that neural network classifiers be￾long to unstable prediction methods in that small changes in the training sample could cause large variations in the test re￾sults. Much attention has been paid to this problem of overfitting or high variance in the literature. A majority of research effort has been devoted to developing methods to reduce the overfit￾ting effect. Such methods include cross validation [118], [184], training with penalty terms [182], and weight decay and node pruning [137], [148]. Weigend [183] analyzes overfitting phe￾nomena by introducing the concept of the effective number of hidden nodes. An interesting observation by Dietterich [39] is that improving the optimization algorithms in training does not have positive effect on the testing performance and hence the overfitting effect may be reduced by “undercomputing.” Wang [179] points out the unpredictability of neural networks in classification applications in the context of learning and gen￾eralization. He proposes a global smoothing training strategy by imposing monotonic constraints on network training, which seems effective in solving classification problems [5]. Ensemble method or combining multiple classifiers [21], [8], [64], [67], [87], [128], [129], [192] is another active research area to reduce generalization error [153]. By averaging or voting the prediction results from multiple networks, the model vari￾ance can be significantly reduced. The motivation of combining several neural networks is to improve the out-of-sample clas￾sification performance over individual classifiers or to guard against the failure of individual component networks. It has been shown theoretically that the performance of the ensemble can not be worse than any single model used separately if the pre￾dictions of individual classifier are unbiased and uncorrelated [129]. Tumer and Ghosh [172] provide an analytical frame￾work to understand the reasons why linearly combined neural classifiers work and how to quantify the improvement achieved by combining. Kittler et al. [90] present a general theoretical framework for classifier ensembles. They review and compare many existing classifier combination schemes and show that many different ensemble methods can be treated as special cases of compound classification where all the pattern representations are used jointly to make decisions. An ensemble can be formed by multiple network architec￾tures, same architecture trained with different algorithms, dif￾ferent initial random weights, or even different classifiers. The component networks can also be developed by training with dif￾ferent data such as the resampling data. The mixed combination of neural networks with traditional statistical classifiers has also been suggested [35], [112]
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有