正在加载图片...
Section 20.1. Statistical Learning 715 ing.because it integration)problem. We will see er in the chapter. portantrol ing.the hypothesis prior P(ha)plays an im Chapte en the hypoth s too expre nat it co at fit the data set wel ner tha itrary limit on the hypotheses to be considered,Bayesian and MAP learning use the prior to pem complexiry.Iypically,more complex hypoth s nave a lower prior probability- n part because there are usually many more complex hypotheses than simple hypotheses.On the other hand,more complex hypotheses have a greater capac ity to fit the data.(In the extreme case,a lookup table can reproduce the data exactly with probability 1.)Hence,the hypothesis prior embodies a trade-off between the complexity ofa hypothesis and its degree of fit to the data. We can see the effe ct of this trade-off most clearly in the logical case,where contains only deterministic hypotheses.In that case.P(d h;)is I if h;is consistent and 0 otherwise Looking at Equation (20.1),we see that hMAP will then be the simplest logical theory tha is consistent with the data.Therefore,maximum a posteriori learning provides a natura embodiment of Ockham's razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1).Choosing hMAP to maximize P(dh)P(h) is equivalent to minimizing -log2 P(d hi)-logz P(hi) Using the connection between information encoding and probability that we introduced in Chapter 18,we see that the- the hvy re 1 g2 P(dh )is the additional number of bits re the (To ee this consider that o bits edicts the data ith he ng ime 1- 0)Henc P MAP I rning is choosing the hyp ide sion of the data The me task is addr tly b or MDI g m the ncoding athe with A final n ie 6. ing a unifo er the spa ce of hy In tha as MAP I es t s a th d 1 aH. sta ch tru nto pre esi prior er a p appro》 imation and MAR e da arg t it has problems(as weSection 20.1. Statistical Learning 715 ing, because it requires solving an optimization problem instead of a large summation (or integration) problem. We will see examples of this later in the chapter. In both Bayesian learning and MAP learning, the hypothesis prior P(hi) plays an im￾portant role. We saw in Chapter 18 that overfitting can occur when the hypothesis space is too expressive, so that it contains many hypotheses that fit the data set well. Rather than placing an arbitrary limit on the hypotheses to be considered, Bayesian and MAP learning methods use the prior to penalize complexity. Typically, more complex hypotheses have a lower prior probability—in part because there are usually many more complex hypotheses than simple hypotheses. On the other hand, more complex hypotheses have a greater capac￾ity to fit the data. (In the extreme case, a lookup table can reproduce the data exactly with probability 1.) Hence, the hypothesis prior embodies a trade-off between the complexity of a hypothesis and its degree of fit to the data. We can see the effect of this trade-off most clearly in the logical case, where H contains only deterministic hypotheses. In that case, P(d|hi) is 1 if hi is consistent and 0 otherwise. Looking at Equation (20.1), we see that hMAP will then be the simplest logical theory that is consistent with the data. Therefore, maximum a posteriori learning provides a natural embodiment of Ockham’s razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1). Choosing hMAP to maximize P(d|hi)P(hi) is equivalent to minimizing − log2 P(d|hi) − log2 P(hi) . Using the connection between information encoding and probability that we introduced in Chapter 18, we see that the − log2 P(hi) term equals the number of bits required to specify the hypothesis hi . Furthermore, − log2 P(d|hi) is the additional number of bits required to specify the data, given the hypothesis. (To see this, consider that no bits are required if the hypothesis predicts the data exactly—as with h5 and the string of lime candies—and log2 1 = 0.) Hence, MAP learning is choosing the hypothesis that provides maximum com￾pression of the data. The same task is addressed more directly by the minimum description length, or MDL, learning method, which attempts to minimize the size of hypothesis and MINIMUM DESCRIPTION LENGTH data encodings rather than work with probabilities. A final simplification is provided by assuming a uniform prior over the space of hy￾potheses. In that case, MAP learning reduces to choosing an hi that maximizes P(d|Hi). This is called a maximum-likelihood (ML) hypothesis, hML. Maximum-likelihood learning MAXIMUM￾LIKELIHOOD is very common in statistics, a discipline in which many researchers distrust the subjective nature of hypothesis priors. It is a reasonable approach when there is no reason to prefer one hypothesis over another a priori—for example, when all hypotheses are equally complex. It provides a good approximation to Bayesian and MAP learning when the data set is large, because the data swamps the prior distribution over hypotheses, but it has problems (as we shall see) with small data sets
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有