20 STATISTICAL LEARNING METHODS In which we view learning as a form of uncertain reasoning from observations Part V pointed out the prevalence of uncertainty in real environments.Agents can handle uncertainty by using the methods of probability and decision theory,but first they must learn their probabilistic theories of the world from experience.This chapter explains how they can do that.We will see how to formulate the leamning task itself as a process of probabilistic inference(Section 20.1).We will see that a Bayesian view of learning is extremely powerful,providing general solutions to the problems of noise,overfitting,and optimal prediction.It also takes into account the fact that a less-than-omniscient agent can never be certain about which theory of the world is correct,yet must still make decisions by using some theory of the world. We describe methods for learning probability models-primarily Bavesian networks- in Sections 20.2 and 20.3.Section 20.4 looks at learning methods that store and recall specific nstances.Section 20.5 covers neural network learning and Section 206 introduces kernel ofthe mat rial in this chanter is fairl mathe atical (requiring a basic un erstanding of multivariate calculus).although the general lessons can be understood without unging into the details It may benefit the Ch eader at this point to review the material in ndix A 20.1 STATISTICAL LEARNING The key concepts in this chapter,just as in Chapter 18,are data and hypotheses.Here,the data are evidence-that is,instantiations of some or all of the random variables describing the domain.The hypotheses are probabilistic theories of how the domain works,including logical theories as a special case. Let us consider a very simple example.Our favorite Surprise candy comes in two flavors:cherry (yum)and lime(ugh).The candy manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper,regardless of flavor.The candy is sold in very large bags,of which there are known to be five kinds-again,indistinguishable from the outside: 712
20 STATISTICAL LEARNING METHODS In which we view learning as a form of uncertain reasoning from observations. Part V pointed out the prevalence of uncertainty in real environments. Agents can handle uncertainty by using the methods of probability and decision theory, but first they must learn their probabilistic theories of the world from experience. This chapter explains how they can do that. We will see how to formulate the learning task itself as a process of probabilistic inference (Section 20.1). We will see that a Bayesian view of learning is extremely powerful, providing general solutions to the problems of noise, overfitting, and optimal prediction. It also takes into account the fact that a less-than-omniscient agent can never be certain about which theory of the world is correct, yet must still make decisions by using some theory of the world. We describe methods for learning probability models—primarily Bayesian networks— in Sections 20.2 and 20.3. Section 20.4 looks at learning methods that store and recallspecific instances. Section 20.5 covers neural network learning and Section 20.6 introduces kernel machines. Some of the material in this chapter is fairly mathematical (requiring a basic understanding of multivariate calculus), although the general lessons can be understood without plunging into the details. It may benefit the reader at this point to review the material in Chapters 13 and 14 and to peek at the mathematical background in Appendix A. 20.1 STATISTICAL LEARNING The key concepts in this chapter, just as in Chapter 18, are data and hypotheses. Here, the data are evidence—that is, instantiations of some or all of the random variables describing the domain. The hypotheses are probabilistic theories of how the domain works, including logical theories as a special case. Let us consider a very simple example. Our favorite Surprise candy comes in two flavors: cherry (yum) and lime (ugh). The candy manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper, regardless of flavor. The candy is sold in very large bags, of which there are known to be five kinds—again, indistinguishable from the outside: 712
Section 20.1.Statistical Learning 713 100%cherry 12 4 cherry +75%lime h:100%lime Given a new bag of candy the random variable h (for hpothesis)denotes the type of the bag.with possible values through hs.H is not directly observable.of course.As the pieces of candy are opened and inspected.data are revealed-DD2. D.where each D.is a random variable with possible values chey and lime.The basic task faced by the agent is to predict the flavor of the next picce of candy Despite its apparent triviality this rves to introduce many of the major issues.The agent really does need to infer a theory of its world.albeit a ver esian lea ning sin po this edu D btain P(hild)=aP(dlhi)P(hi). (20.1) Now,suppose we want to make a prediction about an unknown quantity X.Then we have P(xld)=>P(Xld,hi)P(hild)=>P(X ha)P(hild), (20.2) where we have assumed that each hypothesis determines a probability distribution overX This equation shows that predictions are weighted averages over the predictions of the indi- vidual hypotheses.The hypotheses themselves are essentially"intermediaries"between the raw data and the predictions.The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior,P(h),and the likelihood of the data under each hypothesis,P(dh). LIKELHOOD For our candy example,we will assume for the time being that the prior distribution overh...h is given by (0.1,0.2,0.4,0.2,0.1),as advertised by the manufacturer.The likelihood of the data is calculated under the assumption that the observations are i.i.d.-that is,independently and identically distributed-so that P(dlhi)=IIP(dilhi) (20.3) lime-ther P is realym ba (s)and the first candies are al because nall the candies in an h3 bag are lime.Figure 20.1(a) shows how the posterio r probabilities of the five hypotheses change as the sequence of 10 ndies is observed.Notice that the probabilities start out at their prior values,so h is initially the most likely choice and remains so after I lime candy is unwrapped.After 2 see Exercise 20.3
Section 20.1. Statistical Learning 713 h1: 100% cherry h2: 75% cherry + 25% lime h3: 50% cherry + 50% lime h4: 25% cherry + 75% lime h5: 100% lime Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with possible values h1 through h5. H is not directly observable, of course. As the pieces of candy are opened and inspected, data are revealed—D1, D2, . . ., DN , where each Di is a random variable with possible values cherry and lime. The basic task faced by the agent is to predict the flavor of the next piece of candy. 1 Despite its apparent triviality, this scenario serves to introduce many of the major issues. The agent really does need to infer a theory of its world, albeit a very simple one. BAYESIAN LEARNING Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is reduced to probabilistic inference. Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule: P(hi |d) = αP(d|hi)P(hi) . (20.1) Now, suppose we want to make a prediction about an unknown quantity X. Then we have P(X|d) = X i P(X|d, hi)P(hi |d) = X i P(X|hi)P(hi |d) , (20.2) where we have assumed that each hypothesis determines a probability distribution over X. This equation shows that predictions are weighted averages over the predictions of the individual hypotheses. The hypotheses themselves are essentially “intermediaries” between the raw data and the predictions. The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior, P(hi), and the likelihood of the data under each hypothesis, P(d|hi). LIKELIHOOD For our candy example, we will assume for the time being that the prior distribution over h1, . . . , h5 is given by h0.1, 0.2, 0.4, 0.2, 0.1i, as advertised by the manufacturer. The I.I.D. likelihood of the data is calculated under the assumption that the observations are i.i.d.—that is, independently and identically distributed—so that P(d|hi) = Y j P(dj |hi) . (20.3) For example, suppose the bag is really an all-lime bag (h5) and the first 10 candies are all lime; then P(d|h3) is 0.5 10 , because half the candies in an h3 bag are lime.2 Figure 20.1(a) shows how the posterior probabilities of the five hypotheses change as the sequence of 10 lime candies is observed. Notice that the probabilities start out at their prior values, so h3 is initially the most likely choice and remains so after 1 lime candy is unwrapped. After 2 1 Statistically sophisticated readers will recognize this scenario as a variant of the urn-and-ball setup. We find urns and balls less compelling than candy; furthermore, candy lends itself to other tasks, such as deciding whether to trade the bag with a friend—see Exercise 20.3. 2 We stated earlier that the bags of candy are very large; otherwise, the i.i.d. assumption failsto hold. Technically, it is more correct (but less hygienic) to rewrap each candy after inspection and return it to the bag
714 Chapter 20. Statistical Learning Methods 0.9 0 ◆ 0.8 。 0.6 0 8 0 10 les in les in d (a) Figure20.1 (a)Posterior probabilities P(hald1.. dy)from Equation(20.1).The num ber of observations N ranges from 1 to 10,and each observation is of a lime candy.(b) Bayesian prediction P(dN+=limeld....dN)from Equation(20.2). lime e candies ar unw apped,is most likely after 3 or more,(the dreaded all- ime bag is the most likely After In a row,we are fai 0.1(b)show the predi ted probability that the next candy is lime,based on Equation (20.2).As we would expec .it increas monotonically toward The example shows that the true hypothesis eventually dominates the Bayesian pred tion This is characteristic of Bayesian leaming.For any fixed prior that does not rule out the true hypothesis,the posterior probability ofany false hypothesis will eventually vanish,sim ply because the probability of generating "uncharacteristic"data indefinitely is vanishingl small.(This point is analogous to one made in the discussion of PAC learning in Chapter 18.) More importantly,the Bayesian prediction is oprimal,whether the data set be small or large. Given the hypothesis prior,any other prediction will be correct less often. The optimality of Bayesian learning comes at a price,of course.For real learning problems,the hypothesis space is usually very large or infinite,as we saw in Chapter 18.In some cases,the summation in Equation(20.2)(or integration,in the continuous case)can be carried out tractably,but in most cases we must resort to approximate or simplified methods. A very common approximation- -one that is usually adopted in science- -is to make pre dictions based on a single most probable hypothesis- that is,an h;that maximizes P(hd) A This is often called a maximum a posteriori or MAP (pronounced"em-ay-pee")hypothe- sis.Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent that P(XId)P(XMAP).In our candy example,hMAP=hs after three lime candies in a row.so the MAP learner then predicts that the fourth candy is lime with prob ability 1.0-a much more dangerous prediction than the Bayesian prediction of 0.8 shown in Figure 20.1.As more data arrive,the MAP and Bayesian predictions become closer,be- cause the competitors to the MAP hypothesis become less and less probable.Although our example doesn't show it,finding MAP hypotheses is often much easier than Bayesian learn-
714 Chapter 20. Statistical Learning Methods 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 Posterior probability of hypothesis Number of samples in d P(h1 | d) P(h2 | d) P(h3 | d) P(h4 | d) P(h5 | d) 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 Probability that next candy is lime Number of samples in d (a) (b) Figure 20.1 (a) Posterior probabilities P(hi |d1, . . . , dN ) from Equation (20.1). The number of observations N ranges from 1 to 10, and each observation is of a lime candy. (b) Bayesian prediction P(dN+1 = lime|d1, . . . , dN ) from Equation (20.2). lime candies are unwrapped, h4 is most likely; after 3 or more, h5 (the dreaded all-lime bag) is the most likely. After 10 in a row, we are fairly certain of our fate. Figure 20.1(b) shows the predicted probability that the next candy is lime, based on Equation (20.2). As we would expect, it increases monotonically toward 1. The example shows that the true hypothesis eventually dominates the Bayesian prediction. This is characteristic of Bayesian learning. For any fixed prior that does not rule out the true hypothesis, the posterior probability of any false hypothesis will eventually vanish, simply because the probability of generating “uncharacteristic” data indefinitely is vanishingly small. (This point is analogous to one made in the discussion of PAC learning in Chapter 18.) More importantly, the Bayesian prediction is optimal, whether the data set be small or large. Given the hypothesis prior, any other prediction will be correct less often. The optimality of Bayesian learning comes at a price, of course. For real learning problems, the hypothesis space is usually very large or infinite, as we saw in Chapter 18. In some cases, the summation in Equation (20.2) (or integration, in the continuous case) can be carried out tractably, but in most cases we must resort to approximate or simplified methods. A very common approximation—one that is usually adopted in science—isto make predictions based on a single most probable hypothesis—that is, an hi that maximizes P(hi |d). This is often called a maximum a posteriori or MAP (pronounced “em-ay-pee”) hypothe- MAXIMUM A POSTERIORI sis. Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent that P(X|d) ≈ P(X|hMAP). In our candy example, hMAP = h5 after three lime candies in a row, so the MAP learner then predicts that the fourth candy is lime with probability 1.0—a much more dangerous prediction than the Bayesian prediction of 0.8 shown in Figure 20.1. As more data arrive, the MAP and Bayesian predictions become closer, because the competitors to the MAP hypothesis become less and less probable. Although our example doesn’t show it, finding MAP hypotheses is often much easier than Bayesian learn-
Section 20.1. Statistical Learning 715 ing.because it integration)problem. We will see er in the chapter. portantrol ing.the hypothesis prior P(ha)plays an im Chapte en the hypoth s too expre nat it co at fit the data set wel ner tha itrary limit on the hypotheses to be considered,Bayesian and MAP learning use the prior to pem complexiry.Iypically,more complex hypoth s nave a lower prior probability- n part because there are usually many more complex hypotheses than simple hypotheses.On the other hand,more complex hypotheses have a greater capac ity to fit the data.(In the extreme case,a lookup table can reproduce the data exactly with probability 1.)Hence,the hypothesis prior embodies a trade-off between the complexity ofa hypothesis and its degree of fit to the data. We can see the effe ct of this trade-off most clearly in the logical case,where contains only deterministic hypotheses.In that case.P(d h;)is I if h;is consistent and 0 otherwise Looking at Equation (20.1),we see that hMAP will then be the simplest logical theory tha is consistent with the data.Therefore,maximum a posteriori learning provides a natura embodiment of Ockham's razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1).Choosing hMAP to maximize P(dh)P(h) is equivalent to minimizing -log2 P(d hi)-logz P(hi) Using the connection between information encoding and probability that we introduced in Chapter 18,we see that the- the hvy re 1 g2 P(dh )is the additional number of bits re the (To ee this consider that o bits edicts the data ith he ng ime 1- 0)Henc P MAP I rning is choosing the hyp ide sion of the data The me task is addr tly b or MDI g m the ncoding athe with A final n ie 6. ing a unifo er the spa ce of hy In tha as MAP I es t s a th d 1 aH. sta ch tru nto pre esi prior er a p appro》 imation and MAR e da arg t it has problems(as we
Section 20.1. Statistical Learning 715 ing, because it requires solving an optimization problem instead of a large summation (or integration) problem. We will see examples of this later in the chapter. In both Bayesian learning and MAP learning, the hypothesis prior P(hi) plays an important role. We saw in Chapter 18 that overfitting can occur when the hypothesis space is too expressive, so that it contains many hypotheses that fit the data set well. Rather than placing an arbitrary limit on the hypotheses to be considered, Bayesian and MAP learning methods use the prior to penalize complexity. Typically, more complex hypotheses have a lower prior probability—in part because there are usually many more complex hypotheses than simple hypotheses. On the other hand, more complex hypotheses have a greater capacity to fit the data. (In the extreme case, a lookup table can reproduce the data exactly with probability 1.) Hence, the hypothesis prior embodies a trade-off between the complexity of a hypothesis and its degree of fit to the data. We can see the effect of this trade-off most clearly in the logical case, where H contains only deterministic hypotheses. In that case, P(d|hi) is 1 if hi is consistent and 0 otherwise. Looking at Equation (20.1), we see that hMAP will then be the simplest logical theory that is consistent with the data. Therefore, maximum a posteriori learning provides a natural embodiment of Ockham’s razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1). Choosing hMAP to maximize P(d|hi)P(hi) is equivalent to minimizing − log2 P(d|hi) − log2 P(hi) . Using the connection between information encoding and probability that we introduced in Chapter 18, we see that the − log2 P(hi) term equals the number of bits required to specify the hypothesis hi . Furthermore, − log2 P(d|hi) is the additional number of bits required to specify the data, given the hypothesis. (To see this, consider that no bits are required if the hypothesis predicts the data exactly—as with h5 and the string of lime candies—and log2 1 = 0.) Hence, MAP learning is choosing the hypothesis that provides maximum compression of the data. The same task is addressed more directly by the minimum description length, or MDL, learning method, which attempts to minimize the size of hypothesis and MINIMUM DESCRIPTION LENGTH data encodings rather than work with probabilities. A final simplification is provided by assuming a uniform prior over the space of hypotheses. In that case, MAP learning reduces to choosing an hi that maximizes P(d|Hi). This is called a maximum-likelihood (ML) hypothesis, hML. Maximum-likelihood learning MAXIMUMLIKELIHOOD is very common in statistics, a discipline in which many researchers distrust the subjective nature of hypothesis priors. It is a reasonable approach when there is no reason to prefer one hypothesis over another a priori—for example, when all hypotheses are equally complex. It provides a good approximation to Bayesian and MAP learning when the data set is large, because the data swamps the prior distribution over hypotheses, but it has problems (as we shall see) with small data sets
716 Chapter 20.Statistical Learning Methods 20.2 LEARNING WITH COMPLETE DATA par neter learning with comple te data A parameter learning ta el or a proba ample,we might be n learning the condi I probab d.Fore ilities in a Baye esian network ven su are complete for e ble in the probability model ing learned data greatly simplify the e parameters of a Maximum-likelihood parameter learning:Discrete models Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime-cherry nortions are completely unknown-that is the fraction could he anywhere between o and 1.In that case.we have a continuum of hypotheses.The parameter in this case,which we call 6.is the p qually likely a priori.then a maximum- likelihod approach iseb Ife modeltheinwth Bayesianrke need just one andom variable flavor (the flayor ofa randomly chosen candy from the bag) It has values cherry and lime.where the probability of cherry is(sce Figure 20.2(a)).Now we ap N candies.of which c are cherries and (=N-c are limes.According P(dlhe)=TT P(djlhe)=0.(1-0) 21 The maximum-likelihood hypothesis is given by the value of that maximizes this expres- sion.The same value is obtained by maximizing the log likelihood. L(diho)=log P(dlhe)=>log P(dho)=elog+elog(10) (By taking logarithms,we reduce the product to a sum over the data,which is usually easier o maximize.)To find the maximum-likelihood value of.we differentiate L with respect to and set the resulting expression toero dL(dlho)= 01-00 →0= += In English,then,the maximum-likelihood hypothesish asserts that the actual proportion of cherries in the bag is equal to the observed proportion in the candies unwrapped so farl It appears that we have done a lot of work to discover the obvious.In fact,though,we have laid out one standard method for maximum-likelihood parameter learning: 1.Write down an expression for the likelihood ofthe data as a function of the parameter(s) 2.Write down the derivative of the log likelihood with respect to each parameter 3.Find the parameter values such that the derivatives are zero
716 Chapter 20. Statistical Learning Methods 20.2 LEARNING WITH COMPLETE DATA Our development of statistical learning methods begins with the simplest task: parameter learning with complete data. A parameter learning task involves finding the numerical pa- PARAMETER LEARNING COMPLETE DATA rametersfor a probability model whose structure is fixed. For example, we might be interested in learning the conditional probabilities in a Bayesian network with a given structure. Data are complete when each data point contains values for every variable in the probability model being learned. Complete data greatly simplify the problem of learning the parameters of a complex model. We will also look briefly at the problem of learning structure. Maximum-likelihood parameter learning: Discrete models Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime–cherry proportions are completely unknown—that is, the fraction could be anywhere between 0 and 1. In that case, we have a continuum of hypotheses. The parameter in this case, which we call θ, is the proportion of cherry candies, and the hypothesis is hθ. (The proportion of limes is just 1 − θ.) If we assume that all proportions are equally likely a priori, then a maximumlikelihood approach is reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor (the flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the probability of cherry is θ (see Figure 20.2(a)). Now suppose we unwrap N candies, of which c are cherries and ` = N − c are limes. According to Equation (20.3), the likelihood of this particular data set is P(d|hθ) = Y N j = 1 P(dj |hθ) = θ c · (1 − θ) ` . The maximum-likelihood hypothesis is given by the value of θ that maximizes this expresLOG LIKELIHOOD sion. The same value is obtained by maximizing the log likelihood, L(d|hθ) = log P(d|hθ) = X N j = 1 log P(dj |hθ) = c log θ + ` log(1 − θ) . (By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero: dL(d|hθ) dθ = c θ − ` 1 − θ = 0 ⇒ θ = c c + ` = c N . In English, then, the maximum-likelihood hypothesis hML asserts that the actual proportion of cherries in the bag is equal to the observed proportion in the candies unwrapped so far! It appears that we have done a lot of work to discover the obvious. In fact, though, we have laid out one standard method for maximum-likelihood parameter learning: 1. Write down an expression for the likelihood of the data as a function of the parameter(s). 2. Write down the derivative of the log likelihood with respect to each parameter. 3. Find the parameter values such that the derivatives are zero
Section 20.2.Learning with Complete Data 717 PF-cherry】 PF-chery) Flavor ☐PW=ed Flavor Wrapper (a ) bilistically)on the candy flavor. The trickiest step is usually the last.In our example,it was trivial,but we will see that in eed to ort to iterative solution algorithn encaloptimizaion oblen ood eari eral n the datd set is s et be a ricks are used to avoid this ing the cou or each event to1 insteado uppose th wants to give The W. for eachcandy es candy wrappers orodedandgre is sele ilistically, s,the 4e20 thre arameters param eliho ined from the standard semantics P(Flavor=cherry,Wrapper=greenho.0.0) =P(Flavor= cherrylho)P(Wrapper=green Flavor=cherry,ho.0.) =0.1-01) 网a阿 P(dhe.a1,2)=0r(1-0.(1-a)e.(1-a2)9 This looks pretty horrible,but taking logarithms help L [clog0+6l0g(1-0)]+[re log01+ge log(1-01)]+[relog 02 ge log(1-02)]. The benefit of taking logs is clear:the log likelihood is the sum of three terms each of which contains a single parameter.When we take derivatives with respect to each parameter and set
Section 20.2. Learning with Complete Data 717 Flavor P(F=cherry) (a) θ P(F=cherry) Flavor Wrapper (b) θ F cherry lime P(W=red | F) θ1 θ2 Figure 20.2 (a) Bayesian network model for the case of candies with an unknown proportion of cherries and limes. (b) Model for the case where the wrapper color depends (probabilistically) on the candy flavor. The trickiest step is usually the last. In our example, it was trivial, but we will see that in many cases we need to resort to iterative solution algorithms or other numerical optimization techniques, as described in Chapter 4. The example also illustrates a significant problem with maximum-likelihood learning in general: when the data set is small enough that some events have not yet been observed—for instance, no cherry candies—the maximum likelihood hypothesis assigns zero probability to those events. Various tricks are used to avoid this problem, such as initializing the counts for each event to 1 instead of zero. Let us look at another example. Suppose this new candy manufacturer wants to give a little hint to the consumer and uses candy wrappers colored red and green. The Wrapper for each candy is selected probabilistically, according to some unknown conditional distribution, depending on the flavor. The corresponding probability model is shown in Figure 20.2(b). Notice that it has three parameters: θ, θ1, and θ2. With these parameters, the likelihood of seeing, say, a cherry candy in a green wrapper can be obtained from the standard semantics for Bayesian networks (page 495): P(Flavor = cherry,Wrapper = green|hθ,θ1,θ2 ) = P(Flavor = cherry|hθ,θ1,θ2 )P(Wrapper = green|Flavor = cherry, hθ,θ1,θ2 ) = θ · (1 − θ1) . Now, we unwrap N candies, of which c are cherries and ` are limes. The wrapper counts are as follows: rc of the cherries have red wrappers and gc have green, while r` of the limes have red and g` have green. The likelihood of the data is given by P(d|hθ,θ1,θ2 ) = θ c (1 − θ) ` · θ rc 1 (1 − θ1) gc · θ r` 2 (1 − θ2) g` . This looks pretty horrible, but taking logarithms helps: L = [c log θ + ` log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)] . The benefit of taking logs is clear: the log likelihood is the sum of three terms, each of which contains a single parameter. When we take derivatives with respect to each parameter and set
718 Chapter 20.Statistical Learning Methods them to zero.we get three independent equations.each containing just one parameter =-=0 0 含0=中 %=0 ÷=离 The solution for is the same as before.The solution for the wrapper,is the observed fra ction cherry candies ith wrappers,and Thes sults are very Bayesi nditio int is that babilities are as ta he mo t impo the er.The composes mo separate one for each pa oint is t for a varab given its pa frequ e variabl ts,are Ju f the parent values.As before, we m I zeroes when t the data set is sma Naive Bayes models o"e ewrode如网 Bay variable C(which is to be ariables X: e the lea s The odel is"naive" ce it o nditionally of n the lass odel ir gure 202h naive Bayes model with just one attribute.)Assuming Bool the parameters are 0=P(C=true),0=P(Xi=truelC=true),0:2=P(Xi=truelC=false). The maximum-likelihood parameter values are found in exactly the same way as for Fig- ure 20.2(b).Once the model has been trained in this way,it can be used to classify new exam- ples for which the class variable C is unobserved.With observed attribute values.... the probability of each class is given by P(Clr1,...,zn)=a P(C)IIP(zilC). A deterministic prediction can be obtained by choosing the most likely class.Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18.The method learns fairly well but not as well as decision-tree learning;this is presumably because the true hypothesis-which is a decision tree-is not representable ex- actly using a naive Bayes model.Naive Bayes learning turns out to do surprisingly well in a wide range of applications,the boosted version(Exercise 20.5)is one of the most effective general-purpose learning algorithms.Naive Bayes learning scales well to very large prob- 自 lems:with n Boolean attributes,there are just 2n+1 parameters,and no search is required to find hML.the maximum-likelihood naive Bayes hypothesis.Finally,naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. See Exereise 20.7 for the nontabulated case,where each parameter affects several conditional probabilities
718 Chapter 20. Statistical Learning Methods them to zero, we get three independent equations, each containing just one parameter: ∂L ∂θ = c θ − ` 1−θ = 0 ⇒ θ = c c+` ∂L ∂θ1 = rc θ1 − gc 1−θ1 = 0 ⇒ θ1 = rc rc+gc ∂L ∂θ2 = r` θ2 − g` 1−θ2 = 0 ⇒ θ2 = r` r`+g` . The solution for θ is the same as before. The solution for θ1, the probability that a cherry candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and similarly for θ2. These results are very comforting, and it is easy to see that they can be extended to any Bayesian network whose conditional probabilities are represented as tables. The most important point is that, with complete data, the maximum-likelihood parameter learning problem for a Bayesian network decomposes into separate learning problems, one for each parameter. 3 The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small. Naive Bayes models Probably the most common Bayesian network model used in machine learning is the naive Bayes model. In this model, the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves. The model is “naive” because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables, the parameters are θ = P(C = true), θi1 = P(Xi = true|C = true), θi2 = P(Xi = true|C = false). The maximum-likelihood parameter values are found in exactly the same way as for Figure 20.2(b). Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values x1, . . . , xn, the probability of each class is given by P(C|x1, . . . , xn) = α P(C) Y i P(xi |C) . A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18. The method learns fairly well but not as well as decision-tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide range of applications; the boosted version (Exercise 20.5) is one of the most effective general-purpose learning algorithms. Naive Bayes learning scales well to very large problems: with n Boolean attributes, there are just 2n + 1 parameters, and no search is required to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. 3 See Exercise 20.7 for the nontabulated case, where each parameter affects several conditional probabilities
Section 20.2.Learning with Complete Data 719 08 07 20 40 60 80100 Trainine set size Figure20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 18;the learning curve for decision-tree learning is shown for comparison. Maximum-likelihood parameter learning:Continuous models Continuous probability models such as the linear-Gaussian model were introduced in Sec. portant to know how to leamn continugus models from data likelihood learning are identical to those of the discrete case I et us he in with a very simple case:learning the arameters of a Gaussian density function on a single variable.That is.the data are g enerated as follows: P(x)= √2 The parameters of this model are the mean u and the standard deviation a (Notice that the normalizing"constant"depends on o,so we cannot ignore it.)Let the observed values be N.Then the log likelihood is 。e 1 -=N(-log v2-logo)- X(-)2 122 Setting the derivatives to zero as usual,we obtain =-(红-4=0 →“= →=V②色9 (20.4) 0=-¥+÷∑1(-4)2=0 That is,the maximum-likelihood value of the mean is the sample average and the maximum- likelihood value of the standard deviation is the square root of the sample variance.Again. these are comforting results that confirm"commor Now consider a linear Gaussian model with one continuous parent X and a continuous child Y.As explained on page 502.Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed.To learn the conditional
Section 20.2. Learning with Complete Data 719 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 Proportion correct on test set Training set size Decision tree Naive Bayes Figure 20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 18; the learning curve for decision-tree learning is shown for comparison. Maximum-likelihood parameter learning: Continuous models Continuous probability models such as the linear-Gaussian model were introduced in Section 14.3. Because continuous variables are ubiquitous in real-world applications, it is important to know how to learn continuous models from data. The principles for maximumlikelihood learning are identical to those of the discrete case. Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, the data are generated as follows: P(x) = 1 √ 2πσ e − (x−µ) 2 2σ2 . The parameters of this model are the mean µ and the standard deviation σ. (Notice that the normalizing “constant” depends on σ, so we cannot ignore it.) Let the observed values be x1, . . . , xN . Then the log likelihood is L = X N j = 1 log 1 √ 2πσ e − (xj−µ) 2 2σ2 = N(− log √ 2π − log σ) − X N j = 1 (xj − µ) 2 2σ2 . Setting the derivatives to zero as usual, we obtain ∂L ∂µ = − 1 σ2 PN j=1(xj − µ) = 0 ⇒ µ = P j xj N ∂L ∂σ = − N σ + 1 σ3 PN j=1(xj − µ) 2 = 0 ⇒ σ = rP j (xj−µ) 2 N . (20.4) That is, the maximum-likelihood value of the mean is the sample average and the maximumlikelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice. Now consider a linear Gaussian model with one continuous parent X and a continuous child Y . As explained on page 502, Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed. To learn the conditional
720 Chapter 20.Statistical Learning Methods 11 0.6 0.4 02 0.6 0 00.102030.40.50.60.70.80.9 (a) sian model described asy+2 plus Gaussian noise distribution P(YX),we can maximize the conditional likelihood 1 (20.5) Here,the parameters are02.and.The data are a collection of()pairs,as illustrated in Figure 20.4.Using the usual methods(Exercise 20.6),we can find the maximum-likelihood values of the parameters.Here,we want to make a different point.If we consider just the parameters and 02 that define the linear relationship betweenz andy,it becomes clear that maximizing the log likelihood with respect to these parameters is the same as minimizing the numerator in the exponent of Equation(20.5): E=∑防-01+2P =1 The qntty(y一4巧土》is theeroor(,%hats.thefr benthe and the dicted value :+02 Im ofs SUM OE SOURPED tity that is min zed by y the standard lin regr ssio edur INEAR REGRESSION ing the model.pro ided t squared he ght-line sia noise fixed variance. Bayesian parameter learning Maximum-likelihood learning gives rise to some very simple procedures.but it has some serious deficiencies with small data sets.For example,after seeing one cherry candy,the maximum-likelihood hypothesis is that the bag is 100%cherry (i.e.,=1.0).Unless one's hypothesis prior is that bags must be either all cherry or all lime,this is not a reasonable conclusion.The Bayesian approach to parameter learning places a hypothesis prior over the possible values of the parameters and updates this distribution as data arrive
720 Chapter 20. Statistical Learning Methods 0 0.2 0.4 0.6 0.8 x 1 0 0.20.40.60.81 y 0 0.5 1 1.5 2 2.5 3 3.5 4 P(y |x) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y x (a) (b) Figure 20.4 (a) A linear Gaussian model described as y = θ1x + θ2 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model. distribution P(Y |X), we can maximize the conditional likelihood P(y|x) = 1 √ 2πσ e − (y−(θ1x+θ2))2 2σ2 . (20.5) Here, the parameters are θ1, θ2, and σ. The data are a collection of (xj , yj ) pairs, as illustrated in Figure 20.4. Using the usual methods(Exercise 20.6), we can find the maximum-likelihood values of the parameters. Here, we want to make a different point. If we consider just the parameters θ1 and θ2 that define the linear relationship between x and y, it becomes clear that maximizing the log likelihood with respect to these parameters is the same as minimizing the numerator in the exponent of Equation (20.5): E = X N j = 1 (yj − (θ1xj + θ2))2 . ERROR The quantity (yj − (θ1xj + θ2)) is the error for (xj , yj )—that is, the difference between the actual value yj and the predicted value (θ1xj +θ2)—so E is the well-known sum of squared errors. This is the quantity that is minimized by the standard linear regression procedure. SUM OF SQUARED ERRORS LINEAR REGRESSION Now we can understand why: minimizing the sum of squared errors gives the maximumlikelihood straight-line model, provided that the data are generated with Gaussian noise of fixed variance. Bayesian parameter learning Maximum-likelihood learning gives rise to some very simple procedures, but it has some serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximum-likelihood hypothesis is that the bag is 100% cherry (i.e., θ = 1.0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. The Bayesian approach to parameter learning places a hypothesis prior over the possible values of the parameters and updates this distribution as data arrive
Section 20.2.Learning with Complete Data 721 25 5,1 6 30,10 22 15 621 0.5 0.40.6 0.8 02 0.4 0.6 0.8 Para (a) (b) Figure20.5 Examples of the beta distribution for different values of The candy example in Figure 202(a)has one meter,:the probability that arar domly selected red In the Ba an dis tha lue of a ra ariable the h h P S P just th rdistribution P().Thus P(=0)is the pric probab ility th an value b ag has a ontinuous that zero only o and 1 and that ir tegratest T e uniform one ca s a member f Each tion is defined by two betala,bl(0)=a0a-1(1-0)5-1, (20.6) for 0 in the range 0,1].The normalization constant a depends on a and b.(See Exer- cise 20.8.)Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is a/(ab),so larger values of a suggest a belief that is closer to I than to 0.Larger values ofa+b make the distribution more peaked,suggest- ing greater certainty about the value of Thus,the beta family provides a useful range of possibilities for the hypothesis prior. Besides its flexibility.the beta family has another wonderful pr rty if has a nrior betathneroird.the posterora CONJUGATE PROR distribution.The beta family is called the conjugate prior for the family of distributions for a Boolean variable.5 Let's see how this works.Suppose we observe a cherry candy.then P(D1=cherry)=a P(D1=cherryl0)P(0) =a'0.betafa,](0)=a'0.0-1(1-0)-1 =a'(1-)b betala +1,(0). 4They are called hyperparameters beca Smith (19)
Section 20.2. Learning with Complete Data 721 0 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [1,1] [2,2] [5,5] 0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [3,1] [6,2] [30,10] (a) (b) Figure 20.5 Examples of the beta[a, b] distribution for different values of [a, b]. The candy example in Figure 20.2(a) has one parameter, θ: the probability that a randomly selected piece of candy is cherry flavored. In the Bayesian view, θ is the (unknown) value of a random variable Θ; the hypothesis prior is just the prior distribution P(Θ). Thus, P(Θ = θ) is the prior probability that the bag has a fraction θ of cherry candies. If the parameter θ can be any value between 0 and 1, then P(Θ) must be a continuous distribution that is nonzero only between 0 and 1 and that integratesto 1. The uniform density P(θ) = U[0, 1](θ) is one candidate. (See Chapter 13.) It turns out that the uniform density BETA DISTRIBUTIONS is a member of the family of beta distributions. Each beta distribution is defined by two hyperparameters4 HYPERPARAMETER a and b such that beta[a, b](θ) = α θ a−1 (1 − θ) b−1 , (20.6) for θ in the range [0, 1]. The normalization constant α depends on a and b. (See Exercise 20.8.) Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is a/(a + b), so larger values of a suggest a belief that Θ is closer to 1 than to 0. Larger values of a + b make the distribution more peaked, suggesting greater certainty about the value of Θ. Thus, the beta family provides a useful range of possibilities for the hypothesis prior. Besides its flexibility, the beta family has another wonderful property: if Θ has a prior beta[a, b], then, after a data point is observed, the posterior distribution for Θ is also a beta CONJUGATE PRIOR distribution. The beta family is called the conjugate prior for the family of distributions for a Boolean variable.5 Let’s see how this works. Suppose we observe a cherry candy; then P(θ|D1 = cherry) = α P(D1 = cherry|θ)P(θ) = α 0 θ · beta[a, b](θ) = α 0 θ · θ a−1 (1 − θ) b−1 = α 0 θ a (1 − θ) b−1 = beta[a + 1, b](θ) . 4 They are called hyperparameters because they parameterize a distribution over θ, which is itself a parameter. 5 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution and the Normal–Wishart family for the parameters of a Gaussian distribution. See Bernardo and Smith (1994)