正在加载图片...
Section 20.1.Statistical Learning 713 100%cherry 12 4 cherry +75%lime h:100%lime Given a new bag of candy the random variable h (for hpothesis)denotes the type of the bag.with possible values through hs.H is not directly observable.of course.As the pieces of candy are opened and inspected.data are revealed-DD2. D.where each D.is a random variable with possible values chey and lime.The basic task faced by the agent is to predict the flavor of the next picce of candy Despite its apparent triviality this rves to introduce many of the major issues.The agent really does need to infer a theory of its world.albeit a ver esian lea ning sin po this edu D btain P(hild)=aP(dlhi)P(hi). (20.1) Now,suppose we want to make a prediction about an unknown quantity X.Then we have P(xld)=>P(Xld,hi)P(hild)=>P(X ha)P(hild), (20.2) where we have assumed that each hypothesis determines a probability distribution overX This equation shows that predictions are weighted averages over the predictions of the indi- vidual hypotheses.The hypotheses themselves are essentially"intermediaries"between the raw data and the predictions.The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior,P(h),and the likelihood of the data under each hypothesis,P(dh). LIKELHOOD For our candy example,we will assume for the time being that the prior distribution overh...h is given by (0.1,0.2,0.4,0.2,0.1),as advertised by the manufacturer.The likelihood of the data is calculated under the assumption that the observations are i.i.d.-that is,independently and identically distributed-so that P(dlhi)=IIP(dilhi) (20.3) lime-ther P is realym ba (s)and the first candies are al because nall the candies in an h3 bag are lime.Figure 20.1(a) shows how the posterio r probabilities of the five hypotheses change as the sequence of 10 ndies is observed.Notice that the probabilities start out at their prior values,so h is initially the most likely choice and remains so after I lime candy is unwrapped.After 2 see Exercise 20.3Section 20.1. Statistical Learning 713 h1: 100% cherry h2: 75% cherry + 25% lime h3: 50% cherry + 50% lime h4: 25% cherry + 75% lime h5: 100% lime Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with possible values h1 through h5. H is not directly observable, of course. As the pieces of candy are opened and inspected, data are revealed—D1, D2, . . ., DN , where each Di is a random variable with possible values cherry and lime. The basic task faced by the agent is to predict the flavor of the next piece of candy. 1 Despite its apparent triviality, this scenario serves to introduce many of the major issues. The agent really does need to infer a theory of its world, albeit a very simple one. BAYESIAN LEARNING Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions on that basis. That is, the predictions are made by using all the hy￾potheses, weighted by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is reduced to probabilistic inference. Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule: P(hi |d) = αP(d|hi)P(hi) . (20.1) Now, suppose we want to make a prediction about an unknown quantity X. Then we have P(X|d) = X i P(X|d, hi)P(hi |d) = X i P(X|hi)P(hi |d) , (20.2) where we have assumed that each hypothesis determines a probability distribution over X. This equation shows that predictions are weighted averages over the predictions of the indi￾vidual hypotheses. The hypotheses themselves are essentially “intermediaries” between the raw data and the predictions. The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior, P(hi), and the likelihood of the data under each hypothesis, P(d|hi). LIKELIHOOD For our candy example, we will assume for the time being that the prior distribution over h1, . . . , h5 is given by h0.1, 0.2, 0.4, 0.2, 0.1i, as advertised by the manufacturer. The I.I.D. likelihood of the data is calculated under the assumption that the observations are i.i.d.—that is, independently and identically distributed—so that P(d|hi) = Y j P(dj |hi) . (20.3) For example, suppose the bag is really an all-lime bag (h5) and the first 10 candies are all lime; then P(d|h3) is 0.5 10 , because half the candies in an h3 bag are lime.2 Figure 20.1(a) shows how the posterior probabilities of the five hypotheses change as the sequence of 10 lime candies is observed. Notice that the probabilities start out at their prior values, so h3 is initially the most likely choice and remains so after 1 lime candy is unwrapped. After 2 1 Statistically sophisticated readers will recognize this scenario as a variant of the urn-and-ball setup. We find urns and balls less compelling than candy; furthermore, candy lends itself to other tasks, such as deciding whether to trade the bag with a friend—see Exercise 20.3. 2 We stated earlier that the bags of candy are very large; otherwise, the i.i.d. assumption failsto hold. Technically, it is more correct (but less hygienic) to rewrap each candy after inspection and return it to the bag
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有