正在加载图片...
718 Chapter 20.Statistical Learning Methods them to zero.we get three independent equations.each containing just one parameter =-=0 0 含0=中 %=0 ÷=离 The solution for is the same as before.The solution for the wrapper,is the observed fra ction cherry candies ith wrappers,and Thes sults are very Bayesi nditio int is that babilities are as ta he mo t impo the er.The composes mo separate one for each pa oint is t for a varab given its pa frequ e variabl ts,are Ju f the parent values.As before, we m I zeroes when t the data set is sma Naive Bayes models o"e ewrode如网 Bay variable C(which is to be ariables X: e the lea s The odel is"naive" ce it o nditionally of n the lass odel ir gure 202h naive Bayes model with just one attribute.)Assuming Bool the parameters are 0=P(C=true),0=P(Xi=truelC=true),0:2=P(Xi=truelC=false). The maximum-likelihood parameter values are found in exactly the same way as for Fig- ure 20.2(b).Once the model has been trained in this way,it can be used to classify new exam- ples for which the class variable C is unobserved.With observed attribute values.... the probability of each class is given by P(Clr1,...,zn)=a P(C)IIP(zilC). A deterministic prediction can be obtained by choosing the most likely class.Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18.The method learns fairly well but not as well as decision-tree learning;this is presumably because the true hypothesis-which is a decision tree-is not representable ex- actly using a naive Bayes model.Naive Bayes learning turns out to do surprisingly well in a wide range of applications,the boosted version(Exercise 20.5)is one of the most effective general-purpose learning algorithms.Naive Bayes learning scales well to very large prob- 自 lems:with n Boolean attributes,there are just 2n+1 parameters,and no search is required to find hML.the maximum-likelihood naive Bayes hypothesis.Finally,naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. See Exereise 20.7 for the nontabulated case,where each parameter affects several conditional probabilities. 718 Chapter 20. Statistical Learning Methods them to zero, we get three independent equations, each containing just one parameter: ∂L ∂θ = c θ − ` 1−θ = 0 ⇒ θ = c c+` ∂L ∂θ1 = rc θ1 − gc 1−θ1 = 0 ⇒ θ1 = rc rc+gc ∂L ∂θ2 = r` θ2 − g` 1−θ2 = 0 ⇒ θ2 = r` r`+g` . The solution for θ is the same as before. The solution for θ1, the probability that a cherry candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and similarly for θ2. These results are very comforting, and it is easy to see that they can be extended to any Bayesian network whose conditional probabilities are represented as tables. The most impor￾tant point is that, with complete data, the maximum-likelihood parameter learning problem for a Bayesian network decomposes into separate learning problems, one for each parame￾ter. 3 The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small. Naive Bayes models Probably the most common Bayesian network model used in machine learning is the naive Bayes model. In this model, the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves. The model is “naive” because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables, the parameters are θ = P(C = true), θi1 = P(Xi = true|C = true), θi2 = P(Xi = true|C = false). The maximum-likelihood parameter values are found in exactly the same way as for Fig￾ure 20.2(b). Once the model has been trained in this way, it can be used to classify new exam￾ples for which the class variable C is unobserved. With observed attribute values x1, . . . , xn, the probability of each class is given by P(C|x1, . . . , xn) = α P(C) Y i P(xi |C) . A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18. The method learns fairly well but not as well as decision-tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable ex￾actly using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide range of applications; the boosted version (Exercise 20.5) is one of the most effective general-purpose learning algorithms. Naive Bayes learning scales well to very large prob￾lems: with n Boolean attributes, there are just 2n + 1 parameters, and no search is required to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. 3 See Exercise 20.7 for the nontabulated case, where each parameter affects several conditional probabilities
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有