正在加载图片...
724 Chapter 20.Statistical Learning Methods on the threshold used for this test-the stricter the independence test,the more links will be added and the appr mor with the odel con the data (i degr the prop ns this.howev nd e just ximum-like hood hypo s de anno forced to The MAP(or MDL) ach simply subtr a penalty fr the like (after para ing bef omparing c structures The Bayesian approach place a joint pr sum riables),so most practitioners use M MCM sample xity(whether by MAPor Bayesian methods)introduces an importan con ns in the network ns,the comple y penalty fo 80 with the of pa ons,it grows only This means that eaming wth OISy-OR noisy-OR (or other com pacty para Is tends to produce learned structures with more parents than doe eing with tabular distributions 20.3 LEARNING WITH HIDDEN VARIABLES:THE EM ALGORITHM The preceding section dealt with the fully observable case.Many real-world problems have LATENT VARIABLES hidden variables(sometimes called latent variables)which are not observable in the data that are available for learning.For example,medical records often include the observed symptoms,the treatment applied,and perhaps the outcome of the treatment,but they sel- dom contain a direct observation of the disease itself6 One might ask,"If the disease is not observed,why not construct a model without it?"The answer appears in Figure 20.7 which shows a small,fictitious diagnostic model for heart disease.There are three observ- able predisposing factors and three observable symptoms(which are too depressing to name) Assume that each variable has three possible values(e.g,none,moderate,and severe).Re- moving the hidden variable from the network in(a)yields the network in(b);the total number of parameters increases from 78 to 708.Thus.latent variables can dramatically reduce the number ofparameters required to specify a Bayesian netvork This,in tumn,can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important.but they do complicate the learnin ure 20.7(a),for example,it is not obvious how to leam the conditiona isbutionfor HeartDisease,given its parents,because we do not know the value of HeariDisease in each case,the same problem arises in learning the distributions for the symptoms.This section tain the toms.which arein bthe disease sted by the physicin,but this is acausal ce of the symp 724 Chapter 20. Statistical Learning Methods on the threshold used for this test—the stricter the independence test, the more links will be added and the greater the danger of overfitting. An approach more consistent with the ideas in this chapter is to the degree to which the proposed model explains the data (in a probabilistic sense). We must be careful how we measure this, however. If we just try to find the maximum-likelihood hypothesis, we will end up with a fully connected network, because adding more parents to a node cannot decrease the likelihood (Exercise 20.9). We are forced to penalize model complexity in some way. The MAP (or MDL) approach simply subtracts a penalty from the likelihood of each structure (after parameter tuning) before comparing different structures. The Bayesian approach places a joint prior over structures and parameters. There are usually far too many structures to sum over (superexponential in the number of variables), so most practitioners use MCMC to sample over structures. Penalizing complexity (whether by MAP or Bayesian methods) introduces an important connection between the optimal structure and the nature of the representation for the condi￾tional distributions in the network. With tabular distributions, the complexity penalty for a node’s distribution grows exponentially with the number of parents, but with, say, noisy-OR distributions, it grows only linearly. This means that learning with noisy-OR (or other com￾pactly parameterized) models tends to produce learned structures with more parents than does learning with tabular distributions. 20.3 LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM The preceding section dealt with the fully observable case. Many real-world problems have LATENT VARIABLES hidden variables (sometimes called latent variables) which are not observable in the data that are available for learning. For example, medical records often include the observed symptoms, the treatment applied, and perhaps the outcome of the treatment, but they sel￾dom contain a direct observation of the disease itself!6 One might ask, “If the disease is not observed, why not construct a model without it?” The answer appears in Figure 20.7, which shows a small, fictitious diagnostic model for heart disease. There are three observ￾able predisposing factors and three observable symptoms (which are too depressing to name). Assume that each variable has three possible values (e.g., none, moderate, and severe). Re￾moving the hidden variable from the network in (a) yields the network in (b); the total number of parameters increases from 78 to 708. Thus, latent variables can dramatically reduce the number of parameters required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important, but they do complicate the learning problem. In Fig￾ure 20.7(a), for example, it is not obvious how to learn the conditional distribution for HeartDisease, given its parents, because we do not know the value of HeartDisease in each case; the same problem arises in learning the distributions for the symptoms. This section 6 Some records contain the diagnosis suggested by the physician, but this is a causal consequence of the symp￾toms, which are in turn caused by the disease
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有