正在加载图片...
ter)of documents, while factors can focus on certain For example, a factor can be very uell used to ex plain some raction of the words occurring in a doc ument, al though it might not explain other words at all(e. g, even assign zero probability ) because thes other words can be taken care of by other factors 3.6 Model ei itt a by i tea eming Figure 5: Perplexity results as a function of the So far ue have focused on maximumlikelihood estima- and(b)the Lob data (rank 1674).Plotted tion to fit a mo del from a given do cument collectior are for LSA (dashed-dotted curve) and PLSA(trained Although the likelihood or alent ly, the perple by TEM d curve, trained by early stopping EM ity is the quantity we believe to be cruci al in assessing dotted curve). The upper baseline is the unigram the quality of a model, one clearly has to distinguish model corresponding to marginal independence. TI betueen the performance on the training data and on star at the right end of the Plsa denotes the perplex unseen test data. To derive conditions under which ity of the largest trained aspect models(K=2048) ener tually the fundamental problem of st atis tical learning theory. Here, ue propose a generali zation of maxi- This shous that the effect of the entropy at fi 1 mum likelihood for mixture models which is known as to dampen the posterior probabilities such that they annealin and is based on an entropic regularization lI become closer to the uniform distribution with term. The resulting method is called Tempered Expec- decreasing fi tation Maximization(TEM)and is closely rel ated te deterministic annealin [11] Somewhat contrary to the spirit of annealing as a con tinuation method, ue propose an inverse'annealing The starting point of TEM is a derivation of the E- strategy which first performs EM iterations and step based on an optimi zation principle. As has been decreases fi until performance on hel d-out data deter pointed out in [9 the EM procedure in latent variable orates. Comp ared to annealing this may accelerate the nodels can be obtained by minimi zing a common ob- model fit ting procedure signi ficantly (e.g, by a facto jective function-the(Helmholtz)free ener)-which of N 10-50) and we have not found the test set per- for the aspect model is given by formance of heated models to be worse than the one fi >n(d, w)2P(2, d,w)log P(d, wl=)P(a) algorithm can thus be implemented in the following +∑n(d,w)∑P(,d,w)ogP(2,d,w):(8) 1. Set fi +l and per form EM with early stopping Here P(z, d, w)are vari ational parameters which de 2. Decrease fi -nfi (with n< 1)and perform fine a conditional distribution over 2 and fi is a pa TEM iteration rameter which in analogy to physical systems n held-out data i called the inverse comp ut ational temp erat ure. Noti that the first contribution in( 8) is the negative ex- proves (non-negligible)continue TEM iter ations pected log-likelihood scaled by fi. Thus in the case of at this value of fi, ot her wise goto step 2 P(a, d, w)=P(=ld, w)minimizing F w.r.t. the param- 4. Perform stopping on fi, i.e., stop when decreasing eters defining P(d, w=)P(a) amounts to the st andard fi does not yield further improvements M-step EM. In fact, it is straightforward to ver ify that the posteriors are obtained by minimi zing F al p is determined by 4 Experimental Results [P()P(da)P(w|2)15 P(,d,w)= 2:P()P(d=)P(w2)] (9) In the experimental evaluation, ue focus on tuo tasks (i) perplexity minimi zation for a document-specific un Perplexity re fers to the log-averaged inverse probabil- igram model and noun-adjective pairs, and (ii)auto- ty on unseen dat a mated indexing of do cuments The evaluation of lsato capture the complete vocabulary of a subset (clus￾ter) of documents, while factors can focus on certain aspects of the vocabulary of a subset of documents. For example, a factor can be very well used to ex￾plain some fraction of the words occurring in a doc￾ument, although it might not explain other words at all (e.g., even assign zero probability), because these other words can be taken care of by other factors. 3.6 Model Fitting Revisited: Improving Generalization by Tempered EM So far we have focused on maximum likelihood estima￾tion to t a model from a given document collection. Although the likelihood or, equivalently, the perplex￾ity2 is the quantity we believe to be crucial in assessing the quality of a model, one clearly has to distinguish between the performance on the training data and on unseen test data. To derive conditions under which generalization on unseen data can be guaranteed is ac￾tually the fundamental problem of statistical learning theory. Here, we propose a generalization of maxi- mum likelihood for mixture models which is known as annealing and is based on an entropic regularization term. The resulting method is called Tempered Expec￾tation Maximization (TEM) and is closely related to deterministic annealing [11]. The starting point of TEM is a derivation of the E￾step based on an optimization principle. As has been pointed out in [9] the EM procedure in latent variable models can be obtained by minimizing a common ob￾jective function { the (Helmholtz) free energy { which for the aspect model is given by F = ￾ X d;w n(d; w)X z P~(z; d; w) log P (d; wjz)P (z) +X d;w n(d; w)X z P~(z; d; w) log P~(z; d; w) : (8) Here P~ (z; d; w) are variational parameters which de- ne a conditional distribution over Z and is a pa￾rameter which { in analogy to physical systems { is called the inverse computational temperature. Notice that the rst contribution in (8) is the negative ex￾pected log-likelihood scaled by . Thus in the case of P~ (z; d; w) = P (zjd; w) minimizing F w.r.t. the param￾eters de ning P (d; wjz)P (z) amounts to the standard M-step in EM. In fact, it is straightforward to ver￾ify that the posteriors are obtained by minimizing F w.r.t. P~ at = 1. In general P~ is determined by P~(z; d; w) = [P (z)P (djz)P (wjz)] P z0 [P (z0)P (djz0)P (wjz0)] : (9) 2Perplexity refers to the log-averaged inverse probabil￾ity on unseen data. 200 400 600 800 1000 1000 1500 2000 2500 3000 (a) (b) Latent space dimensions Perplexity PLSA EM TEM LSA 500 1000 1500 500 600 700 800 900 1000 1100 1200 1300 Latent space dimensions Perplexity PLSA LSA Figure 5: Perplexity results as a function of the latent space dimensionality for (a) the MED data (rank 1033) and (b) the LOB data (rank 1674). Plotted results are for LSA (dashed-dotted curve) and PLSA (trained by TEM = solid curve, trained by early stopping EM = dotted curve). The upper baseline is the unigram model corresponding to marginal independence. The star at the right end of the PLSA denotes the perplex￾ity of the largest trained aspect models (K = 2048). This shows that the e ect of the entropy at < 1 is to dampen the posterior probabilities such that they will become closer to the uniform distribution with decreasing . Somewhat contrary to the spirit of annealing as a con￾tinuation method, we propose an `inverse' annealing strategy which rst performs EM iterations and then decreases until performance on held-out data deteri￾orates. Compared to annealing this may accelerate the model tting procedure signi cantly (e.g., by a factor of  10 ￾ 50) and we have not found the test set per￾formance of `heated' models to be worse than the one achieved by carefully `annealed' models. The TEM algorithm can thus be implemented in the following way: 1. Set 1 and perform EM with early stopping. 2. Decrease  (with  < 1) and perform one TEM iteration. 3. As long as the performance on held-out data im￾proves (non-negligible) continue TEM iterations at this value of , otherwise goto step 2 4. Perform stopping on , i.e., stop when decreasing does not yield further improvements. 4 Experimental Results In the experimental evaluation, we focus on two tasks: (i) perplexity minimization for a document-speci c un￾igram model and noun-adjective pairs, and (ii) auto￾mated indexing of documents. The evaluation of LSA
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有