Probabilistic Latent Semantic analysis To appear in: Uncertainity in Artificial Intelligence, UAI99, Stockholm EECS Dep artment, Computer Science Division, Uni versity of California, Berkeley International Computer Science Institute, Berkeley, CA hofmann @cs. berkeley. edu abstract was referred to in a text or an utterance. The result plems are twofold: (i) poly sems, i.e. a word Probabilistic Latent Semantic Analysis is a may have multiple senses and multiple ty pes of usage novel statistical techni que for the analy sis in different context, and (ii) synonyms and semanti of two-mode and co-occurrence data which cally related words, i. e. di fferent words may have a has applications in information retrieval and simil ar meaning, they may at least in certain contexts filtering, natural language processing, ma enote the same concept or -in a weaker sense refer chine le fr to the same topic eas. Comp ared to standard Latent Semantic Latent semantic analysis(LSa)[3 is well-known tecl Analysis which stems from linear algebra and ique which partially addresses these questions. The performs a Singul ar Value Decomposition of key idea is to map high-dimensional count vectors occurrence tables, the proposed method s based on a mixture decomp osition derived tions of text do cuments [12, to a lower dimensional from a latent class model. This results in a represent ation in a so-called latent semantic space. As more principled approach which has a solid foundation in statistics. In order to avoid the name suggests, the goal of lsa is to find a data mapping which provides information well beyond the over fit ting, we prop ose a widely applicable lexical level and reveals semantical relations betwee eralization of maximum likelihood model fitting by tempered EM. Our ap proach yields the entities of interest. Due to its generality, LsA has proven to be a valuable analysis tool with a wide substantial and consistent improvements over range of applications(e.g. 3, 5, 8, 1). Yet its theoreti Latent Semantic Analysis in a number of ex- cal foundation remains to a large extent unsatis factory perIl This paper presents a st atistical view on Lsa which 1 Introduction mo del called Probabilistic late nt Se tics Analysis (PLSA). In cont Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical eat challenges of Artificial Intelligence and Machine foundation and defines a proper generative mo del of Learning. Any subst antial progress in this domain has the data. a detailed discussion of the numerous ad- fro of plsa can be found in sub formation retrieval, information filtering, and intel processing, and machine translation. One of the fun- 2 latent Semantic Analysis gent inter faces, to speech recognition, natural language damental problems is to learn the meaning and usage of words in a data-driven fas hion from 2.1 Count data and co-occurrence tables text corpus, possibly without further linguistic prior Lsa can in principle be applied to any ty pe of count discrete dyadic domain(cf. [7). Ho The main challenge a machine learning system has to ever, since the most prominent application of lsa is address roots in the distinction between the lexical in the analysis and retrieval of text documents, we level of"what actually has been said or written"and focus on this setting for sake of concreteness. Sup the semantical level of"what w as intended"or"what pose therefore we have given a collection of text doc
Probabilistic Latent Semantic Analysis To appear in: Uncertainity in Articial Intelligence, UAI'99, Stockholm Thomas Hofmann EECS Department, Computer Science Division, University of California, Berkeley & International Computer Science Institute, Berkeley, CA hofmann@cs.berkeley.edu Abstract Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overtting, we propose a widely applicable generalization of maximum likelihood model tting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments. 1 Introduction Learning from text and natural language is one of the great challenges of Articial Intelligence and Machine Learning. Any substantial progress in this domain has strong impact on many applications ranging from information retrieval, information ltering, and intelligent interfaces, to speech recognition, natural language processing, and machine translation. One of the fundamental problems is to learn the meaning and usage of words in a data-driven fashion, i.e., from some given text corpus, possibly without further linguistic prior knowledge. The main challenge a machine learning system has to address roots in the distinction between the lexical level of \what actually has been said or written" and the semantical level of \what was intended" or \what was referred to" in a text or an utterance. The resulting problems are twofold: (i) polysems, i.e., a word may have multiple senses and multiple types of usage in dierent context, and (ii) synonymys and semantically related words, i.e., dierent words may have a similar meaning, they may at least in certain contexts denote the same concept or { in a weaker sense { refer to the same topic. Latent semantic analysis (LSA) [3] is well-known technique which partially addresses these questions. The key idea is to map high-dimensional count vectors, such as the ones arising in vector space representations of text documents [12], to a lower dimensional representation in a so-called latent semantic space. As the name suggests, the goal of LSA is to nd a data mapping which provides information well beyond the lexical level and reveals semantical relations between the entities of interest. Due to its generality, LSA has proven to be a valuable analysis tool with a wide range of applications (e.g. [3, 5, 8, 1]). Yet its theoretical foundation remains to a large extent unsatisfactory and incomplete. This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and denes a proper generative model of the data. A detailed discussion of the numerous advantages of PLSA can be found in subsequent sections. 2 Latent Semantic Analysis 2.1 Count Data and Co-occurrence Tables LSA can in principle be applied to any type of count data over a discrete dyadic domain (cf. [7]). However, since the most prominent application of LSA is in the analysis and retrieval of text documents, we focus on this setting for sake of concreteness. Suppose therefore we have given a collection of text doc-
ments=dI, .,dN) with terms from a vocab- (a) ulary W=wl,.wM). By ignoring the sequen tial order in which words occur in a document. one d may summarize the data in a nxM co-occurrence table of counts N=(n(di, wi)ii, where n(d,w)E N denotes how often the term w occurred in document d. In this particular case, N is also called the term- Figure 1: Graphical mo del representation of the as- document matrix and the rows/columns of n are re pect model in the asy mmetric(a)and symmetric(b. ferred to as document/term vectors, respectively. The parameterization vector-space representation [12] of documents will in many cases preserve most of the relevant informati efined by the mixture e g, for tasks like text retrieval based on keywor ds P(d,w)=P(d)P(wld), P(uld)=>P(wl:)(=1d ).(1) 2.2 Latent Semantic Analy sis by SVD Like virtually all statistical latent variable models the aspect model introduces a conditional independence As mentioned in the introduction the key idea of LSA assumption, namely that d and w are independent con is to map documents(and by symmetry terms)to a ditioned on the state of the asso ci ated latent variable vector space of reduced dimensionality, the latent se-(the corresponding graphical model represent ation is mantic space 3. The mapping is restricted to be lin depicted in Figure 1 Since the cardinality of a is ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the (SVD) of the co-occurrence table. One thus starts ttleneck variable in predict with the standard Svd given by N= u2v, where ing words. It is worth noticing that the model can be U and V are orthogonal matrices U(-VV=I equivalently parameterized by(cf Figure 1(b)) and the diagonal matrix 2 contains the singul ar val ues of N. The LSA approximation of N is computed Pd,)=∑P()P(dl)Pml) (2) y setting all but the largest K singul ar values in 2 to zero(=3), which is rank K optimal in the sense which is perfectly symmetric in both entities, doct of the L2-matrix norm. One obtains the approxima- ments and words tion n=U∑vt≈UΣVt=N. Notice that the document-to-do cument inner products based on this 3.2 Mo del Fitting with the EM Algorithm approximation are given by nN=U2U and hence one might think of the rows of U2 as defining coor- The standard procedure for maximum likelihood es inates for do cuments in the latent space. While the timation in latent variable models is the Expectation original high-dimensional vectors are sp arse, the corre Maximiz ation(EM) algorithm 4. EM al ternate two sponding low-dimensional latent vectors will ty pically coupled steps: (i)an expect ation(E)step where poste- not be sp arse. This implies that it is possible to com- rior probabilities are computed for the latent variables, pute meaningful associ ation values between pairs of (ii) an maximization(M)step, where parameters are documents, even if the documents do not have any updated. Standard cal culations(cf [7, 13]) yield the terms in common. The hope is that terms having a E-step equation mon meaning, in particular synony ms, are rough )P(2 ped to the s ame direction in the latent sp ac ∑2szP(x)P(dl2)P(l) as well as the following M-step formulae 3 Probabilistic lsa Pl)∑m(d,)P(l,m),(4) 3.1 The Aspect Mo del P(d|)∝ The starting point for Probabilistic Latent Semantic Analys isis a statistical mo del which has been called P(x)∝ n(d,)P(zd,u).(6) aspect model [7. The aspect model is a latent variable model for co-occurrence data which associ ates an ur Before discussing further al gorithmic refinements, we observed cl ass variable zEZ=21, . *k) with each will study the relationship between the proposed observ ation. A joint probability model over Dx W is model and lsa in more detai
uments D = fd1; : : :;dN g with terms from a vocabulary W = fw1; : : :wMg. By ignoring the sequential order in which words occur in a document, one may summarize the data in a N M co-occurrence table of counts N = (n(di; wj))ij , where n(d; w) 2 IN denotes how often the term w occurred in document d. In this particular case, N is also called the termdocument matrix and the rows/columns of N are referred to as document/term vectors, respectively. The key assumption is that the simplied `bag-of-words' or vector-space representation [12] of documents will in many cases preserve most of the relevant information, e.g., for tasks like text retrieval based on keywords. 2.2 Latent Semantic Analysis by SVD As mentioned in the introduction, the key idea of LSA is to map documents (and by symmetry terms) to a vector space of reduced dimensionality, the latent semantic space [3]. The mapping is restricted to be linear and is based on a Singular Value Decomposition (SVD) of the co-occurrence table. One thus starts with the standard SVD given by N = UVt ; where U and V are orthogonal matrices UtU = VtV = I and the diagonal matrix contains the singular values of N. The LSA approximation of N is computed by setting all but the largest K singular values in to zero (= ~ ), which is rank K optimal in the sense of the L2-matrix norm. One obtains the approximation N~ = UV~ t UVt = N: Notice that the document-to-document inner products based on this approximation are given by N~ N~ t = U~ 2Ut and hence one might think of the rows of U~ as dening coordinates for documents in the latent space. While the original high-dimensional vectors are sparse, the corresponding low-dimensional latent vectors will typically not be sparse. This implies that it is possible to compute meaningful association values between pairs of documents, even if the documents do not have any terms in common. The hope is that terms having a common meaning, in particular synonyms, are roughly mapped to the same direction in the latent space. 3 Probabilistic LSA 3.1 The Aspect Model The starting point for Probabilistic Latent Semantic Analysis is a statistical model which has been called aspect model [7]. The aspect model is a latent variable model for co-occurrence data which associates an unobserved class variable z 2 Z = fz1; : : :;zKg with each observation. A joint probability model over DW is d z w z w P(z) (a) (b) d P(d) P(z|d) P(w|z) P(d|z) P(w|z) Figure 1: Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization. dened by the mixture P (d; w)=P (d)P (wjd); P (wjd)=X z2Z P (wjz)P (zjd): (1) Like virtually all statistical latent variable models the aspect model introduces a conditional independence assumption, namely that d and w are independent conditioned on the state of the associated latent variable (the corresponding graphical model representation is depicted in Figure 1 (a)). Since the cardinality of z is smaller than the number of documents/words in the collection, z acts as a bottleneck variable in predicting words. It is worth noticing that the model can be equivalently parameterized by (cf. Figure 1 (b)) P (d; w) = X z2Z P (z)P (djz)P (wjz) ; (2) which is perfectly symmetric in both entities, docu- ments and words. 3.2 Model Fitting with the EM Algorithm The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm [4]. EM alternates two coupled steps: (i) an expectation (E) step where posterior probabilities are computed for the latent variables, (ii) an maximization (M) step, where parameters are updated. Standard calculations (cf. [7, 13]) yield the E-step equation P (zjd; w) = P (z)P (djz)P (wjz) P z02Z P (z0)P (djz0)P (wjz0) ; (3) as well as the following M-step formulae P (wjz) / X d2D n(d; w)P (zjd; w); (4) P (djz) / X w2W n(d; w)P (zjd; w); (5) P (z) / X d2D X w2W n(d; w)P (zjd; w) : (6) Before discussing further algorithmic renements, we will study the relationship between the proposed model and LSA in more detail.
bedi hood function of mul tinomial samplin and the model. As is well known, this corresponds to minimization of the cross entropy or Kullback-Leibler type ofse deviation. On the modelin) side this offers important advanta]es, for example, the mixture approximation P(wlz,) Pof the co-occurrence table is a well-defined prob meanin/ In contr ast, LSa does not define a properl FiJure E: Sket ch of the prob ability sub-simplex cont ain ne] ative entries. In addition, there is no ob. ormalized probability distribution and n may even panned by the mode ous interpretation of the directions in the LSA latent space, while the directions in the PLSA space are in- ob abilistic latent Sem antic space tinomi al word distributions. T probabilistic approach can also take advanta]e of the Consider the class-conditional multinomial distribu- well-established st atistical theory or mo del selection tions P( over the vo cabulary w hich we call factors d complexity control, e.-, to determine the opti They can be represented as points on the M-1 di- mal number of latent space dimensions. Choosing the mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is its convex hull, this set of k points defines a L< typically based on ad hoc heuristics K-l dimensional sub-simplex. The modelin] assump- A comparison of the computational complexity mi)ht tion expressed by(1)is that conditional distributions su))est some advantages for LSA: i]norin) potenti al P(u ld) for all document are approximated by a multi- problems of numerical stability the SVD can be com- nomial representable as a convex combination of fac- puted exactly, while the EM al orithm is an iter ative method which is only Guaranteed to find a local max define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our sketch of this situation is show n in Fij ure E. Despite experiments the computin] time of EM has not been of the discreteness of the introduced latent variables, a si] nificantly worse than per formin) an SVD on the co cont in uous latent space is obtained within the space of occurrence matrix. There is also Je potential for all multinomial distributions. Since the dimensionality improvin] run-time per formance of EM by on- line up o the sub-simplex is K-1 as opposed to a maxil date schemes, which has not been explored so far oM-1 or the complete probability simplex, this per forms a dimensionality reduction in the space multi nomial distributions and the sp anned sub-simplex can 3.4 Topic Decomp osi tion an d polysemy be identified with a probabilistic latent semantic space. Let us briefly dis cuss some eluci datin) examples at To stress this point and to clarity the rel ation to this point which will also reveal a LSA, let us rewrite the aspect model as parameter- PLSA over LSA in the context o polsemous words We ized by(E) in matrix notation. Hence define ma- have Jenerated a dataset(CLUSTER)with abstracts trices by U =(P(dil=k i, k, V=(P(=k))j, k, and of 1568 documents on cluste ring and trained an aspect 2= dia](P(=k))e. The loint probability model P model with le8 latent cl asses. Four pairs of factors are can then be written as a matrix product P=U2 visualized in FiJure 3. These pairs have been selected Comp arin] this with SVD, one can make the follow- as the two factors that have the hi hest probability to ny observations: (i) outer products between rows o Generate the words"se]ment", "matrix U and V reflect conditional independence in PLSA,"power", respectively. The sketchy characterization of the k actors correspond to the mixture compo- the factors by their 10 most probable words already re ct model, (iii)the mixin) proportions stin te In particular. notice that the in PLSA substitute the sin ular values. The crucial term used to select a particul ar pair has a different difference between PLSA and LSA, however, is the meanin in either topic factor: ob jective function utilized to dete a]e re ion in the first and to a phonetic se ment econ os sitionoapproximation. In LSA, this is the L2- in the second actor.(ii)'Matrix denotes a rectan u or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some additive Gaussi an noise assumption on(possibly trans- thin is embedded or enclosed. (iii)'Line can re er to ormed )counts. In contrast, PLSa relies on the like- a line in an imae, but also to a line in a spectrum
+ P(w|d) P(w|z ) P(w|z )3 P(w|z ) spanned sub-simplex simplex 2 1 0 embedding KL divergence projection Figure 2: Sketch of the probability sub-simplex spanned by the aspect model. 3.3 Probabilistic Latent Semantic Space Consider the class-conditional multinomial distributions P (jz) over the vocabulary which we call factors. They can be represented as points on the M 1 dimensional simplex of all possible multinomials. Via its convex hull, this set of K points denes a L K1 dimensional sub-simplex. The modeling assumption expressed by (1) is that conditional distributions P (wjd) for all document are approximated byamultinomial representable as a convex combination of factors P (wjz), where the mixing weights P (zjd) uniquely dene a point on the spanned sub-simplex. A simple sketch of this situation is shown in Figure 2. Despite of the discreteness of the introduced latent variables, a continuous latent space is obtained within the space of all multinomial distributions. Since the dimensionality of the sub-simplex is K1 as opposed to a maximum of M 1 for the complete probability simplex, this performs a dimensionality reduction in the space of multinomial distributions and the spanned sub-simplex can be identied with a probabilistic latent semantic space. To stress this point and to clarify the relation to LSA, let us rewrite the aspect model as parameterized by (2) in matrix notation. Hence dene matrices by U^ = (P (dijzk))i;k, V^ = (P (wj jzk))j;k, and ^ = diag(P (zk))k. The joint probability model P can then be written as a matrix product P = U^ ^V^ t . Comparing this with SVD, one can make the following observations: (i) outer products between rows of U^ and V^ re ect conditional independence in PLSA, (ii) the K factors correspond to the mixture components in the aspect model, (iii) the mixing proportions in PLSA substitute the singular values. The crucial dierence between PLSA and LSA, however, is the objective function utilized to determine the optimal decomposition/approximation. In LSA, this is the L2- or Frobenius norm, which corresponds to an implicit additive Gaussian noise assumption on (possibly transformed) counts. In contrast, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model. As is well known, this corresponds to a minimization of the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model, which is very dierent from any type of squared deviation. On the modeling side this oers important advantages, for example, the mixture approximation P of the co-occurrence table is a well-dened probability distribution and factors have a clear probabilistic meaning. In contrast, LSA does not dene a properly normalized probability distribution and N~ may even contain negative entries. In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions. The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions. Choosing the number of dimensions in LSA on the other hand is typically based on ad hoc heuristics. A comparison of the computational complexity might suggest some advantages for LSA: ignoring potential problems of numerical stability the SVD can be computed exactly, while the EM algorithm is an iterative method which is only guaranteed to nd a local maximum of the likelihood function. However, in all our experiments the computing time of EM has not been signicantly worse than performing an SVD on the cooccurrence matrix. There is also a large potential for improving run-time performance of EM by on-line update schemes, which has not been explored so far. 3.4 Topic Decomposition and Polysemy Let us brie y discuss some elucidating examples at this point which will also reveal a further advantage of PLSA over LSA in the context of polsemous words. We have generated a dataset (CLUSTER) with abstracts of 1568 documents on clustering and trained an aspect model with 128 latent classes. Four pairs of factors are visualized in Figure 3. These pairs have been selected as the two factors that have the highest probability to generate the words \segment", \matrix", \line", and \power", respectively. The sketchy characterization of the factors by their 10 most probable words already reveals interesting topics. In particular, notice that the term used to select a particular pair has a dierent meaning in either topic factor: (i) `Segment' refers to an image region in the rst and to a phonetic segment in the second factor. (ii) `Matrix' denotes a rectangular table of numbers and to a material in which something is embedded or enclosed. (iii) `Line' can refer to a line in an image, but also to a line in a spectrum
segment 1|“ segment2‖ matrix1”“ matrix2 ine1|ine2yll" power 1” I power2” manufactur constraint al pha POWER SEGMENT MATRIX LINE redshift‖ spectrun texture match LINE MATRIX locat galaxi POWER tissue quasar famili Input sIi source condition design high redshift perturb SEGMENT‖root format fundament densiti standard present volume veloc del Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P(w2), from top to bottom in descending order Document 1, P(d1, u;='segment)=(0.951, 0.0001,...) problem field imag analysi diagnost base proper SE GMENT digit imag SEGMENT medic imag need applic involv estim boundan ob ject classif tissu abnorm shape analysi cotour detec textur SEGmENT despit exist techniqu SEGmENT specif medic imag remain crural problem Document 2, Plz+dy,te 0.0250867,) P onsid signal ongin sequenc sourc specf problem SEGMENT signal rea IENT sourc address issu wide appic field report algonthm forward algonthm observ sequenc baumwelch train es tim hmm rain maten applic multipl signal sourc identif expen p Figure 4: Abstracts of 2 exemplary do cuments from the ClUSteR collection along with latent class posterior probabilities P=d, w='segment and word prob abilities Pw="segment'dy (iv)'Power'is used in the context of radiating ob jects clustering model [10, 7] which can be thought of n astronomy, but also in electrical engineering unsupervised version of a naive Bayes'classifier. It Figure 4 shows the abstr acts of two exemplary docu- can be show n that the conditional word probability of ments which have been pre-processed by a st andard a probabilistic clustering model is given by stop-word list and a stemmer. The posterior probabil- ties for the classes given the different occurrences of P)=∑P4l)=2}P() segment indicate how likely it is for each of the factors n the first pair of F where Pic(d)=a is the posterior probability of doc servation. We have also displayed the estimates of the ument d having latent class 2. It is a simple impli- conditional word probabilities Pw="segment'1d1, 23. cation of Bayes'rule that these posterior probabili One can see that the correct meaning of the word ties will concentrate their probability mass on a cer ment'is identified in both cases. This implies that th reasing number of observations though'segment' occurs frequently in both document, (ie, with the length of the document).This means e overlap in the factored representation is low, since that although(1)and(7)are algebraically equiva- segement'is identi fied as a polysemous word (relative lent, they are conceptually very different and yield in to the chosen resolution level)which -dependent on fact different results. The aspect model assumes that the context -is expl ained by different factors ent-specific distributions are 3.5 Aspects versus Clusters by all de It is worth comparing the aspect model with statistical clustering models the class-conditionals P(w=)have clustering models(cf. also [7). In clustering models In the dist ribut ional clustering model for do cuments, one typically associates a latent cl ass terior uncert ainty of the cluster ignments that induces variable with each do cument in the collection. Most some averaging over the class-conditional word distribu- closely related to our approach is the distrib nal tions P(wl=)
\segment 1" \segment 2" \matrix 1" \matrix 2" \line 1" \line 2" \power 1" power 2" imag speaker robust manufactur constraint alpha POWER load SEGMENT speech MATRIX cell LINE redshift spectrum memori texture recogni eigenvalu part match LINE omega vlsi color signal uncertainti MATRIX locat galaxi mpc POWER tissue train plane cellular imag quasar hsup systolic brain hmm linear famili geometr absorp larg input slice source condition design impos high redshift complex cluster speakerind. perturb machinepart segment ssup galaxi arrai mri SEGMENT root format fundament densiti standard present volume sound suci group recogn veloc model implement Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P (wjz), from top to bottom in descending order. Document 1, P fzkjd1; wj = `segment`g = (0:951; 0:0001;:::) P fwj = `segment`jd1g = 0:06 SEGMENT medic imag challeng problem eld imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need applic involv estim boundari ob ject classif tissu abnorm shape analysi contour detec textur SEGMENT despit exist techniqu SEGMENT specif medic imag remain crucial problem [...] Document 2, P fzkjd2; wj = `segment`g = (0:025; 0:867;:::) P fwj = `segment`jd2g = 0:010 consid signal origin sequenc sourc specif problem SEGMENT signal relat SEGMENT sourc address issu wide applic eld report describ resolu method ergod hidden markov model hmm hmm state correspond signal sourc signal sourc sequenc determin decod procedur viterbi algorithm forward algorithm observ sequenc baumwelch train estim hmm paramet train materi applic multipl signal sourc identif problem experi perform unknown speaker identif [...] Figure 4: Abstracts of 2 exemplary documents from the CLUSTER collection along with latent class posterior probabilities P fzjd; w = `segment'g and word probabilities P fw = `segment'jdg. (iv) 'Power' is used in the context of radiating objects in astronomy, but also in electrical engineering. Figure 4 shows the abstracts of two exemplary docu- ments which have been pre-processed by a standard stop-word list and a stemmer. The posterior probabilities for the classes given the dierent occurrences of `segment' indicate how likely it is for each of the factors in the rst pair of Figure 3 to have generated this observation. We have also displayed the estimates of the conditional word probabilities P fw = `segment'jd1;2g. One can see that the correct meaning of the word `seg- ment' is identied in both cases. This implies that although `segment' occurs frequently in both document, the overlap in the factored representation is low, since `segement' is identied as a polysemous word (relative to the chosen resolution level) which { dependent on the context { is explained by dierent factors. 3.5 Aspects versus Clusters It is worth comparing the aspect model with statistical clustering models (cf. also [7]). In clustering models for documents, one typically associates a latent class variable with each document in the collection. Most closely related to our approach is the distributional clustering model [10, 7] which can be thought of as an unsupervised version of a naive Bayes' classier. It can be shown that the conditional word probability of a probabilistic clustering model is given by P (wjd) = X z2Z P fc(d) = zgP (wjz) ; (7) where P fc(d) = zg is the posterior probability of document d having latent class z. It is a simple implication of Bayes' rule that these posterior probabilities will concentrate their probability mass on a certain value z with an increasing number of observations (i.e., with the length of the document). This means that although (1) and (7) are algebraically equivalent, they are conceptually very dierent and yield in fact dierent results. The aspect model assumes that document-specic distributions are a convex combination of aspects, while the clustering model assumes there is just one cluster-specic distribution which is inherited by all documents in the cluster.1 Thus in clustering models the class-conditionals P (wjz) have 1 In the distributional clustering model it is only the posterior uncertainty of the cluster assignments that induces some averaging over the class-conditional word distributions P (wjz)
ter)of documents, while factors can focus on certain For example, a factor can be very uell used to ex plain some raction of the words occurring in a doc ument, al though it might not explain other words at all(e. g, even assign zero probability ) because thes other words can be taken care of by other factors 3.6 Model ei itt a by i tea eming Figure 5: Perplexity results as a function of the So far ue have focused on maximumlikelihood estima- and(b)the Lob data (rank 1674).Plotted tion to fit a mo del from a given do cument collectior are for LSA (dashed-dotted curve) and PLSA(trained Although the likelihood or alent ly, the perple by TEM d curve, trained by early stopping EM ity is the quantity we believe to be cruci al in assessing dotted curve). The upper baseline is the unigram the quality of a model, one clearly has to distinguish model corresponding to marginal independence. TI betueen the performance on the training data and on star at the right end of the Plsa denotes the perplex unseen test data. To derive conditions under which ity of the largest trained aspect models(K=2048) ener tually the fundamental problem of st atis tical learning theory. Here, ue propose a generali zation of maxi- This shous that the effect of the entropy at fi 1 mum likelihood for mixture models which is known as to dampen the posterior probabilities such that they annealin and is based on an entropic regularization lI become closer to the uniform distribution with term. The resulting method is called Tempered Expec- decreasing fi tation Maximization(TEM)and is closely rel ated te deterministic annealin [11] Somewhat contrary to the spirit of annealing as a con tinuation method, ue propose an inverse'annealing The starting point of TEM is a derivation of the E- strategy which first performs EM iterations and step based on an optimi zation principle. As has been decreases fi until performance on hel d-out data deter pointed out in [9 the EM procedure in latent variable orates. Comp ared to annealing this may accelerate the nodels can be obtained by minimi zing a common ob- model fit ting procedure signi ficantly (e.g, by a facto jective function-the(Helmholtz)free ener)-which of N 10-50) and we have not found the test set per- for the aspect model is given by formance of heated models to be worse than the one fi >n(d, w)2P(2, d,w)log P(d, wl=)P(a) algorithm can thus be implemented in the following +∑n(d,w)∑P(,d,w)ogP(2,d,w):(8) 1. Set fi +l and per form EM with early stopping Here P(z, d, w)are vari ational parameters which de 2. Decrease fi -nfi (with n< 1)and perform fine a conditional distribution over 2 and fi is a pa TEM iteration rameter which in analogy to physical systems n held-out data i called the inverse comp ut ational temp erat ure. Noti that the first contribution in( 8) is the negative ex- proves (non-negligible)continue TEM iter ations pected log-likelihood scaled by fi. Thus in the case of at this value of fi, ot her wise goto step 2 P(a, d, w)=P(=ld, w)minimizing F w.r.t. the param- 4. Perform stopping on fi, i.e., stop when decreasing eters defining P(d, w=)P(a) amounts to the st andard fi does not yield further improvements M-step EM. In fact, it is straightforward to ver ify that the posteriors are obtained by minimi zing F al p is determined by 4 Experimental Results [P()P(da)P(w|2)15 P(,d,w)= 2:P()P(d=)P(w2)] (9) In the experimental evaluation, ue focus on tuo tasks (i) perplexity minimi zation for a document-specific un Perplexity re fers to the log-averaged inverse probabil- igram model and noun-adjective pairs, and (ii)auto- ty on unseen dat a mated indexing of do cuments The evaluation of lsa
to capture the complete vocabulary of a subset (cluster) of documents, while factors can focus on certain aspects of the vocabulary of a subset of documents. For example, a factor can be very well used to explain some fraction of the words occurring in a document, although it might not explain other words at all (e.g., even assign zero probability), because these other words can be taken care of by other factors. 3.6 Model Fitting Revisited: Improving Generalization by Tempered EM So far we have focused on maximum likelihood estimation to t a model from a given document collection. Although the likelihood or, equivalently, the perplexity2 is the quantity we believe to be crucial in assessing the quality of a model, one clearly has to distinguish between the performance on the training data and on unseen test data. To derive conditions under which generalization on unseen data can be guaranteed is actually the fundamental problem of statistical learning theory. Here, we propose a generalization of maxi- mum likelihood for mixture models which is known as annealing and is based on an entropic regularization term. The resulting method is called Tempered Expectation Maximization (TEM) and is closely related to deterministic annealing [11]. The starting point of TEM is a derivation of the Estep based on an optimization principle. As has been pointed out in [9] the EM procedure in latent variable models can be obtained by minimizing a common objective function { the (Helmholtz) free energy { which for the aspect model is given by F = X d;w n(d; w)X z P~(z; d; w) log P (d; wjz)P (z) +X d;w n(d; w)X z P~(z; d; w) log P~(z; d; w) : (8) Here P~ (z; d; w) are variational parameters which de- ne a conditional distribution over Z and is a parameter which { in analogy to physical systems { is called the inverse computational temperature. Notice that the rst contribution in (8) is the negative expected log-likelihood scaled by . Thus in the case of P~ (z; d; w) = P (zjd; w) minimizing F w.r.t. the parameters dening P (d; wjz)P (z) amounts to the standard M-step in EM. In fact, it is straightforward to verify that the posteriors are obtained by minimizing F w.r.t. P~ at = 1. In general P~ is determined by P~(z; d; w) = [P (z)P (djz)P (wjz)] P z0 [P (z0)P (djz0)P (wjz0)] : (9) 2Perplexity refers to the log-averaged inverse probability on unseen data. 200 400 600 800 1000 1000 1500 2000 2500 3000 (a) (b) Latent space dimensions Perplexity PLSA EM TEM LSA 500 1000 1500 500 600 700 800 900 1000 1100 1200 1300 Latent space dimensions Perplexity PLSA LSA Figure 5: Perplexity results as a function of the latent space dimensionality for (a) the MED data (rank 1033) and (b) the LOB data (rank 1674). Plotted results are for LSA (dashed-dotted curve) and PLSA (trained by TEM = solid curve, trained by early stopping EM = dotted curve). The upper baseline is the unigram model corresponding to marginal independence. The star at the right end of the PLSA denotes the perplexity of the largest trained aspect models (K = 2048). This shows that the eect of the entropy at < 1 is to dampen the posterior probabilities such that they will become closer to the uniform distribution with decreasing . Somewhat contrary to the spirit of annealing as a continuation method, we propose an `inverse' annealing strategy which rst performs EM iterations and then decreases until performance on held-out data deteriorates. Compared to annealing this may accelerate the model tting procedure signicantly (e.g., by a factor of 10 50) and we have not found the test set performance of `heated' models to be worse than the one achieved by carefully `annealed' models. The TEM algorithm can thus be implemented in the following way: 1. Set 1 and perform EM with early stopping. 2. Decrease (with < 1) and perform one TEM iteration. 3. As long as the performance on held-out data improves (non-negligible) continue TEM iterations at this value of , otherwise goto step 2 4. Perform stopping on , i.e., stop when decreasing does not yield further improvements. 4 Experimental Results In the experimental evaluation, we focus on two tasks: (i) perplexity minimization for a document-specic unigram model and noun-adjective pairs, and (ii) automated indexing of documents. The evaluation of LSA
Table l: Average precision results and relative improvement w r t the baseline method cos+tf for the 4 st andard test collections. Compared are LSI, PLSI, as well as results obt ained by combining PLSI models(PLSI An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256 dimensions with A=233 is reported in this case MED cos+tf[ 44.3 51.7+167“287 16 11.612.8+0:8 PLSI63.9+44.235.1 9188+48 PLSI663+49.737.5 268+49.720.1+583 d Plsa on the first task will demonstr ate the advan- since it makes a very inefficient use of the available tages of explicitly minimi aing perplexity by tEM, the degrees of fr Notice that with both methods second task will show that the solid statistical founda- it is possible to train high-dimensional models with a tion of PlSa pays off even in applications which are continuous improvement in performance. The num not directly rel ated to perplexity reduction ber of latent sp ace dimensions may even exceed the rank of the co-o nce matrix n and the choice of 4.1 Perplexity Evaluation the number of dimensions becomes merely an issue possible limitations of computational resources In order to compare the predictive performance of PLSA and LSa one has to specify how to extract 4.2 Information Retrieval probabilities from a Lsa decomp osition. This problem is not trivial, since negative entries prohibit a simple One of the key problems in information retrieval re-normalization of the approximating matrix N. We automatic indering which has its main application in have followed the approach of [2] to derive LSA prob- query-based retrieval. The most popular family of in- abilities formation retrieval techniques is b ased on the vector Space Model(VSM)for documents [12]. Here, we have Two data sets that have been used to evaluate the utili zed a rather straightforward representation based perplexity performance: (i)a standard informationre the(untr ans formed)term frequencies n(d, w)t trieval test collection MED with 1033 document, (ii) a dataset with noun-adjective pairs generated from gether with the standard cosine matching function tagged version of the loB corpus In the first case, the a more det ailed experimental analysis can be found goal was to predict word occurrences based on (parts so that the mat ching function for the baseline term matching method can be written as have to predicted conditioned on an associated adjec ∑en(d,)n(q,) PLSA on the MED(a) and LOB(b) datasets in de ∑mn(a,)2∑mn(q,0)2 pendence on the number of dimensions of the (proba ilistic)latent semantic space. PLSA outperforms the statistical model derived from standard Lsa by far In Latent Semantic Indexing(LSI), the On the MED collection PLSA reduces perplexity rel a- tor sp ace representation of documents is replaced by a factor of represent ation in the low-dimensional latent sp ace and the similarity is computed based on that represent three(3073-936 N 3: 3), while LSA achieves less than tion Queries or documents which were not part of the a factor of two in reduction(3073=1647 x 1: 9).On the nal collecti be folded in by less sparse LOB data the PLSA reduction in perplex- multiplication(cf[3]for details).In ity is 1316=347 N 2: 4 1 while the reduction achieved by our experiments LSA is only1316±32≈2:08. In or der to d re have actually consi dered linear combinations of the strat the advantages of TEM, we have also trained aspect original simil arity score(10)(weight A)and the one models on the med data by standard EM with early derived from the latent sp ace representation (weight be seen from the (a), the difference between EM and TEM model fit he same ideas have been applied in Probabilistic La ting is significant. Although both strategies -temper- tent Semantic Indexing(PLSI) in conjunction with ing and early stopping are successful in controlling the Plsa model. More precisely, the low-dimensional the model complexity, EM training performs worse, represent ation in the factor space P(=d) and P(=a)
Table 1: Average precision results and relative improvement w.r.t. the baseline method cos+tf for the 4 standard test collections. Compared are LSI, PLSI, as well as results obtained by combining PLSI models (PLSI ). An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256 dimensions with = 2=3 is reported in this case. MED CRAN CACM CISI prec. impr. prec. impr. prec. impr. prec. impr. cos+tf 44.3 - 29.9 - 17.9 - 12.7 - LSI 51.7 +16.7 28.7 -4.0 16.0 -11.6 12.8 +0:8 PLSI 63.9 +44.2 35.1 +17.4 22.9 +27.9 18.8 +48.0 PLSI 66.3 +49.7 37.5 +25.4 26.8 +49.7 20.1 +58.3 and PLSA on the rst task will demonstrate the advantages of explicitly minimizing perplexity by TEM, the second task will show that the solid statistical foundation of PLSA pays o even in applications which are not directly related to perplexity reduction. 4.1 Perplexity Evaluation In order to compare the predictive performance of PLSA and LSA one has to specify how to extract probabilities from a LSA decomposition. This problem is not trivial, since negative entries prohibit a simple re-normalization of the approximating matrix N~ . We have followed the approach of [2] to derive LSA probabilities. Two data sets that have been used to evaluate the perplexity performance: (i) a standard information retrieval test collection MED with 1033 document, (ii) a dataset with noun-adjective pairs generated from a tagged version of the LOB corpus. In the rst case, the goal was to predict word occurrences based on (parts of ) the words in a document. In the second case, nouns have to predicted conditioned on an associated adjective. Figure 5 reports perplexity results for LSA and PLSA on the MED (a) and LOB (b) datasets in dependence on the number of dimensions of the (probabilistic) latent semantic space. PLSA outperforms the statistical model derived from standard LSA by far. On the MED collection PLSA reduces perplexity relative to the unigram baseline by more than a factor of three (3073=936 3:3), while LSA achieves less than a factor of two in reduction (3073=1647 1:9). On the less sparse LOB data the PLSA reduction in perplexity is 1316=547 2:41 while the reduction achieved by LSA is only 1316=632 2:08. In order to demonstrate the advantages of TEM, we have also trained aspect models on the MED data by standard EM with early stopping. As can be seen from the curves in Figure 5 (a), the dierence between EM and TEM model tting is signicant. Although both strategies { tempering and early stopping { are successful in controlling the model complexity, EM training performs worse, since it makes a very inecient use of the available degrees of freedom. Notice, that with both methods it is possible to train high-dimensional models with a continuous improvement in performance. The number of latent space dimensions may even exceed the rank of the co-occurrence matrix N and the choice of the number of dimensions becomes merely an issue of possible limitations of computational resources. 4.2 Information Retrieval One of the key problems in information retrieval is automatic indexing which has its main application in query-based retrieval. The most popular family of information retrieval techniques is based on the Vector Space Model (VSM) for documents [12]. Here, we have utilized a rather straightforward representation based on the (untransformed) term frequencies n(d; w) together with the standard cosine matching function, a more detailed experimental analysis can be found in [6]. The same representation applies to queries q, so that the matching function for the baseline term matching method can be written as s(d; q) = P w n(d; w)n(q; w) pP w n(d; w)2pP w n(q; w)2 ; (10) In Latent Semantic Indexing (LSI), the original vector space representation of documents is replaced by a representation in the low-dimensional latent space and the similarity is computed based on that representation. Queries or documents which were not part of the original collection can be folded in by a simple matrix multiplication (cf. [3] for details). In our experiments, we have actually considered linear combinations of the original similarity score (10) (weight ) and the one derived from the latent space representation (weight 1 ). The same ideas have been applied in Probabilistic Latent Semantic Indexing (PLSI) in conjunction with the PLSA model. More precisely, the low-dimensional representation in the factor space P (zjd) and P (zjq)
recall [% Precision-recall curves for term matching, LSI, and PLSI on the 4 test collection have been utilized to evaluate similarities. to achieve this, queries have to be fol ded in, which is done in the PLSA by fixing the P(wjz)parameters and cal culating weights P(ajq)by TEM One advantage of using statistical models vs. SVD echniques is that it allows us to sy stematically com bine different models. While this should optimally be done according to a Bayesi an model combination scheme, we have utilized a much simpler approach in our experiments which has never theless shown excel formance and rob simply combined the cosine s cores of all models with a uniform weight. The resulting method ed to PLSI. Empirically we have found the performance to be very robust w r.t. different (nonuinbination with Figure 7: Perplexity and average precision as a func- i form) weights and also w.r.t. the A-weight used in coml the original ucing benefits of (model )aver aging. Notice that lsa tion of the inverse temperature fi for an aspect model represent ations for different K form a nested sequence with K=48(left which is not true for the st atistical mo dels which are expected to capture a larger variety of reasonable de- tion). The condensed results in terms of average pre ision recall(at the 9 recall levels 10%-90%)are sum- We have utilized the following four medium-sized stan- marized in Table l, while the corresponding precision dard document collection with reley nt: recall cur ves can be found in Figure 6. Here are some D(1033 document abstracts from the National additional details of the experimental setup: PLSA Library of Medicine), (ii) CRAN(1400 do cument ab- models at K=32, 48, 64, 80, 128 have been trained by stracts on aeronautics from the Cranfiel d Institute of T EM for each data set with 10% held-out data. For Technology ),(iii) CACM (3204 abstracts from the PLSI we report the best result obtained by any of these CACM Journal), and (iv) CISI (1460 abstracts in li- models, for LSI we report the best result obtained for brary science from the Institute for Scientific Informa- the optimal dimension(exploring 32-512 dimensions t a step size of 8).The combination weight A with the
0 50 100 0 10 20 30 40 50 60 70 80 90 MED recall [%] precision [%] 0 50 100 0 10 20 30 40 50 60 70 CRAN recall [%] 0 50 100 0 10 20 30 40 50 60 CACM recall [%] 0 50 100 0 5 10 15 20 25 30 35 40 45 50 CISI recall [%] cos LSI PLSI* cos LSI PLSI* cos LSI PLSI* cos LSI PLSI* Figure 6: Precision-recall curves for term matching, LSI, and PLSI on the 4 test collections. have been utilized to evaluate similarities. To achieve this, queries have to be folded in, which is done in the PLSA by xing the P (wjz) parameters and calculating weights P (zjq) by TEM. One advantage of using statistical models vs. SVD techniques is that it allows us to systematically combine dierent models. While this should optimally be done according to a Bayesian model combination scheme, we have utilized a much simpler approach in our experiments which has nevertheless shown excellent performance and robustness. Namely, we have simply combined the cosine scores of all models with a uniform weight. The resulting method is referred to as PLSI . Empirically we have found the performance to be very robust w.r.t. dierent (non-uniform) weights and also w.r.t. the -weight used in combination with the original cosine score. This is due to the noise reducing benets of (model) averaging. Notice that LSA representations for dierent K form a nested sequence, which is not true for the statistical models which are expected to capture a larger variety of reasonable decompositions. We have utilized the following four medium-sized standard document collection with relevance assessment: (i) MED (1033 document abstracts from the National Library of Medicine), (ii) CRAN (1400 document abstracts on aeronautics from the Craneld Institute of Technology), (iii) CACM (3204 abstracts from the CACM Journal), and (iv) CISI (1460 abstracts in library science from the Institute for Scientic Informa- 0.6 0.7 0.8 0.9 1 1200 1400 1600 1800 2000 beta perplexity K=48 0.6 0.7 0.8 0.9 1 30 40 50 60 70 beta average precision K=48 0.6 0.7 0.8 0.9 1 1200 1400 1600 1800 2000 beta perplexity K=128 0.6 0.7 0.8 0.9 1 30 40 50 60 70 beta average precision K=128 Figure 7: Perplexity and average precision as a function of the inverse temperature for an aspect model with K = 48 (left) and K = 128 (right). tion). The condensed results in terms of average precision recall (at the 9 recall levels 10%90%) are summarized in Table 1, while the corresponding precision recall curves can be found in Figure 6. Here are some additional details of the experimental setup: PLSA models at K = 32; 48; 64; 80; 128 have been trained by TEM for each data set with 10% held-out data. For PLSI we report the best result obtained by any of these models, for LSI we report the best result obtained for the optimal dimension (exploring 32{512 dimensions at a step size of 8). The combination weight with the
cosine b aseline score has been coarsely optimized by modeling. In Proceedings of ICASSP98, v hand, MED, CRAN: A=1/ 2, CACM, CISI: A= 2/3 677-80.1998 of PLSI over LSI. Substantial performance gains have [2]N Coccaro and D Jurafsky. Towards better inte- The experiments consistently validate the advantages gration of semantic predictors in statistical lai been achieved for all 4 data sets. Notice that the rel a e In Proceedings of ICSLP-98 tive precision gain comp ared to the b aseline method o in the most interesting interme- 1998. to appear ca diate regime of recall! In particul ar, PLSI works well Deerwester s.T Dumais G.w furnas. lan- even in cases where LSI fails completely(these prob- dauer. T.K., and R Harshman. Indexing by la- ems of LSI are in accordance with the original results tent semantic analysis. Journal of the american reported in 3). The benefits of model combination ociety for Information Science, 41, 1990 are also very substantial cases combined model performed bet ter than the best single 4]AP. Dempster, N.M. Laird, and D B. Rubin model. As a sight-effect model averaging also deliber Maximum likelihood from incomplete data via the ated from selecting the correct mo del dimensionality MM algorithm. J. Royal Stat ist. Soc. B, 39: 1-38 These experiments demonstrate that the advant ages of PLSA over standard LSA are not restricted to appl 5 P.W. Foltz and S. T. Dumais. An analysis of in- ns with performance criteria directly depending formation filtering methods. Communications on the perplexity. Statistical ob ective functions like the acm,35(12)51-60,1992 eral yardstick for anal ysis methods in text learning and [6]T. Hofmann. Prob abilistic latent semantic index- the perplexity(log-likelihood )may thus provide a gen ng. In Proceedings of SIGIR 99, 199 experiment on the MED data, where both, perplexity [7 T Hofmann, J Puzicha, and M. I Jordan. Unsu- and average precision, have been moni tored simult a- pervised learning from dy adic dat a In Advances neously as a function of B. The resulting curves which in Neural Information Processing Systems, vol show a striking correl ation are plotted in Figure 7 ume 11. MIr Press. 1999 Conclusion [8T. K. Landauer and S.T. Dumais. A solution to Plato s problem: The latent semantic anal We have proposed a novel method for unsupervised ysis theory of acquisition, induction, and rep- called Proba bilistic late nt Semantic a sis. which is based on a statistical latent cl ass model 104(2):211-240,1997 We have argued that this approach is more principled than standard Latent Semantic Anal ysis, since it pos- [9 R.M. Neal and G.E. Hinton. A view of the EM al sesses a sound statistical foundation. Tempered Expec- gorithm that justifies incremental and ot her var tat ion Maximization has been presented as a powerful ants. In M.I. Jordan, editor, Learning in Grap h fitting pro cedure. We have experimentally verified the I Models, pages 355-368. Kluwer Academic d advantages achieving subs al per formance Publishers. 1998 gains. Prob abilistic Latent Semantic Analysis has thus [101 F.C. N. Pereira, N 2. Tishby, and L. Lee. Distribu el tional clustering of english words. In Proceedings earning met hod with a wide range of applications in of the ACL, pages 183-190, 1993 [11 K. Rose, E. Gurewitz, and acknowledgment s tic annealing approach to clustering. Pattern The aut hor would like to thank j an puzicha. andrew Recognition Letters, 11(11): 589-594, 1990 Mike jordan joack Uhm Tishby, Nelson Morgan, Jerry Feldman, Dan Gildea, [12] G. Sal ton and M J McGill. Introduc Mod- Andrew Ng, Seb asti an Thrun, and Tom Mitchell for stimulating discussions and helpful hints. This work [13] L. Saul and F. Pereira. Aggregate and mixed- has been supported by a Daad fellow ship order Mar kow mo dels for statistical language Re ferences Conference on Empirical Meth guage Processing, 1997 1 J.R. Bellegarda. Exploiting both local and global onstraints for multi-span statistical language
cosine baseline score has been coarsely optimized by hand, MED, CRAN: = 1=2, CACM, CISI: = 2=3. The experiments consistently validate the advantages of PLSI over LSI. Substantial performance gains have been achieved for all 4 data sets. Notice that the relative precision gain compared to the baseline method is typically around 100% in the most interesting intermediate regime of recall! In particular, PLSI works well even in cases where LSI fails completely (these problems of LSI are in accordance with the original results reported in [3]). The benets of model combination are also very substantial. In all cases the (uniformly) combined model performed better than the best single model. As a sight-eect model averaging also deliberated from selecting the correct model dimensionality. These experiments demonstrate that the advantages of PLSA over standard LSA are not restricted to applications with performance criteria directly depending on the perplexity. Statistical objective functions like the perplexity (log-likelihood) may thus provide a general yardstick for analysis methods in text learning and information retrieval. To stress this point we ran an experiment on the MED data, where both, perplexity and average precision, have been monitored simultaneously as a function of . The resulting curves which show a striking correlation are plotted in Figure 7. 5 Conclusion We have proposed a novel method for unsupervised learning, called Probabilistic Latent Semantic Analysis, which is based on a statistical latent class model. We have argued that this approach is more principled than standard Latent Semantic Analysis, since it possesses a sound statistical foundation. Tempered Expectation Maximization has been presented as a powerful tting procedure. We have experimentally veried the claimed advantages achieving substantial performance gains. Probabilistic Latent Semantic Analysis has thus to be considered as a promising novel unsupervised learning method with a wide range of applications in text learning and information retrieval. Acknowledgments The author would like to thank Jan Puzicha, Andrew McCallum, Mike Jordan, Joachim Buhmann, Tali Tishby, Nelson Morgan, Jerry Feldman, Dan Gildea, Andrew Ng, Sebastian Thrun, and Tom Mitchell for stimulating discussions and helpful hints. This work has been supported byaDAAD fellowship. References [1] J.R. Bellegarda. Exploiting both local and global constraints for multi-span statistical language modeling. In Proceedings of ICASSP'98, volume 2, pages 677{80, 1998. [2] N. Coccaro and D. Jurafsky. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of ICSLP-98, 1998. to appear. [3] S. Deerwester, S. T. Dumais, G. W. Furnas, Landauer. T. K., and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 1990. [4] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. B, 39:1{38, 1977. [5] P.W. Foltz and S. T. Dumais. An analysis of information ltering methods. Communications of the ACM, 35(12):51{60, 1992. [6] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR'99, 1999. [7] T. Hofmann, J. Puzicha, and M. I. Jordan. Unsupervised learning from dyadic data. In Advances in Neural Information Processing Systems, volume 11. MIT Press, 1999. [8] T.K. Landauer and S.T. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211{240, 1997. [9] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justies incremental and other variants. In M.I. Jordan, editor, Learning in Graphical Models, pages 355{368. Kluwer Academic Publishers, 1998. [10] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of the ACL, pages 183{190, 1993. [11] K. Rose, E. Gurewitz, and G. Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(11):589{594, 1990. [12] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw{Hill, 1983. [13] L. Saul and F. Pereira. Aggregate and mixed{ order Markov models for statistical language processing. In Proceedings of the 2nd International Conference on Empirical Methods in Natural Language Processing, 1997