bedi hood function of mul tinomial samplin and the model. As is well known, this corresponds to minimization of the cross entropy or Kullback-Leibler type ofse deviation. On the modelin) side this offers important advanta]es, for example, the mixture approximation P(wlz,) Pof the co-occurrence table is a well-defined prob meanin/ In contr ast, LSa does not define a properl FiJure E: Sket ch of the prob ability sub-simplex cont ain ne] ative entries. In addition, there is no ob. ormalized probability distribution and n may even panned by the mode ous interpretation of the directions in the LSA latent space, while the directions in the PLSA space are in- ob abilistic latent Sem antic space tinomi al word distributions. T probabilistic approach can also take advanta]e of the Consider the class-conditional multinomial distribu- well-established st atistical theory or mo del selection tions P( over the vo cabulary w hich we call factors d complexity control, e.-, to determine the opti They can be represented as points on the M-1 di- mal number of latent space dimensions. Choosing the mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is its convex hull, this set of k points defines a L< typically based on ad hoc heuristics K-l dimensional sub-simplex. The modelin] assump- A comparison of the computational complexity mi)ht tion expressed by(1)is that conditional distributions su))est some advantages for LSA: i]norin) potenti al P(u ld) for all document are approximated by a multi- problems of numerical stability the SVD can be com- nomial representable as a convex combination of fac- puted exactly, while the EM al orithm is an iter ative method which is only Guaranteed to find a local max define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our sketch of this situation is show n in Fij ure E. Despite experiments the computin] time of EM has not been of the discreteness of the introduced latent variables, a si] nificantly worse than per formin) an SVD on the co cont in uous latent space is obtained within the space of occurrence matrix. There is also Je potential for all multinomial distributions. Since the dimensionality improvin] run-time per formance of EM by on- line up o the sub-simplex is K-1 as opposed to a maxil date schemes, which has not been explored so far oM-1 or the complete probability simplex, this per forms a dimensionality reduction in the space multi nomial distributions and the sp anned sub-simplex can 3.4 Topic Decomp osi tion an d polysemy be identified with a probabilistic latent semantic space. Let us briefly dis cuss some eluci datin) examples at To stress this point and to clarity the rel ation to this point which will also reveal a LSA, let us rewrite the aspect model as parameter- PLSA over LSA in the context o polsemous words We ized by(E) in matrix notation. Hence define ma- have Jenerated a dataset(CLUSTER)with abstracts trices by U =(P(dil=k i, k, V=(P(=k))j, k, and of 1568 documents on cluste ring and trained an aspect 2= dia](P(=k))e. The loint probability model P model with le8 latent cl asses. Four pairs of factors are can then be written as a matrix product P=U2 visualized in FiJure 3. These pairs have been selected Comp arin] this with SVD, one can make the follow- as the two factors that have the hi hest probability to ny observations: (i) outer products between rows o Generate the words"se]ment", "matrix U and V reflect conditional independence in PLSA,"power", respectively. The sketchy characterization of the k actors correspond to the mixture compo- the factors by their 10 most probable words already re ct model, (iii)the mixin) proportions stin te In particular. notice that the in PLSA substitute the sin ular values. The crucial term used to select a particul ar pair has a different difference between PLSA and LSA, however, is the meanin in either topic factor: ob jective function utilized to dete a]e re ion in the first and to a phonetic se ment econ os sitionoapproximation. In LSA, this is the L2- in the second actor.(ii)'Matrix denotes a rectan u or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some additive Gaussi an noise assumption on(possibly trans- thin is embedded or enclosed. (iii)'Line can re er to ormed )counts. In contrast, PLSa relies on the like- a line in an imae, but also to a line in a spectrum+ P(w|d) P(w|z ) P(w|z )3 P(w|z ) spanned sub-simplex simplex 2 1 0 embedding KL divergence projection Figure 2: Sketch of the probability sub-simplex spanned by the aspect model. 3.3 Probabilistic Latent Semantic Space Consider the class-conditional multinomial distributions P (jz) over the vocabulary which we call factors. They can be represented as points on the M 1 dimensional simplex of all possible multinomials. Via its convex hull, this set of K points denes a L K1 dimensional sub-simplex. The modeling assumption expressed by (1) is that conditional distributions P (wjd) for all document are approximated byamultinomial representable as a convex combination of factors P (wjz), where the mixing weights P (zjd) uniquely dene a point on the spanned sub-simplex. A simple sketch of this situation is shown in Figure 2. Despite of the discreteness of the introduced latent variables, a continuous latent space is obtained within the space of all multinomial distributions. Since the dimensionality of the sub-simplex is K1 as opposed to a maximum of M 1 for the complete probability simplex, this performs a dimensionality reduction in the space of multinomial distributions and the spanned sub-simplex can be identied with a probabilistic latent semantic space. To stress this point and to clarify the relation to LSA, let us rewrite the aspect model as parameterized by (2) in matrix notation. Hence dene matrices by U^ = (P (dijzk))i;k, V^ = (P (wj jzk))j;k, and ^ = diag(P (zk))k. The joint probability model P can then be written as a matrix product P = U^ ^V^ t . Comparing this with SVD, one can make the following observations: (i) outer products between rows of U^ and V^ re ect conditional independence in PLSA, (ii) the K factors correspond to the mixture components in the aspect model, (iii) the mixing proportions in PLSA substitute the singular values. The crucial dierence between PLSA and LSA, however, is the objective function utilized to determine the optimal decomposition/approximation. In LSA, this is the L2- or Frobenius norm, which corresponds to an implicit additive Gaussian noise assumption on (possibly transformed) counts. In contrast, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model. As is well known, this corresponds to a minimization of the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model, which is very dierent from any type of squared deviation. On the modeling side this oers important advantages, for example, the mixture approximation P of the co-occurrence table is a well-dened probability distribution and factors have a clear probabilistic meaning. In contrast, LSA does not dene a properly normalized probability distribution and N~ may even contain negative entries. In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions. The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions. Choosing the number of dimensions in LSA on the other hand is typically based on ad hoc heuristics. A comparison of the computational complexity might suggest some advantages for LSA: ignoring potential problems of numerical stability the SVD can be computed exactly, while the EM algorithm is an iterative method which is only guaranteed to nd a local maximum of the likelihood function. However, in all our experiments the computing time of EM has not been signicantly worse than performing an SVD on the cooccurrence matrix. There is also a large potential for improving run-time performance of EM by on-line update schemes, which has not been explored so far. 3.4 Topic Decomposition and Polysemy Let us brie y discuss some elucidating examples at this point which will also reveal a further advantage of PLSA over LSA in the context of polsemous words. We have generated a dataset (CLUSTER) with abstracts of 1568 documents on clustering and trained an aspect model with 128 latent classes. Four pairs of factors are visualized in Figure 3. These pairs have been selected as the two factors that have the highest probability to generate the words \segment", \matrix", \line", and \power", respectively. The sketchy characterization of the factors by their 10 most probable words already reveals interesting topics. In particular, notice that the term used to select a particular pair has a dierent meaning in either topic factor: (i) `Segment' refers to an image region in the rst and to a phonetic segment in the second factor. (ii) `Matrix' denotes a rectangular table of numbers and to a material in which something is embedded or enclosed. (iii) `Line' can refer to a line in an image, but also to a line in a spectrum