正在加载图片...
bedi hood function of mul tinomial samplin and the model. As is well known, this corresponds to minimization of the cross entropy or Kullback-Leibler type ofse deviation. On the modelin) side this offers important advanta]es, for example, the mixture approximation P(wlz,) Pof the co-occurrence table is a well-defined prob meanin/ In contr ast, LSa does not define a properl FiJure E: Sket ch of the prob ability sub-simplex cont ain ne] ative entries. In addition, there is no ob. ormalized probability distribution and n may even panned by the mode ous interpretation of the directions in the LSA latent space, while the directions in the PLSA space are in- ob abilistic latent Sem antic space tinomi al word distributions. T probabilistic approach can also take advanta]e of the Consider the class-conditional multinomial distribu- well-established st atistical theory or mo del selection tions P( over the vo cabulary w hich we call factors d complexity control, e.-, to determine the opti They can be represented as points on the M-1 di- mal number of latent space dimensions. Choosing the mensional simplex of all possible multinomials. Via number of dimensions in LSA on the other hand is its convex hull, this set of k points defines a L< typically based on ad hoc heuristics K-l dimensional sub-simplex. The modelin] assump- A comparison of the computational complexity mi)ht tion expressed by(1)is that conditional distributions su))est some advantages for LSA: i]norin) potenti al P(u ld) for all document are approximated by a multi- problems of numerical stability the SVD can be com- nomial representable as a convex combination of fac- puted exactly, while the EM al orithm is an iter ative method which is only Guaranteed to find a local max define a point on the spanned sub-simplex. A simple imum of the likelihood function. However, in all our sketch of this situation is show n in Fij ure E. Despite experiments the computin] time of EM has not been of the discreteness of the introduced latent variables, a si] nificantly worse than per formin) an SVD on the co cont in uous latent space is obtained within the space of occurrence matrix. There is also Je potential for all multinomial distributions. Since the dimensionality improvin] run-time per formance of EM by on- line up o the sub-simplex is K-1 as opposed to a maxil date schemes, which has not been explored so far oM-1 or the complete probability simplex, this per forms a dimensionality reduction in the space multi nomial distributions and the sp anned sub-simplex can 3.4 Topic Decomp osi tion an d polysemy be identified with a probabilistic latent semantic space. Let us briefly dis cuss some eluci datin) examples at To stress this point and to clarity the rel ation to this point which will also reveal a LSA, let us rewrite the aspect model as parameter- PLSA over LSA in the context o polsemous words We ized by(E) in matrix notation. Hence define ma- have Jenerated a dataset(CLUSTER)with abstracts trices by U =(P(dil=k i, k, V=(P(=k))j, k, and of 1568 documents on cluste ring and trained an aspect 2= dia](P(=k))e. The loint probability model P model with le8 latent cl asses. Four pairs of factors are can then be written as a matrix product P=U2 visualized in FiJure 3. These pairs have been selected Comp arin] this with SVD, one can make the follow- as the two factors that have the hi hest probability to ny observations: (i) outer products between rows o Generate the words"se]ment", "matrix U and V reflect conditional independence in PLSA,"power", respectively. The sketchy characterization of the k actors correspond to the mixture compo- the factors by their 10 most probable words already re ct model, (iii)the mixin) proportions stin te In particular. notice that the in PLSA substitute the sin ular values. The crucial term used to select a particul ar pair has a different difference between PLSA and LSA, however, is the meanin in either topic factor: ob jective function utilized to dete a]e re ion in the first and to a phonetic se ment econ os sitionoapproximation. In LSA, this is the L2- in the second actor.(ii)'Matrix denotes a rectan u or Frobenius norm, which corresponds to an implicit lar table of numbers and to a material in which some additive Gaussi an noise assumption on(possibly trans- thin is embedded or enclosed. (iii)'Line can re er to ormed )counts. In contrast, PLSa relies on the like- a line in an imae, but also to a line in a spectrum+ P(w|d) P(w|z ) P(w|z )3 P(w|z ) spanned sub-simplex simplex 2 1 0 embedding KL divergence projection Figure 2: Sketch of the probability sub-simplex spanned by the aspect model. 3.3 Probabilistic Latent Semantic Space Consider the class-conditional multinomial distribu￾tions P (jz) over the vocabulary which we call factors. They can be represented as points on the M ￾ 1 di￾mensional simplex of all possible multinomials. Via its convex hull, this set of K points de nes a L  K￾1 dimensional sub-simplex. The modeling assump￾tion expressed by (1) is that conditional distributions P (wjd) for all document are approximated byamulti￾nomial representable as a convex combination of fac￾tors P (wjz), where the mixing weights P (zjd) uniquely de ne a point on the spanned sub-simplex. A simple sketch of this situation is shown in Figure 2. Despite of the discreteness of the introduced latent variables, a continuous latent space is obtained within the space of all multinomial distributions. Since the dimensionality of the sub-simplex is  K￾1 as opposed to a maximum of M ￾1 for the complete probability simplex, this per￾forms a dimensionality reduction in the space of multi￾nomial distributions and the spanned sub-simplex can be identi ed with a probabilistic latent semantic space. To stress this point and to clarify the relation to LSA, let us rewrite the aspect model as parameter￾ized by (2) in matrix notation. Hence de ne ma￾trices by U^ = (P (dijzk))i;k, V^ = (P (wj jzk))j;k, and ^ = diag(P (zk))k. The joint probability model P can then be written as a matrix product P = U^ ^V^ t . Comparing this with SVD, one can make the follow￾ing observations: (i) outer products between rows of U^ and V^ re ect conditional independence in PLSA, (ii) the K factors correspond to the mixture compo￾nents in the aspect model, (iii) the mixing proportions in PLSA substitute the singular values. The crucial di erence between PLSA and LSA, however, is the objective function utilized to determine the optimal decomposition/approximation. In LSA, this is the L2- or Frobenius norm, which corresponds to an implicit additive Gaussian noise assumption on (possibly trans￾formed) counts. In contrast, PLSA relies on the like￾lihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model. As is well known, this corresponds to a minimization of the cross entropy or Kullback-Leibler divergence between the empirical distribution and the model, which is very di erent from any type of squared deviation. On the modeling side this o ers important advantages, for example, the mixture approximation P of the co-occurrence table is a well-de ned proba￾bility distribution and factors have a clear probabilistic meaning. In contrast, LSA does not de ne a properly normalized probability distribution and N~ may even contain negative entries. In addition, there is no obvi￾ous interpretation of the directions in the LSA latent space, while the directions in the PLSA space are in￾terpretable as multinomial word distributions. The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the opti￾mal number of latent space dimensions. Choosing the number of dimensions in LSA on the other hand is typically based on ad hoc heuristics. A comparison of the computational complexity might suggest some advantages for LSA: ignoring potential problems of numerical stability the SVD can be com￾puted exactly, while the EM algorithm is an iterative method which is only guaranteed to nd a local max￾imum of the likelihood function. However, in all our experiments the computing time of EM has not been signi cantly worse than performing an SVD on the co￾occurrence matrix. There is also a large potential for improving run-time performance of EM by on-line up￾date schemes, which has not been explored so far. 3.4 Topic Decomposition and Polysemy Let us brie y discuss some elucidating examples at this point which will also reveal a further advantage of PLSA over LSA in the context of polsemous words. We have generated a dataset (CLUSTER) with abstracts of 1568 documents on clustering and trained an aspect model with 128 latent classes. Four pairs of factors are visualized in Figure 3. These pairs have been selected as the two factors that have the highest probability to generate the words \segment", \matrix", \line", and \power", respectively. The sketchy characterization of the factors by their 10 most probable words already re￾veals interesting topics. In particular, notice that the term used to select a particular pair has a di erent meaning in either topic factor: (i) `Segment' refers to an image region in the rst and to a phonetic segment in the second factor. (ii) `Matrix' denotes a rectangu￾lar table of numbers and to a material in which some￾thing is embedded or enclosed. (iii) `Line' can refer to a line in an image, but also to a line in a spectrum
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有