正在加载图片...
ments=dI, .,dN) with terms from a vocab- (a) ulary W=wl,.wM). By ignoring the sequen tial order in which words occur in a document. one d may summarize the data in a nxM co-occurrence table of counts N=(n(di, wi)ii, where n(d,w)E N denotes how often the term w occurred in document d. In this particular case, N is also called the term- Figure 1: Graphical mo del representation of the as- document matrix and the rows/columns of n are re pect model in the asy mmetric(a)and symmetric(b. ferred to as document/term vectors, respectively. The parameterization vector-space representation [12] of documents will in many cases preserve most of the relevant informati efined by the mixture e g, for tasks like text retrieval based on keywor ds P(d,w)=P(d)P(wld), P(uld)=>P(wl:)(=1d ).(1) 2.2 Latent Semantic Analy sis by SVD Like virtually all statistical latent variable models the aspect model introduces a conditional independence As mentioned in the introduction the key idea of LSA assumption, namely that d and w are independent con is to map documents(and by symmetry terms)to a ditioned on the state of the asso ci ated latent variable vector space of reduced dimensionality, the latent se-(the corresponding graphical model represent ation is mantic space 3. The mapping is restricted to be lin depicted in Figure 1 Since the cardinality of a is ear and is based on a Singular Value Decomposition smaller than the number of documents/words in the (SVD) of the co-occurrence table. One thus starts ttleneck variable in predict with the standard Svd given by N= u2v, where ing words. It is worth noticing that the model can be U and V are orthogonal matrices U(-VV=I equivalently parameterized by(cf Figure 1(b)) and the diagonal matrix 2 contains the singul ar val ues of N. The LSA approximation of N is computed Pd,)=∑P()P(dl)Pml) (2) y setting all but the largest K singul ar values in 2 to zero(=3), which is rank K optimal in the sense which is perfectly symmetric in both entities, doct of the L2-matrix norm. One obtains the approxima- ments and words tion n=U∑vt≈UΣVt=N. Notice that the document-to-do cument inner products based on this 3.2 Mo del Fitting with the EM Algorithm approximation are given by nN=U2U and hence one might think of the rows of U2 as defining coor- The standard procedure for maximum likelihood es inates for do cuments in the latent space. While the timation in latent variable models is the Expectation original high-dimensional vectors are sp arse, the corre Maximiz ation(EM) algorithm 4. EM al ternate two sponding low-dimensional latent vectors will ty pically coupled steps: (i)an expect ation(E)step where poste- not be sp arse. This implies that it is possible to com- rior probabilities are computed for the latent variables, pute meaningful associ ation values between pairs of (ii) an maximization(M)step, where parameters are documents, even if the documents do not have any updated. Standard cal culations(cf [7, 13]) yield the terms in common. The hope is that terms having a E-step equation mon meaning, in particular synony ms, are rough )P(2 ped to the s ame direction in the latent sp ac ∑2szP(x)P(dl2)P(l) as well as the following M-step formulae 3 Probabilistic lsa Pl)∑m(d,)P(l,m),(4) 3.1 The Aspect Mo del P(d|)∝ The starting point for Probabilistic Latent Semantic Analys isis a statistical mo del which has been called P(x)∝ n(d,)P(zd,u).(6) aspect model [7. The aspect model is a latent variable model for co-occurrence data which associ ates an ur Before discussing further al gorithmic refinements, we observed cl ass variable zEZ=21, . *k) with each will study the relationship between the proposed observ ation. A joint probability model over Dx W is model and lsa in more detaiuments D = fd1; : : :;dN g with terms from a vocab￾ulary W = fw1; : : :wMg. By ignoring the sequen￾tial order in which words occur in a document, one may summarize the data in a N M co-occurrence table of counts N = (n(di; wj))ij , where n(d; w) 2 IN denotes how often the term w occurred in document d. In this particular case, N is also called the term￾document matrix and the rows/columns of N are re￾ferred to as document/term vectors, respectively. The key assumption is that the simpli ed `bag-of-words' or vector-space representation [12] of documents will in many cases preserve most of the relevant information, e.g., for tasks like text retrieval based on keywords. 2.2 Latent Semantic Analysis by SVD As mentioned in the introduction, the key idea of LSA is to map documents (and by symmetry terms) to a vector space of reduced dimensionality, the latent se￾mantic space [3]. The mapping is restricted to be lin￾ear and is based on a Singular Value Decomposition (SVD) of the co-occurrence table. One thus starts with the standard SVD given by N = UVt ; where U and V are orthogonal matrices UtU = VtV = I and the diagonal matrix  contains the singular val￾ues of N. The LSA approximation of N is computed by setting all but the largest K singular values in  to zero (= ~ ), which is rank K optimal in the sense of the L2-matrix norm. One obtains the approxima￾tion N~ = UV~ t  UVt = N: Notice that the document-to-document inner products based on this approximation are given by N~ N~ t = U~ 2Ut and hence one might think of the rows of U~ as de ning coor￾dinates for documents in the latent space. While the original high-dimensional vectors are sparse, the corre￾sponding low-dimensional latent vectors will typically not be sparse. This implies that it is possible to com￾pute meaningful association values between pairs of documents, even if the documents do not have any terms in common. The hope is that terms having a common meaning, in particular synonyms, are roughly mapped to the same direction in the latent space. 3 Probabilistic LSA 3.1 The Aspect Model The starting point for Probabilistic Latent Semantic Analysis is a statistical model which has been called aspect model [7]. The aspect model is a latent variable model for co-occurrence data which associates an un￾observed class variable z 2 Z = fz1; : : :;zKg with each observation. A joint probability model over DW is d z w z w P(z) (a) (b) d P(d) P(z|d) P(w|z) P(d|z) P(w|z) Figure 1: Graphical model representation of the as￾pect model in the asymmetric (a) and symmetric (b) parameterization. de ned by the mixture P (d; w)=P (d)P (wjd); P (wjd)=X z2Z P (wjz)P (zjd): (1) Like virtually all statistical latent variable models the aspect model introduces a conditional independence assumption, namely that d and w are independent con￾ditioned on the state of the associated latent variable (the corresponding graphical model representation is depicted in Figure 1 (a)). Since the cardinality of z is smaller than the number of documents/words in the collection, z acts as a bottleneck variable in predict￾ing words. It is worth noticing that the model can be equivalently parameterized by (cf. Figure 1 (b)) P (d; w) = X z2Z P (z)P (djz)P (wjz) ; (2) which is perfectly symmetric in both entities, docu- ments and words. 3.2 Model Fitting with the EM Algorithm The standard procedure for maximum likelihood es￾timation in latent variable models is the Expectation Maximization (EM) algorithm [4]. EM alternates two coupled steps: (i) an expectation (E) step where poste￾rior probabilities are computed for the latent variables, (ii) an maximization (M) step, where parameters are updated. Standard calculations (cf. [7, 13]) yield the E-step equation P (zjd; w) = P (z)P (djz)P (wjz) P z02Z P (z0)P (djz0)P (wjz0) ; (3) as well as the following M-step formulae P (wjz) / X d2D n(d; w)P (zjd; w); (4) P (djz) / X w2W n(d; w)P (zjd; w); (5) P (z) / X d2D X w2W n(d; w)P (zjd; w) : (6) Before discussing further algorithmic re nements, we will study the relationship between the proposed model and LSA in more detail.
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有