正在加载图片...
Probabilistic Latent Semantic analysis To appear in: Uncertainity in Artificial Intelligence, UAI99, Stockholm EECS Dep artment, Computer Science Division, Uni versity of California, Berkeley International Computer Science Institute, Berkeley, CA hofmann @cs. berkeley. edu abstract was referred to in a text or an utterance. The result plems are twofold: (i) poly sems, i.e. a word Probabilistic Latent Semantic Analysis is a may have multiple senses and multiple ty pes of usage novel statistical techni que for the analy sis in different context, and (ii) synonyms and semanti of two-mode and co-occurrence data which cally related words, i. e. di fferent words may have a has applications in information retrieval and simil ar meaning, they may at least in certain contexts filtering, natural language processing, ma enote the same concept or -in a weaker sense refer chine le fr to the same topic eas. Comp ared to standard Latent Semantic Latent semantic analysis(LSa)[3 is well-known tecl Analysis which stems from linear algebra and ique which partially addresses these questions. The performs a Singul ar Value Decomposition of key idea is to map high-dimensional count vectors occurrence tables, the proposed method s based on a mixture decomp osition derived tions of text do cuments [12, to a lower dimensional from a latent class model. This results in a represent ation in a so-called latent semantic space. As more principled approach which has a solid foundation in statistics. In order to avoid the name suggests, the goal of lsa is to find a data mapping which provides information well beyond the over fit ting, we prop ose a widely applicable lexical level and reveals semantical relations betwee eralization of maximum likelihood model fitting by tempered EM. Our ap proach yields the entities of interest. Due to its generality, LsA has proven to be a valuable analysis tool with a wide substantial and consistent improvements over range of applications(e.g. 3, 5, 8, 1). Yet its theoreti Latent Semantic Analysis in a number of ex- cal foundation remains to a large extent unsatis factory perIl This paper presents a st atistical view on Lsa which 1 Introduction mo del called Probabilistic late nt Se tics Analysis (PLSA). In cont Learning from text and natural language is one of the LSA, its probabilistic variant has a sound statistical eat challenges of Artificial Intelligence and Machine foundation and defines a proper generative mo del of Learning. Any subst antial progress in this domain has the data. a detailed discussion of the numerous ad- fro of plsa can be found in sub formation retrieval, information filtering, and intel processing, and machine translation. One of the fun- 2 latent Semantic Analysis gent inter faces, to speech recognition, natural language damental problems is to learn the meaning and usage of words in a data-driven fas hion from 2.1 Count data and co-occurrence tables text corpus, possibly without further linguistic prior Lsa can in principle be applied to any ty pe of count discrete dyadic domain(cf. [7). Ho The main challenge a machine learning system has to ever, since the most prominent application of lsa is address roots in the distinction between the lexical in the analysis and retrieval of text documents, we level of"what actually has been said or written"and focus on this setting for sake of concreteness. Sup the semantical level of"what w as intended"or"what pose therefore we have given a collection of text docProbabilistic Latent Semantic Analysis To appear in: Uncertainity in Arti cial Intelligence, UAI'99, Stockholm Thomas Hofmann EECS Department, Computer Science Division, University of California, Berkeley & International Computer Science Institute, Berkeley, CA hofmann@cs.berkeley.edu Abstract Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, ma￾chine learning from text, and in related ar￾eas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid over tting, we propose a widely applicable generalization of maximum likelihood model tting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of ex￾periments. 1 Introduction Learning from text and natural language is one of the great challenges of Arti cial Intelligence and Machine Learning. Any substantial progress in this domain has strong impact on many applications ranging from in￾formation retrieval, information ltering, and intelli￾gent interfaces, to speech recognition, natural language processing, and machine translation. One of the fun￾damental problems is to learn the meaning and usage of words in a data-driven fashion, i.e., from some given text corpus, possibly without further linguistic prior knowledge. The main challenge a machine learning system has to address roots in the distinction between the lexical level of \what actually has been said or written" and the semantical level of \what was intended" or \what was referred to" in a text or an utterance. The result￾ing problems are twofold: (i) polysems, i.e., a word may have multiple senses and multiple types of usage in di erent context, and (ii) synonymys and semanti￾cally related words, i.e., di erent words may have a similar meaning, they may at least in certain contexts denote the same concept or { in a weaker sense { refer to the same topic. Latent semantic analysis (LSA) [3] is well-known tech￾nique which partially addresses these questions. The key idea is to map high-dimensional count vectors, such as the ones arising in vector space representa￾tions of text documents [12], to a lower dimensional representation in a so-called latent semantic space. As the name suggests, the goal of LSA is to nd a data mapping which provides information well beyond the lexical level and reveals semantical relations between the entities of interest. Due to its generality, LSA has proven to be a valuable analysis tool with a wide range of applications (e.g. [3, 5, 8, 1]). Yet its theoreti￾cal foundation remains to a large extent unsatisfactory and incomplete. This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Se￾mantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and de nes a proper generative model of the data. A detailed discussion of the numerous ad￾vantages of PLSA can be found in subsequent sections. 2 Latent Semantic Analysis 2.1 Count Data and Co-occurrence Tables LSA can in principle be applied to any type of count data over a discrete dyadic domain (cf. [7]). How￾ever, since the most prominent application of LSA is in the analysis and retrieval of text documents, we focus on this setting for sake of concreteness. Sup￾pose therefore we have given a collection of text doc-
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有