正在加载图片...
Table l: Average precision results and relative improvement w r t the baseline method cos+tf for the 4 st andard test collections. Compared are LSI, PLSI, as well as results obt ained by combining PLSI models(PLSI An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256 dimensions with A=233 is reported in this case MED cos+tf[ 44.3 51.7+167“287 16 11.612.8+0:8 PLSI63.9+44.235.1 9188+48 PLSI663+49.737.5 268+49.720.1+583 d Plsa on the first task will demonstr ate the advan- since it makes a very inefficient use of the available tages of explicitly minimi aing perplexity by tEM, the degrees of fr Notice that with both methods second task will show that the solid statistical founda- it is possible to train high-dimensional models with a tion of PlSa pays off even in applications which are continuous improvement in performance. The num not directly rel ated to perplexity reduction ber of latent sp ace dimensions may even exceed the rank of the co-o nce matrix n and the choice of 4.1 Perplexity Evaluation the number of dimensions becomes merely an issue possible limitations of computational resources In order to compare the predictive performance of PLSA and LSa one has to specify how to extract 4.2 Information Retrieval probabilities from a Lsa decomp osition. This problem is not trivial, since negative entries prohibit a simple One of the key problems in information retrieval re-normalization of the approximating matrix N. We automatic indering which has its main application in have followed the approach of [2] to derive LSA prob- query-based retrieval. The most popular family of in- abilities formation retrieval techniques is b ased on the vector Space Model(VSM)for documents [12]. Here, we have Two data sets that have been used to evaluate the utili zed a rather straightforward representation based perplexity performance: (i)a standard informationre the(untr ans formed)term frequencies n(d, w)t trieval test collection MED with 1033 document, (ii) a dataset with noun-adjective pairs generated from gether with the standard cosine matching function tagged version of the loB corpus In the first case, the a more det ailed experimental analysis can be found goal was to predict word occurrences based on (parts so that the mat ching function for the baseline term matching method can be written as have to predicted conditioned on an associated adjec ∑en(d,)n(q,) PLSA on the MED(a) and LOB(b) datasets in de ∑mn(a,)2∑mn(q,0)2 pendence on the number of dimensions of the (proba ilistic)latent semantic space. PLSA outperforms the statistical model derived from standard Lsa by far In Latent Semantic Indexing(LSI), the On the MED collection PLSA reduces perplexity rel a- tor sp ace representation of documents is replaced by a factor of represent ation in the low-dimensional latent sp ace and the similarity is computed based on that represent three(3073-936 N 3: 3), while LSA achieves less than tion Queries or documents which were not part of the a factor of two in reduction(3073=1647 x 1: 9).On the nal collecti be folded in by less sparse LOB data the PLSA reduction in perplex- multiplication(cf[3]for details).In ity is 1316=347 N 2: 4 1 while the reduction achieved by our experiments LSA is only1316±32≈2:08. In or der to d re have actually consi dered linear combinations of the strat the advantages of TEM, we have also trained aspect original simil arity score(10)(weight A)and the one models on the med data by standard EM with early derived from the latent sp ace representation (weight be seen from the (a), the difference between EM and TEM model fit he same ideas have been applied in Probabilistic La ting is significant. Although both strategies -temper- tent Semantic Indexing(PLSI) in conjunction with ing and early stopping are successful in controlling the Plsa model. More precisely, the low-dimensional the model complexity, EM training performs worse, represent ation in the factor space P(=d) and P(=a)Table 1: Average precision results and relative improvement w.r.t. the baseline method cos+tf for the 4 standard test collections. Compared are LSI, PLSI, as well as results obtained by combining PLSI models (PLSI ). An asterix for LSI indicates that no performance gain could be achieved over the baseline, the result at 256 dimensions with  = 2=3 is reported in this case. MED CRAN CACM CISI prec. impr. prec. impr. prec. impr. prec. impr. cos+tf 44.3 - 29.9 - 17.9 - 12.7 - LSI 51.7 +16.7 28.7 -4.0 16.0 -11.6 12.8 +0:8 PLSI 63.9 +44.2 35.1 +17.4 22.9 +27.9 18.8 +48.0 PLSI 66.3 +49.7 37.5 +25.4 26.8 +49.7 20.1 +58.3 and PLSA on the rst task will demonstrate the advan￾tages of explicitly minimizing perplexity by TEM, the second task will show that the solid statistical founda￾tion of PLSA pays o even in applications which are not directly related to perplexity reduction. 4.1 Perplexity Evaluation In order to compare the predictive performance of PLSA and LSA one has to specify how to extract probabilities from a LSA decomposition. This problem is not trivial, since negative entries prohibit a simple re-normalization of the approximating matrix N~ . We have followed the approach of [2] to derive LSA prob￾abilities. Two data sets that have been used to evaluate the perplexity performance: (i) a standard information re￾trieval test collection MED with 1033 document, (ii) a dataset with noun-adjective pairs generated from a tagged version of the LOB corpus. In the rst case, the goal was to predict word occurrences based on (parts of ) the words in a document. In the second case, nouns have to predicted conditioned on an associated adjec￾tive. Figure 5 reports perplexity results for LSA and PLSA on the MED (a) and LOB (b) datasets in de￾pendence on the number of dimensions of the (proba￾bilistic) latent semantic space. PLSA outperforms the statistical model derived from standard LSA by far. On the MED collection PLSA reduces perplexity rela￾tive to the unigram baseline by more than a factor of three (3073=936  3:3), while LSA achieves less than a factor of two in reduction (3073=1647  1:9). On the less sparse LOB data the PLSA reduction in perplex￾ity is 1316=547  2:41 while the reduction achieved by LSA is only 1316=632  2:08. In order to demonstrate the advantages of TEM, we have also trained aspect models on the MED data by standard EM with early stopping. As can be seen from the curves in Figure 5 (a), the di erence between EM and TEM model t￾ting is signi cant. Although both strategies { temper￾ing and early stopping { are successful in controlling the model complexity, EM training performs worse, since it makes a very inecient use of the available degrees of freedom. Notice, that with both methods it is possible to train high-dimensional models with a continuous improvement in performance. The num￾ber of latent space dimensions may even exceed the rank of the co-occurrence matrix N and the choice of the number of dimensions becomes merely an issue of possible limitations of computational resources. 4.2 Information Retrieval One of the key problems in information retrieval is automatic indexing which has its main application in query-based retrieval. The most popular family of in￾formation retrieval techniques is based on the Vector Space Model (VSM) for documents [12]. Here, we have utilized a rather straightforward representation based on the (untransformed) term frequencies n(d; w) to￾gether with the standard cosine matching function, a more detailed experimental analysis can be found in [6]. The same representation applies to queries q, so that the matching function for the baseline term matching method can be written as s(d; q) = P w n(d; w)n(q; w) pP w n(d; w)2pP w n(q; w)2 ; (10) In Latent Semantic Indexing (LSI), the original vec￾tor space representation of documents is replaced by a representation in the low-dimensional latent space and the similarity is computed based on that representa￾tion. Queries or documents which were not part of the original collection can be folded in by a simple matrix multiplication (cf. [3] for details). In our experiments, we have actually considered linear combinations of the original similarity score (10) (weight ) and the one derived from the latent space representation (weight 1 ￾ ). The same ideas have been applied in Probabilistic La￾tent Semantic Indexing (PLSI) in conjunction with the PLSA model. More precisely, the low-dimensional representation in the factor space P (zjd) and P (zjq)
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有