正在加载图片...
Baseline 2) drastically reduces, it becomes more difficult to estimate the Most popular tags with respect to the user: All tags resource specific e reliable. Here, the NDCG for the TT t E Tu are ranked according to the relative frequency. model decreases significantly(NDCG all is 0. 42 for T=200) The UTT model. in contrast. can make use of the user Tag prediction with the TT and UTT model: In the specific topic distribution to estimate p(tr, u)more reliably TT model, the prediction of tags for unseen documents and the NDCG only decreases slightly (NDCG all is 0.48 can be formulated as follows: Based on the word-topic for T=200) and tag-topic count matrices learned from the independent V CONCLUSION AND OUTLOOK training data set, the likelihood of a tag label l E Tu first probability in the sum, P(!t), is given by the learned o In this paper, we presented hierarchical Bayesian models given the test resource r is P(r)=t(ltp(t/r).The mining and modelling large systems with user generated topic-tag distribution. The mixture of topics p(t]r)for the content and massive annotation. To demonstrate its perfor- resource has to be estimated online. For each resource r. we mance, we trained the model on a large fraction of the independently sample topics for a small number of iterations CiteULike data base. As a quantitative result we showed (we used i=5)by using the word counts in 4 from the that the here proposed models provide a better tag annotation quality in terms of perplexity compared to the standard LDa The likelihood of a tag label l E T. in the UTt model framework. With the utt model. we are able to create a is given by p(lr,u)=E p(lt)p(tlr, u). Again, p(t]r, u) personalized view on a resource by sampling the resource has to be estimated online. Here the mixture of topics for specific topic distribution through the user specific topic the resource is restricted with respect to the user, i.e. we distribution, which we see as the reason for the performance stimate the topic-distribution for the resource based or increase in the tag recommendation task. Parts of future the user specific topic-distribution. Recall that every post work will aim at investigating more ways to model users originates from a single user, therefore the estimated topic within the LDA framework distribution for the resource under consideration is based on this user. This estimation gives a personalized view on r and thus influences the topic distribution of the resource. [1R. Jaschke et al., "Tag recommendations in folksonomies, " in Evaluation measure: We are interested in the ranking Knowledge Discovery in Databases: PKDD 2007, 2007. quality of predicted tags. Here we use the normalized iscounted cumulative gain(NDCG)[10] to evaluate a 2 the ysm an tl Intrnatiantal add siGn Co Peened .noos. predicted ranking, which is calculated by summing over all the gains"along the rank list with a log discount factor [3) A. Hotho et al. rmation retrieval in folksonomies. Search and rankin The semantic web Research and as NDCG(R)=ZE(2r()-1)/log(1 +k), where r(k) Applications, 2006 denotes the target label for the k-th ranked item in R, and Z is chosen such that a perfect ranking obtains value 1. To [4] C. Cattuto et al., Semantic grounding of tag relatedness focus more on the top-ranked items, we also consider the in social bookmarking systems, in Proceedings of the 7th NDCG@n which only counts the top n items in the rank International Semantic Web Conference, 2008 list. In addition to the ranking scenario, we report F-measure [5 C Schmitz et al., ""Mining association rules in folksonomies, values averaged over the users as proposed in previous work in Data Science and Classification, 2006 [1] in an extended version of this paper- Tag Prediction Results: Table IV presents the nDCG (6 D m: Blei et al. .latent dirichlet allocation Journal of ores. The first baseline method performs quite poor, since this model does not take into account which tags a certain [7] M. Steyvers et aL., "Probabilistic author-topic models for ser has posted so far. All other methods, i.e. Baseline 2, information discovery. in Proceedings of the 10th ACM Baseline 3. the tt and the utt model take this information SIGKDD Intemational Conference, 2004 into account. The two hierarchical Bayesian models clearly [8 D.M. Blei et al. "Modeling annotated data, "in Proceedings outperform all three baseline methods. Therefore, taking into the 26th Annual International ACM SIGIR Confer account the textual resources clearly adds a benefit. The hierarchical Bayesian models are both not very sensitive to the predefined number of topics T, but a slight performance [91 R Nallapati et al, Joint latent topic models for text and increase can be observed with an increasing number of citations in Proceedings of the 14th ACM SIGKDD Inter- A major advantage of the UTT model can be observed [10] K Jarvelin and J. Kekalainen, "IR evaluation methods fc when a resource has only a title and no abstract(1223 retrieving highly relevant documents, "in Proceedings of the out of 15000 posts). Since the number of observed words 23rd annual intenational ACM SIGIR Conference, 2000(Baseline 2) • Most popular tags with respect to the user: All tags t ∈ Tu are ranked according to the relative frequency. (Baseline 3) Tag prediction with the TT and UTT model: In the TT model, the prediction of tags for unseen documents can be formulated as follows: Based on the word-topic and tag-topic count matrices learned from the independent training data set, the likelihood of a tag label l ∈ Tu given the test resource r is p(l|r) = P t p(l|t)p(t|r). The first probability in the sum, p(l|t), is given by the learned topic-tag distribution. The mixture of topics p(t|r) for the resource has to be estimated online. For each resource r, we independently sample topics for a small number of iterations (we used i=5) by using the word counts in Φ from the training corpus. The likelihood of a tag label l ∈ Tu in the UTT model is given by p(l|r, u) = P t p(l|t)p(t|r, u). Again, p(t|r, u) has to be estimated online. Here the mixture of topics for the resource is restricted with respect to the user, i. e. we estimate the topic-distribution for the resource based on the user specific topic-distribution. Recall that every post originates from a single user, therefore the estimated topic distribution for the resource under consideration is based on this user. This estimation gives a personalized view on r and thus influences the topic distribution of the resource. Evaluation measure: We are interested in the ranking quality of predicted tags. Here we use the normalized discounted cumulative gain (NDCG) [10] to evaluate a predicted ranking, which is calculated by summing over all the “gains” along the rank list with a log discount factor as NDCG(Rˆ) = Z P k (2r(k) − 1)/ log(1 + k), where r(k) denotes the target label for the k-th ranked item in Rˆ, and Z is chosen such that a perfect ranking obtains value 1. To focus more on the top-ranked items, we also consider the NDCG@n which only counts the top n items in the rank list. In addition to the ranking scenario, we report F-measure values averaged over the users as proposed in previous work [1] in an extended version of this paper2 . Tag Prediction Results: Table IV presents the NDCG scores. The first baseline method performs quite poor, since this model does not take into account which tags a certain user has posted so far. All other methods, i. e. Baseline 2, Baseline 3, the TT and the UTT model take this information into account. The two hierarchical Bayesian models clearly outperform all three baseline methods. Therefore, taking into account the textual resources clearly adds a benefit. The hierarchical Bayesian models are both not very sensitive to the predefined number of topics T, but a slight performance increase can be observed with an increasing number of topics. A major advantage of the UTT model can be observed when a resource has only a title and no abstract (1223 out of 15000 posts). Since the number of observed words drastically reduces, it becomes more difficult to estimate the resource specific Θ reliable. Here, the NDCG for the TT model decreases significantly (NDCG all is 0.42 for T=200). The UTT model, in contrast, can make use of the user specific topic distribution to estimate p(t|r, u) more reliably and the NDCG only decreases slightly (NDCG all is 0.48 for T=200). V. CONCLUSION AND OUTLOOK In this paper, we presented hierarchical Bayesian models for mining and modelling large systems with user generated content and massive annotation. To demonstrate its perfor￾mance, we trained the model on a large fraction of the CiteULike data base. As a quantitative result we showed that the here proposed models provide a better tag annotation quality in terms of perplexity compared to the standard LDA framework. With the UTT model, we are able to create a personalized view on a resource by sampling the resource specific topic distribution through the user specific topic distribution, which we see as the reason for the performance increase in the tag recommendation task. Parts of future work will aim at investigating more ways to model users within the LDA framework. REFERENCES [1] R. Jaschke ¨ et al., “Tag recommendations in folksonomies,” in Knowledge Discovery in Databases: PKDD 2007, 2007. [2] P. Heymann et al., “Social tag prediction,” in Proceedings of the 31st Annual International ACM SIGIR Conference, 2008. [3] A. Hotho et al., “Information retrieval in folksonomies: Search and ranking,” in The Semantic Web: Research and Applications, 2006. [4] C. Cattuto et al., “Semantic grounding of tag relatedness in social bookmarking systems,” in Proceedings of the 7th International Semantic Web Conference, 2008. [5] C. Schmitz et al., “Mining association rules in folksonomies,” in Data Science and Classification, 2006. [6] D. M. Blei et al., “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, January 2003. [7] M. Steyvers et al., “Probabilistic author-topic models for information discovery,” in Proceedings of the 10th ACM SIGKDD International Conference, 2004. [8] D. M. Blei et al., “Modeling annotated data,” in Proceedings of the 26th Annual International ACM SIGIR Conference, 2003. [9] R. Nallapati et al., “Joint latent topic models for text and citations,” in Proceedings of the 14th ACM SIGKDD Inter￾national Conference, 2008. [10] K. Jarvelin and J. Kekalainen, “IR evaluation methods for retrieving highly relevant documents,” in Proceedings of the 23rd annual international ACM SIGIR Conference, 2000
<<向上翻页
©2008-现在 cucdc.com 高等教育资讯网 版权所有