正在加载图片...
1. Graphical models in plate notation with observed(gray circles) and latent variables(white circles). Left: standard LDA. Middle: Topic-Tag odel. Right: User-Topic- Tag (UTT) model Table I [8], we estimate 6 and I from the posterior distributions CORPUS STATISTICS CITEULIKE DATA SET D. User-Topic-Tag (UTT) Model 18.62818.628 In this section we introduce the most elaborate model Word tokens 14 489 1.161.794 for collaborative tagging systems. The UTT model adds an additional layer accounting for the most essential entity: the Users 1.393 18.628 user, who assigns one or more tags to web resources. Thi can be formalized by a two-step process where a user first ites an article based on his interests and afterwards assigns resource. Obviously, this is a simplifying modeling assump- tags based on the content of the resource. This process can be tion. However, this assumption yielded promising results modeled by a hierarchical generative model, in which each in the past when modeling authors and their interests (7 word w of a resource r is associated with two variables: a Furthermore, once we have trained a UTT model, we can user u and a latent topic variable t. We assume that each estimate the resource specific topic distribution based on a user is interested in several topics, thus each user has single user. This provides a personalized view on a resource multinomial distribution over topics. First, a user u is chose and results in a potential bette uniformly at random for each word of a certain resource. Section IV-B4) Hereby u is chosen from the users Ur, the users which cite IV EXPERIMENTS the resource r. Second, a topic is sampled for each word from the user specific topic distribution eu from user ur A. Experimental Setup chosen for that word. Third, for each of the tags associated Data Set: CiteULike provides data snapshots on their th the resource, a topic is uniformly drawn based on the webpage The data used in our experiments was from pic assignments for each word in the web page(Figure 1, November 13th 2008 right). This can be summarized as: Training Data: We selected a reasonable high number 1)For each user u=l... U choose Ou Dirichlet(a) of users (1393)and included articles that were cited by at For each topic t 1.. IT choose pe N Dirichlet(B) least three users. Word tokens from title and abstract were and Tt N Dirichlet(n) stemmed with a standard Porter stemmer and stop words 2) For each resource r= 1 and its given users Ur were removed. Word tokens and tags occurring less than five times were filtered out. Table I summarizes the corpus a)Sample a user zi Uniform(1,., Ur) statistics. The user ids, resource ids and tags are provided b)Given ai, sample a topic zi Mult(eu,) as supplementary data". All in all, this results in a total mple a word Mul(④) number of 64159 posts. Test Set for Tag Recommendation: The only restriction 3)For each tag label Lj,j=1. M, in resource r for the test set was that a resource had to be posted from a a)Sample an index i N Uniform(1,., N sly seen in the training set. the same b)Given topic ii, sample a tag label L, N Mult(r:) tags. We evaluate the models on a total of 15000 post bs Sampling average each user uses 32 tags. The maximum number of equations and provide them in an extended version of this tag labels for a specific user is 279. The average number of paper online- In the UTT model, the interest of the user tag assignments for a user is three is modeled by the assignments of users to words in the www.dbs.ifi.imu.De/-bundschu www.dbs.ifiImu.de/--bundschu/uttmodel_supplementary/info.htmlΦ β T R α θ z w N r Φ l β T R M r α θ z w N r z Γ γ T Φ l β T R M r α θ U z w N r x z Γ γ T u r Figure 1. Graphical models in plate notation with observed (gray circles) and latent variables (white circles). Left: standard LDA. Middle: Topic-Tag (TT) model. Right: User-Topic-Tag (UTT) model [8], we estimate Φ, θ and Γ from the posterior distributions via Gibbs Sampling [7]. D. User-Topic-Tag (UTT) Model In this section we introduce the most elaborate model for collaborative tagging systems. The UTT model adds an additional layer accounting for the most essential entity: the user, who assigns one or more tags to web resources. This can be formalized by a two-step process where a user first cites an article based on his interests and afterwards assigns tags based on the content of the resource. This process can be modeled by a hierarchical generative model, in which each word w of a resource r is associated with two variables: a user u and a latent topic variable t. We assume that each user is interested in several topics, thus each user has a multinomial distribution over topics. First, a user u is chosen uniformly at random for each word of a certain resource. Hereby u is chosen from the users Ur, the users which cite the resource r. Second, a topic is sampled for each word from the user specific topic distribution Θu from user ux chosen for that word. Third, for each of the tags associated with the resource, a topic is uniformly drawn based on the topic assignments for each word in the web page (Figure 1, right). This can be summarized as: 1) For each user u = 1 . . . |U| choose Θu ∼ Dirichlet(α) For each topic t = 1 . . . |T| choose Φt ∼ Dirichlet(β) and Γt ∼ Dirichlet(γ) 2) For each resource r = 1 . . . |R| and its given users Ur For each word wi, i = 1 . . . Nr in resource r a) Sample a user xi ∼ Uniform(1, . . . , Ur) b) Given xi, sample a topic zi ∼ Mult(Θui ) c) Given zi, sample a word wi ∼ Mult(Φzi ) 3) For each tag label lj , j = 1 . . . Mr in resource r a) Sample an index i ∼ Uniform(1, . . . , Nr) b) Given topic z˜i, sample a tag label lj ∼ Mult(Γz˜i ) For the sake of brevity, we omit the Gibbs Sampling equations and provide them in an extended version of this paper online2 . In the UTT model, the interest of the user is modeled by the assignments of users to words in the 2www.dbs.ifi.lmu.de/∼bundschu Table I CORPUS STATISTICS CITEULIKE DATA SET Unique Total Resources 18.628 18.628 Word tokens 14.489 1.161.794 Tags 4.311 125.808 Users 1.393 18.628 resource. Obviously, this is a simplifying modeling assump￾tion. However, this assumption yielded promising results in the past when modeling authors and their interests [7]. Furthermore, once we have trained a UTT model, we can estimate the resource specific topic distribution based on a single user. This provides a personalized view on a resource and results in a potential better tag recommendation (see Section IV-B4). IV. EXPERIMENTS A. Experimental Setup Data Set: CiteULike provides data snapshots on their webpage3 . The data used in our experiments was from November 13th 2008. Training Data: We selected a reasonable high number of users (1393) and included articles that were cited by at least three users. Word tokens from title and abstract were stemmed with a standard Porter stemmer and stop words were removed. Word tokens and tags occurring less than five times were filtered out. Table I summarizes the corpus statistics. The user id’s, resource id’s and tags are provided as supplementary data4 . All in all, this results in a total number of 64159 posts. Test Set for Tag Recommendation: The only restriction for the test set was that a resource had to be posted from a user previously seen in the training set. The same applies to tags. We evaluate the models on a total of 15000 posts. In average each user uses 32 tags. The maximum number of tag labels for a specific user is 279. The average number of tag assignments for a user is three. 3http://www.citeulike.org/faq/data.adp 4www.dbs.ifi.lmu.de/∼bundschu/UTTmodel supplementary/info.html
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有