正在加载图片...
one approach to this end: It iterates multiple times Tag Count Prob. Tag Count Prob. ti in document di, and samples a new topic hotography 164 based on Equation 2, until the LDA model parameters an based on the probability P(ai=jti, di,z 90020.129 tutorial155190.145 77390.110 reference140840.132 63020090 verge. photoshop 48250.069tutorials73200069 85 article 1676002 4420.013 C maintains a count of all topic-term assignments, C 11470011 counts the document-topic assignments, z-i represents all 11400011 topic-term and document-topic assignments except the cur- camera 8310012 how-to10810.010 rent assignment zi for term ti, and a and B are the(syn 10540010 metric)hyperparameters for the Dirichlet priors, serving smoothing parameters for the counts. Based on the counts Table 2: Top terms composing the latent topics he posterior probabilities in Equation 1 can be estimated phy”and“ howto” as follows P(1z=)=、CI2+B arious aspects of tutorial material. Given these topics can easily extend the current tag set or recommend new tags ∑:Ct2+T to users by looking at the latent topics. In our example, we Ca4+a reference", and"tips "if we set the threshold for the accu- P(z=jl di) (4) mulated probabilities to 0.045. LDA would assume that our resource in question belongs to 66% to the"photo"-topic and 2.3.1 Application to Tagging Systems to 33% to the"howto-topic. Multiplying these probabilities LDA assigns to each document latent topics together with with the individual tag probabilities of the latent topics re- a probability value that each topic contributes to the overall sults in a ranked list of relevant tags for our resource. document. For tagging systems the documents are resou E R, and each resource is described by tags t E T assigned 3. EVALUATION by users u E U. Instead of documents composed of term re have resources composed of tags. To build an LDa mo To compare the two algorithms we evaluated both re need resources and associated tags previously assigned by Isers. For each resource r we need some bookmarks b(r, ui assigned by users ui, i E.. n. Then we can represent 3.1 Dataset ach resource in the system not with its actual tags but with as a dataset for our evaluations we use a crawl from de- the tags from topics discovered by lda licious provided by Hotho et. al. 119. The dataset consists For a new resource rnew where we only have a small num of c75.000 users. 500.000 tags and N3.200 000 resources ber of bookmarks (iE(1.59), i.e., only one to five users connected via 17, 000,000 tag assignments of users. annotated this resource, we can expand the latent topic rep- The overlap between tags, resources and users is very resentation of this resource with the top tags of each latent sparse. To get a dense subset of the data we computed topic. To accomodate the fact of some tags being added p-cores [4]for different levels. For p= 100 we get enough by multiple users whereas others are only added by one of two users we can use the probabilities that LDA assigns. As ful training and test sets(90%: 10%). The test sets differ formalized in Equation 1 this is a two level process. Proba in the number of bookmarks each resource has assigned to bilities are assigned not only to the latent topics for a single simulate new resources that only have one to five user an- resource but also to each tag within a latent topic to indi- notations. For this we removed all tags not belonging to the cate the probability of this tag being part of that particular first n bookmarks, n E 1...5 opic. We represent each resource Ti as the probabilities Our final dataset consists of 10.000 resources P(alri) for each latent topic zi rery topic zi is rer users,and 3600 tags occuring in 3, 200, 000 ta ented as the probabilities P(tlz) for each tag tn T. By ments. We have five test sets containing 10% of ombining these two probabilities for each tag for rnew, we The 100-core ensures that each tag, each resour get a probability value for each tag that can be interpreted user appears at least 100 times in the tag assignments milarly as the relative tag frequency of a resource. Setting On this set, the only preprocessing of the tag assig a threshold allows to adjust the number of recommended performed was the decapitalization of the tags. No tags and emphasis can be shifted from recall to precision ming or other simplifications were applied. More Imagine a resource with the following tags: photo", "ph ticated preprocessing might improve the results but would tography", and"howto". Table 2 shows the top terms for complicate the evaluation of the algorithms wo topics related with the assigned tags. It is interesting to ompare these two topics with the corresponding association ules in Table 1. Whereas the association rules indicate only In this Section we report results for our LDA-based al- Lirly simple term expansions, the latent topics comprise an gorithm and compare these with the numbers we get using arguably broader notion of (digital) photography and the association rules for the same task on the same dataset.one possible approach to this end: It iterates multiple times over each term ti in document di, and samples a new topic j for the term based on the probability P(zi = j|ti, di, z−i) based on Equation 2, until the LDA model parameters con￾verge. P(zi = j | ti, di, z−i) ∝ C T Z tij + β P t CT Z tj + T β C DZ dij + α P z CDZ diz + Zα (2) C T Z maintains a count of all topic–term assignments, C DZ counts the document–topic assignments, z−i represents all topic–term and document–topic assignments except the cur￾rent assignment zi for term ti, and α and β are the (sym￾metric) hyperparameters for the Dirichlet priors, serving as smoothing parameters for the counts. Based on the counts the posterior probabilities in Equation 1 can be estimated as follows: P(ti | zi = j) = C T Z tij + β P t CT Z tj + T β (3) P(zi = j | di) = C DZ dij + α P z CDZ diz + Zα (4) 2.3.1 Application to Tagging Systems LDA assigns to each document latent topics together with a probability value that each topic contributes to the overall document. For tagging systems the documents are resources r ∈ R, and each resource is described by tags t ∈ T assigned by users u ∈ U. Instead of documents composed of terms, we have resources composed of tags. To build an LDA model we need resources and associated tags previously assigned by users. For each resource r we need some bookmarks b(r, ui) assigned by users ui, i ∈ {1 . . . n}. Then we can represent each resource in the system not with its actual tags but with the tags from topics discovered by LDA. For a new resource rnew where we only have a small num￾ber of bookmarks (i ∈ {1 . . . 5}), i.e., only one to five users annotated this resource, we can expand the latent topic rep￾resentation of this resource with the top tags of each latent topic. To accomodate the fact of some tags being added by multiple users whereas others are only added by one or two users we can use the probabilities that LDA assigns. As formalized in Equation 1 this is a two level process. Proba￾bilities are assigned not only to the latent topics for a single resource but also to each tag within a latent topic to indi￾cate the probability of this tag being part of that particular topic. We represent each resource ri as the probabilities P(z|ri) for each latent topic zj ∈ Z. Every topic zj is repre￾sented as the probabilities P(t|zj ) for each tag tn ∈ T. By combining these two probabilities for each tag for rnew, we get a probability value for each tag that can be interpreted similarly as the relative tag frequency of a resource. Setting a threshold allows to adjust the number of recommended tags and emphasis can be shifted from recall to precision. Imagine a resource with the following tags: “photo”, “pho￾tography”, and “howto”. Table 2 shows the top terms for two topics related with the assigned tags. It is interesting to compare these two topics with the corresponding association rules in Table 1. Whereas the association rules indicate only fairly simple term expansions, the latent topics comprise an arguably broader notion of (digital) photography and the Tag Count Prob. Tag Count Prob. photography 16452 0.235 howto 23371 0.219 photo 9002 0.129 tutorial 15519 0.145 photos 7739 0.110 reference 14084 0.132 images 6302 0.090 tips 13955 0.131 photoshop 4825 0.069 tutorials 7320 0.069 graphics 2831 0.040 guide 3430 0.032 image 2769 0.040 toread 2948 0.028 art 1910 0.027 article 2376 0.022 stock 1852 0.026 articles 1498 0.014 pictures 1676 0.024 useful 1442 0.013 design 1666 0.024 learning 1147 0.011 gallery 1386 0.020 tricks 1140 0.011 camera 831 0.012 how-to 1081 0.010 digital 802 0.011 help 1054 0.010 Table 2: Top terms composing the latent topics “photography” and “howto” various aspects of tutorial material. Given these topics we can easily extend the current tag set or recommend new tags to users by looking at the latent topics. In our example, we can recommend “photos”, “images”, “photoshop”, “tutorial”, “reference”, and “tips” if we set the threshold for the accu￾mulated probabilities to 0.045 . LDA would assume that our resource in question belongs to 66% to the “photo”-topic and to 33% to the “howto”-topic. Multiplying these probabilities with the individual tag probabilities of the latent topics re￾sults in a ranked list of relevant tags for our resource. 3. EVALUATION To compare the two algorithms we evaluated both on a common dataset. 3.1 Dataset As a dataset for our evaluations we use a crawl from De￾licious provided by Hotho et. al. [19]. The dataset consists of ∼75,000 users, ∼500,000 tags and ∼3,200,000 resources connected via ∼17,000,000 tag assignments of users. The overlap between tags, resources and users is very sparse. To get a dense subset of the data we computed p-cores [4] for different levels. For p = 100 we get enough bookmarks for each resource to split the data into meaning￾ful training and test sets (90%:10%). The test sets differ in the number of bookmarks each resource has assigned to simulate new resources that only have one to five user an￾notations. For this we removed all tags not belonging to the first n bookmarks, n ∈ {1 . . . 5}. Our final dataset consists of ∼10,000 resources, ∼10,000 users, and ∼3600 tags occuring in ∼3,200,000 tag assign￾ments. We have five test sets containing 10% of the data. The 100-core ensures that each tag, each resource and each user appears at least 100 times in the tag assignments. On this set, the only preprocessing of the tag assignments performed was the decapitalization of the tags. No stem￾ming or other simplifications were applied. More sophis￾ticated preprocessing might improve the results but would complicate the evaluation of the algorithms. 3.2 Results In this Section we report results for our LDA-based al￾gorithm and compare these with the numbers we get using association rules for the same task on the same dataset
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有