正在加载图片...
association rules and LDA In Section 3 we present our eval- Conf Supp Int I uation results. In Section 4 we discuss related work. and in 10.13 Section 5 we summarize and outline possible future research software, macintosh mac direction 09190.1611.36 09150086405 web, weblogs 09140074771 2. TAG RECOMMENDATION 09120037115 photography, photos- photo 81 howto, code, tutorials-tutorial To evaluate our approach using LDA for tag re 09040086242 tion we compare our approach to association rule 09040111 of-the-art method for tag recommendation prop 0.9020.0491.71toread, howto, guide-reference Heymann et al.[18]. After a formal problem description we 0. 9020. 222 2.80 introduce the two approaches in this Section ool, design, blogs- blog 0.9000.1721.21 cool. internet. free-, web 0.9000.1231.38 2.1 Problem definition 0.900 0.0625.40 web, development, web-design-html Given a set of resources R, tags T, and users U, the 09000.124307 ternary relation X C XU represents the user specific 09001213| design, tutorials, css - development 090000256.75 ource ri E R and a user u, E U comprises all tags assigned 09000.12 y uj to ri: b(ri, uj)=tori uy X". The goal of collective ag recommendation is to suggest new tags for a resource Table 1: Selection of tag association rules with con- ri with only a few bookmarks based on tag assignments te fidence≥0.9 other resources collected in Y =orfr rtXcRxT 2.2 Association rules penalize tag co-occurrences, when they have been annotated Association rules have been invest recommendation. They have the fo 2.3 Latent Dirichlet allocation T1 and 12 are tagsets. The three key measures for asso ciation rules are support, confidence, and interest. Sup The general idea of Latent Dirichlet Allocation(LDA)is ort is the(relative)number of resources that contain all based on the hypothesis that a person writing a document tags of T1 and 12, i.e., an estimate of the joint probability has certain topics in mind. To write about a topic then P(Ti, T2). Confidence is an estimate of the conditional prob- pool of words of that topic. A whole document can then ability P(T2ITi), i. e, how likely is T2 given Ti. Interest(also be represented as a mixture of different topics. When the (P(41PC2), and indicates whether Ti and T2 occur more In the context of tagging systems where multiple users are often together than expected, if they were statistically in- annotating resources, the resulting topics reflect a collabora- dependent. There exist efficient algorithms to exhaustively tive shared view of the document and the tags of the topics mine association rules with some minimum support from reflect a common vocabulary to describe the document large datasets(e.g. 1) More generally, lDa helps to explain the similarity of The basic idea of using association rules for tag recom data by grouping features of this data into unobserved sets mendation is simple: If many resources with tags T1(high A mixture of these sets then constitutes the observable data support)are typically also annotated with tags T2(high con- The method was first introduced by Blei et al. [10] and ap- dence), then a new resource with tags Ti may also be mear plied to solve various tasks including topic identification 16 agfully annotated with tags T2. More formally, given the entity resolution [7], and Web spam classification [8 tagset from(a few)bookmarks T=Ub(r, ui) for a resource The modeling process of lDA can be described as finding r by users ui, and an association rule T1-T2, all tags in a mixture of topics for each resource, i.e., P(z d), with 12 are recommended, if CT. each topic described by terms following another probability Table 1 gives a selection of association rules with high distribution, i.e., P(t I a). This can be formalized as onfidence mined from our dataset. As also observed in [18 hese rules cover all sorts of terminological relationships in cluding spelling variants and synonyms(humour -humor tools, utilities, utility tool), loose notions of hypernyms P(ti d)=>P(t: zi=j)P(zi=j l d) (tutorial, resources reference), and closely related terms where P(ti I d)is the probability of the ith term for a While the mined association rules are very intuitive, they document d and zi is the latent topic. P(t:l zi =j)is the typically recommend rather generic, frequent tags, such as probability of ti within topic j. P(ai=j l d)is the proba “ software"or“web”. This is a direct consequence of requiring bility of picking a term from topic j in the document. The ome minimum support for Ti and T2. Such generic tags are number of latent topics z has to be defined in advance and not necessarily useful for finding specific resources. Indeed, llows to adjust the degree of specialization of the latent tol for personalized tag recommendation Xu et al. 31] explicitly ics LDA estimates the topic-term distribution P(t| z) and the document-topic distribution P(zI d) from an unlabeled ojection T and selection a operate on multisets without corpus of documents using Dirichlet priors for the distribu- removing duplicate tuples ons and a fixed number of topics. Gibbsassociation rules and LDA. In Section 3 we present our eval￾uation results. In Section 4 we discuss related work, and in Section 5 we summarize and outline possible future research directions. 2. TAG RECOMMENDATION To evaluate our approach using LDA for tag recommenda￾tion we compare our approach to association rules – a state￾of-the-art method for tag recommendation proposed e.g. by Heymann et. al. [18]. After a formal problem description we introduce the two approaches in this Section. 2.1 Problem Definition Given a set of resources R, tags T, and users U, the ternary relation X ⊆ R × T × U represents the user specific assignment of tags to resources. A bookmark b(ri, uj ) for re￾source ri ∈ R and a user uj ∈ U comprises all tags assigned by uj to ri: b(ri, uj ) = πtσri,uj X 4 . The goal of collective tag recommendation is to suggest new tags for a resource ri with only a few bookmarks based on tag assignments to other resources collected in Y = σr6=riπr,tX ⊆ R × T. 2.2 Association Rules Association rules have been investigated in [18] for tag recommendation. They have the form T1 → T2, where T1 and T2 are tagsets. The three key measures for asso￾ciation rules are support, confidence, and interest. Sup￾port is the (relative) number of resources that contain all tags of T1 and T2, i.e., an estimate of the joint probability P(T1, T2). Confidence is an estimate of the conditional prob￾ability P(T2|T1), i.e., how likely is T2 given T1. Interest (also called lift) is defined as the ratio between the common sup￾port for T1 and T2, and the individual support of T1 and T2 ( P (T1,T2) P (T1)P (T2) ), and indicates whether T1 and T2 occur more often together than expected, if they were statistically in￾dependent. There exist efficient algorithms to exhaustively mine association rules with some minimum support from large datasets (e.g. [1]). The basic idea of using association rules for tag recom￾mendation is simple: If many resources with tags T1 (high support) are typically also annotated with tags T2 (high con- fidence), then a new resource with tags T1 may also be mean￾ingfully annotated with tags T2. More formally, given the tagset from (a few) bookmarks T = S b(r, ui) for a resource r by users ui, and an association rule T1 → T2, all tags in T2 are recommended, if T1 ⊆ T. Table 1 gives a selection of association rules with high confidence mined from our dataset. As also observed in [18] these rules cover all sorts of terminological relationships in￾cluding spelling variants and synonyms (humour → humor; tools, utilities, utility → tool), loose notions of hypernyms (tutorial, resources → reference), and closely related terms (software, mac, apple → osx). While the mined association rules are very intuitive, they typically recommend rather generic, frequent tags, such as “software” or “web”. This is a direct consequence of requiring some minimum support for T1 and T2. Such generic tags are not necessarily useful for finding specific resources. Indeed, for personalized tag recommendation Xu et al. [31] explicitly 4projection π and selection σ operate on multisets without removing duplicate tuples Conf Supp Int Rule 0.978 0.037 10.13 web, js → javascript 0.921 0.012 6.75 software, macintosh → mac 0.919 0.161 1.36 tools, fun, interesting → cool 0.915 0.086 4.05 web, weblogs → blogs 0.914 0.074 7.71 humour → humor 0.912 0.037 11.57 photography, photos → photo 0.910 0.136 3.81 howto, code, tutorials → tutorial 0.904 0.086 2.42 tools, utilities, utility → tool 0.904 0.111 2.55 tech, tutorial, tutorials → howto 0.902 0.049 1.71 toread, howto, guide → reference 0.902 0.111 1.56 cool, technology, computers → tech 0.902 0.222 2.80 cool, design, blogs → blog 0.900 0.172 1.21 cool, internet, free → web 0.900 0.123 1.38 webdesign, tips → web, design 0.900 0.062 5.40 web, development, web-design → html 0.900 0.124 3.07 design, css → webdev 0.900 0.074 2.13 web, osx → software 0.900 0.099 2.18 design, tutorials, css → development 0.900 0.025 6.75 software, mac, apple → osx 0.900 0.124 1.25 tutorial, resources → reference Table 1: Selection of tag association rules with con- fidence ≥ 0.9 penalize tag co-occurrences, when they have been annotated by different users. 2.3 Latent Dirichlet Allocation The general idea of Latent Dirichlet Allocation (LDA) is based on the hypothesis that a person writing a document has certain topics in mind. To write about a topic then means to pick a word with a certain probability from the pool of words of that topic. A whole document can then be represented as a mixture of different topics. When the author of a document is one person, these topics reflect the person’s view of a document and her particular vocabulary. In the context of tagging systems where multiple users are annotating resources, the resulting topics reflect a collabora￾tive shared view of the document and the tags of the topics reflect a common vocabulary to describe the document. More generally, LDA helps to explain the similarity of data by grouping features of this data into unobserved sets. A mixture of these sets then constitutes the observable data. The method was first introduced by Blei et al. [10] and ap￾plied to solve various tasks including topic identification [16], entity resolution [7], and Web spam classification [8]. The modeling process of LDA can be described as finding a mixture of topics for each resource, i.e., P(z | d), with each topic described by terms following another probability distribution, i.e., P(t | z). This can be formalized as P(ti | d) = XZ j=1 P(ti | zi = j)P(zi = j | d), (1) where P(ti | d) is the probability of the ith term for a given document d and zi is the latent topic. P(ti | zi = j) is the probability of ti within topic j. P(zi = j | d) is the proba￾bility of picking a term from topic j in the document. The number of latent topics Z has to be defined in advance and allows to adjust the degree of specialization of the latent top￾ics. LDA estimates the topic–term distribution P(t | z) and the document–topic distribution P(z | d) from an unlabeled corpus of documents using Dirichlet priors for the distribu￾tions and a fixed number of topics. Gibbs sampling [16] is
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有