正在加载图片...
provided a simple taxonomy of tagging systems to analyze and distinguish these tagg stems from different kinds of websites by distinct facets. Golder and Huberman [2006]an- alyzed the dynamics of collaborative tagging systems, includ- ing user activity, tag frequencies, and bursts of popularity in resources From the aspect of information retrieval [Salton and McGill, 1986], tags bring new information to items over original contents [Bischoff et aL., 2008], and therefore tags can enhance capabilities of existing search engines to find out relevant documents [Heymann et aL., 2008a]. Bao et al. [2007 proposed iterative algorithms integrating tags into web search for better ranking results. Furthermore, tag types re- oo flect what distinctions are important to taggers. Bischoff et al. Number of Total Ta [2008] refined the scheme presented by Golder and Huber- man 2006] by classifying tags into 8 categories and exhibit Figure 1: Power Law: The number of total tags versus the ing tag type distributions across different tagging systems and number of webpages having that many total tags web anchor texts(or link labels). From comparing categories of tags with query logs and user study, they showed that most content of the document. The basic idea is that the better represents the same attributes as searching behavior in most its tag/term set. Exploiting the notion of tag/term coverage, cases we propose a similarity metric between two documents based Although tags are helpful to improve search results and di de documents, people on average annotate resources with on the tags and terms of both documents. Using the vector only a small number of tags Bischoff et al, 2008]. Tag space model, we represent each document with two vectors: recommendation, one of the emerging research topics in tag- a tag vector and a term vector. When calculating the similarity ng, can reduce people's tagging effort and encourage them score, other than the intuitive method of computing the cosine to use more tags to reduce the problem. Xu et al. [2006] similarity between the two tag vectors and the two content vectors of the pair of documents, we also take into account proposed the criteria for better tag recommendations, includ. the cosine similarity be ween the tag vector of one document nary results only. Jaschke et al. aschke et al., 2007) in- metric of the two documents thus consists of a weighted sum duced the Folk Rank hich computes a t of four components, with the weight of each component de- specific ranking of the elements in a folksonomy, and pending on the tag/term coverage of each document. Finally, Tuzhilin, 2005/ ve filtering algorithms [Adomavicius and feated collaborativ using the similarity metric, we allow each document to prop- In terms of content-based tag recommendation, Heymann agate, or to share the tags it owns to other similar documents. et al. [2008b) formulated the problem into a supervised learn- After the propagation step, the tags that have a higher weiyi and problem. Using page text, anchor text, surrounding hosts and available tag information as training data, Heymann et al. We used tag data crawled from one of the largest social trained a classifier for each tag they wanted to predict. Even bookmarking sites, Delicious. Since users on Delicious can though they can achieve high precision using this method add bookmarks along with some descriptive tags into their the time required to train the classifiers for each tag becomes own collection, this site contains an enormous number of substantial when the number of distinct tags increases. In the bookmarks and each of which contains a different number same paper, Heymann et al. achieved good results in expand of tags. We also crawled the webpages corresponding to each ing the tag set of documents with little tag information using bookmark, which serves as the page content of the webpage association rules We analyzed the tag information of the dataset and tested our method using cross validation and through a user study. The 3 Methodology results show that our proposed method is effective in populat- 3.1 notations ing untagged webpages with the correct tags. Before diving into the notations, we first define the termi 2 Related work nology used in this paper. A URL is described by the annotated by the users and the terms which are words on the In the era of web 2.0, websites allow users to contribute their webpage corresponding to the URL. The words webpages and contents, and annotate them with a freely chosen set of key- documents that appeared in the previous paragraphs ar words under the tagging system built by each website. Mika all referred to as URL. The three words, URL, tags and [2005] represented semantic social networks in the form of will be used throughout the text a tripartite model which is consisted of actors(users),con Let u be the set of all URLs. Let T be the set of all tags cepts(tags), and instances(resources). Marlow et al. [2006] For a URL uE U, let Tag(u) be the set of tags that annotateFigure 1: Power Law: The number of total tags versus the number of webpages having that many total tags. content of the document. The basic idea is that the better the tag/term coverage of a document, the more we can trust its tag/term set. Exploiting the notion of tag/term coverage, we propose a similarity metric between two documents based on the tags and terms of both documents. Using the vector space model, we represent each document with two vectors: a tag vector and a term vector. When calculating the similarity score, other than the intuitive method of computing the cosine similarity between the two tag vectors and the two content vectors of the pair of documents, we also take into account the cosine similarity between the tag vector of one document and the term vector of the other document. The similarity metric of the two documents thus consists of a weighted sum of four components, with the weight of each component de￾pending on the tag/term coverage of each document. Finally, using the similarity metric, we allow each document to prop￾agate, or to share the tags it owns to other similar documents. After the propagation step, the tags that have a higher weight in a document is viewed as a trustworthy tag and thus may be a good candidate for tag recommendation. We used tag data crawled from one of the largest social bookmarking sites, Delicious. Since users on Delicious can add bookmarks along with some descriptive tags into their own collection, this site contains an enormous number of bookmarks and each of which contains a different number of tags. We also crawled the webpages corresponding to each bookmark, which serves as the page content of the webpage. We analyzed the tag information of the dataset and tested our method using cross validation and through a user study. The results show that our proposed method is effective in populat￾ing untagged webpages with the correct tags. 2 Related Work In the era of web 2.0, websites allow users to contribute their contents, and annotate them with a freely chosen set of key￾words under the tagging system built by each website. Mika [2005] represented semantic social networks in the form of a tripartite model which is consisted of actors (users), con￾cepts (tags), and instances (resources). Marlow et al. [2006] provided a simple taxonomy of tagging systems to analyze and distinguish these tagging systems from different kinds of websites by distinct facets. Golder and Huberman [2006] an￾alyzed the dynamics of collaborative tagging systems, includ￾ing user activity, tag frequencies, and bursts of popularity in resources. From the aspect of information retrieval [Salton and McGill, 1986], tags bring new information to items over original contents [Bischoff et al., 2008], and therefore tags can enhance capabilities of existing search engines to find out relevant documents [Heymann et al., 2008a]. Bao et al. [2007] proposed iterative algorithms integrating tags into web search for better ranking results. Furthermore, tag types re- flect what distinctions are important to taggers. Bischoff et al. [2008] refined the scheme presented by Golder and Huber￾man [2006] by classifying tags into 8 categories and exhibit￾ing tag type distributions across different tagging systems and web anchor texts (or link labels). From comparing categories of tags with query logs and user study, they showed that most of the tags can be used for search, and that tagging behavior represents the same attributes as searching behavior in most cases. Although tags are helpful to improve search results and di￾vide documents, people on average annotate resources with only a small number of tags [Bischoff et al., 2008]. Tag recommendation, one of the emerging research topics in tag￾ging, can reduce people’s tagging effort and encourage them to use more tags to reduce the problem. Xu et al. [2006] proposed the criteria for better tag recommendations, includ￾ing content-based methods and temporal issue, but prelimi￾nary results only. Jaschke ¨ et al. [Jaschke ¨ et al., 2007] in￾troduced the FolkRank algorithm, which computes a topic￾specific ranking of the elements in a folksonomy, and de￾feated collaborative filtering algorithms [Adomavicius and Tuzhilin, 2005]. In terms of content-based tag recommendation, Heymann et al. [2008b] formulated the problem into a supervised learn￾ing problem. Using page text, anchor text, surrounding hosts and available tag information as training data, Heymann et al. trained a classifier for each tag they wanted to predict. Even though they can achieve high precision using this method, the time required to train the classifiers for each tag becomes substantial when the number of distinct tags increases. In the same paper, Heymann et al. achieved good results in expand￾ing the tag set of documents with little tag information using association rules. 3 Methodology 3.1 Notations Before diving into the notations, we first define the termi￾nology used in this paper. A URL is described by the tags annotated by the users and the terms which are words on the webpage corresponding to the URL. The words webpages and documents that appeared in the previous paragraphs are now all referred to as URL. The three words, URL, tags and terms will be used throughout the text. Let U be the set of all URLs. Let T be the set of all tags. For a URL u ∈ U, let T ag(u) be the set of tags that annotate 2065
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有