正在加载图片...
Hierarchical Bayesian Models for Collaborative Tagging Systems Markus Bundschus, Shipeng Yut, Volker Trespt, Achim Rettingers, Mathaeus Dejorit and Hans-Peter Kriegel* Institute for Computer Science, Ludwig-Maximilians-Universitat Miinchen, Oettingenstr: 67, 80538 Miinchen, germany CAD Knowledge Solutions, Siemens Medical Solutions, 5/ Valley Stream Parkway, Malvern, PA 19355, USA hipeng.yu@siemens.com Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, 81739 Miinchen, Germany volkertresp@siemens.com INstitute for Computer Science(i7), Technische Universitat Minchen, Boltzmannstr. 3, 85748 Garching, germany Email: achim ettinger@cs. tum.edu iIntegrated Data Systems Dep, Siemens Corporate Research, 755 College Road East, Princeton, NJ08540, USA Emailmathaeus.dejori@siemens.com Abstract-Collaborative tagging systems with user generated information_retrieval, information-retrieval and IR). Also ontent have become a fundamental element of websites such as the meaning of a particular tag, such as to-read, might be Delicious, Flickr or CiteULike. By sharing comme n knowledge massively linked semantic data sets are generated that provide subjective to individuals and does not necessarily express ew challenges for data mining. In this paper, we reduce the the same shared semantic for the whole community. These ata complexity in these systems by finding meaningful topics aspects make the extraction of meaningful information from hat serve to group similar users and serve to recommend tags collaborative systems both challenging and rewarding or resources to users. We propose a well-founded probabilistic In this paper, we present a unified probabilistic frame- approach that can model every aspect of a collaborative work for collaborative tagging systems, which has a sound tagging system By integrating both user information and tag formation into the well-known Latent Dirichlet Allocation heoretical foundation in Hierarchical Bayesian Statistics ramework, the developed models can be used to solve a By extending one well established model for document number of important information extraction and retrieval collections, the Latent Dirichlet Allocation(LDA)model, tasks we are able to exploit the complete spectrum of infor- eywords-collaborative tagging; LDA; user modeling mation available in collaborative tagging systems. Hereby, all involved entities. i.e. the users. their resources and the L. INTRODUCTION assigned resource tags are modeled by a latent multinomial Collaborative knowledge platforms have recently emerged topic distribution. With this strategy, we map each entity into lar frameworks for sharing information between a common lower dimensional latent topic space and thus users with common interests. Some popular examples of are able to extract structure and drastically reduce the great such systems are Delicious, Cite ULike! or Flickr. a key variety of ambiguous information inherent in collaborative eature of these systems is that large numbers of users tagging systems. The here proposed models can be applied upload certain resources of interest and label them with naturally to various tasks. We present results for the ex personalized tags. The resources are in most cases some traction of statistical relationships between users, resources of high-dimensional data such as text documents or and tags. As a quantitative evaluation, we present results mages. Without further processing, those resources do not on assessing user similarities, a perplexity analysis on tag ontain any semantic information that is usable for auto- annotation quality, and results on personalized tag recor mated analysis. However, meaningful annotations adding mendation. In the latter case, we outperform several standard semantic to the raw resources are also given in the form tag recommendation algorithms. We train our models on a of user specified tags. In contrast to taxonomies, where raction of the CiteULike system. CiteULike is a system labels represent ordered predefined categories, no restrictions that allows researchers to manage their scientitic reference apply to tags, which are flat and chosen arbitrarily. These articles. It tries to help scientists to cope with the increasing free-form strings actually serve the purpose to organize the interdependent topical complexity of today's researchWhile resources of one single specific user Tags might be polys- in this work, we focus on collaborative tagging systems nous and different users use slightly different variations of based on text, the described models are general and could tags to express the same semantics(e. g. consider the tags handle various types of resources such as pictures as well The outline of the paper is as follows: In Section I Ihttp://www.citeulike.org we briefly summarize existing related work. Section IllHierarchical Bayesian Models for Collaborative Tagging Systems Markus Bundschus∗ , Shipeng Yu† , Volker Tresp‡ , Achim Rettinger§ , Mathaeus Dejori¶ and Hans-Peter Kriegel∗ ∗ Institute for Computer Science, Ludwig-Maximilians-Universitat M ¨ unchen, Oettingenstr. 67, 80538 M ¨ unchen, Germany ¨ Email: {bundschu, kriegel}@dbs.ifi.lmu.de †CAD & Knowledge Solutions, Siemens Medical Solutions, 51 Valley Stream Parkway, Malvern, PA 19355, USA Email: shipeng.yu@siemens.com ‡Corporate Technology, Siemens AG, Otto-Hahn-Ring 6, 81739 Munchen, Germany ¨ Email: volker.tresp@siemens.com § Institute for Computer Science (i7), Technische Universitat M ¨ unchen, Boltzmannstr. 3, 85748 Garching, Germany ¨ Email: achim.rettinger@cs.tum.edu ¶Integrated Data Systems Dep., Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540, USA Email: mathaeus.dejori@siemens.com Abstract—Collaborative tagging systems with user generated content have become a fundamental element of websites such as Delicious, Flickr or CiteULike. By sharing common knowledge, massively linked semantic data sets are generated that provide new challenges for data mining. In this paper, we reduce the data complexity in these systems by finding meaningful topics that serve to group similar users and serve to recommend tags or resources to users. We propose a well-founded probabilistic approach that can model every aspect of a collaborative tagging system. By integrating both user information and tag information into the well-known Latent Dirichlet Allocation framework, the developed models can be used to solve a number of important information extraction and retrieval tasks. Keywords-collaborative tagging; LDA; user modeling; I. INTRODUCTION Collaborative knowledge platforms have recently emerged as popular frameworks for sharing information between users with common interests. Some popular examples of such systems are Delicious, CiteULike1 or Flickr. A key feature of these systems is that large numbers of users upload certain resources of interest and label them with personalized tags. The resources are in most cases some type of high-dimensional data such as text documents or images. Without further processing, those resources do not contain any semantic information that is usable for auto￾mated analysis. However, meaningful annotations adding semantic to the raw resources are also given in the form of user specified tags. In contrast to taxonomies, where labels represent ordered predefined categories, no restrictions apply to tags, which are flat and chosen arbitrarily. These free-form strings actually serve the purpose to organize the resources of one single specific user. Tags might be polyse￾mous and different users use slightly different variations of tags to express the same semantics (e. g. consider the tags 1http://www.citeulike.org/ information retrieval, information-retrieval and IR). Also the meaning of a particular tag, such as to read, might be subjective to individuals and does not necessarily express the same shared semantic for the whole community. These aspects make the extraction of meaningful information from collaborative systems both challenging and rewarding. In this paper, we present a unified probabilistic frame￾work for collaborative tagging systems, which has a sound theoretical foundation in Hierarchical Bayesian Statistics. By extending one well established model for document collections, the Latent Dirichlet Allocation (LDA) model, we are able to exploit the complete spectrum of infor￾mation available in collaborative tagging systems. Hereby, all involved entities, i. e. the users, their resources and the assigned resource tags are modeled by a latent multinomial topic distribution. With this strategy, we map each entity into a common lower dimensional latent topic space and thus are able to extract structure and drastically reduce the great variety of ambiguous information inherent in collaborative tagging systems. The here proposed models can be applied naturally to various tasks. We present results for the ex￾traction of statistical relationships between users, resources and tags. As a quantitative evaluation, we present results on assessing user similarities, a perplexity analysis on tag annotation quality, and results on personalized tag recom￾mendation. In the latter case, we outperform several standard tag recommendation algorithms. We train our models on a fraction of the CiteULike system. CiteULike is a system that allows researchers to manage their scientific reference articles. It tries to help scientists to cope with the increasing interdependent topical complexity of today’s research. While in this work, we focus on collaborative tagging systems based on text, the described models are general and could handle various types of resources such as pictures as well. The outline of the paper is as follows: In Section II we briefly summarize existing related work. Section III
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有