Folke Eisterlehner Andreas Hotho Robert Jaschke(Eds ECML PKDD Discovery Challenge 2009(DC09) International Workshop at the european Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases Bled, Slovenia, September 7th, 2009 ECML PKDD 2009 nd Prine
Folke Eisterlehner Andreas Hotho Robert J¨aschke (Eds.) ECML PKDD Discovery Challenge 2009 (DC09) International Workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases in Bled, Slovenia, September 7th, 2009
Table of contents Table of Contents Relational Classification for Personalized Tag Recommendation Leandro Balby Marinho, Christine Preisach, and Lars Schmidt Thieme Measuring Vertex Centrality in Co-occurrence Graphs for Online Social Tag Recommendation lvdn Cantador, David vallet, and Joemon M. Jose ocial Tag Prediction Base on Supervised Ranking Model Hao Cao, Maogiang Xie, Lian Xue, Chunhua Liu, Fei Teng, and Yalou huang Tag Recommendations Based on Tracking Social Bookmarking Systems A Fast effective Multi-Channeled Tag Recommender Jonathan Gemmell, Maryam Ramezani, Thomas Schimoler, Laura Christiansen. and Bamshad Mobasher A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Ghanogiannis and Theodore Kalamboukis Discriminative Clustering for Content-Based Tag Recommendation in Social Bookmarking System Malik Tahir Hassan. Asim Karim. Suresh Manandhar, and James Time based Tag Recommendation using Direct and Extended Users Sets. 99 reza lofciu and Gianluca Demartini A Weighting Scheme for Tag Recommendation in Social Bookmarking 109 Sanghun Ju and Kyu-Baek Hwang ARKTis-A Fast Tag Recommender System Based On Heuristics 119 mas Kleinbauer and sebastian germein Tag Recommendation using Probabilistic Topic Models Ralf Krestel and Peter Fankhauser
Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Relational Classification for Personalized Tag Recommendation . . . . . . . . . 7 Leandro Balby Marinho, Christine Preisach, and Lars SchmidtThieme Measuring Vertex Centrality in Co-occurrence Graphs for Online Social Tag Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Iv´an Cantador, David Vallet, and Joemon M. Jose Social Tag Prediction Base on Supervised Ranking Model . . . . . . . . . . . . . . 35 Hao Cao, Maoqiang Xie, Lian Xue, Chunhua Liu, Fei Teng, and Yalou Huang Tag Recommendations Based on Tracking Social Bookmarking Systems . . 49 Szymon Chojnacki A Fast Effective Multi-Channeled Tag Recommender . . . . . . . . . . . . . . . . . . 59 Jonathan Gemmell, Maryam Ramezani, Thomas Schimoler, Laura Christiansen, and Bamshad Mobasher A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems . . . . . . . . . . . . . . . . . . . . 71 Anestis Gkanogiannis and Theodore Kalamboukis Discriminative Clustering for Content-Based Tag Recommendation in Social Bookmarking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Malik Tahir Hassan, Asim Karim, Suresh Manandhar, and James Cussens Time based Tag Recommendation using Direct and Extended Users Sets . 99 Tereza Iofciu and Gianluca Demartini A Weighting Scheme for Tag Recommendation in Social Bookmarking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Sanghun Ju and Kyu-Baek Hwang ARKTiS - A Fast Tag Recommender System Based On Heuristics. . . . . . . 119 Thomas Kleinbauer and Sebastian Germesin Tag Recommendation using Probabilistic Topic Models . . . . . . . . . . . . . . . . 131 Ralf Krestel and Peter Fankhauser
A Probabilistic Ranking Approach for Tag Recommendation Zhen liao, Maogiang Xie, Hao Cao, and Yalou huang Tag Sources for Recommendation in Collaborative Tagging Systems 157 Marek Lipczak, Yeming Hu, Yael Kollet, and evangelos Milios Collaborative Tag Recommendation System based on Logistic Regression 173 E. Montanes, J.R. Quevedo, I. Diaz, and. ranilla Content- and Graph-based Tag Recommendation: Two Variations Johannes Mrosek, Stefan Bussmann, Hendrik Albers, Kai Posdziech, Benedikt Hengefeld, Nils Opperman, Stefan Robert, and Gerrit S A Two-Level Learning Hierarchy of Concept Based Keyword Extraction for Tag recommendations Hendri Muri and Klaus Bern TaR: a Social Tag Recommender System Cataldo Musto, Fedelucio Narducci, marco de Gemmis, Pasquale Combining Tag Recommendations Based on User Histo lari nieminen Factor Models for Tag Recommendation in BibSonom Steffen Rendle and Lars Schmidt- Thieme Content-based and Graph-based Tag Suggestion 243 Xiance Si, Zhiyuan Liu, Peng Li, Qiria Jiang, and Maosong Sun RSDC09: Tag Recommendation Using Keywords and Association Rules. 261 Wang, Liangjie Hong and Brian D. Davis Understanding the user: Personomy translation for tag recommendation. 275 Robert Wetzker. Alan Said. and Carsten Zimmermann A Tag Recommendation System based on contents Ning Zhang, Yuan Zhang, and ie Tang A Collaborative Filtering Tag Recommendation System based on Graph. 297 Yuan Zhang, Ning Zhang, and Jie Tang
A Probabilistic Ranking Approach for Tag Recommendation . . . . . . . . . . . 143 Zhen Liao, Maoqiang Xie, Hao Cao, and Yalou Huang Tag Sources for Recommendation in Collaborative Tagging Systems . . . . . 157 Marek Lipczak, Yeming Hu, Yael Kollet, and Evangelos Milios Collaborative Tag Recommendation System based on Logistic Regression 173 E. Monta˜n´es, J. R. Quevedo, I. D´ıaz, and J. Ranilla Content- and Graph-based Tag Recommendation: Two Variations . . . . . . . 189 Johannes Mrosek, Stefan Bussmann, Hendrik Albers, Kai Posdziech, Benedikt Hengefeld, Nils Opperman, Stefan Robert, and Gerrit Spira A Two-Level Learning Hierarchy of Concept Based Keyword Extraction for Tag Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Hendri Murfi and Klaus Obermayer STaR: a Social Tag Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Cataldo Musto, Fedelucio Narducci, Marco de Gemmis, Pasquale Lops, and Giovanni Semeraro Combining Tag Recommendations Based on User History . . . . . . . . . . . . . . 229 Ilari T. Nieminen Factor Models for Tag Recommendation in BibSonomy . . . . . . . . . . . . . . . . 235 Steffen Rendle and Lars Schmidt-Thieme Content-based and Graph-based Tag Suggestion . . . . . . . . . . . . . . . . . . . . . . 243 Xiance Si, Zhiyuan Liu, Peng Li, Qixia Jiang, and Maosong Sun RSDC’09: Tag Recommendation Using Keywords and Association Rules . 261 Jian Wang, Liangjie Hong and Brian D. Davison Understanding the user: Personomy translation for tag recommendation . 275 Robert Wetzker, Alan Said, and Carsten Zimmermann A Tag Recommendation System based on contents . . . . . . . . . . . . . . . . . . . . 285 Ning Zhang, Yuan Zhang, and Jie Tang A Collaborative Filtering Tag Recommendation System based on Graph . 297 Yuan Zhang, Ning Zhang, and Jie Tang
Preface Since 1999 the ECML PKDD embraces the tradition of organizing a Discovery Challenge, allowing researchers to develop and test algorithms for novel and eal world datasets. This year's Discovery Challenge presents a dataset from he field of social bookmarking to deal with the recommendation of tags. The results submitted by the challenge's participants are presented at an ECML PKDD workshop on September 7th, 2009, in Bled, Slovenia. The provided dataset has been created using data of the social bookmark and publication sharing system BibSonomy, 2 operated by the organizers of the allenge. The training data was released on March 25th 2009, the test data or July 6th. The participants had time until July &th to submit their results. This gave researchers 14 weeks time to tune their algorithms on a snapshot of a real world folksonomy dataset and 48 hours to compute results on the test data To support the user during the tagging process and to facilitate the tagging, BibSonomy includes a tag recommender. When a user finds an interesting web page(or publication) and posts it to BibSonomy, the system offers up to five recommended tags on the posting page. The goal of the challenge is to learn a model which effectively predicts the keywords a user has in mind when describing a web page (or publication). We divided the problem into three tasks, each of which focusing on a certain aspect. All three tasks get the same dataset fo training. It is a snapshot of Bibsonomy until December 31st 2008. The dataset leaned and consists of two parts, the core part and the complete snapshot The test dataset is different for each task Task 1: Content-Based Tag Recommendations. The test data for this task con- tains posts, whose user, resource or tags are not contained in the post-core at level 2 of the training data. Thus, methods which can't produce tag recommen- dations for new resources or are unable to suggest new tags very probably wor produce good results here Task 2: Graph-Based Recommendations. This task is especially intended fc methods relying on the graph structure of the training data only. The user. resource, and tags of each post in the test data are all contained in the training datas post-core at level 2 Task 3: Online Tag Recommendations. This is a bonus task which will take place after Tasks 1 and 2. The participants shall implement a recommendation servic which can be called via Http by Bibsonomy'S recommender infrastructure when a user posts a bookmark or publication. All participating recommenders are called on each posting process, one of them is chosen to actually deliver the results to the user. We can then measure the performance of the recommenders in an online setting, where timeouts are important and where we can measure which tags the user has clicked on http://www.kde.cs.uni-kassel.de/ws/dc09 http://www.bibsonomy.org
Preface Since 1999 the ECML PKDD embraces the tradition of organizing a Discovery Challenge, allowing researchers to develop and test algorithms for novel and real world datasets. This year’s Discovery Challenge1 presents a dataset from the field of social bookmarking to deal with the recommendation of tags. The results submitted by the challenge’s participants are presented at an ECML PKDD workshop on September 7th, 2009, in Bled, Slovenia. The provided dataset has been created using data of the social bookmark and publication sharing system BibSonomy,2 operated by the organizers of the challenge. The training data was released on March 25th 2009, the test data on July 6th. The participants had time until July 8th to submit their results. This gave researchers 14 weeks time to tune their algorithms on a snapshot of a real world folksonomy dataset and 48 hours to compute results on the test data. To support the user during the tagging process and to facilitate the tagging, BibSonomy includes a tag recommender. When a user finds an interesting web page (or publication) and posts it to BibSonomy, the system offers up to five recommended tags on the posting page. The goal of the challenge is to learn a model which effectively predicts the keywords a user has in mind when describing a web page (or publication). We divided the problem into three tasks, each of which focusing on a certain aspect. All three tasks get the same dataset for training. It is a snapshot of BibSonomy until December 31st 2008. The dataset is cleaned and consists of two parts, the core part and the complete snapshot. The test dataset is different for each task. Task 1: Content-Based Tag Recommendations. The test data for this task contains posts, whose user, resource or tags are not contained in the post-core at level 2 of the training data. Thus, methods which can’t produce tag recommendations for new resources or are unable to suggest new tags very probably won’t produce good results here. Task 2: Graph-Based Recommendations. This task is especially intended for methods relying on the graph structure of the training data only. The user, resource, and tags of each post in the test data are all contained in the training data’s post-core at level 2. Task 3: Online Tag Recommendations. This is a bonus task which will take place after Tasks 1 and 2. The participants shall implement a recommendation service which can be called via HTTP by BibSonomy’s recommender infrastructure when a user posts a bookmark or publication. All participating recommenders are called on each posting process, one of them is chosen to actually deliver the results to the user. We can then measure the performance of the recommenders in an online setting, where timeouts are important and where we can measure which tags the user has clicked on. 1 http://www.kde.cs.uni-kassel.de/ws/dc09/ 2 http://www.bibsonomy.org 5
Results. More than 150 participants registered for the mailing list which enabled them to download the dataset. At the end we received 42 submissions -21 fo each of the Tasks 1 2. Additionally, 24 participants submitted a paper that can be found in the proceedings at hand We used the F1-Measure common in Information retrieval to evaluate the submitted recommendations. Therefore, we first computed for each post in the test data precision and recall by comparing the first five recommended tags against the tags the user has originally assigned to this post. Then we averaged precision and recall over all posts in the test data and used the resulting precision and recall to compute the F1-Measure as flm= =pecision recan The winning team of Task l has an flm of 0.18740, the second and third follow with 0. 18001 and 0.17975. For Task 2. the winner achieved an flm of 0.35594 33185and0. conference and later on the website of the challenge Lipczak et al. from Dalhousie University, Halifax, Canada(cf page 157)are the winners of Task 1. with a method based on the combination of tags from the resource's title, tags assigned to the resource by other users and tags in the user's profile they reached an flm of 0. 18740 in Task I and additionally achieved the third place in Task 2 with an flm of 0. 32461. The system is composed of six recommenders and the basic idea is to augment the tags from the title by related and then rescore and merge the occurrence graphs and from the user's profile extracted from two tag The winners of Task 2, Rendle and Schmidt-Thieme from University of Hildesheim, Germany(cf page 235)achieved an flm of 0. 35594 with a statis- tical method based on factor models. Therefore, they factorize the folksonomy structure to find latent interactions between users, resources and tags. Using a variant of the stochastic gradient descent algorithm the authors optimize an adaptation of the Bayesian Personal Ranking criterion. Finally, they estimate how many tags to recommend to further improve precision The second of Task 1(Mrosek et al., page 189)harvests tags from sources ke Delicious, Google Scholar, and CiteULike. They also employ the full-text of web pages and PDFs. The third (Ju and Hwang, page 109) merges tags which have been earlier assigned to the resource or used by the user as well as resource descriptions by a weighting scheme. Finally, the second of Task 2 (Balby Marinho et al., page 7)uses relational classification methods in a semi-supervised learning scenario to recommend tags We thank all participants of the challenge for their contributions and the organizers of the ECML PKDD 2009 conference for their support. Furthermore, we want to thank our sponsors Nokia and Tagora for supporting the challenge by awarding prizes for the winners of each task. We are looking forward to a very exciting and interesting workshop Folke Eisterlehner. Andreas hotho. Robert jaschke http://www.nokia.com/ http://www.tagora-project.eu
Results. More than 150 participants registered for the mailing list which enabled them to download the dataset. At the end, we received 42 submissions – 21 for each of the Tasks 1 & 2. Additionally, 24 participants submitted a paper that can be found in the proceedings at hand. We used the F1-Measure common in Information Retrieval to evaluate the submitted recommendations. Therefore, we first computed for each post in the test data precision and recall by comparing the first five recommended tags against the tags the user has originally assigned to this post. Then we averaged precision and recall over all posts in the test data and used the resulting precision and recall to compute the F1-Measure as f1m = 2·precision · recall precision + recall . The winning team of Task 1 has an f1m of 0.18740, the second and third follow with 0.18001 and 0.17975. For Task 2, the winner achieved an f1m of 0.35594, followed by 0.33185 and 0.32461. The winner of Task 3 will be announced at the conference and later on the website of the challenge. Lipczak et al. from Dalhousie University, Halifax, Canada (cf. page 157) are the winners of Task 1. With a method based on the combination of tags from the resource’s title, tags assigned to the resource by other users and tags in the user’s profile they reached an f1m of 0.18740 in Task 1 and additionally achieved the third place in Task 2 with an f1m of 0.32461. The system is composed of six recommenders and the basic idea is to augment the tags from the title by related tags extracted from two tag-tag–co-occurrence graphs and from the user’s profile and then rescore and merge them. The winners of Task 2, Rendle and Schmidt-Thieme from University of Hildesheim, Germany (cf. page 235) achieved an f1m of 0.35594 with a statistical method based on factor models. Therefore, they factorize the folksonomy structure to find latent interactions between users, resources and tags. Using a variant of the stochastic gradient descent algorithm the authors optimize an adaptation of the Bayesian Personal Ranking criterion. Finally, they estimate how many tags to recommend to further improve precision. The second of Task 1 (Mrosek et al., page 189) harvests tags from sources like Delicious, Google Scholar, and CiteULike. They also employ the full-text of web pages and PDFs. The third (Ju and Hwang, page 109) merges tags which have been earlier assigned to the resource or used by the user as well as resource descriptions by a weighting scheme. Finally, the second of Task 2 (Balby Marinho et al., page 7) uses relational classification methods in a semi-supervised learning scenario to recommend tags. We thank all participants of the challenge for their contributions and the organizers of the ECML PKDD 2009 conference for their support. Furthermore, we want to thank our sponsors Nokia3 and Tagora4 for supporting the challenge by awarding prizes for the winners of each task. We are looking forward to a very exciting and interesting workshop. Kassel, August 2009 Folke Eisterlehner, Andreas Hotho, Robert J¨aschke 3 http://www.nokia.com/ 4 http://www.tagora-project.eu/ 6
Relational Classification for Personalized Ta Recommendation eandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (IsmLL) Samelsonplatz 1. University of Hildesheim, D-31141 Hildesheim, Germany [marinho, preisach, schmidt-thieme Gismll uni-hildesheimde ht Abstract. Folksonomy data is relational by nature, and therefore methods that directly exploit these relations are prominent for the tag recommendation prob lem. Relational methods have been successfully applied to areas in which en- tities are linked in an explicit manner, like hypertext documents and scientific publications. For approaching the graph-based tag recommendation task of the ECML PKDD Discovery Challenge 2009, we propose to turn the folksonomy raph into a homogeneous post graph and use relational classification techniques for predicting tags. Our approach features adherence to multiple kinds of rela- 1 Introduction One might want tag recommendations for several reasons, as for example: simplifying the tagging process for the user, exposing different facets of a resource and helping the tag vocabulary to converge. Given that users are free to tag, i.e., the same resource can be tagged differently by different people, it is important to personalize the recom- mended tags for an individual user. Tagging data forms a ternary relation between users, resources and tags, differently from typical recommender systems in which the relation is usually binary between users and resources. The best methods presented so far explore this ternary relation to com- pute tag predictions, either by means of tensor factorization [8]or PageRank 3], on the hypergraph induced by the ternary relational data. We, on the other hand, propose to explore the underlying relational graph between posts by means of relational classifica In this paper we describe our approaches for addressing the graph-based tag rec- ommendation task of the ECML PKDD Discovery Challenge 2009. We present two basic algorithms: PWA*(probabilistic weighted average ), an iterative relational clas- sification algorithm enhanced with relaxation labelling, and WA *(weighted average). an ite ods feature: adherence to multiple kinds of relations, training free, fast predictions, and semi-supervised classification Semi-supervised classification is particularly appealing because it allows us to evtl benefit from the information contained in the test dataset Furthermore, we propose to combine these methods through unweighted voting
Relational Classification for Personalized Tag Recommendation Leandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Samelsonplatz 1, University of Hildesheim, D-31141 Hildesheim, Germany {marinho,preisach,schmidt-thieme}@ismll.uni-hildesheim.de http://www.ismll.uni-hildesheim.com/ Abstract. Folksonomy data is relational by nature, and therefore methods that directly exploit these relations are prominent for the tag recommendation problem. Relational methods have been successfully applied to areas in which entities are linked in an explicit manner, like hypertext documents and scientific publications. For approaching the graph-based tag recommendation task of the ECML PKDD Discovery Challenge 2009, we propose to turn the folksonomy graph into a homogeneous post graph and use relational classification techniques for predicting tags. Our approach features adherence to multiple kinds of relations, semi-supervised learning and fast predictions. 1 Introduction One might want tag recommendations for several reasons, as for example: simplifying the tagging process for the user, exposing different facets of a resource and helping the tag vocabulary to converge. Given that users are free to tag, i.e., the same resource can be tagged differently by different people, it is important to personalize the recommended tags for an individual user. Tagging data forms a ternary relation between users, resources and tags, differently from typical recommender systems in which the relation is usually binary between users and resources. The best methods presented so far explore this ternary relation to compute tag predictions, either by means of tensor factorization [8] or PageRank [3], on the hypergraph induced by the ternary relational data. We, on the other hand, propose to explore the underlying relational graph between posts by means of relational classification. In this paper we describe our approaches for addressing the graph-based tag recommendation task of the ECML PKDD Discovery Challenge 2009. We present two basic algorithms: PWA* (probabilistic weighted average), an iterative relational classification algorithm enhanced with relaxation labelling, and WA* (weighted average), an iterative relational classification method without relaxation labelling. These methods feature: adherence to multiple kinds of relations, training free, fast predictions, and semi-supervised classification. Semi-supervised classification is particularly appealing because it allows us to evtl. benefit from the information contained in the test dataset. Furthermore, we propose to combine these methods through unweighted voting. 7
4 The paper Is organized as follows. Section 2 presents the notation used throughout In Section 3 we show how we turned the folksonomy into a post relation graph. Section 4 introduces the individual classifiers and the ensemble technique we used In Section 5 we elaborate on the evaluation and experiments conducted for tuning the parameters of our models, and report the results obtained on the test dataset released for the challenge. The paper closes with conclusions and directions for future work 2 Notation Foksonomy data usually comprises a set of users U, a set of resources R, a set of tags T, and a set Y of ternary relations between them, i.e,YCUXRXT Let {(u,r)t∈T:(u,r,t) be the set of all unique user/resources combinations in the data, where each pair is called a post. For convenience, let T(r=(u, r)):=teTI(u,r,t)eY be the set of all tags assigned to a given post r E x. We consider train/test splits based on posts, i.e Xtrain, Xtest C X disjoint and covering all of X XtrainUXtest=X For training, the learner has access to the set Xtrain of training posts and their true gs T xmi. The tag recommendation task is then to predict, for a given E Xtest, a set T(a)sTof tags that are most likely to be used by the resp. user for the resp. resource 3 Relation Engineering We propose to represent folksonomy data as a homogeneous, undirected relational graph over the post set, i.e G: =(X, E)in which edges are annotated with a weight w:XxX-R denoting the strength of the relation. Besides making the input data nore compact-we have only a binary relation RcxxX between objects of the same ype-this representation will allow us to trivially cast the problem of personalized tag recommendations as a relational classification problem. Relational classifiers usually consider, additionally to the typical attribute-value data of objects, relational information. A scientific paper, for example, can be connected to another paper that has been written by the same author or because they share common citations. It has been shown in many classification problems that relational classifiers perform better than purely attribute-based classifiers [1, 4, 6] In our case, we assume that posts are related to each other if they share the same user: Ruser: =I(a, rE Xx Xluser(r)=user(r), the same resource: Rres Ia,TE Xx Xres(a)=res(r), or either share the same user or resource Ruser: Ruser U Rres(see Figure 1). For convenience, let user(=)and res(a) denote the user and resource of post r respectively. Thus, each post is connected to each other either in terms of other users that tagged the same resource, or the resources tagged by the same user. Weights are discussed in Section 4
The paper is organized as follows. Section 2 presents the notation used throughout the paper. In Section 3 we show how we turned the folksonomy into a post relational graph. Section 4 introduces the individual classifiers and the ensemble technique we used. In Section 5 we elaborate on the evaluation and experiments conducted for tuning the parameters of our models, and report the results obtained on the test dataset released for the challenge. The paper closes with conclusions and directions for future work. 2 Notation Foksonomy data usually comprises a set of users U, a set of resources R, a set of tags T, and a set Y of ternary relations between them, i.e., Y ⊆ U × R × T. Let X := {(u, r)| ∃t ∈ T : (u, r, t) ∈ Y } be the set of all unique user/resources combinations in the data, where each pair is called a post. For convenience, let T(x = (u, r)) := {t ∈ T | (u, r, t) ∈ Y } be the set of all tags assigned to a given post x ∈ X. We consider train/test splits based on posts, i.e., Xtrain, Xtest ⊂ X disjoint and covering all of X: Xtrain∪˙ Xtest = X For training, the learner has access to the set Xtrain of training posts and their true tags T|Xtrain . The tag recommendation task is then to predict, for a given x ∈ Xtest, a set Tˆ(x) ⊆ T of tags that are most likely to be used by the resp. user for the resp. resource. 3 Relation Engineering We propose to represent folksonomy data as a homogeneous, undirected relational graph over the post set, i.e., G := (X, E) in which edges are annotated with a weight w : X × X → R denoting the strength of the relation. Besides making the input data more compact – we have only a binary relation R ⊆ X×X between objects of the same type – this representation will allow us to trivially cast the problem of personalized tag recommendations as a relational classification problem. Relational classifiers usually consider, additionally to the typical attribute-value data of objects, relational information. A scientific paper, for example, can be connected to another paper that has been written by the same author or because they share common citations. It has been shown in many classification problems that relational classifiers perform better than purely attribute-based classifiers [1, 4, 6]. In our case, we assume that posts are related to each other if they share the same user: Ruser := {(x, x0 ) ∈ X × X | user(x) = user(x 0 )}, the same resource: Rres := {(x, x0 ) ∈ X × X|res(x) = res(x 0 )}, or either share the same user or resource: Rres user := Ruser ∪ Rres (see Figure 1). For convenience, let user(x) and res(x) denote the user and resource of post x respectively. Thus, each post is connected to each other either in terms of other users that tagged the same resource, or the resources tagged by the same user. Weights are discussed in Section 4. 8
Fig. 1. Ruser( top left), Rres(bottom left)and Ruser (right)of a given test post(nodes in grey) Note that it may happen that some of the related posts belong themselves to the test dataset, allowing us to evtl. profit from the unlabeled information of test nodes through, e. g, collective inference(see Section 4). Thus, differently from other ap. proaches(e.g, [3, 8]) that are only restricted to Xtrain, we can also exploit the set X of test posts, but of course not their associated true tags Now, for a given r E Xtest, one can use the tagging information of related instances to estimate T(a). A simple way to do that is, e. g, through tag frequencies of related P印):s|{x'∈NHt∈r(x)H x∈X,t∈T while Nr is the neighborhood of r: N:={x'∈X|(x,x)∈R,T(x)≠ In section 4 we will present the actual relational classifiers we have used to approach 4 Relational classification for Tag recommendation We extract the relational information by adapting simple statistical relational methods, usually used for classification of hypertext documents, scientific publications or movies to the tag recommendation scenario. The aim is to recommend tags to users by using the neighborhood encoded in the homogeneous graph G(X, E). Therefore we described a ery simple method in eq(1), where the probability for a tag t e T given a node a (post)is computed by counting the frequency of neighboring posts z'E Nr that have used the same tag t In this case the strength of the relations is not taken into account, i.e., all considered neighbors of a have the same influence on the probability of tag t
Fig. 1. Ruser (top left), Rres (bottom left) and Rres user (right) of a given test post (nodes in grey) Note that it may happen that some of the related posts belong themselves to the test dataset, allowing us to evtl. profit from the unlabeled information of test nodes through, e.g., collective inference (see Section 4). Thus, differently from other approaches (e.g., [3, 8]) that are only restricted to Xtrain, we can also exploit the set Xtest of test posts, but of course not their associated true tags. Now, for a given x ∈ Xtest, one can use the tagging information of related instances to estimate Tˆ(x). A simple way to do that is, e.g., through tag frequencies of related posts: P(t|x) := |{x 0 ∈ Nx|t ∈ T(x 0 )}| |Nx| , x ∈ X, t ∈ T (1) while Nx is the neighborhood of x: Nx := {x 0 ∈ X |(x, x0 ) ∈ R, T(x) 6= ∅} (2) In section 4 we will present the actual relational classifiers we have used to approach the challenge. 4 Relational Classification for Tag Recommendation We extract the relational information by adapting simple statistical relational methods, usually used for classification of hypertext documents, scientific publications or movies, to the tag recommendation scenario. The aim is to recommend tags to users by using the neighborhood encoded in the homogeneous graph G(X, E). Therefore we described a very simple method in eq. (1), where the probability for a tag t ∈ T given a node x (post) is computed by counting the frequency of neighboring posts x 0 ∈ Nx that have used the same tag t. In this case the strength of the relations is not taken into account, i.e., all considered neighbors of x have the same influence on the probability of tag t 9
given a. But this is not an optimal solution, the more similar posts are to each other th higher the weight of this edge should be Hence, a more suitable relational method for tag recommendation is the WeightedAverage(WA) which sums up all the weights of posts z'E N that share the same tag t E T and normalizes this by the sum over all weights in the neighborhood P(+|x) ∑r∈Nt∈r(x)(x,x Thus, WA does only consider neighbors that belong to the training set. A more sophisticated relational method that takes probabilities into account is the probabilistic weighted average(PWA), it calculates the probability of t given r by build- ing the weighted average of the tag probabilities of neighbor nodes T'E Na ∑x∈N,(x,x')P(tx P(t]r) Where P(tr)=l for I'E Xtrain, i.e., we are only exploiting nodes contained in the training set(see eq. (2). We will see in the next paragraph how one can exploit these probabilities in a more clever way. Both approaches have been introduced in [5] and applied to relational datasets Since we want to recommend more than one tag we need to cast the tag recommen- dation problem as a multilabel classification problem, i. e, assign one or more classes to a test node. We accomplish the multilabel problem by sorting the calculated probabili ties P(tr)for all r E Xtest and recommend the top n tags with highest probabilities The proposed relational methods could either be applied on Rrsr, 1. e, the union of the user and resource relation or on each relation Ruser, Rres individually. If applied on each relation the results could be combined by using ensemble techniques 4.1 Semi-Supervised learning As mentioned before, we would like additionally, to exploit unlabeled information con- tained in the graph and use the test nodes that have not been tagged yet, but are related to other nodes. This can be achieved by applying collective inference methods, being iterative procedures, which classify related nodes simultaneously and exploit relational autocorrelation and unlabeled data. relational autocorrelation is the correlation among a variable of an entity to the same variable(here the class)of a related entity, i.e,con- nected entities are likely to have the same classes assigned. Collective Classification is emi-supervised by nature, since one exploits the unlabeled part of the data. One of this semi-supervised methods is relaxation labeling [1], it can be formally expressed as: M(P(tI)22 We first initialize the unlabeled nodes with the prior probability calculated using the train set, then compute the probability of tag t given r iteratively using a relational clas- sification method M based on the neighborhood Nr in the inner loop. The procedure stops when the algorithm converges (i. e, the difference of the tag probability between
given x. But this is not an optimal solution, the more similar posts are to each other the higher the weight of this edge should be. Hence, a more suitable relational method for tag recommendation is the WeightedAverage (WA) which sums up all the weights of posts x 0 ∈ Nx that share the same tag t ∈ T and normalizes this by the sum over all weights in the neighborhood. P(t|x) = P x0∈Nx|t∈T(x0) w(x, x0 ) P x0∈Nx w(x, x0) (3) Thus, WA does only consider neighbors that belong to the training set. A more sophisticated relational method that takes probabilities into account is the probabilistic weighted average (PWA), it calculates the probability of t given x by building the weighted average of the tag probabilities of neighbor nodes x 0 ∈ Nx: P(t|x) = P x0∈Nx w(x, x0 )P(t|x 0 ) P x0∈Nx w(x, x0) (4) Where P(t|x 0 ) = 1 for x 0 ∈ Xtrain, i.e., we are only exploiting nodes contained in the training set (see eq. (2)). We will see in the next paragraph how one can exploit these probabilities in a more clever way. Both approaches have been introduced in [5] and applied to relational datasets. Since we want to recommend more than one tag we need to cast the tag recommendation problem as a multilabel classification problem, i.e., assign one or more classes to a test node. We accomplish the multilabel problem by sorting the calculated probabilities P(t|x) for all x ∈ Xtest and recommend the top n tags with highest probabilities. The proposed relational methods could either be applied on Rres user, i.e., the union of the user and resource relation or on each relation Ruser, Rres individually. If applied on each relation the results could be combined by using ensemble techniques. 4.1 Semi-Supervised Learning As mentioned before, we would like additionally, to exploit unlabeled information contained in the graph and use the test nodes that have not been tagged yet, but are related to other nodes. This can be achieved by applying collective inference methods, being iterative procedures, which classify related nodes simultaneously and exploit relational autocorrelation and unlabeled data. Relational autocorrelation is the correlation among a variable of an entity to the same variable (here the class) of a related entity, i.e., connected entities are likely to have the same classes assigned. Collective Classification is semi-supervised by nature, since one exploits the unlabeled part of the data. One of this semi-supervised methods is relaxation labeling [1], it can be formally expressed as: P(t|x) (i+1) = M(P(t|x 0 ) (i) x0∈Nx ) (5) We first initialize the unlabeled nodes with the prior probability calculated using the train set, then compute the probability of tag t given x iteratively using a relational classification method M based on the neighborhood Nx in the inner loop. The procedure stops when the algorithm converges (i.e., the difference of the tag probability between 10