Recommender Systems for Social Bookmarking PROEFSCHRIFT ter verkrijging van de graad van doctor an de universiteit van Tilburg gezag van prof. dr. Ph. Eijlander in het openbaar te verdedigen ten overstaan van een door het college voor promotes aangewezen commissie in de aula van de universiteit op dinsdag 8 december 2009 om 14.15 uur door Antonius Marinus Bogers geboren op 21 september 1979 te Roosendaal en Nispen
Recommender Systems for Social Bookmarking PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit van Tilburg, op gezag van de rector magnificus, prof. dr. Ph. Eijlander, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op dinsdag 8 december 2009 om 14.15 uur door Antonius Marinus Bogers, geboren op 21 september 1979 te Roosendaal en Nispen
Promotor Prof dr. A.pj. van den bosch Beoordelingscommissie Prof dr h.j. van den herik Prof dr M. de Rijke Prof dr L boves Dr B. Larsen The research reported in this thesis has been funded by SenterNovem /the Dutch Ministry of Economic Affairs as part of the IOP-MMI A Propos project SIKS Dissertation Series No. 2009-42 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems Tice TiCC Dissertation Series No. 10 ISBN97890-8559582-3 Copyright 2009, A M. Bogers All rights reserved. No part of this publication may be reproduced, stored in a retrieval sys tem, or transmitted, in any form or by any means, electronically, mechanically, photocopying, recording or otherwise, without prior permission of the author
Promotor: Prof. dr. A.P.J. van den Bosch Beoordelingscommissie: Prof. dr. H.J. van den Herik Prof. dr. M. de Rijke Prof. dr. L. Boves Dr. B. Larsen Dr. J.J. Paijmans The research reported in this thesis has been funded by SenterNovem / the Dutch Ministry of Economic Affairs as part of the IOP-MMI À Propos project. SIKS Dissertation Series No. 2009-42 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. TiCC Dissertation Series No. 10 ISBN 978-90-8559-582-3 Copyright c 2009, A.M. Bogers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronically, mechanically, photocopying, recording or otherwise, without prior permission of the author
GC The Web, they say, is leaving the era of search and entering one of discovery. thing. Discovery is when something wonderful that you didn 't know existed, or didn 't know how to ask for, finds you Jeffrey M. o'Brien
“ The Web, they say, is leaving the era of search and entering one of discovery. What’s the difference? Search is what you do when you’re looking for something. Discovery is when something wonderful that you didn’t know existed, or didn’t know how to ask for, finds you. ” Jeffrey M. O’Brien
lll
iii
PREFACE First and foremost I would like to thank my supervisor and promotor Antal van den bosch who guided me in my first steps as a researcher, both for my Masters thesis and my Ph. D research. Antal always gave me free reign in investigating many different research prob- lems, while at the same time managing to steer me in the right direction when the time called for it. Antal was always able to make time for me or any of the other Ph. D. students and read and comment on paper or presentation drafts In addition to turning me into a better researcher, Antal was also instrumental in improving my Guitar Hero skills. Our thesis meetings during your sabbatical doubled as a kind of Rock n Roll Fantasy Camp, where we could both unwind from discussing yet another batch of experiments I had run or was planning to run. Rock on! Antal also shares my passion for ice hockey. This resulted in us attending Tilburg Trappers games in Stappegoor as well as our regular discussions of the latest hockey news. Thanks for inviting me to come see the HL All Star games in Breda. Hopefully we will meet again in spirit come May 2010 when the Canucks beat the Penguins in the Stanley Cup finals! The research presented in this thesis was performed in the context of the a Propos project I would like to acknowledge SenterNovem and the Dutch Ministry of Economic Affairs for funding this project as part of the IOp-MMI program. The a Propos project was started by Lou Boves, Antal, and Frank Hofstede. I would like to thank Lou and Frank in particular. Frank was always able to look at my research problems from a different and more practical angle, and as a result our discussions were always very stimulating. I would also like to Mari Carmen Puerta-Melguizo, Anita Deshpande, and Els den Os, as well as the other members and attendees of the project meetings for the pleasant cooperation and helpful comments and suggestions I wish to thank the members of my committee for taking time out of their busy schedules to read my dissertation and attending my defense: Jaap van den Herik, Maarten de Rijke, Lou Boves, Birger Larsen, and Hans Paijmans. Special thanks go to Jaap for his willingness to go through my thesis with a fine-grained comb. The readability of the final text has benefited greatly from his meticulous attention to detail and quality. Any errors remaining in the thesis are my own. I would also like to thank Birger for his comments, which helped to dot the is and cross the ts of the final product. Finally, I would like to thank Hans Paijmans who contributed considerably to my knowledge of IR. IV
PREFACE First and foremost I would like to thank my supervisor and promotor Antal van den Bosch, who guided me in my first steps as a researcher, both for my Master’s thesis and my Ph.D. research. Antal always gave me free reign in investigating many different research problems, while at the same time managing to steer me in the right direction when the time called for it. Antal was always able to make time for me or any of the other Ph.D. students, and read and comment on paper or presentation drafts. In addition to turning me into a better researcher, Antal was also instrumental in improving my Guitar Hero skills. Our thesis meetings during your sabbatical doubled as a kind of Rock ’n Roll Fantasy Camp, where we could both unwind from discussing yet another batch of experiments I had run or was planning to run. Rock on! Antal also shares my passion for ice hockey. This resulted in us attending Tilburg Trappers games in Stappegoor as well as our regular discussions of the latest hockey news. Thanks for inviting me to come see the NHL All Star games in Breda. Hopefully we will meet again in spirit come May 2010 when the Canucks beat the Penguins in the Stanley Cup finals! The research presented in this thesis was performed in the context of the À Propos project. I would like to acknowledge SenterNovem and the Dutch Ministry of Economic Affairs for funding this project as part of the IOP-MMI program. The À Propos project was started by Lou Boves, Antal, and Frank Hofstede. I would like to thank Lou and Frank in particular. Frank was always able to look at my research problems from a different and more practical angle, and as a result our discussions were always very stimulating. I would also like to Mari Carmen Puerta-Melguizo, Anita Deshpande, and Els den Os, as well as the other members and attendees of the project meetings for the pleasant cooperation and helpful comments and suggestions. I wish to thank the members of my committee for taking time out of their busy schedules to read my dissertation and attending my defense: Jaap van den Herik, Maarten de Rijke, Lou Boves, Birger Larsen, and Hans Paijmans. Special thanks go to Jaap for his willingness to go through my thesis with a fine-grained comb. The readability of the final text has benefited greatly from his meticulous attention to detail and quality. Any errors remaining in the thesis are my own. I would also like to thank Birger for his comments, which helped to dot the i’s and cross the t’s of the final product. Finally, I would like to thank Hans Paijmans, who contributed considerably to my knowledge of IR. iv
My Ph. D. years would not have been as enjoyable and successful without my colleagues at Tilburg University, especially those at the ilK group. It is not everywhere that the bond between colleagues is as strong as it was in iLK and i will not soon forget the coffee breaks with the Sulawesi Boys, the BBQs and Guitar Hero parties, lunch runs, after-work drinks and the friendly and supportive atmosphere on the 3rd floor of Dante. I do not have enough room to thank everyone personally here, you know who you are In your own way, you all contributed to this thesis Over the course of my Ph. D. I have spent many Fridays at the Science Park in Amsterdam, working with members of the ILPS group headed by Maarten de Rijke. I would like to thank Erik Tjong Kim Sang for setting this up and Maarten for allowing me to become a guest researcher at his group. Much of what I know about doing IR research, I learned from these visits. From small things like visualizing research results and LaTeX layout to IR research methodology and a focus on empirical, task-driven research. I hope that some of what I have learned shows in the thesis. i would like to thank all of the ilps members but especially Krisztian, Katja, and Maarten for collaborating with me on expert search, which has proven to be a very fruitful collaboration so far. I have also had the pleasure of working at the Royal School of Library and Information Science in Copenhagen. I am most grateful to Birger Larsen and Peter Ingwersen, for helping to arrange my visit and guiding me around. Thanks are also due to Mette, Haakon, Charles, Jette, and the other members of the Illa group for welcoming me and making me feel at home. Jeg glaeder mig til at arbejde sammen med jer snart Thanks are due to Sunil Patel for designing part of the stylesheet of this thesis and to JonathanFeinbergofhttp://www.wordle.net/forthewordcloudonthefrontofthis thesis. I owe Maarten Clements a debt of gratitude for helping me to more efficiently im- plement his random walk algorithm. And of course thanks to BibSonomy, CiteULike, and Delicious for making the research described in this thesis possible Finally, I would like to thank the three most important groups of people in my life. My friends, for always supporting me and taking my mind off my work. Thanks for all the din- ners, late-night movies, pool games, talks, vacations and trips we have had so far! Thanks to my parents for always supporting me and believing in me; without you I would not have been where I am today. Kirstine, thanks for putting up with me while I was distracted by my work, and thanks for patiently reading and commenting on my Ph. D thesis. Og tusind tak fordi du bringer sa meget glade, latter og kaerlighed ind i mit liv. Det her er til Timmy og Dinky
Preface v My Ph.D. years would not have been as enjoyable and successful without my colleagues at Tilburg University, especially those at the ILK group. It is not everywhere that the bond between colleagues is as strong as it was in ILK and I will not soon forget the coffee breaks with the Sulawesi Boys, the BBQs and Guitar Hero parties, lunch runs, after-work drinks, and the friendly and supportive atmosphere on the 3rd floor of Dante. I do not have enough room to thank everyone personally here, you know who you are. In your own way, you all contributed to this thesis. Over the course of my Ph.D. I have spent many Fridays at the Science Park in Amsterdam, working with members of the ILPS group headed by Maarten de Rijke. I would like to thank Erik Tjong Kim Sang for setting this up and Maarten for allowing me to become a guest researcher at his group. Much of what I know about doing IR research, I learned from these visits. From small things like visualizing research results and LaTeX layout to IR research methodology and a focus on empirical, task-driven research. I hope that some of what I have learned shows in the thesis. I would like to thank all of the ILPS members, but especially Krisztian, Katja, and Maarten for collaborating with me on expert search, which has proven to be a very fruitful collaboration so far. I have also had the pleasure of working at the Royal School of Library and Information Science in Copenhagen. I am most grateful to Birger Larsen and Peter Ingwersen, for helping to arrange my visit and guiding me around. Thanks are also due to Mette, Haakon, Charles, Jette, and the other members of the IIIA group for welcoming me and making me feel at home. Jeg glæder mig til at arbejde sammen med jer snart. Thanks are due to Sunil Patel for designing part of the stylesheet of this thesis and to Jonathan Feinberg of http://www.wordle.net/ for the word cloud on the front of this thesis. I owe Maarten Clements a debt of gratitude for helping me to more efficiently implement his random walk algorithm. And of course thanks to BibSonomy, CiteULike, and Delicious for making the research described in this thesis possible. Finally, I would like to thank the three most important groups of people in my life. My friends, for always supporting me and taking my mind off my work. Thanks for all the dinners, late-night movies, pool games, talks, vacations and trips we have had so far! Thanks to my parents for always supporting me and believing in me; without you I would not have been where I am today. Kirstine, thanks for putting up with me while I was distracted by my work, and thanks for patiently reading and commenting on my Ph.D. thesis. Og tusind tak fordi du bringer så meget glæde, latter og kærlighed ind i mit liv. Det her er til Timmy og Oinky!
CONTENTS Preface 1 Introduction 1.3 Problem Statement and Research Questions 1.4 Research Methodology 5 Organization of the Thesis 6 1.6 Origins of the Material 7 2 Related Work 9 2.1 Recommender Systems 2.1.1 Collaborative Filtering 2.1.2 Content-based Filtering 2.1.3 Knowledge-based Recommendation 2.1.4 Recommending Bookmarks References 2.1.5 Recommendation in Context 2.2.1 Indexing vs. Tagging 2.2.2 Broad vs Narrow Folksonomies 2.2.3 The Social Graph 25 2.3 Social Bookmarking 26 2.3.1 Domains 2.3.2 Interacting with Social Bookmarking Websites 3.3 Research tasks I Recommending bookmarks 3 Building Blocks for the Experiments 3.1 Recommender tasks .2 Data sets 3.2.1 CiteULike 41 3.2.2 BibSonomy
CONTENTS Preface iv 1 Introduction 1 1.1 Social Bookmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Problem Statement and Research Questions . . . . . . . . . . . . . . . . . . . . . 3 1.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Origins of the Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Work 9 2.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Knowledge-based Recommendation . . . . . . . . . . . . . . . . . . . . . 14 2.1.4 Recommending Bookmarks & References . . . . . . . . . . . . . . . . . . 15 2.1.5 Recommendation in Context . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Social Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Indexing vs. Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Broad vs. Narrow Folksonomies . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 The Social Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Social Bookmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Interacting with Social Bookmarking Websites . . . . . . . . . . . . . . . 28 2.3.3 Research tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 I Recommending Bookmarks 3 Building Blocks for the Experiments 35 3.1 Recommender Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 CiteULike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 BibSonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 vi
Conte 3.2.3 Delicious 3.3 Data re 3. 4.2 Evaluatic 3.4.3 Discussion 4 Folksonomic Recommendation 4.1 Preliminaries 56 4.2 Popularity-based Recommenda 4.3 Collaborative Filtering 4.3.1 Algorith 4.4 Tag-based Collaborative Filtering 4.4.1 Tag Overlap Similarity 4.4.2 Tagging Intensity Similarity 4.4.3 Similarity Fusion 688 4.4.4 Results 4.4.5 Discussion 4.5 Related work 4.6 Comparison to Related Work 4.6.1 Tag-aware Fusion of Collaborative Filtering Algorithms 4.6.2 A Random Walk on the Social Graph 4.6.3 Results 4.6.4 Discussion 4.7 Chapter Conclusions and Answer to RQ 1 5 Exploiting Metadata for Recommendation 5.1 Contextual Metadata in Social Bookmarking 5.2 Exploiting Metadata for Item Recommendation 5888 5.2.1 Content-based Filtering 5.2.2 Hybrid Filtering 5.2.3 Similarity Matching 5.2.4 Selecting Metadata Fields for Recommendation Runs 5.3 Results 5.3.1 Content-based Filtering 5.3.2 Hybrid Filtering 5.3.3 Comparison to Folksonomic Recommendation 98 5.4 Related Work 5.4.1 Content-based Filtering 5.4.2 Hybrid Filterin 5.5 Discussion 5.6 Chapter Conclusions and Answer to RQ 2 105 6 Combining Recommendations 107 6.1 Related Work 108 6.1.1 Fusing Recommendations
Contents vii 3.2.3 Delicious . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Folksonomic Recommendation 55 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Popularity-based Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Tag-based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Tag Overlap Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Tagging Intensity Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.3 Similarity Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6.1 Tag-aware Fusion of Collaborative Filtering Algorithms . . . . . . . . . 77 4.6.2 A Random Walk on the Social Graph . . . . . . . . . . . . . . . . . . . . . 78 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 Chapter Conclusions and Answer to RQ 1 . . . . . . . . . . . . . . . . . . . . . . 82 5 Exploiting Metadata for Recommendation 85 5.1 Contextual Metadata in Social Bookmarking . . . . . . . . . . . . . . . . . . . . 86 5.2 Exploiting Metadata for Item Recommendation . . . . . . . . . . . . . . . . . . . 88 5.2.1 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.2 Hybrid Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.3 Similarity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.4 Selecting Metadata Fields for Recommendation Runs . . . . . . . . . . 94 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.1 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 Hybrid Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3.3 Comparison to Folksonomic Recommendation . . . . . . . . . . . . . . . 98 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4.1 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4.2 Hybrid Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6 Chapter Conclusions and Answer to RQ 2 . . . . . . . . . . . . . . . . . . . . . . 105 6 Combining Recommendations 107 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.1 Fusing Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Contents 6.1.2 Data Fusion in Machine Learning and IR 6.1.3 Why Does Fusion Work? 6.2 Fusing Recommendations 112 6.3 Selecting Runs for Fusion 6.4 Results 6.4.1 Fusion Analysis 117 6.4.2 Comparing All Fusion Methods 119 6.5 Discussion conclusions 120 6.6 Chapter Conclusions and Answer to RQ 3 II Growing Pains: Real-world Issues in Social Bookmarking 7 Spam 7.1 Related Work 7.2 Methodology 128 7. 2.1 Data Collection 129 7.2.2 Data Representation 130 7. 2.3 Evaluation 132 7.3 Spam Detection for Social Bookmarking 7.3.1 Language Models for Spam Detection 133 7.3.2 Spam Classification 7.3.3 Results 7.3.4 Discussion and Conclusions 7.4 The Influence of Spam on Recommendation 140 7.4.1 Related Work 7.4.2 Experimental Setup 141 7.4.3 Results and Analysis 142 7.5 Chapter Conclusions and Answer to RQ 4 145 8 Duplicates 147 8.1 Duplicates in CiteULike ..148 8.2 Related Work 8.3 Duplicate Detection 151 8.3.1 Creating a Training Set 151 8.3.2 Constructing a Duplicate Item Classifier 153 8.3.3 Results and Analysis 8.4 The Influence of Duplicates on Recommendation 8.4.1 Experimental Setup 8.4.2 Results and Analysis 162 8.5 Chapter Conclusions and Answer to RQ 5 III Conclusion 9 Discussion and Conclusions 9.1 Answers to Research Questions 169 9.2 Recommendations for Recommendation
Contents viii 6.1.2 Data Fusion in Machine Learning and IR . . . . . . . . . . . . . . . . . . 110 6.1.3 Why Does Fusion Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Fusing Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 Selecting Runs for Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4.1 Fusion Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4.2 Comparing All Fusion Methods . . . . . . . . . . . . . . . . . . . . . . . . 119 6.5 Discussion & Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6 Chapter Conclusions and Answer to RQ 3 . . . . . . . . . . . . . . . . . . . . . . 121 II Growing Pains: Real-world Issues in Social Bookmarking 7 Spam 125 7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2.2 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.3 Spam Detection for Social Bookmarking . . . . . . . . . . . . . . . . . . . . . . . 132 7.3.1 Language Models for Spam Detection . . . . . . . . . . . . . . . . . . . . 133 7.3.2 Spam Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4 The Influence of Spam on Recommendation . . . . . . . . . . . . . . . . . . . . . 140 7.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.5 Chapter Conclusions and Answer to RQ 4 . . . . . . . . . . . . . . . . . . . . . . 145 8 Duplicates 147 8.1 Duplicates in CiteULike . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.3 Duplicate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.3.1 Creating a Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.3.2 Constructing a Duplicate Item Classifier . . . . . . . . . . . . . . . . . . . 153 8.3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.4 The Influence of Duplicates on Recommendation . . . . . . . . . . . . . . . . . . 160 8.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.5 Chapter Conclusions and Answer to RQ 5 . . . . . . . . . . . . . . . . . . . . . . 164 III Conclusion 9 Discussion and Conclusions 169 9.1 Answers to Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.2 Recommendations for Recommendation . . . . . . . . . . . . . . . . . . . . . . . 172
Contents 9.3 Summary of Contributions 9.4 Future Directions 174 References 177 Appendices a Collecting the CiteULike Data Set 191 A1 Extending the Public Data Dump A2 Spam Annotation B Glossary of Recommendation Runs 195 C Optimal Fusion Weight 197 D Duplicate Annotation in CiteULike 203 List of Figures 205 List of tables 207 List of abbreviations 209 Summar 211 Samenvatting 215 Curriculum vitae 219 Publications 221 SIKS Dissertation Series 223 TiCC Dissertation series 229
Contents ix 9.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 References 177 Appendices A Collecting the CiteULike Data Set 191 A.1 Extending the Public Data Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2 Spam Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B Glossary of Recommendation Runs 195 C Optimal Fusion Weights 197 D Duplicate Annotation in CiteULike 203 List of Figures 205 List of Tables 207 List of Abbreviations 209 Summary 211 Samenvatting 215 Curriculum Vitae 219 Publications 221 SIKS Dissertation Series 223 TiCC Dissertation Series 229