Automated Tag Clustering: Improving search and exploration in the tag space Grigory Begelman Philipp Keller Frank Smadja Technion-Israel Institute of Citrin Informatik Gmbh Technology Computer Science Dpt phred@citrin. ch frank@rawsugar.com gbeg@cs technion. acil 1. INTRODUCTION in Figure 1. It is a picture of a piece of sushi called ni In this paper we discuss the use of clustering techniques to giri(hand formed)sushi as opposed to other types of sushi enhance the user experience and thus the success of collabo- of collabo- like maki, futomaki or temaki sushi. A person not aware of rative tagging services. We show that clustering techniques this classification of sushi would tag this picture with can improve the user experience of current tagging services combination of the following tags: food, fish, raw fish, rice ond, we give an overview of existing approaches. We then sushi, or toro. Whithout delving on the psychological as- describe the algorithms we used for tag clustering and give pects of tagging"(nor on the nuances of sushi); we cleary experimental results. Finally, we explore the use of several see that people think and tag differently. This creates a antiques to identify ser noisy tagspace and thus makes it harder to find material 2. MOTIVATION The success of tagging services like Flickr, del. icio us2 and technorati has shown that tagging is a great collabora tion tool. Tagging seems to be the natural way for people to classify objects as well as an attractive way to discove material. Tagging services provides users with a repositor of tagged resources(a k a tagspace) that can be searched Figure 1: How would you tag this? and explored in different ways. More and more people use at least one tagging service and enjoy them as discovery tools In summary, if you tag the above picture as toro, people Indeed, tagging is simple, it does not require a lot of thinking searching for sushi or food will not find it. This type of prob- nd it is very useful to find the tagged objects later. People lem is rooted in the language, words are often related tag pictures, videos, and other resources with a couple of do not stand in isolation. Such relations among words are ywords to easily retrieve them in a later stage. However, called lexical relations, We refer the reader to WordNet. for oking for information in the tag space has a number of a thorough treatment of semantic and syntagmatic relations hard limitations among words he difficulty comes from the fact that several people usu- Our point is that, without accounting for lexical relations. ally use different tags for the same document. In fact, even a searching in a tag space in which many people of various ingle user's tagging practice may vary over time. Usually background collaborate is bound to be very limited his variability is compensated by looking at many users tags: which is only possible when the page has been tagged 2.2 Limited Subscription for less popular pages the probler The availability of RSs and ATOM feeds has recently cre- remains. Currently tagging services still provide a relatively ated a new information discovery paradigm which we call marginal value for information discovery and we claim that here the subscription paradigm. An increasing number of with the use of clustering techniques this can be greatly im- Internet users discover information with the use of these proved. We first discuss the main limitations of the current tagging service The motivation of the subscribe on a certain topic. It is important to receive all documents 2.1 Limited Search related to the topic but it is less important if some received Let us imagine that you would like to tag the picture documents are less relevant. In other words, the subscrib- ttp//www.flickr.comnowpartofYahoo! ing user is expecting a high recall and will accept a lower http://del.icio.usnowpartofYahoo! http://www.technorati.com Seehttp://www.rashmisinha.com/archives/05_09/tagging- is held by the author/owner(s) itive. html by Rashmi Sinha for a good discussion on the www2006, May 22-26, 2006, Edinburgh, UK ttp: //wordnet. princeton. edu
Automated Tag Clustering: Improving search and exploration in the tag space Grigory Begelman Technion - Israel Institute of Technology Computer Science Dpt gbeg@cs.technion.ac.il Philipp Keller Citrin Informatik GmbH phred@citrin.ch Frank Smadja RawSugar frank@rawsugar.com 1. INTRODUCTION In this paper we discuss the use of clustering techniques to enhance the user experience and thus the success of collaborative tagging services. We show that clustering techniques can improve the user experience of current tagging services. We first describe current limitations of tagging services, second, we give an overview of existing approaches. We then describe the algorithms we used for tag clustering and give experimental results. Finally, we explore the use of several techniques to identify semantically related tags. 2. MOTIVATION The success of tagging services like Flickr1 , del.icio.us2 and technorati3 has shown that tagging is a great collaboration tool. Tagging seems to be the natural way for people to classify objects as well as an attractive way to discover new material. Tagging services provides users with a repository of tagged resources (a.k.a tagspace) that can be searched and explored in different ways. More and more people use at least one tagging service and enjoy them as discovery tools. Indeed, tagging is simple, it does not require a lot of thinking and it is very useful to find the tagged objects later. People tag pictures, videos, and other resources with a couple of keywords to easily retrieve them in a later stage. However, looking for information in the tag space has a number of hard limitations. The difficulty comes from the fact that several people usually use different tags for the same document. In fact, even a single user’s tagging practice may vary over time. Usually, this variability is compensated by looking at many users’ tags; which is only possible when the page has been tagged many times. However, for less popular pages the problem remains. Currently tagging services still provide a relatively marginal value for information discovery and we claim that with the use of clustering techniques this can be greatly improved. We first discuss the main limitations of the current tagging services. 2.1 Limited Search Let us imagine that you would like to tag the picture 1http://www.flickr.com, now part of Yahoo! 2http://del.icio.us, now part of Yahoo! 3http://www.technorati.com Copyright is held by the author/owner(s). WWW2006, May 22–26, 2006, Edinburgh, UK. . in Figure 1. It is a picture of a piece of sushi called nigiri (hand formed) sushi as opposed to other types of sushi like maki, futomaki or temaki sushi. A person not aware of this classification of sushi would tag this picture with any combination of the following tags: food, fish, raw fish, rice, Japanese. However a more expert person would use: nigiri sushi, or toro. Whithout delving on the psychological aspects of tagging4 (nor on the nuances of sushi); we cleary see that people think and tag differently. This creates a noisy tagspace and thus makes it harder to find material tagged by other people. Figure 1: How would you tag this? In summary, if you tag the above picture as toro, people searching for sushi or food will not find it. This type of problem is rooted in the language, words are often related and do not stand in isolation. Such relations among words are called lexical relations. We refer the reader to WordNet5 for a thorough treatment of semantic and syntagmatic relations among words. Our point is that, without accounting for lexical relations, searching in a tag space in which many people of various background collaborate is bound to be very limited. 2.2 Limited Subscription The availability of RSS and ATOM feeds has recently created a new information discovery paradigm which we call here the subscription paradigm. An increasing number of Internet users discover information with the use of these tools. The motivation of the subscribing user is to stay informed on a certain topic. It is important to receive all documents related to the topic but it is less important if some received documents are less relevant. In other words, the subscribing user is expecting a high recall and will accept a lower precision. 4See http://www.rashmisinha.com/archives/05 09/taggingcognitive.html by Rashmi Sinha for a good discussion on the subject 5http://wordnet.princeton.edu/
With tagging services you can subscribe to a number of with shopping sites. This is only possible if tags ar are grouped tags, and all items tagged with these tags will show up in our subscription. However, in practice if you subscribe on The other way to explore the tag space is to look at popu- to the tags java and article, you will miss articles that lar pages or tags, for example, a tag cloud in which the size are tagged with the words blog or essay instead of article. of the tags is proportional to their popularity. Although a Recognizing lexical relations is crucial to be able to provide great visualization paradigm, we believe that with todays an effective subscription service. tagclouds it is hard to find more than one or two tags to click on. Tags are not grouped, there is too much informa 2.3 Limited Exploration tion, so that you find lot of related tags scattered on the tag The whole promise of collaborative tagging is that by ex- cloud. One or two popular s and all their related ploring the tag space you can discover a lot of useful infor. tend to dominate the whole cloud. For example, looking at ation you would not find with traditional search engines the delicio us tagcloud, one would mostly see tags related When your information need is not well defined, the idea to web design and technologies. This is because these topics that you can explore and see what other people tagged with are overwhelmingly more frequent than anything else. There certain tags is very attractive. We believe that tagging will are some 4, 500 links tagged with chocolate and some 61,000 be able to reach a very wide audience only when exploration links tagged with food. However, these hardly show up on techniques will be effective the tag clouds or the popular pages. We claim that account Currently, there are two main ways the tag space can be ing for tag clusters by, for example, showing five semantically explored: using search/refine, and using some kind of tag more cohesive tag clouds is much more informative space visualization such as a tag cloud. Say you are looking for a restaurant in your area, searchin be able to group and show related items and to explain how on delicio us for restaurants, your results look like the image he items are related. In hierarchical classification systems like dmoz it is easy to present related items, namely the parent, siblings and children items. However, in tagging del icio, us search spaces, such relations don't exist. Some tagging systems you bookmarks I inbox I for I post ed in as frank smadja l settings I lagout present lists like "this tag often occurs together with the llowing tags"(related tag list in del icio. us)or"this item is tagged r, here are other items tagged r". This information ur items is too raw to build an exploration space upon common tags No resuts found We claim that if we could automatically and dynamically cluster tags whithout putting more burden on the user, we could provide a much stronger service. Searching, subscrib- Restaurant Reviews, Doctors, Bars. Salons, Denb sts and 3. RELATED WORK There is a lot of relevant work to discuss and we will briefly CRestaurarts, NYC menus, ratings, reviews. New York City mention some here. First. we should mention the taxon estaurants Guid omy projects such as Open Directory(dmoz. org) and Ya egDining com- Your Online Guide to Vegetarian Restaurants seattl hoo! directories who in fact recognized the issue of tagging even before it existed. Their solution was flawed. however because they put the burden on the tagger which in their ase was either some Yahoo! Employee or a volunteering Open Table Restaurants and Restaurant Reservations librarian. Along these lines are the shopping sites(sl how semi-automated techniques for tagging and are based Figure 2: Searching for a restaurant with delicio.us on controlled vocabulary With RawSugar taggers can specify tag hierarchies in You definitely get a few useful links at the top, Yelp is a their own accounts(saying that sushi is a subtag of food for very useful site to find restaurants. However, if you want example). The system uses these hierarchies to provide a to explore and see what types of restaurants are available strong exploration and search experience. Figure 3 is the in what locations or in what kind of price range, the tag Raw Sugar tag box extracted from a search page for restau- list on the right is not useful. At this point, it is better to rants on a user's account. The tags are grouped according continue the exploration at yelp for example. This is ver imilar to what you get by searching for restaurants with 7As http://del.icio.us/tag/ googleThefirstlinksareveryrelevantandareprobablyhttp://www.flickr.com/photos/tags/forexample directories or hubs that contain information on restaurants http://dmoz.org/ Thetrueexplorationwillbepursuedatone(orseveral)ofseehttp://wiki.osafoundation.org/bin/view/journal/ these hubs. We believe that searching is only the first step ets Tags for a disccusion on the in exploring, and the user wants to continue exploring in 1ohttp://shopping.com awaytheywoulddoinadirectorylikeOpenDirectoryor1http://shopping.yahoo.com/ http://del.icio.us/rss/tag/java+article, http://froogle.google.co http://rawsugar.com/rss/search/java/articles http://rawsugar.com
With tagging services you can subscribe to a number of tags, and all items tagged with these tags will show up in your subscription. However, in practice if you subscribe on to the tags java and article6 , you will miss articles that are tagged with the words blog or essay instead of article. Recognizing lexical relations is crucial to be able to provide an effective subscription service. 2.3 Limited Exploration The whole promise of collaborative tagging is that by exploring the tag space you can discover a lot of useful information you would not find with traditional search engines. When your information need is not well defined, the idea that you can explore and see what other people tagged with certain tags is very attractive. We believe that tagging will be able to reach a very wide audience only when exploration techniques will be effective. Currently, there are two main ways the tag space can be explored: using search/refine, and using some kind of tag space visualization such as a tag cloud. Say you are looking for a restaurant in your area, searching on del.icio.us for restaurants, your results look like the image in Figure 2. Figure 2: Searching for a restaurant with del.icio.us You definitely get a few useful links at the top, Yelp is a very useful site to find restaurants. However, if you want to explore and see what types of restaurants are available in what locations or in what kind of price range, the tag list on the right is not useful. At this point, it is better to continue the exploration at yelp for example. This is very similar to what you get by searching for restaurants with Google. The first links are very relevant and are probably directories or hubs that contain information on restaurants. The true exploration will be pursued at one (or several) of these hubs. We believe that searching is only the first step in exploring, and the user wants to continue exploring in a way they would do in a directory like Open Directory or 6http://del.icio.us/rss/tag/java+article, or http://rawsugar.com/rss/search/java/articles with shopping sites. This is only possible if tags are grouped in clusters. The other way to explore the tag space is to look at popular pages or tags, for example, a tag cloud in which the size of the tags is proportional to their popularity7 . Although a great visualization paradigm, we believe that with today’s tagclouds it is hard to find more than one or two tags to click on. Tags are not grouped, there is too much information, so that you find lot of related tags scattered on the tag cloud. One or two popular topics and all their related tags tend to dominate the whole cloud. For example, looking at the del.icio.us tagcloud, one would mostly see tags related to web design and technologies. This is because these topics are overwhelmingly more frequent than anything else. There are some 4,500 links tagged with chocolate and some 61,000 links tagged with food. However, these hardly show up on the tag clouds or the popular pages. We claim that accounting for tag clusters by, for example, showing five semantically more cohesive tag clouds is much more informative. The key in building an effective exploration space seems to be able to group and show related items and to explain how the items are related. In hierarchical classification systems like dmoz8 it is easy to present related items, namely the parent, siblings and children items. However, in tagging spaces, such relations don’t exist. Some tagging systems present lists like “this tag often occurs together with the following tags” (related tag list in del.icio.us) or “this item is tagged x, here are other items tagged x”. This information is too raw to build an exploration space upon. We claim that if we could automatically and dynamically cluster tags whithout putting more burden on the user, we could provide a much stronger service. Searching, subscribing and exploring would be much more effective. 3. RELATED WORK There is a lot of relevant work to discuss and we will briefly mention some here. First, we should mention the taxonomy projects such as Open Directory (dmoz.org) and Yahoo! Directories who in fact recognized the issue of tagging even before it existed. Their solution was flawed, however, because they put the burden on the tagger which in their case was either some Yahoo! Employee or a volunteering librarian.9 Along these lines are the shopping sites (shopping.com10, Yahoo! Shopping11, Froogle12) who use somehow semi-automated techniques for tagging and are based on controlled vocabulary. With RawSugar13 taggers can specify tag hierarchies in their own accounts (saying that sushi is a subtag of food for example). The system uses these hierarchies to provide a strong exploration and search experience. Figure 3 is the RawSugar tag box extracted from a search page for restaurants on a user’s account. The tags are grouped according 7As in http://del.icio.us/tag/ or http://www.flickr.com/photos/tags/ for example. 8http://dmoz.org/ 9See http://wiki.osafoundation.org/bin/view/Journal/... HierarchyVersusFacetsVersusTags for a disccusion on the topic 10http://shopping.com 11http://shopping.yahoo.com/ 12http://froogle.google.com/ 13http://rawsugar.com
to the user defined hierarchy and thus present a more pow- of co-occurrences(tags that are used for the same erful exploration space. This solves the problem in specific any pair of tags and a cut-off point is determined to ser's directories but not on the global tagspace because the when the co-occurrence count is significant enough to be clusters or tag groups are still too sparse. used. This results in a sparse matrix that represents tags so that the value of each element is the similarity of the two restanrant type (158) locations(87) tags. asian, senegalese, ethiopian israel, san francisco bay., new york Say a user tags an article about African trees that is writ. ten by an XHTML expert with the following tags: zhtml, food(124 guides(a) standard, trees, biology, africa, toread, resource. Then(html, sushi, recipes, cooking reviews, restaurants without a. about com andard)and (chtml, trees)would each get one count as tags. After processing the whole tagspace, we use the fre- quency counts of all the co-tag pairs and attempt to identify Figure 3: Searching for restaurants on Raw Sugar the significant co-tags. In order to do that, we determine the pairs of tags that co-occur significantly more frequently then Flickr has Flickr clusters, which, provided a popular tag xpected. We look for a cutoff point above which the co-tags give related tags grouped into clusters. For example, lookin are considered strongly related. Frequency graphs we exam at the clusters for the word Jaguar, we see that the clus- ined usually exhibit a general"relatedness distribution ters neatly fall into several semantic categories of Jaguars: shown in Figure 5. Let's look for example at the tags related animal, car and plane. The hereby presented guidepost is to tag rss below what makes the difference. Clustering makes it possible to present a guidepost, to provide the means that allow the user tag countng feed to explore the information space. In addition, Flickr also has 310‖web2.0 an interestingness exploration technique which they define blog298‖ho 65 as a factor of several parameters including the pageviews feeds 246 wikipedia59 the comments left by users, the specific users, etc. arch219‖blog 53 Rashmi Sinha has published a number of entries on tag. ging and clustering. Bielenberg and Zacher [1] mention tag clustering 6 102‖ learn One should also mention a growing number of tag visual- ization techniques in various stages of development that are currently available on the Web Table1:Co- tags of”Rss” and their counts 4. CLUSTERING ALGORITHMS 4.1 Introduction analysis. Clustering provides partitioning of a dataset into subsets of similar objects or data clusters. Before actually using a clustering technique the first task one has to do is to transform the problem at hand into a numeric representation that can be used by clustering alge 05050505055 rithms. In our case, the goal is first to provide a similiarity neasure among tags and then to run clustering techniques ⊥人, on the tag space represented like this. Below we first dis- cuss our proposed technique to find similar tags and then we discuss the use of clustering technique 4.2 Finding Strongly Related Tags Figure4: Tags related to”Rss” In this section, we present an algorithm to find strongly related tags. The algorithm is based on counting the number In the table l and the graph 4, we see that"feed"occurred http://www.flickr.com/tags/jaguar/clusters 310 times together with rss. In the graph, the y-axis is the count,(how many times a certain tag is used together with http://www.rashmisinha.com/archives/05-_02/tag- rss), as well as its 1 and 2" derivative, the x-axis is i. sorting . html In the mentioned example about African trees, the tag Seealsohttp://group.usandhttp://laurie.informatik.uni-combinationchtmlafricaisjustaccidentalandrelatedto bremen.de/clusty his example and thus would not be selected. We see a http://www.newzingo clear change in the shape of the plot for the counti log hubmed. org/archives /001049 html to determine this cutoff point, we consider the 1st and ttp//www.corant derivative of the count. we start from the tail on the http://www.quasimondo.com/tagnautica.php end and seek the point where the 1 derivative has its first http://www.ivy.fr/revealicious/ high peak (that is when the second derivative goes from http://www.tagcloud.com positive to negative)and check if the peak was high enough
to the user defined hierarchy and thus present a more powerful exploration space. This solves the problem in specific user’s directories but not on the global tagspace because the clusters or tag groups are still too sparse. Figure 3: Searching for restaurants on RawSugar Flickr has Flickr clusters, which, provided a popular tag, give related tags grouped into clusters. For example, looking at the clusters for the word Jaguar 14, we see that the clusters neatly fall into several semantic categories of Jaguars: animal, car and plane. The hereby presented guidepost is what makes the difference. Clustering makes it possible to present a guidepost, to provide the means that allow the user to explore the information space. In addition, Flickr also has an interestingness exploration technique which they define as a factor of several parameters including the pageviews, the comments left by users, the specific users, etc. Rashmi Sinha15 has published a number of entries on tagging and clustering. Bielenberg and Zacher [1] mention tag clustering16 . One should also mention a growing number of tag visualization techniques in various stages of development that are currently available on the Web17 . 4. CLUSTERING ALGORITHMS 4.1 Introduction Data clustering is a common technique for statistical data analysis. Clustering provides partitioning of a dataset into subsets of similar objects or data clusters. Before actually using a clustering technique the first task one has to do is to transform the problem at hand into a numeric representation that can be used by clustering algorithms. In our case, the goal is first to provide a similiarity measure among tags and then to run clustering techniques on the tag space represented like this. Below we first discuss our proposed technique to find similar tags and then we discuss the use of clustering techniques. 4.2 Finding Strongly Related Tags In this section, we present an algorithm to find strongly related tags. The algorithm is based on counting the number 14http://www.flickr.com/tags/jaguar/clusters 15http://www.rashmisinha.com/, http://www.rashmisinha.com/archives/05 02/tagsorting.html 16See also http://group.us and http://laurie.informatik.unibremen.de/clusty/ 17http://www.newzingo.com, http://hublog.hubmed.org/archives/001049.html, http://www.corante.com/many/archives/2005/01/26/ visualizing the collective brain.php, http://www.quasimondo.com/tagnautica.php, http://www.ivy.fr/revealicious/, http://www.tagcloud.com/ of co-occurrences (tags that are used for the same page) of any pair of tags and a cut-off point is determined to decide when the co-occurrence count is significant enough to be used. This results in a sparse matrix that represents tags, so that the value of each element is the similarity of the two tags. Say a user tags an article about African trees that is written by an XHTML expert with the following tags: xhtml, standard, trees, biology, africa, toread, resource. Then (xhtml, standard) and (xhtml, trees) would each get one count as cotags. After processing the whole tagspace, we use the frequency counts of all the co-tag pairs and attempt to identify the significant co-tags. In order to do that, we determine the pairs of tags that co-occur significantly more frequently then expected. We look for a cutoff point above which the co-tags are considered strongly related. Frequency graphs we examined usually exhibit a general “relatedness distribution” as shown in Figure 5. Let’s look for example at the tags related to tag rss below: tag count tag count feed 310 web2.0 77 blog 298 home 65 feeds 246 wikipedia 59 search 219 blogs 57 news 173 biography 53 google 103 preview 48 xml 102 learn 33 web 81 sitemap 30 Table 1: Co-tags of ”RSS” and their counts Figure 4: Tags related to ”RSS” In the table 1 and the graph 4, we see that ”feed” occurred 310 times together with rss. In the graph, the y-axis is the counti (how many times a certain tag is used together with rss), as well as its 1st and 2nd derivative, the x-axis is i. In the mentioned example about African trees, the tag combination xhtml, africa is just accidental and related to this example and thus would not be selected. We see a clear change in the shape of the plot for the counti, and to determine this cutoff point, we consider the 1st and 2nd derivative of the count, we start from the tail on the right end and seek the point where the 1st derivative has its first high peak (that is when the second derivative goes from positive to negative) and check if the peak was high enough
The input for the tags clustering algorithm consists of Tags ti, i=l...l Tagged resources(web resources were used in our e ments)rk,k=1….K A 3D tensor A E RXxk of boolean values.The tensor a contains tagging information: if a user u tagged a resource Tk with a tag t, then Aijk 1 weak otherwise Aijk=0. Normally the tensor A is sparse Our goal is to partition the set of tags into non-intersecting groups of semantically-related tags. We show here how a Figure 5: Typical distribution of related tag graph is built from the input of the tag clustering algorithm Let G(V, e, w) be an undirected weighted graph consist- If these two conditions are fulfilled then this is the cutoff matrix WERixI, where I is the number of vertices. Each the tags on the left hand side of the cutoff relate strongly to vertex Ui of the graph G corresponds to a tag ti. First, we compute matrix BE RA, collecting the tagging informa minimal peak hight".However, sometimes the distribution tion from all users: B=V, Aijk, where V,( denotes the doesn't have this disruption point or we simply don't have enough data to compute this point, therefore the tag, with sor A. The rows of the matrix B correspond to the tags, the most co-occurences of tagi is always considered strongly while the columns of the matrix B correspond to the tagge related to tagi. resources. Thus, if a resource rk is tagged by a tag ti by If we do this for every tag in the tagspace we obtained an some user, then Bik=1 The weight wiin of the edge between the vertex Ui1, cor- edges e and a weight matrix w. Each vertex v: of the graph responding to the tag ti and the vertex via, corresponding orresponds to a tag tagi. There is an edge between vi and to the tag tig is the number of resources, tagged by both i if the tag tagi relates strongly to tag tag, or vice versa ags ti, and tia. Thus we take the rows il and ig of B according to the described algorithm. The weight wiig cor- and calculate the number of resources shared by the tags til esponds to the number of times tagi occurred togethe and tia: winia=(B) A(B)all, where A denotes the with tagig within the same item logical AND"and‖·‖ stands for LI norm of a boolean We tested this algorithm on data from delicio us gathered vector, or the number on“ones” in the vector from their RSS-feed. To simplify computations we pruned There are many algorithms for graph clustering [2 all relations with a count smaller than 30; at the date of the cently, 3 introduced the "modularity function"Q, which experiments the table contained 1100 tag connections and measures the quality of a particular clustering of nodes in we found 23 independent clusters, whereas a cluster is a set a graph. Consider a particular division of a graph into k ags that are connected groups. The modularity function is defined as: Figure 6 is a part of one of the big clusters for design. i V) (V,V)=∑eu belonging to the partition c(see 5 for the discussion of the )(n(n( (a)x- nodularity function properties). Our algorithm for graph quality of partitio The graph clustering algorithm is based on the spectral bisection [4. First, we build the Laplacian matrix LG of Figure 6: part of cluster "design he graph G. The Laplacian matrix is an I x I symmetrical The clusters were computed regularly and the results seem to be fairly stable, that is there was no tendency towards LG(i, i)equals to the degree of vertex vi(the number one big cluster. Some clusters seem too big, i.e. the cluster of graph edges touching the vertex vi) above should be split into a“ design”anda“web”clus ter. The spectral clustering algorithm described by Scott LG(i, j)=-I if there is an edge between the vertices White 5 helped splitting up into handier clusters Ui and Uj ·Lc(i,j)=0 otherwise 4.3 Clustering Algorithm Second, we compute the eigenvector u2 of LG correspond Shttp://del.icio.us/rss/ ing to the second largest positive eigenvalue, A2(LG).The
Figure 5: Typical distribution of related tags If these two conditions are fulfilled then this is the cutoff, the tags on the left hand side of the cutoff relate strongly to tag rss. The only parameter that has to be optimized is the ”minimal peak hight“. However, sometimes the distribution doesn’t have this disruption point or we simply don’t have enough data to compute this point, therefore the tagj with the most co-occurences of tagi is always considered strongly related to tagi. If we do this for every tag in the tagspace we obtained an undirected graph G(V, E, W) consisting of nodes V , a set of edges E and a weight matrix W. Each vertex vi of the graph corresponds to a tag tagi. There is an edge between vi and vj if the tag tagi relates strongly to tag tagj or vice versa according to the described algorithm. The weight wi1i2 corresponds to the number of times tagi1 occurred together with tagi2 within the same item. We tested this algorithm on data from del.icio.us gathered from their RSS-feed18. To simplify computations we pruned all relations with a count smaller than 30; at the date of the experiments the table contained 1100 tag connections and we found 23 independent clusters, whereas a cluster is a set of tags that are connected. Figure 6 is a part of one of the big clusters for design. Figure 6: part of cluster “design” The clusters were computed regularly and the results seem to be fairly stable, that is there was no tendency towards one big cluster. Some clusters seem too big, i.e. the cluster above should be split into a “design” and a “web” cluster. The spectral clustering algorithm described by Scott White [5] helped splitting up into handier clusters. 4.3 Clustering Algorithm 18http://del.icio.us/rss/ The input for the tags clustering algorithm consists of: • Tags ti, i = 1 . . . I • Users uj , j = 1 . . . J • Tagged resources (web resources were used in our experiments) rk, k = 1 . . . K • A 3D tensor A ∈ RI×J×K of boolean values. The tensor A contains tagging information: if a user ui tagged a resource rk with a tag tj then Aijk = 1, otherwise Aijk = 0 . Normally the tensor A is sparse. Our goal is to partition the set of tags into non-intersecting groups of semantically-related tags. We show here how a graph is built from the input of the tag clustering algorithm. Let G(V, E, W) be an undirected weighted graph consisting of nodes V , the set of edges E, and a symmetric weight matrix W ∈ RI×I , where I is the number of vertices. Each vertex vi of the graph G corresponds to a tag ti. First, we compute matrix B ∈ RI×K, collecting the tagging information from all users: B = W j Aijk, where W j (·) denotes the “logical OR” performed on the second dimension of the tensor A. The rows of the matrix B correspond to the tags, while the columns of the matrix B correspond to the tagged resources. Thus, if a resource rk is tagged by a tag ti by some user, then Bik = 1. The weight wi1i2 of the edge between the vertex vi1 , corresponding to the tag ti1 and the vertex vi2 , corresponding to the tag ti2 is the number of resources, tagged by both tags ti1 and ti2 . Thus we take the rows i1 and i2 of B, and calculate the number of resources shared by the tags ti1 and ti2 : wi1i2 = (B) i1 V (B) i2 1 , where V denotes the “logical AND” and k · k1 stands for L1 norm of a boolean vector, or the number on “ones” in the vector. There are many algorithms for graph clustering [2]. Recently, [3] introduced the “modularity function” Q, which measures the quality of a particular clustering of nodes in a graph. Consider a particular division of a graph into k groups. The modularity function is defined as: Q(Pk) = Xk c=1 " A(Vc, Vc) A(V, V ) − A(Vc, V ) A(V, V ) 2 # (1) where Pk defines a partitioning of the vertices into k groups, A(V 0 , V 00) = P i∈V 0 ,j∈V 00 w(i, j), and Vc is the set of vertices belonging to the partition c (see [5] for the discussion of the modularity function properties). Our algorithm for graph clustering uses the modularity function as a measure of the quality of partitioning. The graph clustering algorithm is based on the spectral bisection [4]. First, we build the Laplacian matrix LG of the graph G. The Laplacian matrix is an I × I symmetrical matrix, defined by: • LG(i, i) equals to the degree of vertex vi (the number of graph edges touching the vertex vi) • LG(i, j) = −1 if there is an edge between the vertices vi and vj • LG(i, j) = 0 otherwise Second, we compute the eigenvector v2 of LG corresponding to the second largest positive eigenvalue, λ2(LG). The
vertices of the graph are bisected based on the sign of the Increase the similarity count for each pair of tags corresponding component of u2 t, t belonging in the same cluster We combine the spectral bisection algorithm and the mod- ularity function to the recursive greedy algorithm. Our 2. Sort all the pairs of tags t, tk thus produced by their greedy algorithm takes as input an simple connected undi- decreasing count. rected graph and performs the following steps: 3. Select the top N similar tags. 1. Use spectral bisection to split the graph into two clus- related tags 2. Compare the value of the modularity function Qo of mac,osx, macosx, tiger the original unpartitioned graph to the value of the cool, design, fun, graphics, images modularity function Q1 of the partitioned graph. If aascTzpi ajax, dhtml, programming languages Q1>Qo accept the partitioning, otherwise reject the audio, media, mp3, ipod, itunes photography galleries, photo, hi-res, sexy, flickr images 3. Proceed recursively on each accepted partition. computers, hardware, acorn, 4.4 Experimental Results internet, linux, open source software mambo, programming, technology, web The experiments were performed on the RawSugar data- wto, tips, reference, tutorials, tools ase as of January 2006. The data at this point 200,000 pages and 30,000 tags. The results of the can be accessed in the Raw Sugar lab page9.TH Table 2: Some related tags of clusters was chosen manually. Below are some example clusters for a few query tags. Query tag: health 5. CONCLUSION AND FUTURE WORK We have presented in this short paper what we believe nutrition, food, diet IS convincing evidence that clustering techniques can and fitness. workout running should be used in combination with tagging. Clustering can prove the tagging experience and the use of the tagspace e. science in general. We have presented several clustering techniques life, lifehack, product, howto, gtd, reference, tip and provided some results we obtained on the del icio.us esport, sport or Raw Sugar tagspace. We are currently investigating sev- uding similarity measurements us- · Query tag: sports ing mutual information and other statistical measures such as Chi-square or the Dice coefficient. We are also looking hockey. nhl at the problems of tag spamming and inherently ambiguous basebal, mlb, triple basketbal, nba, nbdl, wnba footbal, nfl 6. REFERENCES bar [1 K. Bielenberg and M. Zachera. Groups in social software: Utilizing tagging to integrate individual computer game, action game, free game contexts for social navigation. 2005 4.5 Using Clusters to Find Semantically Re- 2 U Brandes, M. Gaertler, and D. Wagner. Experiments lated tags on graph clustering. In Proceedings of the 11th Annual European Symposium on Algorithms(ESA'03), volume Related tags can also help the user by suggesting interest- 2832 of Lecture Notes in Computer Science, pages ing tags while tagging, searching, exploring or subscribing 568-579. Springer-Verlag, 2003. For example a user subscribing to the tag music would be 3] M. E. Newman and M. Girvan. Finding and evaluating suggested to try the also add the tag mp3 to his subscrip- ommunity structure in networks. Physical Review E, tion. We present here a technique to automatically discover elated tags based on the clusters we obtained previously. Table 2 shows a few examples we obtained with this tech- 4 A. Pothen, H. D. Simon, and K.P. Liou. Partitioning nique on the Raw Sugar tag space. The algorithm works as sparse matrices with eigenvectors of graphs. SIAM J Matri anal.Apl.,11(3):430-452,1990. 1. For each tag ti that is frequent enough in the tagspace International Conference on Data Mining, 2005 Build a graph of its ctags Partition the graph to different number of clusters sing the clustering algorithm described before http://www.rawsugar.com/lab
vertices of the graph are bisected based on the sign of the corresponding component of v2. We combine the spectral bisection algorithm and the modularity function to the recursive greedy algorithm. Our greedy algorithm takes as input an simple connected undirected graph and performs the following steps: 1. Use spectral bisection to split the graph into two clusters. 2. Compare the value of the modularity function Q0 of the original unpartitioned graph to the value of the modularity function Q1 of the partitioned graph. If Q1 > Q0 accept the partitioning, otherwise reject the partitioning. 3. Proceed recursively on each accepted partition. 4.4 Experimental Results The experiments were performed on the RawSugar database as of January 2006. The data at this point was about 200,000 pages and 30,000 tags. The results of the clustering can be accessed in the RawSugar lab page19. The number of clusters was chosen manually. Below are some example clusters for a few query tags. • Query tag: health: – shopping, research – nutrition, food, diet – fitness, workout, running – article, science – life, lifehack, product,howto, gtd, reference, tip – esport, sport • Query tag: sports – hockey, nhl – basebal, mlb, triple – basketbal, nba, nbdl, wnba – footbal, nfl – alcohol, beer, tv, food, bar – computer game, action game, free game 4.5 Using Clusters to Find Semantically Related Tags Related tags can also help the user by suggesting interesting tags while tagging, searching, exploring or subscribing. For example a user subscribing to the tag music would be suggested to try the also add the tag mp3 to his subscription. We present here a technique to automatically discover related tags based on the clusters we obtained previously. Table 2 shows a few examples we obtained with this technique on the RawSugar tag space. The algorithm works as follows: 1. For each tag ti that is frequent enough in the tagspace: • Build a graph of its cotags. • Partition the graph to different number of clusters using the clustering algorithm described before. 19http://www.rawsugar.com/lab • Increase the similarity count for each pair of tags tj tk belonging in the same cluster. 2. Sort all the pairs of tags tj tk thus produced by their decreasing count. 3. Select the top N similar tags. Tag related tags Apple mac, osx, macosx, tiger Art cool, design, fun, graphics, images javascript ajax, dhtml, programming languages music audio, media, mp3, ipod, itunes photography galleries, photo, hi-res, sexy, flickr, images software computers, hardware, acorn, internet, linux, open source software, mambo, programming, technology, web howto, tips, reference, tutorials, tools free download, freeware, opensource Table 2: Some related tags 5. CONCLUSION AND FUTURE WORK We have presented in this short paper what we believe is convincing evidence that clustering techniques can and should be used in combination with tagging. Clustering can improve the tagging experience and the use of the tagspace in general. We have presented several clustering techniques and provided some results we obtained on the del.icio.us or RawSugar tagspace. We are currently investigating several other techniques including similarity measurements using mutual information and other statistical measures such as Chi-square or the Dice coefficient. We are also looking at the problems of tag spamming and inherently ambiguous tags. 6. REFERENCES [1] K. Bielenberg and M. Zachera. Groups in social software: Utilizing tagging to integrate individual contexts for social navigation. 2005. [2] U. Brandes, M. Gaertler, and D. Wagner. Experiments on graph clustering. In Proceedings of the 11th Annual European Symposium on Algorithms (ESA’03), volume 2832 of Lecture Notes in Computer Science, pages 568–579. Springer-Verlag, 2003. [3] M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2), 2004. [4] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430–452, 1990. [5] S. White and P. Smyth. A spectral clustering approach to finding communities in graphs. In SIAM International Conference on Data Mining, 2005