On Deriving Tagsonomies: Keyword Relations Coming from crowd Michal barla and maria bielikova Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology Ilkovicova 3. 842 16 Bratislava, Slovakia I Name Surname fiit. stuba sk Abstract. Many keyword-based approaches to text classification, in formation retrieval or even user modeling for adaptive web-based system could benefit from knowledge on relations between various key words which gives further possibilities to compare them, evaluate their distance etc. This paper proposes an approach how to determine keyword rela- tions(mainly a parent-child relationship) by leveraging collective wisdom of the masses, present in data of collaborative(social)tagging systems on the Web. The feasibility of our approach is demonstrated on the data coming from the social bookmarking systems delicious and CiteULik 1 Introduction Phenomenon of the Social Web, with its roots in Web 2.0, is gaining a lot of attention all over the world in both research and practice. We are studying the power and wisdom of masses, when millions of people switched from the pas- sive reading of the content to the active participation in its creation. People are blogging, sharing wikis, connecting themselves in various social applications nd above all-tagging almost everything they get into touch: bookmarks, pho- tographs, videos, publications, blogposts, articles etc. People got used to classify items by ng few simple tags to it and to use those tags for a future retrieval of their favorite items. More, they are often expecting to find a new, yet unseen content by using their own tags. a part of the Web 2.0 success lies in an implicit agreement of masses on a shared(but never explicitly defined) vocabulary used to tag items -folksonomies. At the same time, vast amount of content requires efficient navigation sup- port, content reorganization or filtering -personalization and adaptation of the web. Its efficiency is dependent on adaptive systems ability to capture and maintain user model. A lot of research was devoted to finding the most suitable fexible or most generic and all-encompassing user model representation [1, 2 however, so far we are not aware of any explicit agreement on an ideal model The obvious question, when analyzing the success of Web 2.0, is whether an assignments of keywords(tags)to user instead of to pages (i.e, creation of
On Deriving Tagsonomies: Keyword Relations Coming from Crowd Michal Barla and M´aria Bielikov´a Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology Ilkoviˇcova 3, 842 16 Bratislava, Slovakia {Name.Surname}@fiit.stuba.sk Abstract. Many keyword-based approaches to text classification, information retrieval or even user modeling for adaptive web-based system could benefit from knowledge on relations between various keywords, which gives further possibilities to compare them, evaluate their distance etc. This paper proposes an approach how to determine keyword relations (mainly a parent-child relationship) by leveraging collective wisdom of the masses, present in data of collaborative (social) tagging systems on the Web. The feasibility of our approach is demonstrated on the data coming from the social bookmarking systems delicious and CiteULike. 1 Introduction Phenomenon of the Social Web, with its roots in Web 2.0, is gaining a lot of attention all over the world in both research and practice. We are studying the power and wisdom of masses, when millions of people switched from the passive reading of the content to the active participation in its creation. People are blogging, sharing wikis, connecting themselves in various social applications and above all – tagging almost everything they get into touch: bookmarks, photographs, videos, publications, blogposts, articles etc. People got used to classify items by assigning few simple tags to it and to use those tags for a future retrieval of their favorite items. More, they are often expecting to find a new, yet unseen content by using their own tags. A part of the Web 2.0 success lies in an implicit agreement of masses on a shared (but never explicitly defined) vocabulary used to tag items – folksonomies. At the same time, vast amount of content requires efficient navigation support, content reorganization or filtering – personalization and adaptation of the web. Its efficiency is dependent on adaptive system’s ability to capture and maintain user model. A lot of research was devoted to finding the most suitable, flexible or most generic and all-encompassing user model representation [1, 2], however, so far we are not aware of any explicit agreement on an ideal model representation. The obvious question, when analyzing the success of Web 2.0, is whether an assignments of keywords (tags) to user instead of to pages (i.e., creation of
tag-based user model) could lead to simple, viable and efficient approach to user modeling for adaptive web-based system. The challenge is then to combine the user model with the models of communities coming from the emerging social web and create a solid platform for personalization based on both traditional(e. g as presented in 3), and social approaches, such as the one presented in [4 When considering tag-based user models in a(tag-based)Web 2.0 environ ment, we are facing the need to be able to compare various tags. We might need to compare user characteristics or even whole user models, represented by tags o find similarities between users, which could serve for model maintenance as well as for more complex tasks such as online community creation or recom- mending in a recommender system. We might also need to compare the domai items represented by tags (e. g, a web page)in order to evaluate the explicit implicit feedback 5 and update the user model appropriately In this paper, we propose a method for inferring various relationships be- tween tags, which allows for full-blown usage of the tag-based user model for personalization and adaptation on the Web The paper is structured as follows: In section 2 we explain our approach to finding relationships between tags. Section 3 presents data we acquired and results of experiments we performed. In section 4 we summarize the related works, which served as an inspiration for our algorithm for building hierarchies from folksonomies. Finally, we give conclusions 2 Finding Relationships between Tags Our approach to finding relationships between tags combines three distinct ap- 1. Deriving of parent-child relationships between tags from a given folksonomy 2. Determining similarity between tags by applying spreading activation on the top of the folksonomy graph 3. Interconnecting tags by additional semantic relationships as well as enriching the whole tag corpus by adding external keywords: both coming from the Wordnet lexical database 2.1 Building Hierarchies from Folksonomies lksonomy is defined as a hypergraph 6h: where the set of vertices V=AUTUI and AnT=0, AnI=0, TOI=O and the set of ternary edges E={(a,t,)|a∈A,t∈T,t∈}. A social tagging system can be represented by such a hypergraph with following definitions of the sets A, T and i (we will use them for the rest of the paper as well) Actors(users)A=a1,.,ak] Tags(keywords, concepts)T=ti, t) Items(objects, instances)I=i1,.,imI
tag-based user model) could lead to simple, viable and efficient approach to user modeling for adaptive web-based system. The challenge is then to combine the user model with the models of communities coming from the emerging social web and create a solid platform for personalization based on both traditional (e.g., as presented in [3]), and social approaches, such as the one presented in [4]. When considering tag-based user models in a (tag-based) Web 2.0 environment, we are facing the need to be able to compare various tags. We might need to compare user characteristics or even whole user models, represented by tags, to find similarities between users, which could serve for model maintenance as well as for more complex tasks such as online community creation or recommending in a recommender system. We might also need to compare the domain items represented by tags (e.g., a web page) in order to evaluate the explicit or implicit feedback [5] and update the user model appropriately. In this paper, we propose a method for inferring various relationships between tags, which allows for full-blown usage of the tag-based user model for personalization and adaptation on the Web. The paper is structured as follows: In section 2 we explain our approach to finding relationships between tags. Section 3 presents data we acquired and results of experiments we performed. In section 4 we summarize the related works, which served as an inspiration for our algorithm for building hierarchies from folksonomies. Finally, we give conclusions. 2 Finding Relationships between Tags Our approach to finding relationships between tags combines three distinct approaches: 1. Deriving of parent-child relationships between tags from a given folksonomy; 2. Determining similarity between tags by applying spreading activation on the top of the folksonomy graph; 3. Interconnecting tags by additional semantic relationships as well as enriching the whole tag corpus by adding external keywords; both coming from the Wordnet lexical database. 2.1 Building Hierarchies from Folksonomies Folksonomy is defined as a hypergraph [6] H := hV, Ei, where the set of vertices V = A ∪T ∪I and A ∩T = ∅, A ∩I = ∅, T ∩I = ∅ and the set of ternary edges E = {(a, t, i) | a ∈ A, t ∈ T, i ∈ I}. A social tagging system can be represented by such a hypergraph with following definitions of the sets A, T and I (we will use them for the rest of the paper as well): – Actors (users) A = {a1, ..., ak} – Tags (keywords, concepts) T = {t1, ..., tl} – Items (objects, instances) I = {i1, ..., im}
foremen ntioned hypergraph can be reduced into three bipartite graphs a graph holding the associations between actors and tags(AT), actors and items (AD) and tags and items(TD). For example, the AT valued bipartite graph defined as follows:AT:=(4A×T,Eat),Eat={(a,t)|i∈I:(a,t,i)∈E} Each such bipartite graph XY: =(XY, Exy) can be furthermore folded into two one-mode graphs Gx Ex), Gy =(Vy, Ew), where Vr =X V=Y, and Er={(a,b)|a,b∈X,y∈Y:(a,y)∈Exy,(b,y)∈Ery} Ey={(a,b)|a,b∈Y,丑x∈X:(x,a)∈Exy,(,b)∈Exy} We can furthermore define a weight of an edge eab E Ei in such a one-mode graph as u(eab):=|k|k∈K:(a,k)∈E1k,(b,k)∈Eik}. In other words, the weight w(ea. b) shows the number of times the a and b were linked together in an original bipartite graph By folding aforementioned bipartite graphs, we get six different one-mode graphs, each representing different semantic network encoded in the folksonomy. For example, by folding AT graph through tags, we get a social network of actors(users) based on overlapping sets of tags, where the links are bet people who have used the same tags with weights showing the number of tags they have used in common. Similarly, we can get a social network based on overlapping sets of items(two people are linked if they have tagged the same item, with weight showing the number of items they have tagged in common) In our work, we were primarily interested in folding TI graph through items giving us a semantic network of tags The proposed approach to creation of hierarchy from the folded folksonomy hypergraph as defined above is based on a rather simple assumptions of set theory In an ideal situation, the tag ta is a parent of tag tb if the set of entities (persons or items)classified under tb is a subset of the entities under ta In other words, all items classified under narrow tag also appear under the broader tag Moreover, since our goal was to produce a reusable hierarchy of hich could be mapped to users' interests and use this hierarchy as a basis for reasoning on those interests, we were not interested in tags, which were used only by a small amount of users, even if they were using it quite extensively. We wanted only what "crowd agrees upon"and were filtering-out tags not achieving a certain degree of popularity (i. e, it is not used by at least k% of all users), even if this decision reduced drastically the amount of tags in the resulting hierarchy The algorithm 1 shows the basic idea of our approach using a simple code. First, we create an artificial root of the hierarchy and put it in the set of already processed tags(ordered tags in the algorithm). Then, we process all tags from the folksonomy in the following manner 1. If the tag t does not reach the popularity threshold, we omit it immediately and n to the next one 2. Otherwise, we compute its overlap(intersect)with every tag from the ordered tags set, resulting in identifying the tag to with a maximum overlap
The aforementioned hypergraph can be reduced into three bipartite graphs: a graph holding the associations between actors and tags (AT), actors and items (AI) and tags and items (T I). For example, the AT valued bipartite graph is defined as follows: AT := hA × T, Eati, Eat = {(a, t) | ∃i ∈ I : (a, t, i) ∈ E}. Each such bipartite graph XY := hX × Y, Exyi can be furthermore folded into two one-mode graphs GX := hVx, Exi, GY := hVy, Eyi, where Vx = X, Vy = Y , and Ex = {(a, b) | a, b ∈ X, ∃y ∈ Y : (a, y) ∈ Exy, (b, y) ∈ Exy}, Ey = {(a, b) | a, b ∈ Y, ∃x ∈ X : (x, a) ∈ Exy, (x, b) ∈ Exy}. We can furthermore define a weight of an edge eab ∈ Ei in such a one-mode graph as w(eab) := |{k | k ∈ K : (a, k) ∈ Eik, (b, k) ∈ Eik}|. In other words, the weight w(ea,b) shows the number of times the a and b were linked together in an original bipartite graph. By folding aforementioned bipartite graphs, we get six different one-mode graphs, each representing different semantic network encoded in the folksonomy. For example, by folding AT graph through tags, we get a social network of actors (users) based on overlapping sets of tags, where the links are between people who have used the same tags with weights showing the number of tags they have used in common. Similarly, we can get a social network based on overlapping sets of items (two people are linked if they have tagged the same item, with weight showing the number of items they have tagged in common). In our work, we were primarily interested in folding T I graph through items, giving us a semantic network of tags. The proposed approach to creation of hierarchy from the folded folksonomy hypergraph as defined above is based on a rather simple assumptions of set theory: In an ideal situation, the tag ta is a parent of tag tb if the set of entities (persons or items) classified under tb is a subset of the entities under ta. In other words, all items classified under narrow tag also appear under the broader tag. Moreover, since our goal was to produce a reusable hierarchy of tags, which could be mapped to users’ interests and use this hierarchy as a basis for reasoning on those interests, we were not interested in tags, which were used only by a small amount of users, even if they were using it quite extensively. We wanted only what “crowd agrees upon” and were filtering-out tags not achieving a certain degree of popularity (i.e., it is not used by at least k% of all users), even if this decision reduced drastically the amount of tags in the resulting hierarchy. The algorithm 1 shows the basic idea of our approach using a simple pseudocode. First, we create an artificial root of the hierarchy and put it in the set of already processed tags (ordered tags in the algorithm). Then, we process all tags from the folksonomy in the following manner: 1. If the tag t does not reach the popularity threshold, we omit it immediately and pass on to the next one. 2. Otherwise, we compute its overlap (intersect) with every tag from the ordered tags set, resulting in identifying the tag to with a maximum overlap
3. The parent-child or sibling relationship is established if the ratio of maximum overlap to the overall use of the tag t reaches the pre-defined threshold. The roles of the tags in relationship is determined as follows: if the tag to is used significantly more often than t, we declare t to be a child of to and vice versa If the usage of both tags is more or less equal, we declare them as sibling Before the assignment of relationships, we check whether it will not "break the context in the hierarchy, i. e, we compute an average overlap of all ances- tors or children in the branch respectively. If the average overlap falls below the threshold, we create a duplicate of tag to, assign tag t to it appropriately and make a new branch of it. The creation of the duplicate aims at solving the homonymy problem, where the to tag has multiple meanings, depending on a context Spreading Activation Spreading activation is a method for associative retrieval [7 in associative and semantic networks. hence a network data structure consisting of nodes and links modeling relationships between nodes. Searching is done by activating the se- lected node with an activation energy and spreading this energy through the edges to its neighbors so that Energylneighbor l- Energylorigin] The process runs recursively until convergence. At the end, the nodes activation levels represent the re of similarity to the initially chosen node or nodes. Spreading activation search in the folksonomy finds new relationships, which are not "visible when considering only set theory assumptions and the algo- rithm 1. If these relations are added to the already existing parent-child rela- tionships, it allows us to make contextual "jumps"between the tags, which we believe could have an interesting impact for the tag- based user modeling and Since the folksonomy does not provide direct links between tags, the spread- ng activation is performed on either the bipartite graph TI (tags-items)or TA (tags-actors)or on a combination of the two, all depending on the definition what neighbors means for the algorithm (i. e, the spreading activation on the Ta graph is performed if the neighbors of a tag are all actors which used that tag). However, due to the vast amount of connections present in every social tagging system, which leads to enormous time and space complexity of the re- cursive process, we needed to select a sub-graph of the whole folksonomy. For instance, for the TI bipartite graph, we select the sub-graph by modifying the neighbors function of a tag/item so that it will return only popular items/tags where popularity of a tag or item is defined(following the same principles as in algorithm 1)as a ratio of actors which used that particular tag or tagge that particular item. When spreading through TA graph, we define an actor as popular if he or she used at least k% of popular tags and tagged at least l% of popular items
3. The parent-child or sibling relationship is established if the ratio of maximum overlap to the overall use of the tag t reaches the pre-defined threshold. The roles of the tags in relationship is determined as follows: if the tag to is used significantly more often than t, we declare t to be a child of to and vice versa. If the usage of both tags is more or less equal, we declare them as siblings. 4. Before the assignment of relationships, we check whether it will not “break” the context in the hierarchy, i.e., we compute an average overlap of all ancestors or children in the branch respectively. If the average overlap falls below the threshold, we create a duplicate of tag to, assign tag t to it appropriately and make a new branch of it. The creation of the duplicate aims at solving the homonymy problem, where the to tag has multiple meanings, depending on a context. 2.2 Finding Related Tags by Spreading Activation Spreading activation is a method for associative retrieval [7] in associative and semantic networks, hence a network data structure consisting of nodes and links modeling relationships between nodes. Searching is done by activating the selected node with an activation energy and spreading this energy through the edges to its neighbors so that Energy[neighbori ] = Energy[origin] |neighbors| The process runs recursively until convergence. At the end, the nodes’ activation levels represent the measure of similarity to the initially chosen node or nodes. Spreading activation search in the folksonomy finds new relationships, which are not “visible” when considering only set theory assumptions and the algorithm 1. If these relations are added to the already existing parent-child relationships, it allows us to make contextual “jumps” between the tags, which we believe could have an interesting impact for the tag-based user modeling and personalization. Since the folksonomy does not provide direct links between tags, the spreading activation is performed on either the bipartite graph T I (tags-items) or T A (tags-actors) or on a combination of the two, all depending on the definition what neighbors means for the algorithm (i.e., the spreading activation on the T A graph is performed if the neighbors of a tag are all actors which used that tag). However, due to the vast amount of connections present in every social tagging system, which leads to enormous time and space complexity of the recursive process, we needed to select a sub-graph of the whole folksonomy. For instance, for the T I bipartite graph, we select the sub-graph by modifying the neighbors function of a tag/item so that it will return only popular items/tags, where popularity of a tag or item is defined (following the same principles as in algorithm 1) as a ratio of actors which used that particular tag or tagged that particular item. When spreading through T A graph, we define an actor as popular if he or she used at least k% of popular tags and tagged at least l% of popular items
Algorithm 1: Creation of tag hierarchy Result: Popular tags structured in a hierarchy create root ordered tags←root for each tag t do t if t popularity popularity threshold then t。←mnar( tatar,vtag∈ ordered tags) if ol overlap threshold th if rightcontert? (to ancestors, t) then I to children←t root, children←toc end else if Itol<t then f rightcontert? (to children, t) then t parent to parent children←to create a copy toc of to root, children end else end end else root. children←t end end 2.3 Applying Knowledge on Keywords from Wordnet ordnet 8(wordnet. princeton. edu) is a large lexical database for the Er glish language developed and maintained by Cognitive Science Laboratory at Princeton University. It groups nouns, verbs, adjectives and adverbs into sets of cognitive synonyms(synsets). Moreover, it provides conceptual-semantic and
Algorithm 1: Creation of tag hierarchy Data: a folksonomy Result: Popular tags structured in a hierarchy create root ordered tags ← root for each tag t do t.popularity = |t.users| |overallusers| if t.popularity overlap threshold then if |to| |t| then if rightcontext?(to.ancestors, t) then to.children ← t end else create a copy toc of to root.children ← toc to.children ← t end end else if |to| |t| then if rightcontext?(to.children, t) then t.parent = to.parent t.children ← to end else create a copy toc of to t.children ← toc root.children ← t end end else to.sibling ← t end end else root.children ← t end end 2.3 Applying Knowledge on Keywords from Wordnet Wordnet [8] (wordnet.princeton.edu) is a large lexical database for the English language developed and maintained by Cognitive Science Laboratory at Princeton University. It groups nouns, verbs, adjectives and adverbs into sets of cognitive synonyms (synsets). Moreover, it provides conceptual-semantic and
lexical relations between synsets, 1. e, one can browse for hyponyms and hyper nyms of a given word, other similar words or even antonyms The reason why we did not use Wordnet as our source of keywords and relationships at the first place comes from our plans to leverage our hierarchy for user modeling purposes. Tags acquired from social tagging systems are closer to the user than Wordnet, they are more"webbish" Even more, some relationships hich emerge from the social tagging systems could never come out of Wordnet for example, a relationship between words ie, png and bugs, pointing to th well-known problem of Internet Explorer's broken PNG support ). Relationshi cquired from social tagging systems are like ordinary people used them, not like linguistics have decided them to be However, we believe, that Wordnet can still significantly contribute to the quality of the tag-based models built upon the generated tag hierarchy. Informa- tion on semantic relationships between words can be used not only to identify particular subtrees, which should be merged (in case of synonymy) or divided (because of ambiguity), but also to add new words(along with their relation- ships)to the hierarchy, which would raise the probability that we will be able to map user's interest to our keywords 3 Experiment In order to determine the feasibility of the proposed approach to deriving rela- tionships between tags from folksonomies for the purposes of tag-based user mod eling and to determine the optimal setting of the algorithm, we performed several experiments with two different folksonomies. Our main concern was whether the algorithm creates cohesive groups of tags(subtrees)without significant flaws of 3.1 Data Wecollectedapartofdeliciousbookmarkingsite(http://delicious.com dataset by periodically polling their Rss feeds. First, we used a recent activity RSS feed to obtain a list of 128 448 unique user login names, next we used this information in user-scoped RSS feeds to obtain all tags and tagged pages for a given login name the second dataset, we took the anonymized folksonomy of users-tags- publications from the Citeulike(htTp: //citeulike. org), which is a system for tagging and searching for scholarly papers. Summarization of data we w able to acquire so far is listed in the Table 1 first thing we were interested in was whether these two folksonomies are used in a similar manner. The graph on Fig. 1 depicts a distribution of tags pages in delicious in a logarithmic scale. We can see that the distribution of tags hypernym-the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a(kind of)y
lexical relations between synsets, i.e., one can browse for hyponyms and hypernyms1 of a given word, other similar words or even antonyms. The reason why we did not use Wordnet as our source of keywords and relationships at the first place comes from our plans to leverage our hierarchy for user modeling purposes. Tags acquired from social tagging systems are closer to the user than Wordnet, they are more “webbish”. Even more, some relationships which emerge from the social tagging systems could never come out of Wordnet (for example, a relationship between words ie, png and bugs, pointing to the well-known problem of Internet Explorer’s broken PNG support). Relationships acquired from social tagging systems are like ordinary people used them, not like linguistics have decided them to be. However, we believe, that Wordnet can still significantly contribute to the quality of the tag-based models built upon the generated tag hierarchy. Information on semantic relationships between words can be used not only to identify particular subtrees, which should be merged (in case of synonymy) or divided (because of ambiguity), but also to add new words (along with their relationships) to the hierarchy, which would raise the probability that we will be able to map user’s interest to our keywords. 3 Experiment In order to determine the feasibility of the proposed approach to deriving relationships between tags from folksonomies for the purposes of tag-based user modeling and to determine the optimal setting of the algorithm, we performed several experiments with two different folksonomies. Our main concern was whether the algorithm creates cohesive groups of tags (subtrees) without significant flaws of the context. 3.1 Data We collected a part of delicious bookmarking site (http://delicious.com) dataset by periodically polling their RSS feeds. First, we used a ’recent activity’ RSS feed to obtain a list of 128 448 unique user login names, next we used this information in user-scoped RSS feeds to obtain all tags and tagged pages for a given login name. As the second dataset, we took the anonymized folksonomy of users-tagspublications from the CiteULike (http://citeulike.org), which is a system for tagging and searching for scholarly papers. Summarization of data we were able to acquire so far is listed in the Table 1. First thing we were interested in was whether these two folksonomies are used in a similar manner. The graph on Fig. 1 depicts a distribution of tags on pages in delicious in a logarithmic scale. We can see that the distribution of tags 1 hypernym – the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y
Table 1. Overview of the acquired dataset #usernames: 128 448 44 215 #unique tags: 220 647 294 80 ems(pages, publications ) 962 3671 437245 fits more or less the power-law, with 421 840 pages tagged by one tag only and one page having 171 distinct tags2 Similar graph for the publications of the CiteULike dataset is shown on Fig. 2 Again, we see the power law distribution, with 625 658 publications tagged by one tag only and one publication tagged by 1 708 distinct tags We know already that power law distributions tend to arise in social systems where many people express their preferences among many options. Therefore, by observing the power law in both datasets, we were assured that datasets are valid difference hough users(which could be an issue especially for our delicious obtain e dataset)to perform a crowd-based analysis. However, we found an interesting popularity of tags. In our delicious dataset, we can find many tags which are shared among 5+% of all users whereas in the CiteULike dataset, a tag marked as"most active"on CiteULike website reaches popularity of 1 to 2 percents only. We continue in delicious feeds harvesting in order to determine, hether the overall popularity of tags marked as popular in our current dataset will decrease 3.2 Results We executed the algorithm 1(see section 2)several times on both folksonomies with different setups and were observing the resulting trees. The setups of the variables for different runs of the algorithm are summarized in the Table 2. The floating average overlap threshold from the table means that the actual threshold is computed"on-the-fly"as a fraction of current parent-child overlap The manual inspection of resulting hierarchies proved the viabilit proach, where meaningful relations between keywords were created uld definitely provide a good basis for a tag-based user modeling. The quality of the result depends highly on the configuration of the algorithm. The rest of this section is devoted to analysis of the impact of algorithms variables to the y We provide summarization of basic attributes of the produced hierarchies the Table 3, with examples on Fig 3 and 4(complete results can be seen ishttp:// the most tagged publication according to the CiteULike linkout database is surpris-
Table 1. Overview of the acquired dataset. delicious citeulike #usernames: 128 448 44 215 #records: 2 957 144 5 228 356 #processed users: 2 234 44 215 #unique tags: 220 647 294 806 #unique items(pages,publications): 962 367 1 437 245 fits more or less the power-law, with 421 840 pages tagged by one tag only and one page having 171 distinct tags2 . Similar graph for the publications of the CiteULike dataset is shown on Fig. 2. Again, we see the power law distribution, with 625 658 publications tagged by one tag only and one publication tagged by 1 708 distinct tags3 . We know already that power law distributions tend to arise in social systems where many people express their preferences among many options. Therefore, by observing the power law in both datasets, we were assured that datasets are valid and contain enough users (which could be an issue especially for our delicious dataset) to perform a crowd-based analysis. However, we found an interesting difference in popularity of tags. In our delicious dataset, we can find many tags which are shared among 5+% of all users whereas in the CiteULike dataset, a tag marked as “most active” on CiteULike website reaches popularity of 1 to 2 percents only. We continue in delicious feeds harvesting in order to determine, whether the overall popularity of tags marked as popular in our current dataset will decrease. 3.2 Results We executed the algorithm 1 (see section 2) several times on both folksonomies with different setups and were observing the resulting trees. The setups of the variables for different runs of the algorithm are summarized in the Table 2. The floating average overlap threshold from the table means that the actual threshold is computed “on-the-fly” as a fraction of current parent-child overlap. The manual inspection of resulting hierarchies proved the viability of our approach, where meaningful relations between keywords were created, which would definitely provide a good basis for a tag-based user modeling. The quality of the result depends highly on the configuration of the algorithm. The rest of this section is devoted to analysis of the impact of algorithm’s variables to the created hierarchy. We provide summarization of basic attributes of the produced hierarchies in the Table 3, with examples on Fig. 3 and 4 (complete results can be seen 2 that page is http://www.scribd.com/ 3 the most tagged publication according to the CiteULike linkout database is surprisingly “about:blank
000 NUMBER OF TAC Fig 1. Distribution of tags on pages in delicious dataset Table 2. Overview of algorithm setup popularity threshold 5%5%15%0.5%05% overlap threshold■10%5% average overlap(context) threshold: 6, 3, 3%Floating 6, 3, 3%floating atwww.fiit.stuba.sk/-barla/iccci09).Ourtaxonomiesconformroughlyto he criteria on estimating quality of a taxonomy defined in 9, where a high- quality taxonomy should have an average depth of 3 with a maximum depth of 5. On the delicious(1 and delicious(2 datasets, we can observe the impact of the overlap threshold. When set to 10%(delicious[1), the tagsonomy tends to be more fat with highly consistent subtrees. The 5% setup of delicious(2 led to organizing more tags into subtrees (i.e, less tags on the first level), but showed that the contextual threshold 3, 3% was setup weakly(only four tags were duplicated in order to keep up with context). The best results were achieved when contextual threshold was set as floating, according to current parent-child overlap. For instance, tag currency on the Fig 3 which was wrongly assigned in a branch audio/ conversion was moved into a separate subtree in order to keep the average overlap of tags in the audio branch on the higher level We were surprised by results coming from the CiteULike folksonomy. The first setup with 10% overlap threshold produced very small hierarchy with only 67 popular tags. It seems that in CiteULike, the crowd did not make an agreement on most appropriate tags for particular publication (i.e, everybody uses his or her own specific tags). One reason for such a difference could be that delicious is pro-actively supporting such an agreement to emerge by recommending tags
10000 100000 1000000 GES 1 10 100 1000 1 5 25 125 625 NUMBER OF PA NUMBER OF TAGS Fig. 1. Distribution of tags on pages in delicious dataset. Table 2. Overview of algorithm setup. deli[1] deli[2] deli[3] cite[1] cite[2] cite[3] popularity threshold: 5% 5% 5% 1,5% 0,5% 0,5% overlap threshold: 10% 5% 5% 10% 5% 5% average overlap (context) threshold: 6,6% 3,3% floating 6,6% 3,3% floating at www.fiit.stuba.sk/~barla/iccci09). Our taxonomies conform roughly to the criteria on estimating quality of a taxonomy defined in [9], where a highquality taxonomy should have an average depth of 3 with a maximum depth of 5. On the delicious[1] and delicious[2] datasets, we can observe the impact of the overlap threshold. When set to 10% (delicious[1]), the tagsonomy tends to be more flat with highly consistent subtrees. The 5% setup of delicious[2] led to organizing more tags into subtrees (i.e., less tags on the first level), but showed that the contextual threshold 3,3% was setup weakly (only four tags were duplicated in order to keep up with context). The best results were achieved when contextual threshold was set as floating, according to current parent-child overlap. For instance, tag currency on the Fig. 3 which was wrongly assigned in a branch audio/conversion was moved into a separate subtree in order to keep the average overlap of tags in the audio branch on the higher level. We were surprised by results coming from the CiteULike folksonomy. The first setup with 10% overlap threshold produced very small hierarchy with only 67 popular tags. It seems that in CiteULike, the crowd did not make an agreement on most appropriate tags for particular publication (i.e., everybody uses his or her own specific tags). One reason for such a difference could be that delicious is pro-actively supporting such an agreement to emerge by recommending tags
10000 Fig. 2. Distribution of tags on publications in CiteULike dataset Table 3. Attributes of the resulting hierarchies I deli[l] deli[2] deli[Cite[cite(2 cite[31 licated tags: st-level tags:3571722775483125 #child-less first-lev. tags verage depth: 1. 819422074206281.194217431:9539 maximum depth: 55 5255 when users are adding a particular page already present in the system, while CiteULike does not provide such a feature yet. Therefore, we decreased the required popularity to 0, 5% for the CiteULike dataset, which resulted in 505 orga Resulting taxonomies pointed out yet another interesting difference between delicious and CiteULike folksonomies: CiteULike folksonomy contains words con- sidered as English stop-words(and, on, the, for etc. ) and(moreover)these stop- words are popular. This was something we did not expect at all, that somebody would use stop-words to organize some content. A possible explanation is that CiteULike users tend to post a short sentence as one tag(e.g,"example of a graph analysis"), but the CiteULike system considers each word of a sentence as a t Another phenomenom of the CiteULike dataset is a popular no-tag tag, which is assigned automatically by the system if user enters a publication without any tags. Obviously, many people do use CiteULike without taking advantage of tagging system built in it, which was rather surprising finding
10000 100000 1000000 R OF IONS 1 10 100 1000 1 10 100 1000 10000 NUMBER PUBLICATI NUMBER OF TAGS Fig. 2. Distribution of tags on publications in CiteULike dataset. Table 3. Attributes of the resulting hierarchies. deli[1] deli[2] deli[3] cite[1] cite[2] cite[3] #popular tags: 1085 1085 1085 67 505 505 #duplicated tags: 1 4 94 0 5 25 #first-level tags: 357 172 277 54 83 125 #child-less first-lev. tags: 245 84 92 47 38 51 average depth: 1.8194 2.2074 2.0628 1.194 2.1743 1.9539 maximum depth: 5 5 5 2 5 5 when users are adding a particular page already present in the system, while CiteULike does not provide such a feature yet. Therefore, we decreased the required popularity to 0,5% for the CiteULike dataset, which resulted in 505 organized tags. Resulting taxonomies pointed out yet another interesting difference between delicious and CiteULike folksonomies: CiteULike folksonomy contains words considered as English stop-words (and, on, the, for etc.) and (moreover) these stopwords are popular. This was something we did not expect at all, that somebody would use stop-words to organize some content. A possible explanation is that CiteULike users tend to post a short sentence as one tag (e.g., “example of a graph analysis”), but the CiteULike system considers each word of a sentence as a tag. Another phenomenom of the CiteULike dataset is a popular no-tag tag, which is assigned automatically by the system if user enters a publication without any tags. Obviously, many people do use CiteULike without taking advantage of tagging system built in it, which was rather surprising finding
modeling sound blogger blender blogging audiobooks comments ding audi model conversIon modeli currency del icio, us ag secondlife converter delicious blender internet uml telephone openg speech domain security pr1 right Fig 3. Example parts of tagsonomy created from the delicious dataset. The first row contains examples from the deli[l] setup, the second row from the deli[2 setup of the algorithm social analysis learning trust pattern ommunit representation communlties erformance management virtual recognitio tegy folksonomy tagging mass sleep distribution density children monitoring Fig 4. Example parts of tagsonomy created from the CiteULike dataset using the 2 setup of the algorit 4 Related works The raise of the tagging systems naturally provoked increase of interest in anal- ysis of folksonomy data among researchers. We are aware of various approache in deriving additional knowledge from folksonomies for different purposes
3d audio blog modeling sound blogger blender voice blogging uml audiobooks comments opengl recording tumblr sounds 3d audio blog model conversion bookmarks modeling currency del.icio.us virtual convert tag secondlife converter delicious blender skype internet uml voip telephone opengl speech domain ... security privacy rights human ... Fig. 3. Example parts of tagsonomy created from the delicious dataset. The first row contains examples from the deli[1] setup, the second row from the deli[2] setup of the algorithm. social analysis learning trust pattern machine community matching knowledge identity reference representation communities performance management virtual recognition strategy folksonomy object memory tagging mass sleep distribution online density children monitoring problem regression plasticity estimation Fig. 4. Example parts of tagsonomy created from the CiteULike dataset using the cite[2] setup of the algorithm. 4 Related Works The raise of the tagging systems naturally provoked increase of interest in analysis of folksonomy data among researchers. We are aware of various approaches in deriving additional knowledge from folksonomies for different purposes