FolksOntology: An Integrated Approach for Turning Folksonomies into ontologies Celine Van Damme, Martin Hepp", and Katharina Siorpae Vakgroep MOSI, Vrije Universiteit Brussel, Brussels, Belgium dIgital Enterprise Research Institute(DERI), University of Innsbruck, Innsbruck, Austria celine. van damme @vub ac be, mheppecomputerorg katharina siorpaes @deri.org Abstract. We can observe that the amount of non-toy domain ontologies is still very limited for many areas of interest. In contrast, folksonomies are widely in use for (1)tagging Web pages(e.g del icio. us),(2)annotating pictures(e. g ickr), or(3)classifying scholarly publications (e.g. bibsonomy). However, uch folksonomies cannot offer the expressivity of ontologies, and the respective tags often lack a context-independent and intersubjective definition of meaning. Also, folksonomies and other unsupervised vocabularies frequent uffer from inconsistencies and redundancies. In this we argue that the social interaction manifested in folksonomies and in their usage should be exploited for building and maintaining ontologies. Then, we sketch a integrating multiple resources and techniques. In detail, we suggest combin (1)the statistical analysis of folksonomies, associated usage data, and ther mplicit social networks, (2)onl ine lexical resources like dictionaries, Wordnet, Google and Wikipedia, (3)ontologies and Semantic Web resources, (4) ontology mapping and matching approaches, and (5)functionality that helps human actors in achieving and maintaining consensus over ontology element suggestions resulting from the preceding steps 1. Introduction It has been argued e.g. in [1] that the insufficient involvement of users in the construction of ontologies is a significant cause for the current shortage of and the unsatisfying coverage found in domain ontologies. One of the reasons for this deficiency is that there are high barriers for laymen users for suggesting new onceptual elements. For example, a new concept, instance or property is added to the ontology only by a privileged group. This requires that ontology users with domain expertise take the burden and have the skills to make respective suggestions, which is different from the evolution of a natural language. where a new word can be invented on the spot when needed and immediately added to the vocabulary [1, 2] Also, since ontology specifications are expressed in a formal language, potential users face difficulties in understanding the formal specifications of the ontology [1 2]. This is important, since the inferences authorized by using a given ontology are represented only in its formal semantics, i.e. to what one commits to when adopting a particular ontology is not obvious from the human-readable labels of ontology elements but only from the associated axioms. In addition to that, we can observe that ntology usage (e
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies Céline Van Damme1 , Martin Hepp2 , and Katharina Siorpaes2 1Vakgroep MOSI, Vrije Universiteit Brussel, Brussels, Belgium 2Digital Enterprise Research Institute (DERI), University of Innsbruck, Innsbruck, Austria celine.van.damme@vub.ac.be, mhepp@computer.org, katharina.siorpaes@deri.org Abstract. We can observe that the amount of non-toy domain ontologies is still very limited for many areas of interest. In contrast, folksonomies are widely in use for (1) tagging Web pages (e.g. del.icio.us), (2) annotating pictures (e.g. flickr), or (3) classifying scholarly publications (e.g. bibsonomy). However, such folksonomies cannot offer the expressivity of ontologies, and the respective tags often lack a context-independent and intersubjective definition of meaning. Also, folksonomies and other unsupervised vocabularies frequently suffer from inconsistencies and redundancies. In this paper, we argue that the social interaction manifested in folksonomies and in their usage should be exploited for building and maintaining ontologies. Then, we sketch a comprehensive approach for deriving ontologies from folksonomies by integrating multiple resources and techniques. In detail, we suggest combining (1) the statistical analysis of folksonomies, associated usage data, and their implicit social networks, (2) online lexical resources like dictionaries, Wordnet, Google and Wikipedia, (3) ontologies and Semantic Web resources, (4) ontology mapping and matching approaches, and (5) functionality that helps human actors in achieving and maintaining consensus over ontology element suggestions resulting from the preceding steps. 1. Introduction It has been argued e.g. in [1] that the insufficient involvement of users in the construction of ontologies is a significant cause for the current shortage of and the unsatisfying coverage found in domain ontologies. One of the reasons for this deficiency is that there are high barriers for laymen users for suggesting new conceptual elements. For example, a new concept, instance or property is added to the ontology only by a privileged group. This requires that ontology users with domain expertise take the burden and have the skills to make respective suggestions, which is different from the evolution of a natural language, where a new word can be invented on the spot when needed and immediately added to the vocabulary [1, 2]. Also, since ontology specifications are expressed in a formal language, potential users face difficulties in understanding the formal specifications of the ontology [1, 2]. This is important, since the inferences authorized by using a given ontology are represented only in its formal semantics, i.e. to what one commits to when adopting a particular ontology is not obvious from the human-readable labels of ontology elements but only from the associated axioms. In addition to that, we can observe that the detachment of ontology usage (e.g. creating annotations) from ontology
2 Celine Van Damme, Martin Hepp, and Katharina Siorpaes construction and maintenance in current practice cuts off valuable feedback and actually makes the social agreement over ontology elements brittle and vague Tagging, i.e., users describing objects with freely chosen keywords( tags) in order to retrieve content more easily, avoids these limitations, since new tags can be introduced on the spot when needed and the construction and maintenance of the tags is closely linked to their actual usage While the resulting tag sets and their assignment to objects are at first only eflecting subjective conceptualizations, many of those subjective representations can be used to derive intersubjective representations. Such aggregation of raw tag data leads to a flat bottom-up categorization or folksonomy [3]. Popular examples of the tagging/folksonomy mechanism are found in the social bookmark manager delicio.us (http://del.icio.us),theimagesharingsystemFlickr(http://www.flickr.com),andthe blogsearchengineTechnorati(http://technoraticom) Tagging features create a wealth of data that reflects(1) subjective assignments associations, and(3)implicit information on social networks However, tags are flat and no relationships or conceptual meanings are formally ttached to them. This causes problems such as(1)lexical ambiguity; for instance, the tag"bank can mean a financial institution or it can be used in the context of a river ge;(2)different tags(.g.NY"and"big_ apple")may refer to the same co (e.g. the city New York), and()specialized(e.g."seagull") and more general tag (e.g. "bird") may be attributed to the same object(e. g. a picture of a seagull on Flickr) 4 Also, the same tag may be used for very different objects in clearly distinct contexts. For example, the tag"Italy"can be used to categorize pictures taken in Italy Ontologies, on the contrary, require a clear and context-independent notion of what it to be an instance of In this paper, we suggest taking an integrated approach of combining five types of resources and techniques for improving the construction of domain ontologies. We propose to exploit(1)the statistical analysis of folksonomies and the wealth of data resulting from their construction, usage, and the underlying social relationships between actors by providing a set of tools and techniques that identify structural atterns in folksonomies, (2)on-line lexical resources like dictionaries, Wordnet, Google, and Wikipedia; (3)ontologies and Semantic Web resources, (4)ontology achieving and matching approaches, and (5)functionality that helps the community in mapping an d maintaining consensus. The structure of the paper is as follows. In section 2, we give an overview of es and techniques that are available for lifting folksonomies to the level of ontologies. In section 3, we explain the FolksOntology approach that is based on the integration of these elements and the involvement of the community. In section 4, we give a preliminary assessment of the possible contributio n or each resource technique In section 5, we discuss our proposal in the light of related work, identify future research challenges, and summarize the main findings
2 Céline Van Damme, Martin Hepp, and Katharina Siorpaes construction and maintenance in current practice cuts off valuable feedback and actually makes the social agreement over ontology elements brittle and vague. Tagging, i.e., users describing objects with freely chosen keywords (tags) in order to retrieve content more easily, avoids these limitations, since new tags can be introduced on the spot when needed and the construction and maintenance of the tags is closely linked to their actual usage. While the resulting tag sets and their assignment to objects are at first only reflecting subjective conceptualizations, many of those subjective representations can be used to derive intersubjective representations. Such aggregation of raw tag data leads to a flat bottom-up categorization or folksonomy [3]. Popular examples of the tagging/folksonomy mechanism are found in the social bookmark manager deli.cio.us (http://del.icio.us), the image sharing system Flickr (http://www.flickr.com), and the blog search engine Technorati (http://technorati.com). Tagging features create a wealth of data that reflects (1) subjective assignments between words and categories of objects, (2) intersubjective patterns in these associations, and (3) implicit information on social networks. However, tags are flat and no relationships or conceptual meanings are formally attached to them. This causes problems such as (1) lexical ambiguity; for instance, the tag “bank” can mean a financial institution or it can be used in the context of a river edge; (2) different tags (e.g. “NY” and “big_apple”) may refer to the same concept (e.g. the city New York), and (3) specialized (e.g. “seagull”) and more general tags (e.g. “bird”) may be attributed to the same object (e.g. a picture of a seagull on Flickr) [4]. Also, the same tag may be used for very different objects in clearly distinct contexts. For example, the tag “Italy” can be used to categorize pictures taken in Italy (in a picture database) or customers living in Italy (in a tagged address data base). Ontologies, on the contrary, require a clear and context-independent notion of what it means to be an instance of a respective class. In this paper, we suggest taking an integrated approach of combining five types of resources and techniques for improving the construction of domain ontologies. We propose to exploit (1) the statistical analysis of folksonomies and the wealth of data resulting from their construction, usage, and the underlying social relationships between actors by providing a set of tools and techniques that identify structural patterns in folksonomies, (2) on-line lexical resources like dictionaries, Wordnet, Google, and Wikipedia; (3) ontologies and Semantic Web resources, (4) ontology mapping and matching approaches, and (5) functionality that helps the community in achieving and maintaining consensus. The structure of the paper is as follows. In section 2, we give an overview of potential resources and techniques that are available for lifting folksonomies to the level of ontologies. In section 3, we explain the FolksOntology approach that is based on the integration of these elements and the involvement of the community. In section 4, we give a preliminary assessment of the possible contribution of each resource and technique. In section 5, we discuss our proposal in the light of related work, identify future research challenges, and summarize the main findings
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 3 2. Resources for Lifting Folksonomies to the level of ontologies In this section, we give an overview of promising resources that can be exploited for deriving ontologies from folksonomies. There exist at least three groups of such resources: First, folksonomies and their associated data(subsection 2. 1); second online lexical resources(subsection 2.2); and third, ontologies and other Semantic leb resources(subsection 2.3). In subsection 2. 4, we discuss how mapping and th 2.1. Folksonomies and Associated data Quite clearly, tagging generates more data than merely tags. When we look at Web sites that have an inherent tagging feature, we can see that there are four groups of entities involved in the tagging process: (1)tags, (2)objects, like images or ibliographic references, (3)actors, and (4)the folksonomy-driven Web sites or systems' themselves [5]. There is interaction between those entities, which generates a large amount of potentially valuable data, as described in the subsections below 2.1.1. Folksonomies and Social Networks in One System During the tagging process, actors are assigning tags to objects(figure 1). The actors describe an object using their own, freely chosen keywords ly in order to facilitate a later retrieval process. As a consequence, the ta pressing and reflecting the actors subjective level of knowledge on and Annotates △△ Fig. 1. The Tagging Process B In the past few years, there have been successful attempts of enriching tags with ierarchical relations [6 and the creation of faceted ontologies [7] through studying the use of objects and tags in a system. However, more information is available than merely tags, as explained e. g. in [8], in which the social dimension of actors was introduced. Out of a tripartite model of tags, objects, and actors, three bipartite graphs were generated based on the co-occurrence of its elements the AC (actor-tag) graph, Al (actor-object)graph, and the CI(tag-object) graph. The folding of these grap into one-mode networks generates implicit social networks, a network of instances and lightweight ontologies [8] examines these two lightweight ontologies(one based on sub-communities of interest and another on object overlaps)on a data set of the deli cio. us system and reveals broader/narrower relations. The authors concluded that analyzing a lightweight ontology of a sub-community is a good mean for discovering In the rest of the paper, we will use the term systems
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 3 2. Resources for Lifting Folksonomies to the Level of Ontologies In this section, we give an overview of promising resources that can be exploited for deriving ontologies from folksonomies. There exist at least three groups of such resources: First, folksonomies and their associated data (subsection 2.1); second, online lexical resources (subsection 2.2); and third, ontologies and other Semantic Web resources (subsection 2.3). In subsection 2.4, we discuss how mapping and matching techniques can support the process. 2.1. Folksonomies and Associated Data Quite clearly, tagging generates more data than merely tags. When we look at Web sites that have an inherent tagging feature, we can see that there are four groups of entities involved in the tagging process: (1) tags, (2) objects, like images or bibliographic references, (3) actors, and (4) the folksonomy-driven Web sites or systems1 themselves [5]. There is interaction between those entities, which generates a large amount of potentially valuable data, as described in the subsections below. 2.1.1. Folksonomies and Social Networks in One System During the tagging process, actors are assigning tags to objects (figure 1). The actors describe an object using their own, freely chosen keywords, usually in order to facilitate a later retrieval process. As a consequence, the tags are expressing and reflecting the actors’ subjective level of knowledge on and their interest in the respective object. Fig. 1. The Tagging Process In the past few years, there have been successful attempts of enriching tags with hierarchical relations [6] and the creation of faceted ontologies [7] through studying the use of objects and tags in a system. However, more information is available than merely tags, as explained e.g. in [8], in which the social dimension of actors was introduced. Out of a tripartite model of tags, objects, and actors, three bipartite graphs were generated based on the co-occurrence of its elements: the AC (actor-tag) graph, AI (actor-object) graph, and the CI (tag-object) graph. The folding of these graphs into one-mode networks generates implicit social networks, a network of instances and lightweight ontologies. [8] examines these two lightweight ontologies (one based on sub-communities of interest and another on object overlaps) on a data set of the deli.cio.us system and reveals broader/narrower relations. The authors concluded that analyzing a lightweight ontology of a sub-community is a good mean for discovering 1 In the rest of the paper, we will use the term systems
4 Celine Van Damme, Martin Hepp, and Katharina Siorpaes the emergent semantics of a community. Therefore, consolidating and analyzing the of ontologies of th We argue that the implicit social networks in a system, which are not studied in [8], may return additional significant information. In particular, one can safely assume that actors are indirectly linked with others by sharing the same tags and/or objects For example, as shown in figure 2, actors A and B are linked by tag 3 and actors B and C are related because they both have tagged object5. In the first case binding is the common language, in the second case, it is the interest in the same Analyzing such data might elevant relations that can help us in reconstructing an ontology for th ective domain of interest. For instance, there might be a significant relat objectl (annotated by actor A)and objects (annotated by actor B): and maybe the tags should be consolidated. Furthermore, a relation might exist between tag3 and the tag set (tag4, tags, tag6) since they are all used to annotate objects Obect △△△△△ Obect 2 objects bect4 Objects Fig. 2. The Collective Tagging Process Sometimes, actors have already made explicit their area of interest or expertise, g. by joining one or more user groups on the system, which is a feature in some stems(e.g. Bibsonomy, Flickr, YouTube). By that, actors with similar interests can share their objects and tags. However, since everyone may create a new group, reduntant groups and a topic overlap between groups is likely. On Flickr, many roups are discussing and generating tags on similar kind of subjects-there exist, e.g., more than 1290 public groups on wine:. Therefore, aggregating the data from those groups may reveal valuable data for the creation of wine ontologies Actors can also make their relations and interests public by inviting other actors to their network, as is supported e.g. by deli cio. us. Adding an actor to your networ implies you are having the same interests as this actor, or that there exist some other social bonds. When all the actors are making their interests public, more information can be extracted http://www.flickr.com/search/groups/g=winesretrievedonApril1,200
4 Céline Van Damme, Martin Hepp, and Katharina Siorpaes the emergent semantics of a community. Therefore, consolidating and analyzing the user-created data of sub-communities seems a valuable start data set for the creation of ontologies of this sub-group. We argue that the implicit social networks in a system, which are not studied in [8], may return additional significant information. In particular, one can safely assume that actors are indirectly linked with others by sharing the same tags and/or objects. For example, as shown in figure 2, actors A and B are linked by tag3 and actors B and C are related because they both have tagged object5. In the first case, the social binding is the common language, in the second case, it is the interest in the same objects. Analyzing such data might reveal relevant relations that can help us in reconstructing an ontology for the respective domain of interest. For instance, there might be a significant relation between object1 (annotated by actor A) and object5 (annotated by actor B): and maybe the tags should be consolidated. Furthermore, a relation might exist between tag3 and the tag set (tag4, tag5, tag6) since they are all used to annotate object5. Fig. 2. The Collective Tagging Process Sometimes, actors have already made explicit their area of interest or expertise, e.g. by joining one or more user groups on the system, which is a feature in some systems (e.g. Bibsonomy, Flickr, YouTube). By that, actors with similar interests can share their objects and tags. However, since everyone may create a new group, reduntant groups and a topic overlap between groups is likely. On Flickr, many groups are discussing and generating tags on similar kind of subjects - there exist, e.g., more than 1290 public groups on wine2 . Therefore, aggregating the data from those groups may reveal valuable data for the creation of wine ontologies. Actors can also make their relations and interests public by inviting other actors to their network, as is supported e.g. by deli.cio.us. Adding an actor to your network implies you are having the same interests as this actor, or that there exist some other social bonds. When all the actors are making their interests public, more information can be extracted. 2 http://www.flickr.com/search/groups/?q=wines retrieved on April 1, 2007
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 5 2.1.2. Folksonomies and Social Networks in Several Systems As already mentioned, there is a fourth type of entities involved in the tagging process, i.e., systems. Since more and more systems are emerging, we believe that tagging data on similar topics and objects is created in parallel on different systems Systems are implicitly connected through shared sub-communities of interest or common objects. Sub-communities are not exclusively related to just one system. Fc instance, a sub-community on wines may exist on Flickr as on deli cio. us. However, we have to be careful when comparing data from different kinds of systems, since a olksonomy can be broad or narrow [ 3. In case the actor and creator are both the same, as is the case on Flickr, the consolidated tags constitute a narrow folksonomy. On deli cio. us every object is tagged by, depending on the popularity of the object, several actors and the aggregation of the tags lead to a broad folksonomy. On the other hand, there may exist implicit links between systems because the actors are annotating the same sets (or kinds) of objects. For instance, the same scholarly ublications are tagged on different systems(e.g. Bibsonomy and CiteULike Consolidating the entire user-created data of similar kinds of objects, which is spersed on several systems, may generate a more complete overview on the meta data of overlapping objects On the other hand, some systems are also explicitly connected through explicit social networks of their actors. Information on a person can be given e.g. using FOAF. FOAF allows everyone to describe him/herself (e.g. name, family name, friends), online accounts, groups and documents in a lightweight formal way. Extracting the information that is stored in FOAF profiles can unveil the explicit social networks. The explicit social networks can be used for determining people with their bookmarks, explicitly describe their relations with other people by FOAF. Theng hev can In ort the tags of their friends and establish mappings between their tag and those of their peers. Doing this implies a certain level of trust and can enhance the feedback functionality in the bookmark system. In that way, 18] are trying to create a community-based ontology that is based on explicitly described relations and trust. We can conclude that this tagging process produces several kinds of data sets that can be analyzed to exploit the information hidden in these systems. It is obvious that the design of proper tools for exploiting structural patterns in folksonomies is a core 2.2. Online Lexical resources The data sets obtained from the previous resource can be complemented with information from lexical or terminological resources such as Leo Dictionary, ordnet, Google, and wikipedia Dictionaries are generally considered as a valuable and reliable resource containing definitions of several common words. Nowadays, several dictionaries are online accessible such as Leo Dictionary and the lexical database Wordnet. However, it is not sufficient to rely solely on these resources. For example, rather new or very specific words such as folksonomy can not be retrieved although the latter is an established term on the Web. Thus, we should exploit other lexical resources the Web 'http://xmins.com/foat70.1/#sec-foafvocabretrievedonApril1,2007
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 5 2.1.2. Folksonomies and Social Networks in Several Systems As already mentioned, there is a fourth type of entities involved in the tagging process, i.e., systems. Since more and more systems are emerging, we believe that tagging data on similar topics and objects is created in parallel on different systems. Systems are implicitly connected through shared sub-communities of interest or common objects. Sub-communities are not exclusively related to just one system. For instance, a sub-community on wines may exist on Flickr as on deli.cio.us. However, we have to be careful when comparing data from different kinds of systems, since a folksonomy can be broad or narrow [3]. In case the actor and creator are both the same, as is the case on Flickr, the consolidated tags constitute a narrow folksonomy. On deli.cio.us every object is tagged by, depending on the popularity of the object, several actors and the aggregation of the tags lead to a broad folksonomy. On the other hand, there may exist implicit links between systems because the actors are annotating the same sets (or kinds) of objects. For instance, the same scholarly publications are tagged on different systems (e.g. Bibsonomy and CiteULike). Consolidating the entire user-created data of similar kinds of objects, which is dispersed on several systems, may generate a more complete overview on the meta data of overlapping objects. On the other hand, some systems are also explicitly connected through explicit social networks of their actors. Information on a person can be given e.g. using FOAF. FOAF allows everyone to describe him/herself (e.g. name, family name, friends), online accounts, groups and documents in a lightweight formal way3 . Extracting the information that is stored in FOAF profiles can unveil the explicit social networks. The explicit social networks can be used for determining people with shared objects and tags. In [18] a system is proposed where actors can next to tagging their bookmarks, explicitly describe their relations with other people by FOAF. Then, they can import the tags of their friends and establish mappings between their tags and those of their peers. Doing this implies a certain level of trust and can enhance the feedback functionality in the bookmark system. In that way, [18] are trying to create a community-based ontology that is based on explicitly described relations and trust. We can conclude that this tagging process produces several kinds of data sets that can be analyzed to exploit the information hidden in these systems. It is obvious that the design of proper tools for exploiting structural patterns in folksonomies is a core challenge for tapping this potential. 2.2. Online Lexical Resources The data sets obtained from the previous resource can be complemented with information from lexical or terminological resources such as Leo Dictionary, Wordnet, Google, and Wikipedia. Dictionaries are generally considered as a valuable and reliable resource containing definitions of several common words. Nowadays, several dictionaries are online accessible such as Leo Dictionary and the lexical database Wordnet. However, it is not sufficient to rely solely on these resources. For example, rather new or very specific words such as folksonomy can not be retrieved although the latter is an established term on the Web. Thus, we should exploit other lexical resources the Web 3 http://xmlns.com/foaf/0.1/#sec-foafvocab retrieved on April 1, 2007
6 Celine Van Damme, Martin Hepp, and Katharina Siorpaes is offering, e.g. Google and Wikipedia. Google is providing some kind of dictionary functions. Each time the user is entering a search key word, Google tries to find similar key words [15]. The search results for both queries are compared (the origina one entered by the user and the similar ones). In case the alternative spelling has more hits,a suggestion is made to the user. For instance when typing in the query occurence, Google will make the suggestion occurrence since the number of results for the user key word occurence are significant lower. This suggestion feature is based on the principle of collective wisdom: if the majority of the Web community is sing this key word, it is accepted as an existing and well-spelled word. The principle of collected wisdom can also be used for checking the proper usage of language, e. g for finding proper prepositions. It can be futher improved by considering the region of originandtheauthorityofthereturnedWebpages(thepagehttp://www.bbc.co.uk willhaveahighercredibilitythanonhttp://yahoo.com/users/pmiller.htm).the Google dictionary function can be complemented with Wikipedia, the online collaborative encyclopedia, for the identification of words. Everyone can edit and make a new Web page in this user-created encyclopedia. For instance, for "folksonomy", a Wikipedia article was already created in November 2004, whereas the respective word does still not exist in regular dictionaries. With more than 5,300,000 articles [9] in various languages, Wikipedia constitutes a huge corpus of knowledge. In the English language, 1, 710,088 articles can be identified by a URI; us it has been shown in [2] that the conceptual meaning of the articles does not change in most cases and thus Wikipedia URIs can be regarded as authoritative identifiers for many concepts 2.3. Ontologies and Semantic Web resources After consulting all the lexical resources, ontologies and Web resources can be employed as the second level of resources. Freely retrieved e. g. through the Semantic Web search engine sy his search engine is searching and indexing Semantic Web documents written in RdF and OWL. It indexes the metadata of the documents and computes relationships between them [10 Wordnet, which we mentioned in the previous section, can also be exploited as a rovides an over saurus, for which an OWL transcript is available. Wordnet omonyms). It is often suggested and applied in research papers for extracting (e.g. in [11], Wordnet is employed for related terms in order to reduce the communication obstruction between intelligent agents with different ontologies, and [ 12] use Wordnet to add a conceptual meaning to the tags when annotating a bookmark) 2.4. Ontology Mapping and Matching Approaches Next to resources, we can build on established techniques for ontology matching and mapping. In principle, matching of conceptual elements in two ontolog 4 Merriam Webster Online. Leo Dictionaries http:/len.wikipedia.orgretrievedonMarch27,2007 http-//www.w3.ore/tr/wordnet-rdf,retrievedMay9,200
6 Céline Van Damme, Martin Hepp, and Katharina Siorpaes is offering, e.g. Google and Wikipedia. Google is providing some kind of dictionary functions. Each time the user is entering a search key word, Google tries to find similar key words [15]. The search results for both queries are compared (the original one entered by the user and the similar ones). In case the alternative spelling has more hits, a suggestion is made to the user. For instance when typing in the query occurence, Google will make the suggestion occurrence since the number of results for the user key word occurence are significant lower. This suggestion feature is based on the principle of collective wisdom: if the majority of the Web community is using this key word, it is accepted as an existing and well-spelled word. The principle of collected wisdom can also be used for checking the proper usage of language, e.g. for finding proper prepositions. It can be futher improved by considering the region of origin and the authority of the returned Web pages (the page http://www.bbc.co.uk will have a higher credibility than on http://yahoo.com/users/pmiller.htm). The Google dictionary function can be complemented with Wikipedia, the online collaborative encyclopedia, for the identification of words. Everyone can edit and make a new Web page in this user-created encyclopedia. For instance, for “folksonomy”, a Wikipedia article was already created in November 2004, whereas the respective word does still not exist in regular dictionaries4 . With more than 5,300,000 articles [9] in various languages, Wikipedia constitutes a huge corpus of knowledge. In the English language, 1,710,0885 articles can be identified by a URI; plus it has been shown in [2] that the conceptual meaning of the articles does not change in most cases and thus Wikipedia URIs can be regarded as authoritative identifiers for many concepts. 2.3. Ontologies and Semantic Web Resources After consulting all the lexical resources, ontologies and Semantic Web resources can be employed as the second level of resources. Freely available ontologies can be retrieved e.g. through the Semantic Web search engine Swoogle. This search engine is searching and indexing Semantic Web documents written in RDF and OWL. It indexes the metadata of the documents and computes relationships between them [10]. Wordnet, which we mentioned in the previous section, can also be exploited as a freely available thesaurus, for which an OWL transcript is available6 . Wordnet provides an overview of terms and their relationships (e.g. synonyms, meronyms and homonyms). It is often suggested and applied in research papers for extracting semantic information (e.g. in [11], Wordnet is employed for finding synonyms and related terms in order to reduce the communication obstruction between intelligent agents with different ontologies, and [12] use Wordnet to add a conceptual meaning to the tags when annotating a bookmark) . 2.4. Ontology Mapping and Matching Approaches Next to resources, we can build on established techniques for ontology matching and mapping. In principle, matching of conceptual elements in two ontologies can be 4 Merriam Webster Online, Leo Dictionaries 5 http://en.wikipedia.org, retrieved on March 27, 2007 6 http://www.w3.org/TR/wordnet-rdf/, retrieved May 9, 2007
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 7 based either on the labels or on the ontology structure, or both. For deriving ontologies from folksonomies, those techniques may be used in particular for identifying relationships between tags, between tags and lexical resources, and petween tags and elements in existing ontologies [ 13] describe the theory of formal classification, where labels are translated to a propositional concept language. Each node is associated to a normal form formula that describes the content of the node This approach is able to capture knowledge that exists implicitly within simple classification hierarchies. [14] describe semantic matching, an approach to matching classification hierarchies. This approach is focused to the graph representation of ontologies, which means it cannot be directly applied to tag data. 15] present the FCA-Merge method, where the input to the method is a set of documents from which concepts and the ontologies to be merged are extracted using natural language techniques. These documents should be representative of the domain at question and should be related to the ontologies. They also have to cover all concepts from both ontologies as well as separating them well enough 3. The Folks Ontology Approach In this section, we describe(1)how the resources from the previous section can be fully exploited for making ontologies out of folksonomies and (2)how the community can be involved as a mechanism to validate all the information extracted from the 3.1. Fully Exploiting the Resources A first principle of our approach is that we try to integrate every reasonable data resource and invokable functionality from the Web that can help us construct ontologies from the social interaction taki ng place on the ne Web. In other words want to take the vast amount of evidence created by users contributing to the Web and extract consensual conceptualizations from that 3.1.1. Cleansing and Preparation of Tags Before analyzing all the data sets of folksonomies, we must clean tag sets. Since ctors can choose any keyword for categorizing their content, they are applying their ated verbs).A sequence, tags are polluted and need to be cleansed. This can be performed through stemming algorithms. These algorithms are reducing tags to their stem or root. It is important not to loose the context of the tags, therefore the stemming rocess of tags should be limited to plural nouns and conjugated verbs. After this ming algorithm, it has to be checked whether all the tags are spelled correctly Wikipedia to check whether or not the tags are misspelled. t dnet, Google, and We can use the four lexical resources Leo Dictionary, Wor retrieved in any of these resources, the frequency of this tag should be counted. A low frequency may indicate that the tag is misspelled and a high frequency can be an indication of the offset of a new word created in the tagging community. This word should be added to the list of new words that has to be examined by the community subsection 3.2)
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 7 based either on the labels or on the ontology structure, or both. For deriving ontologies from folksonomies, those techniques may be used in particular for identifying relationships between tags, between tags and lexical resources, and between tags and elements in existing ontologies. [13] describe the theory of formal classification, where labels are translated to a propositional concept language. Each node is associated to a normal form formula that describes the content of the node. This approach is able to capture knowledge that exists implicitly within simple classification hierarchies. [14] describe semantic matching, an approach to matching classification hierarchies. This approach is focused to the graph representation of ontologies, which means it cannot be directly applied to tag data. [15] present the FCA-Merge method, where the input to the method is a set of documents from which concepts and the ontologies to be merged are extracted using natural language techniques. These documents should be representative of the domain at question and should be related to the ontologies. They also have to cover all concepts from both ontologies as well as separating them well enough. 3. The FolksOntology Approach In this section, we describe (1) how the resources from the previous section can be fully exploited for making ontologies out of folksonomies and (2) how the community can be involved as a mechanism to validate all the information extracted from the resources. 3.1. Fully Exploiting the Resources A first principle of our approach is that we try to integrate every reasonable data resource and invokable functionality from the Web that can help us construct ontologies from the social interaction taking place on the Web. In other words, we want to take the vast amount of evidence created by users contributing to the Web and extract consensual conceptualizations from that. 3.1.1. Cleansing and Preparation of Tags Before analyzing all the data sets of folksonomies, we must clean tag sets. Since actors can choose any keyword for categorizing their content, they are applying their own spelling and tagging rules (e.g. singular or plural nouns, conjugated verbs). As a consequence, tags are polluted and need to be cleansed. This can be performed through stemming algorithms. These algorithms are reducing tags to their stem or root. It is important not to loose the context of the tags, therefore the stemming process of tags should be limited to plural nouns and conjugated verbs. After this stemming algorithm, it has to be checked whether all the tags are spelled correctly. We can use the four lexical resources Leo Dictionary, Wordnet, Google, and Wikipedia to check whether or not the tags are misspelled. In case a tag is not retrieved in any of these resources, the frequency of this tag should be counted. A low frequency may indicate that the tag is misspelled and a high frequency can be an indication of the offset of a new word created in the tagging community. This word should be added to the list of new words that has to be examined by the community (subsection 3.2)
8 Celine Van Damme, Martin Hepp, and Katharina Siorpaes 3.1.2. Statistical Analysis of Folksonomies, Usage Data, and Social Networks In this paragraph we give an overview of data sets described in section 2. 1 and explain the objective, input, output, and techniques that can be employed Table 1. Statistical analysis of tagging data on a single system Step Objective Inpu Pairs of tags Co-occurrence technique: each time two tags pairs of tags are used to tag the same object, the tie strength 2 Enriching Objects and 1[7 presents an algorithm ba the cosine similarities between tags. Tag vectors. The smaller the angle. the more lar the tags are. The tags are consequently If the similarity of two value. the two nodes are connected with an b)A combination of co-occurrence between lightweight folds the AC Graph(actor tags Graph) a network based tags are calculated by the number of times th communities nunity uses social network analysis measures(such as Clustering techniques are used t determine the tags. 8 uses set theory to determine the Actors and 1) Analyzing a social network. The tie streng obiects shared objects techniques can be used for determining the bisects the ob cluster: text mining techniques, digital photo similar ors, tag social network. The tie and objects with between actors is measured on the number of shared tags times the actors have used the same tag. Social ng techniqu shared tags can be used for determining the clusters o used by the actors of can be further analyzed by using the technique described in step I
8 Céline Van Damme, Martin Hepp, and Katharina Siorpaes 3.1.2. Statistical Analysis of Folksonomies, Usage Data, and Social Networks In this paragraph we give an overview of data sets described in section 2.1 and explain the objective, input, output, and techniques that can be employed. Table 1. Statistical analysis of tagging data on a single system Step Objective Input Output Techniques 1 Determining pairs of tags Tags, tag/object data Pairs of tags Co-occurrence technique: each time two tags are used to tag the same object, the tie strength between two tags is increased [19]. 2 Enriching tags Objects and Tags a) Hierarchical relations between tags b) faceted ontology a) [7] presents an algorithm based on the cosine similarities between tags. Tags are aggregated in tag vectors and the cosine similarity calculates the angle between two tag vectors. The smaller the angle, the more similar the tags are. The tags are consequently placed as a node in a similarity graph. If the similarity of two tags exceeds a threshold value, the two nodes are connected with an edge. A hierarchical taxonomy can be deducted from the similarity graph. b) A combination of co-occurrence between tags and a subsumption-based model is presented in [6]. 3 Analyzing and creating subcommunities Actors and tags Lightweight ontologies based on community overlap 1) [8] folds the AC Graph (actor tags Graph) into a network based on tags. The weights of tags are calculated by the number of times the actors have used the tags in combination. [8] uses social network analysis measures (such as degree, closeness and betweenness centrality) to determine the general and specialized tags. General tags are used to bridge two clusters and specialized tags are parts of a specific cluster. Clustering techniques are used to determine the synonyms of the specialized tags. [8] uses set theory to determine the broader/narrow relations in the subcommunity 4 Analyzing social networks based on shared objects Actors and objects Clusters of actors with shared objects 1) Analyzing a social network. The tie strength between actors is measured by the number of times the actors have tagged the same object. Social network measures and/or clustering techniques can be used for determining the clusters of actors with similar tagged objects. 2) Analyzing the objects of the actors in each cluster: text mining techniques, digital photo similarity analysis 5 Analyzing social networks based on shared tags Actors, tags, and objects Clusters of actors with shared tags 1) Analyzing a social network. The tie strength between actors is measured on the number of times the actors have used the same tag. Social network measures and/or clustering techniques can be used for determining the clusters of actors using the same tags. 2) All the tags used by the actors of a cluster can be further analyzed by using the technique described in step 1 6 Merging similar Groups (+tags, Clusters of similar groups 1) The groups can be clustered by setting up a network analysis with groups instead of actors
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 9 differring, the freque adjusted in proportion. The tie strength 2) These clusters can be further analyzed by lusters of explicit social their relations actors 2) These clusters can be further analyzed by Table 2. Statistical analysis of tagging data across multiple systems be employed. However, the analysis has different with similar to be performed on data sets of equal size systems are differing, the frequency of tags 2)These clusters can be further analyzed by using one of the techniques described in ste Analyzing Actors and Clusters of techniques as descnbed above systems overlapping he objects are calculated by with the times the actors have used the objects in on data sets of equal size. This means that if the size of the different systems is differring, the proportions have to be adjusted 2) These clusters can be further analyzed by explicit social (FOAF)actors determining social proximity 3. 1.3. Exploiting Online Lexical Resources The tag data set obtained from the steps can be enriched by lexical es as described in section 22. However. these lexical resources can also be used for other purposes than merely spelling checks(except for Google). Tags can be replaced by concepts and home into English as is elaborated in the following paragraphs wikipedia: Wikipedia articles are identified by URIs which can be regarded as reliable identifiers for conceptual entities [2]. The meaning of those entities is
FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies 9 groups objects, actors) However, the analysis has to be performed on data sets of equal size. This means if the size of the different groups (=number of tags) are differring, the frequency of tags has to be adjusted in proportion. The tie strength between two groups is calculated on the basis of shared tags. Social network measures and/or clustering techniques can be used for determining the clusters. 2) These clusters can be further analyzed by using the technique described in data set 1 7 Analyzing explicit social network Actors and their relations Clusters of actors 1) Analyzing the social network. The tie strength between actors can be 0, 1 or 2 depending on the fact of two persons have linked to each other. 2) These clusters can be further analyzed by using the technique described in step 1 Table 2. Statistical analysis of tagging data across multiple systems Step Objective Input Output Method 1 Analyzing and creating subcommunities Actors and tags of different systems Clusters of communities with similar interests 1) The same techniques as described above can be employed. However, the analysis has to be performed on data sets of equal size. This means if the tags “size” of the different systems are differing, the frequency of tags has to be adjusted in proportion. 2) These clusters can be further analyzed by using one of the techniques described in step 1 in Table 1. 2 Analyzing communities of shared objects Actors and objects of systems with the same annotated objects Clusters of communities on overlapping objects 1) The same techniques as described above can be employed, except that the weights of the objects are calculated by the number of times the actors have used the objects in combination. However, the analysis has to be performed on data sets of equal size. This means that if the size of the different systems is differring, the proportions have to be adjusted. 2) These clusters can be further analyzed by using the technique described in step 1 in Table 1. 3 Analyzing the explicit social network Actors (FOAF) Clusters of actors We can take the direct RDF data for determining social proximity. 3.1.3. Exploiting Online Lexical Resources The tag data set obtained from the previous steps can be enriched by using the online lexical resources as described in section 2.2. However, these lexical resources can also be used for other purposes than merely spelling checks (except for Google). Tags can be replaced by concepts and homonyms, or translated from a foreign language into English as is elaborated in the following paragraphs. Wikipedia: Wikipedia articles are identified by URIs which can be regarded as reliable identifiers for conceptual entities [2]. The meaning of those entities is
10 Celine Van Damme, Martin Hepp, and Katharina Siorpaes described in natural language and augmented by multimedia elements and agreed pon by a large community. Hence, wikipedia is the biggest available collection of conceptual entities that are described with natural language and identified by URIs Already having unique identifiers(e.g. URIs) assigned to concepts defined only in natural language is very beneficial, for it helps improve recall and precision in information retrieval by avoiding synonyms and homonyms. Additionally, wikipedia contains disambiguation pages in order to deal with homonyms. When one word has several meanings, the meanings are collected on a disambiguation page in order to lists articles associated with the same title. This feature can be used to identify and deal with homonyms. Wikipedia also contains an implicit and evolving multilingual dictionary, since a Wikipedia page can have links that refer to the same topic in another language. These links can be retrieved in an XML format easily with the Wikipedia export function Leo dictionaries: Leo(Link everything online)provides a translation ser German, English, French, and Spanish. This functionality can be used for ith different languages. Additionally, Leo contains a definition of terms in 以m Wordnet can be used to deal with synonyms and homonyms: words with similar or identical meaning must be mapped to each other(e.g. baby and infant). Furthermore, words that have different conceptual meanings(e.g. Jaguar as the car and the animal) can be identified with Wordnet as well 3.1. 4 Ontologies and Semantic Web Resourees The tag sets obtained in subsection 3. 1.2 can also be enriched by trying to establish mappings to elements in existing ontologies. Also, the explicit relationships in existing ontologies may be reused, e.g. for determining whether a hierarchical relation holds between two terms. In particular, the Swoogle engine can be used to query for 3.1.5. Mapping and Matching approaches The formal classification theory of [13] can be employed for mapping the labels of existing classifications with the tags obtained from the folksonomies. Consequently, we can also use the lexical resource Wordnet to create a mapping with an existing ontology 3. 2. Mechanisms for Involving the community Instead of aiming at the fully automated creation of ontologies from folksonomies, we uggest a semi-automated approach, in which the aforementioned techniques are combined with collective human intelligence. In other words, we propose that(1)the results from the previous stages have to be confirmed by the community and (2) information that could not be retrieved from the resources(e.g. relations between tags) may be contributed \w W icarchy explicit voting mechanisms on conceptual ommunity on demand. For this, we can combine visualization techniques and im choices. For example, a concept hierarchy reconstructed from data could be presented http:-//dewikipediaorg/wiki/spezial:ExportierenretrievedonApril12007
10 Céline Van Damme, Martin Hepp, and Katharina Siorpaes described in natural language and augmented by multimedia elements and agreed upon by a large community. Hence, Wikipedia is the biggest available collection of conceptual entities that are described with natural language and identified by URIs. Already having unique identifiers (e.g. URIs) assigned to concepts defined only in natural language is very beneficial, for it helps improve recall and precision in information retrieval by avoiding synonyms and homonyms. Additionally, Wikipedia contains disambiguation pages in order to deal with homonyms. When one word has several meanings, the meanings are collected on a disambiguation page in order to lists articles associated with the same title. This feature can be used to identify and deal with homonyms. Wikipedia also contains an implicit and evolving multilingual dictionary, since a Wikipedia page can have links that refer to the same topic in another language. These links can be retrieved in an XML format easily with the Wikipedia export function7 . Leo dictionaries: Leo (Link everything online) provides a translation service for German, English, French, and Spanish. This functionality can be used for dealing with different languages. Additionally, Leo contains a definition of terms in German. Wordnet can be used to deal with synonyms and homonyms: words with similar or identical meaning must be mapped to each other (e.g. baby and infant). Furthermore, words that have different conceptual meanings (e.g. Jaguar as the car and the animal) can be identified with Wordnet as well. 3.1.4. Ontologies and Semantic Web Resources The tag sets obtained in subsection 3.1.2 can also be enriched by trying to establish mappings to elements in existing ontologies. Also, the explicit relationships in existing ontologies may be reused, e.g. for determining whether a hierarchical relation holds between two terms. In particular, the Swoogle engine can be used to query for ontologies and ontology usage data. 3.1.5. Mapping and Matching approaches The formal classification theory of [13] can be employed for mapping the labels of existing classifications with the tags obtained from the folksonomies. Consequently, we can also use the lexical resource Wordnet to create a mapping with an existing ontology. 3.2. Mechanisms for Involving the Community Instead of aiming at the fully automated creation of ontologies from folksonomies, we suggest a semi-automated approach, in which the aforementioned techniques are combined with collective human intelligence. In other words, we propose that (1) the results from the previous stages have to be confirmed by the community and (2) information that could not be retrieved from the resources (e.g. relations between tags) may be contributed by the community on demand. For this, we can combine visualization techniques and implicit and explicit voting mechanisms on conceptual choices. For example, a concept hierarchy reconstructed from data could be presented 7 http://de.wikipedia.org/wiki/Spezial:Exportieren retrieved on April, 1 2007