Finding Experts By Semantic Matching of User Profiles Rajesh Thiagarajan, Geetha Manjunath", and Markus Stumptner Advanced Computing Research Centre, University of South Australia Icisrkt, mst ecs unisa.edu.au 2 Hewlett-Packard Laboratories. India geetha.manjunathghp.com Abstract. Extracting interest profiles of users based on their personal documents ne of the key topics of IR research. However, when these extracted profiles are used in expert finding applications, only naive text-matching techniques an sed to rank experts for a given requirement. In this paper, we address this gap and escribe multiple techniques to match user profiles for better ranking of experts. le propose new metrics for computing semantic similarity of user profiles using spreading activation networks derived from ontologies. Our pilot evaluation shows that matching algorithms based on bipartite graphs over semantic user profiles provide the best results. We show that using these techniques, we can find an expert more accurately than other approaches, in particular within the top ranked results. In applications where a group of candidate users need to be short-listed (say, for a job interview ), we get very good precision and recall as well. 1 Introduction The problem of finding experts on a given set of topics is important for many lines of business e. g, consulting, recruitment, e-business. In these applications, one common way to model a user is with a user profile which is a set of topics with weights determining his level of interest. When used for personalization, these user profiles matched with a retrieved documentmay be a search result) for checking its relevance to him. a similar matching technique can be used for expert finding as well- wherein we first formulate the requis then carried out by matching the query profile with the available/extracted ment(query) as an expected profile of the expert who is sought after. Expert finding expert user profiles In the above context, automatic extraction of topics of expertise (interest) of a person ased on the documents authored(accessed) by the person through information extraction techniques is well known. However, when these extracted profiles are used for expert finding, the profile matching is often carried out by applying traditional content matching techniques which miss most potential candidates if the query is only an approximate escription of the expert (as is usually e. In thi multiple approaches for semantic matching of user profiles to enable better expert-finding in such Let us briefly look at the challenges in comparing user profiles. User profiles are generally represented in the bag-of-words(Bow) format-a set of weighted terms that describe the interest or the expertise of a user. The most commonly used content match ng technique is cosine similarity -cosine between the bOw vector representing the user profile and that of the document to match. Although this simple matching technique suf fices in a number of content matching applications, it is well known that considering just the words leads to problems due to lack of semantics in the representation Problems due to polysemy(terms such as apple, jaguar having two different meanings)and synonymy (two words meaning almost the same thing such as glad and happy)can be solved if profiles are described using semantic concepts instead of words. Once again simple
Finding Experts By Semantic Matching of User Profiles Rajesh Thiagarajan1 , Geetha Manjunath2 , and Markus Stumptner1 1 Advanced Computing Research Centre, University of South Australia {cisrkt,mst}@cs.unisa.edu.au 2 Hewlett-Packard Laboratories, India geetha.manjunath@hp.com Abstract. Extracting interest profiles of users based on their personal documents is one of the key topics of IR research. However, when these extracted profiles are used in expert finding applications, only naive text-matching techniques are used to rank experts for a given requirement. In this paper, we address this gap and describe multiple techniques to match user profiles for better ranking of experts. We propose new metrics for computing semantic similarity of user profiles using spreading activation networks derived from ontologies. Our pilot evaluation shows that matching algorithms based on bipartite graphs over semantic user profiles provide the best results. We show that using these techniques, we can find an expert more accurately than other approaches, in particular within the top ranked results. In applications where a group of candidate users need to be short-listed (say, for a job interview), we get very good precision and recall as well. 1 Introduction The problem of finding experts on a given set of topics is important for many lines of business e.g., consulting, recruitment, e-business. In these applications, one common way to model a user is with a user profile which is a set of topics with weights determining his level of interest. When used for personalization, these user profiles matched with a retrieved document (may be a search result) for checking its relevance to him. A similar matching technique can be used for expert finding as well - wherein we first formulate the requirement (query) as an expected profile of the expert who is sought after. Expert finding is then carried out by matching the query profile with the available/extracted expert user profiles. In the above context, automatic extraction of topics of expertise (interest) of a person based on the documents authored (accessed) by the person through information extraction techniques is well known. However, when these extracted profiles are used for expert finding, the profile matching is often carried out by applying traditional content matching techniques which miss most potential candidates if the query is only an approximate description of the expert (as is usually the case). In this paper, we propose and evaluate multiple approaches for semantic matching of user profiles to enable better expert-finding in such cases. Let us briefly look at the challenges in comparing user profiles. User profiles are generally represented in the bag-of-words (BOW) format - a set of weighted terms that describe the interest or the expertise of a user. The most commonly used content matching technique is cosine similarity - cosine between the BOW vector representing the user profile and that of the document to match. Although this simple matching technique suf- fices in a number of content matching applications, it is well known that considering just the words leads to problems due to lack of semantics in the representation. Problems due to polysemy (terms such as apple, jaguar having two different meanings) and synonymy (two words meaning almost the same thing such as glad and happy) can be solved if profiles are described using semantic concepts instead of words. Once again simple
matching techniques can be used on these bags-of-concepts(BOC). However, these ap. proaches fail to determine the right matches when there is no direct overlap/intersection in the concepts. For example, do two users with Yahoo and Google in their respective profiles have nothing in common? There does seem to be an intersection in these user interests for Web-based IT companies or web search tools! Such overlaps are missed as current approaches work under the assumption that the profile representations (B contain all the information about the user. As a result, relationships that are not explicit in the representations are usually ignored. Furthermore, these mechanisms cannot handle user profiles that are at different levels of granularity or abstractions(e. g, jazz and musi as the implicit relationship between the concepts is ignored In this paper, we solve the above issues in user profile matching through effective use of ontologies. We define the notion of semantic similarity between two user profiles to consider inherent relationships between concepts/words appearing in their respective BOW representation. We use the process of spreading to include additional related terms to a user profile by referring to an ontology (wordnet or wikipedia)and experiment with multiple techniques to enable better profile matching. We propose simple metrics for computing similarity between two user profiles with ontology-based Spreading Activation NetworkS(SAN). We evaluate multiple mechanisms for extending user profiles(set and graph based spreading) and semantic matching(set intersection and bipartite graphs) of profiles. We show the effectiveness of our user profile matching techniques for accuracy in expert-ranking as well as candidate selection. From a given set of user profiles, our bipartite-graph based algorithms can accurately spot an expert just within its top three ranks. In applications where a group of candidate users need to be found (for a job interview), we get very good precision and recall as well The organization of the rest of this document is as follows. We describe different related research efforts for profile building and ontology-based semantic matching tech- section 2 followed by a brief section giving some background and definitions needed to understand our solution. An overview of the our spreading process is presented in Section 4. We present our new similarity measures in Section 5. We describe our evaluation procedure for expert finding in Section 6 and share our improved results. We summarize our contributions and state possible future work in Section 7 2 Related work Determining interest profiles of users based on their personal documents is an important research topic in information extraction and a number of techniques to achieve this have been proposed. Expert finding techniques that combine multiple sources of expertise evidence such as academic papers and social citation network have also bee proposed [1] User profiles have been extracted using multiple types of corpora-utilizing knowledge about the expert in Wikipedia [2], analysing the expert's documents [3-5], and analysing openly accessible research contributions of the expert [6]. Use of Wikipedia corpus to generate semantic user profiles [7]have been seen. Pre-processing the profile terms by mapping terms to such ontology concepts prior to computing cosine similarity has been shown to yield better matching [ 3]. A number of traditional similarity measurement techniques such as the cosine similarity measure or term vector similarity [8, 9), Dice's efficient [10] and Jaccards index [ll] are used in profile matching. For example, Jaccards index is used in [2] to match expert profiles constructed using Wikipedia knowledge. This approach will not determine a semantic inexact match when there is no direct overlap in the concepts in the two user profiles. Use of knowledge obtained from
matching techniques can be used on these bags-of-concepts (BOC). However, these approaches fail to determine the right matches when there is no direct overlap/intersection in the concepts. For example, do two users with Yahoo and Google in their respective profiles have nothing in common? There does seem to be an intersection in these users’ interests for Web-based IT companies or web search tools! Such overlaps are missed as current approaches work under the assumption that the profile representations (BOW) contain all the information about the user. As a result, relationships that are not explicit in the representations are usually ignored. Furthermore, these mechanisms cannot handle user profiles that are at different levels of granularity or abstractions (e.g., jazz and music) as the implicit relationship between the concepts is ignored. In this paper, we solve the above issues in user profile matching through effective use of ontologies. We define the notion of semantic similarity between two user profiles to consider inherent relationships between concepts/words appearing in their respective BOW representation. We use the process of spreading to include additional related terms to a user profile by referring to an ontology (Wordnet or Wikipedia) and experiment with multiple techniques to enable better profile matching. We propose simple metrics for computing similarity between two user profiles with ontology-based Spreading Activation Networks (SAN). We evaluate multiple mechanisms for extending user profiles (set and graph based spreading) and semantic matching (set intersection and bipartite graphs) of profiles. We show the effectiveness of our user profile matching techniques for accuracy in expert-ranking as well as candidate selection. From a given set of user profiles, our bipartite-graph based algorithms can accurately spot an expert just within its top three ranks. In applications where a group of candidate users need to be found (for a job interview), we get very good precision and recall as well. The organization of the rest of this document is as follows. We describe different related research efforts for profile building and ontology-based semantic matching techniques in section 2 followed by a brief section giving some background and definitions needed to understand our solution. An overview of the our spreading process is presented in Section 4. We present our new similarity measures in Section 5. We describe our evaluation procedure for expert finding in Section 6 and share our improved results. We summarize our contributions and state possible future work in Section 7. 2 Related Work Determining interest profiles of users based on their personal documents is an important research topic in information extraction and a number of techniques to achieve this have been proposed. Expert finding techniques that combine multiple sources of expertise evidence such as academic papers and social citation network have also bee proposed [1]. User profiles have been extracted using multiple types of corpora - utilizing knowledge about the expert in Wikipedia [2], analysing the expert’s documents [3–5], and analysing openly accessible research contributions of the expert [6]. Use of Wikipedia corpus to generate semantic user profiles [7] have been seen. Pre-processing the profile terms by mapping terms to such ontology concepts prior to computing cosine similarity has been shown to yield better matching [3]. A number of traditional similarity measurement techniques such as the cosine similarity measure or term vector similarity [8, 9], Dice’s coefficient [10] and Jaccard’s index [11] are used in profile matching. For example, Jaccard’s index is used in [2] to match expert profiles constructed using Wikipedia knowledge.This approach will not determine a semantic inexact match when there is no direct overlap in the concepts in the two user profiles. Use of knowledge obtained from
an ontology, in our solution, enables similarity checks when there are no direct overlaps between user profiles and, therefore, result in more accurate similarity measurements. blem of automated routing of conference papers to their rev somewhat related problem to that of expert finding. Most of the current approaches to that problem use a group of papers authored by reviewers to determine their user profile and perform routine content matching(similar to personalization)to determine whether a paper is fit to be reviewed by that user [12]. The expert finding task introduced by TREC 2005 [13] requires one to provide a ranked list of the candidate experts based on the web data provided. Our attem handle the problem of choosing the best expert description of a hypothetical expert(set of topics with weights) profiles of candidate experts Use of ontologies to derive new concepts that are likely to be of interest to the user through semantic spreading activation networks has been studied as well [14-17, 5]. Pre- ious studies have shown that the spreading process improves accuracy and overco the challenges caused by inherent relationships and Polysemy in word sense disam- biguation process [15, 16] and ontology mapping [17]. We use this spreading process to facilitate the semantic similarity computation. We build on the spreading process used in [5] to learn user preferences in order to drive a personalized multimedia search. The learning process utilizes ontologies as a means to comprehend user interests (in BOw format)and establishes the need to consider related concepts to improve search quality While the results in [5] suggest that personalized search is of better quality in comparison to normal search, they do not show whether the consideration of related terms contributes to these improvements. On the other hand, we show that our spreading process indeed improves the accuracy of our new similarity measures and in the particular context of user profile matching A number of approaches have already been proposed to determine the similarity between two ontology concepts(or words). These determine similarity by: measurin the path distance between them [18], evaluating shared information between them [19] recursively matching sub-graphs[20), combining information from various sources [21]. are only able to determine closeness between two concepts(or words), we compute similarity between two weighted sets of concepts(or words). One of our algorithms us the simple path measure described in [18]over a bipartite graph to determine such a set intersection Ve now compare with other works that use San based IR techniques. One of our imilarity measures is similar to the one discussed in[25] but differs in the treatment of the results of the activation process. While the previous work utilizes the results of the activation to rank documents with respect to a query, our work maps an aggregate of the activation results to a similarity value. Knowledge from an ontology is used to extend the bow with terms that share important relationships with original terms to improve document retrieval is presented in [4]. Our work on set spreading is somewhat similar to this but we further explore the notion of computing similarity by optimal concept matching in bipartite graphs and using SAN 3 Background In this section, we formally define and explain some terms used in the rest of the document
an ontology,in our solution, enables similarity checks when there are no direct overlaps between user profiles and, therefore, result in more accurate similarity measurements. The problem of automated routing of conference papers to their reviewers is a somewhat related problem to that of expert finding. Most of the current approaches to that problem use a group of papers authored by reviewers to determine their user profile and perform routine content matching (similar to personalization) to determine whether a paper is fit to be reviewed by that user [12]. The expert finding task introduced by TREC 2005 [13] requires one to provide a ranked list of the candidate experts based on the web data provided. Our attempt is to handle the problem of choosing the best expert given a description of a hypothetical expert (set of topics with weights) and a set of user profiles of candidate experts. Use of ontologies to derive new concepts that are likely to be of interest to the user through semantic spreading activation networks has been studied as well [14–17, 5]. Previous studies have shown that the spreading process improves accuracy and overcomes the challenges caused by inherent relationships and Polysemy in word sense disambiguation process [15, 16] and ontology mapping [17]. We use this spreading process to facilitate the semantic similarity computation. We build on the spreading process used in [5] to learn user preferences in order to drive a personalized multimedia search. The learning process utilizes ontologies as a means to comprehend user interests (in BOW format) and establishes the need to consider related concepts to improve search quality. While the results in [5] suggest that personalized search is of better quality in comparison to normal search, they do not show whether the consideration of related terms contributes to these improvements. On the other hand, we show that our spreading process indeed improves the accuracy of our new similarity measures and in the particular context of user profile matching. A number of approaches have already been proposed to determine the similarity between two ontology concepts (or words). These determine similarity by: measuring the path distance between them [18], evaluating shared information between them [19], recursively matching sub-graphs [20], combining information from various sources [21], analysing structure of the ontology [22], and combining content analysis and web search [23]. A few other measures are evaluated in [24]. While all these approaches are only able to determine closeness between two concepts (or words), we compute similarity between two weighted sets of concepts (or words). One of our algorithms use the simple path measure described in [18] over a bipartite graph to determine such a set intersection. We now compare with other works that use SAN based IR techniques. One of our similarity measures is similar to the one discussed in [25] but differs in the treatment of the results of the activation process. While the previous work utilizes the results of the activation to rank documents with respect to a query, our work maps an aggregate of the activation results to a similarity value. Knowledge from an ontology is used to extend the BOW with terms that share important relationships with original terms to improve document retrieval is presented in [4]. Our work on set spreading is somewhat similar to this but we further explore the notion of computing similarity by optimal concept matching in bipartite graphs and using SAN. 3 Background In this section, we formally define and explain some terms used in the rest of the document
Definition 1(User Profile). An user profile, u is a set of binary tuples ((t1, w1),..., (tn, wn)) where t; are the terms that describes the user and w; denotes the importance of ti in describing the user: We use terms(u) to denote the set of terms ti in the profile Cosine Similarity: The BOw representation is typically used for computing cosine similarity between the user profiles. If the vector representation of a user profile u is V(u)and the Euclidean length(V(ui )D of an entity u, is vei wi, the similarity simcos(uj, uk)= cos(V(u,),V(uk)) V(uy)·V(uk) v(au)川V(uk) Spreading: Spreading is the process of including the terms that are related to the original terms in an user profile by referring to an ontology. Let us study the earlier mentioned simple example of two users having google and yahoo in their profile in detail to understand the spreading process bette Example 1. Consider computing the similarity of the following users mple intersection check between the profiles result in an empty set (i.e. u1 nu2=0) indicating their un-relatedness (cosine similarity is O). However, if we were to manually judge the similarity of these two users we would give it a value greater than 0. This is because we judge the similarity not just by considering the two terms from the profiles but also by considering the relationships that might exist between them due to our prior knowledge. We are able to establish the fact that both google and yahoo are search engine providers Now let us see the effectiveness of spreading in the similarity computation process in the same example. Spreading the profiles u1 and u2, by referring to Wikipedia parent ategory relationship, extends the profiles to ui=igoogle, 1.0), internet search engines, 0.5)), and The sim, ((yahoo, 2.0),(internet search engines, 1.0) The simple intersection check results in a non-empty set (i.e.uinu?fO)indicating their relatedness(cosine similarity is 0.2). The result of the spreading(i.e. the inclusion of the related term internet search engines) process makes sure that any relationship that exists between the profiles are taken into consideration 4 Spreading to Create Extended User Profiles In this section, we describe two techniques to compute and represent the extended user profiles(see example of section 3)using an ontology. An ontology O represents human knowledge about a certain domain as concepts, attributes and relationships between concepts in a well-defined hierarchy. It is usually represented as a graph where node are the concepts and edges are the relationship labelled with the type of relationship For the purpose of profile spreading we assume that all the terms ti describing an entity are mappable to concepts in a reference ontology. For example, all the terms ti in a BOW representation of a user profile maps to a concept in the Wordnet ontology. Given a term ti, the spreading process utilizes O to determine the terms that are related to t (denoted as related(ti). Although spreading the profiles with related terms allows for
Definition 1 (User Profile). An user profile, u is a set of binary tuples {ht1, w1i, . . . , htn, wni} where ti are the terms that describes the user and wi denotes the importance of ti in describing the user. We use terms(u) to denote the set of terms ti in the profile u. Cosine Similarity: The BOW representation is typically used for computing cosine similarity between the user profiles. If the vector representation of a user profile uj is −→V (uj ) and the Euclidean length (| −→V (uj )|) of an entity uj is pPn i=1 w2 i , the similarity of the entities uj and uk is (1) simcos(uj , uk) = cos( −→V (uj ), −→V (uk)) = −→V (uj ) · −→V (uk) | −→V (uj )||−→V (uk)| Spreading: Spreading is the process of including the terms that are related to the original terms in an user profile by referring to an ontology. Let us study the earlier mentioned simple example of two users having google and yahoo in their profile in detail to understand the spreading process better. Example 1. Consider computing the similarity of the following users – u1 = {hgoogle, 1.0i}, and – u2 = {hyahoo, 2.0i}. A simple intersection check between the profiles result in an empty set (i.e. u1 ∩ u2 = ∅) indicating their un-relatedness (cosine similarity is 0). However, if we were to manually judge the similarity of these two users we would give it a value greater than 0. This is because we judge the similarity not just by considering the two terms from the profiles but also by considering the relationships that might exist between them due to our prior knowledge. We are able to establish the fact that both google and yahoo are search engine providers. Now let us see the effectiveness of spreading in the similarity computation process in the same example. Spreading the profiles u1 and u2, by referring to Wikipedia parent category relationship, extends the profiles to – u 0 1 = {hgoogle, 1.0i,hinternet search engines, 0.5i}, and – u 0 2 = {hyahoo, 2.0i,hinternet search engines, 1.0i}. The simple intersection check results in a non-empty set (i.e. u 0 1 ∩ u 0 2 6= ∅) indicating their relatedness (cosine similarity is 0.2). The result of the spreading (i.e. the inclusion of the related term internet search engines) process makes sure that any relationship that exists between the profiles are taken into consideration. 4 Spreading to Create Extended User Profiles In this section, we describe two techniques to compute and represent the extended user profiles (see example of section 3) using an ontology. An ontology O represents human knowledge about a certain domain as concepts, attributes and relationships between concepts in a well-defined hierarchy. It is usually represented as a graph where nodes are the concepts and edges are the relationship labelled with the type of relationship. For the purpose of profile spreading we assume that all the terms ti describing an entity are mappable to concepts in a reference ontology. For example, all the terms ti in a BOW representation of a user profile maps to a concept in the Wordnet ontology. Given a term ti , the spreading process utilizes O to determine the terms that are related to ti (denoted as relatedO(ti)). Although spreading the profiles with related terms allows for
a comprehensive computation, uncontrolled addition of all the related terms leads to the dilution of the profiles with noise or unrelated terms. This dilution may have negative implications on the computation process where the similarity in the noise may contribute to the similarity values between entities. It is therefore desirable to have control over the types of relationships to be considered during this spreading process A SET SPREADER (a)Set Spreading (b) Graph Spreading Fig 1: Two Schemes for Profile Spreading he weights of the new related terms are proportional to the weights of the original term as the weight wi of a term ti indicates the importance of the term within a user profile. However, during spreading the weights of the related terms should differ accord- ing to the semantics of the relationships on the edge. For example, spreading based on wikipedia may be limited to only spreading along the parent categories. We therefore use a set of linear influence functions, one per relationship-type(role/property of an ontology), to control the spreading process. For example, a spreading process based on Wordnet limited to types synonym and antonym can have functions tij=wi X 0.9 and ti;=Wix-09 respectively. We propose two schemes for representing the related terms post-spreading: extended set and semantic network. 4.1 Set Spreading Depicted in Figure la, set spreading is a process of extending an user profile such that the related terms, which are determined with respect to an ontology, are just appended to the original set of terms. Set spreading is an iterative process. After each iteration, the related terms from the previous iterations are appended to the profile. The spreading process is terminated if there are no related terms to spread the profile with or after a fixed number of iterations 4.2 Graph spreading Shown in Figure 1b, graph spreading is the process where terms from two profiles and the related terms are build into a semantic network(SAN). Unlike set spreading, graph spreading preserves the relationship between a term in a profile and its related term in the form of a graph edge. This allows consideration of relationships based on their semantics on the same network. Graph spreading terminates like set spreading, or if there exists a path between every pair of the term nodes from the two profiles. This condition best uits the ontologies that have a top root element which subsumes the rest of the elements
a comprehensive computation, uncontrolled addition of all the related terms leads to the dilution of the profiles with noise or unrelated terms. This dilution may have negative implications on the computation process where the similarity in the noise may contribute to the similarity values between entities. It is therefore desirable to have control over the types of relationships to be considered during this spreading process. t1 w1 t2 w2 t3 w3 Ontology SET SPREADER # Types of relationships to spread upon (such as parentOf) # Weight functions for the spreaded terms # Termination condition (such as no. of iterations) Parameters t1r1 fr(w1) t2r1 fr(w2) t2s2 fs(w2) t1 w1 t2 w2 t3 w3 t3r1 fr(w3) t3r2 fr(w3) t3s1 fs(w3) # Iteration = 1 # Relationships = {r,s} # Weight functions = {fr,fs} (a) Set Spreading t11 w11 t12 w12 Ontology GRAPH SPREADER # Types of relationships to spread upon (such as parentOf) # Weight functions for the spreaded terms # Termination condition (such as no. of iterations) Parameters # Relationships = {r,s} # Weight functions = {fr,fs} t21 w21 t22 w22 Profile 1 Profile 2 t11 t12 t21 t22 t11r1 t12s1 t12s1 t21r1 t21s1 t22s1 r s fr(t11) fs(t12) (b) Graph Spreading Fig. 1: Two Schemes for Profile Spreading The weights of the new related terms are proportional to the weights of the original term as the weight wi of a term ti indicates the importance of the term within a user profile. However, during spreading the weights of the related terms should differ according to the semantics of the relationships on the edge. For example, spreading based on Wikipedia may be limited to only spreading along the parent categories. We therefore use a set of linear influence functions, one per relationship-type (role/property of an ontology), to control the spreading process. For example, a spreading process based on Wordnet limited to types synonym and antonym can have functions tij = wi × 0.9 and tij = wi × −0.9 respectively. We propose two schemes for representing the related terms post-spreading: extended set and semantic network. 4.1 Set Spreading Depicted in Figure 1a, set spreading is a process of extending an user profile such that the related terms, which are determined with respect to an ontology, are just appended to the original set of terms. Set spreading is an iterative process. After each iteration, the related terms from the previous iterations are appended to the profile. The spreading process is terminated if there are no related terms to spread the profile with or after a fixed number of iterations. 4.2 Graph spreading Shown in Figure 1b, graph spreading is the process where terms from two profiles and the related terms are build into a semantic network (SAN). Unlike set spreading, graph spreading preserves the relationship between a term in a profile and its related term in the form of a graph edge. This allows consideration of relationships based on their semantics on the same network. Graph spreading terminates like set spreading, or if there exists a path between every pair of the term nodes from the two profiles. This condition best suits the ontologies that have a top root element which subsumes the rest of the elements
in the ontology. For example, Wordnet based spreading can be tuned to employ this termination condition when path from individual terms to the root suffices to terminate the spreading In less rigorous ontologies such as the Wikipedia category graph may not be able to support this condition as there may not be a single root. In such the spreading process is terminated if there exists at least one path from every node that the complete details of the two spreading algorithms in our technical report (4?cribe belongs to the smallest of the two profiles to the nodes in the other profile. We describe 5 Similarity Computation In this section, we describe the complete details of our variant metrics to compute semantic similarity using ontologies 5.1 Set-based measure Set spreading process enriches the profiles by appending the related terms in to capture all the relationships between the terms. For set spreading, the same similarity technique defined in Equation 1 is applicable to compute similarity b the extended BOws or BOCs Set spreading-based similarity computation begins by measuring similarity of the original profiles, and proceeds by incrementally extending he profiles until termination while computing the similarity between profiles at every Iteration 5.2 SAN-based measure This similarity computation metric is inspired by the abundant work that exists in the area of semantic search especially by techniques that process a San (e. g, [25, 15). We focus on similarity computation techniques that use a SAn resulting from graph spreading process(see figure 2a for an overview of SAN structure). Following the construction of the semantic network the similarity values are computed either by reducing the graph to a bipartite graph or by activating the graph with an activation strategy. We have implemented both these techniques for evaluation. A brief introduction to the activation process is presented below. For a more detailed discussion the reader is pointed to [15] (a)SAN Building Process (b) Bipartite Graph Matching(Hungarian Algorithm) Fig 2: SAN-based Similarity Computations
in the ontology. For example, Wordnet based spreading can be tuned to employ this termination condition when path from individual terms to the root suffices to terminate the spreading. In less rigorous ontologies such as the Wikipedia category graph may not be able to support this condition as there may not be a single root. In such a case, the spreading process is terminated if there exists at least one path from every node that belongs to the smallest of the two profiles to the nodes in the other profile. We describe the complete details of the two spreading algorithms in our technical report [26]. 5 Similarity Computation In this section, we describe the complete details of our variant metrics to compute semantic similarity using ontologies. 5.1 Set-based Measure Set spreading process enriches the profiles by appending the related terms in order to capture all the relationships between the terms. For set spreading, the same cosine similarity technique defined in Equation 1 is applicable to compute similarity between the extended BOWs or BOCs. Set spreading-based similarity computation begins by measuring similarity of the original profiles, and proceeds by incrementally extending the profiles until termination while computing the similarity between profiles at every iteration. 5.2 SAN-based measure This similarity computation metric is inspired by the abundant work that exists in the area of semantic search especially by techniques that process a SAN (e.g., [25, 15]). We focus on similarity computation techniques that use a SAN resulting from graph spreading process (see figure 2a for an overview of SAN structure). Following the construction of the semantic network the similarity values are computed either by reducing the graph to a bipartite graph or by activating the graph with an activation strategy. We have implemented both these techniques for evaluation. A brief introduction to the activation process is presented below. For a more detailed discussion the reader is pointed to [15]. First Profile Second Profile . . . . . . . . . -- Profile Terms -- Terms related by r1 Relation (r1) -- Terms related by r2 Relation (r2) (a) SAN Building Process Optimal Bipartite Matching 1 2 3 4 5 6 7 1 4 5 7 E = {1,2,3,4,5,6,7} M = {1,4,5,7} (b) Bipartite Graph Matching (Hungarian Algorithm) Fig. 2: SAN-based Similarity Computations
The SAN activation process is iterative. Let A; (p) denotes the activation value of ode j at iteration p. All the original term nodes corresponding to the tuples in a user profile t, take their term weights wj as their initial activation value Ai (0)=wj. The activation value of all the other nodes are initalized to 0. In each iteration Every node propagates its activation to its neighbours he propagated value is a function of the nodes current activation value and weight of the edge(see [15])that connects them(denoted as O,(p)) After a certain number of iterations when the termination condition is reached. the highest activation value among the nodes that are associated with each of the original term node is retrieved into a set ACT=acti, act2, .,actn+m).The the these activation values can be mapped to the similarity between the profiles under the intuition that the nodes with higher activation values are typically the ones that have alue contributions from both the profiles and hence should contribute more to similarity to a value between 0 and 1. The SAN-based similarity between two profiles u1 and u2 where mar(ACT)is the highest activation value is siman(u1, u2 5.3 Similarity Computation by Matching Bipartite Graph A key insight here is that by omitting the intermediate related nodes and considering only the path length between the nodes representing the original profile terms, the semantic network can be converted to a bipartite graph(shown on the left side of Figure 2b). The nodes of the first profile and second profile are the two vertex sets of the bipartite graph where the edge denotes the length between the original term nodes as obtained from the emantic network. Once the bipartite graph is derived, we are able to apply standard algorithms for optimal matching of the bipartite graph. Our similarity measures based on optimal bipartite matching operates under the simple notion that the nodes with higher weights and that are closely located contribute more to the similarity of the entities and viceversa Each node u" in the semantic network is a pair(ti, wi) where u= l or 2 dentin hich user's profile term the node represents. The path(ul, va)denotes the set of edges between two nodes ui and u, in the semantic network. All the edges between ny two nodes with different terms in the semantic network have uniform weight Ve e path (ui, US)set wt(e)=l where wt (e)denotes the weight of the edge e For any two vertices uI and v2 the distance between the them is len(ui, UA if ti =t, wt(ek), otherwise Definition 2(Bipartite Representation). The bipartite graph representation G of the profiles u1 and u2 is a pair G=(V, E)where V=VUV2 where V denotes the vertices from the first profile u1 and v2 denotes the vertices from the second profile u2 E={e1,e12,…,ei} where i={1,2,…,n},j={1,2,…,m} andlen(t} denotes the path length between then vertices u; and ug
The SAN activation process is iterative. Let Aj (p) denotes the activation value of node j at iteration p. All the original term nodes corresponding to the tuples in a user profile tj take their term weights wj as their initial activation value Aj (0) = wj . The activation value of all the other nodes are initalized to 0. In each iteration, – Every node propagates its activation to its neighbours. – The propagated value is a function of the nodes current activation value and weight of the edge (see [15]) that connects them (denoted as Oj (p)). After a certain number of iterations when the termination condition is reached, the highest activation value among the nodes that are associated with each of the original term node is retrieved into a set ACT = {act1, act2, ..., actn+m} 3 . The aggregate of the these activation values can be mapped to the similarity between the profiles under the intuition that the nodes with higher activation values are typically the ones that have value contributions from both the profiles and hence should contribute more to similarity and vice versa. Therefore, the similarity value is the sum of the set ACT normalized to a value between 0 and 1. The SAN-based similarity between two profiles u1 and u2 where max(ACT) is the highest activation value is (2) simsan(u1, u2) = P ∀acti∈ACT acti |ACT| × max(ACT) 5.3 Similarity Computation by Matching Bipartite Graph A key insight here is that by omitting the intermediate related nodes and considering only the path length between the nodes representing the original profile terms, the semantic network can be converted to a bipartite graph (shown on the left side of Figure 2b). The nodes of the first profile and second profile are the two vertex sets of the bipartite graph where the edge denotes the length between the original term nodes as obtained from the semantic network. Once the bipartite graph is derived, we are able to apply standard algorithms for optimal matching of the bipartite graph. Our similarity measures based on optimal bipartite matching operates under the simple notion that the nodes with higher weights and that are closely located contribute more to the similarity of the entities and viceversa. Each node v u i in the semantic network is a pair hti , wii where u = 1 or 2 denoting which user’s profile term the node represents. The path(v 1 i , v2 j ) denotes the set of edges between two nodes v 1 i and v 2 j in the semantic network. All the edges between any two nodes with different terms in the semantic network have uniform weights ∀e ∈ path(v 1 i , v2 j ) set wt(e) = 1 where wt(e) denotes the weight of the edge e. For any two vertices v 1 i and v 2 j the distance between the them is len(v 1 i , v 2 j ) = ( 0 P , if ti = tj ∀ek∈path(v 1 i ,v2 j ) wt(ek), otherwise Definition 2 (Bipartite Representation). The bipartite graph representation G of the profiles u1 and u2 is a pair G = hV, Ei where – V = V 1 ∪V 2 where V 1 denotes the vertices from the first profile u1 and V 2 denotes the vertices from the second profile u2 – V 1 = {v 1 1 , v1 2 , . . . , v1 n} and V 2 = {v 2 1 , v2 2 , . . . , v2 m} where n ≤ m and v k i = ht k i , wk i i is a term. – E = {e11, e12, . . . , eij} where i = {1, 2, . . . , n}, j = {1, 2, . . . , m} and len(v 1 i , v2 j ) denotes the path length between then vertices v 1 i and v 2 j . 3 n and m are the number of terms in the first and second profile respectively
Given the bipartite representation G, the optimal matching EC E between two vertex sets is computed using the Hungarian Algorithm [27]. The optimal bipartite graph(shown on the right side of Figure 2b)isG=(V, E)where ES E such that H12EE, len(u1, ui) is optimal. Given the weights of vertices in the representation Wwl,.,w,wi,..., w2, these are normalized(value [0-1))to I }whey时 a'ew12 wi= Aggregate Path Distances: Abiding by our notion that the closer nodes with higher weights contribute more to the similarity value, we present three(slightly different) path length aggregation measures for empirical evaluation. The path distance of an edge e, in the optimal bipartite graph is defined as if len(ui, U))is 0 path(ei)=lo, n(v2,v)is∞ The Euler path distance of an edge eii in the optimal bipartite graph is defined as if len(vi, Ui)is 0 en("2, otherwise The Euler half path distance of an edge eii in the optimal graph is defined as 0 if len(ui, u2)is euhalf(eij) The aggregate distance of all the matching edges of the bipartite graph is given by the sum of their path distances. them using aggregate path distances in the optimal bipartite graph are defined as foll ( sympath(u1, u2) ∑ve∈ E, path(ey) min(size(terms(u1), size(terms(u2)))x mar(path(eii)) ∑ VEneER empath(ei) (4)simeupath(u1, u2)= min(size(terms(un)), size(terms(u2)))x mar(eupath(ein)) (5)simeuhalf(u1, u2 vse∈ gy((e) min(size(terms(un)), size(terms(u2)))x mar(euhalf(eij)) 5.4 Compound Similarity Measures While the term vector similarity technique considers only intersecting terms while computing similarity, when two profiles actually intersect this measure is quite accurate Therefore, we propose compound similarity measures where the similarity between tersecting profile terms are computed using cosine similarity(Equation 1), and the
Given the bipartite representation G, the optimal matching E0 ⊆ E between two vertex sets is computed using the Hungarian Algorithm [27]. The optimal bipartite graph (shown on the right side of Figure 2b) is G0 = hV, E0 P i where E0 ⊆ E such that ∀eij∈E0 len(v 1 i , v2 j ) is optimal. Given the weights of vertices in the representation W12 = {w 1 1 , . . . , w1 i , w2 1 , . . . , w2 j }, these are normalized (value [0-1]) to W120 = {w 1 0 1 , . . . , w1 0 i , w2 0 1 , . . . , w2 0 j } where ∀wk0 l ∈W120 is w k 0 l = w k P l wk l . Aggregate Path Distances: Abiding by our notion that the closer nodes with higher weights contribute more to the similarity value, we present three (slightly different) path length aggregation measures for empirical evaluation. The path distance of an edge eij in the optimal bipartite graph is defined as path(eij ) = 8 >>>: 1, if len(v 1 i , v2 j ) is 0 0, if len(v 1 i , v2 j ) is ∞ w1 0 i ×w2 0 j len(v 1 i ,v2 j ) , otherwise The Euler path distance of an edge eij in the optimal bipartite graph is defined as eupath(eij ) = 8 >>>: 1, if len(v 1 i , v2 j ) is 0 0, if len(v 1 i , v2 j ) is ∞ w1 0 i ×w2 0 j e len(v1 i ,v2 j ) , otherwise The Euler half path distance of an edge eij in the optimal bipartite graph is defined as euhalf(eij ) = 8 >>>>>>>: 1, if len(v 1 i , v2 j ) is 0 0, if len(v 1 i , v2 j ) is ∞ w1 0 i ×w2 0 j e 0 @ len(v1 i ,v2 j ) 2 1 A , otherwise The aggregate distance of all the matching edges of the bipartite graph is given by the sum of their path distances. Similarity Measures: Given two user profiles u1 and u2, the similarity between them using aggregate path distances in the optimal bipartite graph are defined as follows. (3) simpath(u1, u2) = P ∀eij∈E0 path(eij ) min(size(terms(u1)), size(terms(u2))) × max(path(eij )) (4) simeupath(u1, u2) = P ∀eij∈E0 eupath(eij ) min(size(terms(u1)), size(terms(u2))) × max(eupath(eij )) (5) simeuhalf (u1, u2) = P ∀eij∈E0 euhalf(eij ) min(size(terms(u1)), size(terms(u2))) × max(euhalf(eij )) 5.4 Compound Similarity Measures While the term vector similarity technique considers only intersecting terms while computing similarity, when two profiles actually intersect this measure is quite accurate. Therefore, we propose compound similarity measures where the similarity between intersecting profile terms are computed using cosine similarity (Equation 1), and the
imilarity between the remaining profile terms are computed using our bipartite graph pproaches(Equations 3, 4, and 5). More details follow Given two user profiles u1 and u2, the intersecting profile parts are denoted as uy and u such that terms(ui)=terms(u2)= terms(un)terms(u2). Th overlapping profile parts are denoted as ui and i2 such that terms(ui)=terms(un)\ terms(u2)and terms(ui2)=terms(un)\ terms(u1). The combined size of the two profiles is denoted as N= terms(u1)l+terms(u2). The size of the intersecting profile parts is N= terms(u1)l terms u2). The size of the non-overlapping profile parts is N= terms(ui)+ terms(ui2) The compound similarity measure based on simpath(Equation 3)is (x1,u2)×N (ui1,2)×N The compound similarity measure based on simeupath(Equation 4)is simc.= simcos(u1,12)XN+simeupath(ui, ti2)xN The compound similarity measure based on simeuhalf(Equation 5)is simcos(u1, u2)xN+simeuhalf(u1, ti2) 6 Evaluation and results We evaluate the different algorithms described in the previous section in the context of expert finding. We use an inhouse-built software called Profile Builder to generate expert profiles using techniques described in [7] to create profiles by analysing the documents such as web pages visited by the expert). Both the Bow(word profiles)and BOC(terms are Wikipedia concepts: Wiki profiles)representations of the experts are generated by the profile builder software. An expert finding query is correspondingly in the form of either a Bow or a BOC. For a given query profile, matching expert profiles are determined by computing similarity between the expert profile and the query profile. Measure Description COS-Word Cosine similarity measure between expert and query bow profiles(Equation r iterations of Bi-path Compound similarity measure after graph spreading as defined in equation 6 Bi-EU Compound similarity measure. raph spreading as defined in Equation 7 Similarity measure after grap ing as defined in E Table 1: Glossary of the Similarity Measures A pilot study conducted as a part of the evaluation process interviewed 10 partic ipants with expertise in different fields of computer science research. From each of the participants, 5 to 10 documents that in the participants opinion best describe their research were collected. Along with the documents, the participants were asked to give 5 keywords for each of their document that in their opinion best described the ince these keywords somewhat described the expertise of the participants. used by the participants to provide two similarity judgments. We believe this
similarity between the remaining profile terms are computed using our bipartite graph approaches (Equations 3, 4, and 5). More details follow. Given two user profiles u1 and u2, the intersecting profile parts are denoted as u 0 1 and u 0 2 such that terms(u 0 1 ) = terms(u 0 2 ) = terms(u1) ∩ terms(u2). The remaining nonoverlapping profile parts are denoted as uˆ1 and uˆ2 such that terms( ˆu1) = terms(u1) \ terms(u2) and terms( ˆu2) = terms(u2) \ terms(u1). The combined size of the two profiles is denoted as N = |terms(u1)| + |terms(u2)|. The size of the intersecting profile parts is N0 = |terms(u 0 1 )| + |terms(u 0 2 )|. The size of the non-overlapping profile parts is Nˆ = |terms( ˆu1)| + |terms( ˆu2)|. The compound similarity measure based on simpath (Equation 3) is (6) simC path = simcos(u 0 1, u0 2) × N 0 + simpath( ˆu1, uˆ2) × Nˆ N The compound similarity measure based on simeupath (Equation 4) is (7) simC eupath = simcos(u 0 1, u0 2) × N 0 + simeupath( ˆu1, uˆ2) × Nˆ N The compound similarity measure based on simeuhalf (Equation 5) is (8) simC euhalf = simcos(u 0 1, u0 2) × N 0 + simeuhalf ( ˆu1, uˆ2) × Nˆ N 6 Evaluation and Results We evaluate the different algorithms described in the previous section in the context of expert finding. We use an inhouse-built software called Profile Builder to generate expert profiles using techniques described in [7] to create profiles by analysing the documents (such as web pages visited by the expert). Both the BOW (word profiles) and BOC (terms are Wikipedia concepts; Wiki profiles) representations of the experts are generated by the profile builder software. An expert finding query is correspondingly in the form of either a BOW or a BOC. For a given query profile, matching expert profiles are determined by computing similarity between the expert profile and the query profile. Measure Description COS-Word Cosine similarity measure between expert and query BOW profiles (Equation 1) COS-Con Cosine similarity measure between expert and query BOC profiles (Equation 1) COS-5n Mean cosine similarity between BOC profiles after 5 iterations of set spreading COS-10n Mean cosine similarity between BOC profiles after 10 iterations of set spreading Bi-PATH Compound similarity measure after graph spreading as defined in Equation 6 Bi-EU Compound similarity measure after graph spreading as defined in Equation 7 Bi-EUby2 Compound similarity measure after graph spreading as defined in Equation 8 SAN Similarity measure after graph spreading as defined in Equation 2 Table 1: Glossary of the Similarity Measures A pilot study conducted as a part of the evaluation process interviewed 10 participants with expertise in different fields of computer science research. From each of the participants, 5 to 10 documents that in the participant’s opinion best describe their research were collected. Along with the documents, the participants were asked to give 5 keywords for each of their document that in their opinion best described the document. Since these keywords somewhat described the expertise of the participants, they were used by the participants to provide two similarity judgments. We believe this approach
reduces the subjectivity in judging similarity and gives us more realistic values for icipant was asked to judge the similarity between their profile other profiles. Additionally, each of the participants judged the similarity between every pair of profiles(third person view ). The mean of the subjective judgments provided by the participants were used as the base/reality values to evaluate our similarity measures The comparison of the computed similarity value with the reality values were actually made across all user pairs. However, for evaluating the algorithms in the context of expert finding, we consider a user q to represent the query profile and evaluate similarity results of user pairs(q, a)where a is every other user(experts) Comparison of Precision and Recall PRecision F-measure Co:. SAN L+ D D- LHU L C0s-sr (a) Precision and Recall observations b) F-measure Observations Fig 3: Effectiveness of Similarity Measures for Expert Search(Threshold-based Match) We first evaluate the effectiveness of our similarity measures in the context of short-listing a group of experts(eg: for recruitment interview). ere the selected expert profiles are those that exceed a pre-determined similarity threshold. We repeat the search for 10 query profiles over the derived expert profiles using all the approaches listed in Table 1. Figure 3 shows the results from our candidate search process where we measured precision, recall, and F-measure* of all the approaches. In short, precision represents the fraction of the correctly determined experts from those selected by our algorithms(based on how many of the matched results are the experts we wanted to get) Recall represents the effectiveness of the algorithm to find all experts(based on how many experts did we miss). F-measure is a compound measure, basically a harmonic mean of precision and recall. Higher values are better. From Figure 3, we are able to make the following observations. All the metrics based on bipartite graph mapping(Bi-*)work very well over the standard cosine similarity measurement techniques(Cos-Word and Cos-Con) The accuracy of set-based measures increases with the increase in the number of spreading iterations(Cos-1On performs much better than Cos-5n) The precision of all our approaches are almost equal while the recall varies. Our algorithms show significant improvements in recall when compared with the stan- dard approaches Our approaches Bi-*and Cos-10n exhibit upto 20% improvement. The recall of our Bi-PATH approach is 100% while Bi-EUby2 and Cos-1On approaches exhibit around 90% recall Spreading with 5 iterations(Cos-5n)is almost equal performance to other path based/reachability conditions for termination in a general semantic search approach (SAN). This may be suggestive of the maximum diameter of the relevant subgraph consisting of the user's concept We use the standard definitions of Precision, Recall and F-measure as defined in [8
reduces the subjectivity in judging similarity and gives us more realistic values for comparison. Every participant was asked to judge the similarity between their profile and other profiles. Additionally, each of the participants judged the similarity between every pair of profiles (third person view). The mean of the subjective judgments provided by the participants were used as the base/reality values to evaluate our similarity measures. The comparison of the computed similarity value with the reality values were actually made across all user pairs. However, for evaluating the algorithms in the context of expert finding, we consider a user q to represent the query profile and evaluate similarity results of user pairs (q, x) where x is every other user (experts). (a) Precision and Recall Observations (b) F-measure Observations Fig. 3: Effectiveness of Similarity Measures for Expert Search (Threshold-based Match) We first evaluate the effectiveness of our similarity measures in the context of short-listing a group of experts (eg: for recruitment interview). Here the selected expert profiles are those that exceed a pre-determined similarity threshold. We repeat the search for 10 query profiles over the derived expert profiles using all the approaches listed in Table 1. Figure 3 shows the results from our candidate search process where we measured precision, recall, and F-measure4 of all the approaches. In short, precision represents the fraction of the correctly determined experts from those selected by our algorithms (based on how many of the matched results are the experts we wanted to get). Recall represents the effectiveness of the algorithm to find all experts (based on how many experts did we miss). F-measure is a compound measure, basically a harmonic mean of precision and recall. Higher values are better. From Figure 3, we are able to make the following observations. – All the metrics based on bipartite graph mapping (Bi-*) work very well over the standard cosine similarity measurement techniques (Cos-Word and Cos-Con). – The accuracy of set-based measures increases with the increase in the number of spreading iterations (Cos-10n performs much better than Cos-5n). – The precision of all our approaches are almost equal while the recall varies. – Our algorithms show significant improvements in recall when compared with the standard approaches. Our approaches Bi-* and Cos-10n exhibit upto 20% improvement. – The recall of our Bi-PATH approach is 100% while Bi-EUby2 and Cos-10n approaches exhibit around 90% recall. – Spreading with 5 iterations (Cos-5n) is almost equal performance to other pathbased/reachability conditions for termination in a general semantic search approach (SAN). This may be suggestive of the maximum diameter of the relevant subgraph consisting of the user’s concepts. 4 We use the standard definitions of Precision, Recall and F-measure as defined in [8]