Exploiting Hierarchical Domain Structure to Compute Similarity PRASANNA GANESAN. HECTOR GARCIA-MOLINA and JENNIFER WIDOM Stanford University The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend imilar to provide appropriate results in the presence of multisets(also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product(say milk), or who buy several products in the same category(say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how Categories and Subject Descriptors: H 3.1 [Information Storage and Retrieval: Content Analysis and Indexing General Terms: Algorithms Additional Key Words and Phrases: Similarity measures, hierarchy, collaborative filtering, data INTRODUCTION The notion of similarity is used in many contexts to identify objects having common"characteristics. For instance, a search engine finds documents that are similar to a query or to other documents. a clustering algorithm groups together gene sequences that have similar features. a collaborative filtering system looks for people sharing common interests [Goldberg et al. 1992] In many cases, the objects being compared are treated as sets or bags of elements drawn from a fat domain. Thus a document is a bag of words, a cus- tomer is a bag of purchases, and so on. The similarity between two objects is often determined by their bag intersection the more elements two customers This material is based upon work supported by the National Science Foundation under Grants IIS- 0085896, IIS-9817799, and IIS-9811947 Prasanna Ganesan is supported by a Stanford Graduate Fellowship. author prasannag@cs.stanford.edu. Permission to make digital/hard copy of all or part of this material is granted without fee for per onal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and otice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. o2003ACM10468188/03/0100-00648500 ACM Transactions on Information Systems, Vol 21, No 1, January 2003, Pages 64-93
Exploiting Hierarchical Domain Structure to Compute Similarity PRASANNA GANESAN, HECTOR GARCIA-MOLINA, and JENNIFER WIDOM Stanford University The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing General Terms: Algorithms Additional Key Words and Phrases: Similarity measures, hierarchy, collaborative filtering, data mining 1. INTRODUCTION The notion of similarity is used in many contexts to identify objects having common “characteristics.” For instance, a search engine finds documents that are similar to a query or to other documents. A clustering algorithm groups together gene sequences that have similar features. A collaborative filtering system looks for people sharing common interests [Goldberg et al. 1992]. In many cases, the objects being compared are treated as sets or bags of elements drawn from a flat domain. Thus a document is a bag of words, a customer is a bag of purchases, and so on. The similarity between two objects is often determined by their bag intersection: the more elements two customers This material is based upon work supported by the National Science Foundation under Grants IIS- 0085896, IIS-9817799, and IIS-9811947. Prasanna Ganesan is supported by a Stanford Graduate Fellowship. Authors’ address: 353 Serra Mall, #432, Stanford University, Stanford, CA 94305-9025; email: prasannag@cs.stanford.edu. Permission to make digital/hard copy of all or part of this material is granted without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. °C 2003 ACM 1046-8188/03/0100-0064 $5.00 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003, Pages 64–93
Exploiting Hierarchical Domain Structure. 65 A:(b1 b2) B:(b3b4) D:(b1m1) m1 m2 purchase in common, the more similar they are considered. In other cases, the objects are treated as vectors in an n-dimensional space where n is the cardi nality of the element domain. The cosine of the angle between two objects is then used as a measure of their similarity [McGill 1983]. We propose enhancing these object models by adding a hierarchy describing the relationships among domain elements. The"semantic knowledge"in the hierarchy helps us iden tify objects sharing common characteristics, leading to improved measures of larity To illustrate, let us look at a small three-level hierarchy on the music CD domain, as shown in Figure 1. Say customer a buys Beatles CDs b1 and b2, B buys Beatles CDs b3 and b4, and c buys Stones CDs si and s2. If we were to use a similarity measure based on set intersections, we would find that the similarity between any two of A, B, and C is zero. The Vector-Space Model would represent A, B, and C as three mutually perpendicular vectors and, therefore, the cosine similarity between any two of them is again zero However, looking at the hierarchy of Figure l, we see that a and b are rather similar since both of them like the beatles whereas a and C are somewhat less similar since, although both listen to rock music, they prefer different bands The similarity between two CDs is reflected in how far apart they are in the hierarchy. In this article, we develop measures that take this hierarchy into account, leading to similarity scores that are closer to human intuition than previous measures. There are several interesting challenges that arise in using a hierarchy for milarity computations In our CD example, for instance customers may pur- chase CDs from different portions of the hierarchy: for example, customer D in Figure 1 purchases both Beatles as well as Mozart CDs. In such a case it is not as obvious how similar d is to a or b or to other customers with mixed purchases. As we show, there are multiple ways in which the hierarchy can be used for similarity computations, and in this article we contrast different approaches Another challenge is handling multiple occurrences(multisets)at different levels of the hierarchy. For example, say we had another user E who buys a lot of Beatles CDs as well as a mozart cd m,(see Figure 1). The question is: Which of D or E is more similar to A? Customer D bought Beatles CD b1, just ike A On the other hand, customer e did not buy that CD, but did buy a lot of other Beatles CDs. The traditional cosine-similarity measure favors multiple occurrences of an element. That is, if a query word occurs a hundred times ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 65 Fig. 1. Music CD hierarchy. purchase in common, the more similar they are considered. In other cases, the objects are treated as vectors in an n-dimensional space, where n is the cardinality of the element domain. The cosine of the angle between two objects is then used as a measure of their similarity [McGill 1983]. We propose enhancing these object models by adding a hierarchy describing the relationships among domain elements. The “semantic knowledge” in the hierarchy helps us identify objects sharing common characteristics, leading to improved measures of similarity. To illustrate, let us look at a small three-level hierarchy on the music CD domain, as shown in Figure 1. Say customer A buys Beatles CDs b1 and b2, B buys Beatles CDs b3 and b4, and C buys Stones CDs s1 and s2. If we were to use a similarity measure based on set intersections, we would find that the similarity between any two of A, B, and C is zero. The Vector-Space Model would represent A, B, and C as three mutually perpendicular vectors and, therefore, the cosine similarity between any two of them is again zero. However, looking at the hierarchy of Figure 1, we see that A and B are rather similar since both of them like the Beatles, whereas A and C are somewhat less similar since, although both listen to rock music, they prefer different bands. The similarity between two CDs is reflected in how far apart they are in the hierarchy. In this article, we develop measures that take this hierarchy into account, leading to similarity scores that are closer to human intuition than previous measures. There are several interesting challenges that arise in using a hierarchy for similarity computations. In our CD example, for instance, customers may purchase CDs from different portions of the hierarchy: for example, customer D in Figure 1 purchases both Beatles as well as Mozart CDs. In such a case it is not as obvious how similar D is to A or B or to other customers with mixed purchases. As we show, there are multiple ways in which the hierarchy can be used for similarity computations, and in this article we contrast different approaches. Another challenge is handling multiple occurrences (multisets) at different levels of the hierarchy. For example, say we had another user E who buys a lot of Beatles CDs as well as a Mozart CD m1 (see Figure 1). The question is: Which of D or E is more similar to A? Customer D bought Beatles CD b1, just like A. On the other hand, customer E did not buy that CD, but did buy a lot of other Beatles CDs. The traditional cosine-similarity measure favors multiple occurrences of an element. That is, if a query word occurs a hundred times ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al in a document, the document is more similar to the query than one in which the query word appears only once. If we use this approach in our example, we would say that e is more similar to A than D is, because e buys 14 Beatles CDs, whereas D buys just one Unfortunately, it is not clear that this conclusion is the correct one. E probably a serious Beatles fan, whereas A and D appear more balanced and similar to each other. so it would also be reasonable to conclude that d is more similar to a than E is. Thus measures like cosine-similarity, although suit- able for query-document similarity, do not provide the right semantics for in- terobject similarity in many other situations. This problem has, in fact, been observed earlier even in the context of inter-document similarity [Shivakumar and Garcia-Molina 1995]. In this article, we study various semantics for mul tiple occurrences, and provide measures that map to these semantics There has been a lot of prior work related to similarity in various domains and, naturally, we rely on some of it for our own work In Sections 2 and 6 we discuss prior work in detail, but here we make some brief observations In our example we have seen that with traditional measures customers A B, and C have zero similarity to each other because their purchases do not in- tersect. When objects or collections are sparse(i.e, have few elements relative to the domain), intersections tend to be empty and traditional measures have difficulty identifying similar objects. There have been many attempts to over- come this sparsity problem through techniques such as dimension reduction [Sarwar et al. 2000], filtering agents [Sarwar et al. 1998, item-based filtering [ Sarwar et al. 2001, and the use of personal agents [good et al. 1999]. We be- lieve that using a richer data model (i. e, our hierarchy addresses this problem in a simple and effective way Hierarchies are often used to encode knowledge and have been used in a variety of ways for text classification, for mining association rules, for interac tive information retrieval, and various other tasks where similarity plays a role Feldman and Dagan 1995; Han and Fu 1995 Scott and Matwin 1998; Srikant and Agrawal 1995]. In this article, we focus on the case where attributes are confined to being leaves of the hierarchy. We believe that our work may be extended to deal with polyhierarchies, where the hierarchy is not required to be a strict tree to permit attributes at multiple levels in the hierarchy, and to deal with generalizations of hierarchies, where similarity relationships be- tween attributes are used to induce similarity relationships between objects consisting of sets of these attributes. We do not cover these extensions in this article. Our goal here is to rigorously study how a domain hierarchy can be used to compute similarity between sets of leaves and to explore, compare, and evaluate the various options available different applications require different notions of similarity and we expect that an analysis of the specific applica- tion would determine which of our proposed similarity measures suits it the re are many domains in which hierarchies exist and can be exploited as we suggest here For example there is an inherent hierarchical structure to the URls of pages within a single site. This structure can be exploited in clus- tering user sessions identified from a Web log Joshi and Krishnapuram 2000 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
66 • P. Ganesan et al. in a document, the document is more similar to the query than one in which the query word appears only once. If we use this approach in our example, we would say that E is more similar to A than D is, because E buys 14 Beatles CDs, whereas D buys just one. Unfortunately, it is not clear that this conclusion is the correct one. E is probably a serious Beatles fan, whereas A and D appear more balanced and similar to each other, so it would also be reasonable to conclude that D is more similar to A than E is. Thus measures like cosine-similarity, although suitable for query-document similarity, do not provide the right semantics for interobject similarity in many other situations. This problem has, in fact, been observed earlier even in the context of inter-document similarity [Shivakumar and Garcia-Molina 1995]. In this article, we study various semantics for multiple occurrences, and provide measures that map to these semantics. There has been a lot of prior work related to similarity in various domains and, naturally, we rely on some of it for our own work. In Sections 2 and 6 we discuss prior work in detail, but here we make some brief observations. In our example we have seen that with traditional measures customers A, B, and C have zero similarity to each other because their purchases do not intersect. When objects or collections are sparse (i.e., have few elements relative to the domain), intersections tend to be empty and traditional measures have difficulty identifying similar objects. There have been many attempts to overcome this sparsity problem through techniques such as dimension reduction [Sarwar et al. 2000], filtering agents [Sarwar et al. 1998], item-based filtering [Sarwar et al. 2001], and the use of personal agents [Good et al. 1999]. We believe that using a richer data model (i.e., our hierarchy) addresses this problem in a simple and effective way. Hierarchies are often used to encode knowledge, and have been used in a variety of ways for text classification, for mining association rules, for interactive information retrieval, and various other tasks where similarity plays a role [Feldman and Dagan 1995; Han and Fu 1995; Scott and Matwin 1998; Srikant and Agrawal 1995]. In this article, we focus on the case where attributes are confined to being leaves of the hierarchy. We believe that our work may be extended to deal with polyhierarchies, where the hierarchy is not required to be a strict tree, to permit attributes at multiple levels in the hierarchy, and to deal with generalizations of hierarchies, where similarity relationships between attributes are used to induce similarity relationships between objects consisting of sets of these attributes. We do not cover these extensions in this article. Our goal here is to rigorously study how a domain hierarchy can be used to compute similarity between sets of leaves and to explore, compare, and evaluate the various options available. Different applications require different notions of similarity and we expect that an analysis of the specific application would determine which of our proposed similarity measures suits it the best. There are many domains in which hierarchies exist and can be exploited as we suggest here. For example, there is an inherent hierarchical structure to the URLs of pages within a single site. This structure can be exploited in clustering user sessions identified from a Web log [Joshi and Krishnapuram 2000; ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure laccard's Coeff Dice,s Coeff. Pearson Correlation Coeff Earth-Movers Distance MAC Cosine Similarity Others First Generation Fig. 2. Evolution of similarity measures. Nasraoui et al. 1999]. Although Nasraoui et al. [1999] do attempt to exploit this hierarchical structure in computing similarity, the proposed similarity mea sure turns out to be highly unintuitive and can provide arbitrarily low values of similarity even between identical sessions. We would expect the use of our similarity measures to provide a dramatic improvement in the quality of the To name a few other examples, the Open Directory [opd] is a hierarchy on a subset of pages on the Web Thus, we can compute the similarity of Web users, for instance based on a trace of the Web pages they visit. In the music domain, songs can be organized into a hierarchy by genre, band, album, and so en be used, say, to find users with simila recommend new songs to them. In the document domain, we can use existing hierarchies such as WordNet [Miller et al. 1990] to compute document similar- ty. In all of these cases, our general-purpose extended similarity measures can be used to improve functionality. In summary, the main contributions of this article are the following - We introduce similarity measures that can exploit hierarchical domain struc ture, leading to similarity scores that are more intuitive than the ones gen- erated by traditional similarity measures. Ve extend these measures to deal with multiple occurrences of elements(and of ancestors in the hierarchy), such as those exhibited in a and e in Figure 1 in a semantically meaningful fashion We analyze the differences between our various measures, compare them empirically, and show that all of them are very different from measures that dont exploit the domain hierarchy. b.e report the findings of a user study" ate the quality of the various Figure 2 shows the evolution of the measures that we discuss, and erves as a roadma ap for the rest of the article. Section 2 describes traditional measures of similarity. Section 3 introduces our First-Generation measures, which exploit a hierarchical domain structure and are obtained as natural generalizations of the traditional measures. Section 4 introduces the multiple-occurrence prob- lem, and evolves the measures into our second- Generation measures. section 5 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
Exploiting Hierarchical Domain Structure • 67 Fig. 2. Evolution of similarity measures. Nasraoui et al. 1999]. Although Nasraoui et al. [1999] do attempt to exploit this hierarchical structure in computing similarity, the proposed similarity measure turns out to be highly unintuitive and can provide arbitrarily low values of similarity even between identical sessions. We would expect the use of our similarity measures to provide a dramatic improvement in the quality of the results. To name a few other examples, the Open Directory [OPD ] is a hierarchy on a subset of pages on the Web. Thus, we can compute the similarity of Web users, for instance, based on a trace of the Web pages they visit. In the music domain, songs can be organized into a hierarchy by genre, band, album, and so on. This hierarchy can then be used, say, to find users with similar tastes, and recommend new songs to them. In the document domain, we can use existing hierarchies such as WordNet [Miller et al. 1990] to compute document similarity. In all of these cases, our general-purpose extended similarity measures can be used to improve functionality. In summary, the main contributions of this article are the following. —We introduce similarity measures that can exploit hierarchical domain structure, leading to similarity scores that are more intuitive than the ones generated by traditional similarity measures. —We extend these measures to deal with multiple occurrences of elements (and of ancestors in the hierarchy), such as those exhibited in A and E in Figure 1, in a semantically meaningful fashion. —We analyze the differences between our various measures, compare them empirically, and show that all of them are very different from measures that don’t exploit the domain hierarchy. —We report the findings of a user study to evaluate the quality of the various measures. Figure 2 shows the evolution of the measures that we discuss, and serves as a roadmap for the rest of the article. Section 2 describes traditional measures of similarity. Section 3 introduces our First-Generation measures, which exploit a hierarchical domain structure and are obtained as natural generalizations of the traditional measures. Section 4 introduces the multiple-occurrence problem, and evolves the measures into our Second-Generation measures. Section 5 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al is devoted to a comparison of these measures and their evaluation. Section 6 describes related work 2. TRADITIONAL SIMILARITY MEASURES Given two objects, or collections of elements C1 and C2, our goal is to compute their similarity sim(C1, C2), a real number in [0, 1. The similarity should tend to 1 as C and c2 have more and more common"characteristics "There is no universal notion of which"characteristics"count and hence the notion of sim- ilarity is necessarily subjective. Here we define several notions of similarity, and discuss how intuitive they are 2.1 The Set/Bag Model In many applications, the simplest approach to modeling an object is to treat it as a set, or a bag, ofelements, which we term a collection. The similarity between wo collections is then computed on the basis of their set or bag intersection. There are many different measures in use, which differ primarily in the way hey normalize this intersection value Ivan Rijsbergen 1979. We describe two of them here Let X and y be two collections. Jaccard's Coefficient, sim Jace(X, Y), is defined to be X∩Y Thus, in Figure 1, simJace (A, D)=1/(2+2-1)=3. Dice's Coefficient, which we denote sim pice(X, y), is defined to be 2*x∩Y simplice(X, Y) Once again referring to Figure 1, simpice(A, D)=(2*1)/(2+2)=].Other such measures include the Inclusion Measure, the Overlap Coefficient, and the Extended Jaccard Coefficient [Strehl et al. 2000; van Rijsbergen 1979 2.2 The Vector-Space Model The Vector-Space Model is a popular model in the information retrieval do- main [ MeGill 1983]. In this model, each element in the domain is taken to be a dimension in a vector space. a collection is represented by a vector, with com- ponents along exactly those dimensions corresponding to the elements in the ollection. One advantage of this model is that we can now weight the compo- nents of the vectors, by using schemes such as TF-IDF [Salton and buckley 1988. The Cosine-Similarity Measure(CSM) defines the similarity between two vectors to be the cosine of the angle between them. This measure has proven to be very popular for query-document and document-document simi- larity in text retrieval (Salton and Buckley 1988]. again referring to Figure 1 ACM Transactions on Information Systems, Vol 21, No 1, January 2003
68 • P. Ganesan et al. is devoted to a comparison of these measures and their evaluation. Section 6 describes related work. 2. TRADITIONAL SIMILARITY MEASURES Given two objects, or collections of elements C1 and C2, our goal is to compute their similarity sim(C1, C2), a real number in [0, 1]. The similarity should tend to 1 as C1 and C2 have more and more common “characteristics.” There is no universal notion of which “characteristics” count, and hence the notion of similarity is necessarily subjective. Here we define several notions of similarity, and discuss how intuitive they are. 2.1 The Set/Bag Model In many applications, the simplest approach to modeling an object is to treat it as a set, or a bag, of elements, which we term a collection. The similarity between two collections is then computed on the basis of their set or bag intersection. There are many different measures in use, which differ primarily in the way they normalize this intersection value [van Rijsbergen 1979]. We describe two of them here. Let X and Y be two collections. Jaccard’s Coefficient, simJacc(X , Y ), is defined to be: simJacc(X , Y ) = |X ∩ Y | |X ∪ Y | . Thus, in Figure 1, simJacc(A, D) = 1/(2 + 2 − 1) = 1 3 . Dice’s Coefficient, which we denote simDice(X , Y ), is defined to be: simDice(X , Y ) = 2 ∗ |X ∩ Y | |X |+|Y | . Once again referring to Figure 1, simDice(A, D) = (2 ∗ 1)/(2 + 2) = 1 2 . Other such measures include the Inclusion Measure, the Overlap Coefficient, and the Extended Jaccard Coefficient [Strehl et al. 2000; van Rijsbergen 1979]. 2.2 The Vector-Space Model The Vector-Space Model is a popular model in the information retrieval domain [McGill 1983]. In this model, each element in the domain is taken to be a dimension in a vector space. A collection is represented by a vector, with components along exactly those dimensions corresponding to the elements in the collection. One advantage of this model is that we can now weight the components of the vectors, by using schemes such as TF-IDF [Salton and Buckley 1988]. The Cosine-Similarity Measure (CSM) defines the similarity between two vectors to be the cosine of the angle between them. This measure has proven to be very popular for query-document and document-document similarity in text retrieval [Salton and Buckley 1988]. Again referring to Figure 1, ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure. 69 and using uniform weights of 1 x.苏五+6mi+B.6+Bmi simCos(A, D) 6i.五+砬.B五.6+m 1+0+0+01 √1+1√1+12 Collaborative-filtering systems such as groupLens [Resnick et al. 1994] use similar vector model, with each dimension being a"vote"of the user for a particular item. However, they use the Pearson Correlation Coefficient as a similarity measure, which first subtracts the average of the elements from each of the vectors before computing their cosine similarity. Formally, this similarity is given by the formula (X,Y) where xj is the value of vector X in dimension j, t is the average value of X along a dimension, and the summation is over all dimensions in which both X and Y are nonzero Resnick et al. 1994]. Inverse User Frequency may be used to weight the different components of the vectors. There have also been other enhancements such as default voting and case amplification Breese et al 1998, which modify the values of the vectors along the various dimensions 2.3 Measures Exploiting a Hierarchy There are quite a few distance measures proposed in the literature that may be adapted to compute similarity while making use of a hierarchy. One such measure, popular in a variety of domains, is Earth-mouer's distance [rubner et al. 1998; Chakrabarti et al. 2000]. This distance computes the dissimilarity between two collections of points in space by calculating the work to be done in moving"mounds of earth, "located at points in the first collection, to fill"holes located at the points in the second collection. It is possible to map bags in our domain into collections of points in space by using the hierarchy to compute the distance between any two elements of the domain in space We could thus apply Earth-movers distance to our problem, but this happens to be unsatisfactory for many reasons, some of which are demonstrated by our user study, described in Section 5.2. We also provide a detailed analysis of the reasons underlying the inapplicability of such measures in the extended version ofour article[Ganesan etal.2002] As noted in the introduction, there have been attempts to exploit hierar- chical URL structure in clustering user sessions from a Web log Joshi and Krishnapuram 2000; Nasraoui et al. 1999]. Unfortunately, the proposed mea sure turns out to be unintuitive, and can provide arbitrarily low similarity values between identical sessions. There are also other measures such as tree edit distance and Mac lloannidis and Ppoosala 1999] which could conceivably be applied to our problem. We provide more details on these measures in Section 6 ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 69 and using uniform weights of 1: simCos(A, D) = −→A · −→D | −→A ||−→D | = −→b1 · −→b1 + −→b1 · −→m1 + −→b2 · −→b1 + −→b2 · −→m1 q−→b1 · −→b1 + −→b2 · −→b2 q−→b1 · −→b1 + −→m1 · −→m1 = 1 + 0 + 0 + 0 √1 + 1 √1 + 1 = 1 2 . Collaborative-filtering systems such as GroupLens [Resnick et al. 1994] use a similar vector model, with each dimension being a “vote” of the user for a particular item. However, they use the Pearson Correlation Coefficient as a similarity measure, which first subtracts the average of the elements from each of the vectors before computing their cosine similarity. Formally, this similarity is given by the formula: c(X , Y ) = P j(x j − x)( y j − y) qP j (x j − x) 2 P j ( y j − y) 2 , where x j is the value of vector X in dimension j, x is the average value of X along a dimension, and the summation is over all dimensions in which both X and Y are nonzero [Resnick et al. 1994]. Inverse User Frequency may be used to weight the different components of the vectors. There have also been other enhancements such as default voting and case amplification [Breese et al. 1998], which modify the values of the vectors along the various dimensions. 2.3 Measures Exploiting a Hierarchy There are quite a few distance measures proposed in the literature that may be adapted to compute similarity while making use of a hierarchy. One such measure, popular in a variety of domains, is Earth-mover’s distance [Rubner et al. 1998; Chakrabarti et al. 2000]. This distance computes the dissimilarity between two collections of points in space by calculating the work to be done in moving “mounds of earth,” located at points in the first collection, to fill “holes,” located at the points in the second collection. It is possible to map bags in our domain into collections of points in space by using the hierarchy to compute the distance between any two elements of the domain in space. We could thus apply Earth-mover’s distance to our problem, but this happens to be unsatisfactory for many reasons, some of which are demonstrated by our user study, described in Section 5.2. We also provide a detailed analysis of the reasons underlying the inapplicability of such measures in the extended version of our article [Ganesan et al. 2002]. As noted in the introduction, there have been attempts to exploit hierarchical URL structure in clustering user sessions from a Web log [Joshi and Krishnapuram 2000; Nasraoui et al. 1999]. Unfortunately, the proposed measure turns out to be unintuitive, and can provide arbitrarily low similarity values between identical sessions. There are also other measures such as tree edit distance and MAC [Ioannidis and Poosala 1999] which could conceivably be applied to our problem. We provide more details on these measures in Section 6 ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al Fig 3. Induced trees for collections A and B. but, here, we simply note that neither of these measures proves to be a good fit for the problem at hand 3. THE FIRST GENERATION We now describe two new measures we developed, based fairly directly on the traditional measures, that exploit a hierarchical domain structure in computing similarity. We first describe our model formally, define some associated concepts and then proceed to develop the measures 3. 1 The model Let u be a rooted tree with all nodes carrying a distinct label. We do not impose any restrictions on the shape of U. Each node can have arbitrary fanout, and the leaves of U can be at different levels. Let lu be the set of all labels in U. Let LLu be the set of all labels on the leaves of U. Lly is the element domain, on which there is a superimposed hierarchy described by u. In our music example, [61, b2,..., S1, 82, .. ml, m2, .. C1, C2,.. A collection C is a bag whose elements are drawn from LLI Let w be a function from LLy to the set of real numbers. w is an a priori weight function on the leaves of U, which captures the relative importance of different elements. There are many ways of deriving this weight function. It could be an Inverse User Frequency such as the one defined in Breese et al [1998]. It could also be corpus-independent, and be determined by attributes of the elements, such as their cost (in monetary terms). Of course, the weight function also can be uniform Since there is a hierarchical structure imposed on llu, a collection C induce tree, a subgraph of U that consists of the ancestral paths of each leaf in C We refer to trees that are induced in this manner as induced trees. Notice that since C is a bag, the induced tree might have more than one leaf with the same label. Figure 3 shows the induced trees for the collections A and B from As is conventional, the depth of a node in the hierarchy is the number of edges on the path from the root of u to that node. Given any two leaves l1 and l2 in U, define the Lowest Common Ancestor LCA(1, l2) to be the node of greatest depth that is an ancestor of both l1 and l2. This LCa is always well defined since the two leaves have at least one common ancestor-the root nodeand no two common ancestors can have the same depth In Figure 1, LCA(b1, b2)=b, and LCA(61, S1=r ACM Transactions on Information Systems, Vol 21, No 1, January 2003
70 • P. Ganesan et al. Fig. 3. Induced trees for collections A and B. but, here, we simply note that neither of these measures proves to be a good fit for the problem at hand. 3. THE FIRST GENERATION We now describe two new measures we developed, based fairly directly on the traditional measures, that exploit a hierarchical domain structure in computing similarity. We first describe our model formally, define some associated concepts, and then proceed to develop the measures. 3.1 The Model Let U be a rooted tree, with all nodes carrying a distinct label. We do not impose any restrictions on the shape of U. Each node can have arbitrary fanout, and the leaves of U can be at different levels. Let LU be the set of all labels in U. Let LLU be the set of all labels on the leaves of U. LLU is the element domain, on which there is a superimposed hierarchy described by U. In our music example, LLU = {b1, b2, ... , s1, s2, ... , m1, m2, ... , c1, c2, ...}. A collection C is a bag whose elements are drawn from LLU . Let W be a function from LLU to the set of real numbers. W is an a priori weight function on the leaves of U, which captures the relative importance of different elements. There are many ways of deriving this weight function. It could be an Inverse User Frequency such as the one defined in Breese et al. [1998]. It could also be corpus-independent, and be determined by attributes of the elements, such as their cost (in monetary terms). Of course, the weight function also can be uniform. Since there is a hierarchical structure imposed on LLU , a collection C induces a tree, a subgraph of U that consists of the ancestral paths of each leaf in C. We refer to trees that are induced in this manner as induced trees. Notice that, since C is a bag, the induced tree might have more than one leaf with the same label. Figure 3 shows the induced trees for the collections A and B from Figure 1. As is conventional, the depth of a node in the hierarchy is the number of edges on the path from the root of U to that node. Given any two leaves l1 and l2 in U, define the Lowest Common Ancestor LCA(l1, l2) to be the node of greatest depth that is an ancestor of both l1 and l2. This LCA is always well defined since the two leaves have at least one common ancestor—the root node—and no two common ancestors can have the same depth. In Figure 1, LCA(b1, b2) = b, and LCA(b1, s1) = r. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure 3.2 The Generalized vector-Space Model To illustrate how the Vector-Space Model can be generalized to take the hier- archy into account, consider Figure 1 again. Let us say that the unit vector corresponding to a leaf l is represented by l. Now, according to the traditional cosine-similarity measure all leaf unit vectors are perpendicular to each other which means that the dot product of any two of them is zero. The dot product of a unit vector with itself is equal to 1 We have already observed that b, is, intuitively, somewhat similar to bs sinc they are both Beatles CDs. Thus, if a buys b1 and b buys b3, we need to make this fact contribute something to the similarity of A and B; that is, we want b1. b3 to be nonzero. In the vector space, we want to assert that b1 and bs are not really perpendicular to each other, since they are somewhat similar We use the hierarchy to decide exactly what value to assign to this dot product. For example, let us decide that b1. b3=a, since they have a com- mon ancestor that is two-thirds of the way down from the root. By a similar reasoning process, we let b1. si be 1. We let b1. mi continue to be 0 since they are in different sections of the hierarchy and don't seem to have anything to do with each other, except for the fact that they are both music CDs. Formally, let the set of leaf labels Lly be(1, l2, 23, .. In]. Let Counta(li)be the number of times li occurs in collection A. Then, collection a is represented A i)* Counta(i) for i= l.n. This usage of weights is identical to the standard Vector-Space Model's For any two elements l1 and l2, we define 2* depth(lCav(1, l2)) between0 and 1. Note that the dot product is equal to 1 if and only if11=4 This definition is consistent, since the right side of this equation always lie We continue to measure similarity by the cosine-similarity measure, except that we have now dropped the assumption that the different "components "of the vector are perpendicular to each other. If collection A is represented by the AB=∑∑ab7 Again, this equation is identical to the standard Vector-Space Model, except that li. li is not equal to o whenever i+j. Finally, the cosine similarity between A and b is given by the traditional formula x.官 sim(A, B) x.xV方. We call this measure the generalized Cosine-Similarity Measure(GCSm) 3.3 The Optimistic Genealogy Measure The Generalized Cosine-Similarity Measure from Section 3.2 is not the only, or even the most intuitive, way to exploit a hierarchy for similarity. Next we ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 71 3.2 The Generalized Vector-Space Model To illustrate how the Vector-Space Model can be generalized to take the hierarchy into account, consider Figure 1 again. Let us say that the unit vector corresponding to a leaf l is represented by →l. Now, according to the traditional cosine-similarity measure, all leaf unit vectors are perpendicular to each other, which means that the dot product of any two of them is zero. The dot product of a unit vector with itself is equal to 1. We have already observed that b1 is, intuitively, somewhat similar to b3 since they are both Beatles CDs. Thus, if A buys b1 and B buys b3, we need to make this fact contribute something to the similarity of A and B; that is, we want −→b1 · −→b3 to be nonzero. In the vector space, we want to assert that −→b1 and −→b3 are not really perpendicular to each other, since they are somewhat similar. We use the hierarchy to decide exactly what value to assign to this dot product. For example, let us decide that −→b1 · −→b3 = 2 3 , since they have a common ancestor that is two-thirds of the way down from the root. By a similar reasoning process, we let −→b1 · −→s1 be 1 3 . We let −→b1 · − m →1 continue to be 0 since they are in different sections of the hierarchy and don’t seem to have anything to do with each other, except for the fact that they are both music CDs. Formally, let the set of leaf labels LLU be {l1, l2, l3, ... , ln}. Let CountA(li) be the number of times li occurs in collection A. Then, collection A is represented by the vector −→A = Pn i=1 ai −→li , where ai = W(li) ∗ CountA(li) for i = 1..n. This usage of weights is identical to the standard Vector-Space Model’s. For any two elements l1 and l2, we define −→l1 · −→l2 = 2 ∗ depth(LCAU (l1, l2)) depth(l1) + depth(l2) . This definition is consistent, since the right side of this equation always lies between 0 and 1. Note that the dot product is equal to 1 if and only if l1 = l2. We continue to measure similarity by the cosine-similarity measure, except that we have now dropped the assumption that the different “components” of the vector are perpendicular to each other. If collection A is represented by the vector −→A = P i ai −→li and B by the vector −→B = P i bi −→li , then −→A . −→B = Xn i=1 Xn j=1 aibj −→li . −→l j . Again, this equation is identical to the standard Vector-Space Model, except that −→li . −→l j is not equal to 0 whenever i 6= j. Finally, the cosine similarity between A and B is given by the traditional formula: sim(A, B) = −→A · −→B q−→A · −→A q−→B · −→B . We call this measure the Generalized Cosine-Similarity Measure (GCSM). 3.3 The Optimistic Genealogy Measure The Generalized Cosine-Similarity Measure from Section 3.2 is not the only, or even the most intuitive, way to exploit a hierarchy for similarity. Next we ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
P Ganesan et al present a second, more natural, and intuitive measure, and contrast it with GCSM. Intuitively, the Optimistic Genealogy Measure computes a"similarity contribution"for each element in one collection, and then takes the weighted average of these contributions to be the similarity between the two collections. The contribution of an element is determined by how good a"match"it has in the other collection Let C1 and C2 be the collections to be compared and let Ti and T2 be their induced trees as defined in Section 3. 1. For any leaf l1 in T1, define LCAt,t(1) to be the ancestor of l1 of greatest depth that is present in T2, that is, the lowest of the LCAs thatli shares with the leaves of T2. This LCA provides an indication of how good the "best match"for l, can be For example, for the trees in Figure 3 LCAA B(b1) is Beatles, since it is present in tree B, and is the lowest ancestor of bi that is present in B (We abuse notation and let A and b refer both to the two collections and to their corresponding induced trees. Now define matchA, T(1)=2 E C2LCA(1, l2)=LCAT.t(l1). That is, match, T(1) is the set of all leaves in T2 that can be the"best match" for l1 In Figure 3, matchA B(b1) is the set (b3, b4 since both elements match b1 at its parent Beatles. Next, we define leafsimT T(,)- depth(LcAT, Ta (D) depth(1) The value leafsimT, T (1)measures how similar l1 is to its best match in T2 Ifl1 itself is present in T2, then LCAT,, T(1)=l1, and therefore leafsimmn(1)=1 On the other hand, if no ancestor of l1 except for the root is present in T2, we have depth(LCAt. T (1))=0 and, therefore, leafsimr, T(1)=0. In Figure 3, leafsimA, B(b1) is 3 and leafsimA B(b2)is also 3 Finally, for any two collections Cl and C2 with associated induced trees T1 and T2, respectively, we define the Optimistic Genealogy Measure(OGM) as sim(C, C,)- lec ZeafsimT, T(1)*W(1) (1) ∑;ec1W(l1) This is just the weighted average of the individual leafsim values of the leaves in T1. Note that since C is a bag, the summation is over all members of the bag, and is not the set average. In our example, sim(A, B)is also 3, since the contributions from b and b2 are identical Note that OGm is, in general, asymmetric; that is, sim(a, B)* sim(B, a) The symmetric simlarity score between A and b is defined to be the average of the two asymmetric scores 3.4 Discussion Table i shows the similarity values computed by various traditional measures discussed in Section 2, as well as by gCSm and ogm, for tl I The reason for the name becomes clear in the next section ACM Transactions on Information Systems, Vol 21, No 1, January 2003
72 • P. Ganesan et al. present a second, more natural, and intuitive measure, and contrast it with GCSM. Intuitively, the Optimistic Genealogy Measure1 computes a “similarity contribution” for each element in one collection, and then takes the weighted average of these contributions to be the similarity between the two collections. The contribution of an element is determined by how good a “match” it has in the other collection. Let C1 and C2 be the collections to be compared and let T1 and T2 be their induced trees as defined in Section 3.1. For any leaf l1 in T1, define LCAT1,T2 (l1) to be the ancestor of l1 of greatest depth that is present in T2, that is, the lowest of the LCAs thatl1 shares with the leaves of T2. This LCA provides an indication of how good the “best match” for l1 can be. For example, for the trees in Figure 3, LCAA,B(b1) is Beatles, since it is present in tree B, and is the lowest ancestor of b1 that is present in B. (We abuse notation and let A and B refer both to the two collections and to their corresponding induced trees.) Now define: matchT1,T2 (l1) = {l2 ∈ C2|LCA(l1, l2) = LCAT1,T2 (l1)}. That is, matchT1,T2 (l1) is the set of all leaves in T2 that can be the “best match” for l1. In Figure 3, matchA,B(b1) is the set {b3, b4} since both elements match b1 at its parent Beatles. Next, we define: leafsimT1,T2 (l1) = depth(LCAT1,T2 (l1)) depth(l1) . The value leafsimT1,T2 (l1) measures how similar l1 is to its best match in T2. If l1 itself is present in T2, then LCAT1,T2 (l1) = l1, and therefore leafsimT1,T2 (l1) = 1. On the other hand, if no ancestor of l1 except for the root is present in T2, we have depth(LCAT1,T2 (l1)) = 0 and, therefore, leafsimT1,T2 (l1) = 0. In Figure 3, leafsimA,B(b1) is 2 3 and leafsimA,B(b2) is also 2 3 . Finally, for any two collections C1 and C2 with associated induced trees T1 and T2, respectively, we define the Optimistic Genealogy Measure (OGM) as sim(C1, C2) = P l1∈C1 leafsimT1,T2 (l1) ∗ W(l1) P l1∈C1 W(l1) . (1) This is just the weighted average of the individual leafsim values of the leaves in T1. Note that since C1 is a bag, the summation is over all members of the bag, and is not the set average. In our example, sim(A, B) is also 2 3 , since the contributions from b1 and b2 are identical. Note that OGM is, in general, asymmetric; that is, sim(A, B) 6= sim(B, A). The symmetric simlarity score between A and B is defined to be the average of the two asymmetric scores. 3.4 Discussion Table I shows the similarity values computed by various traditional measures discussed in Section 2, as well as by GCSM and OGM, for the collections in 1The reason for the name becomes clear in the next section. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure Table I. Comparison of the Various Measures I sim JC DC CSM GCSM OGM I B. C 0.26 0.25 Figure 1. JC stands for Jaccard's Coefficient and DC for Dice,s Coefficient. We compute similarity between every pair of customers, except for customer E whom we use in a subsequent example. The values shown are symmetric sim ilarity values, with the average of the two asymmetric values being used for OGM. As motivated in Section 1, we would expect to find that customers a and b are more similar to each other than a and c. c and d should be even less similar From Table I, we see that both of our First-Generation measures produce this result, whereas the traditional measures do not. Intuitively, it is not clear whether sim(A, D)should be higher than sim(A, B). There is a case for saying that sim(a, b)is higher, since both a and b are pure?"Beatles persons. One could also contend that a and d have a CD in common, whereas A and b have none, and, therefore, that sim(a, D)ought to be higher. oGm gives them the same similarity values, and GCSm makes sim(A, B)higher. The traditional measures claim that sim(A, D)is higher, since they do not detect any similarity between A and B. gCsm and oGm can be tuned to adjust the conclusion in cases such as these We discuss how to achieve this tuning in the extended version of this article [ganesan et al. 2002] 3.4.1 Contrasting GCSM with OGM. Having seen how the First Generation measures fare on our simple example when compared with the traditional measures. we now examine the basic differences between GCSm -GCSM uses many-to-many matches, whereas OGM uses many-to-one matches. In gCsm, the similarity contribution of an element in one collec- tion is gathered from all elements in the other collection that have a nonzero similarity to that element On the other hand, oGM simply uses the best similarity score it can find for each element. GCSM is a symmetric measure, which means that we will not get high sim larity scores if one collection is a subset of the other [Shivakumar and garcia Molina 1995]. OGM can be used as an asymmetric measure and conveys more information that may help us identify different semantic notions of similar ity. For example, if we wanted to find an"expert "for a particular user A (i.e someone who is knowledgeable about the things that a buys), we would look for a user B such that her purchases are close to a superset of A' s purchases Thus sim(A, B)would be very high, but sim(B, A)might be fairly low GCSM has worst-case complexity quadratic in the number of elements in the two collections. oGM has complexity linear in the number of nodes in the induced trees of the two collections ACM Transactions on Information Systems, Vol 21, No. 1, January 2003
Exploiting Hierarchical Domain Structure • 73 Table I. Comparison of the Various Measures sim JC DC CSM GCSM OGM A, B 0 0 0 0.8 0.67 A, C 0 0 0 0.4 0.33 A, D 0.33 0.5 0.5 0.65 0.67 B, C 0 0 0 0.4 0.33 B, D 0 0 0 0.52 0.5 C, D 0 0 0 0.26 0.25 Figure 1. JC stands for Jaccard’s Coefficient and DC for Dice’s Coefficient. We compute similarity between every pair of customers, except for customer E, whom we use in a subsequent example. The values shown are symmetric similarity values, with the average of the two asymmetric values being used for OGM. As motivated in Section 1, we would expect to find that customers A and B are more similar to each other than A and C. C and D should be even less similar. From Table I, we see that both of our First-Generation measures produce this result, whereas the traditional measures do not. Intuitively, it is not clear whether sim(A, D) should be higher than sim(A, B). There is a case for saying that sim(A, B) is higher, since both A and B are “pure” Beatles persons. One could also contend that A and D have a CD in common, whereas A and B have none, and, therefore, that sim(A, D) ought to be higher. OGM gives them the same similarity values, and GCSM makes sim(A, B) higher. The traditional measures claim that sim(A, D) is higher, since they do not detect any similarity between A and B. GCSM and OGM can be tuned to adjust the conclusion in cases such as these. We discuss how to achieve this tuning in the extended version of this article [Ganesan et al. 2002]. 3.4.1 Contrasting GCSM with OGM. Having seen how the FirstGeneration measures fare on our simple example when compared with the traditional measures, we now examine the basic differences between GCSM and OGM. —GCSM uses many-to-many matches, whereas OGM uses many-to-one matches. In GCSM, the similarity contribution of an element in one collection is gathered from all elements in the other collection that have a nonzero similarity to that element. On the other hand, OGM simply uses the best similarity score it can find for each element. —GCSM is a symmetric measure, which means that we will not get high similarity scores if one collection is a subset of the other [Shivakumar and GarciaMolina 1995]. OGM can be used as an asymmetric measure, and conveys more information that may help us identify different semantic notions of similarity. For example, if we wanted to find an “expert” for a particular user A (i.e., someone who is knowledgeable about the things that A buys), we would look for a user B such that her purchases are close to a superset of A’s purchases. Thus sim(A, B) would be very high, but sim(B, A) might be fairly low. —GCSM has worst-case complexity quadratic in the number of elements in the two collections. OGM has complexity linear in the number of nodes in the induced trees of the two collections. ACM Transactions on Information Systems, Vol. 21, No. 1, January 2003