Ontology-Based News Recommendation Wouter Intema Frank Goossen wouterijntema@gmail.com frankgoossen@gmail.com Flavius frasincar Frederik Hogenboom frasincar @ese. eur. nl hogeboom@ese. eur. nl Erasmus University Rotterdam PO Box 1738. NL-3000 Rotterdam the Netherlands ABSTRACT platform to find them. However, these news items are not ecommending news items is traditionally done by term. personalized for one's interests. In this paper we propose based algorithms like TF-IDF. This paper concentrates on an approach based on rich semantics for delivering the most the benefits of recommending news items using a domain nteresting news items to the user ontology instead of using a term-based approach. For this Recommending news items can be done by calculating the rpose, we propose Athena, which is an extension to the similarity between the current news item and the previously xisting Hermes framework. Athena employs a user profile browsed news items. Traditionally, this similarity is calcu- o store terms or concepts found in news items browsed by lated by an algorithm that is content-based, which practi- the user. Based on this information. the framework uses a cally means that every word in a news item is taken into traditional method based on TF-IDF, and several ontology account. However, a news item often contains key concepts based methods to recommend new articles to the user. The that capture the semantic context of the article. Recom- paper concludes with the evaluation of the different meth menders that focus on the key concepts might produce faster ods, which shows that the new ontology-based method that and more accurate recommendations than the content-based we propose in this paper performs better(wrt racy, recommenders, since they don't need to consider all words, and recall) than the traditional method with and unlike words, concepts are not ambiguous. Such an ap- tion of one measure(recall), also better the proach is called a semantic-based recommendation system other considered ontology-based approaches Other recommendation systems are either collaborative or hybrid, and are outside the scope of this paper Categories and Subject Descriptors In 7, we introduced the Hermes framework, which pro- vides a semantic method for personalizing news items. It H 3.3 Information Storage and Retrieval]: Information uses an ontology to store concepts and their relations to Search and Retrieval-Information filtering, Relevance feed- the news items. Our paper focuses on a new way of rec- back: 1.2.4 [Artificial Intelligence): Knowledge Represen- ommending, based on concepts found in the news items, by tation Formalisms and Methods-Representation languages employing some of the functionalities offered by Hermes. In order to recommend news items. first we model the General terms user's browsing behavior. By recording a history of read news items, a profile of the user can be made. Based on Design, Experimentation this profile, it is possible to propose new news items that the user might find interesting. The goal of our research is to investigate the benefit of recommending news items Recommender systems, User profiling, Ontology by using domain ontology-based recommenders with respect to traditional term-based recommenders, and to determine 1. INTRODUCTION which of the ontology-based recommenders performs best In this paper we propose Athena, which is an extension In the last decade. the Web has become increasingly im- of the Hermes framework. Athena is able to observe user portant in delivering news to individuals. Many people read behavior and generate recommendations based on this be- news articles for different purposes and the Web is the best havior. The program uses a traditional term-based recom- mender and several semantic-based recommendation algo- rithms to compare unread news items with the user profile The news items having the highest similarity with the user Permission to make digital or hard copies of all or part of profile are recommended to the user. use is granted without fee pre The structure of this paper is as follows. First, we discuss bear this notice and the full citation on the first page. To copy the related work in Sect. 2. Section 3 presents the Athena republish, to post on servers or to redistribute to lists, requires prior specific framework, the Hermes framework, and the Hermes News Portal(HNP), which is the implementation of the Hermes Copyright2010ACM978-1-60558-945-9/100003.51000 ramework. After that, Sect. 4 describes the implementation
Ontology-Based News Recommendation Wouter IJntema wouterijntema@gmail.com Flavius Frasincar frasincar@ese.eur.nl Frank Goossen frankgoossen@gmail.com Frederik Hogenboom fhogenboom@ese.eur.nl Erasmus University Rotterdam PO Box 1738, NL-3000 Rotterdam, the Netherlands ABSTRACT Recommending news items is traditionally done by termbased algorithms like TF-IDF. This paper concentrates on the benefits of recommending news items using a domain ontology instead of using a term-based approach. For this purpose, we propose Athena, which is an extension to the existing Hermes framework. Athena employs a user profile to store terms or concepts found in news items browsed by the user. Based on this information, the framework uses a traditional method based on TF-IDF, and several ontologybased methods to recommend new articles to the user. The paper concludes with the evaluation of the different methods, which shows that the new ontology-based method that we propose in this paper performs better (w.r.t. accuracy, precision, and recall) than the traditional method and, with the exception of one measure (recall), also better than the other considered ontology-based approaches. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering, Relevance feedback; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods—Representation languages General Terms Design, Experimentation Keywords Recommender systems, User profiling, Ontology 1. INTRODUCTION In the last decade, the Web has become increasingly important in delivering news to individuals. Many people read news articles for different purposes and the Web is the best Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EDBT 2010, March 22–26, 2010, Lausanne, Switzerland. Copyright 2010 ACM 978-1-60558-945-9/10/0003 ...$10.00 platform to find them. However, these news items are not personalized for one’s interests. In this paper we propose an approach based on rich semantics for delivering the most interesting news items to the user. Recommending news items can be done by calculating the similarity between the current news item and the previously browsed news items. Traditionally, this similarity is calculated by an algorithm that is content-based, which practically means that every word in a news item is taken into account. However, a news item often contains key concepts that capture the semantic context of the article. Recommenders that focus on the key concepts might produce faster and more accurate recommendations than the content-based recommenders, since they don’t need to consider all words, and unlike words, concepts are not ambiguous. Such an approach is called a semantic-based recommendation system. Other recommendation systems are either collaborative or hybrid, and are outside the scope of this paper. In [7], we introduced the Hermes framework, which provides a semantic method for personalizing news items. It uses an ontology to store concepts and their relations to the news items. Our paper focuses on a new way of recommending, based on concepts found in the news items, by employing some of the functionalities offered by Hermes. In order to recommend news items, first we model the user’s browsing behavior. By recording a history of read news items, a profile of the user can be made. Based on this profile, it is possible to propose new news items that the user might find interesting. The goal of our research is to investigate the benefit of recommending news items by using domain ontology-based recommenders with respect to traditional term-based recommenders, and to determine which of the ontology-based recommenders performs best. In this paper we propose Athena, which is an extension of the Hermes framework. Athena is able to observe user behavior and generate recommendations based on this behavior. The program uses a traditional term-based recommender and several semantic-based recommendation algorithms to compare unread news items with the user profile. The news items having the highest similarity with the user profile are recommended to the user. The structure of this paper is as follows. First, we discuss the related work in Sect. 2. Section 3 presents the Athena framework, the Hermes framework, and the Hermes News Portal (HNP), which is the implementation of the Hermes framework. After that, Sect. 4 describes the implementation
of Athena as a plug-in for the HNP. Then, Sect. 5 gives the 3. AtHENA valuation of the implemented methods. Section 6 concludes In this paper we propose Athena, which is an extension to the paper and proposes future work. the Hermes framework. Subsection 3.1 explains the hermes framework and how it contributes to the recommendati of news items. Subsection 3.2 explains how the user pro- 2. RELATED WORK file is constructed. In subsection 3.3 and subsection 3.4 we Recommending news items or other documents based on discuss some existing content-based respectively, semantic- the user's interest has attracted the attention of many re based recommendation nethods. In subsection 3.5 we in- searchers. Several adaptive Web-based news services have troduce the ranked recommendation method. our semantic. been developed which focus on personal recommendation of based recommendation method news items. These systems vary in application domain, plat- 3.1 Hermes form, development methodology, levels of adaptivity, etc We identify four categories in recommendation systems, con Athena is an extension to the Hermes framework 7,a tent-based, semantic-based, collaborative, and hybrid sys- framework used to build a news personalization service. The tems.In this paper, we limit the discussion to content-based system can be described by input, internal processing, and and semantic-based recommendation methods output. The input is composed of predefined RSS feeds of YourNews [1] is a personalized news system, that employs news items and concepts selected by the user. The inter a content-based approach, which intends to increase the nal processing is the classification of these news items using transparency of adapted news delivery by allowing the user concepts from a knowledge base. The output is defined as to adapt the user profile. Another content-based approach the personalized news items based on the selected concepts. is News Dude [2, which is a personal news recommending agent, that utilizes TF-IDF in combination with the Near- 3.1.1 The Ontology est Neighbor algorithm in order to recommend news items to The Hermes framework offers a semantic-based approach the user.[3] states, supported by Singhal's findings [12], that for retrieving news items related, directly or indirectly, to the performance of TF-IDF, which is employed in Your News the concepts of interests from a domain ontology, which and News Dude, decreases as the length of the article, and is called the knowledge base The ontology consists of he number of words, increases. In addition to this, by ig. classes, e.g, Company and CEO, and the relationship be- noring the semantics of a text, news items that are seman- tween these classes, like is cEoOf and its inverse hasCEO tically related to the news items in the user profile, fail to A concept is defined as either a class or an instance of a class be recommended by the system e.g., Company and ft. The knowledge base is con- [8] provides a practical approach to measure the related- structed and maintained by a domain expert, with financial ess or similarity between RSS news items. Their method is information obtained from Yahoo! Finance based on the semantic relatedness between rss items. As our approach, they determine the relationships between 3.1.2 The hermes news portal words, using WordNet[6. Their focus is on the linguis- The Hermes News Portal (HNP)is a Java implementation tic neighborhood of a word, in which general relationships of the Hermes framework 7. It allows the user to query as synonymy, hyponymy, and meronymy between words are the news and view the knowledge base. It uses Jena for considered. The difference with our approach is that we manipulating and reasoning with the OWL ontologies. For make use of an ontology. Besides the general relationships querying, it employs SPARQL and tSPARQL [7, which adds between words, the ontology covers specific relationships like time functionalities to the queries. The classification of the is-competitor-of, has-product, etc. Despite this difference, news articles is done using GATE [4] and the WordNet [6] heir method is applicable in our context, and therefore we will compare both approaches In [10 ontological user profiling is employed for recom- 3.2 User Profile construction mending academic research papers. While is-a relationships Recommending news items starts with building a user pro- tre rich in semantics, we find this approach limited, as it file. A user profile can be defined by keeping track of which fails to consider other types of concept relationships. The articles the user has read so far. Those articles will provide authors propose a classification algorithm, based on the k- us with information about the user's interests. The user pro- est Neighbor classifier, that assigns topics to papers. In file is constructed in different ways. For concept equivalence, approach, GatE 4 is employed to classify the content binary cosine, and Jaccard, the profile is a set of concepts article by using several language processing techniques. from the articles the user has read. The semantic related This enables the system to not only recommend full articles ness approach creates a vector with the distinct concepts but also possibly recommend a snippet of an article. An- from the user profile and assigns a weight to each concept other difference lies in the construction of the user profile The ranked recommendation method also vector s in [10, the user can adjust the profile. However, as [1 distinct concepts from the read articles and assigns a rank explains, adjusting the user profile might harm the quality to each concept. The difference in user profile construction of the recommendations, so in our approach the user is no between the latter two approaches, is the method used allowed to change the profile. Recommendations are made compute the corresponding weights. by combining collaborative filtering techniques with limited semantic-based recommendations, that only employ is-a re 3.3 Content-Based Recommendation lations, while our system solely employs semantic-based rec- A well-known term weighting method is TF-IDF(term ommendation techniques that utilize more types of relation- frequency-inverse document frequency)[11. A classic ap- ships between concepts proach in comparing documents is the use of TF-IdF to-
of Athena as a plug-in for the HNP. Then, Sect. 5 gives the evaluation of the implemented methods. Section 6 concludes the paper and proposes future work. 2. RELATED WORK Recommending news items or other documents based on the user’s interest has attracted the attention of many researchers. Several adaptive Web-based news services have been developed which focus on personal recommendation of news items. These systems vary in application domain, platform, development methodology, levels of adaptivity, etc. We identify four categories in recommendation systems, content-based, semantic-based, collaborative, and hybrid systems. In this paper, we limit the discussion to content-based and semantic-based recommendation methods. YourNews [1] is a personalized news system, that employs a content-based approach, which intends to increase the transparency of adapted news delivery by allowing the user to adapt the user profile. Another content-based approach is News Dude [2], which is a personal news recommending agent, that utilizes TF-IDF in combination with the Nearest Neighbor algorithm in order to recommend news items to the user. [3] states, supported by Singhal’s findings [12], that the performance of TF-IDF, which is employed in YourNews and NewsDude, decreases as the length of the article, and the number of words, increases. In addition to this, by ignoring the semantics of a text, news items that are semantically related to the news items in the user profile, fail to be recommended by the system. [8] provides a practical approach to measure the relatedness or similarity between RSS news items. Their method is based on the semantic relatedness between RSS items. As in our approach, they determine the relationships between words, using WordNet [6]. Their focus is on the linguistic neighborhood of a word, in which general relationships as synonymy, hyponymy, and meronymy between words are considered. The difference with our approach is that we make use of an ontology. Besides the general relationships between words, the ontology covers specific relationships like is-competitor-of, has-product, etc. Despite this difference, their method is applicable in our context, and therefore we will compare both approaches. In [10] ontological user profiling is employed for recommending academic research papers. While is-a relationships are rich in semantics, we find this approach limited, as it fails to consider other types of concept relationships. The authors propose a classification algorithm, based on the kNearest Neighbor classifier, that assigns topics to papers. In our approach, GATE [4] is employed to classify the content of an article by using several language processing techniques. This enables the system to not only recommend full articles, but also possibly recommend a snippet of an article. Another difference lies in the construction of the user profile, as in [10], the user can adjust the profile. However, as [1] explains, adjusting the user profile might harm the quality of the recommendations, so in our approach the user is not allowed to change the profile. Recommendations are made by combining collaborative filtering techniques with limited semantic-based recommendations, that only employ is-a relations, while our system solely employs semantic-based recommendation techniques that utilize more types of relationships between concepts. 3. ATHENA In this paper we propose Athena, which is an extension to the Hermes framework. Subsection 3.1 explains the Hermes framework and how it contributes to the recommendation of news items. Subsection 3.2 explains how the user pro- file is constructed. In subsection 3.3 and subsection 3.4 we discuss some existing content-based respectively, semanticbased recommendation methods. In subsection 3.5 we introduce the ranked recommendation method, our semanticbased recommendation method. 3.1 Hermes Athena is an extension to the Hermes framework [7], a framework used to build a news personalization service. The system can be described by input, internal processing, and output. The input is composed of predefined RSS feeds of news items and concepts selected by the user. The internal processing is the classification of these news items using concepts from a knowledge base. The output is defined as the personalized news items based on the selected concepts. 3.1.1 The Ontology The Hermes framework offers a semantic-based approach for retrieving news items related, directly or indirectly, to the concepts of interests from a domain ontology, which is called the knowledge base [7]. The ontology consists of classes, e.g., Company and CEO, and the relationship between these classes, like isCEOOf and its inverse hasCEO. A concept is defined as either a class or an instance of a class, e.g., Company and Microsoft. The knowledge base is constructed and maintained by a domain expert, with financial information obtained from Yahoo! Finance. 3.1.2 The Hermes News Portal The Hermes News Portal (HNP) is a Java implementation of the Hermes framework [7]. It allows the user to query the news and view the knowledge base. It uses Jena for manipulating and reasoning with the OWL ontologies. For querying, it employs SPARQL and tSPARQL [7], which adds time functionalities to the queries. The classification of the news articles is done using GATE [4] and the WordNet [6] semantic lexicon. 3.2 User Profile Construction Recommending news items starts with building a user pro- file. A user profile can be defined by keeping track of which articles the user has read so far. Those articles will provide us with information about the user’s interests. The user pro- file is constructed in different ways. For concept equivalence, binary cosine, and Jaccard, the profile is a set of concepts from the articles the user has read. The semantic relatedness approach creates a vector with the distinct concepts from the user profile and assigns a weight to each concept. The ranked recommendation method also uses a vector of distinct concepts from the read articles and assigns a rank to each concept. The difference in user profile construction between the latter two approaches, is the method used to compute the corresponding weights. 3.3 Content-Based Recommendation A well-known term weighting method is TF-IDF (term frequency-inverse document frequency) [11]. A classic approach in comparing documents is the use of TF-IDF to-
gether with the cosine similarity measure. TF-IDF is a sta- Using sets of concepts, makes it impossible to compute the tistical method used to determine the relative importance similarity using the regular cosine measure. This measure f a word within a document in a collection(or corpus)of requires a vector of values, like TF-IDF values. The inter estingness of a new news item is determined by computing As discussed in [1, before calculating the TF-IDF values the intersection between the previous two set the stop words are being filtered from the document. After stop word removal, the remaining words are stemmed using a stemmer. This process reduces words like process','pro- (U,A) 1 if JUnA>0 cessor,, 'processing, and"processed back to their root word If this results in 1. the article is considered interesting. oth- The TF-IDF measure can be determined by first calculat- erwise it is considered not interesting ing the term frequency (TF), which indicates the importance f a term ti within a document dj. By computing the in- 3.4.2 Binary Cosine verse document frequency(IDF) To compute the similarity between two texts, we can also the term in a set of documents can be captured use the binary cosine similarity coefficient The objective is to compare any new document the user profile. Therefore a vector is calculated fo ∩A ser profile. This vector contains the TF-IDF value B(U, A words with the highest TF-IDF value from the documents that have been read by the user. Subsequently in the same JU n Al represents the number of concepts in the in- anner a vector, based on the total set of documents, is tion of U and A, and JUI and JAl are respectively the created for the new document that is being compared to the er of concepts in U and A user profile. By calculating the cosine measure of the news 3. 4.3 Jaccard tem and the user profile, the similarity can be determined The articles with the highest similarity value are considered The Jaccard similarity coefficient can be computed in a to be the most similar to the user profile and are recom- mended to the user 3. 4 Semantic-Based Recommendation J(U, A)=lUnA In traditional forms of text comparison, all words in the ext are considered. In addition to this. there is no relation where JUn A is the number of concepts in the intersection of between different words. For instance, it is not possible to U and A, and JUUAl is the number of concepts in the union of U and A. Jaccard computes the number of elements in determine the relation between Google and Microsoft. But a the intersection of the concepts found in the user profile and user who is interested in news regarding his stocks in Google might also be interested in news about Microsoft, because the news item, relatively to the number of concepts in the it is a competitor of Google. Using an ontology that covers union of these two sets hose relations might therefore be useful in recommending 3.4.4 Semantic Relatednes new articles. To illustrate how we accomplished this, we will irst discuss a few simple methods and then conclude with a In [8 the focus is on the semantic relationship between complex method words. The semantic neighborhood of a concept ci EC is defined as the set of concepts related to it via the synonymy 3.4.1 Concept equivalence ( hyponymy(), and meronymy (<<) relations. Our We start with a very simple technique which only consid ontology covers more relations than only the linguistic re- ers the equivalent concepts. The ontology contains a set of lations. Therefore the semantic neighborhood of concept ci includes each concept that is directly related to the concept ci(including ci) C={c1 The user profile consists of p concepts identified by Hermes N(a)={…,} in the news previously read by the user. A concept is present A text tk can be described by a set of concepts: in a news item if one of the concept lexical representations found in the news item and the meaning of this lexical representation in the context of the news item corresponds to the meaning of the concept as defined in the domain on- ology. The user profile can be represented as the following When comparing two texts, ti and t,, a vector in n-dimensio- nal space can be created, according to the vector space model }, where c∈C V A news article can also be formulated as a set of g concept that appear in the article where lE i, jl and wi represents the weight associated to the concept ci and p=CSi UCS,l is the number of distinct concepts in CSi and CS,. If the concept c A={e,e,,…,c}, where c∈C in CS, then wi= l, otherwise it is computed based on the
gether with the cosine similarity measure. TF-IDF is a statistical method used to determine the relative importance of a word within a document in a collection (or corpus) of documents. As discussed in [1], before calculating the TF-IDF values, the stop words are being filtered from the document. After stop word removal, the remaining words are stemmed using a stemmer. This process reduces words like ‘process’, ‘processor’, ‘processing’, and ‘processed’ back to their root word ‘process’. The TF-IDF measure can be determined by first calculating the term frequency (TF), which indicates the importance of a term ti within a document dj . By computing the inverse document frequency (IDF), the general importance of the term in a set of documents can be captured. The objective is to compare any new document against the user profile. Therefore a vector is calculated for the user profile. This vector contains the TF-IDF value for 100 words with the highest TF-IDF value from the documents that have been read by the user. Subsequently in the same manner a vector, based on the total set of documents, is created for the new document that is being compared to the user profile. By calculating the cosine measure of the news item and the user profile, the similarity can be determined. The articles with the highest similarity value are considered to be the most similar to the user profile and are recommended to the user. 3.4 Semantic-Based Recommendation In traditional forms of text comparison, all words in the text are considered. In addition to this, there is no relation between different words. For instance, it is not possible to determine the relation between Google and Microsoft. But a user who is interested in news regarding his stocks in Google, might also be interested in news about Microsoft, because it is a competitor of Google. Using an ontology that covers those relations might therefore be useful in recommending new articles. To illustrate how we accomplished this, we will first discuss a few simple methods and then conclude with a complex method. 3.4.1 Concept Equivalence We start with a very simple technique which only considers the equivalent concepts. The ontology contains a set of n concepts: C = {c1, c2, c3, · · · , cn} . (1) The user profile consists of p concepts identified by Hermes in the news previously read by the user. A concept is present in a news item if one of the concept lexical representations is found in the news item and the meaning of this lexical representation in the context of the news item corresponds to the meaning of the concept as defined in the domain ontology. The user profile can be represented as the following set: U = ˘ c u 1 , c u 2 , c u 3 , · · · , c u p ¯ , where c u i ∈ C . (2) A news article can also be formulated as a set of q concepts that appear in the article: A = ˘ c a 1, c a 2, c a 3, · · · , c a q ¯ , where c a j ∈ C . (3) Using sets of concepts, makes it impossible to compute the similarity using the regular cosine measure. This measure requires a vector of values, like TF-IDF values. The interestingness of a new news item is determined by computing the intersection between the previous two sets: Similarity(U, A) = 1 if |U ∩ A| > 0 0 otherwise . (4) If this results in 1, the article is considered interesting, otherwise it is considered not interesting. 3.4.2 Binary Cosine To compute the similarity between two texts, we can also use the binary cosine similarity coefficient: B(U, A) = |U ∩ A| |U| × |A| , (5) where |U ∩ A| represents the number of concepts in the intersection of U and A, and |U| and |A| are respectively the number of concepts in U and A. 3.4.3 Jaccard The Jaccard similarity coefficient can be computed in a similar manner: J(U, A) = |U ∩ A| |U ∪ A| , (6) where |U ∩A| is the number of concepts in the intersection of U and A, and |U ∪A| is the number of concepts in the union of U and A. Jaccard computes the number of elements in the intersection of the concepts found in the user profile and the news item, relatively to the number of concepts in the union of these two sets. 3.4.4 Semantic Relatedness In [8] the focus is on the semantic relationship between words. The semantic neighborhood of a concept ci ∈ C is defined as the set of concepts related to it via the synonymy (≡), hyponymy (≺), and meronymy (<<) relations. Our ontology covers more relations than only the linguistic relations. Therefore the semantic neighborhood of concept ci includes each concept that is directly related to the concept ci (including ci): N(ci) = n c i 1, c i 2, · · · , c i n o . (7) A text tk can be described by a set of concepts: CSk = n c k 1 , c k 2 , · · · , c k m o . (8) When comparing two texts, ti and tj , a vector in n-dimensional space can be created, according to the vector space model: Vl = (D c l 1, w l 1 E , · · · , D c l p, w l p E ) , (9) where l ∈ {i, j} and wi represents the weight associated to the concept ci and p = |CSi ∪CSj | is the number of distinct concepts in CSi and CSj . If the concept ci is referenced in CSj then wi = 1, otherwise it is computed based on the
naximum enclosure similarity it has with another conce not in the user profile, but are related to the concepts in the Ci in its corresponding vector Vi. This takes into account user pro the global semantic neighborhood of each concept as follows o calculate the final ranks for each concept, we organize the concepts in a matrix. This is done because we have to if freq(c: in CS,)>0 assign a rank to each concept in the extended user prof max,(ES(Ci, ci)) otherwise for each concept the user has read about. Reading about concept cI increases its value with 1.0. If concept c2 is di- rectly related to concept cl, then its value is increased with ES(ci, ci) N(c)∩N(c) IN( (11) 0.5. If there is a concept, concept c], in the extended pro- file which is neither equal to concept ci nor is it related to Finally the similarity between ti and ti is computed using concept ci, its value is decreased with 0. 1. These constant he coil were determined by experimenting with values ranging from 0 to l with a step of O.1. Al a matrix with rank values. The columns contain the item SemRel(ti, ti)=cos(Vi, Vi) ∈0,1,(12) e extended user profile (UR) and the rows contain ns from the user profile (0). Table 1 shows a rank where the nominator is the dot product of both vectors and d ui eu. summing the values he denominator is the multiplication of the magnitude of of the cells in a column of the matrix, and repeating this each vector process for each column,results a vector with the final The advantage of this approach above concept equiva- ranks for each concept, in the extended user profile lence, binary cosine, and Jaccard, is that it also takes into ccount the related concepts of a concept that occurs in a Table 1: Rank matrix 3.5 Ranked semantic recommendation 5 describes an intuitive approach in working with adap- tive hypermedia. For instance when you read something about concept ci which is related to concept c2 and concept ou increase not only your knowledge in concept ci but m| Im2 also in the other two concepts Even though it is used in a different research field(adap- The user might have read one or more articles abo tive hypermedia), the main idea can be applied also here. cept. Logically, the user is presumed to be more in Each concept is assigned a value, this value we call the rank. in concepts that are found in several articles. The For example, when a user reads about Google, he might also of articles the user has read about concept ui, is called the interested in its competitors, like Yahoo but also in news weight wi about its CEO. Eric Schmidt. Both are considered to be in direct relation to the concept Google. Therefore we increase the rank for Google, Yahoo!, and Eric Schmidt. Unrelated W={un,u2,…,m} oncepts, i. e, concepts that are not directly connected to Now we can calculate the value for each cell in the above he current concept, also need to be addressed. This means, matrix. This is done as follows if a user profile consists of concepts ci and c2, and the next article the user reads, contains concept c, which is directly related to cI, but not related to c2. we increase the rank +1.0ife;=u f ci, and decrease the rank of c2. By decreasing the rank r=t;×+0.5ie;≠u,e∈r(u for such a concept we make the user profile adaptive to the -0.1 otherwise users main interest The final rank for each concept from the extended user pro- Che set of related keywords to concept ci is defined as file, can be computed by taking the sum of the values of the corresponding column in the matrix: r(ci R is described as the union of all related concepts to the Rank(e)=∑r (18) the user profile Those sums are stored in a vector Vu. Each concept in R=∪r(a (14) the extended user profile now has a rank. Before we can And finally UR is defined as the set of all concepts and cor- d to ensure that the range of the ranks is [o, 1 ]. The responding related concepts, this is called the extended user normalization is done as follows profile UREUUR The extended user profile is used in order to be able to in- where v E Vu and Wu E Vu. With this normalization we crease the interest of the user in certain concepts that are can compare the extended user profile to a new article that
maximum enclosure similarity it has with another concept cj in its corresponding vector Vj . This takes into account the global semantic neighborhood of each concept as follows: wi = 1 if freq(ci in CSj ) > 0 maxj (ES(ci, cj )) otherwise (10) where ES(ci, cj ) = |N(ci) ∩ N(cj )| |N(ci)| . (11) Finally the similarity between ti and tj is computed using the cosine measure: SemRel(ti, tj ) = cos(Vi, Vj ) = Vi · Vj ||Vi|| · ||Vj || ∈ [0, 1] , (12) where the nominator is the dot product of both vectors and the denominator is the multiplication of the magnitude of each vector. The advantage of this approach above concept equivalence, binary cosine, and Jaccard, is that it also takes into account the related concepts of a concept that occurs in a text. 3.5 Ranked Semantic Recommendation [5] describes an intuitive approach in working with adaptive hypermedia. For instance when you read something about concept c1 which is related to concept c2 and concept c3 you increase not only your knowledge in concept c1 but also in the other two concepts. Even though it is used in a different research field (adaptive hypermedia), the main idea can be applied also here. Each concept is assigned a value, this value we call the rank. For example, when a user reads about Google, he might also be interested in its competitors, like Yahoo!, but also in news about its CEO, Eric Schmidt. Both are considered to be in direct relation to the concept Google. Therefore we increase the rank for Google, Yahoo!, and Eric Schmidt. Unrelated concepts, i.e., concepts that are not directly connected to the current concept, also need to be addressed. This means, if a user profile consists of concepts c1 and c2, and the next article the user reads, contains concept c3, which is directly related to c1, but not related to c2, we increase the rank of c1, and decrease the rank of c2. By decreasing the rank for such a concept we make the user profile adaptive to the user’s main interest. The set of related keywords to concept ci is defined as: r(ci) = n c i 1, c i 2, · · · , c i k o . (13) R is described as the union of all related concepts to the concepts in the user profile: R = [ ui∈U r(ui) . (14) And finally UR is defined as the set of all concepts and corresponding related concepts, this is called the extended user profile: UR = U ∪ R . (15) The extended user profile is used in order to be able to increase the interest of the user in certain concepts that are not in the user profile, but are related to the concepts in the user profile. To calculate the final ranks for each concept, we organize the concepts in a matrix. This is done because we have to assign a rank to each concept in the extended user profile for each concept the user has read about. Reading about concept c1 increases its value with 1.0. If concept c2 is directly related to concept c1, then its value is increased with 0.5. If there is a concept, concept c3, in the extended pro- file which is neither equal to concept c1 nor is it related to concept c1, its value is decreased with 0.1. These constants were determined by experimenting with values ranging from 0 to 1 with a step of 0.1. Applying this procedure results in a matrix with rank values. The columns contain the items from the extended user profile (UR) and the rows contain the items from the user profile (U). Table 1 shows a rank matrix, where ei ∈ UR and ui ∈ U. Summing the values of the cells in a column of the matrix, and repeating this process for each column, results in a vector with the final ranks for each concept, in the extended user profile. Table 1: Rank matrix e1 e2 . . . eq u1 r11 r12 . . . r11 u2 r21 r22 . . . r2q . . . . . . . . . . . . . . . um rm1 rm2 . . . rmq The user might have read one or more articles about a concept. Logically, the user is presumed to be more interested in concepts that are found in several articles. The number of articles the user has read about concept ui, is called the weight wi, W = {w1, w2, · · · , wm} . (16) Now we can calculate the value for each cell in the above matrix. This is done as follows: ri,j = wi × 8 < : +1.0 if ej = ui +0.5 if ej 6= ui, ej ∈ r(ui) −0.1 otherwise . (17) The final rank for each concept from the extended user pro- file, can be computed by taking the sum of the values of the corresponding column in the matrix: Rank(ej ) = Xm i=1 rij . (18) Those sums are stored in a vector VU . Each concept in the extended user profile now has a rank. Before we can compare the user profile with an unread news article, we need to ensure that the range of the ranks is [0,1]. The normalization is done as follows: VU [vi] = vi − min(vu) max(vu) − min(vu) , (19) where vi ∈ VU and vu ∈ VU . With this normalization we can compare the extended user profile to a new article that
needs to be classified. The new article consists of a set of We also have included a concept list(similar to the well- nown tag cloud), which displays all the concepts that have been stored in the user profile. When a concept is read in er. Als a feature which highlights the concepts and related concepts For this article we define a vector containing the ranks. This found in the article in different colors vector is defined as VA (s (21) c, Additionally, Athena provides a testing environment for Rank(ei) 5. EVALUATION ∈ if ei A (22) Our research goal was to find whether ontology-based rec- ommenders perform better than a classic recommender like Each concept from the extended user profile that appears in TF-IDF. To evaluate our approach, we have developed a test he article is assigned the same rank as the one in vu. The method and built a test environment emaining concepts are assigned zero. Concepts appearing in The testing method we have chosen, is based on super- he article but not in the profile are ignored. In the current vised learning. First the user is shown a set of 300 news ork we assume that all concepts found in a news item are articles, assembled by the designer of the test. For each ar- equally important ticle the user has to read the title and the summary. Based To compare the article with the user profile we propose on this. he should decide whether the article is interesting to compute the extent to which the article fits the profile by or not. For the experiments we have used 5 users, each user dividing the sum of the ranks of concepts in the article by having different news interests than the other ones the sun of the ranks of the concepts in the user profile ly, this set of articles, with the corresponding ratings by the user, is split randomly into two different sets Similarity(VA v)= .u the training set(60%)and the validation set(40%). The two sets are filled with a relatively equal number of interest- ing items. The training set is used to create a user profile The article with the highest similarity measure fits best each item that is marked as interesting will be added to this th the user profile. The cut-off value for news item inter profile. The validation set is used by each recommender to estingness was fixed to 0.5, after experimenting with values determine for each news item the similarity with the user anging from 0 to l with a step of 0.1 profile. An article is considered to be interesting if the simi- 4. ATHENA IMPLEMENTATION larity to the user profile is higher than the predefined cut-off value. otherwise it is classified as not interesting As Athena is an extension to the hermes framework. it o determine the performance of a recommende has been implemented as a plug-in to the existing imple- sures like accuracy, precision, recall(sensitivity), an mentation of the Hermes framework, the Hermes News Por- ficity are used. These measures are calculated by tal(HNP). The implementation of Athena is done in the confusion matrix, which stores the number of true positives, coe language as the HNP, Java. As a stemmer, for the false positives, false negatives, and true negatives, for each content-based method. we have used the Krovetz Stemmer of the analyzed recommender systems. Based on these mea- sures, in the rest of this section, we compare the performance The user interface of Athena consists of 3 tabs: a browser of the ranked recommender with respect to the performance for all news items, a tab for the recommendations, and a of the other considered recommender systems ab for evaluation purposes. The browser contains the news The results in table 2 and table 3 show that the ranked items sorted by date. Here, the user can browse through recommender scores better than TF-IDF for accuracy(94% the news items instead of browsing through query results as vs. 90%), precision(93% vS. 90%), and recall(62%vs n the HNP. Each item is presented with a title, summary, 45%), and has the same high score for specificity(99%) an image which is related to the news item. and the date or accuracy and precision, from all implemented methods published the ranked recommender scores best, closely followed(differ- The user profile is created from the articles the user has ence of 1%)by the Jaccard recommender. The recall of the read. We define reading an article as opening it into the ranked recommender(62%)is nevertheless lower than the Web browser. After reading several articles, the user can se- recall of concept equivalence(98%), binary cosine(95%) choose a type of recommender, and get the recommended is for the ranked recommender, Jaccard, and TF-De(99% lect the recommendations tab in Athena. Here the user can and semantic relatedness(92%). The best specificity articles based on the user profile. Only one recommender The ranked recommender is able to propose interesting can be chosen at a time. By clicking the refresh button, stories for the user, eliminating most uninteresting stories he recommender starts analyzing the user profile. After a Nevertheless, during the news filtering, news items deemed short period of time, the recommender presents a list of news interesting by the user are also wrongly eliminated. How items that the user may find interesting. This list consists of ever, the ranked recommender provides the user with mor the news items that the recommender ranked highest. Each interesting news items relative to the total number of recor news item is presented with its corresponding ranks. The mended new items than a traditional recommender syster ser can browse through the results, and by double-clicking The ranked recommender also suggests more interesting sto- t a news item, it is registered in the user profile, whereafter ries relative to the total number of recommended new items the user's Web browser shows the concerning news article than the other considered semantic-based recommender
needs to be classified. The new article consists of a set of concepts, specified as A: A = {a1, a2, · · · , at} . (20) For this article we define a vector containing the ranks. This vector is defined as VA: VA = (s1, s2, · · · , st) , (21) si = Rank(ei) if ei ∈ A 0 if ei ∈/ A . (22) Each concept from the extended user profile that appears in the article is assigned the same rank as the one in VU . The remaining concepts are assigned zero. Concepts appearing in the article but not in the profile are ignored. In the current work we assume that all concepts found in a news item are equally important. To compare the article with the user profile we propose to compute the extent to which the article fits the profile by dividing the sum of the ranks of concepts in the article by the sum of the ranks of the concepts in the user profile: Similarity(VA, VU ) = P va∈VA va P vu∈VU vu . (23) The article with the highest similarity measure fits best with the user profile. The cut-off value for news item interestingness was fixed to 0.5, after experimenting with values ranging from 0 to 1 with a step of 0.1. 4. ATHENA IMPLEMENTATION As Athena is an extension to the Hermes framework, it has been implemented as a plug-in to the existing implementation of the Hermes framework, the Hermes News Portal (HNP). The implementation of Athena is done in the same language as the HNP, Java. As a stemmer, for the content-based method, we have used the Krovetz Stemmer [9]. The user interface of Athena consists of 3 tabs: a browser for all news items, a tab for the recommendations, and a tab for evaluation purposes. The browser contains the news items sorted by date. Here, the user can browse through the news items instead of browsing through query results as in the HNP. Each item is presented with a title, summary, an image which is related to the news item, and the date published. The user profile is created from the articles the user has read. We define reading an article as opening it into the Web browser. After reading several articles, the user can select the recommendations tab in Athena. Here the user can choose a type of recommender, and get the recommended articles based on the user profile. Only one recommender can be chosen at a time. By clicking the refresh button, the recommender starts analyzing the user profile. After a short period of time, the recommender presents a list of news items that the user may find interesting. This list consists of the news items that the recommender ranked highest. Each news item is presented with its corresponding ranks. The user can browse through the results, and by double-clicking at a news item, it is registered in the user profile, whereafter the user’s Web browser shows the concerning news article. We also have included a concept list (similar to the wellknown tag cloud), which displays all the concepts that have been stored in the user profile. When a concept is read in multiple articles, the font gets larger. Also, we have included a feature which highlights the concepts and related concepts found in the article in different colors. Additionally, Athena provides a testing environment for evaluation purposes which will be discussed in section 5. 5. EVALUATION Our research goal was to find whether ontology-based recommenders perform better than a classic recommender like TF-IDF. To evaluate our approach, we have developed a test method and built a test environment. The testing method we have chosen, is based on supervised learning. First the user is shown a set of 300 news articles, assembled by the designer of the test. For each article the user has to read the title and the summary. Based on this, he should decide whether the article is interesting or not. For the experiments we have used 5 users, each user having different news interests than the other ones. Subsequently, this set of articles, with the corresponding ratings by the user, is split randomly into two different sets, the training set (60%) and the validation set (40%). The two sets are filled with a relatively equal number of interesting items. The training set is used to create a user profile. Each item that is marked as interesting will be added to this profile. The validation set is used by each recommender to determine for each news item the similarity with the user profile. An article is considered to be interesting if the similarity to the user profile is higher than the predefined cut-off value, otherwise it is classified as not interesting. To determine the performance of a recommender, measures like accuracy, precision, recall (sensitivity), and speci- ficity are used. These measures are calculated by using a confusion matrix, which stores the number of true positives, false positives, false negatives, and true negatives, for each of the analyzed recommender systems. Based on these measures, in the rest of this section, we compare the performance of the ranked recommender with respect to the performance of the other considered recommender systems. The results in Table 2 and Table 3 show that the ranked recommender scores better than TF-IDF for accuracy (94% vs. 90%), precision (93% vs. 90%), and recall (62% vs. 45%), and has the same high score for specificity (99%). For accuracy and precision, from all implemented methods, the ranked recommender scores best, closely followed (difference of 1%) by the Jaccard recommender. The recall of the ranked recommender (62%) is nevertheless lower than the recall of concept equivalence (98%), binary cosine (95%), and semantic relatedness (92%). The best specificity (99%) is for the ranked recommender, Jaccard, and TF-IDF. The ranked recommender is able to propose interesting stories for the user, eliminating most uninteresting stories. Nevertheless, during the news filtering, news items deemed interesting by the user are also wrongly eliminated. However, the ranked recommender provides the user with more interesting news items relative to the total number of recommended new items than a traditional recommender system. The ranked recommender also suggests more interesting stories relative to the total number of recommended new items than the other considered semantic-based recommenders
Table 2: Accuracy and Precision considered ontology-based recommenders. Nevertheless, the Method Accuracy Precision recall is lower than some of the implemented ontology-based TF-IDF The knowledge base that is used, is partly created by a Concept Equivalence44% 22% domain expert and takes a lot of effort. Future research 23 should focus on automatically creating and maintaining such a knowledge base to support ontology-based recommenda- Related tion methods. Besides the improvement of the knowledge Ranked base, the algorithm can be improved as well. In our approach we have focused on a limited number of relations betwee concepts, for instance only the direct relations. However Table 3: Sensitivity and specificity concepts might be related to each other on different levels Method Recall Specificity I i.e., concepts might not be directly related to each other TF-IDF but there might exist a relation with one or more concepts Concept Equivalence etween them. Additionally, we would like, in the future, to take into account the importance of a concept in a news item Semantic Relatedness 927047% 7. REFERENCES [1J. Ahn, P. Brusilovsky, J. Grad Ie and S. Y Syn. Open User Profiles for Adaptive News Systems 6. CONCLUSION Help or Harm? In 16th International Conference on This paper describes Athena, an extension to the Hermes World wide web, pages 11-20. ACM, 2007 framework that provides several methods for news item rec- 2 D Billsus and M. J Pazzani. A Personal News Agent emendation based on the user's interests. The system uses that Talks, Learns and Explains In The Third Annua a user profile, news items, and several similarity measures Conference on Autonomous Agents, pages 268-275 At the heart of Athena is the ontology provided by the ACM, May 1999 Hermes framework. This ontology contains the domain con- 3 T Bogers and A van den Bosch. Comparing and cepts and the relationships between the concepts. with Evaluating Information Retrieval Algorithms for News these relationships, more information about each concept Recommendation In ACM Conference On is available than only the concept itself. This allows Athena Recommender Systems, pages 141-144. ACM, 2007. o consider different articles interesting than by using ex- 4 H. Cunningham. GATE, a General Architecture for sting technologies that employ content-based methods, like Text en Computers and the Humanities, TF-IDF, because it does not only consider the concepts that 6:223-254,2002 appear in the article, but also the ones that are related to [5]P. De Bra, A.T.M. Aerts, GJ.Houben, and H. Wu Making General-Purpose Adaptive Hypermedia Work We have described different methods to employ ontologies In WebNet 2000 Conference, pages 117-123. AACE comparing the user profile with a new article. We started 2000. with a content-based method that employs TF-IDF and the [6 C. Fellbaum, editor. WordNet: An ElectronicLerical cosine similarity measure, followed by three basic semantic- Database. MIT Press, Cambridge, MA, 1998 based methods. Concept equivalence is a simple, intuitive method that looks for articles that contain at 7 F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News oncepts from the profile. This method does account the number of concepts found in th es. International Journal of E-B Research,5(3):35-53,2009 In order to take into account these concepts, we have used binary cosine and Jaccard. Those methods compute the [8 F. Getahun, J. Tekli, C. Richard, M. Viviani, and similarity between the article and the profile Yetongnon. Relating RSS News/Items. In 9th A more advanced method also takes into account the se. International Conference on Web Engineering, pages mantic relatedness between different concepts, which are 442-452. Springer,2009 provided by the underlying ontology. a weight is assigned to 9S. Guzman-Lara KStem Java Implementation ach concept based on its neighborhood and the enclosure University of Massachusetts Amherst. 2007 imilarity. This method, referred as semantic relatedness. http://ciir.cs.umassedu/cgi-bin/ is based on linguistic relationships. Finally, we presented a downloads/downloads. cgi. lew method. called ranked recommender, which also uses [10 S. E Middleton, N. R Shadbolt, and D. C D. Roure he ontology relationships between the concepts. It takes Ontological User Profiling in Recommender Syster the concepts from the user profile and combines these with ACM Transactions on Information Systems, the related concepts to create the extended user profile. 22(1)5488,2004 In this paper, we have shown that the ranked recom- [11 G Salton and C. Buckley. Term Weighting mender, our ontology-based recommender, performs better Approaches in Automatic Text Retrieval. Information than a traditional recommender systems based on TF-IDF Processing and Management, 24 (5 ) 513-523, 1988 for accuracy, precision, and recall, and equally good for [12] A Singhal, G Salton, M. Mitra, and C. Buckley specificity. It also performs better, or equally good, with Document Length Normalization. Information espect to accuracy, precision, and specificity than the other Processing and Management, 32(5 ): 619-633, 1996
Table 2: Accuracy and Precision Method Accuracy Precision TF-IDF 90% 90% Concept Equivalence 44% 22% Binary Cosine 47% 23% Jaccard 93% 92% Semantic Relatedness 57% 26% Ranked 94% 93% Table 3: Sensitivity and Specificity Method Recall Specificity TF-IDF 45% 99% Concept Equivalence 98% 32% Binary Cosine 95% 36% Jaccard 58% 99% Semantic Relatedness 92% 47% Ranked 62% 99% 6. CONCLUSION This paper describes Athena, an extension to the Hermes framework that provides several methods for news item recommendation based on the user’s interests. The system uses a user profile, news items, and several similarity measures. At the heart of Athena is the ontology provided by the Hermes framework. This ontology contains the domain concepts and the relationships between the concepts. With these relationships, more information about each concept is available than only the concept itself. This allows Athena to consider different articles interesting than by using existing technologies that employ content-based methods, like TF-IDF, because it does not only consider the concepts that appear in the article, but also the ones that are related to them. We have described different methods to employ ontologies in comparing the user profile with a new article. We started with a content-based method that employs TF-IDF and the cosine similarity measure, followed by three basic semanticbased methods. Concept equivalence is a simple, intuitive method that looks for articles that contain at least one of the concepts from the profile. This method does not take into account the number of concepts found in the news article. In order to take into account these concepts, we have used binary cosine and Jaccard. Those methods compute the similarity between the article and the profile. A more advanced method also takes into account the semantic relatedness between different concepts, which are provided by the underlying ontology. A weight is assigned to each concept based on its neighborhood and the enclosure similarity. This method, referred as semantic relatedness, is based on linguistic relationships. Finally, we presented a new method, called ranked recommender, which also uses the ontology relationships between the concepts. It takes the concepts from the user profile and combines these with the related concepts to create the extended user profile. In this paper, we have shown that the ranked recommender, our ontology-based recommender, performs better than a traditional recommender systems based on TF-IDF for accuracy, precision, and recall, and equally good for specificity. It also performs better, or equally good, with respect to accuracy, precision, and specificity than the other considered ontology-based recommenders. Nevertheless, the recall is lower than some of the implemented ontology-based recommenders. The knowledge base that is used, is partly created by a domain expert and takes a lot of effort. Future research should focus on automatically creating and maintaining such a knowledge base to support ontology-based recommendation methods. Besides the improvement of the knowledge base, the algorithm can be improved as well. In our approach we have focused on a limited number of relations between concepts, for instance only the direct relations. However, concepts might be related to each other on different levels, i.e., concepts might not be directly related to each other but there might exist a relation with one or more concepts between them. Additionally, we would like, in the future, to take into account the importance of a concept in a news item. 7. REFERENCES [1] J. Ahn, P. Brusilovsky, J. Grady, D. He, and S. Y. Syn. Open User Profiles for Adaptive News Systems: Help or Harm? In 16th International Conference on World Wide Web, pages 11–20. ACM, 2007. [2] D. Billsus and M. J. Pazzani. A Personal News Agent that Talks, Learns and Explains. In The Third Annual Conference on Autonomous Agents, pages 268–275. ACM, May 1999. [3] T. Bogers and A. van den Bosch. Comparing and Evaluating Information Retrieval Algorithms for News Recommendation. In ACM Conference On Recommender Systems, pages 141–144. ACM, 2007. [4] H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, 36:223–254, 2002. [5] P. De Bra, A. T. M. Aerts, G. J. Houben, and H. Wu. Making General-Purpose Adaptive Hypermedia Work. In WebNet 2000 Conference, pages 117–123. AACE, 2000. [6] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998. [7] F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35–53, 2009. [8] F. Getahun, J. Tekli, C. Richard, M. Viviani, and K. Yetongnon. Relating RSS News/Items. In 9th International Conference on Web Engineering, pages 442–452. Springer, 2009. [9] S. Guzman-Lara. KStem Java Implementation. University of Massachusetts Amherst, 2007. http://ciir.cs.umass.edu/cgi-bin/ downloads/downloads.cgi. [10] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure. Ontological User Profiling in Recommender Systems. ACM Transactions on Information Systems, 22(1):54–88, 2004. [11] G. Salton and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5):513–523, 1988. [12] A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document Length Normalization. Information Processing and Management, 32(5):619–633, 1996