正在加载图片...
gether with the cosine similarity measure. TF-IDF is a sta- Using sets of concepts, makes it impossible to compute the tistical method used to determine the relative importance similarity using the regular cosine measure. This measure f a word within a document in a collection(or corpus)of requires a vector of values, like TF-IDF values. The inter estingness of a new news item is determined by computing As discussed in [1, before calculating the TF-IDF values the intersection between the previous two set the stop words are being filtered from the document. After stop word removal, the remaining words are stemmed using a stemmer. This process reduces words like process','pro- (U,A) 1 if JUnA>0 cessor,, 'processing, and"processed back to their root word If this results in 1. the article is considered interesting. oth- The TF-IDF measure can be determined by first calculat- erwise it is considered not interesting ing the term frequency (TF), which indicates the importance f a term ti within a document dj. By computing the in- 3.4.2 Binary Cosine verse document frequency(IDF) To compute the similarity between two texts, we can also the term in a set of documents can be captured use the binary cosine similarity coefficient The objective is to compare any new document the user profile. Therefore a vector is calculated fo ∩A ser profile. This vector contains the TF-IDF value B(U, A words with the highest TF-IDF value from the documents that have been read by the user. Subsequently in the same JU n Al represents the number of concepts in the in- anner a vector, based on the total set of documents, is tion of U and A, and JUI and JAl are respectively the created for the new document that is being compared to the er of concepts in U and A user profile. By calculating the cosine measure of the news 3. 4.3 Jaccard tem and the user profile, the similarity can be determined The articles with the highest similarity value are considered The Jaccard similarity coefficient can be computed in a to be the most similar to the user profile and are recom- mended to the user 3. 4 Semantic-Based Recommendation J(U, A)=lUnA In traditional forms of text comparison, all words in the ext are considered. In addition to this. there is no relation where JUn A is the number of concepts in the intersection of between different words. For instance, it is not possible to U and A, and JUUAl is the number of concepts in the union of U and A. Jaccard computes the number of elements in determine the relation between Google and Microsoft. But a the intersection of the concepts found in the user profile and user who is interested in news regarding his stocks in Google might also be interested in news about Microsoft, because the news item, relatively to the number of concepts in the it is a competitor of Google. Using an ontology that covers union of these two sets hose relations might therefore be useful in recommending 3.4.4 Semantic Relatednes new articles. To illustrate how we accomplished this, we will irst discuss a few simple methods and then conclude with a In [8 the focus is on the semantic relationship between complex method words. The semantic neighborhood of a concept ci EC is defined as the set of concepts related to it via the synonymy 3.4.1 Concept equivalence ( hyponymy(), and meronymy (<<) relations. Our We start with a very simple technique which only consid ontology covers more relations than only the linguistic re- ers the equivalent concepts. The ontology contains a set of lations. Therefore the semantic neighborhood of concept ci includes each concept that is directly related to the concept ci(including ci) C={c1 The user profile consists of p concepts identified by Hermes N(a)={…,} in the news previously read by the user. A concept is present A text tk can be described by a set of concepts: in a news item if one of the concept lexical representations found in the news item and the meaning of this lexical representation in the context of the news item corresponds to the meaning of the concept as defined in the domain on- ology. The user profile can be represented as the following When comparing two texts, ti and t,, a vector in n-dimensio- nal space can be created, according to the vector space model }, where c∈C V A news article can also be formulated as a set of g concept that appear in the article where lE i, jl and wi represents the weight associated to the concept ci and p=CSi UCS,l is the number of distinct concepts in CSi and CS,. If the concept c A={e,e,,…,c}, where c∈C in CS, then wi= l, otherwise it is computed based on thegether with the cosine similarity measure. TF-IDF is a sta￾tistical method used to determine the relative importance of a word within a document in a collection (or corpus) of documents. As discussed in [1], before calculating the TF-IDF values, the stop words are being filtered from the document. After stop word removal, the remaining words are stemmed using a stemmer. This process reduces words like ‘process’, ‘pro￾cessor’, ‘processing’, and ‘processed’ back to their root word ‘process’. The TF-IDF measure can be determined by first calculat￾ing the term frequency (TF), which indicates the importance of a term ti within a document dj . By computing the in￾verse document frequency (IDF), the general importance of the term in a set of documents can be captured. The objective is to compare any new document against the user profile. Therefore a vector is calculated for the user profile. This vector contains the TF-IDF value for 100 words with the highest TF-IDF value from the documents that have been read by the user. Subsequently in the same manner a vector, based on the total set of documents, is created for the new document that is being compared to the user profile. By calculating the cosine measure of the news item and the user profile, the similarity can be determined. The articles with the highest similarity value are considered to be the most similar to the user profile and are recom￾mended to the user. 3.4 Semantic-Based Recommendation In traditional forms of text comparison, all words in the text are considered. In addition to this, there is no relation between different words. For instance, it is not possible to determine the relation between Google and Microsoft. But a user who is interested in news regarding his stocks in Google, might also be interested in news about Microsoft, because it is a competitor of Google. Using an ontology that covers those relations might therefore be useful in recommending new articles. To illustrate how we accomplished this, we will first discuss a few simple methods and then conclude with a complex method. 3.4.1 Concept Equivalence We start with a very simple technique which only consid￾ers the equivalent concepts. The ontology contains a set of n concepts: C = {c1, c2, c3, · · · , cn} . (1) The user profile consists of p concepts identified by Hermes in the news previously read by the user. A concept is present in a news item if one of the concept lexical representations is found in the news item and the meaning of this lexical representation in the context of the news item corresponds to the meaning of the concept as defined in the domain on￾tology. The user profile can be represented as the following set: U = ˘ c u 1 , c u 2 , c u 3 , · · · , c u p ¯ , where c u i ∈ C . (2) A news article can also be formulated as a set of q concepts that appear in the article: A = ˘ c a 1, c a 2, c a 3, · · · , c a q ¯ , where c a j ∈ C . (3) Using sets of concepts, makes it impossible to compute the similarity using the regular cosine measure. This measure requires a vector of values, like TF-IDF values. The inter￾estingness of a new news item is determined by computing the intersection between the previous two sets: Similarity(U, A) =  1 if |U ∩ A| > 0 0 otherwise . (4) If this results in 1, the article is considered interesting, oth￾erwise it is considered not interesting. 3.4.2 Binary Cosine To compute the similarity between two texts, we can also use the binary cosine similarity coefficient: B(U, A) = |U ∩ A| |U| × |A| , (5) where |U ∩ A| represents the number of concepts in the in￾tersection of U and A, and |U| and |A| are respectively the number of concepts in U and A. 3.4.3 Jaccard The Jaccard similarity coefficient can be computed in a similar manner: J(U, A) = |U ∩ A| |U ∪ A| , (6) where |U ∩A| is the number of concepts in the intersection of U and A, and |U ∪A| is the number of concepts in the union of U and A. Jaccard computes the number of elements in the intersection of the concepts found in the user profile and the news item, relatively to the number of concepts in the union of these two sets. 3.4.4 Semantic Relatedness In [8] the focus is on the semantic relationship between words. The semantic neighborhood of a concept ci ∈ C is defined as the set of concepts related to it via the synonymy (≡), hyponymy (≺), and meronymy (<<) relations. Our ontology covers more relations than only the linguistic re￾lations. Therefore the semantic neighborhood of concept ci includes each concept that is directly related to the concept ci (including ci): N(ci) = n c i 1, c i 2, · · · , c i n o . (7) A text tk can be described by a set of concepts: CSk = n c k 1 , c k 2 , · · · , c k m o . (8) When comparing two texts, ti and tj , a vector in n-dimensio￾nal space can be created, according to the vector space model: Vl = (D c l 1, w l 1 E , · · · , D c l p, w l p E ) , (9) where l ∈ {i, j} and wi represents the weight associated to the concept ci and p = |CSi ∪CSj | is the number of distinct concepts in CSi and CSj . If the concept ci is referenced in CSj then wi = 1, otherwise it is computed based on the
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有