Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
1 Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST
Keyword Extraction Goal given n documents each consisting of words, extract the most significant subset of words> keywords Example [All the students are taking exams]-->[student, take, exam Keyword Extraction Process remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words-build key word index extract key phrases- build key phrase index
2 Keyword Extraction ◼ Goal: ◼ given N documents, each consisting of words, ◼ extract the most significant subset of words → keywords ◼ Example ◼ [All the students are taking exams] -- >[student, take, exam] ◼ Keyword Extraction Process ◼ remove stop words ◼ stem remaining terms ◼ collapse terms using thesaurus ◼ build inverted index ◼ extract key words - build key word index ◼ extract key phrases - build key phrase index i t
Stop Words and Stemming From a given Stop Word list a, about, again, are, the to, of, Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf's law: Frequency * rank x constant most frequent words tend to be short most frequent 20% of words account for 60% of usage
3 Stop Words and Stemming ◼ From a given Stop Word List ◼ [a, about, again, are, the, to, of, …] ◼ Remove them from the documents ◼ Or, determine stop words ◼ Given a large enough corpus of common English ◼ Sort the list of words in decreasing order of their occurrence frequency in the corpus ◼ Zipf’s law: Frequency * rank constant ◼ most frequent words tend to be short ◼ most frequent 20% of words account for 60% of usage
Zipf's Law--An illustration Rank(R) Term Frequency(F)R*F(10**6) the 69971 0.070 of 36411 0.073 23456789 and 28.852 0.086 to 26.149 0.104 a 232370.116 In 21.341 0.128 that 10.5950074 10009 0.081 was 9816 0.088 10 he 9543 0.095
4 Zipf’s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,009 0.081 9 was 9,816 0.088 10 he 9,543 0.095
Resolving power of word Non-significant Non-significant high-frequency low-frequency terms terms Presumed resolving power of significant words Words in decreasing frequency order
5 Resolving Power of Word Words in decreasing frequency order Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words
Stemming a The next task is stemming transforming words to root form Computing, Computer, Computation >comput Suffix based methods Remove ability"from"computability L+ness, +ive, >remove Suffix list context rules
6 Stemming ◼ The next task is stemming: transforming words to root form ◼ Computing, Computer, Computation →comput ◼ Suffix based methods ◼ Remove “ability” from “computability” ◼ “…”+ness, “…”+ive, → remove ◼ Suffix list + context rules
Thesaurus rules a thesaurus aims at classification of words in a language for a word it gives related terms which are broader than narrower than same as (synonyms)and opposed to(antonyms)of the given word (other kinds of relationships may exist, e.g, composed of) Static Thesaurus tables anneal, strain], [antenna, receiver] Roget's thesaurus WordNet at princeton 7
7 Thesaurus Rules ◼ A thesaurus aims at ◼ classification of words in a language ◼ for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) ◼ Static Thesaurus Tables ◼ [anneal, strain], [antenna, receiver], … ◼ Roget’s thesaurus ◼ WordNet at Preinceton
Thesaurus rules can also be learned From a search engine query log After typing queries browse If query 1 and query2 leads to the same document Then, Similar(query l, query2) If queryl leads to Document with title keyword K Then, Similar(query1, k) Then transitivity Microsoft research china s work in WWW10 (Wen et al )on Encarta online
8 Thesaurus Rules can also be Learned ◼ From a search engine query log ◼ After typing queries, browse… ◼ If query1 and query2 leads to the same document ◼ Then, Similar(query1, query2) ◼ If query1 leads to Document with title keyword K, ◼ Then, Similar(query1, K) ◼ Then, transitivity… ◼ Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online
The vector-Space Model Distinct terms are available call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document nor T1=architecture 2=bu T3==computer T4=database T5=xm computer science collection index terms or vocabulary of the colelction
9 The Vector-Space Model ◼ T distinct terms are available; call them index terms or the vocabulary ◼ The index terms represent important terms for an application → a vector to represent the document ◼ or T1=architecture T2=bus T3=computer T4=database T5=xml computer science collection index terms or vocabulary of the colelction
The vector-Space Model Assumptions: words are uncorrelated Given 1.n documents and a query TT 2. Query considered a document 1 012 too DD 2. Each represented by t terms 202122 3. Each term in document i has weight 4. We will deal with how to compute the weights later
10 The Vector-Space Model ◼ Assumptions: words are uncorrelated T1 T2 …. Tt D1 d11 d12 … d1t D2 d21 d22 … d2t : : : : : : : : Dn dn1 dn2 … dnt Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight 4. We will deal with how to compute the weights later ij d Q q q qt ... 1 2