Word Sense disambiguation Zhang Yu zhangyu( irhit. edu.cn
Word Sense Disambiguation Zhang Yu zhangyu@ir.hit.edu.cn
Overview of the problem Problem: many words have different meanings Or senses, i.e. there is ambiguity about how they are to be specifically interpreted (e. g, differentian Task to determine which of the senses of an ambiguous word is invoked in a particular use of the word by looking at the context of its use Note: more often than not the different senses of a word are closely related 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 2 Overview of the Problem ◼ Problem: many words have different meanings or senses, i.e., there is ambiguity about how they are to be specifically interpreted (e.g., differentiate). ◼ Task: to determine which of the senses of an ambiguous word is invoked in a particular use of the word by looking at the context of its use. ◼ Note: more often than not the different senses of a word are closely related
Ambiguity resolution Bank Title The rising ground bordering a Name/heading of a book lake river. or sea statue. work of art or music An establishment for the etc custody, loan exchange,or Material at the start of a film issue of money, for the The right of legal ownership extension of credit and for (of land) facilitating the transmission of The document that is evidence funds of the right A n appe lation of respect attached to a person s name A written work(synecdoche part stands for the whole) 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 3 Ambiguity Resolution ◼ Bank ◼ The rising ground bordering a lake, river, or sea ◼ An establishment for the custody, loan exchange, or issue of money, for the extension of credit, and for facilitating the transmission of funds ◼ Title ◼ Name/heading of a book, statue, work of art or music, etc. ◼ Material at the start of a film ◼ The right of legal ownership (of land) ◼ The document that is evidence of the right ◼ An appellation of respect attached to a person’ s name ◼ A written work (synecdoche: part stands for the whole)
Overview of our discussion Methodology Supervised Disambiguation: based on a labeled training set Dictionary-Based Disambiguation: based on lexical resources such as dictionaries and thesauri Unsupervised Disambiguation: based on unlabeled corpora 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 4 Overview of our Discussion ◼ Methodology ◼ Supervised Disambiguation: based on a labeled training set. ◼ Dictionary-Based Disambiguation: based on lexical resources such as dictionaries and thesauri. ◼ Unsupervised Disambiguation: based on unlabeled corpora
Methodological Preliminaries Supervised versus Unsupervised Learning: In supervised learning(classification), the sense label of each word occurrence is provided in the training set; whereas, in unsupervised learning (clustering), it is not provided Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms e.g, replace each of two words(e.g,, bell and book) with a psuedoword(e.g, bell-book a Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task Upper: human performance Lower: baseline using highest frequency alternative(best of 2 versus 10) 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 5 Methodological Preliminaries ◼ Supervised versus Unsupervised Learning: In supervised learning (classification), the sense label of each word occurrence is provided in the training set; whereas, in unsupervised learning (clustering), it is not provided. ◼ Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms, e.g., replace each of two words (e.g., bell and book) with a psuedoword (e.g., bell-book). ◼ Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task. ◼ – Upper: human performance ◼ – Lower: baseline using highest frequency alternative (best of 2 versus 10)
Supervised disambiguation Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label This becomes a statistical classification problem; assign w some sense sk in context cl Approaches Bavesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates Information from many words in a context window. nformation Theory: only looks at the most informative feature in the context, which may be sensitive to text structure. T here are many more approaches (see Chapter 16 or a text on Machine Learning ml) that could be applied 20212/5 Natural Language Processing--Word Sense Disambiguation 6
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 6 Supervised Disambiguation ◼ Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label. This becomes a statistical classification problem; assign w some sense sk in context cl. ◼ Approaches: ◼ Bayesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates information from many words in a context window. ◼ Information Theory: only looks at the most informative feature in the context, which may be sensitive to text structure. ◼ There are many more approaches (see Chapter 16 or a text on Machine Learning (ML)) that could be applied
Supervised Disambiguation Bayesian classification (Gale et al, 1992): look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection; it simply combines the evidence from all features, assuming they are independent Bayes decision rule: Decide s if P(S1>PGld for Sk t Optimal because it minimizes the probability of error; for each individual case it selects the class with the highest conditional probability(and hence owest error rate) a Error rate for a sequence will also be minimized 20212/5 Natural Language Processing--Word Sense Disambiguation 7
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 7 Supervised Disambiguation: Bayesian Classification ◼ (Gale et al, 1992): look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection; it simply combines the evidence from all features, assuming they are independent. ◼ Bayes decision rule: Decide s ’ if P(s ’|c) > P(sk|c) for sk ≠s ’ ◼ Optimal because it minimizes the probability of error; for each individual case it selects the class with the highest conditional probability (and hence lowest error rate). ◼ Error rate for a sequence will also be minimized
Supervised Disambiguation Bayesian classification We do not usually know p(sko), but we can use Baye Rule to compute it: P(k|)=P(|s/P()×P() P(k is the prior probability of Se,1.e, the probability of instance s, without any contextual information When updating the prior with evidence from context (i.e P(CSe/P(), we obtain the posterior probability P(e S If all we want to do is select the correct class, we can ignore P(. Also use logs to simplify computation Assign word w sense s'=argmax kp(sd argmaxeP(c5kX P(k= argmax klog P(c sk+ log P 20212/5 Natural Language Processing--Word Sense Disambiguation 8
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 8 Supervised Disambiguation: Bayesian Classification ◼ We do not usually know P(sk|c), but we can use Bayes’ Rule to compute it: ◼ P(sk|c) = (P(c|sk )/P(c)) × P(sk ) ◼ P(sk ) is the prior probability of sk , i.e., the probability of instance sk without any contextual information. ◼ When updating the prior with evidence from context (i.e., P(c|sk )/P(c)), we obtain the posterior probability P(sk|c). ◼ If all we want to do is select the correct class, we can ignore P(c). Also use logs to simplify computation. ◼ Assign word w sense s ’ = argmaxskP(sk|c) =argmaxskP(c|sk ) × P(sk ) = argmaxsk[log P(c| sk ) + log P(sk )]
Bayesian Classification: Nalve bayes Naive bayes is widely used in ml due to its ability to efficiently combine evidence from a wide variety of features. can be applied if the state of the world we base our classification on can be described as a series of attributes in this case, we describe the context of w in terms of the words y that occur in the context Naive bayes assumption: The attributes used for classification are conditionally independent: P(c 5y P(Zl y; in c) Ise=Ilyin P( I sg ■ Two consequences: The structure and linear ordering of words d bag of words model The presence of one word is independent of another, which is clearly untrue in text 20212/5 Natural Language Processing--Word Sense Disambiguation 9
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 9 Bayesian Classification: Naïve Bayes ◼ Naïve Bayes: ◼ is widely used in ML due to its ability to efficiently combine evidence from a wide variety of features. ◼ can be applied if the state of the world we base our classification on can be described as a series of attributes. ◼ in this case, we describe the context of w in terms of the words vj that occur in the context. ◼ Naïve Bayes assumption: ◼ The attributes used for classification are conditionally independent: P(c|sk ) = P({vj| vj in c}|sk ) = П vj in c P(vj | sk ) ◼ Two consequences: ◼ The structure and linear ordering of words is ignored: bag of words model. ◼ The presence of one word is independent of another, which is clearly untrue in text
Bayesian Classification: Naive bayes Although the naive bayes assumption is incorrect in the context of text processing, it often does quite well, partly because the decisions made can be optimal even in the face of the inaccurate assumption a Decision rule for Naive bayes: Decide sif s=argmax k[log P(5k+smiinc log p(oi lsel P(Oils and p(e are computed via Maximum-Likelihoo Estimation, perhaps with appropriate smoothing, from a labeled training corpus P(|)=C(y)C( P()=C()/C(n) 20212/5 Natural Language Processing--Word Sense Disambiguation 10
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 10 Bayesian Classification: Naïve Bayes ◼ Although the Naïve Bayes assumption is incorrect in the context of text processing, it often does quite well, partly because the decisions made can be optimal even in the face of the inaccurate assumption. ◼ Decision rule for Naïve Bayes: Decide s ’ if s ’ =argmaxsk[log P(sk )+Σvj in c log P(vj|sk )] ◼ P(vj|sk ) and P(sk ) are computed via Maximum-Likelihood Estimation, perhaps with appropriate smoothing, from a labeled training corpus. ◼ P(vj|sk ) = C(vj ,sk )/C(sk ) ◼ P(sk ) = C(sk )/C(w)