Language Models Web Search and Mining ecture 12: Language Models
Language Models 1 Lecture 12: Language Models Web Search and Mining
Language Models Recap Probabilistic models Binary independence model Bayesian networks for IR Language models
Language Models 2 Recap ▪ Probabilistic models: ▪ Binary independence model ▪ Bayesian networks for IR ▪ Language models
Language Models This lecture The Language model approach to Ir Basic query generation model Alternative models
Language Models 3 This lecture ▪ The Language Model Approach to IR ▪ Basic query generation model ▪ Alternative models
Language Models Standard probabilistic ir Information need P(r O, d) d1 matching d2 query dn document collection
Language Models 4 Standard Probabilistic IR query d1 d2 dn … Information need document collection matching P(R | Q,d)
Language Models iR based on Language Model (Lm) Information …………… need P(2IM) Mdi generation d2 query a common search heuristic is to use Mdn words that you expect to find in matching documents as your query document collection The LM approach directly exploits that idea
Language Models 5 IR based on Language Model (LM) query d1 d2 dn … Information need document collection generation ( | ) P Q Md 1 Md M d2 … n M d ▪ A common search heuristic is to use words that you expect to find in matching documents as your query. ▪ The LM approach directly exploits that idea!
Language Models Formal Language(model) Traditional generative model: generates strings Finite state machines or regular grammars, etc Example: I wish I wish i wish I wish i wish i wish wish I wish i wish i wish i wish *wish i wish
Language Models 6 Formal Language (Model) ▪ Traditional generative model: generates strings ▪ Finite state machines or regular grammars, etc. ▪ Example: I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish I wish … *wish I wish
Language Models Stochastic Language Models Models probability of generating strings in the language(commonly all strings over alphabet 2) Model m 0.2 the the man likes the woman 0.1 0.01 man 0.20.010.020.20.01 0.01 woman 0.03 said multiply 0.02 likes P(S|M)=0.0000008
Language Models 7 Stochastic Language Models ▪ Models probability of generating strings in the language (commonly all strings over alphabet ∑) 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes … the man likes the woman 0.2 0.01 0.02 0.2 0.01 multiply Model M P(s | M) = 0.00000008
Language Models Stochastic Language Models Model probability of generating any string Model mi Model m2 0.2 the 0.2th 0.0001cass the class pleaseth yon maiden 0.01 class 0.0001 sayst 0.03 sayst 0.2 0.010.00010.00010.0005 0.001 oleasethll0.02 pleaseth)0200100201001 0.0001yon yon 0.0005 m aiden 0.01 m aiden (SM2)> P(SM1) 0.01 wom an 0.0001 woman 8
Language Models 8 Stochastic Language Models ▪ Model probability of generating any string 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman Model M1 Model M2 the class pleaseth yon maiden 0.2 0.01 0.0001 0.0001 0.0005 0.2 0.0001 0.02 0.1 0.01 P(s|M2) > P(s|M1) 0.2 the 0.0001 class 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maiden 0.0001 woman
Language Models Stochastic Language Models A statistical model for generating text Probability distribution over strings in a given language → P(o●oM)=P(。|M (o|M,● P(。|M,●d P(。|M,●o
Language Models 9 Stochastic Language Models ▪ A statistical model for generating text ▪ Probability distribution over strings in a given language M P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, )
Language Models Unigram and higher-order models ○0●● P()P(o|P(●oP(。l●o刂 Unigram Language models Easy P()P(o)P(●)P( Effective Bigram( generally n-gram) Language models P(°)P(ol●)P(olo)P(●●) Other Language models Grammar-based models(PCFGs),etc Probably not the first thing to try in ir
Language Models 10 Unigram and higher-order models ▪ Unigram Language Models ▪ Bigram (generally, n-gram) Language Models ▪ Other Language Models ▪ Grammar-based models (PCFGs), etc. ▪ Probably not the first thing to try in IR = P ( ) P ( | ) P ( | ) P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective!