I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.

I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario

Today Exercises –Design a graphical model –Learn parameters for naïve bayes Language models (n-grams) Sparse data & smoothing methods

Exercise Let’s design a GM Problem: topic and subtopics classification –Each document has one broad semantic topic (e.g. politics, sports, etc.) –There are several subtopics in each document –Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons

Exercise The goal is to classify the overall topic (T) of the documents and all the subtopics (ST i ) Assumptions: The subtopics ST i depend on the topic of the T document The subtopics ST i are conditionally independent of each other (given T) The words of the document w j depend on the subtopic ST i and are conditionally independent of each other (given ST i ) –For simplicity assume as many topics nodes as there are words How would a GM encoding this assumptions look like? –Variable? Edges? Joint Pb distributions?

Exercise What about now if the words of the document depend also directly from the topic T? –The subtopic persons may be quite different if the overall topic is sport or politics What about now if there is an ordering in the subtopics, i.e. ST i depend on T and also ST i-1

Naïve Bayes for topic classification w1w1 w2w2 wnwn T Recall the general joint probability distribution: P(X 1,..X N ) =  i P(X i | Par(X i ) ) P(T, w 1..w n ) = P(T) P(w 1 | T) P(w 2 | T) … P(w n | T )= = P(T)  i P(w i | T) Inference (Testing): Compute conditional probabilities: P(T | w 1, w 2,..w n ) Estimation (Training): Given data, estimate: P(T), P(w i | T)

Topic = sport (num words = 15) D1: 2009 open season D2: against Maryland Sept D3: play six games D3: schedule games weekends D4: games games games Exercise Topic = politics (num words = 19) D1: Obama hoping rally support D2: billion stimulus package D3: House Republicans tax D4: cuts spending GOP games D4: Republicans obama open D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0 Estimate: for each w i, T j

Exercise: inference What is the topic of new documents: –Republicans obama season –games season open –democrats kennedy house

Exercise: inference Recall: Bayes decision rule Decide T j if P(T j | c) > P(T k | c) for T j ≠ T k c is the context, here the words of the documents We want to assign the topic T for which T’ = argmax T j P(T j | c)

Exercise: Bayes classification We compute P(T j | c) with Bayes rule Because of the dependencies encoded in the GM Bayes rule This GM

Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/34 0 0 1/19 = 0 That is, for each T j we calculate and see which one is higher Choose T = politics

Exercise: Bayes classification That is, for each T j we calculate and see which one is higher New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/34 0 0 1/19 = 0 democrats kennedy: unseen words  data sparsity How can we address this?

Today Exercises –Design of a GM –Learn parameters Language models (n-grams) Sparse data & smoothing methods

Language Models Model to assign scores to sentences Probabilities should broadly indicate likelihood of sentences –P( I saw a van) >> P( eyes awe of an) Not grammaticality –P(artichokes intimidate zippers) ≈ 0 In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides

Language models Related: the task of predicting the next word Can be useful for –Spelling corrections I need to notified the bank –Machine translations –Speech recognition –OCR (optical character recognition) –Handwriting recognition –Augmentative communication Computer systems to help the disabled in communication –For example, systems that let choose words with hand movements

Language Models Model to assign scores to sentences –Sentence: w 1, w 2, … w n –Break sentence probability down with chain rule (no loss of generality) –Too many histories!

wiwi w1w1 Markov assumption: n-gram solution Markov assumption: only the prior local context - -- the last “few” n words– affects the next word N-gram models: assume each word depends only on a short linear history –Use N-1 words to predict the next one wiwi W i-2

n-gram: Unigrams (n = 1) From Dan Klein’s CS 288 slides

n-gram: Bigrams (n = 2) From Dan Klein’s CS 288 slides

n-gram: Trigrams (n = 3) W1W1 W2W2 WNWN... W3W3 From Dan Klein’s CS 288 slides

Choice of n In principle we would like the n of the n-gram to be large –green –large green –the large green –swallowed the large green –swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) –The crocodile swallowed the large green.. –Mary swallowed the large green.. –And so on…

Discrimination vs. reliability Looking at longer histories (large n) should allows us to make better prediction (better discrimination) But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large – The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations

Language Models N size of vocabulary Unigrams Bi-grams Tri-grams For each w i calculate P(w i ): N of such numbers: N parameters For each w i, w j w k calculate P(w i | w j, w k ): NxNxN parameters For each w i, w j calculate P(w i | w j, ): NxN parameters

N-grams and parameters ModelParameters Bigram model20,000 2 = 400 million Trigram model20,000 3 = 8 trillion Four-gram model20,000 4 = 1.6 x 10 17 Assume we have a vocabulary of 20,000 words Growth in number of parameters for n-grams models:

Sparsity Zipf’s law: most words are rare –This makes frequency-based approaches to language hard New words appear all the time, new bigrams more often, trigrams or more, still worse! These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus

Sparsity The larger the number of parameters, the more likely it is to get 0 probabilities Note also the product: If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence

Tackling data sparsity Discounting or smoothing methods –Change the probabilities to avoid zeros –Remember pd have to sum to 1 –Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)

Smoothing From Dan Klein’s CS 288 slides

Smoothing Put probability mass on “unseen events” Add one /delta (uniform prior) Add one /delta (unigram prior) Linear interpolation ….

Smoothing: Combining estimators Make linear combination of multiple probability estimates –(Providing that we weight the contribution of each of them so that the result is another probability function) Linear interpolation or mixture models

Smoothing: Combining estimators Back-off models –Special case of linear interpolation

Smoothing: Combining estimators Back-off models: trigram version

Beyond N-Gram LMs Discriminative models (n-grams are generative model) Grammar based –Syntactic models: use tree models to capture long- distance syntactic effects –Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero Lexical –Word forms –Unknown words Semantic based –Semantic classes: do statistic at the semantic classes level (eg., WordNet) More data (Web)

Summary Given a problem (topic and subtopic classification, language models): design a GM Learn parameters from data But: data sparsity Need to smooth the parameters

I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.

Similar presentations

Presentation on theme: "I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.

Similar presentations

Presentation on theme: "I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario."— Presentation transcript:

Similar presentations

About project

Feedback