CPSC 503 Computational Linguistics

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
6/18/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Hidden Markov Models BMI/CS 576
N-Grams Chapter 4 Part 2.
CSC 594 Topics in AI – Natural Language Processing
Context-based Data Compression
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
Probability and Time: Markov Models
CSCI 5832 Natural Language Processing
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Honors Statistics From Randomness to Probability
N-Gram Model Formulas Word sequences Chain rule of probability
Probability and Time: Markov Models
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
CPSC 503 Computational Linguistics
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
ML – Lecture 3B Deep NN.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CPSC 503 Computational Linguistics
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Implementation of Learning Systems
CS249: Neural Language Model
Presentation transcript:

CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini 4/6/2019 CPSC503 Spring 2005

Today 25/9 Finish spelling n-grams Model evaluation Start Hidden Markov-Models 4/6/2019 CPSC503 Winter 2007

Final Method multiple errors (1) Given O, for each wi compute: mei=min-edit distance(wi,O) if mei<k save corresponding edit operations in EdOpi word prior Probability of the errors generating O from wi (2) For all the wi compute: Sort and display top-n to the user the prior of the collected words Multiply P(wi) by the probability of the particular error that results in O (estimate of P(O|wi)). (3) Sort and display top-n to user 4/6/2019 CPSC503 Winter 2007

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, funnel... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely sub word in this context Real-word context 4/6/2019 CPSC503 Winter 2007

Real Word Spelling Errors Collect a set of common sets of confusions: C={C1 .. Cn} e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..} Whenever c’  Ci is encountered Compute the probability of the sentence in which it appears Substitute all cCi (c ≠ c’) and compute the probability of the resulting sentence Choose the higher one Mental confusions Their/they’re/there To/too/two Weather/whether Typos that result in real words Lave for Have Similar process for non-word errors 4/6/2019 CPSC503 Winter 2007

Key Transition Up to this point we’ve mostly been discussing words in isolation Now we’re switching to sequences of words And we’re going to worry about assigning probabilities to sequences of words 4/6/2019 CPSC503 Winter 2007

Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Spelling: Very simple NLP task (requires morphological recognition) Shows Need for probabilistic approaches Move beyond single words Probability of a sentence Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 4/6/2019 CPSC503 Winter 2007

Only Spelling? AB Assign a probability to a sentence Part-of-speech tagging Word-sense disambiguation Probabilistic Parsing Predict the next word Speech recognition Hand-writing recognition Augmentative communication for the disabled AB Why would you want to assign a probability to a sentence or… Why would you want to predict the next word… Impossible to estimate  4/6/2019 CPSC503 Winter 2007

Decompose: apply chain rule Applied to a word sequence from position 1 to n: Most sentences/sequences will not appear or appear only once  Standard Solution: decompose in a set of probabilities that are easier to estimate So the probability of a sequence is 4/6/2019 CPSC503 Winter 2007

Example Sequence “The big red dog barks” P(The big red dog barks)= P(big|the) * P(red|the big)* P(dog|the big red)* P(barks|the big red dog) Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|<S>) 4/6/2019 CPSC503 Winter 2007

Not a satisfying solution  Even for small n (e.g., 6) we would need a far too large corpus to estimate: Markov Assumption: the entire prefix history isn’t necessary. unigram bigram That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes. Markov Assumption Assume that the entire prefix history isn’t necessary. In other words, an event doesn’t depend on all of its history, just a fixed length near history So for each component in the product replace each with its with the approximation (assuming a prefix of N) trigram 4/6/2019 CPSC503 Winter 2007

Prob of a sentence: N-Grams unigram bigram trigram 4/6/2019 CPSC503 Winter 2007

Bigram <s>The big red dog barks P(The big red dog barks)= P(The|<S>) * P(big|the) * P(red|big)* P(dog|red)* P(barks|dog) Trigram? 4/6/2019 CPSC503 Winter 2007

Estimates for N-Grams bigram ..in general 4/6/2019 CPSC503 Winter 2007 N-pairs in a corpus is equal to the N-words in the corpus 4/6/2019 CPSC503 Winter 2007

Estimates for Bigrams Silly Corpus : “<S>The big red dog barks against the big pink dog” N-pairs in a corpus is equal to the N-words in the corpus Word types vs. Word tokens 4/6/2019 CPSC503 Winter 2007

Berkeley Restaurant Project (1994) Table: Counts Berkley Restaurant Project 1994 ~10000 sentences Notice that majority of values is 0 But the full matrix would be even more sparse Corpus: ~10,000 sentences, 1616 word types 4/6/2019 CPSC503 Winter 2007

BERP Table: 4/6/2019 CPSC503 Winter 2007

BERP Table Comparison Counts Prob. 1? 4/6/2019 CPSC503 Winter 2007

Some Observations What’s being captured here? application syntax P(want|I) = .32 P(to|want) = .65 P(eat|to) = .26 P(food|Chinese) = .56 P(lunch|eat) = .055 application syntax I want : Because the application is a question answering systems Want to: Want often subcategorize for an infinitive To eat: the systems is about eating out Chinese food: application/cultural Eat lunch: semantics application application semantics 4/6/2019 CPSC503 Winter 2007

Some Observations Speech-based restaurant consultant! I I I want I want I want to The food I want is P(I | I) P(want | I) P(I | food) Speech-based restaurant consultant! 4/6/2019 CPSC503 Winter 2007

Generation Choose N-Grams according to their probabilities and string them together You can generate the most likely sentences…. I want -> want to -> to eat -> eat lunch I want -> want to -> to eat -> eat Chinese -> Chinese food 4/6/2019 CPSC503 Winter 2007

Two problems with applying: to 0 for unseen This is qualitatively wrong as a prior model because it would never allow n-gram, While clearly some of the unseen ngrams will appear in other texts 4/6/2019 CPSC503 Winter 2007

Problem (1) We may need to multiply many very small numbers (underflows!) Easy Solution: Convert probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog. 4/6/2019 CPSC503 Winter 2007

Problem (2) Solutions: The probability matrix for n-grams is sparse How can we assign a probability to a sequence where one of the component n-grams has a value of zero? Solutions: Add-one smoothing Good-Turing Back off and Deleted Interpolation Let’s assume we’re using N-grams How can we assign a probability to a sequence where one of the component n-grams has a value of zero Assume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else SOME USEFUL OBSERVATIONS A small number of events occur with high frequency You can collect reliable statistics on these events with relatively small samples A large number of events occur with small frequency You might have to wait a long time to gather statistics on the low frequency events Some zeroes are really zeroes Meaning that they represent events that can’t or shouldn’t occur On the other hand, some zeroes aren’t really zeroes They represent low frequency events that simply didn’t occur in the corpus 4/6/2019 CPSC503 Winter 2007

Add-One Make the zero counts 1. Rationale: If you had seen these “rare” events chances are you would only have seen them once. unigram They’re just events you haven’t seen yet. What would have been the count that would have generated P*? discount 4/6/2019 CPSC503 Winter 2007

Add-One: Bigram Counts …… 4/6/2019 CPSC503 Winter 2007 Sum up for all the wN and it must add to 1 …… 4/6/2019 CPSC503 Winter 2007

BERP Original vs. Add-one smoothed Counts 6 0.18 Factor of 8 for Chinese food 19 4/6/2019 CPSC503 Winter 2007

Add-One Smoothed Problems An excessive amount of probability mass is moved to the zero counts When compared empirically with MLE or other smoothing techniques, it performs poorly -> Not used in practice Detailed discussion [Gale and Church 1994] Reduce the problem by adding smaller values 0.5 … 0.2 4/6/2019 CPSC503 Winter 2007

Better smoothing technique Good-Turing (clear example on textbook) More advanced: Backoff and Interpolation To estimate an ngram use “lower order” ngrams Backoff: Rely on the lower order ngrams when we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts The probability of such an event can be measured in a corpus by just looking at how often it happens. Just take the single word case first. Assume a corpus of N tokens and T types. How many times was an as yet unseen type encountered? 4/6/2019 CPSC503 Winter 2007

Impossible to estimate! N-Grams Summary: final Impossible to estimate! Assuming 104 word types and average sentence contains 10 words -> sample space? Chain rule does not help  Markov assumption: Unigram… sample space? Bigram … sample space? Trigram … sample space? Most sentences/sequences will not appear or appear only once  At least your corpus should be >> than your sample space Sparse matrix: Smoothing techniques 4/6/2019 CPSC503 Winter 2007

You still need a big corpus… The biggest one is the Web! Impractical to download every page from the Web to count the ngrams => Rely on page counts (which are only approximations) Page can contain an ngram multiple times Search Engines round-off their counts Such “noise” is tolerable in practice  4/6/2019 CPSC503 Winter 2007

Model Evaluation Relative Entropy (KL divergence) ? Actual distribution How different? Our approximation Relative Entropy (KL divergence) ? D(p||q)= xX p(x)log(p(x)/q(x)) Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) 0*log0/q=0; p*logp/0=inf. P ¼ ¾ Q ½ ½ Distance (¼ * log ¼ / ½) + ( ¾ * log ¾ / ½) (log 3/2 = ~0.6) 4/6/2019 CPSC503 Winter 2007

Entropy Def1. Measure of uncertainty Def2. Measure of the information that we need to resolve an uncertain situation Def3. Measure of the information that we obtain form an experiment that resolves an uncertain situation X not limited to numbers ranges over a set of basic elements Can be words part-of-speech Let p(x)=P(X=x); where x  X. H(p)= H(X)= - xX p(x)log2p(x) It is normally measured in bits. 4/6/2019 CPSC503 Winter 2007

Entropy of Entropy rate Language Entropy Assumptions: Shannon-McMillan-Breiman Assumptions: stationary and ergodic … without summing up over all possible sequences Let’s try to find an alternative measure Shannon-McMillan-Breiman theorem Ergotic process: for each state there is a non-zero probability to end up in any other state Stationary probability distributions on words/sequences-of-words are time invariant (for bigram if it depends only from previous word at time t it will depend only on previous word at any other future time) Assumptions: language is generated by a stationary and ergodic process NL is not stationary the probability of upcoming words can be dependent on events that were arbitrary distant and time dependent – new expressions regularly enter the language While other die out… but if we take a snapshot of text from a certain period it may be a good approximation NL? Entropy can be computed by taking the average log probability of a looooong sample 4/6/2019 CPSC503 Winter 2007

Cross-Entropy Between probability distribution P and another distribution Q (model for P) Applied to Language We could not use the kl divergence (relative entropy) But we’ll see that we can use a related measure of distance called cross-entropy The bigger the probability Q w(n,1) the smaller the cross entropy Between two models Q1 and Q2 the more accurate is the one with lower cross-entropy 4/6/2019 CPSC503 Winter 2007

Model Evaluation: In practice Corpus A:split Training Set Testing set B: train model C:Apply model counting frequencies smoothing Compute cross-perplexity Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) Model 4/6/2019 CPSC503 Winter 2007

Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics My Conceptual map - This is the master plan Markov Models used for part-of-speech and dialog Syntax is the study of formal relationship between words How words are clustered into classes (that determine how they group and behave) How they group with they neighbors into phrases Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 4/6/2019 CPSC503 Winter 2007

Markov Models: state machines Generalizations of Finite-state automata Probabilistic transitions (later) Probabilistic symbol emission q1 q0 q2 q3 k z r 0.2 0.8 1 0.5 k z r 0.2 0.8 1 0.5 r q1 q0 q2 q3 k z 0.2 0.8 1 0.5 Their first use was in modeling the letter sequences in works of Russian literature They were later developed as a general statistical tool. Markov models can be used whenever we want to model the probability of a linear sequence of events 4/6/2019 CPSC503 Winter 2007

Example of a Markov Chain .6 1 a p h .4 .4 .3 .6 1 .3 e t 1 i .4 .6 .4 Start Start 4/6/2019 CPSC503 Winter 2007

Markov-Chain Formal description: 1 Stochastic Transition matrix A 2 .3 .3 .4 1 Stochastic Transition matrix A i .4 .6 p 1 a .4 .6 h 1 e 1 2 Probability of initial states t .6 i .4 4/6/2019 CPSC503 Winter 2007 Manning/Schütze, 2000: 318

Markov Assumptions Let X=(X1, .., Xt) be a sequence of random variables taking values in some finite set S={s1, …, sn}, the state space, the Markov properties are: (a) Limited Horizon: For all t, P(Xt+1|X1, .., Xt)=P(X t+1 | Xt) (b)Time Invariant: For all t, P(X t+1 |Xt)=P(X2 | X1) i.e., the dependency does not change over time. (b) Means that the matrix does not change If X possesses these properties, then X is said to be a Markov Chain 4/6/2019 CPSC503 Winter 2007

Markov-Chain Example: Similar to …….? Probability of a sequence of states X1 … XT Example: Similar to …….? 4/6/2019 CPSC503 Winter 2007 Manning/Schütze, 2000: 320

Hidden Markov Model (State Emission) .6 1 a .1 .5 b b s4 .4 .4 s3 .5 a .5 i .7 .6 .3 .1 1 i b s1 s2 a .9 .4 .6 .4 Start Start 4/6/2019 CPSC503 Winter 2007

Next Time Finish Hidden Markov-Models (HHM) Part-of-Speech (POS) tagging 4/6/2019 CPSC503 Winter 2007