Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Speech Recognition  Observe: Acoustic signal (A=a 1,…,a n )  Challenge: Find the likely word sequence  But we also have to consider the context Starting.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAIN-SPECIFIC VOICE TRANSCRIPTION Curry Guinn Dave Crist.
Natural Language Processing Statistical Inference: n-grams
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
Statistical Language Models
CPSC 503 Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Presentation transcript:

Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio Issues to consider – determine the intended word sequence? – resolve grammatical and pronunciation errors Applications: spell-checking, allophonic variations of pronunciation, automatic speech recognition Implementation: Establish word sequence probabilities – Use existing corpora – Train program with run-time data

Entropy Questions answered if we can compute entropy? How much information there is in a particular grammar, words, parts of speech, phonemes, etc? How predictive is a language in computing the next word based on previous words? How difficult is a speech recognition task? Entropy is measured in bits What is the least number of bits required to encode a piece of information Measures the quantity of information in a signal stream  p i lgp i i  1 r  H(X) = X is a random variable that can assume r values Entropy of spoken languages could focus implementation possibilities

Entropy Example Eight Horses are in an upcoming race –We want to take bets on the winner –Naïve approach is to use 3 bits –Better approach is to use less bits for the horses bet on more frequently Entropy –What is the minimum number of bits needed? = -∑ i=1,8 p(i)log p(i) = - ½ log ½ - ¼ log ¼ - 1/8 log 1/8 – 1/16 log 1/16 – 4 * 1/64 log 1/64 = 1/2 + 2/4 +3/8 + 4/ * 6/64 = ( )/64 = 2 The table to the right shows the optimal coding scheme Question: What if the odd were all equal (1/8) HorseoddsCode 1½0 2¼10 31/ / / / / /

Entropy of words and languages What is the entropy of a sequence of words? H(w 1,w 2,…,w n ) = -∑ p(w i n ) log p (w i n ) – w i n ε L – P(w i n ) = probability that w i n is in a sequence of n words What is the entropy of a word appearing in an n word sequence? H(w 1 n ) = - 1/n ∑ p(w i n ) log p (w i n ) What is the entropy of a language? H(L) = lim n=∞ (- 1/n * ∑ p(w i n ) log p (w i n ))

Cross Entropy We want to know the entropy of a language L, but don’t know its distribution We model L by an approximation to its probability distribution We take sequences of words, phonemes, etc from the real language but use the following formula H(p,m) = lim n->∞ - 1/n log m(w 1,w 2,…,w n ) Cross entropy will always be an upper bound for the actual language Example – Trigram model of 583 million words of English – Corpus of 1,014,312 tokens – Character entropy computed based on a tri-gram grammar – Result 1.75 bits per character

Probability Chain Rule Conditional Probability P(A 1,A 2 ) = P(A 1 ) · P(A 2 |A 1 ) The Chain Rule generalizes to multiple events – P(A 1, …,A n ) = P(A 1 ) P(A 2 |A 1 ) P(A 3 |A 1,A 2 )…P(A n |A 1 …A n-1 ) Examples: – P(the dog) = P(the) P(dog | the) – P(the dog bites) = P(the) P(dog | the) P(bites| the dog) Conditional probability applies more than individual relative word frequencies because they consider the context – Dog may be relatively rare word in a corpus – But if we see barking, P(dog|barking) is much more likely 1 n In general, the probability of a complete string of words w 1 …w n is: P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = Detecting likely word sequences using probabilities

Counts What’s the probability of “canine”? What’s the probability of “canine tooth” or tooth | canine? What’s the probability of “canine companion”? P(tooth|canine) = P(canine & tooth)/P(canine) Sometimes we can use counts to deduce probabilities. Example: According to google: – P(canine): occurs 1,750,000 times – P(canine tooth): 6280 times – P(tooth | canine): 6280/ =.0035 – P(companion | canine):.01 – So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up

Single Word Probabilities WordP(O|w)P(w)P(O|w)P(w) new neat need knee P([ni]|new)P(new) P([ni]|neat)P(neat) P([ni]|need)P(need) P([ni]|knee)P(knee) Limitation: ignores context We might need to factor in the surrounding words -Use P(need|I) instead of just P(need) -Note: P(new|I) < P(need|I) Single word probability  Compute likelihood P([ni]|w), then multiply

Word Prediction Approaches Simple: *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| – If English has 100,000 words, probability of each next word is 1/ = n times Simple vs. Smart Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words. Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words.

Common Spelling Errors They are leaving in about fifteen minuets The study was conducted manly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out. Spell check without considering context will fail Difficulty: Detect grammatical errors, or nonsensical expressions

N-grams 0 gram: Every word’s likelihood probability is equal –Each word of a 300,000 word corpora has frequency probability Uni-gram: A word’s likelihood depends on frequency counts –The occurs 69,971 in the Brown corpus of 1,000,000 words Bi-gram: word likelihood determined by the previous word –P(w|a) = P(w) * P(w|w i-1 ) –The appears with frequency.07, rabbit appears with frequency –Rabbit is a more likely word that follows the word white than the is Tri-gram: word likelihood determined by the previous two words –P(w|a) = P(w) * P(w|w i-1 & w i-2 ) Question: How many previous words should we consider? –Test: Generate random sentences from Shakesphere –Results: Trigram sentences start looking like those of Shakesphere –Tradeoffs: Computational overhead and memory requirements How many previous words should we consider?

The Sparse Data Problem Definitions –Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input –N-gram sequence: A sequence of n words whose context speech algorithms consider –Training data: A group of probabilities computed from a corpora of text data –Sparse data problem: How should algorithms handle n-grams that have very low probabilities? Data sparseness is a frequently occurring problem Algorithms will make incorrect decisions if it is not handled Problem 1: Low frequency n-grams –Assume n-gram x occurs twice and n-gram y occurs once –Is x really twice as likely to occur as y? Problem 2: Zero counts –Probabilities compute to zero for n-grams not seen in the corpora –If n-gram y does not occur, should its probability is zero?

Smoothing An algorithm that redistributes the probability mass Discounting: Reduces probabilities of n-grams with non-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora). Definition: A corpora is a collection of written or spoken material in machine-readable form

Add-One Smoothing The Naïve smoothing technique –Add one to the count of all seen and unseen n-grams –Add the total increased count to the probability mass Example: Uni-grams –Un-smoothed probability for word w: uni-grams –Add-one revised probability for word w: –N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered

Add-One Smoothing Example P(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 ) P +1 (w n |w n-1 ) = [C(w n-1 w n )+1]/[C(w n-1 )+V] Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459

Add-One Discounting c’(w i,w i-1 ) =(c(w i,w i-1 ) i +1) * c(w i,w i-1 ) Original Counts Revised Counts Note: High counts reduce by approximately a third for this example Note: Low counts get larger Note : N = c(w i-1 ), V = vocabulary size = 1616 C(W I ) I3437 Want1215 To3256 Eat938 Chinese213 Food1506 Lunch459

Evaluation of Add-One Smoothing Advantage: –Simple technique to implement and understand Disadvantages: –Too much probability mass moves to the unseen n-grams –Underestimates the probabilities of the common n-grams –Overestimates probabilities of rare (or unseen) n-grams –Relative smoothing of all unseen n-grams is the same –Relative smoothing of rare n-grams still incorrect Alternative: –Use a smaller add value –Disadvantage: Does not fully solve this problem

Unigram Witten-Bell Discounting Compute the probability of a first time encounter of a new word –Note: Every one of O observed words had a first encounter –How many Unseen words: U = V – O –What is the probability of encountering a new word? Answer: P( any newly encountered word ) = O/(V+O) Equally add this probability across all unobserved words –P( any specific newly encountered word ) = 1/U * O/(V+O) –Adjusted counts = V * 1/U*O/(V+O)) Discount each encountered word i to preserve probability space –Probability From: count i /V To: count i /(V+O) –Discounted Counts From: count i To: count i * V/(V+O) Add probability mass to un-encountered words; discount the rest O = observed words, U = words never seen, V = corpus vocabulary words

Bi-gram Witten-Bell Discounting Consider the bi-gram w n w n-1 –O(w n-1 ) = number of uniquely observed bi-grams starting with w n-1 –V(w n-1 ) = count of bi-grams starting with w n-1 –U(w n-1 ) = number of un-observed bi-grams starting with w n-1 Compute probability of a new bi-gram (bi n-1 ) starting with w n-1 –Answer: P( any newly encountered bi-gram ) = O(w n-1 )/(V(w n-1 ) +O(w n-1 )) –Note: We observed O(w n-1 ) bi-grams in V(w n-1 )+O(w n-1 ) events –Note: An event is either a bi-gram or a first time encounter Divide this probability among all unseen bi-grams (new(w n-1 )) –Adjusted P(new(w n-1 )) = 1/U(w n-1 )*O(w n-1 )/(V(w n-1 )+O(w n-1 )) –Adjusted count = V(w n-1 ) * 1/U(w n-1 ) * O(w n-1 )/(V(w n-1 )+O(w n-1 )) Discount observed bi-grams gram(w n-1 ) to preserve probability space –Probability From: c(w n-1 w n )/V(w n-1 ) To: c(w n-1 w n )/(V(w n-1 ) + O(w n-1 )) –Counts From: c(w n-1 w n ) To: c(w n-1 w n ) * V(w n-1 )/(V(w n-1 )+O(w n-1 )) Add probability mass to un-encountered bi-grams; discount the rest O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams

Witten-Bell Smoothing c′(w n,w n-1 )= (c(w n,w n-1 )+1) c(w n,w n-1 ) c′(w n,w n-1 ) = O/U if c(w n, w n-1 )=0 c(w n,w n-1 ) otherwise Original Counts Adjusted Add-One Counts Adjusted Witten-Bell Counts V, O and U values are on the next slide VN V  Note: V, O, U refer to w n-1 counts VN V  VN V 

Bi-gram Counts for Example O(w n-1 )U(W n-1 )V(w n-1 ) I951, Want761, To1301, Eat1241, Chinese201, Food821, Lunch451, O(w n-1 ) = number of observed bi-grams starting with w n-1 V(w n-1 ) = count of bi-grams starting with w n-1 U(w n-1 ) = number of un-observed bi-grams starting with

Evaluation of Witten-Bell Estimates probability of already encountered grams to compute probabilities for unseen grams Smaller impact on probabilities of already encountered grams Generally computes reasonable probabilities

Back-off Discounting The general Concept –Consider the trigram (w n,w n-1, w n-2 ) –If c(w n-1, w n-2 ) = 0, consider the ‘back-off’ bi-gram (w n, w n-1 ) –If c(w n-1 ) = 0, consider the ‘back-off’ unigram w n Goal is to use a hierarchy of approximations –trigram > bigram > unigram –Degrade gracefully when higher level grams don’t exist Given a word sequence fragment: w n-2 w n-1 w n … Utilize the following preference rule –1.p(w n |w n-2 w n-1 ) if c(w n-2 w n-1 w n )  0 –2.  1 p(w n |w n-1 ) if c(w n-1 w n )  0 –3.  2 p(w n ) Note:  1 and  2 are values carefully computed to preserve probability mass

N-grams for Spell Checks Non-word detection (easiest) Example: graffe => (giraffe) –Isolated-word (context-free) error correction –by definition cannot correct when error word is a valid word Context-dependent (hardest)Example: your an idiot => you’re an idiot when the mis-typed word happens to be a real word 15% Peterson (1986), 25%-40% Kukich (1992)