Presentation on theme: "Speech Recognition Observe: Acoustic signal (A=a 1,…,a n ) Challenge: Find the likely word sequence But we also have to consider the context Starting."— Presentation transcript:
Speech Recognition Observe: Acoustic signal (A=a 1,…,a n ) Challenge: Find the likely word sequence But we also have to consider the context Starting at this point, we need to be able to model the target language
LML Speech Recognition Language Modeling
Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio Issues to consider determine the intended word sequence resolve grammatical and pronunciation errors Implementation: Establish word sequence probabilities Use existing corpora Train program with run-time data
Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct? Problem: How do we handle a string of text containing words that are not in the dictionary? Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense?
Problem: Resolving words not in the dictionary Question: How different is a recognized word from those that are in the dictionary? Solution: Count the single step transformations necessary to convert one word into another. Example: caat cat with removal of one letter Example: flpc fireplace requires adding the letters ire after f and a before c and e at the end
Simple: *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| – If English has 100,000 words, probability of each next word is 1/ = Simple vs. Smart Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words. Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words. n times
What’s the probability of “canine”? What’s the probability of “canine tooth” or tooth | canine? What’s the probability of “canine companion”? P(tooth|canine) = P(canine & tooth)/P(canine) Sometimes we can use counts to deduce probabilities. Example: According to google: P(canine): occurs 1,750,000 times P(canine tooth): 6280 times P(tooth | canine): 6280/ =.0035 P(companion | canine):.01 So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up
Limitation: ignores context We might need to factor in the surrounding words - Use P(need|I) instead of just P(need) - Note: P(new|I) < P(need|I) WordP(O|w)P(w)P(O|w)P(w) new neat need knee P([ni]|new)P(new) P([ni]|neat)P(neat) P([ni]|need)P(need) P([ni]|knee)P(knee) Single word probability Compute likelihood P([ni]|w), then multiply
What is the most likely word sequence? 'botik-'spen-siv'pre-z & ns boatexcessivepresidents baldexpensivepresence boldexpressivepresents boughtinactivepress
Conditional Probability P(A 1,A 2 ) = P(A 1 ) · P(A 2 |A 1 ) The Chain Rule generalizes to multiple events P(A 1, …,A n ) = P(A 1 ) P(A 2 |A 1 ) P(A 3 |A 1,A 2 )…P(A n |A 1 …A n-1 ) Examples: P(the dog) = P(the) P(dog | the) P(the dog bites) = P(the) P(dog | the) P(bites| the dog) Conditional probability applies more than individual relative word frequencies because they consider the context Dog may be relatively rare word in a corpus But if we see barking, P(dog|barking) is much more likely 1 n In general, the probability of a complete string of words w 1 …w n is: P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = Detecting likely word sequences using probabilities
0 gram: Every word’s likelihood probability is equal Each word of a 300,000 word corpora has frequency probabilities Uni-gram: A word’s likelihood depends on frequency counts The word, ‘the’ occurs 69,971 in the Brown corpus of 1,000,000 words Bi-gram: word likelihood determined by the previous word P(w|a) = P(w) * P(w|w i-1 ) The appears with frequency.07, rabbit appears with frequency Rabbit is a more likely word that follows the word white than the is Tri-gram: word likelihood determined by the previous two words P(w|a) = P(w) * P(w|w i-1 & w i-2 ) N-gram A model of word or phoneme prediction that uses the previous N-1 words or phonemes to predict the next How many previous words should we consider?
Generating sentences: random unigrams... Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter With bigrams... What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Trigrams Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty.
Quadrigrams (Output now is Shakespeare) What! I will go seek the traitor Gloucester. Will you not tell me who I am? Comments The accuracy of an n-gram model increases with increasing n because word combinations are more and more constrained Higher n-gram models are more and more sparse. Shakespeare produced 0.04% of 844 million possible bigrams. There is a tradeoff between accuracy and computational overhead and memory requirements
Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common:“raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common:“you have parents”, “you seen Brooklyn”
Non-word detection (easiest) Example: graffe => (giraffe) Isolated-word (context-free) error correction A correction is not possible when the error word is in the dictionary Context-dependent (hardest)Example: your an idiot => you’re an idiot (the mistyped word happens to be a real word)
Mispelled word: acress Candidates – with probabilities of use and use within context Context Context * P(c)
Word frequency percentage is not enough We need p(typo|candidate) * p(candidate) How likely is the particular error? Deletion of a t after a c and before an r Insertion of an a at the beginning Transpose a c and an a Substitute a c for an r Substitute an o for an e Insert an s before the last s, or after the last s Context of the word within a sentence or paragraph Misspelled word: accress
They are leaving in about fifteen minuets The study was conducted manly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out. Spell check without considering context will fail Difficulty: Detecting grammatical errors, or nonsensical expressions
Definitions Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input N-gram sequence: A sequence of n words whose context speech algorithms consider Training data: A group of probabilities computed from a corpora of text data Sparse data problem: How should algorithms handle n-grams that have very low probabilities? Data sparseness is a frequently occurring problem Algorithms will make incorrect decisions if it is not handled Problem 1: Low frequency n-grams Assume n-gram x occurs twice and n-gram y occurs once Is x really twice as likely to occur as y? Problem 2: Zero counts Probabilities compute to zero for n-grams not seen in the corpora If n-gram y does not occur, should its probability is zero?
An algorithm that redistributes the probability mass Discounting: Reduces probabilities of n- grams with non-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora). Definition: A corpora is a collection of written or spoken material in machine-readable form
The Naïve smoothing technique Add one to the count of all seen and unseen n-grams Add the total increased count to the probability mass Example: uni-grams Un-smoothed probability for word w: uni-grams Add-one revised probability for word w: N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered
P(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 ) P +1 (w n |w n-1 ) = [C(w n-1 w n )+1]/[C(w n-1 )+V] Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459
C(W I ) I3437 Want1215 To3256 Eat938 Chinese213 Food1506 Lunch459 c’(w i,w i-1 ) =(c(w i,w i-1 ) i +1) * c(w i,w i-1 ) Original Counts Revised Counts Note: High counts reduce by approximately a third for this example Note: Low counts get larger Note : N = c(w i-1 ), V = vocabulary size = 1616
Advantage: Simple technique to implement and understand Disadvantages: Too much probability mass moves to the unseen n-grams Underestimates the probabilities of the common n-grams Overestimates probabilities of rare (or unseen) n-grams Relative smoothing of all unseen n-grams is the same Relative smoothing of rare n-grams still incorrect Alternative: Use a smaller add value Disadvantage: Does not fully solve this problem
Compute the probability of a first time encounter of a new word Note: Every one of O observed words had a first encounter How many Unseen words: U = V – O What is the probability of encountering a new word? Answer: P( any newly encountered word ) = O/(V+O) Equally add this probability across all unobserved words P( any specific newly encountered word ) = 1/U * O/(V+O) Adjusted counts = V * 1/U*O/(V+O)) Discount each encountered word i to preserve probability space Probability From: count i /V To: count i /(V+O) Discounted Counts From: count i To: count i * V/(V+O) Add probability mass to un-encountered words; discount the rest O = observed words, U = words never seen, V = corpus vocabulary words
Consider the bi-gram w n w n-1 O(w n-1 ) = number of uniquely observed bi-grams starting with w n-1 V(w n-1 ) = count of bi-grams starting with w n-1 U(w n-1 ) = number of un-observed bi-grams starting with w n-1 Compute probability of a new bi-gram (bi n-1 ) starting with w n-1 Answer: P( any newly encountered bi-gram ) = O(w n-1 )/(V(w n-1 ) +O(w n-1 )) Note: We observed O(w n-1 ) bi-grams in V(w n-1 )+O(w n-1 ) events Note: An event is either a bi-gram or a first time encounter Divide this probability among all unseen bi-grams (new(w n-1 )) Adjusted P(new(w n-1 )) = 1/U(w n-1 )*O(w n-1 )/(V(w n-1 )+O(w n-1 )) Adjusted count = V(w n-1 ) * 1/U(w n-1 ) * O(w n-1 )/(V(w n-1 )+O(w n-1 )) Discount observed bi-grams gram(w n-1 ) to preserve probability space Probability From: c(w n-1 w n )/V(w n-1 ) To: c(w n-1 w n )/(V(w n-1 ) + O(w n-1 )) Counts From: c(w n-1 w n ) To: c(w n-1 w n ) * V(w n-1 )/(V(w n-1 )+O(w n-1 )) Add probability mass to un-encountered bi-grams; discount the rest O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams
c′(w n,w n-1 )= (c(w n,w n-1 )+1) c(w n,w n-1 ) c′(w n,w n-1 ) = O/U if c(w n, w n-1 )=0 c(w n,w n-1 ) otherwise Original Counts Adjusted Add-One Counts Adjusted Witten-Bell Counts V, O and U values are on the next slide VN V Note: V, O, U refer to w n-1 counts VN V VN V
O(w n-1 )U(W n-1 )V(w n-1 ) I951, Want761, To1301, Eat1241, Chinese201, Food821, Lunch451, O(w n-1 ) = number of observed bi-grams starting with w n-1 V(w n-1 ) = count of bi-grams starting with w n-1 U(w n-1 ) = number of un-observed bi-grams starting with
Estimates probability of already encountered grams to compute probabilities for unseen grams Smaller impact on probabilities of already encountered grams Generally computes reasonable probabilities
The general Concept Consider the trigram (w n,w n-1, w n-2 ) If c(w n-1, w n-2 ) = 0, consider the ‘back-off’ bi-gram (w n, w n-1 ) If c(w n-1 ) = 0, consider the ‘back-off’ unigram w n Goal is to use a hierarchy of approximations trigram > bigram > unigram Degrade gracefully when higher level grams don’t exist Given a word sequence fragment: w n-2 w n-1 w n … Utilize the following preference rule 1.p(w n |w n-2 w n-1 ) if c(w n-2 w n-1 w n ) 0 2. 1 p(w n |w n-1 ) if c(w n-1 w n ) 0 3. 2 p(w n ) Note: 1 and 2 are values carefully computed to preserve probability mass
Goal: Reduce the trainable units that the recognizer needs to process Approach: HMMs represent sub-phonetic units A tree structure Combine sub- phonetic units Phoneme recognizer searches tree to find HMMs Nodes partition with questions about neighbors Performance: Triphones reduces error rate by:15% Senones reduces error rate by 24% Definition: A cluster of similar Markov States Is left phone sonorant or nasal? Is right a back-R?Is left s, z, sh, zh? Is left a back-L? Is right voiced?