Probabilistic Pronunciation + N-gram Models CMSC 35100 Natural Language Processing April 15, 2003.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
FSA and HMM LING 572 Fei Xia 1/5/06.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
COMP 4060 Natural Language Processing Speech Processing.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Introduction to Automatic Speech Recognition
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Entropy & Hidden Markov Models Natural Language Processing CMSC April 22, 2003.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Pronunciation Variation: TTS & Probabilistic Models CMSC Natural Language Processing April 10, 2003.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
CSA3202 Human Language Technology HMMs for POS Tagging.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC February 26, 2008.
Natural Language Processing Statistical Inference: n-grams
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing CMSC February 22, 2005.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
Uncertain Reasoning over Time
CSC 594 Topics in AI – Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
LECTURE 15: REESTIMATION, EM AND MIXTURES
Dynamic Programming Search
CSCI 5582 Artificial Intelligence
Presentation transcript:

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003

The ASR Pronunciation Problem Given a series of phones, what is the most probable word? Simplification: Assume phone sequence known, word boundaries known Approach: Noisy channel model Surface form is an instance of lexical form that has passed through a noisy communication path Model channel to remove noise, find original

Bayesian Model Pr(w|O) = Pr(O|w)Pr(w)/P(O) Goal: Most probable word –Observations held constant –Find w to maximize Pr(O|w)*Pr(w) Where do we get the likelihoods? – Pr(O|w) –Probabilistic rules (Labov) Add probabilities to pronunciation variation rules –Count over large corpus of surface forms wrt lexicon Where do we get Pr(w)? –Similarly – count over words in a large corpus

Weighted Automata Associate a weight (probability) with each arc - Determine weights by decision tree compilation or counting from a large corpus start ax ix b aw aedx tend Computed from Switchboard corpus

Forward Computation For a weighted automaton and a phoneme sequence, what is its likelihood? –Automaton: Tuple Set of states Q: q0,…qn Set of transition probabilities between states aij, –Where aij is the probability of transitioning from state i to j Special start & end states –Inputs: Observation sequence: O = o1,o2,…,ok –Computed as: forward[t,j] = P(o1,o2…ot,qt=j|λ)p(w)=Σi forward[t-1,i]*aij*bjt –Sums over all paths to qt=j

Viterbi Decoding Given an observation sequence o and a weighted automaton, what is the mostly likely state sequence? –Use to identify words by merging multiple word pronunciation automata in parallel –Comparable to forward Replace sum with max Dynamic programming approach –Store max through a given state/time pair

Viterbi Algorithm Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return

Segmentation Breaking sequence into chunks –Sentence segmentation Break long sequences into sentences –Word segmentation Break character/phonetic sequences into words –Chinese: typically written w/o whitespace »Pronunciation affected by units –Language acquisition: »How does a child learn language from stream of phones?

Models of Segmentation Many: –Rule-based, heuristic longest match Probabilistic: –Each word associated with its probability –Find sequence with highest probability Typically compute as log probs & sum –Implementation: Weighted FST cascade Each word = chars + probability Self-loop on dictionary Compose input with dict* Compute most likely

N-grams Perspective: –Some sequences (words/chars) are more likely than others –Given sequence, can guess most likely next Used in – Speech recognition – Spelling correction, –Augmentative communication –Other NL applications

Corpus Counts Estimate probabilities by counts in large collections of text/speech Issues: –Wordforms (surface) vs lemma (root) –Case? Punctuation? Disfluency? –Type (distinct words) vs Token (total)

Basic N-grams Most trivial: 1/#tokens: too simple! Standard unigram: frequency –# word occurrences/total corpus size E.g. the=0.07; rabbit = –Too simple: no context! Conditional probabilities of word sequences

Markov Assumptions Exact computation requires too much data Approximate probability given all prior wds –Assume finite history –Bigram: Probability of word given 1 previous First-order Markov –Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Issues Relative frequency –Typically compute count of sequence Divide by prefix Corpus sensitivity –Shakespeare vs Wall Street Journal Very unnatural Ngrams –Unigram: little; bigrams: colloc; trigrams:phrase