Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.

Slides:

Advertisements

Similar presentations

Hidden Markov Models (HMM) Rabiner’s Paper

Advertisements

Hidden Markov Model in Automatic Speech Recognition Z. Fodroczi Pazmany Peter Catholic. Univ.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Introduction to Hidden Markov Models

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Ch-9: Markov Models Prepared by Qaiser Abbas ( )

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Hidden Markov Models Fundamentals and applications to bioinformatics.

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Hidden Markov Models in NLP

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.

Albert Gatt Corpora and Statistical Methods Lecture 8.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

… Hidden Markov Models Markov assumption: Transition model:

FSA and HMM LING 572 Fei Xia 1/5/06.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.

Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).

Doug Downey, adapted from Bryan Pardo,Northwestern University

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

Albert Gatt Corpora and Statistical Methods Lecture 9.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.

Hidden Markov Models Usman Roshan CS 675 Machine Learning.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Models CSCE Sequential Data Often highly variable, but has an embedded structure Information is contained in the structure.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.

Hidden Markov Models BMI/CS 576

CSCE 771 Natural Language Processing

Statistical Models for Automatic Speech Recognition

N-Gram Model Formulas Word sequences Chain rule of probability

Presentation transcript:

Hidden Markov Models John Goldsmith

Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current event is conditioned only by the previous event.

Symbol sequences Consider a sequence of random variables X 1, X 2, …, X N. Think of the subscripts as indicating word-position in a sentence. Remember that a random variable is a function, and in this case its range is the vocabulary of the language. The size of the “pre-image” that maps to a given word w is the probability assigned to w.

What is the probability of a sequence of words w 1 …w t ? This is… P(X 1 = w 1 and X 2 = w 2 and…X t = w t ) The fact that subscript “1” appears on both the X and the w in “X 1 = w 1 “ is a bit abusive of notation. It might be better to write:

By definition… This says less than it appears to; it’s just a way of talking about the word “and” and the definition of conditional probability.

We can carry this out… This says that every word is conditioned by all the words preceding it.

The Markov assumption What a sorry assumption about language! Manning and Schütze call this the “limited horizon” property of the model.

Stationary model There’s also an additional assumption that the parameters don’t change “over time”: for all (appropriate) t and k:

the old big dog cat just died appeared P( “the big dog just died” ) = 0.4 * 0.6 * 0.2 * 0.5

Prob ( Sequence )

Hidden Markov model An HMM is a non-deterministic markov model – that is, one where knowledge of the emitted symbol does not determine the state-transition. This means that in order to determine the probability of a given string, we must take more than one path through the states into account.

Relating emitted symbols to HMM architecture There are two ways: 1.State-emission HMM (Moore machine): a set of probabilities assigned to the vocabulary in each state. 2.Arc-emission HMM (Mealy machine): a set of probabilities assigned to the vocabulary for each state-to-state transition. (More parameters)

State emission p(a) = 0.2 p(b) = 0.7 … p(a) = 0.7 p(b) = 0.2 … p(a) = 0.2 p(b) = 0.7 … p(a) = 0.2 p(b) = 0.7 … p(a) = 0.7 p(b) = 0.2 … p(a) = 0.7 p(b) = 0.2 …

Arc-emission (Mealy) p(a) =.03, p(b)=.105,… p(a) =.17, p(b)=.595,… p(a) = 0.525, p(b)=.15, … p(a)= 0.175, p(b)=0,05,… p(a) =.03, p(b)=.105,… p(a) =.17, p(b)=.595,… p(a) = 0.525, p(b)=.15, … p(a)= 0.175, p(b)=0,05,… Sum of prob’s leaving each state sum to 1.0

Definition Set of states S={s 1, …, s N } Output alphabet K = {k 1,…,k M } Initial state probabilities State transition probabilities Symbol emission probabilities State sequence Output sequence

Follow “ab” through the HMM Using the state emission model:

State to state transition probability State 1State 2 State State

State-emission symbol probabilities State 1State 2 pr(a) = 0.2pr(a) = 0.7 pr(b) = 0.7pr(b) = 0.2 pr(c) = 0.1pr(b) = 0.1

p(a) = 0.2 p(a) = 0.7 p(b) = 0.7 … p(b) = ½*0.2 * 0.15 = ½*0.7 * 0.25 =.082 ½* 0.7 * 0.75 = ½ * 0.2 * 0.85 = = = Start 0.5  

p(a) = 0.2 p(a) = 0.7 p(b) = 0.7 … p(b) = pr( produce (ab) & this state ) =.278 * 0.7 = pr(produce(ab) & this state) = * 0.2 = = = 0.167

p(a) = 0.2 p(a) = 0.7 p(b) = 0.7 … p(b) = * 0.25 = * 0.15 = * 0.85 = * 0.75 =.0248 P( produce (b) ) =.278 * 0.7 = P(produce(b)) = * 0.2 =

p(a) = 0.2 p(a) = 0.7 p(b) = 0.7 … p(b) =

p(a) = 0.2 p(a) = 0.7 p(b) = 0.7 … p(b) = What’s the probability of “ab”? Answer: – the sum of the probabilities of the ways of generating “ab” = This is the “forward” probability calculation.

That’s the basic idea of an HMM Three questions: 1.Given a model, how do we compute the probability of an observation sequence? 2.Given a model, how do we find the best state sequence? 3.Given a corpus and a parameterized model, how do we find the parameters that maximize the probability of the corpus?

Probability of a sequence Using the notation we’ve used: Initialization: we have a distribution of probabilities of being in the states initially, before any symbol has been emitted. Assign a distribution to the set of initial states; these are  (i), where i varies from 1 to N, the number of states.

We’re going to focus on a variable called forward probability, denoted .  i (t) is the probability of being at state s i at time t, given that o 1,…,o t-1 were generated.

Induction step: Probability at state i in previous “loop” Transition from state i to this state, state j Probability of emitting the right word during that particular transition. (Having 2 arguments here is what makes it state-emission.)

Side note on arc-emission: induction stage Probability at state i in previous “loop” Transition from state i to this state, state j Probability of emitting the right word during that particular transition

Forward probability So by calculating , the forward probability, we calculate the probability of being in a particular state at time t after having “correctly” generated the symbols up to that point.

The final probability of the observation is

We want to do the same thing from the end: Backward  This is the probability of generating the symbols from o t to o T, starting out from state i at time t.

Initialization (this is different than Forward…) Induction Total

Probability of the corpus:

Again: finding the best path to generate the data: Viterbi Dr. Andrew Viterbi received his B.S. and M.S. from MIT in 1957 and Ph.D. from the University of Southern California in He began his career at California Institute of Technology's Jet Propulsion Laboratory. In 1968, he co- founded LINKABIT Corporation and, in 1985, QUALCOMM, Inc., now a leader in digital wireless communications and products based on CDMA technologies. He served as a professor at UCLA and UC San Diego, where he is now a professor emeritus. Dr.Viterbi is currently president of the Viterbi Group, LLC, which advises and invests in startup companies in communication, network, and imaging technologies. He also recently accepted a position teaching at USC's newly named Andrew and Erna Viterbi School of Engineering.

Viterbi Goal: find We calculate this variable to keep track of the “best” path that generates the first t-1 symbols and ends in state j.

Viterbi Initialization Induction Backtrace/memo: Termination

Next step is the difficult one We want to start understanding how you can set (“estimate”) parameters automatically from a corpus. The problem is that you need to learn the probability parameters, and probability parameters are best learned by counting frequencies. So in theory we’d like to see how often you make each of the transitions in the graph.

Central idea We’ll take every word in the corpus, and when it’s the i th word, we’ll divide its count of 1.0 over all the transitions that it could have made in the network, weighting the pieces by the probability that it took that transition. AND: the probability that a particular transition occurred is calculated by weighting the transition by the probability of the entire path (it’s unique, right?), from beginning to end, that includes it.

Thus: if we can do this, Probabilities give us (=have just given us) counts of transitions. We sum these transition counts over our whole large corpus, and use those counts to generate new probabilities for each parameter (maximum likelihood parameters).

Here’s the trick: word “w” in the utterance S[0…n] probabilities of each state (from Forward) probabilities from each state (from Backward) each line represents a transition emitting the word w

probabilities of each state (from Forward) probabilities from each state (from Backward) each line represents a transition emitting the word w prob of a transition line = prob (starting state) * prob (emitting w) * prob (ending state)

probability of transition, given the data we don’t need to keep expanding the denominator – we are doing that just to make clear how the numerator relates to the denominator conceptually.

Now we just sum over all of our observations:

Sum over to-states Sum over the whole corpus

That’s the basics of the first (hard) half of the algorithm This training is a special case of the Expectation-Maximization (EM) algorithm; we’ve just done the “expectation” half, which creates a set of “virtual” or soft counts – these are turned into model parameters (or probabilities) in the second part, the “maximization” half.

Maximization Let’s assume that there were N-1 transitions in the path through the network, and that we have no knowledge of where sentences start (etc.). Then the probability of each state s i is the number of transitions that went from s i to any state, divided by N-1. The probability of a state transition a ij is the number of transitions from state i to state j, divided by the number of probability of state i.

and the probability of making the transition from i to j and emitting word w is: the number of transitions from i to j that emitted word w, divided by the total number of transitions from from i to j.

More exactly…

So that’s the big idea. Application: to speech recognition. Create an HMM for each word in the lexicon, and use that to calculate, for a given input sound P and word w i, what the probability is of P. The word that gives the highest score wins. Part of speech tagging: in two weeks.

Speech HMMs in the classical (discrete) speech context “emit” or “accept” symbols chosen from a “codebook” consisting of 256 spectra – in effect, timeslices of a spectrogram. Every 5 or 10 msec., we take a spectrogram, and decide which page of the codebook it most resembles, and encode the continuous sound event as a sequence of 100 or 200 symbols per second. (There are alternatives to this.)

Speech The HMMs then are asked to generate the symbolic sequences produced in that way. Each word can assign a probability to a given sequence of these symbols.

Speech Speech models of words are generally (and roughly) along these lines: The HMM for “dog” /D AW1 G/ is three successive phoneme models. Each phoneme model is actually a phoneme-in-context model: a D after # followed by AW1, an AW1 model after D and before G, etc.

Each phoneme model is made up of 3, 4, or 5 states; associated with each state is a distribution over all the time-slice symbols.

From peech/software/tutorials/monthly/200 2_05/