Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

Slides:



Advertisements
Similar presentations
Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
Advertisements

Learning HMM parameters
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Introduction to Hidden Markov Models
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Apaydin slides with a several modifications and additions by Christoph Eick.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models 戴玉書
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
1 Hidden Markov Models Hsin-Min Wang Institute of Information Science, Academia Sinica References: 1.L. R. Rabiner and B. H. Juang,
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
Hidden Markov Models BMI/CS 576
CSCE 771 Natural Language Processing
Hassanin M. Al-Barhamtoshy
Presentation transcript:

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer Science and Engineering

Outline Evaluation problem of HMM Decoding problem of HMM Learning problem of HMM –A multi-million dollar algorithm—viterbi algorithm

Evaluation problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the probability that model M has generated sequence O. Decoding problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the most likely sequence of hidden states s i that produced this observation sequence O. Learning problem. Given some training observation sequences O=o 1 o 2... o K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B,  ) that best fit training data. O=o 1...o K denotes a sequence of observations o k  { v 1,…,v M }. Main issues using HMMs :

Typed word recognition, assume all characters are separated. Character recognizer outputs probability of the image being particular character, P(image|character) z c b a Word recognition example(1). Hidden state Observation

If lexicon is given, we can construct separate HMM models for each lexicon word. Amherstam he r s t Buffalobu ff a l o Here recognition of word image is equivalent to the problem of evaluating few HMM models. This is an application of Evaluation problem. Word recognition example(2)

We can construct a single HMM for all words. Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before. a m he r s t b v f o Here we have to determine the best sequence of hidden states, the one that most likely produced word image. This is an application of Decoding problem. Word recognition example(3).

Evaluation problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the probability that model M has generated sequence O. Trying to find probability of observations O=o 1 o 2... o K by means of considering all hidden state sequences (as was done in example) is impractical: N K hidden state sequences - exponential complexity. Use Forward-Backward HMM algorithms for efficient calculations. (Dynamic programming!) Define the forward variable  k (i) as the joint probability of the partial observation sequence o 1 o 2... o k and that the hidden state at time k is s i :  k (i)= P(o 1 o 2... o k, q k = s i ) Evaluation Problem.

s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN a 1j a 2j a ij a Nj Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM

Initialization:  1 (i)= P(o 1, q 1 = s i ) =  i b i (o 1 ), 1<=i<=N. Forward recursion:  k+1 (i)= P(o 1 o 2... o k+1, q k+1 = s j ) =  i P(o 1 o 2... o k+1, q k = s i, q k+1 = s j ) =  i P(o 1 o 2... o k, q k = s i ) a ij b j (o k+1 ) = [  i  k (i) a ij ] b j (o k+1 ), 1<=j<=N, 1<=k<=K-1. Termination: P(o 1 o 2... o K ) =  i P(o 1 o 2... o K, q K = s i ) =  i  K (i) Complexity : N 2 K operations. Forward recursion for HMM

s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN a j1 a j2 a ji a jN Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM

Define the forward variable  k (i) as the joint probability of the partial observation sequence o k+1 o k+2... o K given that the hidden state at time k is s i :  k (i)= P(o k+1 o k+2... o K |q k = s i ) Initialization:  K (i)= 1, 1<=i<=N. Backward recursion:  k (j)= P(o k+1 o k+2... o K | q k = s j ) =  i P(o k+1 o k+2... o K, q k+1 = s i | q k = s j ) =  i P(o k+2 o k+3... o K | q k+1 = s i ) a ji b i (o k+1 ) =  i  k+1 (i) a ji b i (o k+1 ), 1<=j<=N, 1<=k<=K-1. Termination: P(o 1 o 2... o K ) =  i P(o 1 o 2... o K, q 1 = s i ) =  i P(o 1 o 2... o K |q 1 = s i ) P(q 1 = s i ) =  i  1 (i) b i (o 1 )  i Backward recursion for HMM

Decoding problem. Given the HMM M=(A, B,  ) and the observation sequence O=o 1 o 2... o K, calculate the most likely sequence of hidden states s i that produced this observation sequence. We want to find the state sequence Q = q 1 …q K which maximizes P(Q | o 1 o 2... o K ), or equivalently P(Q, o 1 o 2... o K ). Brute force consideration of all paths takes exponential time. Use efficient Viterbi algorithm instead. Define variable  k (i) as the maximum probability of producing observation sequence o 1 o 2... o k when moving along any hidden state sequence q 1 … q k-1 and getting into q k = s i.  k (i) = max P(q 1 … q k-1, q k = s i, o 1 o 2... o k ) where max is taken over all possible paths q 1 … q k-1. Decoding problem

Andrew Viterbi In 1967 he invented the Viterbi algorithm, which he used for decoding convolutionally encoded data. It is still used widely in cellular phones for error correcting codes, as well as for speech recognition, DNA analysis, and many other applications of Hidden Markov models In September 2008, he was awarded the National Medal of Science for developing "the 'Viterbi algorithm,' and for his contributions to Code Division Multiple Access (CDMA) wireless technology that transformed the theory and practice of digital communications."

General idea: if best path ending in q k = s j goes through q k-1 = s i then it should coincide with best path ending in q k-1 = s i. s1s1 sisi sNsN sjsj a ij a Nj a 1j q k-1 q k  k (i) = max P(q 1 … q k-1, q k = s j, o 1 o 2... o k ) = max i [ a ij b j (o k ) max P(q 1 … q k-1 = s i, o 1 o 2... o k-1 ) ] To backtrack best path keep info that predecessor of s j was s i. Viterbi algorithm (1)

Initialization:  1 (i) = max P(q 1 = s i, o 1 ) =  i b i (o 1 ), 1<=i<=N. Forward recursion:  k (j) = max P(q 1 … q k-1, q k = s j, o 1 o 2... o k ) = max i [ a ij b j (o k ) max P(q 1 … q k-1 = s i, o 1 o 2... o k-1 ) ] = max i [ a ij b j (o k )  k-1 (i) ], 1<=j<=N, 2<=k<=K. Termination: choose best path ending at time K max i [  K (i) ] Backtrack best path. This algorithm is similar to the forward recursion of evaluation problem, with  replaced by max and additional backtracking. Viterbi algorithm (2)

Learning problem. Given some training observation sequences O=o 1 o 2... o K and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B,  ) that best fit training data, that is maximizes P(O | M). There is no algorithm producing optimal parameter values. Learning problem (1)

If training data has information about sequence of hidden states (as in word recognition example), then use maximum likelihood estimation of parameters: a ij = P(s i | s j ) = Number of transitions from state s j to state s i Number of transitions out of state s j b i (v m ) = P(v m | s i )= Number of times observation v m occurs in state s i Number of times in state s i Learning problem (2) The state sequence is generated by Viterbi algoritm

CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC to state 012 from state 00 (0%)1 (100%)0 (0%) 11 (4%)21 (84%)3 (12%) 20 (0%)3 (20%)12 (80%) symbol ACGT in state 1 6 (24%) 7 (28%) 5 (20%) 7 (28%) 2 3 (20%) 2 (13%) 7 (47%) Example: Training Unambiguous Models transitions emissions

HMM Training Ambiguous Models The Problem: We have training sequences, but not the associated paths (state labels) Two Solutions: 1. Viterbi Training: Start with random HMM parameters. Use Viterbi to find the most probable path for each training sequence, and then label the sequence with that path. Use labeled sequence training on the resulting set of sequences and paths. 2. Baum-Welch Training: sum over all possible paths (rather than the single most probable one) to estimate expected counts A i, j and E i,k ; then use the same formulas as for labeled sequence training on these expected counts:

training features Viterbi Decoder Sequence Labeler Labeled Sequence Trainer paths {  i } initial submodel M new submodel M labeled features final submodel M * repeat n times Viterbi Training (iterate over the training sequences) (find most probable path for this sequence) (label the sequence with that path)

General idea: a ij = P(s i | s j ) = Expected number of transitions from state s j to state s i Expected number of transitions out of state s j b i (v m ) = P(v m | s i )= Expected number of times observation v m occurs in state s i Expected number of times in state s i  i = P(s i ) = Expected frequency in state s i at time k=1. Baum-Welch algorithm

s1s1 s2s2 sisi sNsN s1s1 s2s2 sisi sNsN s1s1 s2s2 sjsj sNsN s1s1 s2s2 sisi sNsN a ij Time= 1 k k+1 K o 1 o k o k+1 o K = Observations Trellis representation of an HMM  k (i)  k+1 (j)

Define variable  k (i,j) as the probability of being in state s i at time k and in state s j at time k+1, given the observation sequence o 1 o 2... o K.  k (i,j)= P(q k = s i, q k+1 = s j | o 1 o 2... o K )  k (i,j)= P(q k = s i, q k+1 = s j, o 1 o 2... o k ) P(o 1 o 2... o k ) = P(q k = s i, o 1 o 2... o k ) a ij b j (o k+1 ) P(o k+2... o K | q k+1 = s j ) P(o 1 o 2... o k ) =  k (i) a ij b j (o k+1 )  k+1 (j)  i  j  k (i) a ij b j (o k+1 )  k+1 (j) Baum-Welch algorithm: expectation step(1)

Define variable  k (i) as the probability of being in state s i at time k, given the observation sequence o 1 o 2... o K.  k (i)= P(q k = s i | o 1 o 2... o K )  k (i)= P(q k = s i, o 1 o 2... o k ) P(o 1 o 2... o k ) =  k (i)  k (i)  i  k (i)  k (i) Baum-Welch algorithm: expectation step(2)

We calculated  k (i,j) = P(q k = s i, q k+1 = s j | o 1 o 2... o K ) and  k (i)= P(q k = s i | o 1 o 2... o K ) Expected number of transitions from state s i to state s j = =  k  k (i,j) Expected number of transitions out of state s i =  k  k (i) Expected number of times observation v m occurs in state s i = =  k  k (i), k is such that o k = v m Expected frequency in state s i at time k=1 :  1 (i). Baum-Welch algorithm: expectation step(3)

a ij = Expected number of transitions from state s j to state s i Expected number of transitions out of state s j b i (v m ) = Expected number of times observation v m occurs in state s i Expected number of times in state s i  i = ( Expected frequency in state s i at time k=1) =  1 (i). =  k  k (i,j)  k  k (i) =  k  k (i,j)  k,o k = v m  k (i) Baum-Welch algorithm: maximization step

Baum-Welch Training compute Fwd & Bkwd DP matrices accumulate expected counts for E & A