Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Angelo Dalli Department of Intelligent Computing Systems
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Planning under Uncertainty
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Speech Recognition Training Continuous Density HMMs Lecture Based on:
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.
Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Siddiqi and Moore, Fast Inference and Learning in Large-State-Space HMMs Sajid M. Siddiqi Andrew W. Moore The Auton Lab Carnegie Mellon.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Fast State Discovery for HMM Model Selection and Learning Sajid M. Siddiqi Geoffrey J. Gordon Andrew W. Moore CMU.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Flat clustering approaches
1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models Part 2: Algorithms
Presentation transcript:

Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore

t x

t x How many kinds of observations (x) ?

t x How many kinds of observations (x) ? 3

t x How many kinds of transitions (x t+1 |x t )?

t x How many kinds of observations (x) ? 3 How many kinds of transitions (x t+1 |x t )? 4

t x How many kinds of observations (x) ? 3 How many kinds of transitions ( x t  x t+1 )? 4 We say that this sequence ‘exhibits four states under the first-order Markov assumption’ Our goal is to discover the number of such states (and their parameter settings) in sequential data, and to do so efficiently

Definitions An HMM is a 3-tuple = {A,B,π}, where A : NxN transition matrix B : NxM observation probability matrix π : Nx1 prior probability vector | | : number of states in HMM, i.e. N T : number of observations in sequence q t : the state the HMM is in at time t

HMMs as DBNs 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4

Each of these probability tables is identical i P( q t+1 =s 1 |q t = s i ) P( q t+1 =s 2 |q t = s i )… P( q t+1 =s j |q t = s i )… P( q t+1 =s N |q t = s i ) 1 a 11 a 12 … a 1j … a 1N 2 a 21 a 22 … a 2j … a 2N 3 a 31 a 32 … a 3j … a 3N ::::::: i a i1 a i2 … a ij … a iN N a N1 a N2 … a Nj … a NN Transition Model 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 Notation:

Observation Model q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 i P( O t =1 |q t = s i ) P( O t =2 |q t = s i )… P( O t =k |q t = s i )… P( O t =M |q t = s i ) 1 b 1 (1)b 1 (2) … b 1 (k) … b 1 (M) 2 b 2 (1)b 2 (2) … b 2 (k) … b 2 (M) 3 b 3 (1)b 3 (2) … b 3 (k) … b 3 (M) : :::::: i b i (1)b i (2) … b i (k) … b i (M) : :::::: N b N (1)b N (2) … b N (k) … b N (M) Notation:

HMMs as DBNs 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4

HMMs as FSAsHMMs as DBNs 1/3 q0q0 q1q1 q2q2 q3q3 q4q4 O0O0 O1O1 O2O2 O3O3 O4O4 S1S1 S3S3 S2S2 S4S4

Operations on HMMs Problem 1: Evaluation Given an HMM and an observation sequence, what is the likelihood of this sequence? Problem 2: Most Probable Path Given an HMM and an observation sequence, what is the most probable path through state space? Problem 3: Learning HMM parameters Given an observation sequence and a fixed number of states, what is an HMM that is likely to have produced this string of observations? Problem 3: Learning the number of states Given an observation sequence, what is an HMM (of any size) that is likely to have produced this string of observations?

ProblemAlgorithmComplexity Evaluation: Calculating P(O| ) Forward- Backward O(TN 2 ) Path Inference: Computing Q * = argmax Q P(O,Q| ) Viterbi O(TN 2 ) Parameter Learning: 1.Computing * =argmax,Q P(O,Q|  2.Computing * =argmax P(O|  Viterbi Training Baum-Welch (EM) O(TN 2 ) Learning the number of states?? Operations on HMMs

Path Inference Viterbi Algorithm for calculating argmax Q P(O,Q| )

t δ  t (1)δ  t (2)δ  t (3) … δ  t (N)

t δ  t (1)δ  t (2)δ  t (3) … δ  t (N) 1 2… 3…

Path Inference Viterbi Algorithm for calculating argmax Q P(O,Q| ) Running time: O(TN 2 ) Yields a globally optimal path through hidden state space, associating each timestep with exactly one HMM state.

Parameter Learning I Viterbi Training(≈ K-means for sequences)

Parameter Learning I Viterbi Training(≈ K-means for sequences) Q * s+1 = argmax Q P(O,Q| s ) (Viterbi algorithm) s+1 = argmax P(O,Q * s+1 | ) Running time: O(TN 2 ) per iteration Models the posterior belief as a δ-function per timestep in the sequence. Performs well on data with easily distinguishable states.

Parameter Learning II Baum-Welch(≈ GMM for sequences) 1.Iterate the following two steps until 2.Calculate the expected complete log- likelihood given s 3.Obtain updated model parameters s+1 by maximizing this log-likelihood

Parameter Learning II Baum-Welch(≈ GMM for sequences) 1.Iterate the following two steps until 2.Calculate the expected complete log- likelihood given s 3.Obtain updated model parameters s+1 by maximizing this log-likelihood Obj(, s ) = E Q [P(O,Q| ) | O, s ] s+1 = argmax Obj(, s ) Running time: O(TN 2 ) per iteration, but with a larger constant Models the full posterior belief over hidden states per timestep. Effectively models sequences with overlapping states at the cost of extra computation.

HMM Model Selection Distinction between model search and actual selection step –We can search the spaces of HMMs with different N using parameter learning, and perform selection using a criterion like BIC.

HMM Model Selection Distinction between model search and actual selection step –We can search the spaces of HMMs with different N using parameter learning, and perform selection using a criterion like BIC. Running time: O(Tn 2 ) to compute likelihood for BIC

HMM Model Selection I for n = 1 … Nmax Initialize n-state HMM randomly Learn model parameters Calculate BIC score If best so far, store model if larger model not chosen, stop

HMM Model Selection I for n = 1 … Nmax Initialize n-state HMM randomly Learn model parameters Calculate BIC score If best so far, store model if larger model not chosen, stop Running time: O(Tn 2 ) per iteration Drawback: Local minima in parameter optimization

HMM Model Selection II for n = 1 … Nmax –for i = 1 … NumTries Initialize n-state HMM randomly Learn model parameters Calculate BIC score If best so far, store model –if larger model not chosen, stop

HMM Model Selection II for n = 1 … Nmax –for i = 1 … NumTries Initialize n-state HMM randomly Learn model parameters Calculate BIC score If best so far, store model –if larger model not chosen, stop Running time: O(NumTries x Tn 2 ) per iteration Evaluates NumTries candidate models for each n to overcome local minima. However: expensive, and still prone to local minima especially for large N

Idea: Binary state splits * to generate candidate models To split state s into s 1 and s 2, –Create ’ such that ’ \s  \s –Initialize ’ s1 and ’ s2 based on s and on parameter constraints * first proposed in Ostendorf and Singer, 1997 Notation: s : HMM parameters related to state s \s : HMM parameters not related to state s

Idea: Binary state splits * to generate candidate models To split state s into s 1 and s 2, –Create ’ such that ’ \s  \s –Initialize ’ s1 and ’ s2 based on s and on parameter constraints This is an effective heuristic for avoiding local minima * first proposed in Ostendorf and Singer, 1997 Notation: s : HMM parameters related to state s \s : HMM parameters not related to state s

Overall algorithm

Start with a small number of states Binary state splits * followed by EM BIC on training set Stop when bigger HMM is not selected EM (B.W. or V.T.)

Overall algorithm Start with a small number of states Binary state splits followed by EM BIC on training set Stop when bigger HMM is not selected What is ‘efficient’? Want this loop to be at most O(TN 2 ) EM (B.W. or V.T.)

HMM Model Selection III Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, learn model parameters Calculate BIC score If best so far, store model –if larger model not chosen, stop

HMM Model Selection III Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, learn model parameters Calculate BIC score If best so far, store model –if larger model not chosen, stop O(Tn 2 )

HMM Model Selection III Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, learn model parameters Calculate BIC score If best so far, store model –if larger model not chosen, stop Running time: O(Tn 3 ) per iteration of outer loop More effective at avoiding local minima than previous approaches. However, scales poorly because of n 3 term. O(Tn 2 )

Fast Candidate Generation

Only consider timesteps owned by s in Viterbi path Only allow parameters of split states to vary Merge parameters and store as candidate

OptimizeSplitParams I: Split-State Viterbi Training (SSVT) Iterate until convergence:

Constrained Viterbi Splitting state s to s 1,s 2. We calculate using a fast ‘constrained’ Viterbi algorithm over only those timesteps owned by s in Q *, and constraining them to belong to s 1 or s 2.

t δ  t (1)δ  t (2)δ  t (3) … δ  t (N) The Viterbi path is denoted by Suppose we split state N into s 1,s 2

t δ  t (1)δ  t (2)δ  t (3) … δ  t (s 1 )δ  t (s 2 ) ?? ?? ?? ?? The Viterbi path is denoted by Suppose we split state N into s 1,s 2

t δ  t (1)δ  t (2)δ  t (3) … δ  t (s 1 )δ  t (s 2 ) The Viterbi path is denoted by Suppose we split state N into s 1,s 2

t δ  t (1)δ  t (2)δ  t (3) … δ  t (s 1 )δ  t (s 2 ) The Viterbi path is denoted by Suppose we split state N into s 1,s 2

Iterate until convergence: Running time: O(|T s |n) per iteration When splitting state s, assumes rest of the HMM parameters ( \s ) and rest of the Viterbi path (Q * \Ts ) are both fixed OptimizeSplitParams I: Split-State Viterbi Training (SSVT)

Fast approximate BIC Compute once for base model: O(Tn 2 ) Update optimistically * for candidate model: O(|T s |) * first proposed in Stolcke and Omohundro, 1994

HMM Model Selection IV Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, optimize by constrained EM Calculate approximate BIC score If best so far, store model –if larger model not chosen, stop

HMM Model Selection IV Initialize n 0 -state HMM randomly for n = n 0 … Nmax –Learn model parameters –for i = 1 … n Split state i, optimize by constrained EM Calculate approximate BIC score If best so far, store model –if larger model not chosen, stop Running time: O(Tn 2 ) per iteration of outer loop! O(Tn)

Algorithms SOFT: Baum-Welch / Constrained Baum-Welch HARD : Viterbi Training / Constrained Viterbi Training faster, coarser slower, more accurate

Results 1.Learning fixed-size models 2.Learning variable-sized models Baseline: Fixed-size HMM Baum-Welch with five restarts

Learning fixed-size models

Fixed-size experiments table, continued

Learning fixed-size models

Learning variable-size models

Conclusion Pros: –Simple and efficient method for HMM model selection –Also learns better fixed-size models (Often faster than single run of Baum-Welch ) –Different variants suitable for different problems

Conclusion Cons: –Greedy heuristic: No performance guarantees –Binary splits also prone to local minima –Why binary splits? –Works less well on discrete-valued data Greater error from Viterbi path assumptions Pros: –Simple and efficient method for HMM model selection –Also learns better fixed-size models (Often faster than single run of Baum-Welch ) –Different variants suitable for different problems

Thank you

Appendix

Viterbi Algorithm

Constrained Vit.

EM for HMMs

More definitions

OptimizeSplitParams II: Constrained Baum-Welch Iterate until convergence:

Penalized BIC