Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.

Slides:



Advertisements
Similar presentations
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Advertisements

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
A Tutorial on Learning with Bayesian Networks
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Lauritzen-Spiegelhalter Algorithm
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
An Introduction to Variational Methods for Graphical Models.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Introduction to Hidden Markov Models
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Fundamentals and applications to bioinformatics.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
… Hidden Markov Models Markov assumption: Transition model:
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Belief Propagation, Junction Trees, and Factor Graphs
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Visual Recognition Tutorial
Exact Inference: Clique Trees
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CPSC 422, Lecture 14Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14 Feb, 4, 2015 Slide credit: some slides adapted from Stuart.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Review of Lecture Two Linear Regression Normal Equation
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Computer vision: models, learning and inference
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
EM and expected complete log-likelihood Mixture of Experts
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
CS Statistical Machine learning Lecture 24
An Introduction to Variational Methods for Graphical Models
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Kalman Filtering And Smoothing
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Pattern Recognition and Machine Learning
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Today Graphical Models Representing conditional dependence graphically
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
PCFG estimation with EM The Inside-Outside Algorithm.
Hidden Markov Models BMI/CS 576
Today.
Hidden Markov Models Part 2: Algorithms
Hidden Markov Autoregressive Models
An Introduction to Variational Methods for Graphical Models
Markov Random Fields Presented by: Vladan Radosavljevic.
Exact Inference Eric Xing Lecture 11, August 14, 2010
Advanced Machine Learning
Presentation transcript:

Hidden Markov Models M. Vijay Venkatesh

Outline Introduction Graphical Model Parameterization Inference Summary

Introduction Hidden Markov Model (HMM) is a graphical model for modeling sequential data. The states are no longer independent, but a state at any given step depends on the choice of the previous step. Generalizations of Mixture models with a transition matrix linking states at neighboring steps.

Introduction Inferencing in HMM involves having the observed data as input and yielding a probability distribution on the underlying states. Since the states are dependent, it is a little more involved than inferencing for mixture models.

Generation of data for IID and HMM case

Graphical Model Y0Y0 Y1Y1 Y2Y2 YTYT Q0Q0 QTQT Q2Q2 Q1Q1 π A A Top node in each slice represents the multinomial Q t variable and the bottom node represents the observable Y t variable

Graphical Model Conditioning on state Q t renders Q t-1 and Q t+1 independent. Generally, Qs is independent of Q u, for s<t and t<u. This is also for output nodes Y s and Y u, when conditioned on state node Q t. Conditioning on output node, does not yield any conditional independence. Indeed, conditioning on all output nodes fails to induce any independencies on state nodes.

Parameterization State transition matrix A where of A is defined as the transition probability Each output node has a single state node as a parent, therefore we require probability For a particular configuration, the joint probability is expressed as

Parameterization To introduce A and π parameters in the joint probability equation, we re-write the transition matrix indices and the unconditional initial node distribution as We get the joint probability as

Inferencing The general inference problem is to compute the probability of hidden state q given an observable output sequence y. Marginal probability of a particular hidden state q t given output sequence. Probabilities conditioned on partial output Filtering Prediction Smoothing, where we calculate a posterior probability based on data up to and including future time

Inference Let’s calculate where, is the entire observable output. We can calculate But to calculate, we need to sum across all possible values of hidden states Each state can take M possible values and we have T state nodes, which implies that we must perform M T sums

Inference Each factor involves only one or two state variables. It is possible to move those sums inside the product to do it in a systematic way Moving sum inside and forming a recursive form, reduces computation significantly

Inferencing Rather than computing P(q|y) we focus on a particular state node q t and calculate P(q t |y) We take advantage of conditional independencies and Bayes rule YtYt Y t+1 QtQt Q t+1 A

Inferencing where α(q t ) is the probability of emitting a partial sequence of outputs y 0, …,y t and ending at state q t where β (q t ) is the probability of emitting a partial sequence of outputs y t+1, …,y T starting at state q t

Inferencing Reduced to finding α,β We hope to obtain a recursive relation between α(q t ) and α(q t+1 ) The required time is O(M 2 T) and the algorithm proceeds forward in time Similarly we obtain a recursive backward relation between β (q t ) and β (q t+1 ) To compute posterior probabilities for all states q t, we are required to compute alphas and betas for each step.

Alternate inference algorithm An alternative approach in which the backward phase is a recursion defined on γ(q t ) variable Backward phase does not use y t ; only the forward phase does. We can throw data as we filter.

Alternate Inference algorithm This recursion makes use of the α variables, and hence must be computed before γ recursion The data y t are not used in γ recursion; the α recursion has absorbed all the necessary likelihoods

Transition matrix The α-β or α-γ algorithm provides us with posterior probability of the state To estimate state transition matrix A, we need the matrix of co-occurrence prob. P(q t,q t+1 |y) We calculate ξ(q t,q t+1 ) based on alphas and betas

Junction Tree Connection We can calculate all the posterior probability for HMM recursively Given an observed sequence y, we run α- recursion forward in time If we require likelihood, we simply sum the alphas at final time step If we require posterior probabilities of the states, we use either β or γ-recursion

Junction tree connection HMM is represented by multinomial state variable Q t and the observable output variable y t HMM is parameterized by initial probability π and each subsequent state node with a transition matrix A where The output nodes are assigned the local conditional probability. We assume that y t is a multinomial node so that can be viewed as a matrix B. To convert HMM to Junction Tree, we moralize, triangulate and form the clique tree. Then we choose a maximal spanning tree which forms our junction tree.

Junction tree connection Moralized and triangulated graph The junction tree for HMM with potentials labeled

Junction tree Connection The initial probability as well as the conditional prob. is assigned to the potential, which implies that this potential is initially set to The state to state potentials are given the assignment, the output probabilities are assigned the potential and the separator potentials are initialized to one.

Unconditional Inference Lets do inferencing before any evidence is observed and we designate the node as the root and collect to the root. Consider the first operation of passing a message upward from a clique for t>1. The marginalization yields Thus the separator potential remains set at one. This implies that the update factor is one and thus the potential remains unchanged. In general, the messages passed upward from leaves have no effect when no evidence is observed

Unconditional inference Now consider message from (q o,y 0 ) to (q 0,q 1 ) This transformation propagates forward along the chain, changing separator potentials on q t into marginals P(q t ) and the clique potentials into marginals P(q t,q t+1 ) A subsequent distribute evidence will have no effect on potentials along the backbone of the chain, but will convert into marginals P(q t ) and the potentials Ψ(q t, y t ) into marginals P(q t, y t )

Unconditional inference Thus all potentials throughout the junction tree become marginal probabilities Our results helps to clarify the representation of the joint probability as the product of the clique potentials divided by the product of the separator potentials.

Junction Tree Algorithm 1. Moralize if needed 2. Triangulate using any triangulation algorithm 3. Formulate the clique graph (clique nodes and separator nodes) 4. Compute the junction tree 5. Initialize all separator potentials to be Phase 1: Collect from children 7. Phase 2: Distribute to children Message from children C :  *(X S )=  C\S  (X C ) Update at parent P:  *(X P )=  (X P )  S  *(X S )/  (X S ) Message from parent P :  **(X P )=  P\S  **(X P ) Update at child C:  *(X C )=  (X C )  S  **(X S )/  *(X S )

Introducing evidence We now suppose that outputs y are observed and we wish to compute P(y) as well as marginal posterior prob. such as P(q t |y) and P(q t,q t+1 |y). Initialize separator potentials to unity and recall that Ψ(q t,y t ) can be viewed as a matrix B, with columns labeled by possible values of y t. In practice we would set the separator potential We designate (Q T-1, Q T ) as the root of the JT and collect to the root

Collecting to the root Consider update of clique( Q t, Q t+1 ) as shown and we assume that Φ*(q t ) has already been updated and consider the computation of Ψ*(q t,q t+1 ) and Φ*(q t+1 ) Ψ*(q t,q t+1 ) = Ψ(q t,q t+1 ) Φ*(q t ) ς*(q t+1 ) = a qt,qt+1 Φ*(q t ) P(y t+1 |q t+1 )

Collecting to the root Proceeding forward along the chain Defining α(q t ) = Φ*(q t ), we have recovered the alpha algorithm The collect phase of algorithm terminates with update of Ψ(q T-1, q T ). The updated potential will equal p(y o,…y t,q t,q t+1 ) and thus by marginalization we get likelihood

Collecting to the root Suppose instead of designating (q T-1, q T ) as the root, if we utilize (q o,q 1 ) as the root, we obtain the beta algorithm. It is not necessary to change the root of the JT to derive the beta algorithm. It arises during the DistributeEvidence pass when having (q T-1,q T ) as the root.

Distributing from root Now in the second phase we want to distribute evidence from the root (q T-1,q T ) This phase proceeds backwards along state-state as well as state-output cliques

Distribute from the root We suppose that the separator potential Φ**(q t+1 ) has already been updated and consider the update of Ψ**(q t,q t+1 ) and Φ**(q t ) and Simplifying we obtain Gamma recursion

Distribution from the root By rearranging and simplifying we can also derive a relationship between alpha-beta recursion