 Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.

Presentation on theme: "Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary."— Presentation transcript:

Hidden Markov Models M. Vijay Venkatesh

Outline Introduction Graphical Model Parameterization Inference Summary

Introduction Hidden Markov Model (HMM) is a graphical model for modeling sequential data. The states are no longer independent, but a state at any given step depends on the choice of the previous step. Generalizations of Mixture models with a transition matrix linking states at neighboring steps.

Introduction Inferencing in HMM involves having the observed data as input and yielding a probability distribution on the underlying states. Since the states are dependent, it is a little more involved than inferencing for mixture models.

Generation of data for IID and HMM case

Graphical Model Y0Y0 Y1Y1 Y2Y2 YTYT Q0Q0 QTQT Q2Q2 Q1Q1 π A A Top node in each slice represents the multinomial Q t variable and the bottom node represents the observable Y t variable

Graphical Model Conditioning on state Q t renders Q t-1 and Q t+1 independent. Generally, Qs is independent of Q u, for s<t and t<u. This is also for output nodes Y s and Y u, when conditioned on state node Q t. Conditioning on output node, does not yield any conditional independence. Indeed, conditioning on all output nodes fails to induce any independencies on state nodes.

Parameterization State transition matrix A where of A is defined as the transition probability Each output node has a single state node as a parent, therefore we require probability For a particular configuration, the joint probability is expressed as

Parameterization To introduce A and π parameters in the joint probability equation, we re-write the transition matrix indices and the unconditional initial node distribution as We get the joint probability as

Inferencing The general inference problem is to compute the probability of hidden state q given an observable output sequence y. Marginal probability of a particular hidden state q t given output sequence. Probabilities conditioned on partial output Filtering Prediction Smoothing, where we calculate a posterior probability based on data up to and including future time

Inference Let’s calculate where, is the entire observable output. We can calculate But to calculate, we need to sum across all possible values of hidden states Each state can take M possible values and we have T state nodes, which implies that we must perform M T sums

Inference Each factor involves only one or two state variables. It is possible to move those sums inside the product to do it in a systematic way Moving sum inside and forming a recursive form, reduces computation significantly

Inferencing Rather than computing P(q|y) we focus on a particular state node q t and calculate P(q t |y) We take advantage of conditional independencies and Bayes rule YtYt Y t+1 QtQt Q t+1 A

Inferencing where α(q t ) is the probability of emitting a partial sequence of outputs y 0, …,y t and ending at state q t where β (q t ) is the probability of emitting a partial sequence of outputs y t+1, …,y T starting at state q t

Inferencing Reduced to finding α,β We hope to obtain a recursive relation between α(q t ) and α(q t+1 ) The required time is O(M 2 T) and the algorithm proceeds forward in time Similarly we obtain a recursive backward relation between β (q t ) and β (q t+1 ) To compute posterior probabilities for all states q t, we are required to compute alphas and betas for each step.

Alternate inference algorithm An alternative approach in which the backward phase is a recursion defined on γ(q t ) variable Backward phase does not use y t ; only the forward phase does. We can throw data as we filter.

Alternate Inference algorithm This recursion makes use of the α variables, and hence must be computed before γ recursion The data y t are not used in γ recursion; the α recursion has absorbed all the necessary likelihoods

Transition matrix The α-β or α-γ algorithm provides us with posterior probability of the state To estimate state transition matrix A, we need the matrix of co-occurrence prob. P(q t,q t+1 |y) We calculate ξ(q t,q t+1 ) based on alphas and betas

Junction Tree Connection We can calculate all the posterior probability for HMM recursively Given an observed sequence y, we run α- recursion forward in time If we require likelihood, we simply sum the alphas at final time step If we require posterior probabilities of the states, we use either β or γ-recursion

Junction tree connection HMM is represented by multinomial state variable Q t and the observable output variable y t HMM is parameterized by initial probability π and each subsequent state node with a transition matrix A where The output nodes are assigned the local conditional probability. We assume that y t is a multinomial node so that can be viewed as a matrix B. To convert HMM to Junction Tree, we moralize, triangulate and form the clique tree. Then we choose a maximal spanning tree which forms our junction tree.

Junction tree connection Moralized and triangulated graph The junction tree for HMM with potentials labeled

Junction tree Connection The initial probability as well as the conditional prob. is assigned to the potential, which implies that this potential is initially set to The state to state potentials are given the assignment, the output probabilities are assigned the potential and the separator potentials are initialized to one.

Unconditional Inference Lets do inferencing before any evidence is observed and we designate the node as the root and collect to the root. Consider the first operation of passing a message upward from a clique for t>1. The marginalization yields Thus the separator potential remains set at one. This implies that the update factor is one and thus the potential remains unchanged. In general, the messages passed upward from leaves have no effect when no evidence is observed

Unconditional inference Now consider message from (q o,y 0 ) to (q 0,q 1 ) This transformation propagates forward along the chain, changing separator potentials on q t into marginals P(q t ) and the clique potentials into marginals P(q t,q t+1 ) A subsequent distribute evidence will have no effect on potentials along the backbone of the chain, but will convert into marginals P(q t ) and the potentials Ψ(q t, y t ) into marginals P(q t, y t )

Unconditional inference Thus all potentials throughout the junction tree become marginal probabilities Our results helps to clarify the representation of the joint probability as the product of the clique potentials divided by the product of the separator potentials.

Junction Tree Algorithm 1. Moralize if needed 2. Triangulate using any triangulation algorithm 3. Formulate the clique graph (clique nodes and separator nodes) 4. Compute the junction tree 5. Initialize all separator potentials to be 1. 6. Phase 1: Collect from children 7. Phase 2: Distribute to children Message from children C :  *(X S )=  C\S  (X C ) Update at parent P:  *(X P )=  (X P )  S  *(X S )/  (X S ) Message from parent P :  **(X P )=  P\S  **(X P ) Update at child C:  *(X C )=  (X C )  S  **(X S )/  *(X S )

Introducing evidence We now suppose that outputs y are observed and we wish to compute P(y) as well as marginal posterior prob. such as P(q t |y) and P(q t,q t+1 |y). Initialize separator potentials to unity and recall that Ψ(q t,y t ) can be viewed as a matrix B, with columns labeled by possible values of y t. In practice we would set the separator potential We designate (Q T-1, Q T ) as the root of the JT and collect to the root

Collecting to the root Consider update of clique( Q t, Q t+1 ) as shown and we assume that Φ*(q t ) has already been updated and consider the computation of Ψ*(q t,q t+1 ) and Φ*(q t+1 ) Ψ*(q t,q t+1 ) = Ψ(q t,q t+1 ) Φ*(q t ) ς*(q t+1 ) = a qt,qt+1 Φ*(q t ) P(y t+1 |q t+1 )

Collecting to the root Proceeding forward along the chain Defining α(q t ) = Φ*(q t ), we have recovered the alpha algorithm The collect phase of algorithm terminates with update of Ψ(q T-1, q T ). The updated potential will equal p(y o,…y t,q t,q t+1 ) and thus by marginalization we get likelihood

Collecting to the root Suppose instead of designating (q T-1, q T ) as the root, if we utilize (q o,q 1 ) as the root, we obtain the beta algorithm. It is not necessary to change the root of the JT to derive the beta algorithm. It arises during the DistributeEvidence pass when having (q T-1,q T ) as the root.

Distributing from root Now in the second phase we want to distribute evidence from the root (q T-1,q T ) This phase proceeds backwards along state-state as well as state-output cliques

Distribute from the root We suppose that the separator potential Φ**(q t+1 ) has already been updated and consider the update of Ψ**(q t,q t+1 ) and Φ**(q t ) and Simplifying we obtain Gamma recursion

Distribution from the root By rearranging and simplifying we can also derive a relationship between alpha-beta recursion

Download ppt "Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary."

Similar presentations