 # Ch-9: Markov Models Prepared by Qaiser Abbas ( )

## Presentation on theme: "Ch-9: Markov Models Prepared by Qaiser Abbas ( )"— Presentation transcript:

Ch-9: Markov Models Prepared by Qaiser Abbas (07-0906)

Outline Markov Models Hidden MarKov Models (HMM)
Three problems in HMM and their solutions

Credits and References
Materials used in this representation are taken from following textbooks or web resources: 1."Foundations of Statistical Natural Language Processing" by Manning & Schütze. Chapter 9, “Markov Models” 2.“SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, by D. Jurafsky and J.H. Martin, updated chapters are available on author’s website: Chapter 9: “Automatic Speech Recognition” 3.“Spoken Language Processing - A Guide to Theory, Algorithm, and System Development”, by X. Huang, A. Acero, and H.W. Hon. Chapter 8:”Hidden Markov Models” Chapter 12, “Basic Search Algorithms” 4.Dr. Andrew W. Moore, Carnegie Melon University, 5.Larry Rabiner’s tutorial on HMM’s

A Markov System s2 s1 s3 Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, … s2 s1 s3 N = 3 t=0

A Markov System s2 s1 s3 Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } s2 Current State s1 s3 N = 3 t=0 qt=q0=s3

A Markov System s2 s1 s3 Has N states, called s1, s2 .. sN
There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen by random. Current State s2 s1 s3 N = 3 t=1 qt=q1=s2

A Markov System s2 s1 s3 Has N states, called s1, s2 .. sN
P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 A Markov System Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } The current state determines the probability distribution for the next state. P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 s2 1/2 2/3 1/2 s1 s3 1/3 N = 3 t=1 qt=q1=s2 1 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 Markov Property qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) The sequence of q is said to be a Markov chain ,or to have the Markov property if the next state depends only upon the current state and not on any past states P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1 s2 1/2 2/3 1/2 s1 s3 1/3 N = 3 t=1 qt=q1=s2 1 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Transition Matrix Question: What is the probability of states sequence of

Example: A Simple Markov Model For Weather Prediction
Any given day, the weather can be described as being in one of three states: State 1: snowy State 2: cloudy State 3: sunny transition matrix:

Question Given that the weather on day 1(t=1) is sunny (state 3), What is the probability that the weather for eight consecutive days is “sun-sun-sun-rain-rain-sun-cloudy-sun”? Solution: O = sun sun sun rain rain sun cloudy sun

From Markov To Hidden Markov
The previous model assumes that each state can be uniquely associated with an observable event Once an observation is made, the state of the system is then trivially retrieved This model, however, is too restrictive to be of practical use for most realistic problems To make the model more flexible, we will assume that the outcomes or observations of the model are a probabilistic function of each state Each state can produce a number of outputs according to a probability distribution, and each distinct output can potentially be generated at any state These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system

Example: A Crazy Soft Drink Machine
Suppose you have a crazy soft drink machine: it can be in two states, cola preferring (CP) and iced tea preferring (IP), but it switches between them randomly after each purchase, as shown below: NOT OBSERVABLE output possibility matrix Now, if , when you put in your coin, the machine always put out a cola if it was in the cola preferring state and an iced tea when it was in the iced tea preferring state, then we would have a visible Markov model. But instead, it only has a tendency to do this Three possible outputs( observations): cola, iced Tea, lemonade

Question What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state? Solution: We need to consider all paths that might be taken through the HMM, and then to sum over them. We know that the machine starts in state CP. There are then four possibilities to produce the observations: CP->CP->CP CP->CP-> IP CP->IP->CP CP->IP->IP So the total probability is: output possibility matrix

A Crazy Soft Drink Machine (Continued)
cola Ice tea lemonade observations hidden states

General Form of an HMM HMM is specified by a five-tuple 1)
Set of hidden states N: the number of states : the state at time t 2) Set of observation symbols M: the number of observation symbols 3) The initial state distribution 4) State transition probability distribution 5) Observation symbol probability distribution in state

General Form of an HMM (Continued)
To sum up, a complete specification of an HMM includes: two constant-size parameters: N and M (representing the total number of states and the size of observation symbols), three sets of probability distribution: Two assumptions: 1.Markov assumption: represents the state sequence 2.Output independence assumption: represents the output sequence The output-independence assumption states that the probability that a particular symbol is emitted at time t depends only on the state st and is conditionally independent of the past observations

Three Basic Problems in HMM
How to evaluate an HMM? Forward Algorithm 1.The Evaluation Problem –Given a model and a sequence of observations , what is the probability ; i.e., the probability of the model that generates the observations? 2.The Decoding Problem – Given a model and a sequence of observation , what is the most likely state sequence in the model that produces the observations? 3.The Learning Problem –Given a model and a set of observations, how can we adjust the model parameter to maximize the joint probability ? How to Decode an HMM? Viterbi Algorithm if we could solve the evaluation problem, we would have a way of evaluating how well a given HMM matches a given observation sequence. If we consider the case in which we are trying to choose among several competing models, the solution to problem 1 allows us to choose the model which best matches the observations. Problem 2 is the one in which we attempt to uncover the hidden part of the model, i.e., to find the “correct” state sequence. Typical uses might be to learn about the structure of the model, to find optimal state sequences for continuous speech recognition, or to get average statistics of individual states Problem3 is the one in which we attempt to optimize the model parameters so as to best describe how a given observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence since it is used to “train” the HMM. How to Train an HMM? Baum-Welch Algorithm

How to Evaluate an HMM- A Straightforward Method
To calculate the probability (likelihood) of the observation sequence , given the HMM , the most intuitive way is to sum up the probabilities of all possible state sequences: Applying Markov assumption: Applying output independent assumption: In other words, to compute p(x|Φ) ,we first enumerate all possible state sequences S of length T, that generate observation sequence X, and then sum all the probabilities.

How to Evaluate an HMM- A Straightforward Method (complexity)
For any given state sequence, we start from initial state with probability or We take a transition from to with probability and generate the observation with probability until we reach the last transition. In other words, to compute p(x|Φ) ,we first enumerate all possible state sequences S of length T, that generate observation sequence X, and then sum all the probabilities. It needs multiplications and additions. Total calculations: For N=5, T=100, it needs

How to Evaluate an HMM- The Forward Algorithm
Define forward probability: is the probability that the HMM is in state having generated partial observation t=0, the cells contains exactly the initial probabilities The computation is in a time-synchronous fashion from left to right, where each cell for time t is completely computed before proceeding to time t+1 When the states in the last column have been computed, the sum of all probabilities in the final column is the probability of generating the observation sequence. The computation is done in a time- synchronous fashion from left to right

How to Evaluate an HMM- The Forward Algorithm
It needs exactly N(N+1)(T-1)+N multiplications and N(N-1)(T-1) additions, so the complexity for this algorithm is O(N2T). For N=5, T=100, we need about 3000 computations for the forward algorithm, versus 1072 computations for the straightforward method.

How to Decode an HMM- The Viterbi Algorithm
Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best path. Define the best-path probability: is the probability of the most likely state sequence at time t, which has generated the observation (until time t) and ends in state i.

How to Decode an HMM- The Viterbi Algorithm
The computation is done in a time-synchronous fashion from left to right. The complexity is also O(N2T).

HMM Training Using Baum-Welch Algorithm
A Hidden Markov Model is a probabilistic model of the joint probability of a collection of random variables {O1,…OT, Q1,…QT}. The Ot variables are discrete observations and the Qt variables are “hidden” and discrete states. Under HMM, two conditional independence assumptions are 1. the tth hidden variable, given the (t-1)st hidden variable, is independent of previous variables, or: P(Qt | Qt-1, Ot-1, …, Q1, O1)= P(Qt | Qt-1). 2. the tth observation depends only on the tth state. P(Ot | Qt,Ot,…, Q1, O1)= P(Ot| Qt). EM algorithm for finding the MLE of the parameters of a HMM given a set of observed feature vectors. This algorithm is also known as the Baum-Welch algorithm. Qt is a discrete random variable with N possible values {1….N}. We further assume that the underlying “hidden” Markov chain defined by P(Qt | Qt-1 } is time-homogeneous (i.e., is independent of the time t). Therefore, we can represent P(Qt | Qt-1 } as a time-independent stochastic transition matrix A={aij}=p(Qt=j|Qt-1=i}. The special case of time t=1 is described by the initial state distribution πi=P(Q1=i). We say that we are in state j at time t if Qt = j. A particular sequence of states is described by q = (q qT ) where qt∈ {1…..N} is the state at time t. The observation is one of L possible observation symbols, Ot∈ {o1,….oL}.The probability of a particular observation vector at a particular time t for state j is described by: bj(ot) = p(Ot = ot|Qt = j). (B={bij} is an L by N matrix). A particular observation sequence O is described as O = (O1 = o1, , , OT = oT ). This is by far the most difficult of the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed form.

Therefore, we can describe a HMM byλ = (A,B, π)
Therefore, we can describe a HMM byλ = (A,B, π). Given an observation O, the Baum-Welch algorithm finds: that is, the HMM λ, that maximizes the probability of the observation O. The Baum-Welch algorithm Initialization: set with random initial conditions. The algorithm updates the parameters of λ iteratively until convergence, following the procedure below. The forward procedure: We define: αi(t) = p(O1 = o1, , ,Ot = ot, Qt = i| λ), which is the probability of seeing the partial sequence o1, , , ot and ending up in state i at time t. We can efficiently calculate αi(t) recursively as: The backward procedure: This is the probability of the ending partial sequence ot+1, , , oT given that we started at state i, at time t. We can efficiently calculate βi(t) as: using α and β, we can calculate the following variables:

having γ and ξ , one can define update rules as follows:

Toolkits for HMM Hidden Markov Model Toolkit (HTK) Hidden Markov Model (HMM) Toolbox for Matlab