# Hidden Markov Models in NLP

## Presentation on theme: "Hidden Markov Models in NLP"— Presentation transcript:

Hidden Markov Models in NLP
Leah Spontaneo

Overview Introduction Discrete Markov Processes Hidden Markov Model
Deterministic Model Statistical Model Speech Recognition Discrete Markov Processes Hidden Markov Model Elements of HMMs Output Sequence Basic HMM Problems HMM Operations Forward-Backward Viterbi Algorithm Baum-Welch Different Types of HMMs Speech Recognition using HMMs Isolated Word Recognition Limitations of HMMs

Introduction Real-world processes generally produce observable outputs characterized in signals Signals can be discrete or continuous Signal source can be stationary or nonstationary They can be pure or corrupted

Introduction Signals are characterized in signal models
Process the signal to provide desired output Learn about the signal source without the source available Work well in practice Signal models can be deterministic or statistical

Deterministic Model Exploit some known specific properties of the signal Sine wave Sum of exponentials Chaos theory Specification of the signal generally straight-forward Determine/estimate the values of the parameters of the signal model

Statistical Model Statistical models try to characterize the statistical properties of the signal Gaussian processes Markov processes Hidden Markov processes Signal characterized as a parametric random process Parameters of the stochastic process can be determine/estimated in a precise, well-defined manner

Speech Recognition Basic theory of hidden Markov models in speech recognition originally pulished in 1960s by Baum and collegues Implemented in speech processing applications in 1970s be Baker at CMU and by Jelinek at IBM

Discrete Markov Processes
Contains a set of N distinct states: S1, S2,..., SN At discrete time intervals the state changes and the state at time t is qt For the first-order Markov chain, the probabilistic description is P[qt = Sj| qt -1= Si, qt -2= Sk,...] = P[qt = Sj| qt -1= Si]

Discrete Markov Processes
Processes considered are those independent of time leading to transition probabilities aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N aij ≥ 0 Σ aij = 1 The revious stochastic process is considered an observable Markov model The output process is the set of states at each time interval and each state corresponds to an observable event

Hidden Markov Model Markov process decides future probabilities based on recent values A hidden Markov model is a Markov process with an unobservable state HMMs must have 3 sets of probabilities: Initial probabilities Transition probabilities Emission probabilities -Genome includes unobserved states (CGI or baseline), HMMs are a natural model to consider -States are inferred from the base-to-base transitions observed along the genome

Hidden Markov Model Includes the case where the observation is a probabilistic function of the state A doubly embedded stochastic process with an underlying unobservable stochastic process Unobservable process only observed through a set of stochastic processes producing the observations

Elements of HMMs N, number of states in the model
Although hidden, there is physical significance attached to the states of the model Individual states are denoted as S = {S1, S2,..., SN} State at time t is denoted qt M, number of distinct observation symbols for each state

Elements of HMMs Observation symbols correspond to the output of the system modeled Individual symbols are denoted V = {v1, v2,..., vM} State transition probability distribution A = {aij} aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N The special case where any state can reach any other, aij > 0 for all i, j

Elements of HMMs The observation probability distribution in state j, B = {bj(k)} bj(k) = P[vk at t|qt = Sj] 1 ≤ j ≤ N, 1 ≤ k ≤ M The initial state distribution π = {πi} πi = P[q1 = Si] 1 ≤ i ≤ N With the right values for N, M, A, B, and π the HMM can generate and output sequence O = O1O2...OT where each Ot is an observation in V and T is the total number of observations in the sequence

Output Sequence 1) Choose an initial state q1 = Si according to π
3) Get Ot = vk based on the emission probability for Si, bi(k) 4) Transition to new state qt+1 = Sj based on transition probability for Si, aij 5) t = t + 1 and go back to #3 if t < T; otherwise end sequence Successfully used for acoustic modeling in speech recognition Applied to language modeling and POS tagging

Output Sequence The procedure can be used to generate a sequence of observations and to model how an observation sequence was produced by and HMM The cost of determining the probability that the system is in state Si at time t is O(tN2)

Basic HMM Problems HMMs can find the state sequence that most likely produces a given output The sequence of states is most efficiently computed using the Viterbi algorithm Maximum likelihood estimates of the probability sets are determined using the Baum-Welch algorithm -HMMs usually associated with three main problems: compute probability that a particular sequence is associated with a given model; find the state sequence that would most likely produce a given output sequence; determine the maximum likelihood estimate of the probabilities based on a given output sequence

HMM Operations Calculating P(qt = Si|O1O2...Ot) uses the forward- backward algorithm Computing Q* = argmaxQ P(Q|O) requires the Viterbi algorithm Learning λ* = argmaxλ P(O|λ) using the Baum-Welch algorithm The complexity for the three algorithms is O(TN2) where T is the time taken and N is the number of states

Evaluation Scores how well a given model matches an observation sequence Extremely useful in considering which model, among many, best represents the set of observations

Forward-Backward Given observations O1O2...OT
αt(i) = P(O1O2...Ot ^ qt = Si| λ) is the probability that given the first t observations, we end up in state Si on visit t α1(i) = b(O1) πi α t+1(j) = Σ aijbi(Ot+1) αt(i) We can now cheaply compute αt(i) = P(O1O2...Ot ^ qt = Si) P(O1O2...Ot) = Σ αt(i) P(qt = Si| O1O2...Ot) = αt(i)/ Σ αt(j)

Forward-Backward The key is since there are only N states, all possible state sequences will merge to the N nodes At t = 1, only the values of α1(i), 1 ≤ i ≤ N require calculation When t = 2, 3,..., T we calculate αt(j), 1 ≤ j ≤ N where each calculation involves N previous values of αt-1(i) only

State Sequence There is no ‘correct’ state sequence except for degenerate models Optimality criterion is used instead to determine the best possible outcome There are several reasonable optimality criteria and thus, the chosen criterion depends on the use of the uncovered sequence Used for continuous speech recognition

Viterbi Algorithm Finds the most likely sequence of hidden states based on known observations using dynamic programming Makes three assumptions about the model: The model must be a state machine Transitions between states are marked by a metric Events must be cumulative over the path Path history must be kept in memory to find the best probable path in the end -Similar to the forward algorithm which computes the probability that a set of observed events was generated by the model -Designed in 1967 by Andrew Viterbi to decode convolutional codes within noise of digital comm. links -Must contain finite number of states and at any given point in time be in one state -Time computed over each event that passes -All possible previous paths added to new state metric based on observation and best path is then chosen

Viterbi Algorithm Used for speech recognition where the hidden state is part of word formation Given a specific signal, it would deduce the most probable word based on the model To find the best state sequence, Q = {q1, q2,..., qT} for the observation sequence O = {O1O2...OT} we define

Viterbi Algorithm δt(i) is the best score along a single path at time t accounting for the first t observations ending in state Si The inductive step is

Viterbi Algorithm Similar to forward calculation of the forward- backward algorithm The major difference is the maximization over the previous states instead of summing the probabilities

Optimizing Parameters
A training sequence is used to train the HMM and adjust the model parameters Training problem is crucial for most HMM applications Allows us to optimally adapt model parameters to observed training data creating good models for real data

Baum-Welch A special case of expectation maximization
EM has two main steps: Devise the expectation of the log-likelihood using current estimates of latent variables Compute the maximized log-likelihood using values from the first step Baum-Welch is a form of generalized EM which allows the algorithm to converge to a local optimum -Parameters need to be trained on a sequence of symbols to determine the final probabilities the model will use to find the Viterbi path

Baum-Welch Also known as the forward-backwards algorithm
Baum-Welch uses two main steps: The forward and backward probability is calculated for each state in the HMM The transition and emission probabilities are determined and divided by the probability of the whole model based on the previous step -It can produce maximum likelihood and posterior estimates for all parameters given only initial emission probabilities to work with -Starts by assigning initial probabilities to all parameters, then continues until convergence happens by adjusting probabilities of each parameter in accordance with the training set being scanned

Baum-Welch Cons: lots of local minima
Pros: local minima are often adequate models of the data EM requires the number of states to be given Sometimes HMMs require some links to be zero. For this aij = 0 in the initial estimate λ(0)

Different Types of HMM Left-right model
As time increases the state index increases or stays the same No transitions are allowed between states whose indecies are lower than the current state Cross-coupled two parallel left-right Obeys the left-right constraints on transition probabilities but provide more flexibility

Speech Recognition using HMMs
Feature Analysis: A spectral/temporal analysis of speech signals can be performed to provide observation vectors used to train HMMs Unit Matching: Each unit is characterized by an HMM with parameters estimated from speech data Provides the likelihoods of matches of all sequences of speech recognition units to the unknown input

Isolated Word Recognition
Vocabulary of V words to be recognized and each modeled by a distinct HMM For each word, a training set of K occurrences of each spoken word where each occurrence of the word is an observation sequence For each word v, build an HMM (estimating the model parameters which optimize the likelihood of the training set observations for each word)

Isolated Word Recognition
For each unknown word recognized, measure the observation sequence O = {O1O2...OT} via feature analysis of the speech corresponding to the word Calculate model likelihoods for all possible models followed by the selection of the word with the highest model likelihood

Isolated Word Recognition

Limitations of HMMs Assumption that successive observations are independent and thus, the probability of an observation sequence can be written as the product of the probabilities of individual observations Assumes the distributions of individual observation parameters are well represented as a mixture of autoregressive or Gaussian densities Assumption of being in a given state at each time interval inappropriate for speech sound which can extend through several states

Questions?

References Vogel, S., et al. “HMM-based word alignment in statistical translation.” In Proceedings of the 16th conference on Computational linguistics (1996), pp   Rabiner, L. R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE (1989), pp Moore, A. W. “Hidden Markov Models.” Carnegie Mellon University. 11/F/6390/_media/hmm14.pdf