Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9

Acknowledgement Some of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999)

Formalisation of a Hidden Markov model Part 1

Crucial ingredients (familiar) Underlying states: S = {s 1,…,s N } Output alphabet (observations): K = {k 1,…,k M } State transition probabilities: A = {a ij }, i,j Є S State sequence: X = (X 1,…,X T+1 ) + a function mapping each X t to a state s Output sequence: O = (O 1,…,O T ) where each o t Є K

Crucial ingredients (additional) Initial state probabilities: Π = { π i }, i Є S (tell us the initial probability of each state) Symbol emission probabilities: B = {b ijk }, i,j Є S, k Є K (tell us the probability b of seeing observation O t =k at time t, given that X t =s i and X t+1 = s j )

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1 o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3

Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1 o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3 b 1,1,k=O2 b 1,1,k=O3 b 1,2,k=O2 b 1,3,k=O2

The fundamental questions for HMMs 1. Given a model μ = (A, B, Π ), how do we compute the likelihood of an observation P(O| μ )? 2. Given an observation sequence O, and model μ, which is the state sequence (X 1,…,X t+1 ) that best explains the observations? This is the decoding problem 3. Given an observation sequence O, and a space of possible models μ = (A, B, Π ), which model best explains the observed data?

Application of question 1 (ASR) Given a model μ = (A, B, Π ), how do we compute the likelihood of an observation P(O| μ )? Input of an ASR system: a continuous stream of sound waves, which is ambiguous Need to decode it into a sequence of phones. is the input the sequence [n iy d] or [n iy]? which sequence is the most probable?

Application of question 2 (POS Tagging) Given an observation sequence O, and model μ, which is the state sequence (X 1,…,X t+1 ) that best explains the observations? this is the decoding problem Consider a POS Tagger Input observation sequence: I can read need to find the most likely sequence of underlying POS tags: e.g. is can a modal verb, or the noun? how likely is it that can is a noun, given that the previous word is a pronoun?

Finding the probability of an observation sequence

Example problem: ASR Assume that the input contains the word need input stream is ambiguous (there is noise, individual variation in speech, etc) Possible sequences of observations: [n iy] (knee) [n iy dh] (need) [n iy t] (neat) … States: underlying sequences of phones giving rise to the input observations with transition probabilities assume we have state sequences for need, knee, new, neat, …

Formulating the problem Probability of an observation sequence is logically an OR problem: model gives us state transitions underlying several possible words (knee, need, neat…) How likely is the word need? We have: all possible state sequences X each sequence can give rise to the signal received with a certain probability (possibly zero) the probability of the word need is the sum of probabilities with which each sequence can have given rise to the word.

oToT o1o1 otot o t-1 o t+1 Simplified trellis diagram representation startn dhiyend  Hidden layer: transitions between sounds forming the words need, knee…  This is our model

oToT o1o1 otot o t-1 o t+1 Simplified trellis diagram representation startn dhiyend  Visible layer is what ASR is given as input

oToT o1o1 otot o t-1 o t+1 Computing the probability of an observation startn dhiyend

Computing the probability of an observation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

A final word on observation probabilities Since we’re computing the probability of an observation given a model, we can use these methods to compare different models if we take observations in our corpus as given, then the best model is the one which maximises the probability of these observations (useful for training/parameter setting)

The forward procedure

Forward Procedure Given our phone input, how do we decide whether the actual word is need, knee, …? Could compute p(O| μ ) for every single word Highly expensive in terms of computation

Forward procedure An efficient solution to resolving the problem based on dynamic programming (memoisation) rather than perform separate computations for all possible sequences X, keep in memory partial solutions

Forward procedure Network representation of all sequences (X) of states that could generate the observations sum of probabilities for those sequences E.g. O=[n iy] could be generated by X1 = [n iy d] (need) X2 = [n iy t] (neat) shared histories can help us save on memory Fundamental assumption: Given several state sequences of length t+1 with shared history up to t probability of first t observations is the same in all of them

Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Probability of the first t observations is the same for all possible t+1 length state sequences. Define a forward variable: Probability of ending up in state s i at time t after observations 1 to t-1

Forward Procedure: initialisation oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Probability of the first t observations is the same for all possible t+1 length state sequences. Define: Probability of being in state s i first is just equal to the initialisation probability

Forward Procedure (inductive step) oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Looking backward The forward procedure caches the probability of sequences of states leading up to an observation (left to right). The backward procedure works the other way: probability of seeing the rest of the obs sequence given that we were in some state at some time

Backward procedure: basic structure Define: probability of the remaining observations given that current obs is emitted by state i Initialise: probability at the final state Inductive step: Total:

Combining forward & backward variables Our two variables can be combined: the likelihood of being in state i at time t with our sequence of observations is a function of: the probability of ending up in i at t given what came previously the probability of being in i at t given the rest Therefore:

Decoding: Finding the best state sequence

Best state sequence: example Consider the ASR problem again Input observation sequence: [aa n iy dh ax] (corresponds to I need the…) Possible solutions: I need a… I need the… I kneed a… … NB: each possible solution corresponds to a state sequence. Problem is to find best word segmentation and most likely underlying phonetic input.

Some difficulties… If we focus on the likelihood of each individual state, we run into problems context effects mean that what is individually likely may together yield an unlikely sequence the ASR program needs to look at the probability of entire sequences

Viterbi algorithm Given an observation sequence O and a model , find: argmax X P(X,O|  ) the sequence of states X such that P(X,O|  ) is highest Basic idea: run a type of forward procedure (computes probability of all possible paths) store partial solutions at the end, look back to find the best path

Illustration: path through the trellis S1S1 S2S2 1t=234567 S3S3 S4S4 At every node (state) and time, we store: the likelihood of reaching that state at that time by the most probable path leading to that state (denoted ) the preceding state leading to the current state (denoted )

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: definitions x1x1 x t-1 j The probability of the most probable path from observation 1 to t-1, landing us in state j at t

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: initialisation x1x1 x t-1 j The probability of being in state j at the beginning is just the initialisation probability of state j.

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: inductive step x1x1 x t-1 xtxt x t+1 Probability of being in j at t+1 depends on the state i for which a ij is highest the probability that j emits the symbol O t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: inductive step x1x1 x t-1 xtxt x t+1 Backtrace store: the most probable state from which state j can be reached

Illustration S1S1 S2S2 1t=234567 S3S3 S4S4  2 (t=6) = probability of reaching state 2 at time t=6 by the most probable path (marked) through state 2 at t=6  2 (t=6) =3 is the state preceding state 2 at t=6 on the most probable path through state 2 at t=6

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace x1x1 x t-1 xtxt x t+1 xTxT The best state at T is that state i for which the probability  i (T) is highest

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace Work backwards to the most likely preceding state x1x1 x t-1 xtxt x t+1 xTxT

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm: backtrace The probability of the best state sequence is the maximum value stored for the final state T x1x1 x t-1 xtxt x t+1 xTxT

Summary We’ve looked at two algorithms for solving two of the fundamental problems of HMMS: likelihood of an observation sequence given a model (Forward/Backward Procedure) most likely underlying state, given an observation sequence (Viterbi Algorithm) Next up: we look at POS tagging

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Similar presentations

Presentation on theme: "Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

Similar presentations

Presentation on theme: "Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9."— Presentation transcript:

Similar presentations

About project

Feedback