CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing
Spring 2018 11. Hidden Markov Model (Some slides adapted from Jurafsky & Martin, and Raymond Mooney at UT Austin)

Speech and Language Processing - Jurafsky and Martin
Hidden Markov Models Hidden Markov Model (HMM) is a sequence model. A sequence model or sequence classifier is a model which assigns a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model: given a sequence of units (e.g. words, letters, morphemes, sentences), they compute a probability distribution over possible sequences of labels and choose the best label sequence. This is a kind of generative model. Speech and Language Processing - Jurafsky and Martin

Sequence Labeling in NLP
There are many NLP tasks which use sequence labeling: Speech recognition Part-of-speech tagging Named Entity Recognition Information Extraction But not all NLP tasks are sequence labeling – only the ones where information of the previous things in the sequence affect/help the next one. Speech and Language Processing - Jurafsky and Martin

Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Speech and Language Processing - Jurafsky and Martin

Markov Chain for Weather
Speech and Language Processing - Jurafsky and Martin

Markov Chain: “First-order observable Markov Model”
A set of states Q = q1, q2…qN; the state at time t is qt Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Special initial probability vector π Current state only depends on previous state Speech and Language Processing - Jurafsky and Martin

Markov Chain for Weather
What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm And state sequence is P(3,3,3,3) = 3a33a33a33a33 = 0.2 x (0.6)3 = Speech and Language Processing - Jurafsky and Martin

Hidden Markov Model (HMM)
However, oftentimes we want to know what produced the sequence – the hidden sequence for the observed sequence. For example, Inferring the words (hidden) from acoustic signal (observed) in speech recognition Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS tagging. Assigning named entity categories (hidden) to a sentence (sequence of words) – Named Entity Recognition. Speech and Language Processing - Jurafsky and Martin

Two Kinds of Probabilities
State transition probabilities -- p(ti|ti-1) State-to-state transition probabilities Observation/Emission likelihood -- p(wi|ti) Probabilities of observing various values at a given state Speech and Language Processing - Jurafsky and Martin

What’s the state sequence for the observed sequence “1 2 3”?

Three Problems Given this framework there are 3 problems that we can pose to an HMM Given an observation sequence, what is the probability of that sequence given a model? Given an observation sequence and a model, what is the most likely state sequence? Given an observation sequence, find the best model parameters for a partially specified model Speech and Language Processing - Jurafsky and Martin

Problem 1: Obserbation Likelihood
The probability of a sequence given a model... Used in model development... How do I know if some change I made to the model is making things better? And in classification tasks Word spotting in ASR, language identification, speaker identification, author identification, etc. Train one HMM model per class Given an observation, pass it to each model and compute P(seq|model). Speech and Language Processing - Jurafsky and Martin

Problem 2: Decoding Most probable state sequence given a model and an observation sequence Typically used in tagging problems, where the tags correspond to hidden states As we’ll see almost any problem can be cast as a sequence labeling problem Speech and Language Processing - Jurafsky and Martin

Problem 3: Learning Infer the best model parameters, given a partial model and an observation sequence... That is, fill in the A and B tables with the right numbers -- the numbers that make the observation sequence most likely This is to learn the probabilities! Speech and Language Processing - Jurafsky and Martin

Solutions Problem 2: Viterbi Problem 1: Forward Problem 3: Forward-Backward An instance of EM (Expectation Maximization) Speech and Language Processing - Jurafsky and Martin

Problem 2: Decoding We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin

Getting to HMMs This equation should give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian inference: Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer) Speech and Language Processing - Jurafsky and Martin

Using Bayes Rule Know this. Speech and Language Processing - Jurafsky and Martin

Likelihood and Prior Speech and Language Processing - Jurafsky and Martin

Definition of HMM States Q = q1, q2…qN; Observations O= o1, o2…oN; Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Transition probabilities Transition probability matrix A = {aij} Observation likelihoods Output probability matrix B={bi(k)} Special initial probability vector  Speech and Language Processing - Jurafsky and Martin

HMMs for Ice Cream You are a climatologist in the year 2799 studying global warming You can’t find any records of the weather in Baltimore for summer of 2007 But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer Your job: figure out how hot it was each day Speech and Language Processing - Jurafsky and Martin

Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… Produce: Hidden Weather Sequence: H,C,H,H,H,C, C… Speech and Language Processing - Jurafsky and Martin

HMM for Ice Cream Speech and Language Processing - Jurafsky and Martin

Ice Cream HMM Let’s just do 131 as the sequence How many underlying state (hot/cold) sequences are there? How do you pick the right one? HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1) Speech and Language Processing - Jurafsky and Martin

Ice Cream HMM Let’s just do 1 sequence: CHC Cold as the initial state P(Cold|Start) .2 .5 .4 .3 Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) .0024 Observing a 1 on a cold day P(1 | Cold) Speech and Language Processing - Jurafsky and Martin

But there are other sequences which generate 313. For example, HHC Speech and Language Processing - Jurafsky and Martin

Viterbi Algorithm Create an array Columns corresponding to observations Rows corresponding to possible hidden states Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. Also record “backpointers” that subsequently allow backtracing the most probable state sequence. Speech and Language Processing - Jurafsky and Martin

The Viterbi Algorithm Speech and Language Processing - Jurafsky and Martin

Viterbi Example: Ice Cream

Viterbi Example

Raymond Mooney at UT Austin
Viterbi Backpointers s1      s2       s0    sF    sN t1 t2 t3 tT-1 tT Raymond Mooney at UT Austin 31

Viterbi Backtrace s1      s2       s0    sF    sN t1 t2 t3 tT-1 tT Most likely Sequence: s0 sN s1 s2 …s2 sF Raymond Mooney at UT Austin 32

Problem 1: Forward Given an observation sequence return the probability of the sequence given the model... Well in a normal Markov model, the states and the sequences are identical... So the probability of a sequence is the probability of the path sequence But not in an HMM... Remember that any number of sequences might be responsible for any given observation sequence. Speech and Language Processing - Jurafsky and Martin

Forward Efficiently computes the probability of an observed sequence given a model P(sequence|model) Nearly identical to Viterbi; replace the MAX with a SUM Speech and Language Processing - Jurafsky and Martin

The Forward Algorithm Speech and Language Processing - Jurafsky and Martin

Visualizing Forward Speech and Language Processing - Jurafsky and Martin

Problem 3 Infer the best model parameters, given a skeletal model and an observation sequence... That is, fill in the A and B tables with the right numbers... The numbers that make the observation sequence most likely Useful for getting an HMM without having to hire annotators... Speech and Language Processing - Jurafsky and Martin 37

Solutions Problem 1: Forward Problem 2: Viterbi Problem 3: EM Or Forward-backward algorithm (or Baum-Welch) Speech and Language Processing - Jurafsky and Martin 38

Forward-Backward Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation-Maximization algorithm The algorithm will let us train the transition probabilities A= {aij} and the emission probabilities B={bi(ot)} of the HMM Speech and Language Processing - Jurafsky and Martin 39 39

Intuition for re-estimation of aij
We will estimate âij via this intuition: Numerator intuition: Assume we had some estimate of probability that a given transition ij was taken at time t in a observation sequence. If we knew this probability for each time t, we could sum over all t to get expected value for ij. Speech and Language Processing - Jurafsky and Martin 40 40

Intuition for re-estimation of aij
Speech and Language Processing - Jurafsky and Martin 41

Re-estimating aij The expected number of transitions from state i to state j is the sum over all t of  The total expected number of transitions out of state i is the sum over all transitions out of state i Final formula for reestimated aij Speech and Language Processing - Jurafsky and Martin 42 42

The Forward-Backward Alg
Speech and Language Processing - Jurafsky and Martin 43

Sketch of Baum-Welch (EM) Algorithm for Training HMMs
Assume an HMM with N states. Randomly set its parameters λ=(A,B) (making sure they represent legal distributions) Until converge (i.e. λ no longer changes) do: E Step: Use the forward/backward procedure to determine the probability of various possible state sequences for generating the training data M Step: Use these probability estimates to re-estimate values for all of the parameters λ Raymond Mooney at UT Austin 44

EM Properties Each iteration changes the parameters in a way that is guaranteed to increase the likelihood of the data: P(O|). Anytime algorithm: Can stop at any time prior to convergence to get approximate solution. Converges to a local maximum. Raymond Mooney at UT Austin 45

Summary: Forward-Backward Algorithm
Intialize =(A,B) Compute , ,  using observations Estimate new ’=(A,B) Replace  with ’ If not converged go to 2 Speech and Language Processing - Jurafsky and Martin 44 46

CSC 594 Topics in AI – Natural Language Processing

Similar presentations

Presentation on theme: "CSC 594 Topics in AI – Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSC 594 Topics in AI – Natural Language Processing

Similar presentations

Presentation on theme: "CSC 594 Topics in AI – Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback