Download presentation
Presentation is loading. Please wait.
Published byἈζαρίας Αντωνοπούλου Modified over 6 years ago
1
CSC 594 Topics in AI – Natural Language Processing
Spring 2018 11. Hidden Markov Model (Some slides adapted from Jurafsky & Martin, and Raymond Mooney at UT Austin)
2
Speech and Language Processing - Jurafsky and Martin
Hidden Markov Models Hidden Markov Model (HMM) is a sequence model. A sequence model or sequence classifier is a model which assigns a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model: given a sequence of units (e.g. words, letters, morphemes, sentences), they compute a probability distribution over possible sequences of labels and choose the best label sequence. This is a kind of generative model. Speech and Language Processing - Jurafsky and Martin
3
Sequence Labeling in NLP
There are many NLP tasks which use sequence labeling: Speech recognition Part-of-speech tagging Named Entity Recognition Information Extraction But not all NLP tasks are sequence labeling – only the ones where information of the previous things in the sequence affect/help the next one. Speech and Language Processing - Jurafsky and Martin
4
Speech and Language Processing - Jurafsky and Martin
Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Speech and Language Processing - Jurafsky and Martin
5
Markov Chain for Weather
Speech and Language Processing - Jurafsky and Martin
6
Markov Chain: “First-order observable Markov Model”
A set of states Q = q1, q2…qN; the state at time t is qt Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Special initial probability vector π Current state only depends on previous state Speech and Language Processing - Jurafsky and Martin
7
Markov Chain for Weather
What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm And state sequence is P(3,3,3,3) = 3a33a33a33a33 = 0.2 x (0.6)3 = Speech and Language Processing - Jurafsky and Martin
8
Hidden Markov Model (HMM)
However, oftentimes we want to know what produced the sequence – the hidden sequence for the observed sequence. For example, Inferring the words (hidden) from acoustic signal (observed) in speech recognition Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS tagging. Assigning named entity categories (hidden) to a sentence (sequence of words) – Named Entity Recognition. Speech and Language Processing - Jurafsky and Martin
9
Two Kinds of Probabilities
State transition probabilities -- p(ti|ti-1) State-to-state transition probabilities Observation/Emission likelihood -- p(wi|ti) Probabilities of observing various values at a given state Speech and Language Processing - Jurafsky and Martin
10
What’s the state sequence for the observed sequence “1 2 3”?
Speech and Language Processing - Jurafsky and Martin
11
Speech and Language Processing - Jurafsky and Martin
Three Problems Given this framework there are 3 problems that we can pose to an HMM Given an observation sequence, what is the probability of that sequence given a model? Given an observation sequence and a model, what is the most likely state sequence? Given an observation sequence, find the best model parameters for a partially specified model Speech and Language Processing - Jurafsky and Martin
12
Problem 1: Obserbation Likelihood
The probability of a sequence given a model... Used in model development... How do I know if some change I made to the model is making things better? And in classification tasks Word spotting in ASR, language identification, speaker identification, author identification, etc. Train one HMM model per class Given an observation, pass it to each model and compute P(seq|model). Speech and Language Processing - Jurafsky and Martin
13
Speech and Language Processing - Jurafsky and Martin
Problem 2: Decoding Most probable state sequence given a model and an observation sequence Typically used in tagging problems, where the tags correspond to hidden states As we’ll see almost any problem can be cast as a sequence labeling problem Speech and Language Processing - Jurafsky and Martin
14
Speech and Language Processing - Jurafsky and Martin
Problem 3: Learning Infer the best model parameters, given a partial model and an observation sequence... That is, fill in the A and B tables with the right numbers -- the numbers that make the observation sequence most likely This is to learn the probabilities! Speech and Language Processing - Jurafsky and Martin
15
Speech and Language Processing - Jurafsky and Martin
Solutions Problem 2: Viterbi Problem 1: Forward Problem 3: Forward-Backward An instance of EM (Expectation Maximization) Speech and Language Processing - Jurafsky and Martin
16
Speech and Language Processing - Jurafsky and Martin
Problem 2: Decoding We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin
17
Speech and Language Processing - Jurafsky and Martin
Getting to HMMs This equation should give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian inference: Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer) Speech and Language Processing - Jurafsky and Martin
18
Speech and Language Processing - Jurafsky and Martin
Using Bayes Rule Know this. Speech and Language Processing - Jurafsky and Martin
19
Speech and Language Processing - Jurafsky and Martin
Likelihood and Prior Speech and Language Processing - Jurafsky and Martin
20
Speech and Language Processing - Jurafsky and Martin
Definition of HMM States Q = q1, q2…qN; Observations O= o1, o2…oN; Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Transition probabilities Transition probability matrix A = {aij} Observation likelihoods Output probability matrix B={bi(k)} Special initial probability vector Speech and Language Processing - Jurafsky and Martin
21
Speech and Language Processing - Jurafsky and Martin
HMMs for Ice Cream You are a climatologist in the year 2799 studying global warming You can’t find any records of the weather in Baltimore for summer of 2007 But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer Your job: figure out how hot it was each day Speech and Language Processing - Jurafsky and Martin
22
Speech and Language Processing - Jurafsky and Martin
Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… Produce: Hidden Weather Sequence: H,C,H,H,H,C, C… Speech and Language Processing - Jurafsky and Martin
23
Speech and Language Processing - Jurafsky and Martin
HMM for Ice Cream Speech and Language Processing - Jurafsky and Martin
24
Speech and Language Processing - Jurafsky and Martin
Ice Cream HMM Let’s just do 131 as the sequence How many underlying state (hot/cold) sequences are there? How do you pick the right one? HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1) Speech and Language Processing - Jurafsky and Martin
25
Speech and Language Processing - Jurafsky and Martin
Ice Cream HMM Let’s just do 1 sequence: CHC Cold as the initial state P(Cold|Start) .2 .5 .4 .3 Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) .0024 Observing a 1 on a cold day P(1 | Cold) Speech and Language Processing - Jurafsky and Martin
26
Speech and Language Processing - Jurafsky and Martin
But there are other sequences which generate 313. For example, HHC Speech and Language Processing - Jurafsky and Martin
27
Speech and Language Processing - Jurafsky and Martin
Viterbi Algorithm Create an array Columns corresponding to observations Rows corresponding to possible hidden states Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. Also record “backpointers” that subsequently allow backtracing the most probable state sequence. Speech and Language Processing - Jurafsky and Martin
28
Speech and Language Processing - Jurafsky and Martin
The Viterbi Algorithm Speech and Language Processing - Jurafsky and Martin
29
Viterbi Example: Ice Cream
Speech and Language Processing - Jurafsky and Martin
30
Viterbi Example
31
Raymond Mooney at UT Austin
Viterbi Backpointers s1 s2 s0 sF sN t1 t2 t3 tT-1 tT Raymond Mooney at UT Austin 31
32
Raymond Mooney at UT Austin
Viterbi Backtrace s1 s2 s0 sF sN t1 t2 t3 tT-1 tT Most likely Sequence: s0 sN s1 s2 …s2 sF Raymond Mooney at UT Austin 32
33
Speech and Language Processing - Jurafsky and Martin
Problem 1: Forward Given an observation sequence return the probability of the sequence given the model... Well in a normal Markov model, the states and the sequences are identical... So the probability of a sequence is the probability of the path sequence But not in an HMM... Remember that any number of sequences might be responsible for any given observation sequence. Speech and Language Processing - Jurafsky and Martin
34
Speech and Language Processing - Jurafsky and Martin
Forward Efficiently computes the probability of an observed sequence given a model P(sequence|model) Nearly identical to Viterbi; replace the MAX with a SUM Speech and Language Processing - Jurafsky and Martin
35
Speech and Language Processing - Jurafsky and Martin
The Forward Algorithm Speech and Language Processing - Jurafsky and Martin
36
Speech and Language Processing - Jurafsky and Martin
Visualizing Forward Speech and Language Processing - Jurafsky and Martin
37
Speech and Language Processing - Jurafsky and Martin
Problem 3 Infer the best model parameters, given a skeletal model and an observation sequence... That is, fill in the A and B tables with the right numbers... The numbers that make the observation sequence most likely Useful for getting an HMM without having to hire annotators... Speech and Language Processing - Jurafsky and Martin 37
38
Speech and Language Processing - Jurafsky and Martin
Solutions Problem 1: Forward Problem 2: Viterbi Problem 3: EM Or Forward-backward algorithm (or Baum-Welch) Speech and Language Processing - Jurafsky and Martin 38
39
Speech and Language Processing - Jurafsky and Martin
Forward-Backward Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation-Maximization algorithm The algorithm will let us train the transition probabilities A= {aij} and the emission probabilities B={bi(ot)} of the HMM Speech and Language Processing - Jurafsky and Martin 39 39
40
Intuition for re-estimation of aij
We will estimate âij via this intuition: Numerator intuition: Assume we had some estimate of probability that a given transition ij was taken at time t in a observation sequence. If we knew this probability for each time t, we could sum over all t to get expected value for ij. Speech and Language Processing - Jurafsky and Martin 40 40
41
Intuition for re-estimation of aij
Speech and Language Processing - Jurafsky and Martin 41
42
Speech and Language Processing - Jurafsky and Martin
Re-estimating aij The expected number of transitions from state i to state j is the sum over all t of The total expected number of transitions out of state i is the sum over all transitions out of state i Final formula for reestimated aij Speech and Language Processing - Jurafsky and Martin 42 42
43
The Forward-Backward Alg
Speech and Language Processing - Jurafsky and Martin 43
44
Sketch of Baum-Welch (EM) Algorithm for Training HMMs
Assume an HMM with N states. Randomly set its parameters λ=(A,B) (making sure they represent legal distributions) Until converge (i.e. λ no longer changes) do: E Step: Use the forward/backward procedure to determine the probability of various possible state sequences for generating the training data M Step: Use these probability estimates to re-estimate values for all of the parameters λ Raymond Mooney at UT Austin 44
45
Raymond Mooney at UT Austin
EM Properties Each iteration changes the parameters in a way that is guaranteed to increase the likelihood of the data: P(O|). Anytime algorithm: Can stop at any time prior to convergence to get approximate solution. Converges to a local maximum. Raymond Mooney at UT Austin 45
46
Summary: Forward-Backward Algorithm
Intialize =(A,B) Compute , , using observations Estimate new ’=(A,B) Replace with ’ If not converged go to 2 Speech and Language Processing - Jurafsky and Martin 44 46
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.