Hidden Markov Models Teaching Demo The University of Arizona

Hidden Markov Models Teaching Demo The University of Arizona
Tatjana Scheffler

Warm-Up: Parts of Speech
Part of Speech Tagging = Grouping words into morphosyntactic types like noun, verb, etc.: She went for a walk

Warm-Up: Parts of Speech
Grouping words into morphosyntactic types like noun, verb, etc.: PRON VERB ADP DET NOUN She went for a walk

POS tags (Universal Dependencies)
Open class words Closed class words Other ADJ ADP PUNCT ADV AUX SYM INTJ CCONJ X NOUN DET PROPN NUM VERB PART PRON SCONJ

Let’s try! Look at your your word(s) and find an appropriate (or likely) part of speech tag. Write it on the paper. Now find the other words from your sentence (sentences are numbered an color-coded in the upper left corner). Re-tag the sentence. Did any of your group have to change POS tags? How did you know?

Hidden Markov Models Generative language model Now, two step process:
Compare, e.g. n-gram model: P(wn|w1,…,wn-1) Now, two step process: Generate sequence of hidden states (POS tags) t1, …, tT from a bigram model P(ti|ti-1) Independently, generate an observable word wi from each state ti, from a bigram model P(wi|ti)

Question 1: Language Modelling
Given an HMM and a string w1, …, wT, what is its likelihood P(w1, …, wT)? Compute it efficiently with the Forward algorithm

Question 2: POS Tagging Given an HMM and a string w1, …, wT, what is the most likely sequence of hidden tags t1, …, tT that generated it? argmax 𝑡 1 … 𝑡 𝑇 𝑃( 𝑤 1 , 𝑡 1 ,…, 𝑤 𝑇 , 𝑡 𝑇 ) Compute it efficiently with the Viterbi algorithm

Question 3: Training (not today)
Train HMM parameters from a set of POS tags and Annotated training data: Maximum likelihood training with smoothing Unannotated training data: Forward-backward algorithm (an instance of EM)

Hidden Markov Model (formally)
A Hidden Markov Model is a 5-tuple consisting of: a finite set of states Q={q1,…,qN} (= POS tags) a finite set of possible observations O (= words) initial probabilities a0i = P(X1 = qi) transition probabilities aij = P(Xt+1 = qj | Xt = qi) emission probabilities bi(o) = P(Yt = o | Xt = qi) The HMM describes two coupled random processes: Xt = qi : At time t, HMM is in state qi Yt = o : At time t, HMM emits observation o. Initial, transition and emission probabiliites are true probabilities, ie they must sum to one, respectively

Example: Eisner’s ice cream diary
States: weather in Baltimore on a given day Observations: How many ice creams Eisner ate that day initial p. a0H transition p. aCH emission p. bC(3)

HMM models x,y jointly The coupled random processes of HMM give us the joint probability P(x,y) where x = x1, …, xT is the sequence of hidden states and y = y1, …, yT is the sequence of observations 𝑃 𝑥,𝑦 =𝑃 𝑥 ∙𝑃 𝑦 𝑥 = 𝑡=1 𝑇 𝑃( 𝑋 𝑡 = 𝑥 𝑡 | 𝑋 𝑡−1 = 𝑥 𝑡−1 )∙ 𝑡=1 𝑇 𝑃( 𝑌 𝑡 = 𝑦 𝑡 | 𝑋 𝑡 = 𝑥 𝑡 ) = 𝑡=1 𝑇 𝑎 𝑥 𝑡−1 𝑥 𝑡 ∙ 𝑡=1 𝑇 𝑏 𝑥 𝑡 ( 𝑦 𝑡 ) Idea that probability of entire sequence of states can be seen as only a bigram model is called ”Markov assumption” Words depend only on current state

Question 1: Likelihood P(y)
How likely is it that Eisner ate 3 ice creams on day 1, 1 ice cream on day 2, and 3 ice creams on day 3? Want to compute P(3,1,3) Definitions let us compute joint probabilities like P(H,3,C,1,H,3) easily. But there could be different state sequences that lead to (3,1,3) We must sum up over all of them

Naïve approach Sum over all possible state sequences to compute P(3,1,3): Technical term: Marginalization 𝑃 3,1,3 = 𝑥 1 , 𝑥 2 , 𝑥 3 ∈𝑄 𝑃( 𝑥 1 ,3, 𝑥 2 ,1, 𝑥 3 ,3)

Naïve approach is too expensive
Naïve approach sums up over exponential number of terms. This is too slow for practical use. Visualize the path through the hidden states in a trellis (unfolding of the HMM) one column for each time point, represents Xt each column contains a copy of all the states in the HMM edges from states in t to t+1 (transitions in the HMM) Each path in the trellis is one state sequence So P(y) is the sum of all paths that emit y

Ice cream trellis

Sentence likelihood

The Forward algorithm In the naïve solution, we compute many intermediate results several times Central idea: Define the forward probability 𝛼 𝑡 (𝑗) that the HMM outputs 𝑦 1 … 𝑦 𝑡 and ends in 𝑋 𝑡 = 𝑞 𝑗 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 = 𝑥 1 … 𝑥 𝑡−1 𝑃( 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 1 = 𝑥 1 ,…, 𝑋 𝑡−1 = 𝑥 𝑡−1 , 𝑋 𝑡 = 𝑞 𝑗 ) So: 𝑃 𝑦 1 ,…, 𝑦 𝑇 = 𝑞𝜖𝑄 𝛼 𝑇 (𝑞)

The Forward algorithm 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗
𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 Base case, t = 1: 𝛼 1 𝑗 =𝑃 𝑦 1 , 𝑋 1 = 𝑞 𝑗 = 𝑏 𝑗 ( 𝑦 1 )∙ 𝑎 0𝑗 Inductive step for t = 2 … T: = 𝑖=1 𝑁 𝑃 𝑦 1 ,…, 𝑦 𝑡−1 , 𝑋 𝑡−1 = 𝑞 𝑖 ∙𝑃 𝑋 𝑡 = 𝑞 𝑗 𝑋 𝑡−1 = 𝑞 𝑖 ∙𝑃( 𝑦 𝑡 | 𝑋 𝑡 = 𝑞 𝑗 ) = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 )

P(3,1,3) with Forward 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 )

P(3,1,3) with Forward 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 )

P(3,1,3) with Forward 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02

P(3,1,3) with Forward 𝛼 2 𝐻 =.04 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02

P(3,1,3) with Forward 𝛼 2 𝐻 =.04 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02 𝛼 2 𝐶 =.05

P(3,1,3) with Forward 𝛼 2 𝐻 =.04 𝛼 3 𝐻 =.02 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02 𝛼 2 𝐶 =.05

P(3,1,3) with Forward 𝛼 2 𝐻 =.04 𝛼 3 𝐻 =.02 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02 𝛼 2 𝐶 =.05 𝛼 3 𝐶 =.005

P(3,1,3) with Forward 𝑃 3,1,3 = 𝛼 3 𝐻 +𝛼 3 𝐶 =.026
𝛼 2 𝐻 =.04 𝛼 3 𝐻 =.02 𝑃 3,1,3 = 𝛼 3 𝐻 +𝛼 3 𝐶 =.026 𝛼 1 𝐻 =.32 𝛼 𝑡 𝑗 =𝑃 𝑦 1 ,…, 𝑦 𝑡 , 𝑋 𝑡 = 𝑞 𝑗 𝛼 1 𝑗 = 𝑏 𝑗 𝑦 1 ∙ 𝑎 0𝑗 𝛼 𝑡 𝑗 = 𝑖−1 𝑁 𝛼 𝑡−1 (𝑖) ∙ 𝑎 𝑖𝑗 ∙ 𝑏 𝑗 ( 𝑦 𝑡 ) 𝛼 1 𝐶 =.02 𝛼 2 𝐶 =.05 𝛼 3 𝐶 =.005

Question 1: Likelihood P(y)
How likely is it that Eisner ate 3 ice creams on day 1, 1 ice cream on day 2, and 3 ice creams on day 3? 𝑃 3,1,3 = 𝛼 3 𝐻 +𝛼 3 𝐶 =.026 Use Forward algorithm to sum up over different paths (= weather patterns) efficiently.

Question 2: Tagging Given observations y1, …, yT, what is the most likely sequence of hidden states x1, …, xT? max 𝑥 1 … 𝑥 𝑇 𝑃( 𝑥 1 ,… 𝑥 𝑇 | 𝑦 1 ,…, 𝑦 𝑇 ) We are only interested in the most likely sequence of states (not really their probability). argmax 𝑥 1 … 𝑥 𝑇 𝑃( 𝑥 1 ,… 𝑥 𝑇 | 𝑦 1 ,…, 𝑦 𝑇 ) = argmax 𝑥 1 … 𝑥 𝑇 𝑃( 𝑥 1 ,… 𝑥 𝑇 , 𝑦 1 ,…, 𝑦 𝑇 ) 𝑃( 𝑦 1 ,…, 𝑦 𝑇 ) = argmax 𝑥 1 … 𝑥 𝑇 𝑃( 𝑥 1 ,… 𝑥 𝑇 , 𝑦 1 ,…, 𝑦 𝑇 ) What does this question mean in Eisner’s model?

Naïve solution

Likelihood (Question 1)
Parallelism Likelihood (Question 1) Tagging (Question 2) P(y) argmax P(x,y) Forward algorithm Viterbi algorithm 𝑥 1 … 𝑥 𝑇 𝑃(𝑥,…, 𝑥 𝑇 , 𝑦 1 ,…, 𝑦 𝑇 ) argmax 𝑥 1 … 𝑥 𝑇 𝑃( 𝑥 1 ,… 𝑥 𝑇 , 𝑦 1 ,…, 𝑦 𝑇 ) 𝛼 𝑡 𝑗 = 𝑥 1 … 𝑥 𝑡−1 𝑃( 𝑦 1 ,…, 𝑦 𝑡 , 𝑥 1 ,…, 𝑥 𝑡−1 , 𝑋 𝑡 = 𝑞 𝑗 ) 𝑉 𝑡 𝑗 = max 𝑥 1 … 𝑥 𝑡−1 𝑃( 𝑦 1 ,…, 𝑦 𝑡 , 𝑥 1 ,…, 𝑥 𝑡−1 , 𝑋 𝑡 = 𝑞 𝑗 ) 𝑃 𝑦 = 𝑞∈𝑄 𝛼 𝑇 (𝑞) max 𝑥 𝑃(𝑥,𝑦) = max 𝑞∈𝑄 𝑉 𝑇 (𝑞)

𝑉 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑉 𝑡−1 (𝑖) ∙ 𝑏 𝑗 ( 𝑦 𝑡 )∙ 𝑎 𝑖𝑗
The Viterbi algorithm 𝑉 𝑡 𝑗 = max 𝑥 1 … 𝑥 𝑡−1 𝑃( 𝑦 1 ,…, 𝑦 𝑡 , 𝑥 1 ,…, 𝑥 𝑡−1 , 𝑋 𝑡 = 𝑞 𝑗 ) Base case, t = 1: 𝑉 1 𝑗 = 𝑏 𝑗 ( 𝑦 1 )∙ 𝑎 0𝑗 Inductive step, t = 2 … T: 𝑉 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑉 𝑡−1 (𝑖) ∙ 𝑏 𝑗 ( 𝑦 𝑡 )∙ 𝑎 𝑖𝑗 For each state and time step (j, t), remember the i for which maximum was achieved as a backpointer bpt(j) Retrieve optimal tag sequence by following bps T → 1

P(x,3,1,3) with Viterbi 𝑉 1 𝑗 = 𝑏 𝑗 ( 𝑦 1 )∙ 𝑎 0𝑗 𝑉 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑉 𝑡−1 (𝑖) ∙ 𝑏 𝑗 ( 𝑦 𝑡 )∙ 𝑎 𝑖𝑗

P(x,3,1,3) with Viterbi 𝑉 1 𝐻 =.32 𝑉 2 𝐻 =.045 𝑉 3 𝐻 =.012 𝑉 1 𝐶 =.02 𝑉 2 𝐶 =.048 𝑉 3 𝐶 =.003 𝑉 1 𝑗 = 𝑏 𝑗 ( 𝑦 1 )∙ 𝑎 0𝑗 𝑉 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑉 𝑡−1 (𝑖) ∙ 𝑏 𝑗 ( 𝑦 𝑡 )∙ 𝑎 𝑖𝑗

Runtime Forward and Viterbi have the same runtime, dominated by the inductive step: 𝑉 𝑡 𝑗 = max 1≤𝑖≤𝑁 𝑉 𝑡−1 (𝑖) ∙ 𝑏 𝑗 ( 𝑦 𝑡 )∙ 𝑎 𝑖𝑗 Compute 𝑉 𝑡 𝑗 N ∙ T times. Each computation iterates over N predecessor states i Total runtime is O(N2 T) Linear in sentence length Quadratic in number of states (tags)

Summary Hidden Markov Models are a popular model for POS tagging and other tasks (e.g., dialog act tagging, see research talk tomorrow!) HMM = two coupled random processes: Bigram hidden state model Model generating observable outputs from states Efficient algorithms for two common problems: Forward algorithm for likelihood computation Viterbi algorithm for tagging (= best state sequence) Can anyone give a one-minute summary?

Eisner’s ice cream HMM

Hidden Markov Models Teaching Demo The University of Arizona

Similar presentations

Presentation on theme: "Hidden Markov Models Teaching Demo The University of Arizona"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models Teaching Demo The University of Arizona

Similar presentations

Presentation on theme: "Hidden Markov Models Teaching Demo The University of Arizona"— Presentation transcript:

Similar presentations

About project

Feedback