Presentation is loading. Please wait.

Presentation is loading. Please wait.

Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Similar presentations


Presentation on theme: "Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most."— Presentation transcript:

1 Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken (or adapted) from David Meir Blei, Figures from Manning and Schuetze and from Rabiner

2 Contents  Markov Models  Hidden Markov Models Three problems - three algorithms  Decoding  Viterbi  Baum-Welsch  Next chapter Application to part-of-speech-tagging (POS-tagging) Largely chapter 9 of Statistical NLP, Manning and Schuetze, or Rabiner, A tutorial on HMMs and selected applications in Speech Recognition, Proc. IEEE

3 Motivations and Applications  Part-of-speech tagging / Sequence tagging The representative put chairs on the table AT NN VBD NNS IN AT NN AT JJ NN VBZ IN AT NN  Some tags : AT: article, NN: singular or mass noun, VBD: verb, past tense, NNS: plural noun, IN: preposition, JJ: adjective

4 Bioinformatics  Durbin et al. Biological Sequence Analysis, Cambridge University Press.  Several applications, e.g. proteins  From primary structure ATCPLELLLD  Infer secondary structure HHHBBBBBC..

5 Other Applications  Speech Recognition: from From: Acoustic signals infer Infer: Sentence  Robotics: From Sensory readings Infer Trajectory / location …

6 What is a (Visible) Markov Model ?  Graphical Model (Can be interpreted as Bayesian Net)  Circles indicate states  Arrows indicate probabilistic dependencies between states  State depends only on the previous state  “The past is independent of the future given the present.”  Recall from introduction to N-gramms !!!

7 Markov Model Formalization SSSSS  {S,   S : {s 1 …s N } are the values for the hidden states Limited Horizon (Markov Assumption) Time Invariant (Stationary) Transition Matrix A

8 Markov Model Formalization SSSSS  {S,   S : {s 1 …s N } are the values for the hidden states     are the initial state probabilities  A = {a ij } are the state transition probabilities AAAA

9 What is the probability of a sequence of states ?

10 What is an HMM?  Graphical Model  Circles indicate states  Arrows indicate probabilistic dependencies between states HMM = Hidden Markov Model

11 What is an HMM?  Green circles are hidden states  Dependent only on the previous state

12 What is an HMM?  Purple nodes are observed states  Dependent only on their corresponding hidden state  The past is independent of the future given the present

13 HMM Formalism  {S, K,   S : {s 1 …s N } are the values for the hidden states  K : {k 1 …k M } are the values for the observations SSS KKK S K S K

14 HMM Formalism  {S, K,      are the initial state probabilities  A = {a ij } are the state transition probabilities  B = {b ik } are the observation state probabilities Note : sometimes one uses B = {b ijk } output then depends on previous state / transition as well A B AAA BB SSS KKK S K S K

15 The crazy soft drink machine  Fig 9.2

16 Probability of {lem,ice} ?  Sum over all paths taken through HMM  Start in CP 1 x 0.3 x 0.7 x 0.1 + 1 x 0.3 x 0.3 x 0.7

17 oToT o1o1 otot o t-1 o t+1 HMMs and Bayesian Nets (1) x1x1 x t-1 xtxt x t+1 xTxT

18 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 HMM and Bayesian Nets (2) Conditionally independent of Given Because of d-separation “The past is independent of the future given the present.”

19 Inference in an HMM  Compute the probability of a given observation sequence  Given an observation sequence, compute the most likely hidden state sequence  Given an observation sequence and set of possible models, which model most closely fits the data?

20 oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Decoding

21 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

22 Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

23 Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

24 Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

25 Decoding oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

26 Dynamic Programming

27 Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. Define:

28 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

29 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

30 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

31 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

32 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

33 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

34 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

35 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

36 Dynamic Programming

37 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state

38

39 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Decoding Solution Forward Procedure Backward Procedure Combination

40

41 oToT o1o1 otot o t-1 o t+1 Best State Sequence  Find the state sequence that best explains the observations  Two approaches Individually most likely states Most likely sequence (Viterbi) 

42 Best State Sequence (1)

43 oToT o1o1 otot o t-1 o t+1 Best State Sequence (2)  Find the state sequence that best explains the observations  Viterbi algorithm

44 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

45 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

46 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

47 oToT o1o1 otot o t-1 o t+1 HMMs and Bayesian Nets (1) x1x1 x t-1 xtxt x t+1 xTxT

48 oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 HMM and Bayesian Nets (2) Conditionally independent of Given Because of d-separation “The past is independent of the future given the present.”

49 Inference in an HMM  Compute the probability of a given observation sequence  Given an observation sequence, compute the most likely hidden state sequence  Given an observation sequence and set of possible models, which model most closely fits the data?

50 Dynamic Programming

51 oToT o1o1 otot o t-1 o t+1 Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method Given a model and observation sequence, update the model parameters to better fit the observations. A B AAA BBBB

52

53 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc Probability of being in state i

54 oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

55 Instance of Expectation Maximization  We have that  We may get stuck in local maximum (or even saddle point)  Nevertheless, Baum-Welch usually effective

56 Some Variants  So far, ergodic models All states are connected Not always wanted  Epsilon or null-transitions Not all states/transitions emit output symbols  Parameter tying Assuming that certain parameters are shared Reduces the number of parameters that have to be estimated  Logical HMMs (Kersting, De Raedt, Raiko) Working with structured states and observation symbols  Working with log probabilities and addition instead of multiplication of probabilities (typically done)

57 oToT o1o1 otot o t-1 o t+1 The Most Important Thing A B AAA BBBB We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

58 HMM’s from an Agent Perspective  AI: a modern approach AI is the study of rational agents Third part by Wolfram Burgard on Reinforcement learning  HMMs can also be used here Typically one is interested in P(state)

59 Example  Possible states {snow, no snow}  Observations {skis, no skis }  Questions Was there snow the day before yesterday (given a sequence of observations) ? Is there now snow (given a sequence of observations) ? Will there be snow tomorrow, given a sequence of observations? Next week ?

60 HMM and Agents  Question Case 1 : often called smoothing  t < T : see last time  Only part of trellis between t and T needed

61 HMM and Agents Case 2 : often called filtering  t= T : last time  Can we make it recursive ? I.e go from T-1 to T ?

62 HMM and Agents Case 2 : often called filtering  t= T : last time

63 HMM and Agents Case 3 : often called prediction  t= T+1 (or T+K) not yet seen  Interesting : recursive  Easily extended towards k > 1

64 Extensions  Use Dynamic Bayesian networks instead of HMMs One state corresponds to a Bayesian Net Observations can become more complex  Involve actions of the agent as well Cf. Wolfram Burgard’s Part


Download ppt "Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most."

Similar presentations


Ads by Google