Download presentation

Presentation is loading. Please wait.

Published byTheresa Wilkerson Modified over 4 years ago

1
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic in nature Can be used for handwriting, keystroke biometrics

2
2 Classification with Static Features Simpler than dynamic problem Can use, for example, MLPs E.g. In two dimensional space: x x x x x x x o o o o o o o

3
3 Hidden Markov Models (HMMs) First: Visible VMMs Formal Definition Recognition Training HMMs Formal Definition Recognition Training Trellis Algorithms Forward-Backward Viterbi

4
4 Visible Markov Models Probabilistic Automaton N distinct states S = {s 1, …, s N } M- element output alphabet K = {k 1, …, k M } Initial state probabilities Π = {π i }, i S State transition at t = 1, 2,… State trans. probabilities A = {a ij }, i,j S State sequence X = {X 1, …, X T }, X t S Output seq. O = {o 1, …, o T }, o t K

5
5 VMM: Weather Example

6
6 Generative VMM We choose the state sequence probabilistically… We could try this using: the numbers 1-10 drawing from a hat an ad-hoc assignment scheme

7
7 Training Problem –Given an observation sequence O and a “space” of possible models which spans possible values for model parameters w = {A, Π}, how do we find the model that best explains the observed data? Recognition (decoding) problem –Given a model w i = {A, Π}, how do we compute how likely a certain observation is, i.e. P(O | w i ) ? 2 Questions

8
8 Training VMMs Given observation sequences O s, we want to find model parameters w = {A, Π} which best explain the observations I.e. we want to find values for w = {A, Π} that maximises P(O | w) {A, Π} chosen = argmax {A, Π} P(O | {A, Π})

9
9 Straightforward for VMMs frequency in state i at time t =1 ( number of transitions from state i to state j) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- (number of transitions from state i ) = (number of transitions from state i to state j) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- (number of times in state i) Training VMMs

10
10 Recognition We need to calculate P(O | w i ) P(O | w i ) is handy for calculating P(w i |O) If we have a set of models L = {w 1, w 2,…, w V } then if we can calculate P(w i |O) we can choose the model which returns the highest probability, i.e. w chosen = argmax w i L P(w i |O)

11
11 Recognition Why is P(O | w i ) of use? Let’s revisit speech for a moment. In speech we are given a sequence of observations, e.g. a series of MFCC vectors –E.g. MFCCs taken from frames of length 20- 40ms, every 10-20 ms If we have a set of models L = {w 1, w 2,…, w V } and if we can calculate P(w i |O) we can choose the model which returns the highest probability, i.e. w chosen = argmax w i L P(w i |O)

12
12 w chosen = argmax w i L P(w i |O) P(w i |O) difficult to calculate as we would have to have a model for every possible observation sequence O Use Bayes’ rule: P(x | y) = P (y | x) P(x) / P(y) So now we have w chosen = argmax w i L P(O |w i ) P(w i ) / P(O) P(w i ) can be easily calculated P(O) is the same for each calculation and so can be ignored So P(O |w i ) is the key!!!

13
13 Hidden Markov Models Probabilistic Automaton N distinct states S = {s 1, …, s N } M- element output alphabet K = {k 1, …, k M } Initial state probabilities Π = {π i }, i S State transition at t = 1, 2,… State trans. probabilities A = {a ij }, i,j S Symbol emission probabilities B = {b ik }, i S, k K State sequence X = {X 1, …, X T }, X t S Output sequence O = {o 1, …, o T }, o t K

14
14 HMM: Weather Example

15
15 State Emission Distributions Discrete probability distribution

16
16 State Emission Distributions Continuous probability distribution

17
17 Generative HMM Now we not only choose the state sequence probabilistically… …but also the state emissions Try this yourself using the numbers 1-10 and drawing from a hat...

18
18 Recognition (decoding) problem –Given a model w i = {A, B, Π}, how do we compute how likely a certain observation is, i.e. P(O | w i ) ? State sequence? –Given the observation sequence and a model how do we choose a state sequence X = {X 1, …, X T } that best explains the observations Training Problem –Given an observation sequence O and a “space” of possible models which spans possible values for model parameters w = {A, B, Π}, how do we find the model that best explains the observed data? 3 Questions

19
19 Computing P(O | w) For any particular state sequence X = {X 1, …, X T } we have and

20
20 This requires (2T) N T multiplications Very inefficient! Computing P(O | w)

21
21 Trellis Algorithms Array of states vs. time

22
22 Overlap in paths implies repetition of the same calculations Harness the overlap to make calculations efficient A node at (s i, t) stores info about state sequences that contain X t = s i Trellis Algorithms

23
23 Consider 2 states and 3 time points: Trellis Algorithms

24
24 A node at (s i, t) stores info about state sequences up to time t that arrive at s i Forward Algorithm s1s1 s2s2 sjsj

25
25 Forward Algorithm

26
26 A node at (s i, t) stores info about state sequences from time t that evolve from s i Backward Algorithm s1s1 s2s2 sisi

27
27 Backward Algorithm

28
28 P(O | w) as calculated from the forward and backward algorithms should be the same FB algorithm usually used in training FB algorithm not suited to recognition as it considers all possible state sequences In reality, we would like to only consider the “best” state sequence (HMM problem 2) Forward & Backward Algorithms

29
29 “Best” State Sequence How is “best” defined? We could choose most likely individual state at each time t:

30
30 Define

31
31 Viterbi Algorithm …may produce an unlikely or even invalid state sequence One solution is to choose the most likely state sequence:

32
32 Define Viterbi Algorithm is the best score along a single path, at time t, which accounts for the first t observations and ends in state s i By induction we have:

33
33 Viterbi Algorithm

34
34 Viterbi Algorithm

35
35 Viterbi vs. Forward Algorithm Similar in implementation –Forward sums over all incoming paths –Viterbi maximises Viterbi probability Forward probability Both efficiently implemented using a trellis structure

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google