 # Apaydin slides with a several modifications and additions by Christoph Eick.

## Presentation on theme: "Apaydin slides with a several modifications and additions by Christoph Eick."— Presentation transcript:

Apaydin slides with a several modifications and additions by Christoph Eick.

Introduction Modeling dependencies in input; no longer iid; e.g the order of observations in a dataset matters: Temporal Sequences: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). Stock market (stock values over time) Spatial Sequences Base pairs in DNA Sequences 2Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Discrete Markov Process N states: S 1, S 2,..., S N State at “time” t, q t = S i First-order Markov P(q t+1 =S j | q t =S i, q t-1 =S k,...) = P(q t+1 =S j | q t =S i ) Transition probabilities a ij ≡ P(q t+1 =S j | q t =S i ) a ij ≥ 0 and Σ j=1 N a ij =1 Initial probabilities π i ≡ P(q 1 =S i ) Σ j=1 N π i =1 3

Stochastic Automaton/Markov Chain 4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Three urns each full of balls of one color S 1 : blue, S 2 : red, S 3 : green Example: Balls and Urns 5 1 23

Given K example sequences of length T Balls and Urns: Learning 6 Remark: Extract the probabilities from the observed sequences: s1-s2-s1-s3 s2-s1-s1-s2   1 =1/3,  2 =2/3, a 11 =1/3, a 12 =1/3, a 13 =1/3, a 21 =3/4,… s2-s3-s2-s1

Hidden Markov Models States are not observable Discrete observations {v 1,v 2,...,v M } are recorded; a probabilistic function of the state Emission probabilities b j (m) ≡ P(O t =v m | q t =S j ) Example: In each urn, there are balls of different colors, but with different probabilities. For each observation sequence, there are multiple state sequences 7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) http://en.wikipedia.org/wiki/Hidden_Markov_model

HMM Unfolded in Time 8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter10.htm l

Now a more complicated problem 9 We observe: 1 23 What urn sequence create it? 1.1-1-2-2 (somewhat trivial, as states are observable!) 2.(1 or 2)-(1 or 2)-(2 or 3)-(2 or 3) and the potential sequences have different probabilities—e.g drawing a blue ball from urn1 is more likely than from urn2! Markov Chains Hidden Markov Models

Another Motivating Example 10

Elements of an HMM N: Number of states M: Number of observation symbols A = [a ij ]: N by N state transition probability matrix B = b j (m): N by M observation probability matrix Π = [π i ]: N by 1 initial state probability vector λ = (A, B, Π), parameter set of HMM 11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Three Basic Problems of HMMs 1. Evaluation: Given λ, and sequence O, calculate P (O | λ) 2. Most Likely State Sequence: Given λ and sequence O, find state sequence Q * such that P (Q * | O, λ ) = max Q P (Q | O, λ ) 3. Learning: Given a set of sequence O={O 1,…O k }, find λ * such that λ * is the most like explanation for the sequences in O. P ( O | λ * )=max λ  k P ( O k | λ ) 12 (Rabiner, 1989)

Forward variable: Evaluation 13 Probability of observing O 1 -…-O t and additionally being in state i Complexity: O(N 2 *T) Using  i the probability of the observed sequence can be computed as follows:

Backward variable: 14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Probability of observing O t+1 -…-O T and additionally being in state i

Finding the Most Likely State Sequence 15 Choose the state that has the highest probability, for each time step: q t * = arg max i γ t (i) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Observe: O 1 …O t O t+1 …O T  t (i):=Probability of being in state i at step t.

Viterbi’s Algorithm δ t (i) ≡ max q1q2∙∙∙ qt-1 p(q 1 q 2 ∙∙∙q t-1,q t =S i,O 1 ∙∙∙O t | λ) Initialization: δ 1 (i) = π i b i (O 1 ), ψ 1 (i) = 0 Recursion: δ t (j) = max i δ t-1 (i)a ij b j (O t ), ψ t (j) = argmax i δ t-1 (i)a ij Termination: p * = max i δ T (i), q T * = argmax i δ T (i) Path backtracking: q t * = ψ t+1 (q t+1 * ), t=T-1, T-2,..., 1 16 Idea: Combines path probability computations with backtracking over competing paths. Only briefly discussed in 2014!

Baum-Welch Algorithm Baum- Welch Algorithm O={O 1,…,O K } Model =(A,B,  ) Hidden State Sequence Observed Symbol Sequence O

Learning a Model from Sequences O 18 This is a hidden(latent) variable, measuring the probability of going from state i to state j at step t+1 observing O t+1, given a model and an observed sequence O  O k. An EM-style algorithm is used! This is a hidden(latent) variable, measuring the probability of being in state i step t observing given a model and an observed sequence O  O k.

Baum-Welch Algorithm: M-Step 19 Remark: k iterates over the observed sequences O 1,…,O K ; for each individual sequence O r  O  r and  r are computed in the E-step; then, the actual model is computed in the M-step by averaging over the estimates of  i,a ij,b j (based on  k and  k ) for each of the K observed sequences. Probability going from i to j Probability being in i

Baum-Welch Algorithm: Summary Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Baum- Welch Algorithm O={O 1,…,O K } Model =(A,B,  ) For more discussion see: http://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf http://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf See also: http://www.digplanet.com/wiki/Baum%E2%80%93Welch_algorithmhttp://www.digplanet.com/wiki/Baum%E2%80%93Welch_algorithm

Generalization of HMM: Continuous Observations 21 The observations generated at each time step are vectors consisting of k numbers; a multivariate Gaussian with k dimensions is associated with each state j, defining the probabilities of k-dimensional vector v generated when being in state j: Hidden State Sequence Observed Vector Sequence O =(A, (  j,  j ) j=1,…n,B)

Input-dependent observations: Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996): Time-delay input: Generalization: HMM with Inputs 22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)