 # ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for.

## Presentation on theme: "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."— Presentation transcript:

ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

Introduction Assumption Modeling dependencies in input; no longer iid (independent and identically distributed) Sequences Temporal: In speech: phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). In handwriting: pen movements Spatial: In a DNA sequence: base pairs ( 鹼基對 ) Base pairs in a DNA sequence can not be modeled as simple probability distribution. 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Discrete Markov Process N states: S 1, S 2,..., S N State at “time” t, q t = S i First-order Markov P(q t+1 =S j | q t =S i, q t-1 =S k,...) = P(q t+1 =S j | q t =S i ) Transition probabilities a ij ≡ P(q t+1 =S j | q t =S i ) a ij ≥ 0 and Σ j=1 N a ij =1 Initial probabilities π i ≡ P(q 1 =S i ) Σ j=1 N π i =1 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Stochastic Automaton 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Example: Balls and Urns Three urns each full of balls of one color S 1 : red, S 2 : blue, S 3 : green S 1 S 2 S 3 = {red, red, green, green} 6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Balls and Urns: Learning Given K example sequences of length T 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Hidden Markov Models States are not observable Discrete observations {v 1,v 2,...,v M } are recorded; a probabilistic function of the state Emission probabilities b j (m) ≡ P(O t =v m | q t =S j ) Example In each urn, there are balls of different colors, but with different probabilities. For each observation sequence, there are multiple state sequences 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

HMM Unfolded in Time 9 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Elements of an HMM N: Number of states S = {S 1, S 2,..., S N } M: Number of observation symbols V = {v 1,v 2,...,v M } A = [a ij ]: N by N state transition probability matrix a ij ≡ P (q t+1 =S j | q t =S i ) B = b j (m): N by M observation probability matrix b j (m) ≡ P (O t =v m | q t =S j ) Π = [π i ]: N by 1 initial state probability vector π i ≡ P (q 1 =S i ) λ = (A, B, Π), parameter set of HMM 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Three Basic Problems of HMMs 1. Evaluation: Given λ, and O, calculate P (O | λ) 2. State sequence: Given λ, and O, find Q * such that P (Q * | O, λ ) = max Q P (Q | O, λ ) 3. Learning: Given X ={O k } k, find λ * such that P ( X | λ * )=max λ P ( X | λ ) (Rabiner, 1989) 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Evaluation Forward variable: The probability of observing the partial sequence until time t and being in S i at time t, given the model Evaluation result 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Evaluation Backward variable: The probability of being in S i at time t and observing the partial sequence 13 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Finding the State Sequence No! Because this may choose S i and S j as the most probable state at time t and t+1 even when a ij =0. Choose the state that has the highest probability, for each time step: q t * = arg max i γ t (i) The probability of being in state S i at time t, given O and 14 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Viterbi’s Algorithm (Dynamic programming) δ t (i) ≡ max q1q2∙∙∙ qt-1 p(q 1 q 2 ∙∙∙q t-1,q t =S i,O 1 ∙∙∙O t | λ) The probability of the highest probability path at time t that accounts for the first t observations and ends in S i. Initialization: δ 1 (i) = π i b i (O 1 ), ψ 1 (i) = 0 Recursion: δ t (j) = max i δ t-1 (i)a ij b j (O t ), ψ t (j) = arg max i δ t-1 (i)a ij Termination: p * = max i δ T (i), q T * = argmax i δ T (i) Path (state sequence) backtracking: q t * = ψ t+1 (q t+1 * ), t=T-1, T-2,..., 1 15 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Learning Model Parameters The probability of being in state S i at time t and in state S j at time t+1 16 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Baum-Welch Algorithm (EM) Expectation step Maximization step 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Continuous Observations We assumed discrete observations modeled as a multinomial: where If the inputs are continuous, one possibility is to discretize them and then use these discrete values as observation. A vector quantizer can be used for this purpose of converting continuous values to the discrete index of the closest reference vector. For example: k-means 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Continuous Observations K-means used for vector quantization is the hard version of a Gaussian mixture model. Gaussian mixture (Discretize using k-means): and the observations are kept continuous. In this case of Gaussian mixtures, EM equations can be derived for the component parameters and the mixture proportions. 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

HMM with Input Input-dependent observations: Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996): Time-delay input: An HMM with input can be used by generating an “input” through some pre-specified function of previous observations. 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Model Selection in HMM The complexity of an HMM should be tuned so as to balance its complexity with the size and properties of the data at hand. One possibility is to tune the topology of the HMM. For example, left-to-right HMMs Left-to-right HMMs have their states ordered in time so that as time increases, the state index increases or stays the same. There is the property that we never move to a state with a smaller index. 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Model Selection in HMM Another factor that determines the complexity of an HMM is the number of states N. Because the states are hidden, their number is not known and should be chosen before training. This is determined using prior information and can be fine-tuned by cross-validation. When used for classification, we have a set of HMMs, each one modeling the sequences belonging to one class. 22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)