CS Statistical Machine learning Lecture 24

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Dynamic Bayesian Networks (DBNs)
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
Hidden Markov Models in NLP
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
… Hidden Markov Models Markov assumption: Transition model:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Lecture 5: Learning models using EM
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
EM Algorithm Likelihood, Mixture Models and Clustering.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
A Unifying Review of Linear Gaussian Models
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computer vision: models, learning and inference
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Computer vision: models, learning and inference Chapter 19 Temporal models.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS 541: Artificial Intelligence Lecture VIII: Temporal Probability Models.
Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Hidden Markov Models BMI/CS 576
Today.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Summarized by Kim Jin-young
Expectation-Maximization & Belief Propagation
LECTURE 15: REESTIMATION, EM AND MIXTURES
Biointelligence Laboratory, Seoul National University
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

CS 59000 Statistical Machine learning Lecture 24 Yuan (Alan) Qi Purdue CS Nov. 20 2008

Outline Review of K-medoids, Mixture of Gaussians, Expectation Maximization (EM), Alternative view of EM Hidden Markvo Models, forward-backward algorithm, EM for learning HMM parameters, Viterbi Algorithm, Linear state space models, Kalman filtering and smoothing

K-medoids Algorithm

Mixture of Gaussians Mixture of Gaussians: Introduce latent variables: Marginal distribution:

Conditional Probability Responsibility that component k takes for explaining the observation.

Maximum Likelihood Maximize the log likelihood function

Severe Overfitting by Maximum Likelihood When a cluster has only data point, its variance goes to 0.

Maximum Likelihood Conditions (1) Setting the derivatives of to zero:

Maximum Likelihood Conditions (2) Setting the derivative of to zero:

Maximum Likelihood Conditions (3) Lagrange function: Setting its derivative to zero and use the normalization constraint, we obtain:

Expectation Maximization for Mixture Gaussians Although the previous conditions do not provide closed-form conditions, we can use them to construct iterative updates: E step: Compute responsibilities . M step: Compute new mean , variance , and mixing coefficients . Loop over E and M steps until the log likelihood stops to increase.

General EM Algorithm

EM and Jensen Inequality Goal: maximize Define: We have From Jesen’s Inequality, we see is a lower bound of .

Lower Bound is a functional of the distribution . Since and , is a lower bound of the log likelihood function . (Another way to see the lower bound without using Jensen’s inequality)

Lower Bound Perspective of EM Expectation Step: Maximizing the functional lower bound over the distribution . Maximization Step: Maximizing the lower bound over the parameters .

Illustration of EM Updates

Sequential Data There are temporal dependence between data points

Markov Models By chain rule, a joint distribution can be re-written as: Assume conditional independence, we have It is known as first-order Markov chain

High Order Markov Chains Second order Markov assumption Can be generalized to higher order Markov Chains. But the number of the parameters explores exponentially with the order.

State Space Models Important graphical models for many dynamic models, includes Hidden Markov Models (HMMs) and linear dynamic systems Questions: order for the Markov assumption

Hidden Markov Models Many applications, e.g., speech recognition, natural language processing, handwriting recognition, bio-sequence analysis

From Mixture Models to HMMs By turning a mixture Model into a dynamic model, we obtain the HMM. Let model the dependence between two consecutive latent variables by a transition probability:

HMMs Prior on initial latent variable: Emission probabilities: Joint distribution:

Samples from HMM (a) Contours of constant probability density for the emission distributions corresponding to each of the three states of the latent variable. (b) A sample of 50 points drawn from the hidden Markov model, with lines connecting the successive observations.

Inference: Forward-backward Algorithm Goal: compute marginals for latent variables. Forward-backward Algorithm: exact inference as special case of sum-product algorithm on the HMM. Factor graph representation (grouping emission density and transition probability in one factor at a time):

Forward-backward Algorithm as Message Passing Method (1) Forward messages:

Forward-backward Algorithm as Message Passing Method (2) Backward messages (Q: how to compute it?): The messages actually involves X Similarly, we can compute the following (Q: why)

Rescaling to Avoid Overflowing When a sequence is long, the forward message will become to small to be represented by the dynamic range of the computer. We redefine the forward message as Similarly, we re-define the backward message Then, we can compute See detailed derivation in textbook

Viterbi Algorithm Viterbi Algorithm: Finding the most probable sequence of states Special case of sum-product algorithm on HMM. What if we want to find the most probable individual states?

Maximum Likelihood Estimation for HMM Goal: maximize Looks familiar? Remember EM for mixture of Gaussians… Indeed the updates are similar.

EM for HMM E step: Computed from forward-backward/sum-product algorithm M step:

Linear Dynamical Systems Equivalently, we have where

Kalman Filtering and Smoothing Inference on linear Gaussian systems. Kalman filtering: sequentially update scaled forward message: Kalman smoothing: sequentially update state beliefs based on scaled forward and backward messages:

Learning in LDS EM again…

Extension of HMM and LDS Discrete latent variables: Factorized HMMs Continuous latent variables: switching Kalman filtering models