Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Image Modeling & Segmentation
Mixture Models and the EM Algorithm
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Unsupervised Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Expectation Maximization
© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Chapter 4: Linear Models for Classification
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Clustering.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Expectation Maximization Algorithm
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Semi-Supervised Learning
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Gaussian Mixture Models and Expectation-Maximization Algorithm.
Lecture 2: Statistical learning primer for biologists
CSE 517 Natural Language Processing Winter 2015
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Learning From Observed Data
Presentation transcript:

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3

Feedback in Learning Supervised learning: correct answers for each example Unsupervised learning: correct answers not given Reinforcement learning: occasional rewards

The problem of finding labels for unlabeled data So far we have solved “supervised” classification problems where a teacher told us the label of each example. In nature, items often do not come with labels. How can we learn labels without a teacher? Unlabeled data Labeled data From Shadmehr & Diedrichsen

Example: image segmentation Identify pixels that are white matter, gray matter, or outside of the brain Outside the brain Gray matter White matter Pixel value (normalized) From Shadmehr & Diedrichsen

Raw Sensor Data Measured distances for expected distance of 300 cm. SonarLaser

Gaussians --   Univariate  Multivariate

Maximum Likelihood Estimation of Gaussian Parameters Given: j (weighted) samples x j

© D. Weld and D. Fox 9 Fitting a Gaussian PDF to Data Poor fitGood fit From Russell

© D. Weld and D. Fox 10 Fitting a Gaussian PDF to Data Suppose y = y 1,…,y n,…,y N is a set of N data values Given a Gaussian PDF p with mean  and variance , define: How do we choose  and  to maximise this probability? From Russell

© D. Weld and D. Fox 11 Maximum Likelihood Estimation Define the best fitting Gaussian to be the one such that p(y| ,  ) is maximised. Terminology: p(y| ,  ), thought of as a function of y is the probability (density) of y p(y| ,  ), thought of as a function of ,  is the likelihood of ,  Maximizing p(y| ,  ) with respect to ,  is called Maximum Likelihood (ML) estimation of ,  From Russell

© D. Weld and D. Fox 12 ML estimation of ,  Intuitively: The maximum likelihood estimate of  should be the average value of y 1,…,y N, (the sample mean) The maximum likelihood estimate of  should be the variance of y 1,…,y N. (the sample variance) This turns out to be true: p(y| ,  ) is maximised by setting: From Russell

© D. Weld and D. Fox 13

© D. Weld and D. Fox 14

© D. Weld and D. Fox 15

If our data is not labeled, we can hypothesize that: 1.There are exactly m classes in the data: 2.Each class y occurs with a specific frequency: 3.Examples of class y are governed by a specific distribution: According to our hypothesis, each example x (i) must have been generated from a specific “mixture” distribution: We might hypothesize that the distributions are Gaussian: Mixtures Parameters of the distributions Mixing proportionsNormal distribution

Hidden variable Measured variable y  Py Graphical Representation of Gaussian Mixtures

Learning of mixture models

Learning Mixtures from Data Consider fixed K = 2 e.g., Unknown parameters  = {  1 ,  1,  2 ,  2,   } Given data D = {x 1,…….x N }, we want to find the parameters  that “best fit” the data

Early Attempts Weldon’s data, n=1000 crabs from Bay of Naples - Ratio of forehead to body length - Suspected existence of 2 separate species

Early Attempts Karl Pearson, 1894: - JRSS paper - proposed a mixture of 2 Gaussians - 5 parameters  = {  1 ,  1,  2 ,  2,   } - parameter estimation -> method of moments - involved solution of 9th order equations! ( see Chapter 10, Stigler (1986), The History of Statistics )

“The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..” Charlier (1906)

Maximum Likelihood Principle Fisher, 1922 assume a probabilistic model likelihood = p(data | parameters, model) find the parameters that make the data most likely

1977: The EM Algorithm Dempster, Laird, and Rubin General framework for likelihood-based parameter estimation with missing data start with initial guesses of parameters E-step: estimate memberships given params M-step: estimate params given memberships Repeat until convergence Converges to a (local) maximum of likelihood E-step and M-step are often computationally simple Generalizes to maximum a posteriori (with priors)

© D. Weld and D. Fox 25 EM for Mixture of Gaussians E-step: Compute probability that point x j was generated by component i: M-step: Compute new mean, covariance, and component weights:

Anemia Group Control Group

Mixture Density How can we determine the model parameters?

Raw Sensor Data Measured distances for expected distance of 300 cm. SonarLaser

Approximation Results Sonar Laser 300cm400cm

© D. Weld and D. Fox 38 Predict Goal and Path Predicted goal Predicted path

© D. Weld and D. Fox 39 Learned Transition Parameters Going to the workplaceGoing home : bus : car : foot