Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.

Slides:



Advertisements
Similar presentations
Mixture Models and the EM Algorithm
Advertisements

Unsupervised Learning
Expectation Maximization
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Machine Learning Week 2 Lecture 1.
Segmentation and Fitting Using Probabilistic Methods
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Gaussian Mixture Example: Start After First Iteration.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation- Maximization. News o’ the day First “3-d” picture of sun Anybody got red/green sunglasses?
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Gaussian Mixture Models and Expectation Maximization.
Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture.
Probability theory: (lecture 2 on AMLbook.com)
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
Statistical Learning (From data to distributions).
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 8, Friday June 8 th, 2007 (introduction.
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Lecture 18 Expectation Maximization
Classification of unlabeled data:
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Expectation-Maximization
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Expectation Maximization
Lecture 11 Generalizations of EM.
Clustering (2) & EM algorithm
Presentation transcript:

Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM algorithm + convergence)

Overview of this Lecture Quick recap of last lecture –maximum likelihood principle / our 3 examples The EM algorithm –writing down the formula (very easy) –understanding the formula (very hard) –Example: mixture of two normal distributions Convergence –to local maximum (under mild assumptions) Exercise Sheet –explain / discuss / make a start

Maximum Likelihood: Example 1 Sequence of coin flips HHTTTTTTHTTTTTHTTHHT –say 5 times H and 15 times T –which Prob(H) and Prob(T) are most likely? Formalization –Data X = (x 1, …, x n ), x i in {H,T} –Parameters Θ = (p H, p T ), p H + p T = 1 –Likelihood L(X,Θ) =p H h · p T t, h = #{i : x i = H}, t = #{i : x i = T} –Log Likelihood Q(X,Θ) = log L(X,Θ) = h · log p H + t · log p T –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) Solution –here p H = h / (h + t) and p T = t / (h + t) looks like Prob(H) = ¼ Prob(T) = ¾ simple calculus [blackboard]

Maximum Likelihood: Example 2 Sequence of reals drawn from N(μ, σ) –which μ and σ are most likely? Formalization –Data X = (x 1, …, x n ), x i real number –Parameters Θ = (μ, σ) –Likelihood L(X,Θ) = π i 1/(sqrt(2 π) σ) · exp( - (x i - μ) 2 / 2σ 2 ) –Log Likelihood Q(X,Θ) = - n/2 · log(2 π) - n · log σ – Σ i (x i - μ) 2 / 2σ 2 –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) Solution –here μ = 1/n * Σ i x i and σ 2 = 1/n * Σ i (x i - μ) 2 simple calculus [blackboard] normal distribution with mean μ and standard deviation σ

Maximum Likelihood: Example 3 Sequence of real numbers –each drawn from either N 1 (μ 1, σ 1 ) or N 2 (μ 2, σ 2 ) –from N 1 with prob p 1, and from N 2 with prob p 2 –which μ 1, σ 1, μ 2, σ 2, p 1, p 2 are most likely? Formalization –Data X = (x 1, …, x n ), x i real number –Hidden data Z = (z 1, …, z n ), z i = j iff x i drawn from N j –Parameters Θ = (μ 1, σ 1, μ 2, σ 2, p 1, p 2 ), p 1 + p 2 = 1 –Likelihood L(X,Θ) = [blackboard] –Log Likelihood Q(X,Θ) = [blackboard] –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) standard calculus fails (derivative of sum of logs of sum)

The EM algorithm — Formula Given –Data X = (x 1, …,x n ) –Hidden data Z = (z 1, …,z n ) –Parameters Θ + an initial guess θ 1 Expectation-Step: –Pr(Z|X;θ t ) = Pr(X|Z;θ t ) ∙ Pr(Z|θ t ) / Σ Z’ Pr(X|Z’;θ t ) ∙ Pr(Z’|θ t ) Maximization-Step: –θ t+1 = argmax Θ E Z [ log Pr(X,Z|Θ) | X;θ t ] What the hell does this mean? crucial to understand each of these probabilities / expected values What is fixed? What is random and how? What do the conditionals mean?

Three attempts to maximize the likelihood 1. The direct way … –given x 1, …,x n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n ) is maximized 2. If only we knew … –given data x 1, …,x n and hidden data z 1, …,z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n, z 1, …,z n ) is maximized 3. The EM way … –given x 1, …,x n and random variables Z 1, …,Z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that E log L(x 1, …,x n, Z 1, …,Z n ) is maximized optimization too hard (sum of logs of sums) would be feasible [show on blackboard] M-Step of the EM algorithm E-Step provides the Z 1, …,Z n consider the mixture of two Gaussians as an example but we don’t know the z 1, …,z n

E-Step — Formula We have (at the beginning of each iteration) –the data x 1, …,x n –the fully specified distributions N 1 (μ 1,σ 1 ) and N 2 (μ 2,σ 2 ) –the probability of choosing between N 1 and N 2 = random variable Z with p 1 = Pr(Z=1) and p 2 = Pr(Z=2) We want –for each data point x i a probability of choosing N 1 or N 2 = random variables Z 1, …,Z n Solution (the actual E-Step) –take Z i as the conditional Z | X i –Pr(Zi=1) = Pr(Z=1 | x i ) = Pr(x i | Z=1) ∙ Pr(Z=1) / Pr(x i ) with Pr(x i ) = Σ Z Pr(x i | Z=z) ∙ Pr(Z=z) Bayes’ law consider the mixture of two Gaussians as an example

E-Step — analogy to a simple example Pr(Urn 1) = 1/3, Pr(Urn 2) = 2/3 Pr(Blue | Urn 1) = 1/2, Pr(Blue | Urn 2) = 1/4 Pr(Blue) = Pr(Blue | Urn 1) ∙ Pr(Urn 1) + Pr(Blue | Urn 2) ∙ Pr(Urn2) = 1/2 ∙ 1/3 + 2/3 ∙ 1/4 = 1/3 Pr(Urn 1 | Blue) = Pr(B | Urn 1) ∙ Pr(Urn 1) / Pr(B) = 1/2 ∙ 1/3 / 1/3 = 1/2 Draw ball from one of two urns Urn 1 pick with prob 1/3 Urn 2 pick with prob 2/3

M-Step — Formula [Blackboard]

Convergence of EM Algorithm Two (log) likelihoods –true:log L(x 1,…,x n ) –EM:E log L(x 1,…,x n, Z 1,…,Z n ) Lemma 1 (lower bound) –E log L(x 1,…,x n, Z 1,…,Z n ) ≤ log L(x 1,…,x n ) Lemma 2 (touch) –E log L(x 1,…,x n, Z 1,…,Z n )(θ t ) = log L(x 1,…,x n )(θ t ) Convergence –if expected likelihood function is well-behaved, e.g., if first derivate at local maxima exist and second derivate is < 0 –then Lemmas 1 and 2 imply convergence [blackboard]

Attempt Two: Calculations If only we knew … –given data x 1, …,x n and hidden data z 1, …,z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n, z 1, …,z n ) is maximized –let I 1 = {i : z i = 1} and I 2 = {i : z i = 2} L ( x 1 ;:::; x n ; z 1 ;:::; z n ) = Y i 2 I 1 1 p 2 ¼¾ 1 ¢ e ¡ ( x i ¡ ¹ 1 ) 2 2 ¾ 2 1 ¢ Y i 2 I 2 1 p 2 ¼¾ 2 ¢ e ¡ ( x i ¡ ¹ 2 ) 2 2 ¾ 2 2 The two products can be maximized separately –here μ 1 = Σ i in I 1 x i / |I 1 | and σ 1 2 = Σ i in I 1 (x i – μ 1 ) 2 / |I 1 | –here μ 2 = Σ i in I 2 x i / |I 2 | and σ 2 2 = Σ i in I 2 (x i – μ 2 ) 2 / |I 2 |