Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.

Similar presentations


Presentation on theme: "Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM."— Presentation transcript:

1 Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM algorithm + convergence)

2 Overview of this Lecture Quick recap of last lecture –maximum likelihood principle / our 3 examples The EM algorithm –writing down the formula (very easy) –understanding the formula (very hard) –Example: mixture of two normal distributions Convergence –to local maximum (under mild assumptions) Exercise Sheet –explain / discuss / make a start

3 Maximum Likelihood: Example 1 Sequence of coin flips HHTTTTTTHTTTTTHTTHHT –say 5 times H and 15 times T –which Prob(H) and Prob(T) are most likely? Formalization –Data X = (x 1, …, x n ), x i in {H,T} –Parameters Θ = (p H, p T ), p H + p T = 1 –Likelihood L(X,Θ) =p H h · p T t, h = #{i : x i = H}, t = #{i : x i = T} –Log Likelihood Q(X,Θ) = log L(X,Θ) = h · log p H + t · log p T –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) Solution –here p H = h / (h + t) and p T = t / (h + t) looks like Prob(H) = ¼ Prob(T) = ¾ simple calculus [blackboard]

4 Maximum Likelihood: Example 2 Sequence of reals drawn from N(μ, σ) –which μ and σ are most likely? Formalization –Data X = (x 1, …, x n ), x i real number –Parameters Θ = (μ, σ) –Likelihood L(X,Θ) = π i 1/(sqrt(2 π) σ) · exp( - (x i - μ) 2 / 2σ 2 ) –Log Likelihood Q(X,Θ) = - n/2 · log(2 π) - n · log σ – Σ i (x i - μ) 2 / 2σ 2 –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) Solution –here μ = 1/n * Σ i x i and σ 2 = 1/n * Σ i (x i - μ) 2 simple calculus [blackboard] normal distribution with mean μ and standard deviation σ

5 Maximum Likelihood: Example 3 Sequence of real numbers –each drawn from either N 1 (μ 1, σ 1 ) or N 2 (μ 2, σ 2 ) –from N 1 with prob p 1, and from N 2 with prob p 2 –which μ 1, σ 1, μ 2, σ 2, p 1, p 2 are most likely? Formalization –Data X = (x 1, …, x n ), x i real number –Hidden data Z = (z 1, …, z n ), z i = j iff x i drawn from N j –Parameters Θ = (μ 1, σ 1, μ 2, σ 2, p 1, p 2 ), p 1 + p 2 = 1 –Likelihood L(X,Θ) = [blackboard] –Log Likelihood Q(X,Θ) = [blackboard] –find Θ* = argmax Θ L(X,Θ) = argmax Θ Q(X,Θ) standard calculus fails (derivative of sum of logs of sum)

6 The EM algorithm — Formula Given –Data X = (x 1, …,x n ) –Hidden data Z = (z 1, …,z n ) –Parameters Θ + an initial guess θ 1 Expectation-Step: –Pr(Z|X;θ t ) = Pr(X|Z;θ t ) ∙ Pr(Z|θ t ) / Σ Z’ Pr(X|Z’;θ t ) ∙ Pr(Z’|θ t ) Maximization-Step: –θ t+1 = argmax Θ E Z [ log Pr(X,Z|Θ) | X;θ t ] What the hell does this mean? crucial to understand each of these probabilities / expected values What is fixed? What is random and how? What do the conditionals mean?

7 Three attempts to maximize the likelihood 1. The direct way … –given x 1, …,x n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n ) is maximized 2. If only we knew … –given data x 1, …,x n and hidden data z 1, …,z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n, z 1, …,z n ) is maximized 3. The EM way … –given x 1, …,x n and random variables Z 1, …,Z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that E log L(x 1, …,x n, Z 1, …,Z n ) is maximized optimization too hard (sum of logs of sums) would be feasible [show on blackboard] M-Step of the EM algorithm E-Step provides the Z 1, …,Z n consider the mixture of two Gaussians as an example but we don’t know the z 1, …,z n

8 E-Step — Formula We have (at the beginning of each iteration) –the data x 1, …,x n –the fully specified distributions N 1 (μ 1,σ 1 ) and N 2 (μ 2,σ 2 ) –the probability of choosing between N 1 and N 2 = random variable Z with p 1 = Pr(Z=1) and p 2 = Pr(Z=2) We want –for each data point x i a probability of choosing N 1 or N 2 = random variables Z 1, …,Z n Solution (the actual E-Step) –take Z i as the conditional Z | X i –Pr(Zi=1) = Pr(Z=1 | x i ) = Pr(x i | Z=1) ∙ Pr(Z=1) / Pr(x i ) with Pr(x i ) = Σ Z Pr(x i | Z=z) ∙ Pr(Z=z) Bayes’ law consider the mixture of two Gaussians as an example

9 E-Step — analogy to a simple example Pr(Urn 1) = 1/3, Pr(Urn 2) = 2/3 Pr(Blue | Urn 1) = 1/2, Pr(Blue | Urn 2) = 1/4 Pr(Blue) = Pr(Blue | Urn 1) ∙ Pr(Urn 1) + Pr(Blue | Urn 2) ∙ Pr(Urn2) = 1/2 ∙ 1/3 + 2/3 ∙ 1/4 = 1/3 Pr(Urn 1 | Blue) = Pr(B | Urn 1) ∙ Pr(Urn 1) / Pr(B) = 1/2 ∙ 1/3 / 1/3 = 1/2 Draw ball from one of two urns Urn 1 pick with prob 1/3 Urn 2 pick with prob 2/3

10 M-Step — Formula [Blackboard]

11 Convergence of EM Algorithm Two (log) likelihoods –true:log L(x 1,…,x n ) –EM:E log L(x 1,…,x n, Z 1,…,Z n ) Lemma 1 (lower bound) –E log L(x 1,…,x n, Z 1,…,Z n ) ≤ log L(x 1,…,x n ) Lemma 2 (touch) –E log L(x 1,…,x n, Z 1,…,Z n )(θ t ) = log L(x 1,…,x n )(θ t ) Convergence –if expected likelihood function is well-behaved, e.g., if first derivate at local maxima exist and second derivate is < 0 –then Lemmas 1 and 2 imply convergence [blackboard]

12

13 Attempt Two: Calculations If only we knew … –given data x 1, …,x n and hidden data z 1, …,z n –find parameters μ 1, σ 1, μ 2, σ 2, p 1, p 2 –such that log L(x 1, …,x n, z 1, …,z n ) is maximized –let I 1 = {i : z i = 1} and I 2 = {i : z i = 2} L ( x 1 ;:::; x n ; z 1 ;:::; z n ) = Y i 2 I 1 1 p 2 ¼¾ 1 ¢ e ¡ ( x i ¡ ¹ 1 ) 2 2 ¾ 2 1 ¢ Y i 2 I 2 1 p 2 ¼¾ 2 ¢ e ¡ ( x i ¡ ¹ 2 ) 2 2 ¾ 2 2 The two products can be maximized separately –here μ 1 = Σ i in I 1 x i / |I 1 | and σ 1 2 = Σ i in I 1 (x i – μ 1 ) 2 / |I 1 | –here μ 2 = Σ i in I 2 x i / |I 2 | and σ 2 2 = Σ i in I 2 (x i – μ 2 ) 2 / |I 2 |


Download ppt "Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM."

Similar presentations


Ads by Google