Download presentation

1
**Expectation Maximization Algorithm**

Rong Jin

2
**A Mixture Model Problem**

Apparently, the dataset consists of two modes How can we automatically identify the two modes?

3
**Gaussian Mixture Model (GMM)**

Assume that the dataset is generated by two mixed Gaussian distributions Gaussian model 1: Gaussian model 2: If we know the memberships for each bin, estimating the two Gaussian models is easy. How to estimate the two Gaussian models without knowing the memberships of bins?

4
**EM Algorithm for GMM Let memberships to be hidden variables**

EM algorithm for Gaussian mixture model Unknown memberships: Unknown Gaussian models: Learn these two sets of parameters iteratively

5
**Start with A Random Guess**

Random assign the memberships to each bin

6
**Start with A Random Guess**

Random assign the memberships to each bin Estimate the means and variance of each Gaussian model

7
**E-step Fixed the two Gaussian models**

Estimate the posterior for each data point

8
EM Algorithm for GMM Re-estimate the memberships for each bin

9
**Weighted by posteriors**

M-Step Fixed the memberships Re-estimate the two model Gaussian Weighted by posteriors Weighted by posteriors

10
**EM Algorithm for GMM Re-estimate the memberships for each bin**

Re-estimate the models

11
At the 5-th Iteration Red Gaussian component slowly shifts toward the left end of the x axis

12
At the10-th Iteration Red Gaussian component still slowly shifts toward the left end of the x axis

13
At the 20-th Iteration Red Gaussian component make more noticeable shift toward the left end of the x axis

14
At the 50-th Iteration Red Gaussian component is close to the desirable location

15
At the 100-th Iteration The results are almost identical to the ones for the 50-th iteration

16
**EM as A Bound Optimization**

EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

17
**EM as A Bound Optimization**

EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

18
**EM as A Bound Optimization**

EM algorithm in fact maximizes the log-likelihood function of training data Likelihood for a data point x Log-likelihood of training data

19
**Logarithm Bound Algorithm**

Start with initial guess

20
**Logarithm Bound Algorithm**

Touch Point Start with initial guess Come up with a lower bounded

21
**Logarithm Bound Algorithm**

Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

22
**Logarithm Bound Algorithm**

Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure

23
**Logarithm Bound Algorithm**

Optimal Point Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes Repeat the procedure Converge to the local optimal

24
**EM as A Bound Optimization**

Parameter for previous iteration: Parameter for current iteration: Compute

27
**Concave property of logarithm function**

28
**Definition of posterior**

29
**Log-Likelihood of EM Alg.**

Saddle points

30
**Maximize GMM Model What is the global optimal solution to GMM?**

Maximizing the objective function of GMM is ill-posed problem

31
**Maximize GMM Model What is the global optimal solution to GMM?**

Maximizing the objective function of GMM is ill-posed problem

32
**Identify Hidden Variables**

For certain learning problems, identifying hidden variables is not a easy task Consider a simple translation model For a pair of English and Chinese sentences: A simple translation model is The log-likelihood of training corpus

33
**Identify Hidden Variables**

Consider a simple case Alignment variable a(i) Rewrite

34
**Identify Hidden Variables**

Consider a simple case Alignment variable a(i) Rewrite

35
**Identify Hidden Variables**

Consider a simple case Alignment variable a(i) Rewrite

36
**Identify Hidden Variables**

Consider a simple case Alignment variable a(i) Rewrite

37
**EM Algorithm for A Translation Model**

Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c)

38
**EM Algorithm for A Translation Model**

Introduce an alignment variable for each translation pair EM algorithm for the translation model E-step: compute the posterior for each alignment variable M-step: estimate the translation probability Pr(e|c) We are luck here. In general, this step can be extremely difficult and usually requires approximate approaches

39
Compute Pr(e|c) First compute

40
Compute Pr(e|c) First compute

41
**Bound Optimization for A Translation Model**

42
**Bound Optimization for A Translation Model**

43
**Iterative Scaling Maximum entropy model Iterative scaling All features**

Sum of features are constant

44
Iterative Scaling Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w1 ,w2 …, wc = 0 Repeat Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as

45
Iterative Scaling

46
**No, we can’t because we need a lower bound**

Iterative Scaling Can we use the concave property of logarithm function? No, we can’t because we need a lower bound

47
**Iterative Scaling Weights still couple with each other**

Still need further decomposition

48
Iterative Scaling

49
**Wait a minute, this can not be right! What happens?**

Iterative Scaling Wait a minute, this can not be right! What happens?

50
**Logarithm Bound Algorithm**

Start with initial guess Come up with a lower bounded Search the optimal solution that maximizes

51
Iterative Scaling Where does it go wrong?

52
Iterative Scaling Not zero when = ’

53
**Definition of conditional exponential model**

Iterative Scaling Definition of conditional exponential model

54
Iterative Scaling

55
Iterative Scaling

56
**Is this solution unique?**

Iterative Scaling How about ? Is this solution unique?

57
Iterative Scaling How about negative features?

58
**Faster Iterative Scaling**

The lower bound may not be tight given all the coupling between weights is removed A tighter bound can be derived by not fully decoupling the correlation between weights Univariate functions!

59
**Faster Iterative Scaling**

Log-likelihood

60
**Bad News You may feel great after the struggle of the derivation.**

However, is iterative scaling a true great idea? Given there have been so many studies in optimization, we should try out existing methods.

61
**Comparing Improved Iterative Scaling to Newton’s Method**

Dataset Iterations Time (s) Rule 823 42.48 81 1.13 Lex 241 102.18 176 20.02 Summary 626 208.22 69 8.52 Shallow 3216 421 Dataset Instances Features Rule 29,602 246 Lex 42,509 135,182 Summary 24,044 198,467 Shallow 8,625,782 264,142 Try out the standard numerical methods before you get excited about your algorithm Limited-memory Quasi-Newton method Improved iterative scaling

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google