Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mixture Language Models and EM Algorithm

Similar presentations


Presentation on theme: "Mixture Language Models and EM Algorithm"— Presentation transcript:

1 Mixture Language Models and EM Algorithm
(Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Rest of this Lecture Unigram mixture models EM algorithm
Slightly more sophisticated unigram LMs Related to smoothing EM algorithm VERY useful for estimating parameters of a mixture model or when latent/hidden variables are involved Will occur again and again in the course

3 Modeling a Multi-topic Document
A document with 2 types of vocabulary text mining passage food nutrition passage How do we model such a document? How do we “generate” such a document? How do we estimate our model? Solution: A mixture model + EM

4 Simple Unigram Mixture Model
text 0.2 mining 0.1 assocation 0.01 clustering 0.02 food =0.7 Model/topic 1 p(w|1) food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 1-=0.3 Model/topic 2 p(w|2) p(w|1 2) = p(w|1)+(1- )p(w|2)

5 Parameter Estimation Likelihood: Estimation scenarios: =clustering
p(w|1) & p(w|2) are known; estimate  p(w|1) &  are known; estimate p(w|2) p(w|1) is known; estimate  & p(w|2)  is known; estimate p(w|1)& p(w|2) Estimate , p(w|1), p(w|2) The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these Two subtopics are and to what extent the doc covers Each. =clustering

6 Parameter Estimation Example: Given p(w|1) and p(w|2), estimate 
Maximum Likelihood: No closed form solution, But many numerical algorithms Expectation-Maximization (EM) Algorithm is a commonly used method Basic idea: Start from some random guess of parameter values, and then Iteratively improve our estimates (“hill climbing”) Likelihood logL() E-step: compute the lower bound M-step: find a new  that maximizes the lower bound next guess (n+1) current guess (n) Lower bound

7 EM Algorithm: Intuition
text 0.2 mining 0.1 assocation 0.01 clustering 0.02 food Observed Doc d =? From 2 p(w|1) food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 1-=? p(w|2) Suppose we know the identity of each word… p(w|1 2) = p(w|1)+(1- )p(w|2)

8 Can We “Guess” the Identity?
Identity (“hidden”) variable: zw {1 (w from 1), 0(w from 2)} zw 1 ... What’s a reasonable guess? - depends on  (why?) - depends on p(w| 1) ) and p(w|2) (how?) the paper presents a text mining algorithm ... E-step M-step Initially, set  to some random value, then iterate …

9 An Example of EM Computation

10 Any Theoretical Guarantee?
EM is guaranteed to reach a LOCAL maximum When “local maxima” = “global maxima”, EM can find the global maximum But, when there are multiple local maximas, “special techniques” are needed (e.g., try different initial values) In our case, we have one unique local maxima (why?)

11 A General Introduction to EM
Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the complete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

12 Convergence Guarantee
Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0 Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

13 Another way of looking at EM
Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  )) next guess current guess Lower bound (Q function) E-step = computing the lower bound M-step = maximizing the lower bound

14 What You Should Know What is unigram language so important?
What is a unigram mixture language model? How to estimate parameters of simple unigram mixture models using EM Know the general idea of EM (EM will be covered again later in the course)


Download ppt "Mixture Language Models and EM Algorithm"

Similar presentations


Ads by Google