Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating Topics and Syntax -Thomas L

Similar presentations


Presentation on theme: "Integrating Topics and Syntax -Thomas L"— Presentation transcript:

1 Integrating Topics and Syntax -Thomas L
Integrating Topics and Syntax -Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum Han Liu Department of Computer Science University of Illinois at Urbana-Champaign April 12th. 2005

2 Outline Motivations – Syntactic vs. semantic modeling
Formalization – Notations and terminology Generative Models – pLSI; Latent Dirichlet Allocation Composite Models –HMMs + LDA Inference – MCMC (Metropolis; Gibbs Sampling ) Experiments – Performance and evaluations Summary – Bayesian hierarchical models Discussions ! Han Liu

3 Motivations Statistical language modeling
- Syntactic dependencies  short range dependencies - Semantic dependencies  long-range Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! Han Liu

4 Problem Formalization
Word - A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 Document - A document is a sequence of N words denoted by w = {w1, w2 , … , wN}, where wi is the ith word in the sequence. Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2 , … , wM} Han Liu

5 Latent Semantic Structure
Distribution over words Latent Structure Inferring latent structure Words Prediction Han Liu

6 Probabilistic Generative Models
Probabilistic Latent Semantic Indexing (pLSI) - Hoffman (1999) ACM SIGIR - Probabilistic semantic model Latent Dirichlet Allocation (LDA) - Blei, Ng, & Jordan (2003) J. of Machine Learning Res. Hidden Markov Models (HMMs) - Baum, & Petrie (1966) Ann. Math. Stat. - Probabilistic syntactic model Han Liu

7 Dirichelt vs. Multinomial Distributions
Dirichlet Distribution (conjugate prior) Multinomial Distribution Han Liu

8 Probabilistic LSI : Graphical Model
model the distribution over topics d Topic as latent variables z w generate a word from that topic Nd d D Han Liu

9 Probabilistic LSI- Parameter Estimation
The log-likelihood of Probabilistic LSI EM - algorithm - E - Step - M- Step Han Liu

10 LDA : Graphical Model a q z b f w sample a distribution over topics
sample a topic z b f w sample a word from that topic Nd d D T Han Liu

11 Latent Dirichlet Allocation
A variant LDA developed by Griffith 2003 - choose N |x ~ Poisson ( x ) - sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) - sample z |q ~ Multinomial (q ) - sample w| z, f(z) ~ Multinomial (f(z) ) Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted Han Liu

12 The Composite Model q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4
An intuitive representation q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 Semantic state: generate words from LDA Syntactic states: generate words from HMMs Han Liu

13 Composite Model : Graphical Model
q c z p g b F(z) w F(c) d Nd d C T M Han Liu

14 Composite Model All the Dirichelt are assumed to be symmetric
- choose N |x ~ Poisson ( x ) - sample q(d) |a ~ Dir (a ) - sample f(zi)|b ~ Dir (b ) - sample f(ci)| g ~ Dir (g ) - sample p(ci-1)| d ~ Dir (d ) - sample zi |q(d) ~ Multinomial (q(d) ) - sample ci |p(ci-1)~ Multinomial (p(ci-1)) - sample wi| zi, f(zi) ~ Multinomial (f(zi) ) if ci = 1 - sample wi| ci, f(ci) ~ Multinomial (f(ci) ) if not Han Liu

15 The Composite Model: Generative process
Han Liu

16 Bayesian Inference EM algorithm can be applied to the composite model
- treating q, f(z) , f(c) , p(c) as parameters - log P(w| q, f(z) , f(c) , p(c) ) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w) and P(c|w) Han Liu

17 Markov Chain Monte Carlo
Sampling posterior distribution according to a Markov Chain - an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) The key task is to construct the suitable T(x,x’) Han Liu

18 Metropolis-Hastings Algorithm
Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling Han Liu

19 Metropolis-Hastings Algorithm (cont.)
loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition xt r=1.0 x* r=p(x*)/p(xt) x* Han Liu

20 Metropolis-Hastings Algorithm
Why it works Single-site Updating algorithm Han Liu

21 Gibbs Sampling A special case of single-site Updating Metropolis
Han Liu

22 Gibbs Sampling for Composite Model
q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm Han Liu

23 Experiments Corpora Experimental Design
- Brown corpus 500 documents, 1,137,466 words - TASA corpus, 37,651 documents, 12,190,931 word tokens - NIPS corpus, 1713 documents, 4,312,614 word tokens - W = 37,202 (Brown + TASA); W = 17,268 (NIPS) Experimental Design - one class for sentence start/end markers {., ?,!} - T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered Han Liu

24 Identifying function and content words
Han Liu

25 Comparative study on NIPS corpus (T=100 & C = 50)
Han Liu

26 Identifying function and content words (NIPS)
Han Liu

27 Marginal probabilities
Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors Han Liu

28 Part of Speech Tagging Assessed performance on the Brown corpus
- One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The 20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters Han Liu

29 Document Classification
Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! Han Liu

30 Summary Bayesian hierarchical models are natural for text modeling
Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation Similar ideas could be generalized to the other areas Han Liu

31 Discussions Gibbs Sampling vs. EM algorithm ?
Hieratical models reduce the number of Parameters, what about model complexity? Equal prior for Bayesian model comparison? Whether there is really any effect of the 4 hyper-parameters? Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! EM is sensitive to local maxima, why Bayesian goes through? Is document classification experiment a good evaluation? Majority vote for tagging? Han Liu


Download ppt "Integrating Topics and Syntax -Thomas L"

Similar presentations


Ads by Google