Integrating Topics and Syntax -Thomas L Integrating Topics and Syntax -Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum Han Liu Department of Computer Science University of Illinois at Urbana-Champaign hanliu@ncsa.uiuc.edu April 12th. 2005
Outline Motivations – Syntactic vs. semantic modeling Formalization – Notations and terminology Generative Models – pLSI; Latent Dirichlet Allocation Composite Models –HMMs + LDA Inference – MCMC (Metropolis; Gibbs Sampling ) Experiments – Performance and evaluations Summary – Bayesian hierarchical models Discussions ! 2005-4-12 Han Liu
Motivations Statistical language modeling - Syntactic dependencies short range dependencies - Semantic dependencies long-range Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! 2005-4-12 Han Liu
Problem Formalization Word - A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 Document - A document is a sequence of N words denoted by w = {w1, w2 , … , wN}, where wi is the ith word in the sequence. Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2 , … , wM} 2005-4-12 Han Liu
Latent Semantic Structure Distribution over words Latent Structure Inferring latent structure Words Prediction 2005-4-12 Han Liu
Probabilistic Generative Models Probabilistic Latent Semantic Indexing (pLSI) - Hoffman (1999) ACM SIGIR - Probabilistic semantic model Latent Dirichlet Allocation (LDA) - Blei, Ng, & Jordan (2003) J. of Machine Learning Res. Hidden Markov Models (HMMs) - Baum, & Petrie (1966) Ann. Math. Stat. - Probabilistic syntactic model 2005-4-12 Han Liu
Dirichelt vs. Multinomial Distributions Dirichlet Distribution (conjugate prior) Multinomial Distribution 2005-4-12 Han Liu
Probabilistic LSI : Graphical Model model the distribution over topics d Topic as latent variables z w generate a word from that topic Nd d D 2005-4-12 Han Liu
Probabilistic LSI- Parameter Estimation The log-likelihood of Probabilistic LSI EM - algorithm - E - Step - M- Step 2005-4-12 Han Liu
LDA : Graphical Model a q z b f w sample a distribution over topics sample a topic z b f w sample a word from that topic Nd d D T 2005-4-12 Han Liu
Latent Dirichlet Allocation A variant LDA developed by Griffith 2003 - choose N |x ~ Poisson ( x ) - sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) - sample z |q ~ Multinomial (q ) - sample w| z, f(z) ~ Multinomial (f(z) ) Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted 2005-4-12 Han Liu
The Composite Model q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 An intuitive representation q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 Semantic state: generate words from LDA Syntactic states: generate words from HMMs 2005-4-12 Han Liu
Composite Model : Graphical Model q c z p g b F(z) w F(c) d Nd d C T M 2005-4-12 Han Liu
Composite Model All the Dirichelt are assumed to be symmetric - choose N |x ~ Poisson ( x ) - sample q(d) |a ~ Dir (a ) - sample f(zi)|b ~ Dir (b ) - sample f(ci)| g ~ Dir (g ) - sample p(ci-1)| d ~ Dir (d ) - sample zi |q(d) ~ Multinomial (q(d) ) - sample ci |p(ci-1)~ Multinomial (p(ci-1)) - sample wi| zi, f(zi) ~ Multinomial (f(zi) ) if ci = 1 - sample wi| ci, f(ci) ~ Multinomial (f(ci) ) if not 2005-4-12 Han Liu
The Composite Model: Generative process 2005-4-12 Han Liu
Bayesian Inference EM algorithm can be applied to the composite model - treating q, f(z) , f(c) , p(c) as parameters - log P(w| q, f(z) , f(c) , p(c) ) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w) and P(c|w) 2005-4-12 Han Liu
Markov Chain Monte Carlo Sampling posterior distribution according to a Markov Chain - an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) The key task is to construct the suitable T(x,x’) 2005-4-12 Han Liu
Metropolis-Hastings Algorithm Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling 2005-4-12 Han Liu
Metropolis-Hastings Algorithm (cont.) loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition xt r=1.0 x* r=p(x*)/p(xt) x* 2005-4-12 Han Liu
Metropolis-Hastings Algorithm Why it works Single-site Updating algorithm 2005-4-12 Han Liu
Gibbs Sampling A special case of single-site Updating Metropolis 2005-4-12 Han Liu
Gibbs Sampling for Composite Model q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm 2005-4-12 Han Liu
Experiments Corpora Experimental Design - Brown corpus 500 documents, 1,137,466 words - TASA corpus, 37,651 documents, 12,190,931 word tokens - NIPS corpus, 1713 documents, 4,312,614 word tokens - W = 37,202 (Brown + TASA); W = 17,268 (NIPS) Experimental Design - one class for sentence start/end markers {., ?,!} - T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered 2005-4-12 Han Liu
Identifying function and content words 2005-4-12 Han Liu
Comparative study on NIPS corpus (T=100 & C = 50) 2005-4-12 Han Liu
Identifying function and content words (NIPS) 2005-4-12 Han Liu
Marginal probabilities Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors 2005-4-12 Han Liu
Part of Speech Tagging Assessed performance on the Brown corpus - One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The 20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters 2005-4-12 Han Liu
Document Classification Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! 2005-4-12 Han Liu
Summary Bayesian hierarchical models are natural for text modeling Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation Similar ideas could be generalized to the other areas 2005-4-12 Han Liu
Discussions Gibbs Sampling vs. EM algorithm ? Hieratical models reduce the number of Parameters, what about model complexity? Equal prior for Bayesian model comparison? Whether there is really any effect of the 4 hyper-parameters? Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! EM is sensitive to local maxima, why Bayesian goes through? Is document classification experiment a good evaluation? Majority vote for tagging? 2005-4-12 Han Liu