Integrating Topics and Syntax -Thomas L

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
Bayesian Estimation in MARK
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Statistical Topic Modeling part 1
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
BAYESIAN INFERENCE Sampling techniques
Generative Topic Models for Community Analysis
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Integrating Topics and Syntax Paper by Thomas Griffiths, Mark Steyvers, David Blei, Josh Tenenbaum Presentation by Eric Wang 9/12/2008.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Online Learning for Latent Dirichlet Allocation
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Probabilistic Topic Models
27. May Topic Models Nam Khanh Tran L3S Research Center.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Tracking Multiple Cells By Correspondence Resolution In A Sequential Bayesian Framework Nilanjan Ray Gang Dong Scott T. Acton C.L. Brown Department of.
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Web-Mining Agents Topic Analysis: pLSI and LDA
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Latent Dirichlet Allocation (LDA)
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
MCMC Output & Metropolis-Hastings Algorithm Part I
Latent Dirichlet Analysis
Probabilistic Models with Latent Variables
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Presentation transcript:

Integrating Topics and Syntax -Thomas L Integrating Topics and Syntax -Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum Han Liu Department of Computer Science University of Illinois at Urbana-Champaign hanliu@ncsa.uiuc.edu April 12th. 2005

Outline Motivations – Syntactic vs. semantic modeling Formalization – Notations and terminology Generative Models – pLSI; Latent Dirichlet Allocation Composite Models –HMMs + LDA Inference – MCMC (Metropolis; Gibbs Sampling ) Experiments – Performance and evaluations Summary – Bayesian hierarchical models Discussions ! 2005-4-12 Han Liu

Motivations Statistical language modeling - Syntactic dependencies  short range dependencies - Semantic dependencies  long-range Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! 2005-4-12 Han Liu

Problem Formalization Word - A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 Document - A document is a sequence of N words denoted by w = {w1, w2 , … , wN}, where wi is the ith word in the sequence. Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2 , … , wM} 2005-4-12 Han Liu

Latent Semantic Structure Distribution over words Latent Structure Inferring latent structure Words Prediction 2005-4-12 Han Liu

Probabilistic Generative Models Probabilistic Latent Semantic Indexing (pLSI) - Hoffman (1999) ACM SIGIR - Probabilistic semantic model Latent Dirichlet Allocation (LDA) - Blei, Ng, & Jordan (2003) J. of Machine Learning Res. Hidden Markov Models (HMMs) - Baum, & Petrie (1966) Ann. Math. Stat. - Probabilistic syntactic model 2005-4-12 Han Liu

Dirichelt vs. Multinomial Distributions Dirichlet Distribution (conjugate prior) Multinomial Distribution 2005-4-12 Han Liu

Probabilistic LSI : Graphical Model model the distribution over topics d Topic as latent variables z w generate a word from that topic Nd d D 2005-4-12 Han Liu

Probabilistic LSI- Parameter Estimation The log-likelihood of Probabilistic LSI EM - algorithm - E - Step - M- Step 2005-4-12 Han Liu

LDA : Graphical Model a q z b f w sample a distribution over topics sample a topic z b f w sample a word from that topic Nd d D T 2005-4-12 Han Liu

Latent Dirichlet Allocation A variant LDA developed by Griffith 2003 - choose N |x ~ Poisson ( x ) - sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) - sample z |q ~ Multinomial (q ) - sample w| z, f(z) ~ Multinomial (f(z) ) Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted 2005-4-12 Han Liu

The Composite Model q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 An intuitive representation q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 Semantic state: generate words from LDA Syntactic states: generate words from HMMs 2005-4-12 Han Liu

Composite Model : Graphical Model q c z p g b F(z) w F(c) d Nd d C T M 2005-4-12 Han Liu

Composite Model All the Dirichelt are assumed to be symmetric - choose N |x ~ Poisson ( x ) - sample q(d) |a ~ Dir (a ) - sample f(zi)|b ~ Dir (b ) - sample f(ci)| g ~ Dir (g ) - sample p(ci-1)| d ~ Dir (d ) - sample zi |q(d) ~ Multinomial (q(d) ) - sample ci |p(ci-1)~ Multinomial (p(ci-1)) - sample wi| zi, f(zi) ~ Multinomial (f(zi) ) if ci = 1 - sample wi| ci, f(ci) ~ Multinomial (f(ci) ) if not 2005-4-12 Han Liu

The Composite Model: Generative process 2005-4-12 Han Liu

Bayesian Inference EM algorithm can be applied to the composite model - treating q, f(z) , f(c) , p(c) as parameters - log P(w| q, f(z) , f(c) , p(c) ) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w) and P(c|w) 2005-4-12 Han Liu

Markov Chain Monte Carlo Sampling posterior distribution according to a Markov Chain - an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) The key task is to construct the suitable T(x,x’) 2005-4-12 Han Liu

Metropolis-Hastings Algorithm Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling 2005-4-12 Han Liu

Metropolis-Hastings Algorithm (cont.) loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition xt r=1.0 x* r=p(x*)/p(xt) x* 2005-4-12 Han Liu

Metropolis-Hastings Algorithm Why it works Single-site Updating algorithm 2005-4-12 Han Liu

Gibbs Sampling A special case of single-site Updating Metropolis 2005-4-12 Han Liu

Gibbs Sampling for Composite Model q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm 2005-4-12 Han Liu

Experiments Corpora Experimental Design - Brown corpus 500 documents, 1,137,466 words - TASA corpus, 37,651 documents, 12,190,931 word tokens - NIPS corpus, 1713 documents, 4,312,614 word tokens - W = 37,202 (Brown + TASA); W = 17,268 (NIPS) Experimental Design - one class for sentence start/end markers {., ?,!} - T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered 2005-4-12 Han Liu

Identifying function and content words 2005-4-12 Han Liu

Comparative study on NIPS corpus (T=100 & C = 50) 2005-4-12 Han Liu

Identifying function and content words (NIPS) 2005-4-12 Han Liu

Marginal probabilities Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors 2005-4-12 Han Liu

Part of Speech Tagging Assessed performance on the Brown corpus - One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The 20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters 2005-4-12 Han Liu

Document Classification Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! 2005-4-12 Han Liu

Summary Bayesian hierarchical models are natural for text modeling Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation Similar ideas could be generalized to the other areas 2005-4-12 Han Liu

Discussions Gibbs Sampling vs. EM algorithm ? Hieratical models reduce the number of Parameters, what about model complexity? Equal prior for Bayesian model comparison? Whether there is really any effect of the 4 hyper-parameters? Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! EM is sensitive to local maxima, why Bayesian goes through? Is document classification experiment a good evaluation? Majority vote for tagging? 2005-4-12 Han Liu