An Introduction to Latent Dirichlet Allocation (LDA)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
A Tutorial on Learning with Bayesian Networks
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
An Introduction to Variational Methods for Graphical Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Lecture 5: Learning models using EM
Latent Dirichlet Allocation a generative model for text
Maximum likelihood (ML) and likelihood ratio (LR) test
Generative learning methods for bags of features
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Maximum likelihood (ML)
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Model Inference and Averaging
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Probabilistic Graphical Models
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Randomized Algorithms for Bayesian Hierarchical Clustering
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Latent Variables, Mixture Models and EM
Latent Dirichlet Analysis
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Latent Dirichlet Allocation
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

An Introduction to Latent Dirichlet Allocation (LDA)

LDA A generative probabilistic model for collections of discrete data such text corpora. Is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as finite mixture over an underlying set of topics. Each topic again is modeled as an infinite mixture of an underlying set of topic probabilities. It has natural advantages over unigram model and probabilistic LSI model. In the context of text modeling, the topic probabilities provide an explicit representation of a document.

History (1) – text processing IR – text to real number vector(Baeza-Yates and Ribeiro-Neto, 1999), tfidf (Salton and McGill, 1983) Tfidf – shortcoming: (1) Lengthy and (2) Cannot model inter- and intra- document statistical structure LSI – dimension reduction (Deerwester et al., 1990) Advantages: achieve significant compression in large collections and capture synonymy and polysemy. Generative probabilistic model – to study the ability of LSI (Papadimitriou et al., 1998) Why LSI, we can model the data directly using maximum likelihood or Bayesian methods. Furthermore, Deerwester et al. argue that the derived features of LSI, which are linear combinations of the original tf-idf features, can capture some aspects of basic linguistic notions such as synonymy and polysemy.

History (2) – text processing Probabilistic LSI – also aspect model. Milestone (Hofmann, 1999). P(wi|θj), d={w1, …, wN}, and θ={θ1, …, θk}. each word is generated from a single model θj. Document d is considered to be a mixing proportions for these mixture components θ, that is a list of numbers (the mixing proportions for topics). Disadvantage: no probabilistic model at document level. The number of parameters grows linearly with the size of corpora. It is not clear to assign probability to document outside of the collection. (does not make any assumptions about how the mixture weights θ are generated, making it difficult to test the generalizability of the model to new documents. )

Notation D={d1, …, dM}, d={w1, …, wN}, and θ={θ1, …, θk}, equivalently D={w1, …, wM}. (Bold variable denotes vector.) Suppose we have V distinct words in the whole data set. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.1

LDA The basic idea: Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. For each document d, we generate as follows: k topics z

Dirichlet Random Variables θ A k-dimensional Dirichlet random variables θ can take values in the (k-1)-simplex (a k-vector θ lies in the (k-1)-simplex if , and thus the probability density can be: where α is a k-vector parameter with αi>0, and Γ(x) is a gamma function The Dirichlet is a convenient distribution on the simplex—it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution. These properties will facilitate the development of inference and parameter estimation algorithms for LDA.

Multinomial Distribution Each trial can end in exactly one of k categories n independent trials Probability a trial results in category i is pi Yi is the number of trials resulting in category i p1+…+pk = 1 Y1+…+Yk = n

Multinomial Distribution

Joint Distribution Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by: where p(zn|θ) is simply θi for the unique i such that zni=1. Integrating over θ and summing over z we obtain the marginal distribution of a document (a)

Joint Distribution (cont.) Finally, taking the product of the marginal probabilities of single documents, we can obtain the probability of a corpus: Similar model can be referred to hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989).

Relationships with Other Latent Models Problems: (1) p(z|d) only model the documents in the training data set and cannot for the unseen document; and (2) the parameter kV+kM grows linearly in M and thus overfitting.

Graphical Interpretation The probability density of the Dirichlet distribution when K=3 for various parameter vectors α. Clockwise from top left: α=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).

Graphical Interpretation (cont.) The Dirichlet prior on the topic-word distributions can be interpreted as forces on the topic locations with higher β moving the topic locations away from the corners of the simplex.

Matrix Interpretation In the topic model, the word-document co-occurrence matrix is split into two parts: a topic matrix Φ and a document matrix Θ. Note that the diagonal matrix D in LSA can be absorbed in the matrix U or V, making the similarity between the two representations even clearer.

Inference and Parameter Estimation The key inferential problem is that of computing the posterior distribution of the hidden variables given a document: However, such distribution is intractable to compute in general. For normalization in the above distribution, we have to marginalize over the hidden variables and write the Equation (a) in terms of the model parameters:

Inference and Parameter Estimation The key inferential problem is that of computing the posterior distribution of the hidden variables given a document: However, such distribution is intractable to compute in general. For normalization in the above distribution, we have to marginalize over the hidden variables and write the Equation (a) in terms of the model parameters:

Inference and Parameter Estimation (cont.) This function is intractable due to the coupling between θ and β in the summation over latent topics (Dickey, 1983). Rather than the intractable exact inference, we can use some other approximate inference algorithms, e.g., Laplace approximation, variational approximation, and Markov chain Monte Carlo (Jordan, 1999).

Variational Inference Here, we introduce a simple convexity-based variational algorithm for inference in LDA. The basic idea here is to make use of Jensen’s inequality to obtain an adjustable lower bound on the log likelihood (Jordan, 1999). A simple way to obtain a tractable family of lower bounds is to consider simple modifications of the original graphical model in which some of the edges and nodes are removed.

Variational Inference (cont.) Hence, by dropping edges between θ, z, and w, and w nodes, and also endow the resulting simplified graphical model with free variational parameters, we obtain a family of distributions on the latent variables: where the Dirichlet parameter γ and the multinomial parameters (Φ1, …, ΦN) are the free variational parameters.

How to determine the parameters We can set up an optimization problem to determine the values of the variational parameters γ and Φ. We can define the optimization function as minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior p(θ,z|w,α,β): This minimization can be achieved by an iterative fixed-point method. (5)

Variational Inference We now discuss how to set the parameter γ and Φ via an optimization procedure. Following Jordan et al. (1999), we have a lower bound of the log likelihood of a document using Jensen’s inequality:

Variational Inference (cont.) The Jensen’s inequality provides us with a lower bound on the log likelihood for an arbitrary variational distribution q(θ,z|γ,Φ). It can be easily verified that the difference between the left-hand side and the right-hand side of the above equation is the KL divergence between the variational posterior probability and the true posterior probability.

Variational Inference (cont.) That is, letting denote the right-hand side of the above equation we have: This means that maximizing the lower bound w.r.t. γ and Φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability, the optimization problem in equation (5).

Variational Inference (cont.) We can expand the above equation By extending it again, we can have (15)

Entropy

Variaitonal Multinomial We first maximize Eq. (15) w.r.t. Φni, the probability that the n-th word is generated by latent topic i. We form the Lagrangian by isolating the terms which contain Φni and adding the appropriate Lagrange multipliers. Let βiv be p(wvn=1|zi=1) for the appropriate v. (recall that each wn is a vector of size V with exactly one component equal to one; we can select the unique v such that wvn=1):

Variaitonal Multinomial (cont.) Taking derivatives w.r.t. Φni, we obtain: Setting this to zero yields the maximizing value of the variational parameter Φni :

Variational Dirichlet Next we maximize equation (15) w.r.t. γi, the i-th component of the posterior Dirichlet parameters, the terms containing γi are: By simplifying

Variational Dirichlet (cont.) Taking the derivative w.r.t. γi: Setting this equation to zero yields a maximum at:

Solve the Optimization Problem Derivate the KL divergence and setting them equal to zero, we obtain the following update equations: where the expectation in the multinomial update can be computed as follows: where ψ is the first derivative of the logΓ function which is computable via Taylor approximations (Abramowitz and Stegun, 1970). (6) (7) It is important to note that the variational distribution is actually a conditional distribution, varying as a function of w. This occurs because the optimization problem in Eq. (5) is conducted for fixed w, and thus yields optimizing parameters (g;f) that are a function of w. We can write the resulting variational distribution as q(q; zjg(w);f(w)), where we have made the dependence on w explicit. Thus the variational distribution can be viewed as an approximation to the posterior distribution p(q; zjw;a;b). (8)

Computing E[log(θi|α)] Recall that a distribution is in the exponential family if it can be written in the form: where η is the natural parameter, T(x) is the sufficient statistic, and A(η) is the log of the normalization factor. As we can write the Dirichlet in this form by exponentiating the log of Eq.: p(θ|α)

Computing E[log(θi|α)] (cont.) From this form we see that the natural parameter of the Dirichlet is ηi=αi-1 and the sufficient statistic is T(θi)=logθi. Moreover, based on the general fact that the derivative of the log normalization factor w.r.t. the natural parameter is equal to the expectation of the sufficient statistic, we obtain: where ψ is the digamma function, the first derivative of the log Gamma function.

Variational Inference Algorithm Each iteration requires O((N+1)k) operations For a single document the iteration number is on the order of the number of words in it Thus, the total number of operations roughly on the order of N2k

Parameter Estimation We can use a empirical Bayes method for parameter estimation. In particular, we wish to find parameters α and β that maximize the marginal log likelihood: The quantity p(w|α, β) can be computed by the variational inference as described above. Thus, we can find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure that maximizes a lower bound w.r.t. the variational parameters γ and Φ, and then fixed values of the variational parameters, maximizes the lower bound w.r.t. the model parameter α and β.

Variational EM (E-step) For each document, find the optimizing values of the variational parameters . This is done as described in the previous section. (M-step) Maximize the resulting lower bound on the log likelihood w.r.t. the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step. Actually, the update for the conditional multinomial parameter β can be written out analytically: The update for α can be implemented using an efficient Newton-Raphson method. These two steps are repeated until converges. (9)

Parameter Estimation We now consider how to obtain empirical Bayes estimates of the model parameters α and β. We solve this problem by using the variational lower bound as a surrogate for the marginal log likelihood, with the variational parameters Φ and γ fixed to the values found by variational inference. We then obtain empirical Bayes estimates by maximizing this lower bound w.r.t. the model parameters.

Parameter Estimation (cont.) Recall our approach for finding the empirical Bayes estimates is based on a variational EM procedure. In the variational E-step, we maximize the bound w.r.t. the variational parameter γ and Φ. In the M-step, which we describe in this section, we maximize the bound w.r.t. the model parameters α and β. The overall procedure can thus be viewed as coordinate ascent in L.

Conditional Multinomials To maximize w.r.t. β, we isolate terms and add Lagrange multipliers: Taking the derivative w.r.t. βij and set it to zero, we have

Dirichlet First, we have Taking derivative w.r.t. αi, we obtain: This derivative depends on αi, where j<>i, and we therefore must use an iterative method to find the maximal α. In particular, the Hessian is in the form found in equation (10):

Smoothing Simple Laplace smoothing is no longer justified as a maximum a posteriori method in LDA setting. We can then assume that each row in βkxV is independently drawn from an exchangeable Dirichlet distribution. That is to treat βi as random variables that are endowed with a posterior distribution, conditioned on the data.

Smoothing Model Thus we obtain a variational approach to Bayesian inference: where is the variational distribution defined for LDA as above and the update for the new variational parameter η is as follow:

Applications

Document Modeling Perplexity is used to indicate the generalization performance of a method. Specifically, we estimate a document modeling and use this model to describe the new data set. LDA outperforms the other models including pLSI, Smoothed Unigram, and Smoothed Mixt. Unigrams.

Document Classification We can use the LDA model results as the features for classifiers. In this way, say 50 topics, we can reduce the feature space by 99.6%. The experimental results show that such feature reduction may decrease the accuracy only a little.

Collaborative Filtering We can learn a model on a fully observed set of users. Then for each unobserved user, we are shown all but one of the movies preferred by that user and are asked to predict what the held-out movie is. Precisely, define the predictive perplexity on M test uses as:

Other Applications

The Naïve Bayes model c w N Prior prob. of the object classes Image likelihood given the class Object class decision Csurka et al. 2004

Csurka et al. 2004

Hierarchical Bayesian text models Probabilistic Latent Semantic Analysis (pLSA) w N d z D “face” Sivic et al. ICCV 2005

Hierarchical Bayesian text models “beach” Latent Dirichlet Allocation (LDA) w N c z D  Fei-Fei et al. ICCV 2005

Summary Based on the exchangeability assumption Can be viewed as a dimensionality reduction technique Exact inference is intractable, we can approximate instead Applications in other collection – images and caption for example.

End of The Talk !

Conjugation

Entropy