Presentation on theme: "Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei"— Presentation transcript:
1Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei The IBP Compound Dirichlet Process and its Application to Focused Topic ModelingSinead Williamson, Chong Wang, Katherine A. Heller, David M. BleiPresented by Eric Wang9/16/2011
2IntroductionLatent Dirichlet Allocation (LDA) is a powerful and ubiquitous topic modeling framework.Incorporating the hierarchical Dirichlet process (HDP) into the LDA allows for more flexible topic modeling by estimating the global topic proportions.A drawback of HDP-LDA is that a topic that is rare globally will also have a low expected proportion within each document.The authors propose a model that allows a rare topic to still have large mass within individual documents.
3Hierarchical Dirichlet Process The hierarchical Dirichlet process (HDP) is a prior for Bayesian nonparametric mixed membership modeling of data groups.Hierarchically, it can be defined aswhere m indexes the data group.In HDP, the expectation of the mixing weights in is In practice, the mixing weights in is the global average of the mixture membership.
4Indian Buffet ProcessThe Indian Buffet Process (IBP) defines a distribution over binary matrices with an infinite number of columns, and a finite number of non-zero entries.Hierarchically, it is defined aswhere m and k denote the rows and columns of binary matrix b. It can be represented via a stick-breaking construction
5IBP Compound Dirichlet Process Combining HDP and IBP into single prior yields an infinite “spike-slab” prior (ICD).A spike distribution (IBP) determines which variables are drawn from the slab (DP).The model assumes the following generative process
6IBP Compound Dirichlet Process The atom masses of data group m is Dirichlet distributed as followswhereIn this construction, the are the topic proportions for document m and B is a binary vector indicating usage of the dictionary elements.
7Focused Topic ModelsThe authors use ICD to develop the Focused Topic model (FTM).In this framework, a global distribution over topics is drawn and shared over all documents as in HDP-LDA.Each document infers a subset of topics from the global menu. The subset is determined by the binary vector Since the binary vector is independent of the global topic proportions, topics that are rare globally can still make up a large proportion of individual documents.
8Focused Topic ModelsThe generative process for the FTM is as follows
9Posterior InferenceTo sample the topic indicator for word i in document m,where the integralhas an analytical form andThis is an important point because it suggests a general framework that can be adapted to other applications.
10Posterior InferenceThe joint probability of and the total number of words assigned to topic k isand is log differentiable with respect to and .A hybrid MC algorithm is used to sample from their posteriors.
11Posterior Inference The topic weights are sampled as And the binary topic indicators are sampled asNotice here that if a topic is used, it is automatically considered “active”, and additional (unused) topics can be activated.
12Empirical ResultsThe authors considered three different text datasets:All models were run for 1000 iterations, with the first 500 iterations discarded as burn-in.
14Empirical ResultsHere, the authors compare the number of topics a word appears in (a). The FTM has more concentrated topics.In (b), the authors show the number of documents the topics appear in. The plot illustrates that HDP has many topics that appear in only a few documents, while a significant portion of the FTM topics appear in many documents.
15DiscussionThe authors have proposed a novel model called the IBP compound Dirichlet Process (ICD) that decouples the across-data topic prevalence and the intra-data topic proportions.The Focused Topic Model (FTM) was developed from the ICD that addressed several key shortcomings of HDP-LDA.In HDL-LDA, the global topic prevalence affects the proportion a topic can appear within a document, but in FTM, globally rare topics can still be highly occupied within a document.FTM shows improved perplexity relative to HDP.