Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayesian Belief Propagation
Topic models Source: Topic models, David Blei, MLSS 09.
Hierarchical Dirichlet Processes
Unsupervised Learning
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
LDA Training System 8/22/2012.
Probabilistic Clustering-Projection Model for Discrete Data
Final Project Presentation Name: Samer Al-Khateeb Instructor: Dr. Xiaowei Xu Class: Information Science Principal/ Theory (IFSC 7321) TOPIC MODELING FOR.
Statistical Topic Modeling part 1
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Guillaume Bouchard Xerox Research Centre Europe
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Generative Topic Models for Community Analysis
Lecture 5: Learning models using EM
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Modeling Scientific Impact with Topical Influence Regression James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine.
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Online Learning for Latent Dirichlet Allocation
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Integrating Topics and Syntax -Thomas L
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
Population and Sample The entire group of individuals that we want information about is called population. A sample is a part of the population that we.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Latent Dirichlet Allocation
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
How can we maintain an error bound? Settle for a “per-step” bound What’s the probability of a mistake at each step? Not cumulative, but Equal footing with.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Probabilistic models for corpora and graphs. Review: some generative models Multinomial Naïve Bayes C W1W1 W2W2 W3W3 ….. WNWN  M  For each document.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Fast search for Dirichlet process mixture models
World’s fastest Machine Learning Engine
Online Multiscale Dynamic Topic Models
Understanding Generalization in Adaptive Data Analysis
Markov Networks.
Sampling Distribution
Sampling Distribution
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Topic models for corpora and for graphs
Bayesian Inference for Mixture Language Models
≠ Particle-based Variational Inference for Continuous Systems
Michal Rosen-Zvi University of California, Irvine
Topic models for corpora and for graphs
Topic Models in Text Processing
Presentation transcript:

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth 1, Max Welling 3 1 University of California Irvine, Computer Science 2 University of California Irvine, Statistics 3 University of Amsterdam, Computer Science

Let’s say we want to build an LDA topic model on Wikipedia

LDA on Wikipedia 1 hour6 hours 12 hours 10 mins

LDA on Wikipedia 1 hour6 hours 12 hours 10 mins

LDA on Wikipedia 1 full iteration = 3.5 days! 1 hour6 hours 12 hours 10 mins

LDA on Wikipedia Stochastic variational inference 1 hour6 hours 12 hours 10 mins

LDA on Wikipedia Stochastic collapsed variational inference 1 hour6 hours 12 hours 10 mins

Available tools VB Collapsed Gibbs Sampling Collapsed VB BatchBlei et al. (2003) Griffiths and Steyvers (2004) Teh et al. (2007), Asuncion et al. (2009) Stochastic Hoffman et al. (2010, 2013) Mimno et al. (2012) (VB/Gibbs hybrid) ???

Available tools VB Collapsed Gibbs Sampling Collapsed VB BatchBlei et al. (2003) Griffiths and Steyvers (2004) Teh et al. (2007), Asuncion et al. (2009) Stochastic Hoffman et al. (2010, 2013) Mimno et al. (2012) (VB/Gibbs hybrid) ???

Outline Stochastic optimization Collapsed inference for LDA New algorithm: SCVB0 Experimental results Discussion

Stochastic Optimization for ML Batch algorithms – While (not converged) Process the entire dataset Update parameters Stochastic algorithms – While (not converged) Process a subset of the dataset Update parameters

Stochastic Optimization for ML Batch algorithms – While (not converged) Process the entire dataset Update parameters Stochastic algorithms – While (not converged) Process a subset of the dataset Update parameters

Stochastic Optimization for ML Stochastic gradient descent – Estimate the gradient Stochastic variational inference (Hoffman et al. 2010, 2013) – Estimate the natural gradient of the variational parameters Online EM (Cappe and Moulines, 2009) – Estimate E-step sufficient statistics

Collapsed Inference for LDA Marginalize out the parameters, and perform inference on the latent variables only – Simpler, faster and fewer update equations – Better mixing for Gibbs sampling – Better variational bound for VB (Teh et al., 2007)

A Key Insight VBStochastic VB Document parameters Update after every document

A Key Insight VBStochastic VB Document parameters Collapsed VB Stochastic Collapsed VB Word parameters Update after every word? Update after every document

Collapsed Inference for LDA Collapsed Variational Bayes (Teh et al., 2007) K-dimensional discrete variational distributions for each token Mean field assumption

Collapsed Gibbs sampler Collapsed Inference for LDA

Collapsed Gibbs sampler CVB0 (Asuncion et al., 2009) Collapsed Inference for LDA

CVB0 Statistics Simple sums over the variational parameters

Stochastic Optimization for ML Stochastic gradient descent – Estimate the gradient Stochastic variational inference (Hoffman et al. 2010, 2013) – Estimate the natural gradient of the variational parameters Online EM (Cappe and Moulines, 2009) – Estimate E-step sufficient statistics Stochastic CVB0 – Estimate the CVB0 statistics

Stochastic Optimization for ML Stochastic gradient descent – Estimate the gradient Stochastic variational inference (Hoffman et al. 2010, 2013) – Estimate the natural gradient of the variational parameters Online EM (Cappe and Moulines, 2009) – Estimate E-step sufficient statistics Stochastic CVB0 – Estimate the CVB0 statistics

Estimating CVB0 Statistics

Pick a random word i from a random document j

Estimating CVB0 Statistics Pick a random word i from a random document j

Stochastic CVB0 In an online algorithm, we cannot store the variational parameters But we can update them!

Stochastic CVB0 Keep an online average of the CVB0 statistics

Extra Refinements Optional burn-in passes per document Minibatches Operating on sparse counts

Stochastic CVB0 Putting it all Together

Theory Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of (a variant of) CVB0 Theorem: with an appropriate sequence of step sizes, Stochastic CVB0 converges to a stationary point of the MAP, with adjusted hyper-parameters

Experimental Results – Large Scale

Experimental Results – Small Scale Real-time or near real-time results are important for EDA applications Human participants shown the top ten words from each topic

Experimental Results – Small Scale NIPS (5 Seconds)New York Times (60 Seconds) Mean number of errors Standard deviations:

Discussion We introduced stochastic CVB0 for LDA – Combines stochastic and collapsed inference approaches – Fast – Easy to implement – Accurate Experimental results show SCVB0 is useful for both large-scale and small-scale analysis Future work: Exploit sparsity, parallelization, non-parametric extensions

Thanks! Questions?