Teg Grenager NLP Group Lunch February 24, 2005

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei

CS188: Computational Models of Human Behavior

Pattern Finding and Pattern Discovery in Time Series

Topic models Source: Topic models, David Blei, MLSS 09.

Gentle Introduction to Infinite Gaussian Mixture Modeling

Xiaolong Wang and Daniel Khashabi

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Course: Neural Networks, Instructor: Professor L.Behera.

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Dirichlet Process (HDP)

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Basics of Statistical Estimation

Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.

Information retrieval – LSI, pLSI and LDA

Probabilistic models Haixu Tang School of Informatics.

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Hierarchical Dirichlet Processes

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

Beam Sampling for the Infinite Hidden Markov Model Van Gael, et al. ICML 2008 Presented by Daniel Johnson.

Visual Recognition Tutorial

Lecture 5: Learning models using EM

Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Latent Dirichlet Allocation a generative model for text

Nonparametric Bayesian Learning

British Museum Library, London Picture Courtesy: flickr.

Visual Recognition Tutorial

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

Randomized Algorithms for Bayesian Hierarchical Clustering

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Stick-Breaking Constructions

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.

Latent Dirichlet Allocation

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Analysis of Social Media MLD , LTI William Cohen

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Non-Parametric Models

Collapsed Variational Dirichlet Process Mixture Models

Multitask Learning Using Dirichlet Process

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Bayesian Inference for Mixture Language Models

Latent Dirichlet Allocation

Topic Models in Text Processing

Rational models of categorization

CS639: Data Management for Data Science

Presentation transcript:

Teg Grenager NLP Group Lunch February 24, 2005 Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005

Agenda Motivation Mixture Models Dirichlet Process Gibbs Sampling Applications

Clustering Goal: learn a partition of the data, such that: Data within classes are “similar” Classes are “different” from each other Two very different approaches: Agglomerative: build up clusters by iteratively sticking similar things together Mixture Model: learn a generative model over the data, treating the classes as hidden variables

Agglomerative Clustering Num Clusters Max Distance 20 19 5 18 5 17 5 16 8 15 8 14 8 13 8 12 8 11 9 10 9 9 Pros: Doesn’t need generative model (number of clusters, parametric distribution) Cons: Ad-hoc, no probabilistic foundation, intractable for large data sets 8 10 7 10 6 10 5 10 4 12 3 12 2 15 1 16

Mixture Model Clustering Examples: K-means, mixture of Gaussians, Naïve Bayes Pros: Sound probabilistic foundation, efficient even for large data sets Cons: Requires generative model, including number of clusters (mixture components)

Problem How many clusters are there here? Can use agglomerative to get a whole cluster tree Can run K-means with diff numbers of clusters to see how likelihood changes But what’s the right number? Would be nice to let the data decide somehow…

Big Idea Want to use a generative model, but don’t want to decide number of clusters in advance Suggestion: put each datum in its own cluster Problem: probability of 2 clusters colliding is zero under any density function, no “stickiness” Solution: instead of a density function, use a statistical process where the probability of two clusters falling together is non-negative Best of both worlds: stickiness with variable number of clusters

Finite Mixture Model p  ci xi N c x p  ci xij N M c x1 x2 xM Gaussian p  ci xi N c x Naïve Bayes p  ci xij N M c x1 x2 xM

Dirichlet Priors (Review) A distribution over possible parameter vectors of the multinomial distribution Thus values must lie in the k-dimensional simplex Beta distribution is the 2-parameter special case Expectation A conjugate prior to the multinomial Explicit formulation is ugly! xi   N

Infinite Mixture Model  G0 p  ci xi N

Chinese Restaurant Process CRP is a distribution on partitions that captures the clustering effect of the DP

DP Mixture Model ci xi N p   G0 xi N G i  G0 So far we have presented the DP mixture model, let’s now define the more general DP Remember we’re going to put each datum in its own cluster, but draw the cluster parameters from a DP In DP the prob of sampling the same point twice is positive G0 is called the base measure of G, and alpha is called the concentration Parameter G is a random probability measure distributed according to the DP

Stick-breaking Process 0.4 0.6 0.5 0.3 0.3 0.8 0.24 What is G? - A sample from the DP The theta params for each datum are drawn from it Because prob of drawing the same theta twice is positive, it must be discrete Depends somehow on theta and G_0 Stick breaking process G0

Properties of the DP Let (,) be a measurable space, G0 be a probability measure on the space, and  be a positive real number A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of , Draws G from DP are generally not distinct The number of distinct values grows with O(log n)

Infinite Exchangeability In general, an infinite set of random variables is said to be infinitely exchangeable if for every finite subset {xi,…,xn} and for any permutation  we have Note that infinite exchangeability is not the same as being independent and identically distributed (i.i.d.)! Using DeFinetti’s theorem, it is possible to show that our draws  are infinitely exchangeable Thus the mixture components may be sampled in any order

Mixture Model Inference We want to find a clustering of the data: an assignment of values to the hidden class variable Sometimes we also want the component parameters In most finite mixture models, this can be found with EM The Dirichlet process is a non-parametric prior, and doesn’t permit EM We use Gibbs sampling instead

Gibbs Sampling 1 xi N G i  G0 xi N i  G0 Algorithm 1: integrate out G, and sample the i directly, conditioned on everything else This is inefficient, because we update cluster information for one datum at a time xi N G i  G0 xi N i  G0

Gibbs Sampling 2 xi N G i  G0 xi N ci  G0  c Algorithm 2: Reintroduce a cluster variable ci which takes on values that are the names c of the clusters Store the parameters that are shared by all data in class c in a new variable c xi N G i  G0 xi N ci  G0  c

Gibbs Sampling 2 (cont.) Works well Algorithm 2: For i = 1,…,N sample ci from where H-i,c is the posterior distribution of c based on the prior G0 and all observations for which ji and cj=c Repeat Works well Note: can also use variational methods (other than EM)

NLP Applications Clustering Sequence modeling: the “infinite HMM” Document clustering for topic, genre, sentiment,… Word clustering for POS, WSD, synonymy,… Topic clustering across documents (see Blei et. al., 2004 and Teh et. al., 2004) Noun coreference: don’t know how many entities there are Other identity uncertainty problems: deduping, etc. Grammar induction Sequence modeling: the “infinite HMM” Topic segmentation (see Grenager et. al., 2005) Sequence models for POS tagging Others? Useful anytime you want to cluster or do unsup learning without specifying the number fo clusters

Nested CRP Used for modeling topic hierarchies by Blei et. al., 2004. Day 1 Day 2 Day 3

Nested CRP (cont.) To generate a document given a tree with L levels Choose a path from the root of the tree to a leaf Draw a vector  of topic mixing proportions from an L-dimensional Dirichlet Generate the words in the document from a mixture of the topics along the path, with mixing proportions 

Nested CRP (cont.) A topic hierarchy estimated from 1717 abstracts from NIPS01 through NIPS12.

References Seminal: Foundational: NLP: T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics 1:209-230, 1973. C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics 2:1152-1174, 1974. Foundational: M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577-588, 1995. S.N. MacEachern and P. Muller. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7:223-238, 1998. R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249-265, 2000. C.E. Rasmussen. The Infinite Gaussian Mixture Model. NIPS, 2000. H. Ishwaran and L. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96:161-173, 2001. NLP: D.M. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. NIPS, 2004. Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. NIPS, 2004.