Download presentation

Presentation is loading. Please wait.

Published byLacey Whidby Modified over 2 years ago

1

2
Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process Teg Grenager NLP Group Lunch February 24, 2005

3
Agenda Motivation Mixture Models Dirichlet Process Gibbs Sampling Applications

4
Clustering Goal: learn a partition of the data, such that: Data within classes are similar Classes are different from each other Two very different approaches: Agglomerative: build up clusters by iteratively sticking similar things together Mixture Model: learn a generative model over the data, treating the classes as hidden variables

5
Agglomerative Clustering Pros: Doesnt need generative model (number of clusters, parametric distribution) Cons: Ad-hoc, no probabilistic foundation, intractable for large data sets Num ClustersMax Distance

6
Mixture Model Clustering Examples: K-means, mixture of Gaussians, Naïve Bayes Pros: Sound probabilistic foundation, efficient even for large data sets Cons: Requires generative model, including number of clusters (mixture components)

7
Problem

8
Big Idea Want to use a generative model, but dont want to decide number of clusters in advance Suggestion: put each datum in its own cluster Problem: probability of 2 clusters colliding is zero under any density function, no stickiness Solution: instead of a density function, use a statistical process where the probability of two clusters falling together is non-negative Best of both worlds: stickiness with variable number of clusters

9
Finite Mixture Model Gaussian Naïve Bayes c x c x1x1 x2x2 xMxM cici xixi N p cici x ij N M p

10
Dirichlet Priors (Review) A distribution over possible parameter vectors of the multinomial distribution Thus values must lie in the k-dimensional simplex Beta distribution is the 2-parameter special case Expectation A conjugate prior to the multinomial Explicit formulation is ugly! xixi N

11
Infinite Mixture Model cici xixi N p G0G0

12
Chinese Restaurant Process

13
DP Mixture Model xixi N G i G0G0 cici xixi N p G0G0

14
Stick-breaking Process G0G

15
Properties of the DP Let (, ) be a measurable space, G 0 be a probability measure on the space, and be a positive real number A Dirichlet process is any distribution of a random probability measure G over (, ) such that, for all finite partitions (A 1,…,A r ) of, Draws G from DP are generally not distinct The number of distinct values grows with O(log n)

16
Infinite Exchangeability In general, an infinite set of random variables is said to be infinitely exchangeable if for every finite subset {x i,…,x n } and for any permutation we have Note that infinite exchangeability is not the same as being independent and identically distributed (i.i.d.)! Using DeFinettis theorem, it is possible to show that our draws are infinitely exchangeable Thus the mixture components may be sampled in any order

17
Mixture Model Inference We want to find a clustering of the data: an assignment of values to the hidden class variable Sometimes we also want the component parameters In most finite mixture models, this can be found with EM The Dirichlet process is a non-parametric prior, and doesnt permit EM We use Gibbs sampling instead

18
Gibbs Sampling 1 Algorithm 1: integrate out G, and sample the i directly, conditioned on everything else This is inefficient, because we update cluster information for one datum at a time xixi N G i G0G0 xixi N i G0G0

19
Gibbs Sampling 2 Algorithm 2: Reintroduce a cluster variable c i which takes on values that are the names c of the clusters Store the parameters that are shared by all data in class c in a new variable c xixi N G i G0G0 xixi N cici G0G0 c

20
Gibbs Sampling 2 (cont.) Algorithm 2: For i = 1,…,N sample c i from where H -i,c is the posterior distribution of c based on the prior G 0 and all observations for which j i and c j =c Repeat Works well Note: can also use variational methods (other than EM)

21
NLP Applications Clustering Document clustering for topic, genre, sentiment,… Word clustering for POS, WSD, synonymy,… Topic clustering across documents (see Blei et. al., 2004 and Teh et. al., 2004) Noun coreference: dont know how many entities there are Other identity uncertainty problems: deduping, etc. Grammar induction Sequence modeling: the infinite HMM Topic segmentation (see Grenager et. al., 2005) Sequence models for POS tagging Others?

22
Nested CRP Day 1Day 2Day 3

23
Nested CRP (cont.) To generate a document given a tree with L levels Choose a path from the root of the tree to a leaf Draw a vector of topic mixing proportions from an L-dimensional Dirichlet Generate the words in the document from a mixture of the topics along the path, with mixing proportions

24
Nested CRP (cont.)

25
References Seminal: T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics 1: , C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics 2: , Foundational: M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90: , S.N. MacEachern and P. Muller. Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7: , R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: , C.E. Rasmussen. The Infinite Gaussian Mixture Model. NIPS, H. Ishwaran and L. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96: , NLP: D.M. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. NIPS, Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. NIPS, 2004.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google