Dirichlet process tutorial

Slides:



Advertisements
Similar presentations
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Advertisements

Hierarchical Dirichlet Processes
Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,
Unsupervised Learning
Factor Graphs, Variable Elimination, MLEs Joseph Gonzalez TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Gaussians Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the TexPoint.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Math 10 Chapter 6 Notes: The Normal Distribution Notation: X is a continuous random variable X ~ N( ,  ) Parameters:  is the mean and  is the standard.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Lecture 2: Statistical learning primer for biologists
Generalized Spatial Dirichlet Process Models Jason A. Duan Michele Guindani Alan E. Gelfand March, 2006.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Oliver Schulte Machine Learning 726
Bayesian Semi-Parametric Multiple Shrinkage
Oliver Schulte Machine Learning 726
Machine Learning and Data Mining Clustering
Bayesian Generalized Product Partition Model
CH 5: Multivariate Methods
Non-Parametric Models
Bayes Net Learning: Bayesian Approaches
Latent Variables, Mixture Models and EM
Distributions and Concepts in Probability Theory
Outline Parameter estimation – continued Non-parametric methods.
t distribution Suppose Z ~ N(0,1) independent of X ~ χ2(n). Then,
Information Based Criteria for Design of Experiments
More about Posterior Distributions
Dealing with Noisy Data
Uri Zwick – Tel Aviv Univ.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
CSCI 5822 Probabilistic Models of Human and Machine Learning
CSE 321 Discrete Structures
CS5112: Algorithms and Data Structures for Applications
Segmentation (continued)
LECTURE 09: BAYESIAN LEARNING
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Machine Learning and Data Mining Clustering
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
CSE 321 Discrete Structures
CSE 321 Discrete Structures
Bayesian kernel mixtures for counts
Machine Learning and Data Mining Clustering
Presentation transcript:

Dirichlet process tutorial Bryan Russell TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA

Goals Intuitive understanding of Dirichlet processes and applications Minimize math and maximize pictures Motivate you to go through the math to understand implementation

Disclaimers What I’m about to tell you applies more generally We’ll gloss over lots of math (especially measure theory); look at the original papers for details

What’s this good for? Principled, Bayesian method for fitting a mixture model with an unknown number of clusters Because it’s Bayesian, can build hierarchies (e.g. HDPs) and integrate with other random variables in a principled way

Aren’t there other ways to count the number of clusters?

Gaussian mixture model, revisited

Gaussian mixture model, revisited

Gaussian mixture model, revisited

Let us generate data points…

Multinomial weights: prior probabilities of the mixtures

For each data point, choose cluster center h

Generate points x from the Gaussian mixture h

Let us be more Bayesian… Put a prior over mixture parameters For Gaussian mixtures, this is a normal inverse-Wishart density

Suppose we do not know the number of clusters We could sample Gaussian parameters for each data point However, the parameters may all be unique, i.e. there is one Gaussian mixture for each data point--overfitting

Dirichlet processes to the rescue Draws from Dirichlet processes have a nice clustering property: Normal inverse-Wishart density Concentration parameter is a density over the parameters and is discrete with probability one

Visualizing Dirichlet process draws Think of these as prior weights over the parameters

DP mixture model This model has a bias to “bunch” parameters together. The concentration parameter controls this “bunching” property: lower values will find fewer clusters and vice versa for higher values