Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei

Teg Grenager NLP Group Lunch February 24, 2005

Xiaolong Wang and Daniel Khashabi

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Course: Neural Networks, Instructor: Professor L.Behera.

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Dirichlet Process (HDP)

A Tutorial on Learning with Bayesian Networks

Probabilistic models Haixu Tang School of Informatics.

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Hierarchical Dirichlet Processes

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Sharing Features among Dynamical Systems with Beta Processes

HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.

Latent Dirichlet Allocation a generative model for text

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Nonparametric Bayesian Learning

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Visual Recognition Tutorial

Scalable Text Mining with Sparse Generative Models

Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Integrating Topics and Syntax -Thomas L

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Randomized Algorithms for Bayesian Hierarchical Clustering

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

Stick-Breaking Constructions

CS Statistical Machine learning Lecture 24

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.

Latent Dirichlet Allocation

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Latent Dirichlet Allocation

LECTURE 07: BAYESIAN ESTIMATION

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Presentation transcript:

Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux June 14, 2010

Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics

Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics Some of these connections are frequentist and some are Bayesian Some of the connections involve nonparametrics and some involve parametric modeling –it appears that most of the nonparametric work is being conducted within a frequentist framework I want to introduce you to Bayesian nonparametrics

Bayesian Nonparametrics At the core of Bayesian inference is Bayes theorem: For parametric models, we let denote a Euclidean parameter and write: For Bayesian nonparametric models, we let be a general stochastic process (an “infinite-dimensional random variable”) and write (non-rigorously): This frees us to work with flexible data structures

Bayesian Nonparametrics (cont) Examples of stochastic processes we'll mention today include distributions on: –directed trees of unbounded depth and unbounded fan- out –partitions –grammars –sparse binary infinite-dimensional matrices –copulae –distributions General mathematical tool: completely random processes

Bayesian Nonparametrics Replace distributions on finite-dimensional objects with distributions on infinite-dimensional objects such as function spaces, partitions and measure spaces –mathematically this simply means that we work with stochastic processes A key construction: random measures These can be used to provide flexibility at the higher levels of Bayesian hierarchies:

Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:

Constructing Random Measures It's not hard to see that (wp1) Now define the following object: –where are independent draws from a distribution on some space Because, is a probability measure---it is a random probability measure The distribution of is known as a Dirichlet process: There are other random measures that can be obtained this way, but there are other approaches as well…

Random Measures from Stick-Breaking x x x x x x x x x x x x x x x

Completely Random Measures Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of –e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process –but it's a normalized gamma process) Completely random measures are discrete wp1 (up to a possible deterministic continuous component) Completely random measures are random measures, not necessarily random probability measures (Kingman, 1968)

Completely Random Measures Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in (Kingman, 1968) xx x x x xx x x

Beta Processes The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: degenerate Beta(0,c) distribution Base measure

Beta Processes

Gamma Processes, Dirichlet Processes and Bernoulli Processes The gamma process uses an improper gamma density in the Levy measure Once again, the resulting random measure is an infinite weighted sum of atoms To obtain the Dirichlet process, simply normalize the gamma process The Bernoulli process is obtained by treating the beta process as an infinite collection of coins and tossing those coins: where

Beta Process and Bernoulli Process

BP and BeP Sample Paths

Beta Processes vs. Dirichlet Processes The Dirichlet process yields a classification of each data point into a single class (a “cluster”) The beta process allows each data point to belong to multiple classes (a “feature vector”) Essentially, the Dirichlet process yields a single coin with an infinite number of sides Essentially, the beta process yields an infinite collection of coins with mostly small probabilities, and the Bernoulli process tosses those coins to yield a binary feature vector

The De Finetti Theorem An infinite sequence of random variables is called infinitely exchangeable if the distribution of any finite subsequence is invariant to permutation Theorem: infinite exchangeability if and only if for some random measure

Random Measures and Their Marginals Taking to be a completely random or stick- breaking measure, we obtain various interesting combintorial stochastic processes: –Dirichlet process => Chinese restaurant process (Polya urn) –Beta process => Indian buffet process –Hierarchical Dirichlet process => Chinese restaurant franchise –HDP-HMM => infinite HMM –Nested Dirichlet process => nested Chinese restaurant process

Dirichlet Process Mixture Models

Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables –first customer sits at the first table – th subsequent customer sits at a table drawn from the following distribution: –where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated

The CRP and Clustering Data points are customers; tables are mixture components –the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: –a likelihood---e.g., associate a parameterized probability distribution with each table –a prior for the parameters---the first customer to sit at table chooses the parameter vector,, for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality

CRP Prior, Gaussian Likelihood, Conjugate Prior

Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams

Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems

Multiple Estimation Problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well –want to “share statistical strength”

Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups

Hierarchical Modeling The plate notation: Equivalent to:

Hierarchical Dirichlet Process Mixtures

Chinese Restaurant Franchise (CRF) Global menu: Restaurant 1: Restaurant 2: Restaurant 3:

Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood

Protein Folding (cont.)

JohnJaneBobJohn BobBob JillJill Speaker Diarization

Motion Capture Analysis  Goal: Find coherent “behaviors” in the time series that transfer to other time series (e.g., jumping, reaching)

Page 35 Hidden Markov Models Time State states observations

Page 36 Hidden Markov Models Time modes observations

Page 37 Hidden Markov Models Time states observations

Page 38 Hidden Markov Models Time states observations

Issues with HMMs How many states should we use? –we don’t know the number of speakers a priori –we don’t know the number of behaviors a priori How can we structure the state space? –how to encode the notion that a particular time series makes use of a particular subset of the states? –how to share states among time series? We’ll develop a Bayesian nonparametric approach to HMMs that solves these problems in a simple and general way

HDP-HMM Dirichlet process: –state space of unbounded cardinality Hierarchical DP: –ties state transition distributions Time State

Average transition distribution: HDP-HMM sparsity of  is shared State-specific transition distributions:

State Splitting HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance

“Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky

“Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky

JohnJaneBobJohn BobBob JillJill Speaker Diarization

NIST Evaluations NIST Rich Transcription meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art Meeting by Meeting Comparison

Results: 21 meetings Overall DERBest DERWorst DER Sticky HDP-HMM17.84%1.26%34.29% Non-Sticky HDP-HMM23.91%6.26%46.95% ICSI18.37%4.39%32.23%

HDP-PCFG The most successful parsers of natural language are based on lexicalized probabilistic context-free grammars (PCFGs) We want to learn PCFGs from data without hand- tuning of the number or kind of lexical categories

HDP-PCFG

The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables –using the Dirichlet process means having a large number of parameters, which may overfit –perhaps instead want to characterize objects as collections of attributes (“sparse features”)? –i.e., binary matrices with more than one 1 in each row

Beta Processes

Beta Process and Bernoulli Process

BP and BeP Sample Paths

Multiple Time Series Goals: –transfer knowledge among related time series in the form of a library of “behaviors” –allow each time series model to make use of an arbitrary subset of the behaviors Method: –represent behaviors as states in a nonparametric HMM –use the beta/Bernoulli process to pick out subsets of states

BP-AR-HMM Bernoulli process determines which states are used Beta process prior: –encourages sharing –allows variability

Motion Capture Results

Document Corpora “Bag-of-words” models of document corpora have a long history in the field of information retrieval The bag-of-words assumption is an exchangeability assumption for the words in a document –it is generally a very rough reflection of reality –but it is useful, for both computational and statistical reasons Examples –text (bags of words) –images (bags of patches) –genotypes (bags of polymorphisms) Motivated by De Finetti, we wish to find useful latent representations for these “documents”

Short History of Document Modeling Finite mixture models Finite admixture models (latent Dirichlet allocation) Bayesian nonparametric admixture models (the hierarchical Dirichlet process) Abstraction hierarchy mixture models (nested Chinese restaurant process) Abstraction hierarchy admixture models (nested beta process)

Finite Mixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –select a mixture component –repeatedly draw words from this mixture component The mixing proportions are corpora-specific, not document-specific Major drawback: each document can express only a single topic

Finite Admixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –repeatedly select a mixture component –draw a word from this mixture component The mixing proportions are document-specific Now each document can express multiple topics

Latent Dirichlet Allocation A word is represented as a multinomial random variable w A topic allocation is represented as a multinomial random variable z A document is modeled as a Dirichlet random variable  The variables  and  are  hyperparameters (Blei, Ng, and Jordan, 2003)

Abstraction Hierarchies Words in documents are organized not only by topics but also by level of abstraction Models such as LDA don’t capture this notion; common words often appear repeatedly across many topics Idea: Let documents be represented as paths down a tree of topics, placing topics focused on common words near the top of the tree

Nesting In the hierarchical models, all of the atoms are available to all of the “groups” In nested models, the atoms are subdivided into smaller collections, and the groups select among the collections E.g., the nested CRP of Blei, Griffiths and Jordan -each table in the Chinese restaurant indexes another Chinese restaurant E.g., the nested DP of Rodriguez, Dunson and Gelfand -each atom in a high-level DP is itself a DP

Nested Chinese Restaurant Process (Blei, Griffiths and Jordan, JACM, 2010)

Nested Chinese Restaurant Process

Hierarchical Latent Dirichlet Allocation The generative model for a document is as follows: -use the nested CRP to select a path down the infinite tree -draw to obtain a distribution on levels along the path -repeatedly draw a level from and draw a word from the topic distribution at that level

Psychology Today Abstracts

Pros and Cons of Hierarchical LDA The hierarchical LDA model yields more interpretable topics than flat LDA The maximum a posteriori tree can be interesting But the model is out of the spirit of classical topic models because it only allows one “theme” per document We would like a model that can both mix over levels in a tree and mix over paths within a single document

Nested Beta Processes In the nested CRP, at each level of the tree the procedure selects only a single outgoing branch Instead of using the CRP, use the beta process (or the gamma process) together with the Bernoulli process (or the Poisson process) to allow multiple branches to be selected:

Nested Beta Processes

Conclusions Hierarchical modeling and nesting have at least an important a role to play in Bayesian nonparametrics as they play in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies Completely random measures are an important tool in constructing nonparametric priors For papers, software, tutorials and more details: