Presentation is loading. Please wait.

Presentation is loading. Please wait.

Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,

Similar presentations


Presentation on theme: "Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,"— Presentation transcript:

1 Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux June 14, 2010

2 Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics

3 Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics Some of these connections are frequentist and some are Bayesian Some of the connections involve nonparametrics and some involve parametric modeling –it appears that most of the nonparametric work is being conducted within a frequentist framework I want to introduce you to Bayesian nonparametrics

4 Bayesian Nonparametrics At the core of Bayesian inference is Bayes theorem: For parametric models, we let denote a Euclidean parameter and write: For Bayesian nonparametric models, we let be a general stochastic process (an “infinite-dimensional random variable”) and write (non-rigorously): This frees us to work with flexible data structures

5 Bayesian Nonparametrics (cont) Examples of stochastic processes we'll mention today include distributions on: –directed trees of unbounded depth and unbounded fan- out –partitions –grammars –sparse binary infinite-dimensional matrices –copulae –distributions General mathematical tool: completely random processes

6 Bayesian Nonparametrics Replace distributions on finite-dimensional objects with distributions on infinite-dimensional objects such as function spaces, partitions and measure spaces –mathematically this simply means that we work with stochastic processes A key construction: random measures These can be used to provide flexibility at the higher levels of Bayesian hierarchies:

7 Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:

8 Constructing Random Measures It's not hard to see that (wp1) Now define the following object: –where are independent draws from a distribution on some space Because, is a probability measure---it is a random probability measure The distribution of is known as a Dirichlet process: There are other random measures that can be obtained this way, but there are other approaches as well…

9 Random Measures from Stick-Breaking x x x x x x x x x x x x x x x

10 Completely Random Measures Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of –e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process –but it's a normalized gamma process) Completely random measures are discrete wp1 (up to a possible deterministic continuous component) Completely random measures are random measures, not necessarily random probability measures (Kingman, 1968)

11 Completely Random Measures Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in (Kingman, 1968) xx x x x xx x x

12 Beta Processes The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: degenerate Beta(0,c) distribution Base measure

13 Beta Processes

14 Gamma Processes, Dirichlet Processes and Bernoulli Processes The gamma process uses an improper gamma density in the Levy measure Once again, the resulting random measure is an infinite weighted sum of atoms To obtain the Dirichlet process, simply normalize the gamma process The Bernoulli process is obtained by treating the beta process as an infinite collection of coins and tossing those coins: where

15 Beta Process and Bernoulli Process

16 BP and BeP Sample Paths

17 Beta Processes vs. Dirichlet Processes The Dirichlet process yields a classification of each data point into a single class (a “cluster”) The beta process allows each data point to belong to multiple classes (a “feature vector”) Essentially, the Dirichlet process yields a single coin with an infinite number of sides Essentially, the beta process yields an infinite collection of coins with mostly small probabilities, and the Bernoulli process tosses those coins to yield a binary feature vector

18 The De Finetti Theorem An infinite sequence of random variables is called infinitely exchangeable if the distribution of any finite subsequence is invariant to permutation Theorem: infinite exchangeability if and only if for some random measure

19 Random Measures and Their Marginals Taking to be a completely random or stick- breaking measure, we obtain various interesting combintorial stochastic processes: –Dirichlet process => Chinese restaurant process (Polya urn) –Beta process => Indian buffet process –Hierarchical Dirichlet process => Chinese restaurant franchise –HDP-HMM => infinite HMM –Nested Dirichlet process => nested Chinese restaurant process

20 Dirichlet Process Mixture Models

21 Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables –first customer sits at the first table – th subsequent customer sits at a table drawn from the following distribution: –where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated

22 The CRP and Clustering Data points are customers; tables are mixture components –the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: –a likelihood---e.g., associate a parameterized probability distribution with each table –a prior for the parameters---the first customer to sit at table chooses the parameter vector,, for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality

23 CRP Prior, Gaussian Likelihood, Conjugate Prior

24 Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams

25 Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems

26 Multiple Estimation Problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well –want to “share statistical strength”

27 Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups

28 Hierarchical Modeling The plate notation: Equivalent to:

29 Hierarchical Dirichlet Process Mixtures

30 Chinese Restaurant Franchise (CRF) Global menu: Restaurant 1: Restaurant 2: Restaurant 3:

31 Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood

32 Protein Folding (cont.)

33 JohnJaneBobJohn BobBob JillJill Speaker Diarization

34 Motion Capture Analysis  Goal: Find coherent “behaviors” in the time series that transfer to other time series (e.g., jumping, reaching)

35 Page 35 Hidden Markov Models Time State states observations

36 Page 36 Hidden Markov Models Time modes observations

37 Page 37 Hidden Markov Models Time states observations

38 Page 38 Hidden Markov Models Time states observations

39 Issues with HMMs How many states should we use? –we don’t know the number of speakers a priori –we don’t know the number of behaviors a priori How can we structure the state space? –how to encode the notion that a particular time series makes use of a particular subset of the states? –how to share states among time series? We’ll develop a Bayesian nonparametric approach to HMMs that solves these problems in a simple and general way

40 HDP-HMM Dirichlet process: –state space of unbounded cardinality Hierarchical DP: –ties state transition distributions Time State

41 Average transition distribution: HDP-HMM sparsity of  is shared State-specific transition distributions:

42 State Splitting HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance

43 “Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky

44 “Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky

45 JohnJaneBobJohn BobBob JillJill Speaker Diarization

46 NIST Evaluations NIST Rich Transcription 2004- 2007 meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art Meeting by Meeting Comparison

47 Results: 21 meetings Overall DERBest DERWorst DER Sticky HDP-HMM17.84%1.26%34.29% Non-Sticky HDP-HMM23.91%6.26%46.95% ICSI18.37%4.39%32.23%

48 HDP-PCFG The most successful parsers of natural language are based on lexicalized probabilistic context-free grammars (PCFGs) We want to learn PCFGs from data without hand- tuning of the number or kind of lexical categories

49 HDP-PCFG

50 The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables –using the Dirichlet process means having a large number of parameters, which may overfit –perhaps instead want to characterize objects as collections of attributes (“sparse features”)? –i.e., binary matrices with more than one 1 in each row

51 Beta Processes

52 Beta Process and Bernoulli Process

53 BP and BeP Sample Paths

54 Multiple Time Series Goals: –transfer knowledge among related time series in the form of a library of “behaviors” –allow each time series model to make use of an arbitrary subset of the behaviors Method: –represent behaviors as states in a nonparametric HMM –use the beta/Bernoulli process to pick out subsets of states

55 BP-AR-HMM Bernoulli process determines which states are used Beta process prior: –encourages sharing –allows variability

56 Motion Capture Results

57 Document Corpora “Bag-of-words” models of document corpora have a long history in the field of information retrieval The bag-of-words assumption is an exchangeability assumption for the words in a document –it is generally a very rough reflection of reality –but it is useful, for both computational and statistical reasons Examples –text (bags of words) –images (bags of patches) –genotypes (bags of polymorphisms) Motivated by De Finetti, we wish to find useful latent representations for these “documents”

58 Short History of Document Modeling Finite mixture models Finite admixture models (latent Dirichlet allocation) Bayesian nonparametric admixture models (the hierarchical Dirichlet process) Abstraction hierarchy mixture models (nested Chinese restaurant process) Abstraction hierarchy admixture models (nested beta process)

59 Finite Mixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –select a mixture component –repeatedly draw words from this mixture component The mixing proportions are corpora-specific, not document-specific Major drawback: each document can express only a single topic

60 Finite Admixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –repeatedly select a mixture component –draw a word from this mixture component The mixing proportions are document-specific Now each document can express multiple topics

61 Latent Dirichlet Allocation A word is represented as a multinomial random variable w A topic allocation is represented as a multinomial random variable z A document is modeled as a Dirichlet random variable  The variables  and  are  hyperparameters (Blei, Ng, and Jordan, 2003)

62 Abstraction Hierarchies Words in documents are organized not only by topics but also by level of abstraction Models such as LDA don’t capture this notion; common words often appear repeatedly across many topics Idea: Let documents be represented as paths down a tree of topics, placing topics focused on common words near the top of the tree

63 Nesting In the hierarchical models, all of the atoms are available to all of the “groups” In nested models, the atoms are subdivided into smaller collections, and the groups select among the collections E.g., the nested CRP of Blei, Griffiths and Jordan -each table in the Chinese restaurant indexes another Chinese restaurant E.g., the nested DP of Rodriguez, Dunson and Gelfand -each atom in a high-level DP is itself a DP

64 Nested Chinese Restaurant Process (Blei, Griffiths and Jordan, JACM, 2010)

65 Nested Chinese Restaurant Process

66

67

68

69

70 Hierarchical Latent Dirichlet Allocation The generative model for a document is as follows: -use the nested CRP to select a path down the infinite tree -draw to obtain a distribution on levels along the path -repeatedly draw a level from and draw a word from the topic distribution at that level

71 Psychology Today Abstracts

72 Pros and Cons of Hierarchical LDA The hierarchical LDA model yields more interpretable topics than flat LDA The maximum a posteriori tree can be interesting But the model is out of the spirit of classical topic models because it only allows one “theme” per document We would like a model that can both mix over levels in a tree and mix over paths within a single document

73 Nested Beta Processes In the nested CRP, at each level of the tree the procedure selects only a single outgoing branch Instead of using the CRP, use the beta process (or the gamma process) together with the Bernoulli process (or the Poisson process) to allow multiple branches to be selected:

74 Nested Beta Processes

75

76

77 Conclusions Hierarchical modeling and nesting have at least an important a role to play in Bayesian nonparametrics as they play in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies Completely random measures are an important tool in constructing nonparametric priors For papers, software, tutorials and more details: www.cs.berkeley.edu/~jordan/publications.html


Download ppt "Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,"

Similar presentations


Ads by Google