Expectation Propagation in Practice

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Part 2: Unsupervised Learning

Bayesian Belief Propagation

Lecture 18: Temporal-Difference Learning

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,

Dynamic Bayesian Networks (DBNs)

Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Nonlinear and Non-Gaussian Estimation with A Focus on Particle Filters Prasanth Jeevan Mary Knox May 12, 2006.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Bayesian Filtering for Location Estimation D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello Presented by: Honggang Zhang.

Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.

24 November, 2011National Tsin Hua University, Taiwan1 Mathematical Structures of Belief Propagation Algorithms in Probabilistic Information Processing.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Computer vision: models, learning and inference Chapter 19 Temporal models.

From Bayesian Filtering to Particle Filters Dieter Fox University of Washington Joint work with W. Burgard, F. Dellaert, C. Kwok, S. Thrun.

Computer vision: models, learning and inference Chapter 19 Temporal models.

Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.

Kalman Filter (Thu) Joon Shik Kim Computational Models of Intelligence.

Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi

Sparse Gaussian Process Classification With Multiple Classes Matthias W. Seeger Michael I. Jordan University of California, Berkeley

1 Structured Region Graphs: Morphing EP into GBP Max Welling Tom Minka Yee Whye Teh.

Probabilistic Robotics Bayes Filter Implementations.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.

Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.

Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.

Mean Field Variational Bayesian Data Assimilation EGU 2012, Vienna Michail Vrettas 1, Dan Cornford 1, Manfred Opper 2 1 NCRG, Computer Science, Aston University,

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.

Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.

Mobile Robot Localization (ch. 7)

CS Statistical Machine learning Lecture 24

BCS547 Neural Decoding.

Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi MIT Media Lab.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.

Short Introduction to Particle Filtering by Arthur Pece [ follows my Introduction to Kalman filtering ]

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

Nonlinear State Estimation

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

Extending Expectation Propagation for Graphical Models

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Probabilistic Robotics

PSG College of Technology

Bayesian Models in Machine Learning

Filtering and State Estimation: Basic Concepts

≠ Particle-based Variational Inference for Continuous Systems

Expectation-Maximization & Belief Propagation

Latent Dirichlet Allocation

Extending Expectation Propagation for Graphical Models

Topic Models in Text Processing

Presentation transcript:

Expectation Propagation in Practice Tom Minka CMU Statistics Joint work with Yuan Qi and John Lafferty

Outline EP algorithm Examples: Tracking a dynamic system Signal detection in fading channels Document modeling Boltzmann machines

Extensions to EP Alternatives to moment-matching Factors raised to powers Skipping factors

EP in a nutshell Approximate a function by a simpler one: Where each lives in a parametric, exponential family (e.g. Gaussian) Factors can be conditional distributions in a Bayesian network

EP algorithm Iterate the fixed-point equations: specifies where the approximation needs to be good Coordinated local approximations where

(Loopy) Belief propagation Specialize to factorized approximations: Minimize KL-divergence = match marginals of (partially factorized) and (fully factorized) “send messages” “messages”

EP versus BP EP approximation can be in a restricted family, e.g. Gaussian EP approximation does not have to be factorized EP applies to many more problems e.g. mixture of discrete/continuous variables

EP versus Monte Carlo Monte Carlo is general but expensive A sledgehammer EP exploits underlying simplicity of the problem (if it exists) Monte Carlo is still needed for complex problems (e.g. large isolated peaks) Trick is to know what problem you have

Example: Tracking Guess the position of an object given noisy measurements Object

Bayesian network e.g. (random walk) want distribution of x’s given y’s

Terminology Filtering: posterior for last state only Smoothing: posterior for middle states On-line: old data is discarded (fixed memory) Off-line: old data is re-used (unbounded memory)

Kalman filtering / Belief propagation Prediction: Measurement: Smoothing:

Approximation Factorized and Gaussian in x

Approximation = (forward msg)(observation)(backward msg) EP equations are exactly the prediction, measurement, and smoothing equations for the Kalman filter - but only preserve first and second moments Consider case of linear dynamics…

EP in dynamic systems Loop t = 1, …, T (filtering) Prediction step Approximate measurement step Loop t = T, …, 1 (smoothing) Smoothing step Divide out the approximate measurement Re-approximate the measurement Loop t = 1, …, T (re-filtering) Prediction and measurement using previous approx

Generalization Instead of matching moments, can use any method for approximate filtering E.g. Extended Kalman filter, statistical linearization, unscented filter, etc. All can be interpreted as finding linear/Gaussian approx to original terms

Interpreting EP After more information is available, re-approximate individual terms for better results Optimal filtering is no longer on-line

Example: Poisson tracking is an integer valued Poisson variate with mean

Poisson tracking model

Approximate measurement step is not Gaussian Moments of x not analytic Two approaches: Gauss-Hermite quadrature for moments Statistical linearization instead of moment-matching Both work well

Posterior for the last state

EP for signal detection Wireless communication problem Transmitted signal = vary to encode each symbol In complex numbers: Im Re

Binary symbols, Gaussian noise Symbols are 1 and –1 (in complex plane) Received signal = Recovered Optimal detection is easy

Fading channel Channel systematically changes amplitude and phase: changes over time

Differential detection Use last measurement to estimate state Binary symbols only No smoothing of state = noisy

Bayesian network Symbols can also be correlated (e.g. error-correcting code) Dynamics are learned from training data (all 1’s)

On-line implementation Iterate over the last measurements Previous measurements act as prior Results comparable to particle filtering, but much faster

Document modeling Want to classify documents by semantic content Word order generally found to be irrelevant Word choice is what matters Model each document as a bag of words Reduces to modeling correlations between word probabilities

Generative aspect model (Hofmann 1999; Blei, Ng, & Jordan 2001) Each document mixes aspects in different proportions Aspect 1 Aspect 2

Generative aspect model Multinomial sampling Document

Two tasks Inference: Given aspects and document i, what is (posterior for) ? Learning: Given some documents, what are (maximum likelihood) aspects?

Approximation Likelihood is composed of terms of form Want Dirichlet approximation:

EP with powers These terms seem too complicated for EP Can match moments if , but not for large Solution: match moments of one occurrence at a time Redefine what are the “terms”

EP with powers Moment match: Context function: all but one occurrence Fixed point equations for

EP with skipping Context fcn might not be a proper density Solution: “skip” this term (keep old approximation) In later iterations, context becomes proper

Another problem Minimizing KL-divergence of Dirichlet is expensive Requires iteration Match (mean,variance) instead Closed-form

One term VB = Variational Bayes (Blei et al)

Ten word document

General behavior For long documents, VB recovers correct mean, but not correct variance of Disastrous for learning No Occam factor Gets worse with more documents No asymptotic salvation EP gets correct variance, learns properly

Learning in probability simplex 100 docs, Length 10

Learning in probability simplex 10 docs, Length 10

Learning in probability simplex 10 docs, Length 10

Learning in probability simplex 10 docs, Length 10

Boltzmann machines Joint distribution is product of pair potentials: Want to approximate by a simpler distribution

Approximations BP EP

Approximating an edge by a tree Each potential in p is projected onto the tree-structure of q Correlations are not lost, but projected onto the tree

Fixed-point equations Match single and pairwise marginals of Reduces to exact inference on single loops Use cutset conditioning and

5-node complete graphs, 10 trials Method FLOPS Error Exact 500 TreeEP 3,000 0.032 BP/double-loop 200,000 0.186 GBP 360,000 0.211

8x8 grids, 10 trials Method FLOPS Error Exact 30,000 TreeEP 300,000 TreeEP 300,000 0.149 BP/double-loop 15,500,000 0.358 GBP 17,500,000 0.003

TreeEP versus BP TreeEP always more accurate than BP, often faster GBP slower than BP, not always more accurate TreeEP converges more often than BP and GBP

Conclusions EP algorithms exceed state-of-art in several domains Many more opportunities out there EP is sensitive to choice of approximation does not give guidance in choosing it (e.g. tree structure) – error bound? Exponential family constraint can be limiting – mixtures?

End

Limitation of BP If the dynamics or measurements are not linear and Gaussian, the complexity of the posterior increases with the number of measurements I.e. BP equations are not “closed” Beliefs need not stay within a given family * * or any other exponential family

Approximate filtering Compute a Gaussian belief which approximates the true posterior: E.g. Extended Kalman filter, statistical linearization, unscented filter, assumed-density filter

EP perspective Approximate filtering is equivalent to replacing true measurement/dynamics equations with linear/Gaussian equations Gaussian implies Gaussian

EP perspective EKF, UKF, ADF are all algorithms for: Linear, Gaussian Nonlinear, Non-Gaussian