Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi MIT Media Lab.

Slides:

Advertisements

Similar presentations

Expectation Propagation in Practice

Advertisements

Pattern Recognition and Machine Learning

Part 2: Unsupervised Learning

Bayesian Belief Propagation

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Biointelligence Laboratory, Seoul National University

Exact Inference in Bayes Nets

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Dynamic Bayesian Networks (DBNs)

Minimum Redundancy and Maximum Relevance Feature Selection

What is Statistical Modeling

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Visual Recognition Tutorial

A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.

Pattern Recognition and Machine Learning

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Lecture 5: Learning models using EM

Conditional Random Fields

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

PATTERN RECOGNITION AND MACHINE LEARNING

Efficient Model Selection for Support Vector Machines

Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.

Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Randomized Algorithms for Bayesian Hierarchical Clustering

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

CS Statistical Machine learning Lecture 24

Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Lecture 2: Statistical learning primer for biologists

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Pattern Recognition and Machine Learning

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.

Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Extending Expectation Propagation for Graphical Models

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Machine Learning Basics

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Expectation-Maximization & Belief Propagation

Extending Expectation Propagation for Graphical Models

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Presentation transcript:

Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi MIT Media Lab

Motivation Graphical models are widely used in real-world applications, such as human behavior recognition and wireless digital communications. Inference on graphical models: infer hidden variables –Previous approaches often sacrifice efficiency for accuracy or sacrifice accuracy for efficiency. > Need methods that better balance the trade-off between accuracy and efficiency. Learning graphical models : learning model parameters –Overfitting problem: Maximum likelihood approaches > Need efficient Bayesian training methods

Outline Background: –Graphical models and expectation propagation (EP) Inference on graphical models –Extending EP on Bayesian dynamic networks Fixed lag smoothing: wireless signal detection Different approximation techniques: Poisson tracking –Combining EP with local propagation on loopy graphs Learning conditional graphical models –Extending EP classification to perform feature selection Gene expression classification –Training Bayesian conditional random fields Handwritten ink analysis Conclusions

Outline Background on expectation propagation (EP) –4 kinds of graphical models –EP in a nutshell Inference on graphical models Learning conditional graphical models Conclusions

Graphical Models Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w INFERENCEINFERENCE LEARNINGLEARNING

Expectation Propagation in a Nutshell Approximate a probability distribution by simpler parametric terms (Minka 2001): – For Bayesian networks: – For Markov networks: – For conditional classification: – For conditional random fields: Each approximation term or lives in an exponential family (such as Gaussian & Multinomial)

EP in a Nutshell (2) The approximate term minimizes the following KL divergence by moment matching: Where the leave-one-out approximation is

EP in a Nutshell (3) Three key steps: Deletion Step: approximate the “leave-one-out” predictive posterior for the i th point: Minimizing the following KL divergence by moment matching (Assumed Density filtering): Inclusion:

Limitations of Plain EP Batch processing of terms: not online Can be difficult or expensive to analytically compute ADF step Can be expensive to compute and maintain a valid approximation distribution q(x), which is coherent under marginalization –Tree-structured q(x): EP classification degenerates in the presence of noisy features. Cannot incorporate denominators

Four Extensions on Four Types of Graphical Models Fixed-lag smoothing and embedding different approximation techniques for dynamic Bayesian networks Allow a structured approximation to be globally non- coherent, while only maintaining local consistency during inference on loopy graphs. Combine EP with ARD for classification with noisy features Extend EP to train conditional random fields with a denominator (partition function)

Inference on dynamic Baysian networks Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Outline Background Inference on graphical models –Extending EP on Bayesian dynamic networks Fixed lag smoothing: wireless signal detection Different approximation techniques: Poisson tracking –Combining EP with junction tree algorithm on loopy graphs Learning conditional graphical models Conclusions

Object Tracking Guess the position of an object given noisy observations Object

Bayesian Network (random walk) e.g. want distribution of x’s given y’s x1x1 x2x2 xTxT y1y1 y2y2 yTyT

Approximation Factorized and Gaussian in x

Message Interpretation = (forward msg)(observation msg)(backward msg) xtxt ytyt Forward Message Backward Message Observation Message

Extensions of EP Instead of batch iterations, use fixed-lag smoothing for online processing. Instead of assumed density filtering, use any method for approximate filtering. –Examples: unscented Kalman filter (UKF) Turn a deterministic filtering method into a smoothing method! All methods can be interpreted as finding linear/Gaussian approximations to original terms. –Use quadrature or Monte Carlo for term approximations

Bayesian network for Wireless Signal Detection x1x1 x2x2 xTxT y1y1 y2y2 yTyT s1s1 s2s2 sTsT s i : Transmitted signals x i : Channel coefficients for digital wireless communications y i : Received noisy observations

Experimental Results EP outperforms particle smoothers in efficiency with comparable accuracy. (Chen, Wang, Liu 2000) Signal-Noise-Ratio

Computational Complexity AlgorithmComplexity Extended EPO(nLd 2 ) Stochastic mixture of Kalman filtersO(MLd 2 ) Rao-blackwised particle smoothersO(MNLd 2 ) L: Length of fixed-lag smooth window d: Dimension of the parameter vector n: Number of EP iterations (Typically, 4 or 5) M: Number of samples in filtering (Often larger than 500 or 100) N: Number of samples in smoothing (Larger than 50) Extended EP is about than Rao-blackwised particle smoothers!

Example: Poisson Tracking is an integer valued Poisson variate with mean

Accuracy/Efficiency Tradeoff (TIME)

Inference on markov networks Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Outline Background on expectation propagation (EP) Inference on graphical models –Extending EP on Bayesian dynamic networks Fixed lag smoothing: wireless signal detection Different approximation techniques: poisson tracking –Combining EP with junction tree algorithm on loopy graphs Learning conditional graphical models Conclusions

Inference on Loopy Graphs Problem: estimate marginal distributions of the variables indexed by the nodes in a loopy graph, e.g., p(x i ), i = 1,..., 16. X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 X 15 X 16

4-node Loopy Graph Joint distribution is product of pairwise potentials for all edges: Want to approximate by a simpler distribution

BP vs. TreeEP BP TreeEP projection

Junction Tree Representation p(x) q(x) Junction tree

Two Kinds of Edges On-tree edges, e.g., (x 1,x 4 ): exactly incorporated into the junction tree Off-tree edges, e.g., (x 1,x 2 ): approximated by projecting them onto the tree structure

KL Minimization KL minimization moment matching Match single and pairwise marginals of and

Matching Marginals on Graph (1) Incorporate edge ( x 3 x 4 ) x 3 x 4 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 (2) Incorporate edge ( x 6 x 7 ) x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 6 x 7 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6

Drawbacks of Global Propagation by Regular EP Update all the cliques even when only incorporating one off-tree edge –Computationally expensive Store each off-tree data message as a whole tree –Require large memory size

Solution: Local Propagation Allow q(x) be non-coherent during the iterations. It only needs to be coherent in the end. Exploit the junction tree representation: only locally propagate information within the minimal loop (subtree) that is directly connected to the off-tree edge. – Reduce computational complexity – Save memory

x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 3 x 4 x 1 x 2 x 1 x 3 x 1 x 4 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 5 x 7 x 1 x 2 x 1 x 3 x 1 x 4 x 3 x 5 x 3 x 6 x 3 x 5 x 6 x 7 x 5 x 7 x 3 x 5 x 3 x 6 (1) Incorporate edge( x 3 x 4 ) (3) Incorporate edge ( x 6 x 7 ) (2) Propagate evidence On this simple graph, local propagation runs roughly 2 times faster and uses 2 times less memory to store messages than plain EP

Tree-EP Combine EP with junction algorithm Can perform efficiently over hypertrees and hypernodes

Fully-connected graphs Results are averaged over 10 graphs with randomly generated potentials TreeEP performs the same or better than all other methods in both accuracy and efficiency!

Learning Conditional Classification Models Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Outline Background on expectation propagation (EP) Inference on graphical models Learning conditional graphical models –Extending EP classification to perform feature selection Gene expression classification –Training Bayesian conditional random fields Handwritten ink analysis Conclusions

Conditional Bayesian Classification Model Prior of the classifier w : Labels: t inputs: X parameters: w Likelihood for the data set: Where is a cumulative distribution function for a standard Gaussian. x1x1 x2x2 t1t1 t2t2 w

Evidence and Predictive Distribution The evidence, i.e., the marginal likelihood of the hyperparameters : The predictive posterior distribution of the label for a new input :

Limitation of EP classifications In the presence of noisy features, the performance of classical conditional Bayesian classifiers, e.g., Bayes Point Machines trained by EP, degenerates.

Automatic Relevance Determination (ARD) Give the classifier weight independent Gaussian priors whose variance,, controls how far away from zero each weight is allowed to go: Maximize, the marginal likelihood of the model, with respect to. Outcome: many elements of go to infinity, which naturally prunes irrelevant features in the data.

Two Types of Overfitting Classical Maximum likelihood: –Optimizing the classifier weights w can directly fit noise in the data, resulting in a complicated model. Type II Maximum likelihood (ARD): –Optimizing the hyperparameters corresponds to choosing which variables are irrelevant. Choosing one out of exponentially many models can also overfit if we maximize the model marginal likelihood.

Risk of Optimizing X: Class 1 vs O: Class 2 Evd-ARD-1 Evd-ARD-2 Bayes Point

Predictive-ARD Choosing the model with the best estimated predictive performance instead of the most probable model. Expectation propagation (EP) estimates the leave- one-out predictive performance without performing any expensive cross-validation.

Estimate Predictive Performance Predictive posterior given a test data point EP can estimate predictive leave-one-out error probability where q( w| t \i ) is the approximate posterior of leaving out the i th label. EP can also estimate predictive leave-one-out error count

Comparison of different model selection criteria for ARD training 1 st row: Test error 2 nd row: Estimated leave-one-out error probability 3 rd row: Estimated leave-one-out error counts 4 th row: Evidence (Model marginal likelihood) 5 th row: Fraction of selected features The estimated leave-one-out error probabilities and counts are better correlated with the test error than evidence and sparsity level.

Gene Expression Classification Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.

Classifying Leukemia Data The task: distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). The dataset: 47 and 25 samples of type ALL and AML respectively with 7129 features per sample. The dataset was randomly split 100 times into 36 training and 36 testing samples.

Classifying Colon Cancer Data The task: distinguish normal and cancer samples The dataset: 22 normal and 40 cancer samples with 2000 features per sample. The dataset was randomly split 100 times into 50 training and 12 testing samples. SVM results from Li et al. 2002

Learning Conditional Random Fields Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Outline Background on expectation propagation (EP) Inference on graphical models Learning conditional graphical models –Extending EP classification to perform feature selection Gene expression classification –Training Bayesian conditional random fields Handwritten ink analysis Conclusions

x1x1 x2x2 t1t1 t2t2 w Generalize traditional classification model by introducing correlation between labels. Potential functions: Model data with independence and structure, e.g, web pages, natural languages, and multiple visual objects in a picture. Information at one location propagates to other locations. Conditional random fields (CRFs)

Learning the parameter w by ML/MAP Maximum likelihood (ML) : Maximize the data likelihood where Maximum a posterior (MAP):Gaussian prior on w ML/MAP problem: Overfitting to the noise in data.

Bayesian Conditional Networks Bayesian training to avoid overfitting Need efficient training: –The exact posterior of w –The Gaussian approximate posterior of w

Two Difficulties for Bayesian Training the partition function appears in the denominator –Regular EP does not apply the partition function is a complex function of w

Turn Denominator to Numerator (1) Inverting approximation term: –Deletion: –ADF: –Inclusion: q q \Z q new q = KL projection One step forward; two step backward

Turn Denominator to Numerator (2) Minka’s approach: –Deletion: –ADF: –Inclusion: Two step backward; one step forward

Approximating the partition function The parameters w and the labels t are intertwined in Z(w): where k = {i, j} is the index of edges. The joint distribution of w and t: Factorized approximation:

Remove the intermediate level Flatten Approximation Structure Increased efficiency, stability, and accuracy! Iterations

Results on Synthetic Data Data generation: first, randomly sample input x, fixed true parameters w, and then sample the labels t Graphical structure: Four nodes in a simple loop Comparing maximum likelihood trained CRF with Bayesian conditional networks: 10 Trials. 100 training examples and 1000 test examples. BCNs significantly outperformed ML CRFs.

Ink Application: analyzing handwritten organization charts Parsing a graph into different components: containers vs. connectors

Ink Application: compare BCNs with i.i.d conditional Bayesian classifiers Results: conditional Bayesian classifiers BCNs (early version)

Ink Application: compare ML CRFs with BCNs Comparing maximum likelihood trained CRFs with Bayesian conditional networks (BCNs): 15 Trials, 14 graphs for training and 9 graphs for testing in each trials. BCNs significantly outperformed ML CRFs.

4 types of graphical models Bayesian networks Markov networks conditional classification conditional random fields x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 y1y1 y2y2 x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Outline Background on expectation propagation (EP) Inference on graphical models Learning conditional graphical models Conclusions –4 extensions to EP on 4 types of graphical models & 3 real- world applications –Inference: better trade-off between accuracy and efficiency –Learning: better generalization than the state of the art.

Conclusion: 4 extensions & 3 applications 1.Extending EP on dynamic models by fixed-lag smoothing and embedding different approximation techniques: –Wireless signal detection: Much less computation, and comparable or superior accuracy to sequential Monte Carlo 2.Combining EP with local propagation on loopy graphs: –Outperformed belief propagation, naïve mean field, and structure variational methods 3.Extending EP classification to perform feature selection: –Gene expression classification: Outperformed traditional ARD, SVM with feature selection 4.Training Bayesian conditional random fields to deal with denominator and flattened approximation structure –Ink analysis: Beats ML CRFs

Extended EP algorithms for inference and learning Computational time Inference error Outperform state-of-art inference techniques in the trade-off between accuracy and efficiency Extended EP State-of-art Inference tech. 0 Learning tech. Outperform state-of-art techs on sparse learning and CRFs in generalization performance Test error with error bar 0 Extended EP State-oft-art learning tech.

Acknowledgement My advisor Roz Picard Tom Minka Tommi and Zoubin Rgrads: Ashish, Carson, Karen, Phil, Win, Raul, etc Researchers at MSR: Martin Szummer, Chris Bishop, Ralf Herbrich, Thore Graepel, Andrew Blake Folks at UCL: Chu Wei, Jaz Kandola, Fernando, Ed, Iain, Katherine, and Mark Peter Gorniak and Brian Whitman

End Questions? Now or Thesis will be online at

x1x1 x2x2 t1t1 t2t2 w x1x1 x2x2 t1t1 t2t2 w

Conclusions Extend EP on graphical models: –Instead of minimizing KL divergence, use other sensible criteria to generate messages. Effectively turn any deterministic filtering method into a smoothing method. –Use quadrature to approximate messages. –Local propagation to save the computation and memory in tree structured EP.

Conclusions Extended EP algorithms outperform state-of-art inference methods on graphical models in the trade- off between accuracy and efficiency Computational Time Error Extended EP State-of-art Techniques

Future Work More extensions of EP: –How to choose a sensible approximation family (e.g. which tree structure) –More flexible approximation: mixture of EP? –Error bound? –Bayesian conditional random fields –EP for optimization (generalize max-product) More real-world applications, e.g., classification of gene expression data.

Motivation Task 1: Classify high dimensional datasets with many irrelevant features, e.g., normal v.s. cancer microarray data. Task 2: Sparse Bayesian kernel classifiers for fast test performance.

Outline Background on expectation propagation (EP) Extending EP on Bayesian dynamic networks –Fixed lag smoothing: wireless signal detection –Different approximation techniques: poisson tracking Combining EP with junction tree algorithm on loopy graphs Extending EP classification to perform feature selection –Gene expression classification Training Bayesian conditional random fields –Handwritten ink analysis Conclusions and future work

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments Conclusions

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusion

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments Conclusions

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusions

Maximizing marginal likelihood can lead to overfitting in the model space if there are a lot of features. We propose Predictive-ARD based on EP for –feature selection –sparse kernel learning In practice Predictive-ARD works better than traditional ARD.

Three Extensions 1. Instead of choosing the approximate term to minimize the following KL divergence: use other criteria. 2. Use numerical approximation to compute moments: Quadrature or Monte Carlo. 3. Allow the tree-structured q(x) to be non-coherent during the iterations. It only needs to be coherent in the end.

Motivation Computational Time Error Current Techniques What we want

Efficiency vs. Accuracy Computational Time Error Extended EP ? Monte Carlo Loopy BP (Factorized EP)

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusions

Maximizing marginal likelihood can lead to overfitting in the model space if there are a lot of features. We propose Predictive-ARD based on EP for –feature selection –sparse kernel learning In practice Predictive-ARD works better than traditional ARD.

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusions

Maximizing marginal likelihood can lead to overfitting in the model space if there are a lot of features. We propose Predictive-ARD based on EP for –feature selection –sparse kernel learning In practice Predictive-ARD works better than traditional ARD.

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusions

Maximizing marginal likelihood can lead to overfitting in the model space if there are a lot of features. We propose Predictive-ARD based on EP for –feature selection –sparse kernel learning In practice Predictive-ARD works better than traditional ARD.

Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation –Sequential update Experiments Conclusion Outline

Motivation Computational Time Error Current Techniques What we want

Inference on Graphical Models Bayesian inference techniques: –Belief propagation (BP): Kalman filtering /smoothing, forward-backward algorithm –Monte Carlo: Particle filter/smoothers, MCMC Loopy BP: typically efficient, but not accurate on general loopy graphs Monte Carlo: accurate, but often not efficient

Extended EP vs. Monte Carlo: Accuracy Variance Mean

Poisson Tracking Model

Extended-EP Joint Signal Detection and Channel Estimation Turn mixture of Kalman filters into a smoothing method Smoothing over the last observations Observations before act as prior for the current estimation

Bayesian Networks for Adaptive Decoding x1x1 x2x2 xTxT y1y1 y2y2 yTyT e1e1 e2e2 eTeT The information bits e t are coded by a convolutional error-correcting encoder.

EP Outperforms Viterbi Decoding Signal-Noise-Ratio

Combine Tree-structured Approximation with Junction Tree algorithm Combine EP with junction algorithm Can perform efficiently over hypertrees and hypernodes

8x8 grids, 10 trials MethodFLOPSError Exact30,0000 TreeEP300, BP/double-loop15,500, GBP17,500,

4-node Graph TreeEP = the proposed method GBP = generalized belief propagation on triangles TreeVB = variational tree BP = loopy belief propagation = Factorized EP MF = mean-field

Efficiency vs. Accuracy Computational Time Error Extended EP ? Monte Carlo Loopy BP (Factorized EP)

Outline Background on expectation propagation (EP) Extending EP on Bayesian dynamic networks –Fixed lag smoothing: wireless signal detection –Different approximation techniques: poisson tracking Combining EP with junction tree algorithm on loopy graphs Extending EP classification to perform feature selection –Gene expression classification Training Bayesian conditional random fields –Handwritten ink analysis Conclusions and future work

Outline Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments Conclusions

Outline: Extending EP classification to perform feature selection Background –Bayesian classification model –Automatic relevance determination (ARD) Risk of Overfitting by optimizing hyperparameters Predictive ARD by expectation propagation (EP): –Approximate prediction error –EP approximation Experiments

Approximate Leave-One-Out Error Three key steps: Deletion Step: approximate the “leave-one-out” predictive posterior for the i th point: Minimizing the following KL divergence by moment matching: Inclusion: The key observation: we can use the approximate predictive posterior, obtained in the deletion step, for model selection. No extra computation!

Bayesian Sparse Kernel Classifiers Using feature/kernel expansions defined on training data points: Predictive-ARD-EP trains a classifier that depends on a small subset of the training set. Fast test performance.

Test error rates and numbers of relevance or support vectors on breast cancer dataset. 50 partitionings of the data were used. All these methods use the same Gaussian kernel with kernel width = 5. The trade-off parameter C in SVM is chosen via 10-fold cross- validation for each partition.

Test error rates on diabetes data 100 partitionings of the data were used. Evidence and Predictive ARD-EPs use the Gaussian kernel with kernel width = 5.

Ink application: using graphical models Three steps: –Subdivision of pen strokes into fragments, –Construction of a conditional random field that only contains pairwise features based on the fragments, –Training and inference on the network.

Low rank matrix computation Explore the structure of the problem Observation: each potential function only constraints the posterior in a subspace More efficiency with low-rank matrix computation

Compare to Belief Propagation in ML training Similarity: Both propagate probabilistic information between nodes in a graph Difference: Bayesian training averages the belief q(t) over the potential parameters w, while belief propagation does not.

TreeEP versus BP and GBP TreeEP is always more accurate than BP and is often faster TreeEP is much more efficient than GBP and more accurate on some problems TreeEP converges more often than BP and GBP