Dropout as a Bayesian Approximation

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.

Pattern Recognition and Machine Learning

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Computer vision: models, learning and inference

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Neural Networks Marco Loog.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Bayesian Neural Networks Pushpa Bhat Fermilab Harrison Prosper Florida State University.

Machine Learning CMPT 726 Simon Fraser University

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Bayesian Framework EE 645 ZHAO XIN. A Brief Introduction to Bayesian Framework The Bayesian Philosophy Bayesian Neural Network Some Discussion on Priors.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.

The free-energy principle: a rough guide to the brain? K Friston Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.

Classification Part 3: Artificial Neural Networks

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Randomized Algorithms for Bayesian Hierarchical Clustering

IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Lecture 2: Statistical learning primer for biologists

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Gaussian Processes For Regression, Classification, and Prediction.

Machine Learning 5. Parametric Methods.

Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Review of statistical modeling and probability theory Alan Moses ML4bio.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Computacion Inteligente Least-Square Methods for System Identification.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Variational Autoencoders Presentation by Yuri Burda CS2523, University of Toronto.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Learning Deep Generative Models by Ruslan Salakhutdinov

CEE 6410 Water Resources Systems Analysis

Deep Feedforward Networks

Probability Theory and Parameter Estimation I

ICS 280 Learning in Graphical Models

Ch3: Model Building through Regression

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Multimodal Learning with Deep Boltzmann Machines

ECE 6504 Deep Learning for Perception

Special Topics In Scientific Computing

Latent Variables, Mixture Models and EM

CSCI 5822 Probabilistic Models of Human and Machine Learning

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Bayesian Models in Machine Learning

CS 4501: Introduction to Computer Vision Training Neural Networks II

SMEM Algorithm for Mixture Models

Pattern Recognition and Machine Learning

Robust Full Bayesian Learning for Neural Networks

Learning From Observed Data

Probabilistic Surrogate Models

Presentation transcript:

Dropout as a Bayesian Approximation Yarin Gal and Zoubin Ghahramani Presented by Qing Sun

Why Care About Uncertainty Cat or Dog?

Bayesian Inference Bayesian techniques Challenge Posterior: Prediction: Challenge Computational cost More parameters to optimize

Softmax? Softmax input as a function of data x Softmax output as a function of data x

Softmax? P(c|O): the density of points of category c at location O Consider neighbors Point estimate Place a distribution over O Softmax: Delta distribution centered at local minima Softmax is not enough to reason uncertainty! John S. Denker and Yann LeCun. Transforming Neural-Net Output Levels to Probability Distributions, 1995

Why Dropout works? Ensemble, L2 regularizer, … Variational approximation to Gaussian Process (GP)

Gaussian Process A Gaussian Process is a generalization of a multivariate Gaussian distribution to infinitely many variables (i.e., function). Definition: a Gaussian Process is a collection of random variables, any finite of which have (consistent) Gaussian distribution. A Gaussian Process is fully specified by a mean function , and covariance function :

Prior and Posterior Squared Exponential (SE) covariance function:

How Dropout works? Demo.

How Dropout works? Gaussian process with SE covariance function Dropout using uncertainty information (5 hidden layers, ReLU non-linearty)

How Dropout works? CO2 concentration dataset (b) Gaussian process with SE covariance function (a) Standard dropout (c) MC dropout ReLU non-linearity (d) MC dropout TanH non-linearity CO2 concentration dataset

Why Does It Make Sense? Infinity wide (single hidden layer) NNs with distributions placed over their weights converge to Gaussian Process [Neal’s thesis, 1995] By the Central Limit Theorem, it will become Gaussian as N->∞, as long as each term has finite variance. Since is bounded, this must be the case The distribution will reach a limit if we make scale as The joint distribution of the function at any number of input points converges to a multivariate Gaussian, i.e., we have a Gaussian process. The hidden-to-output weights go to zero as the number of hidden units goes to infinity. [Please check Neal’s thesis for how they deal with this issue.] R M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.

Why Does It Make Sense? Posterior distribution might have complex form Define an “easier” variational distribution Minimizing KL maximizing the log evidence lower bound Fit training data Similar to prior-> avoid over-fitting Key problem: what kind of q(w) dropout provides?

Why Does It Make Sense? Parameters: W1, W2 and b No variance variable. Minimizing KL divergence from the full posterior contains second-order moment p1=p2=0, normal NN without dropout => no regularization on parameters s->0, mixed Gaussian distribution approximates Bernoullis distribution

Experiments (a) Softmax input scatter (b) Softmax output scatter MINIST digit classification

Probabilistic back-propagation (PBP) and dropout uncertainty (Dropout) Experiments Averaged test performance in RMSE and predictive log likelihood for variational inference (VI), Probabilistic back-propagation (PBP) and dropout uncertainty (Dropout)

Experiments (a) Agent in 2D world. Red circle: postive reward, green circle: negative reward (b) Log plot of average reward

The End!