Weight Uncertainty in Neural Networks

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Applications of one-class classification

: INTRODUCTION TO Machine Learning Parametric Methods.

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ImageNet Classification with Deep Convolutional Neural Networks

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

1 Removing Camera Shake from a Single Photograph Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T. Roweis and William T. Freeman ACM SIGGRAPH 2006, Boston,

Learning Convolutional Feature Hierarchies for Visual Recognition

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Radial Basis Functions

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

Bayesian Learning Rong Jin.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Optimal Bayes Classification

Other NN Models Reinforcement learning (RL)

INTRODUCTION TO Machine Learning 3rd Edition

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

BCS547 Neural Decoding.

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Lecture 2: Statistical learning primer for biologists

Dropout as a Bayesian Approximation

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

CSC321: Lecture 7:Ways to prevent overfitting

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.

Gaussian Processes For Regression, Classification, and Prediction.

Logistic Regression William Cohen.

Data Mining and Decision Support

Machine Learning 5. Parametric Methods.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Additional NN Models Reinforcement learning (RL) Basic ideas: –Supervised learning: (delta rule, BP) Samples (x, f(x)) to learn function f(.) precise error.

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

NIPS 2013 Michael C. Hughes and Erik B. Sudderth

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Variational Autoencoders Theory and Extensions

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

CEE 6410 Water Resources Systems Analysis

The Gradient Descent Algorithm

Ch3: Model Building through Regression

Probabilistic Models for Linear Regression

CSCI 5822 Probabilistic Models of Human and Machine Learning

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

network of simple neuron-like computing elements

Stochastic Optimization Maximization for Latent Variable Models

Chapter 2: Evaluative Feedback

Chapter 8: Generalization and Function Approximation

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Robust Full Bayesian Learning for Neural Networks

Parametric Methods Berlin Chen, 2005 References:

Chapter 2: Evaluative Feedback

Presentation transcript:

Weight Uncertainty in Neural Networks Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra Presented by Michael Cogswell

Point Estimates of Network Weights MLE

Point Estimates of Neural Networks MAP

A Distribution over Neural Networks Ideal Test Distribution

Approximate

Why? Regularization Understand network uncertainty Cheap Model Averaging Exploration in Reinforcement Learning (Contextual Bandit)

Outline Variational Approximation Gradients for All The Prior and the Posterior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

Computing the Distribution This is defined… (Bayes Rule) …but intractable.

Variational Approximation Gaussian Gaussian Gaussian Gaussian \theta are the parameters of the gaussians

Variational Approximation

Objective

Why? Minimum Description Length

Another Expression for Complexity Cost Likelihood Cost

Minimum Description Length bits to describe w given prior bits to transfer targets given inputs by encoding them with a network Honkela and Volpa, 2004; Hinton and Van Camp, 1993; Graves 2011

Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

Goal

Previous Approach (Graves, NIPS 2011) Directly approximate for each Prior/Posterior e.g., Gaussians:

Previous Approach (Graves, NIPS 2011) Directly approximate for each Prior/Posterior Potentially Biased!

Re-parameterization

Unbiased Gradients

Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

The Prior – Scale Mixture of Gaussians Don’t have to derive a specific approximation of Just need

The Posterior – Independent Gaussians

The Posterior – Re-Parameterization learn

The Posterior – Sampling with Noise

Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

Learning Sample w Compute update with sampled w

(Sample) (Update)

(Sample) (Update)

Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)

MNIST Classification

MNIST Test Error

Convergence Rate

Weight Histogram Note that vanilla SGD looks like a gaussian, so a gaussian prior isn’t a bad idea.

Signal to Noise Ratio Each weight is one data point Note that vanilla SGD looks like a gaussian, so a gaussian prior isn’t a bad idea.

Weight Pruning

Weight Pruning Peak 1 Peak 2

Regression

Does uncertainty in weights lead to uncertainty in outputs?

Bayes by Backprop Standard NN Blue and purple shading indicates quartiles… red is median… black crosses are training data

Exploration in Bandit Problems

UCI Mushroom Dataset 22 Attributes 8124 Examples Actions: “edible” e E[reward] = 5 “unknown” u E[r] = 0 “poisonous” p E[r] = -15 Image: https://archive.ics.uci.edu/ml/datasets/Mushroom

Classification vs Contextual Bandit NN X P(y=e) P(y=u) P(y=p) NN X E[r] e u p One output per class vs one input per class (w/ reward output) Cross Entropy naturally judges all predictions

Thompson Sampling

Contextual Bandit Results Greedy does not explore for 1000 steps Bayes by Backprop explores

Conclusion Somewhat general procedure for approximating NN posterior Unbiased gradients Could help with RL

Next: Dropout as a GP