Weight Uncertainty in Neural Networks Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra Presented by Michael Cogswell
Point Estimates of Network Weights MLE
Point Estimates of Neural Networks MAP
A Distribution over Neural Networks Ideal Test Distribution
Approximate
Why? Regularization Understand network uncertainty Cheap Model Averaging Exploration in Reinforcement Learning (Contextual Bandit)
Outline Variational Approximation Gradients for All The Prior and the Posterior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
Computing the Distribution This is defined… (Bayes Rule) …but intractable.
Variational Approximation Gaussian Gaussian Gaussian Gaussian \theta are the parameters of the gaussians
Variational Approximation
Objective
Why? Minimum Description Length
Another Expression for Complexity Cost Likelihood Cost
Minimum Description Length bits to describe w given prior bits to transfer targets given inputs by encoding them with a network Honkela and Volpa, 2004; Hinton and Van Camp, 1993; Graves 2011
Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
Goal
Previous Approach (Graves, NIPS 2011) Directly approximate for each Prior/Posterior e.g., Gaussians:
Previous Approach (Graves, NIPS 2011) Directly approximate for each Prior/Posterior Potentially Biased!
Re-parameterization
Unbiased Gradients
Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
The Prior – Scale Mixture of Gaussians Don’t have to derive a specific approximation of Just need
The Posterior – Independent Gaussians
The Posterior – Re-Parameterization learn
The Posterior – Sampling with Noise
Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
Learning Sample w Compute update with sampled w
(Sample) (Update)
(Sample) (Update)
Outline Variational Approximation Gradients for All The Posterior and the Prior An Algorithm Experiments (Setting) (Contribution) (Details) (Results)
MNIST Classification
MNIST Test Error
Convergence Rate
Weight Histogram Note that vanilla SGD looks like a gaussian, so a gaussian prior isn’t a bad idea.
Signal to Noise Ratio Each weight is one data point Note that vanilla SGD looks like a gaussian, so a gaussian prior isn’t a bad idea.
Weight Pruning
Weight Pruning Peak 1 Peak 2
Regression
Does uncertainty in weights lead to uncertainty in outputs?
Bayes by Backprop Standard NN Blue and purple shading indicates quartiles… red is median… black crosses are training data
Exploration in Bandit Problems
UCI Mushroom Dataset 22 Attributes 8124 Examples Actions: “edible” e E[reward] = 5 “unknown” u E[r] = 0 “poisonous” p E[r] = -15 Image: https://archive.ics.uci.edu/ml/datasets/Mushroom
Classification vs Contextual Bandit NN X P(y=e) P(y=u) P(y=p) NN X E[r] e u p One output per class vs one input per class (w/ reward output) Cross Entropy naturally judges all predictions
Thompson Sampling
Contextual Bandit Results Greedy does not explore for 1000 steps Bayes by Backprop explores
Conclusion Somewhat general procedure for approximating NN posterior Unbiased gradients Could help with RL
Next: Dropout as a GP