CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
ITCS 3153 Artificial Intelligence Lecture 24 Statistical Learning Chapter 20 Lecture 24 Statistical Learning Chapter 20.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Parameter Estimation using likelihood functions Tutorial #1
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.
Bayesian learning finalized (with high probability)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Thanks to Nir Friedman, HU
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.
Univariate Gaussian Case (Cont.)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Data Modeling Patrice Koehl Department of Biological Sciences
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
CS 416 Artificial Intelligence
Special Topics In Scientific Computing
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
BTEC Computing Unit 1: Principles of Computer Science
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton

The Bayesian framework The Bayesian framework assumes that we always have a prior distribution for everything. –The prior may be very vague. –When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution. –The likelihood term takes into account how probable the observed data is given the parameters of the model. It favors parameter settings that make the data likely. It fights the prior With enough data the likelihood terms always win.

A coin tossing example Suppose you know nothing about coins except that each tossing event produces a head with some unknown probability p and a tail with probability 1-p. Suppose we observe 100 tosses and there are 53 heads. What is p? The frequentist answer: Pick the value of p that makes the observation of 53 heads and 47 tails most probable.

Some problems with picking the parameters that are most likely to generate the data What if we only tossed the coin once and we got 1 head? –Is p=1 a sensible answer? Surely p=0.5 is a much better answer. Is it reasonable to give a single answer? –If we don’t have much data, we are unsure about p. – Our computations will work much better if we take this uncertainty into account.

Using a distribution over parameters Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a head given that value. Then renormalize to get the posterior distribution probability density p area=1 01 1

Lets do it again Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a tail given that value. The renormalize to get the posterior distribution probability density p area=

Lets do it another 98 times After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0.53 (assuming a uniform prior). probability density p area=

Bayes Theorem Prior probability of weight vector W Posterior probability of weight vector W Probability of observed data given W joint probability conditional probability

Why we maximize sums of log probs We want to maximize products of probabilities of a set of independent events –Assume the output errors on different training cases are independent. –Assume the priors on weights are independent. Because the log function is monotonic, we can maximize sums of log probabilities

The Bayesian interpretation of weight decay assuming a Gaussian prior for the weights assuming that the model makes a Gaussian prediction

Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answers under a Gaussian centered at the model’s guess. This is Maximum Likelihood. d correct answer y model’s prediction

Maximum A Posteriori Learning This trades-off the prior probabilities of the parameters against the probability of the data given the parameters. It looks for the parameters that have the greatest product of the prior term and the likelihood term. Minimizing the squared weights is equivalent to maximizing the log probability of the weights under a zero-mean Gaussian prior. w 0 p(w)

Full Bayesian Learning Instead of trying to find the best single setting of the parameters (as in ML or MAP) compute the full posterior distribution over parameter settings –This is extremely computationally intensive for all but the simplest models (its feasible for a biased coin). To make predictions, let each different setting of the parameters make its own prediction and then combine all these predictions by weighting each of them by the posterior probability of that setting of the parameters. –This is also computationally intensive. The full Bayesian approach allows us to use complicated models even when we do not have much data

Overfitting: A frequentist illusion? If you do not have much data, you should use a simple model, because a complex one will overfit. –This is true. But only if you assume that fitting a model means choosing a single best setting of the parameters. –If you use the full posterior over parameter settings, overfitting disappears! –With little data, you get very vague predictions because many different parameters settings have significant posterior probability

A classic example of overfitting Which model do you believe? –The complicated model fits the data better. –But it is not economical and it makes silly predictions. But what if we start with a reasonable prior over all fifth-order polynomials and use the full posterior distribution. –Now we get vague and sensible predictions. There is no reason why the amount of data should influence our prior beliefs about the complexity of the model.