Gaussian Processes For Regression, Classification, and Prediction.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.

Gaussian Process Regression for Dummies Greg Cox Rich Shiffrin.

CS479/679 Pattern Recognition Dr. George Bebis

Biointelligence Laboratory, Seoul National University

Pattern Recognition and Machine Learning

An Introduction of Support Vector Machine

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Pattern Recognition and Machine Learning

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Study of Sparse Online Gaussian Process for Regression EE645 Final Project May 2005 Eric Saint Georges.

Spline and Kernel method Gaussian Processes

Chapter Two Probability Distributions: Discrete Variables

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Gaussian process modelling

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Randomized Algorithms for Bayesian Hierarchical Clustering

Gaussian Processes Li An Li An

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Lecture 2: Statistical learning primer for biologists

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Dropout as a Bayesian Approximation

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

Introduction to Gaussian Process CS 478 – INTRODUCTION 1 CS 778 Chris Tensmeyer.

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

RECITATION 2 APRIL 28 Spline and Kernel method Gaussian Processes Mixture Modeling for Density Estimation.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Probability Theory and Parameter Estimation I

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

CS 2750: Machine Learning Density Estimation

Bayes Net Learning: Bayesian Approaches

Lecture 09: Gaussian Processes

Special Topics In Scientific Computing

Roberto Battiti, Mauro Brunato

CSCI 5822 Probabilistic Models of Human and Machine Learning

Pattern Recognition and Machine Learning

Biointelligence Laboratory, Seoul National University

Lecture 10: Gaussian Processes

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Probabilistic Surrogate Models

Presentation transcript:

Gaussian Processes For Regression, Classification, and Prediction

How Do We Deal With Many Parameters, Little Data? 1. Regularization e.g., smoothing, L1 penalty, drop out in neural nets, large K for K-nearest neighbor 2. Standard Bayesian approach specify probability of data given weights, P(D|W) specify weight priors given hyperparameter α, P(W|α) find posterior over weights given data, P(W|D, α) With little data, strong weight prior constrains inference 3. Gaussian processes place a prior over functions directly rather than over model parameters, W

Problems That GPs Solve Overfitting  In problems with limited data, the data themselves aren’t sufficient to constrain the model predictions Uncertainty in prediction  Want to not only predict test values, but also estimate uncertainty in the predictions Provide a convenient language for expressing domain knowledge  Prior is specified over function space  Intuitions are likely to be stronger in function space than parameter space  E.g., smoothness constraints on function, trends over time, etc.

Intuition Figures from Rasmussen & Williams (2006) Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning

Important Ideas Nonparametric model  complexity of model posterior can grow as more data are added  compare to polynomial regression Prior determined by one or more smoothness parameters  We will learn these parameters from the data

Gaussian Distributions: A Reminder Figures stolen from Rasmussen NIPS 2006 tutorial

Gaussian Distributions: A Reminder Figure stolen from Rasmussen NIPS 2006 tutorial Marginals of multivariate Gaussians are Gaussian

Gaussian Distributions: A Reminder Conditionals of multivariate Gaussians are Gaussian

Stochastic Process Generalization of a multivariate distribution to countably infinite number of dimensions Collection of random variables G(x) indexed by x  E.g., Dirichlet process x: parameters of mixture component x: possible word or topic in topic models Joint probability over any subset of values of x is Dirichlet P(G(x 1 ), G(x 2 ), …, G(x k )) ~ Dirichlet(.)

Gaussian Process Infinite dimensional Gaussian distribution, F(x) Draws from Gaussian Process are functions f ~ GP(.) x f notation switch

Gaussian Process Joint probability over any subset of x is Gaussian P(f(x 1 ), f(x 2 ), …, f(x k )) ~ Gaussian(.) Consider relation between 2 points on function  The value of one point constrains the value of the second point x f x1x1 x2x2

Gaussian Process How do points covary as a function of their distance? x f x1x1 x2x2 x f x1x1 x2x2

Gaussian Process A multivariate Gaussian distribution is defined by a mean vector and a covariance matrix. A GP is fully defined by a mean function and a covariance function Example squared exponential covariance l : length scale

Inference GP specifies a prior over functions, F(x) Suppose we have a set of observations: D = {(x 1,y 1 ), (x 2, y 2 ), (x 3, y 3 ), …, (x n, y n )} Standard Bayesian approach p(F|D) ~ p(D|F) p(F) If data are noise free…  P(D|F) = 0 if any  P(D|F) = 1 otherwise

Graphical Model Depiction of Gaussian Process Inference

GP Prior where GP Posterior Predictive Distribution

Observation Noise What if Covariance function becomes Posterior predictive distribution becomes

Demo Andreas Geiger’s demo (Mike: run in firefox)

A Two-Dimensional Input Space

What About The Length Scale?

Full-Blown Squared-Exponential Covariance Function Has Three Parameters How do we determine parameters?  Prior knowledge  Maximum likelihood  Bayesian inference Potentially many other free parameters  e.g., mean function

Many Forms For Covariance Function From GPML toolbox

Many Forms For Likelihood Function What is likelihood function?  noise free case we discussed  noisy observation case we discussed With these likelihoods, inference is exact  Gaussian process is conjugate prior of Gaussian  Equivalent to covariance function incorporating noise

Customizing Likelihood Function To Problem Suppose that instead of real valued observations y, observations are binary {0, 1}, e.g., class labels

Many Approximate Inference Techniques For GPs Choice depends on likelihood function From GPML toolbox

What Do GPs Buy Us? Choice of covariance and mean functions seem a good way to incorporate prior knowledge  e.g., periodic covariance function As with all Bayesian methods, greatest benefit is when problem is data limited Predictions with error bars An elegant way of doing global optimization…  Next class