PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh.

Slides:

Advertisements

Similar presentations

PAC Learning 8/5/2005. purpose Effort to understand negative selection algorithm from totally different aspects –Statistics –Machine learning What is.

Advertisements

Introduction to Support Vector Machines (SVM)

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Support Vector Machines

Lecture 3 Nonparametric density estimation and classification

Machine Learning Week 2 Lecture 1.

Model Assessment, Selection and Averaging

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Visual Recognition Tutorial

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.

Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Binary Classification Problem Linearly Separable Case

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

SVM Support Vectors Machines

Visual Recognition Tutorial

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Support Vector Machines

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

Sparse Gaussian Process Classification With Multiple Classes Matthias W. Seeger Michael I. Jordan University of California, Berkeley

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Probabilistic Graphical Models

Universit at Dortmund, LS VIII

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

CS 478 – Tools for Machine Learning and Data Mining SVM.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Biointelligence Laboratory, Seoul National University

Sparse Approximations to Bayesian Gaussian Processes Matthias Seeger University of Edinburgh.

Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

C&O 355 Mathematical Programming Fall 2010 Lecture 5 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.

Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.

Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)

CS 9633 Machine Learning Support Vector Machines

Non-Parametric Models

Regularized risk minimization

Data Mining Lecture 11.

Computational Learning Theory

Summarizing Data by Statistics

Computational Learning Theory

Welcome to the Kernel-Club

Primal Sparse Max-Margin Markov Networks

Linear Constrained Optimization

Presentation transcript:

PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh

Overview PAC-Bayesian theorem for Gibbs classifiers PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Application to Gaussian process classification Experiments Experiments Conclusions Conclusions

What Is a PAC Bound? Algorithm: S  Predictor t * from x * Generalisation error: gen(S) Algorithm: S  Predictor t * from x * Generalisation error: gen(S) PAC/distribution free bound: PAC/distribution free bound: Unknown P * Sample S= {(x i,t i ) | i=1,…,n} i.i.d.

Nonuniform PAC Bounds A PAC bound has to hold independent of correctness of prior knowledge A PAC bound has to hold independent of correctness of prior knowledge It does not have to be independent of prior knowledge It does not have to be independent of prior knowledge Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness

Gibbs Classifiers Bayes classifier: Bayes classifier: Gibbs classifier: New independent w for each prediction Gibbs classifier: New independent w for each prediction w y1y1 y2y2 y3y3 t1t1 t2t2 t3t3 R3R3 2 {-1,+1}

PAC-Bayesian Theorem Result for Gibbs classifiers Prior P(w), independent of S Prior P(w), independent of S Posterior Q(w), may depend on S Posterior Q(w), may depend on S Expected generalisation error: Expected generalisation error: Expected empirical error: Expected empirical error:

PAC-Bayesian Theorem (II) McAllester (1999): D[Q || P]: Relative entropy If Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P] D[Q || P]: Relative entropy If Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]

The Proof Idea Step 1: Inequality for a dumb classifier Let. Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”

The Proof Idea (II) Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! Can we exchange P for Q? Yes! Can we exchange P for Q? Yes! What do we have to pay? n -1 D[Q || P] What do we have to pay? n -1 D[Q || P]

Convex Duality Could finish proof using tricks and Jensen. Let’s see what’s behind instead! Could finish proof using tricks and Jensen. Let’s see what’s behind instead! Convex (Legendre) Duality: A very simple, but powerful concept: Parameterise linear lower bounds to a convex function Convex (Legendre) Duality: A very simple, but powerful concept: Parameterise linear lower bounds to a convex function Behind the scenes (almost) everywhere: EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem Behind the scenes (almost) everywhere: EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem

Convex Duality (II)

Convex Duality (III)

The Proof Idea (III) Works just as well for spaces of functions and distributions. Works just as well for spaces of functions and distributions. For our purpose: is convex and has the dual For our purpose: is convex and has the dual

The Proof Idea (IV) This gives the bound for all Q, This gives the bound for all Q, Set (w) = n  (w). Then: Have already bounded 2 nd term right. And on the left (Jensen again): Set (w) = n  (w). Then: Have already bounded 2 nd term right. And on the left (Jensen again):

Comments PAC-Bayesian technique generic: Use specific large deviation bounds for the Q-independent term PAC-Bayesian technique generic: Use specific large deviation bounds for the Q-independent term Choice of Q: Trade-off between emp(S,Q) and divergence D[Q || P]. Bayesian posterior a good candidate Choice of Q: Trade-off between emp(S,Q) and divergence D[Q || P]. Bayesian posterior a good candidate

Gaussian Process Classification Recall yesterday: We approximate true posterior process by a Gaussian one: Recall yesterday: We approximate true posterior process by a Gaussian one:

The Relative Entropy But, then the relative entropy is just: But, then the relative entropy is just: Straightforward to compute for all GPC approximations in this class Straightforward to compute for all GPC approximations in this class

Concrete GPC Methods We considered so far: Laplace GPC [Barber/Williams] Laplace GPC [Barber/Williams] Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup: Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)

Results for Laplace GPC

Results Sparse Greedy GPC Extremely tight for a kernel classifier bound Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers. Bayes classifiers do better, but the (original) PAC-Bayesian theorem does not hold Note: These results are for Gibbs classifiers. Bayes classifiers do better, but the (original) PAC-Bayesian theorem does not hold

Comparison Compression Bound Compression bound for sparse greedy GPC (Bayes version, not Gibbs) Compression bound for sparse greedy GPC (Bayes version, not Gibbs) Problem: Bound not configurable by prior knowledge, not specific to the algorithm Problem: Bound not configurable by prior knowledge, not specific to the algorithm

Comparison With SVM Compression bound (best we could find!) Compression bound (best we could find!) Note: Bound values lower than for sparse GPC only because of sparser solution: Bound does not depend on algorithm! Note: Bound values lower than for sparse GPC only because of sparser solution: Bound does not depend on algorithm!

Model Selection

The Bayes Classifier Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers Uses recent Rademacher complexity bounds together with convex duality argument Uses recent Rademacher complexity bounds together with convex duality argument Can be applied to GP classification as well (not yet done) Can be applied to GP classification as well (not yet done)

Conclusions PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) Easy extension to multi-class scenarios Easy extension to multi-class scenarios Application to GP classification: Tighter bounds than previously available for kernel machines (to our knowledge) Application to GP classification: Tighter bounds than previously available for kernel machines (to our knowledge)

Conclusions (II) Value in practice: Bound holds for any posterior approximation, not just the true posterior itself Value in practice: Bound holds for any posterior approximation, not just the true posterior itself Some open problems: Some open problems:  Unbounded loss functions  Characterize the slack in the bound  Incorporating ML-II model selection over continuous hyperparameter space