# PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh.

## Presentation on theme: "PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh."— Presentation transcript:

PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh

Overview PAC-Bayesian theorem for Gibbs classifiers PAC-Bayesian theorem for Gibbs classifiers Application to Gaussian process classification Application to Gaussian process classification Experiments Experiments Conclusions Conclusions

What Is a PAC Bound? Algorithm: S  Predictor t * from x * Generalisation error: gen(S) Algorithm: S  Predictor t * from x * Generalisation error: gen(S) PAC/distribution free bound: PAC/distribution free bound: Unknown P * Sample S= {(x i,t i ) | i=1,…,n} i.i.d.

Nonuniform PAC Bounds A PAC bound has to hold independent of correctness of prior knowledge A PAC bound has to hold independent of correctness of prior knowledge It does not have to be independent of prior knowledge It does not have to be independent of prior knowledge Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness Unfortunately, most standard VC bounds are only vaguely dependent on prior/model they are applied to lack tightness

Gibbs Classifiers Bayes classifier: Bayes classifier: Gibbs classifier: New independent w for each prediction Gibbs classifier: New independent w for each prediction w y1y1 y2y2 y3y3 t1t1 t2t2 t3t3 R3R3 2 {-1,+1}

PAC-Bayesian Theorem Result for Gibbs classifiers Prior P(w), independent of S Prior P(w), independent of S Posterior Q(w), may depend on S Posterior Q(w), may depend on S Expected generalisation error: Expected generalisation error: Expected empirical error: Expected empirical error:

PAC-Bayesian Theorem (II) McAllester (1999): D[Q || P]: Relative entropy If Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P] D[Q || P]: Relative entropy If Q(w) feasible approximation to Bayesian posterior, we can compute D[Q || P]

The Proof Idea Step 1: Inequality for a dumb classifier Let. Large deviation bound holds for fixed w (use Asymptotic Equipartition Property). Since P(w) independent of S, bound holds also “on average”

The Proof Idea (II) Could use Jensen’s inequality: But so what?? P is fixed a-priori, giving a pretty dumb classifier! Can we exchange P for Q? Yes! Can we exchange P for Q? Yes! What do we have to pay? n -1 D[Q || P] What do we have to pay? n -1 D[Q || P]

Convex Duality Could finish proof using tricks and Jensen. Let’s see what’s behind instead! Could finish proof using tricks and Jensen. Let’s see what’s behind instead! Convex (Legendre) Duality: A very simple, but powerful concept: Parameterise linear lower bounds to a convex function Convex (Legendre) Duality: A very simple, but powerful concept: Parameterise linear lower bounds to a convex function Behind the scenes (almost) everywhere: EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem Behind the scenes (almost) everywhere: EM, variational bounds, primal-dual optimisation, …, PAC-Bayesian theorem

Convex Duality (II)

Convex Duality (III)

The Proof Idea (III) Works just as well for spaces of functions and distributions. Works just as well for spaces of functions and distributions. For our purpose: is convex and has the dual For our purpose: is convex and has the dual

The Proof Idea (IV) This gives the bound for all Q, This gives the bound for all Q, Set (w) = n  (w). Then: Have already bounded 2 nd term right. And on the left (Jensen again): Set (w) = n  (w). Then: Have already bounded 2 nd term right. And on the left (Jensen again):

Comments PAC-Bayesian technique generic: Use specific large deviation bounds for the Q-independent term PAC-Bayesian technique generic: Use specific large deviation bounds for the Q-independent term Choice of Q: Trade-off between emp(S,Q) and divergence D[Q || P]. Bayesian posterior a good candidate Choice of Q: Trade-off between emp(S,Q) and divergence D[Q || P]. Bayesian posterior a good candidate

Gaussian Process Classification Recall yesterday: We approximate true posterior process by a Gaussian one: Recall yesterday: We approximate true posterior process by a Gaussian one:

The Relative Entropy But, then the relative entropy is just: But, then the relative entropy is just: Straightforward to compute for all GPC approximations in this class Straightforward to compute for all GPC approximations in this class

Concrete GPC Methods We considered so far: Laplace GPC [Barber/Williams] Laplace GPC [Barber/Williams] Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Sparse greedy GPC (IVM) [Csato/Opper, Lawrence/Seeger/Herbrich] Setup: Downsampled MNIST (2s vs. 3s). RBF kernels. Model selection using independent holdout sets (no ML-II allowed here!)

Results for Laplace GPC

Results Sparse Greedy GPC Extremely tight for a kernel classifier bound Extremely tight for a kernel classifier bound Note: These results are for Gibbs classifiers. Bayes classifiers do better, but the (original) PAC-Bayesian theorem does not hold Note: These results are for Gibbs classifiers. Bayes classifiers do better, but the (original) PAC-Bayesian theorem does not hold

Comparison Compression Bound Compression bound for sparse greedy GPC (Bayes version, not Gibbs) Compression bound for sparse greedy GPC (Bayes version, not Gibbs) Problem: Bound not configurable by prior knowledge, not specific to the algorithm Problem: Bound not configurable by prior knowledge, not specific to the algorithm

Comparison With SVM Compression bound (best we could find!) Compression bound (best we could find!) Note: Bound values lower than for sparse GPC only because of sparser solution: Bound does not depend on algorithm! Note: Bound values lower than for sparse GPC only because of sparser solution: Bound does not depend on algorithm!

Model Selection

The Bayes Classifier Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers Very recently, Meir and Zhang obtained PAC-Bayesian bound for Bayes-type classifiers Uses recent Rademacher complexity bounds together with convex duality argument Uses recent Rademacher complexity bounds together with convex duality argument Can be applied to GP classification as well (not yet done) Can be applied to GP classification as well (not yet done)

Conclusions PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) PAC-Bayesian technique (convex duality) leads to tighter bounds than previously available for Bayes-type classifiers (to our knowledge) Easy extension to multi-class scenarios Easy extension to multi-class scenarios Application to GP classification: Tighter bounds than previously available for kernel machines (to our knowledge) Application to GP classification: Tighter bounds than previously available for kernel machines (to our knowledge)

Conclusions (II) Value in practice: Bound holds for any posterior approximation, not just the true posterior itself Value in practice: Bound holds for any posterior approximation, not just the true posterior itself Some open problems: Some open problems:  Unbounded loss functions  Characterize the slack in the bound  Incorporating ML-II model selection over continuous hyperparameter space

Download ppt "PAC-Bayesian Theorems for Gaussian Process Classifications Matthias Seeger University of Edinburgh."

Similar presentations