Download presentation

Presentation is loading. Please wait.

Published byJuniper Haynes Modified over 2 years ago

1
**Generalization Error of Linear Neural Networks in an Empirical Bayes Approach**

Shinichi Nakajima Sumio Watanabe Tokyo Institute of Technology Nikon Corporation

2
**Contents Backgrounds Setting Analysis Discussion & Conclusions**

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

3
**Regular Models Conventional Learning Theory Regular models**

K : dimensionality of parameter space Regular models Everywhere det (Fisher Information) > 0 - Mean estimation - Linear regression n : # of samples x : input y : output 1. Asymptotic normalities of distribution of ML estimator and Bayes posterior (Asymptotically) normal likelihood for ANY true parameter GE: FE: Model selection methods (AIC, BIC, MDL) 2. Asymptotic generalization error l(ML) = l(Bayes)

4
**Unidentifiable models**

H : # of components Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models NON-normal likelihood when true is on singularities. Unidentifiable set : 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion.

5
**Superiority of Bayes to ML**

How singularities work in learning ? Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models When true is on singularities, Increase of neighborhood of true accelerates overfitting. Increase of population denoting true suppresses overfitting. (only in Bayes) 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion. In ML, 2. Bayes has advantage G(Bayes) < G(ML) In Bayes,

6
**What’s the purpose ? Is there any approximation**

Bayes provides good generalization. Expensive. (Needs Markov chain Monte Carlo) Is there any approximation with good generalization and tractability? Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00] Analyzed in another paper. [Nakajima&Watanabe05] Subspace Bayes (SB)

7
**Contents Backgrounds Setting Analysis Discussion & Conclusions**

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

8
**Linear Neural Networks (LNNs)**

LNN with M input, N output, and H hidden units: A : input parameter (H x M ) matrix B : output parameter (N x H ) matrix Essential parameter dimensionality: Trivial redundancy True map: B*A* with rank H* ( ). learner true H*<H H*=H ML [Fukumizu99] l > K / 2 l = K / 2 Bayes [Aoyagi&Watanabe03] l < K / 2

9
**Maximum Likelihood estimator [Baldi&Hornik95]**

ML estimator is given by where Here : h-th largest singular value of RQ -1/2. : right singular vector. : left singular vector.

10
**Bayes estimation : input : output : parameter True n training samples**

Learner Prior Marginal likelihood : Posterior : Predictive : In ML (or MAP) : Predict with one model In Bayes : Predict with ensemble of models

11
**Empirical Bayes (EB) approach [Effron&Morris73]**

Hyperparameter : True n training samples Learner Prior Marginal likelihood : Hyperparameter is estimated by maximizing marginal likelihood. Posterior : Predictive :

12
**Subspace Bayes (SB) approach**

SB is an EB where part of parameters are regarded as hyperparameters. a) MIP (Marginalizing in Input Parameter space) version A : parameter B : hyperparameter Learner : Prior : b) MOP (Marginalizing in Output Parameter space) version A : hyperparameter B : parameter Marginalization can be done analytically in LNNs.

13
**Intuitive explanation**

Bayes posterior SB posterior For redundant comp. Optimize

14
**Contents Backgrounds Setting Analysis Discussion & Conclusions**

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

15
**Free energy (a.k.a. evidence, stochastic complexity)**

Important variable used for model selection. [Akaike80;Mackay92] We minimize the free energy, optimizing hyperparameter.

16
**Generalization error Generalization Error :**

: Kullbuck-Leibler divergence between q & p where : Expectation of V over q Asymptotic expansion : : generalization coefficient In regular, In unidentifiable,

17
**James-Stein (JS) estimator**

for any true Domination of a over b : for a certain true K-dimensional mean estimation (Regular model) A certain relation between EB and JS was discussed in [Efron&Morris73] : samples : ML estimator (arithmetic mean) ML is efficient (never dominated by any unbiased estimator), but is inadmissible (dominated by biased estimator) when [Stein56]. true mean James-Stein estimator [James&Stein61] ML JS (K=3)

18
**Positive-part JS estimator**

Positive-part JS type (PJS) estimator where Thresholding Model selection PJS is a model selecting, shrinkage estimator. where

19
**Hyperparameter optimization**

Assume orthonormality : : d x d identity matrix Analytically solved in LNNs! Optimum hyperparameter value :

20
**SB solution (Theorem1, Lemma1)**

L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP. Theorem 1: The SB estimator is given by where Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive. where SB is asymptotically equivalent to PJS estimation.

21
**Generalization error (Theorem 2)**

Theorem 2: SB generalization coefficient is given by : h-th largest eigenvalue of matrix subject to WN-H* (M-H*, IN-H* ). Expectation over Wishart distribution.

22
**Large scale approximation (Theorem 3)**

Theorem 3: In the large scale limit when , the generalization coefficient converges to where

23
**Results 1 (true rank dependence)**

ML M = 50 Bayes SB(MIP) SB(MOP) learner N = 30 M = 50 SB provides good generalization. Note : This does NOT mean domination of SB over Bayes. Discussion of domination needs consideration of delicate situation. (See paper)

24
**Results 2 (redundant rank dependence)**

true N = 30 ML M = 50 Bayes SB(MOP) SB(MIP) learner N = 30 M = 50 depends on H similarly to ML. has also a property similar to ML.

25
**Contents Backgrounds Setting Analysis Discussion & Conclusions**

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

26
**Feature of SB provides good generalization.**

In LNNs, asymptotically equivalent to PJS. requires smaller computational costs. Reduction of marginalized space. In some models, marginalization can be done analytically. related to variational Bayes (VB) approach.

27
**Variational Bayes (VB) Solution [Nakajima&Watanabe05]**

VB results in same solution as MIP. VB automatically selects larger dimension to marginalize. For and Bayes posterior VB posterior Similar to SB posterior

28
**Conclusions We have introduced a subspace Bayes (SB) approach.**

We have proved that, in LNNs, SB is asymptotically equivalent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges not to ML but to smaller value, which means suppression of overfitting. Interestingly, MIP of SB is asymptotically equivalent to VB. We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and acceleration of overfitting by basis selection.

29
Future work Analysis of other models. (neural networks, Bayesian networks, mixture models, etc). Analysis of variational Bayes (VB) in other models.

30
Thank you!

Similar presentations

OK

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on road accidents in sri Ppt on channels of distribution in japan Ppt on centring or centering Ppt on unstructured data analysis 10 slides ppt on are vampires real Ppt on cartesian product sets Cg ppt online registration 2012 Raster scan display ppt online Raster and random scan display ppt online Ppt on mode of transport