Download presentation

Presentation is loading. Please wait.

Published byJuniper Haynes Modified about 1 year ago

1
1 Generalization Error of Linear Neural Networks in an Empirical Bayes Approach Shinichi Nakajima Sumio Watanabe Tokyo Institute of Technology Nikon Corporation

2
2 Contents Backgrounds Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

3
3 (Asymptotically) normal likelihood for ANY true parameter Conventional Learning Theory Regular Models 1. Asymptotic normalities of distribution of ML estimator and Bayes posterior Model selection methods (AIC, BIC, MDL) Regular models Everywhere det (Fisher Information) > 0 - Mean estimation - Linear regression 2. Asymptotic generalization error (ML) = (Bayes) K : dimensionality of parameter space n : # of samples GE: FE: x : input y : output

4
4 NON-normal likelihood when true is on singularities. Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models H : # of components Unidentifiable set : 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion.

5
5 Superiority of Bayes to ML Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models 1. Asymptotic normalities NOT hold. 2. Bayes has advantage G(Bayes) < G(ML) Increase of neighborhood of true accelerates overfitting. When true is on singularities, Increase of population denoting true suppresses overfitting. (only in Bayes) In ML, In Bayes, No (penalized likelihood type) information criterion. How singularities work in learning ?

6
6 What’s the purpose ? Expensive. (Needs Markov chain Monte Carlo) Is there any approximation with good generalization and tractability? Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00] Subspace Bayes (SB) Analyzed in another paper. [Nakajima&Watanabe05] Bayes provides good generalization.

7
7 Contents Backgrounds Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

8
8 Linear Neural Networks (LNNs) A : input parameter (H x M ) matrix Essential parameter dimensionality: Trivial redundancy learnertrue H*

9
9 Maximum Likelihood estimator [Baldi&Hornik95] : h -th largest singular value of RQ -1/2. ML estimator is given by where Here : right singular vector. : left singular vector.

10
10 LearnerPrior Bayes estimation : input: output: parameter n training samples Predictive : Marginal likelihood : True Posterior : In ML (or MAP) : Predict with one model In Bayes : Predict with ensemble of models

11
11 Empirical Bayes (EB) approach [Effron&Morris73] n training samples True Hyperparameter : LearnerPrior Marginal likelihood : Posterior : Predictive : Hyperparameter is estimated by maximizing marginal likelihood.

12
12 Subspace Bayes (SB) approach a) MIP (Marginalizing in Input Parameter space) version Learner : Prior : A : parameter b) MOP (Marginalizing in Output Parameter space) version SB is an EB where part of parameters are regarded as hyperparameters. B : hyperparameter A : hyperparameter B : parameter Marginalization can be done analytically in LNNs.

13
13 Intuitive explanation Bayes posterior SB posterior Optimize For redundant comp.

14
14 Contents Backgrounds Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

15
15 Free energy (a.k.a. evidence, stochastic complexity) Free energy : Important variable used for model selection. [Akaike80;Mackay92] We minimize the free energy, optimizing hyperparameter.

16
16 Generalization error Generalization Error : : generalization coefficient : Kullbuck-Leibler divergence between q & p : Expectation of V over q where Asymptotic expansion : In regular, In unidentifiable,

17
17 James-Stein (JS) estimator : ML estimator ( arithmetic mean) true mean Domination of over : for any true for a certain true K-dimensional mean estimation (Regular model) : samples ML is efficient (never dominated by any unbiased estimator), but is inadmissible (dominated by biased estimator) when [Stein56]. James-Stein estimator [James&Stein61] JS (K=3) ML A certain relation between EB and JS was discussed in [Efron&Morris73]

18
18 Positive-part JS estimator where Positive-part JS type (PJS) estimator ThresholdingModel selection PJS is a model selecting, shrinkage estimator. where

19
19 Hyperparameter optimization Optimum hyperparameter value : Assume orthonormality : : d x d identity matrix Analytically solved in LNNs!

20
20 SB solution (Theorem1, Lemma1) Theorem 1: The SB estimator is given by where Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive. L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP. SB is asymptotically equivalent to PJS estimation. where

21
21 Generalization error (Theorem 2) : h -th largest eigenvalue of matrix subject to W N-H* (M-H *, I N-H* ). Expectation over Wishart distribution. Theorem 2: SB generalization coefficient is given by

22
22 Large scale approximation (Theorem 3) Theorem 3: In the large scale limit when, the generalization coefficient converges to where

23
23 Results 1 (true rank dependence) Bayes SB(MIP) ML SB provides good generalization. SB(MOP) Note : This does NOT mean domination of SB over Bayes. Discussion of domination needs consideration of delicate situation. (See paper) learnertrue N = 30 M = 50 N = 30 M = 50

24
24 Results 2 (redundant rank dependence) Bayes SB(MIP) ML depends on H similarly to ML. has also a property similar to ML. SB(MOP) learnertrue N = 30 M = 50 N = 30 M = 50

25
25 Contents Backgrounds Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

26
26 Feature of SB provides good generalization. In LNNs, asymptotically equivalent to PJS. requires smaller computational costs. Reduction of marginalized space. In some models, marginalization can be done analytically. related to variational Bayes (VB) approach.

27
27 Variational Bayes (VB) Solution [Nakajima&Watanabe05] VB results in same solution as MIP. VB automatically selects larger dimension to marginalize. Bayes posterior VB posterior Similar to SB posterior For and

28
28 Conclusions We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equivalent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges not to ML but to smaller value, which means suppression of overfitting. Interestingly, MIP of SB is asymptotically equivalent to VB. We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and acceleration of overfitting by basis selection.

29
29 Future work Analysis of other models. (neural networks, Bayesian networks, mixture models, etc). Analysis of variational Bayes (VB) in other models.

30
30 Thank you!

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google