Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Advanced topics in Financial Econometrics Bas Werker Tilburg University, SAMSI fellow.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Dimension reduction (1)
Model assessment and cross-validation - overview
Chapter 4: Linear Models for Classification
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Pattern Recognition and Machine Learning
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Lecture 5: Learning models using EM
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Variational Bayes 101.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Bayesian Learning Rong Jin.
PATTERN RECOGNITION AND MACHINE LEARNING
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
A C B Small Model Middle Model Large Model Figure 1 Parameter Space The set of parameters of a small model is an analytic set with singularities. Rank.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
1 Analytic Solution of Hierarchical Variational Bayes Approach in Linear Inverse Problem Shinichi Nakajima, Sumio Watanabe Nikon Corporation Tokyo Institute.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Asymptotic Behavior of Stochastic Complexity of Complete Bipartite Graph-Type Boltzmann Machines Yu Nishiyama and Sumio Watanabe Tokyo Institute of Technology,
Toric Modification on Machine Learning Keisuke Yamazaki & Sumio Watanabe Tokyo Institute of Technology.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Chapter 3: Maximum-Likelihood Parameter Estimation
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Introduction to Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
OVERVIEW OF LINEAR MODELS
OVERVIEW OF LINEAR MODELS
Feature space tansformation methods
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Generalization Error of Linear Neural Networks in an Empirical Bayes Approach Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology Nikon Corporation

Contents Backgrounds Setting Analysis Discussion & Conclusions Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

Regular Models Conventional Learning Theory Regular models K : dimensionality of parameter space Regular models Everywhere det (Fisher Information) > 0 - Mean estimation - Linear regression n : # of samples x : input y : output 1. Asymptotic normalities of distribution of ML estimator and Bayes posterior (Asymptotically) normal likelihood for ANY true parameter GE: FE: Model selection methods (AIC, BIC, MDL) 2. Asymptotic generalization error l(ML) = l(Bayes)

Unidentifiable models H : # of components Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models NON-normal likelihood when true is on singularities. Unidentifiable set :  1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion.

Superiority of Bayes to ML How singularities work in learning ? Unidentifiable models Exist singularities, where det (Fisher Information) = 0 - Neural networks - Bayesian networks - Mixture models - Hidden Markov models When true is on singularities, Increase of neighborhood of true accelerates overfitting. Increase of population denoting true suppresses overfitting. (only in Bayes) 1. Asymptotic normalities NOT hold. No (penalized likelihood type) information criterion. In ML, 2. Bayes has advantage G(Bayes) < G(ML) In Bayes,

What’s the purpose ? Is there any approximation Bayes provides good generalization. Expensive. (Needs Markov chain Monte Carlo) Is there any approximation with good generalization and tractability? Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00] Analyzed in another paper. [Nakajima&Watanabe05] Subspace Bayes (SB)

Contents Backgrounds Setting Analysis Discussion & Conclusions Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

Linear Neural Networks (LNNs) LNN with M input, N output, and H hidden units: A : input parameter (H x M ) matrix B : output parameter (N x H ) matrix Essential parameter dimensionality: Trivial redundancy True map: B*A* with rank H* ( ). learner true H*<H H*=H ML [Fukumizu99] l > K / 2 l = K / 2 Bayes [Aoyagi&Watanabe03] l < K / 2

Maximum Likelihood estimator [Baldi&Hornik95] ML estimator is given by where Here : h-th largest singular value of RQ -1/2. : right singular vector. : left singular vector.

Bayes estimation : input : output : parameter True n training samples Learner Prior Marginal likelihood : Posterior : Predictive : In ML (or MAP) : Predict with one model In Bayes : Predict with ensemble of models

Empirical Bayes (EB) approach [Effron&Morris73] Hyperparameter : True n training samples Learner Prior Marginal likelihood : Hyperparameter is estimated by maximizing marginal likelihood. Posterior : Predictive :

Subspace Bayes (SB) approach SB is an EB where part of parameters are regarded as hyperparameters. a) MIP (Marginalizing in Input Parameter space) version A : parameter B : hyperparameter Learner : Prior : b) MOP (Marginalizing in Output Parameter space) version A : hyperparameter B : parameter Marginalization can be done analytically in LNNs.

Intuitive explanation Bayes posterior SB posterior For redundant comp. Optimize

Contents Backgrounds Setting Analysis Discussion & Conclusions Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

Free energy (a.k.a. evidence, stochastic complexity) Important variable used for model selection. [Akaike80;Mackay92] We minimize the free energy, optimizing hyperparameter.

Generalization error Generalization Error : : Kullbuck-Leibler divergence between q & p where : Expectation of V over q Asymptotic expansion : : generalization coefficient In regular, In unidentifiable,

James-Stein (JS) estimator for any true Domination of a over b : for a certain true K-dimensional mean estimation (Regular model) A certain relation between EB and JS was discussed in [Efron&Morris73] : samples : ML estimator (arithmetic mean)     ML is efficient (never dominated by any unbiased estimator), but is inadmissible (dominated by biased estimator) when [Stein56]. true mean James-Stein estimator [James&Stein61] ML JS (K=3)

Positive-part JS estimator Positive-part JS type (PJS) estimator where   Thresholding Model selection PJS is a model selecting, shrinkage estimator. where  

Hyperparameter optimization Assume orthonormality : : d x d identity matrix Analytically solved in LNNs! Optimum hyperparameter value :

SB solution (Theorem1, Lemma1) L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP. Theorem 1: The SB estimator is given by where Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive. where   SB is asymptotically equivalent to PJS estimation.

Generalization error (Theorem 2) Theorem 2: SB generalization coefficient is given by : h-th largest eigenvalue of matrix subject to WN-H* (M-H*, IN-H* ). Expectation over Wishart distribution.

Large scale approximation (Theorem 3) Theorem 3: In the large scale limit when , the generalization coefficient converges to where

Results 1 (true rank dependence) ML M = 50 Bayes SB(MIP) SB(MOP) learner N = 30 M = 50 SB provides good generalization. Note : This does NOT mean domination of SB over Bayes. Discussion of domination needs consideration of delicate situation. (See paper)

Results 2 (redundant rank dependence) true N = 30 ML M = 50 Bayes SB(MOP) SB(MIP) learner N = 30 M = 50 depends on H similarly to ML. has also a property similar to ML.

Contents Backgrounds Setting Analysis Discussion & Conclusions Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model Subspace Bayes (SB) Approach Analysis (James-Stein estimator) Solution Generalization error Discussion & Conclusions

Feature of SB provides good generalization. In LNNs, asymptotically equivalent to PJS. requires smaller computational costs. Reduction of marginalized space. In some models, marginalization can be done analytically. related to variational Bayes (VB) approach.

Variational Bayes (VB) Solution [Nakajima&Watanabe05] VB results in same solution as MIP. VB automatically selects larger dimension to marginalize. For and Bayes posterior VB posterior Similar to SB posterior

Conclusions We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equivalent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges not to ML but to smaller value, which means suppression of overfitting. Interestingly, MIP of SB is asymptotically equivalent to VB. We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and acceleration of overfitting by basis selection.

Future work Analysis of other models. (neural networks, Bayesian networks, mixture models, etc). Analysis of variational Bayes (VB) in other models.

Thank you!