3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Kernel methods - overview
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
Gaussian Processes Li An Li An
Linear Models for Classification
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Modeling Patrice Koehl Department of Biological Sciences
Pattern Recognition and Machine Learning
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Introduction to Machine Learning
Sparse Kernel Machines
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Machine learning, pattern recognition and statistical data modelling
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Latent Variables, Mixture Models and EM
Modelling data and curve fitting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子

Contents 3.3 Bayesian Linear Regression Parameter distribution Predictive distribution Equivalent kernel 3.4 Bayesian Model Comparison 3.5 The Evidence Approximation Evaluation of the evidence function Maximizing the evidence function Effective number of parameters 3.6 Limitations of Fixed Basis Functions

3.3 Bayesian Linear Regression How do we decide model complexity (= number of basis functions)? Regularization coefficients control effective number of model complexity. But we still need to decide number and forms of basis functions. Bayesian Linear Regression (vs. Held-out dataset) can automatically determine model complexity with training data alone! Computationally expensive!! Wasteful of valuable data!!

3.3.1 Parameter distribution Prior probability of w (eq. 3.48) Posterior (eq. 3.49, 3.50, 3.51)

Consider a zero-mean isotropic Gaussian (eq. 3.52) Posterior (eq. 3.49, 3.53, 3.54) Log of the posterior Maximizing posterior distribution w.r.t. w = minimizing the sum-of-squares

Straight-line fitting Original function Model f(x,a) = a 0 + a 1 x (a 0 = -0.3 a 1 =0.5) t n = f + noise Samples from posterior prior×likelihood =posterior

Straight-line fitting Samples from posterior prior×likelihood =posterior

Straight-line fitting Samples from posterior prior×likelihood =posterior

3.3.2 Predictive distribution Our real interests (eq. 3.57) (eq. 3.58, 3.59) Using the result from Section Noise + uncertainty associated with w Prob. of t for new input values

with the sinusoidal data set predictive uncertainty Mean of the Gaussian predictive distribution N=1N=2 N=4N=25

Samples from the posterior

3.3.3 Equivalent kernel Predictive mean (eq. 3.60) Mean of the predictive distribution at a point x (eq. 3.61, 3.62) Model function with the posterior mean Linear combination of t n “smoother matrix”, “equivalent kernel”

Examples of equivalent kernels Polynomial basis functions Sigmoidal basis funcitons k(x, x’) for x = 0 x’

Covariance between y(x) and y(x’) cov[y(x), y(x’)] = k(x,x’) (see eq. 3.63) k(x, x n ) sums to one (eq. 3.64) An important property of kernel functions: Inner product of nonlinear functions (eq. 3.65)

3.4 Bayesian Model Comparison The problem of model selection from a Bayesian perspective The over-fitting associated with maximum likelihood can be avoided by marginalizing over the model parameters instead of making point estimates of their values. It also allows multiple complexity parameters to be determined simultaneously as part of the training process. The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model.

Compare a set of L models {M i } (i = 1, …L) Prior probability Posterior probability (eq. 3.66) Bayes factor Prior (we can express a preference for different models) Model evidence (marginal likelihood) Compare two models.

Predictive distribution (from the sum and product rules) (eq. 3.67) Model evidence (again, from the sum and product rules) (eq. 3.68) Assume that the posterior distribution is sharply peaked around the most probable value w MAP, with width ⊿ w posterior (eq. 3.70). Drop M i Assume that the prior is flat with width ⊿ w prior

Further insight Take the log (eq. 3.71) For a model having a set of M parameters, simple intermediate complex (eq. 3.72) Always negative. penalty term when parameters are finely tuned in the posterior Assume that all parameters have the same ratio ⊿ w posterior / ⊿ w prior

Implicit assumption: the true distribution from which the data are generated is contained within the set of models under consideration. Provided this, Bayesian model comparison favours the correct model. = Kullback-Leibler divergence (zero if the two distributions are equal)

3.5 The Evidence Approximation Fully Bayesian treatment of the linear basis function model –Prior distribution over hyperparametersα, β –Predictive distribution: Marginalize w.r.t. hyperparameters as well as w (eq. 3.74) –If the posterior distribution p(α, β | t ) is sharply peaked around values α and β, the predictive distribution is obtained simply by marginalizing over w in which α,β are fixed to α,β. (eq. 3.75)

From Bayes’ theorem (eq. 3.76) Then Hyperparameters can be determined from the training data alone from this method. (w/o recourse to cross-validation) Recall that the ratio α/β is analogous to a regularization parameter. Alternative means for maximizing evidence –Set evidence function’s derivative equal to zero, re- estimate equations for α,β. –Use technique called the expectation maximization (EM) algorithm. prior: relatively flat Marginal likelihood

3.5.1 Evaluation of the evidence function Marginal likelihood (eq. 3.77, 3.78, 3.79) Complete the square over w Hessian matrix

Example of the sinusoidal function fitting order Model evidence M = 3

3.5.2 Maximizing the evidence function Maximization of p(t| α, β ) Set derivative w.r.t α, β to zero. w.r.t. α (eq. 3.87, 3.91, 3.92) w.r.t. β (eq. 3.95) λ: eigen value u : eigenvector (Implicit solution)

3.5.3 Effective number of parameters prior Contours of the likelihood function the effective total number of well determined parameters (eq. 3.91) α<<λ i Well determined α>>λi eigenvalue λ: curvature λ 2 : small λ 1 : large

Compare the variance (inverse precision) estimation in two approaches In Maximum likelihood approach In Evidence approximation approach Corrects the bias of the maximum likelihood result (eq. 3.21) (eq. 3.95)

Optimal α Test err. Log evidence with the sinusoidal data Optimal α β:set to true vlaue Fig shows how α is determined in the evidence framework Fig 3.17 shows how α controls the magnitude of the parameters {w i } Fig Fig ≦ α ≦ ∞ α=0α=∞

If N >>M, all of the parameters will be well-determined by the data. So, γ=M. Use these as an easy-to-compute approximation. (3.98) (3.99)

3.6 Limitations of Fixed Basis Functions Models comprising a linear combination of fixed, nonlinear basis functions. –Have closed-form solutions to the least-squares problem. –Have a tractable Bayesian treatment. The difficulty –The basis functions are fixed before the training data set is observed, and is a manifestation of the curse of dimensionality. Properties of real data sets to alleviate this problem –The data vectors {x n } typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space –Target variables may have significant dependence on only a small number of possible directions within the data manifold.

おわり