Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Updated.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Biointelligence Laboratory, Seoul National University
Summarized by Soo-Jin Kim
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Model Inference and Averaging
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Machine learning, pattern recognition and statistical data modelling
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Probabilistic Models for Linear Regression
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Updated and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 3.4 Bayesian Model Comparison 3.5 The Evidence Approximation 3.5.1 Evaluation of the evidence function 3.5.2 Maximizing the evidence function 3.5.3 Effective number of parameters 3.6 Limitations of Fixed Basis Functions (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (1/4) Model selection from a Bayesian perspective Over-fitting associated with maximum likelihood can be avoided by marginalizing over the model parameters instead of making point estimates of their values. It also allow multiple complexity parameters to be determined simultaneously as part of the training process. (relevance vector machine) The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model. Posterior : prior, a preference for different models. : model evidence (marginal likelihood), the preference shown by the data for different models. Parameters have been marginalized out. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (2/4) Bayes factor: the ratio of model evidences for two models Predictive distribution: mixture distribution. Averaging the predictive distribution weighted by the posterior probabilities. Model evidence Sampling perspective: Marginal likelihood can be viewed as the probability of generating the data set D from a model whose parameters are sampled at random from the prior. Posterior distribution over parameters Evidence is the normalizing term that appears in the denominator when evaluating the posterior distribution over parameters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (3/4) Consider the case of a model having a single parameter w. Assume that the posterior distribution is sharply peaked around the most probable value wMAP, with width . Assume that the prior is plat, then The first term represents the fit to the data given by the most probable parameter value, and for a flat prior this would correspond to the log likelihood. The second term penalizes the model according to its complexity, because this term is negative. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Model Comparison (4/4) For a model having a set of M parameters, As we increase the complexity of the model, the first term will typically decrease, because a more complex model is better able to fit the data. Whereas the second term will increase due to the dependence on M. The optimal model complexity, as determined by the maximum evidence, will be given by a trade-off between these two competing terms. A simple model has little variability and so will generate data sets that are fairly similar to each other. A complex model spreads its predictive probability over too broad a range of data sets and so assigns relatively small probability to any one of them. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Evidence Approximation (1/2) Fully Bayesian treatment of linear basis function model Hyperparameters: α, β. Prediction: Marginalize w.r.t. hyperparameters as well as w. Predictive distribution If the posterior distribution is sharply peaked around values , the predictive distribution is obtained simply by marginalizing over w in which are fixed to the values . (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Evidence Approximation (2/2) If the prior is relatively flat, In the evidence framework the values of are obtained by maximizing the marginal likelihood function . Hyperparameters can be determined from the training data alone from this method. (w/o recourse to cross-validation) Recall that the ratio α/β is analogous to a regularization parameter. Maximizing evidence Set evidence function’s derivative equal to zero, re-estimate equations for α,β. Use technique called the expectation maximization (EM) algorithm. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Evaluation of the Evidence Function Marginal likelihood (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Evaluation of the Evidence Function Model evidence (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximizing the Evidence Function Maximization of Set derivative w.r.t α, β to zero. w.r.t. α ui and λi are eigenvector and eigenvalue described by Maximizing hyperparameter w.r.t. β (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Effective Number of Parameters (1/2) γ: effective total number of well determined parameters (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Effective Number of Parameters (2/2) Optimal α Test err. Log evidence Optimal α (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Limitations of Fixed Basis Functions Models comprising a linear combination of fixed, nonlinear basis functions. Have closed-form solutions to the least-squares problem. Have a tractable Bayesian treatment. The difficulty The basis functions are fixed before the training data set is observed, and is a manifestation of the curse of dimensionality. Properties of data sets to alleviate this problem The data vectors {xn} typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space Target variables may have significant dependence on only a small number of possible directions within the data manifold. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/