Biointelligence Laboratory, Seoul National University

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning: Kernel Methods.
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Supervised Learning Recap
Linear Models for Classification: Probabilistic Methods
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Summarized by Soo-Jin Kim
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Introduction to Machine Learning
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Linear Regression (continued)
Machine learning, pattern recognition and statistical data modelling
10701 / Machine Learning.
Bias and Variance of the Estimator
Statistical Learning Dong Liu Dept. EEIS, USTC.
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
10701 / Machine Learning Today: - Cross validation,
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Biointelligence Laboratory, Seoul National University Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 3.1 Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs 3.2 The Bias-Variance Decomposition 3.3 Bayesian Lear Regression 3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Linear Basis Function Models Linear regression Linear model Linearity in the parameters Using basis functions, allow nonlinear function of the input vector x. Simplify the analysis of this class of models Have some significant limitations M: total number of parameters : basis functions ( : dummy basis function) , (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Basis Functions Polynomial functions: Global functions of the input variable  spline functions Gaussian basis functions: Sigmoidal basis functions: Logistic sigmoid functions: Fourier basis  wavelets (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximum Likelihood and Least Squares (1/2) Assumption: Gaussian noise model : zero mean Gaussian random variable with precision (inverse variance) .  Result Conditional mean = (unimodal) For dataset Likelihood: (Drop the explicit x) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximum Likelihood and Least Squares (2/2) Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function. Setting the gradient of log likelihood and setting it to zero to get where the NxM design matrix (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bias and Precision Parameter by ML Some other solutions we can get by setting derivative to zero. Bias maximizing log likelihood The bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values. Noise precision parameter maximizing log likelihood (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Geometry of Least Squares If the number M of basis functions is smaller than the number N of data points, then the M vectors will span a linear subspace S of dimensionality M. : jth column of y: linear combination of The least-squares solution for w corresponds to that choice of y that lies in subspace S and that is closest to t. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Sequential Learning On-line learning Technique of Stochastic gradient descent (or sequential gradient descent) For the case of sum-of-squares error function (least-mean-square or the LMS algorithm) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Regularized Least Squares Control over-fitting Total error function Closed form solution: A more general regularizer (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ General Regularizer In case q=1 in general regularizer ‘lasso’ in the statistical literature Sparse model: corresponding basis functions play no role. Minimizing the unregularized sum-of-squares error s.t. the constraint Contours of the regularization term The lasso gives the sparse solution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Multiple Outputs For K>1 target variables 1. Introduce a different set of basis functions for each component of t. 2. Use the same set of basis functions to model all of the components of the target vector. (W: MxK matrix of parameters) For each variable tk, : pseudo-inverse of (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (1/4) Frequentist viewpoint of the model complexity issue: bias-variance trade-off. Expected squared loss Bayesian: the uncertainty in our model is expressed through a posterior distribution over w. Frequentist: make a point estimate of w based on the data set D. Arises from the intrinsic noise on the data Dependent on the particular dataset D. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (2/4) The extent to which the average prediction over all data sets differs from the desired regression function. Variance The extent to which the solutions for individual data sets vary around their average. The extent to which the function y(x;D) is sensitive to the particular choice of data set. Expected loss = (bias)2 + variance + noise (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (3/4)  bias-variance trade-off Averaging many solutions for the complex model (M=25) is a beneficial procedure. A weighted averaging (although with respect to the posterior distribution of parameters, not with respect to multiple data sets) of multiple solutions lies at the heart of Bayesian approach. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (4/4) The average prediction Bias and variance Bias-variance decomposition is based on averages with respect to ensembles of data sets (frequentist perspective). We would be better off combining them into a single large training set. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Linear Regression (1/2) Conjugate prior of likelihood Posterior (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Linear Regression (2/2) Consider prior Corresponding posterior Log of the posterior Other forms of prior over parameters (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive Distribution (1/2) Our real interests Uncertainty associated with the parameters w. 0 if N∞ noise Mean of the Gaussian predictive distribution (red line), and predictive uncertainty (shaded region) as the number of data increases. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive Distribution (2/2) Draw samples from the posterior distribution over w. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Equivalent Kernel Mean of the predictive distribution at a point x. Inner product of nonlinear functions Smoother matrix or equivalent kernel (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/