Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II 24.10.2012 Bastian.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Regression “A new perspective on freedom” TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A AAA A A.

Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.

Prediction with Regression

Biointelligence Laboratory, Seoul National University

Pattern Recognition and Machine Learning

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Model Assessment, Selection and Averaging

Visual Recognition Tutorial

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Pattern Recognition and Machine Learning

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Statistical Decision Theory, Bayes Classifier

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Visual Recognition Tutorial

Collaborative Filtering Matrix Factorization Approach

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Biointelligence Laboratory, Seoul National University

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Perceptual and Sensory Augmented Computing Machine Learning Summer’09 Machine Learning – Lecture 2 Probability Density Estimation Bastian Leibe.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Machine Learning – Lecture 6

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Machine Learning 5. Parametric Methods.

Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 5 Linear Discriminant Functions Bastian Leibe.

Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Deep Feedforward Networks

12. Principles of Parameter Estimation

Probability Theory and Parameter Estimation I

Ch3: Model Building through Regression

Linear Regression (continued)

Bias and Variance of the Estimator

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Probabilistic Models for Linear Regression

Roberto Battiti, Mauro Brunato

Collaborative Filtering Matrix Factorization Approach

10701 / Machine Learning Today: - Cross validation,

Biointelligence Laboratory, Seoul National University

Parametric Methods Berlin Chen, 2005 References:

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

12. Principles of Parameter Estimation

Presentation transcript:

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian Leibe RWTH Aachen TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A A

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 This Lecture: Advanced Machine Learning Regression Approaches  Linear Regression  Regularization (Ridge, Lasso)  Support Vector Regression  Gaussian Processes Learning with Latent Variables  EM and Generalizations  Dirichlet Processes Structured Output Learning  Large-margin Learning B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 3 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Probabilistic Regression First assumption:  Our target function values t are generated by adding noise to the ideal function estimate: Second assumption:  The noise is Gaussian distributed. 4 B. Leibe Target function value Regression functionInput valueWeights or parameters Noise MeanVariance ( ¯ precision) Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Probabilistic Regression Given  Training data points:  Associated function values: Conditional likelihood (assuming i.i.d. data)  Maximize w.r.t. w, ¯ 5 B. Leibe Generalized linear regression function Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Maximum Likelihood Regression Setting the gradient to zero:  Least-squares regression is equivalent to Maximum Likelihood under the assumption of Gaussian noise. 6 B. Leibe Same as in least-squares regression! Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Role of the Precision Parameter Also use ML to determine the precision parameter ¯ : Gradient w.r.t. ¯ :  The inverse of the noise precision is given by the residual variance of the target values around the regression function. 7 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Predictive Distribution Having determined the parameters w and ¯, we can now make predictions for new values of x. This means  Rather than giving a point estimate, we can now also give an estimate of the estimation uncertainty. 8 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Maximum-A-Posteriori Estimation Introduce a prior distribution over the coefficients w.  For simplicity, assume a zero-mean Gaussian distribution  New hyperparameter ® controls the distribution of model parameters. Express the posterior distribution over w.  Using Bayes’ theorem:  We can now determine w by maximizing the posterior.  This technique is called maximum-a-posteriori (MAP). 9 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: MAP Solution Minimize the negative logarithm The MAP solution is therefore  Maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error (with ). 10 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 MAP Solution (2) Setting the gradient to zero: B. Leibe 11 Effect of regularization: Keeps the inverse well-conditioned

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Bayesian Curve Fitting Given  Training data points:  Associated function values:  Our goal is to predict the value of t for a new point x. Evaluate the predictive distribution  Noise distribition – again assume a Gaussian here  Assume that parameters ® and ¯ are fixed and known for now. 12 B. Leibe What we just computed for MAP

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recap: Bayesian Curve Fitting Under those assumptions, the posterior distribution is a Gaussian and can be evaluated analytically:  where the mean and variance are given by  and S is the regularized covariance matrix 13 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 14 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Loss Functions for Regression Given p ( y, x, w, ¯ ), how do we actually estimate a function value y t for a new point x t ? We need a loss function, just as in the classification case Optimal prediction: Minimize the expected loss 15 B. Leibe Slide adapted from Stefan Roth

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Loss Functions for Regression Simplest case  Squared loss:  Expected loss 16 B. Leibe Slide adapted from Stefan Roth

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Loss Functions for Regression Important result  Under Squared loss, the optimal regression function is the mean E [ t | x ] of the posterior p ( t | x ).  Also called mean prediction.  For our generalized linear regression function and square loss, we obtain as result 17 B. Leibe Slide adapted from Stefan Roth

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Visualization of Mean Prediction 18 B. Leibe mean prediction Slide adapted from Stefan Roth Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Loss Functions for Regression Different derivation: Expand the square term as follows Substituting into the loss function  The cross-term vanishes, and we end up with 19 B. Leibe Optimal least-squares predictor given by the conditional mean Intrinsic variability of target data  Irreducible minimum value of the loss function

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Other Loss Functions The squared loss is not the only possible choice  Poor choice when conditional distribution p ( t | x ) is multimodal. Simple generalization: Minkowski loss  Expectation Minimum of E [ L q ] is given by  Conditional mean for q = 2,  Conditional median for q = 1,  Conditional mode for q = B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Minkowski Loss Functions 21 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 22 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Linear Basis Function Models Generally, we consider models of the following form  where Á j ( x ) are known as basis functions.  Typically, Á 0 ( x ) = 1, so that w 0 acts as a bias.  In the simplest case, we use linear basis functions: Á d ( x ) = x d. Let’s take a look at some other possible basis functions B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Linear Basis Function Models (2) Polynomial basis functions Properties  Global  A small change in x affects all basis functions. 24 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Linear Basis Function Models (3) Gaussian basis functions Properties  Local  A small change in x affects only nearby basis functions.  ¹ j and s control location and scale (width). 25 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Linear Basis Function Models (4) Sigmoid basis functions  where Properties  Local  A small change in x affects only nearby basis functions.  ¹ j and s control location and scale (slope). 26 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 27 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Multiple Outputs Multiple Output Formulation  So far only considered the case of a single target variable t.  We may wish to predict K > 1 target variables in a vector t.  We can write this in matrix form  where 28 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Multiple Outputs (2) Analogously to the single output case we have: Given observed inputs,, and targets,, we obtain the log likelihood function 29 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Multiple Outputs (3) Maximizing with respect to W, we obtain If we consider a single target variable, t k, we see that where, which is identical with the single output case. 30 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 31 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Sequential Learning Up to now, we have mainly considered batch methods  All data was used at the same time  Instead, we can also consider data items one at a time (a.k.a. online learning) Stochastic (sequential) gradient descent: This is known as the least-mean-squares (LMS) algorithm. Issue: how to choose the learning rate ´ ? 32 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 33 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Regularization Revisited Consider the error function With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by 34 B. Leibe Slide adapted from C.M. Bishop, 2006 Data term + Regularization term ¸ is called the regularization coefficient.

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Regularized Least-Squares Let’s look at more general regularizers “L q norms” 35 B. Leibe “Lasso”“Ridge Regression” Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Recall: Lagrange Multipliers 36 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Regularized Least-Squares We want to minimize This is equivalent to minimizing  subject to the constraint  (for some suitably chosen ´ ) 37 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Regularized Least-Squares Effect: Sparsity for q  1.  Minimization tends to set many coefficients to zero Why is this good? Why don’t we always do it, then? Any problems? 38 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Consider the following regressor  This formulation is known as the Lasso. Properties  L 1 regularization  The solution will be sparse (only few coefficients will be non-zero)  The L 1 penalty makes the problem non-linear.  There is no closed-form solution.  Need to solve a quadratic programming problem.  However, efficient algorithms are available with the same computational cost as for ridge regression. The Lasso 39 B. Leibe Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Lasso as Bayes Estimation Interpretation as Bayes Estimation  We can think of | w j | q as the log-prior density for w j. Prior for Lasso ( q = 1) : Laplacian distribution 40 B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009 with

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Analysis Equicontours of the prior distribution Analysis  For q · 1, the prior is not uniform in direction, but concentrates more mass on the coordinate directions.  The case q = 1 (lasso) is the smallest q such that the constraint region is convex.  Non-convexity makes the optimization problem more difficult.  Limit for q = 0 : regularization term becomes  j=1..M 1 = M.  This is known as Best Subset Selection. 41 B. Leibe Image source: Friedman, Hastie, Tibshirani, 2009

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Discussion Bayesian analysis  Lasso, Ridge regression and Best Subset Selection are Bayes estimates with different priors.  However, derived as maximizers of the posterior.  Should ideally use the posterior mean as the Bayes estimate!  Ridge regression solution is also the posterior mean, but Lasso and Best Subset Selection are not. We might also try using other values of q besides 0, 1, 2 …  However, experience shows that this is not worth the effort.  Values of q 2 (1,2) are a compromise between lasso and ridge  However, | w j | q with q > 1 is differentiable at 0.  Loses the ability of lasso for setting coefficients exactly to zero. 42 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression  Loss functions for regression  Basis functions  Multiple Outputs  Sequential Estimation Regularization revisited  Regularized Least-squares  The Lasso  Discussion Bias-Variance Decomposition 43 B. Leibe

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Recall the expected squared loss,  where The second term of E [ L ] corresponds to the noise inherent in the random variable t. What about the first term? 44 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Suppose we were given multiple data sets, each of size N. Any particular data set D will give a particular function y ( x ; D ). We then have Taking the expectation over D yields 45 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Thus we can write  where 46 B. Leibe Slide adapted from C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Example  25 data sets from the sinusoidal, varying the degree of regularization, ¸. 47 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Example  25 data sets from the sinusoidal, varying the degree of regularization, ¸. 48 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Bias-Variance Decomposition Example  25 data sets from the sinusoidal, varying the degree of regularization, ¸. 49 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 The Bias-Variance Trade-Off Result from these plots  An over-regularized model (large ¸ ) will have a high bias.  An under-regularized model (small ¸ ) will have a high variance. We can compute an estimate for the generalization capability this way (magenta curve)!  Can you see where the problem is with this?  Computation is based on average w.r.t. ensembles of data sets.  Unfortunately of little practical value… 50 B. Leibe Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 References and Further Reading More information on linear regression, including a discussion on regularization can be found in Chapters and of the Bishop book. Additional information on the Lasso, including efficient algorithms to solve it, can be found in Chapter 3.4 of the Hastie book. B. Leibe 51 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman Elements of Statistical Learning 2 nd edition, Springer, 2009