Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Slides:



Advertisements
Similar presentations
Bayesian Learning & Estimation Theory
Advertisements

Pattern Recognition and Machine Learning
: INTRODUCTION TO Machine Learning Parametric Methods.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Data mining in 1D: curve fitting
The loss function, the normal equation,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Statistical Decision Theory, Bayes Classifier
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Lecture 3 Properties of Summary Statistics: Sampling Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Linear Algebra Curve Fitting. Last Class: Curve Fitting.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Special Topics In Scientific Computing
Bias and Variance of the Estimator
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
INTRODUCTION TO Machine Learning
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Simple Linear Regression
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4

In previous lectures we showed how to build classifiers when the underlying densities are known  Bayesian Decision Theory introduced the general formulation In most situations, however, the true distributions are unknown and must be estimated from data.  Parameter Estimation (we saw the Maximum Likelihood Method) Assume a particular form for the density (e.g. Gaussian), so only the parameters (e.g., mean and variance) need to be estimated Maximum Likelihood Bayesian Estimation  Non-parametric Density Estimation (not covered) Assume NO knowledge about the density Kernel Density Estimation Nearest Neighbor Rule 2

3 The blue is the distribution of the estimates (e.g. Mean)

4 How Good is an Estimator Assume our dataset X is sampled from a population specified up to the parameter  how good is an estimator d(X) as an estimate for  Notice that the estimate depends on sample set X If we take an expectation of d(X) over different datasets X, E[(d(X)-  ) 2 ], and expand using E[d  E[d(X) , we get: X E[(d(X)-  ) 2 ]= E[(d(X)-E[d  ) 2 ] + (E[d]-  ) 2 variance bias sq. of the estimator Using a simpler notation (dropping the dependence on X from the notation – but knowing it exists): E[(d-  ) 2 ] = E[(d-E[d  ) 2 ] + (E[d]-  ) 2 variance bias sq. of the estimator

5 Properties of and  ML is an unbiased estimator  ML is biased Use instead:

6 The Bias-Variance Decomposition (1) Recall the expected squared loss, where We said that the second term corresponds to the noise inherent in the random variable t. What about the first term?

7 The Bias-Variance Decomposition (2) Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x; D ). We then have:

8 The Bias-Variance Decomposition (3) Taking the expectation over D yields

9 The Bias-Variance Decomposition (4) Thus we can write where

10 Bias measures how much the prediction (averaged over all data sets) differs from the desired regression function. Variance measures how much the predictions for individual data sets vary around their average. There is a trade-off between bias and variance As we increase model complexity,  bias decreases (a better fit to data) and  variance increases (fit varies more with data)

11 bias variance f gigi g f

Reminder: Overfitting Concepts: Polynomial curve fitting, overfitting, regularization

13 Polynomial Curve Fitting

14 Sum-of-Squares Error Function

15 0 th Order Polynomial

16 1 st Order Polynomial

17 3 rd Order Polynomial

18 9 th Order Polynomial

19 Over-fitting Root-Mean-Square (RMS) Error:

20 Polynomial Coefficients Large weights results in large changes in y(x) for small changes in x.

Regularization One solution to control complexity is to penalize complex models -> regularization.

22 Regularization Penalize large coefficient values:

23 Regularization:

24 Regularization:

25 Regularization: vs.

26 Polynomial Coefficients Small lambda = Large lambda = no regularization too much regularization

27 The Bias-Variance Decomposition (5) Example: 100 data sets, each with 25 data points from the sinusoidal h(x) = sin(2px), varying the degree of regularization,.

28 The Bias-Variance Decomposition (6) Regularization constant  = exp{-0.31}.

29 The Bias-Variance Decomposition (7) Regularization constant  = exp{-2.4}.

30 The Bias-Variance Trade-off From these plots, we note that;  an over-regularized model (large ) will have a high bias  while an under-regularized model (small ) will have a high variance. Minimum value of bias 2 +variance is around =-0.31 This is close to the value that gives the minimum error on the test data.

31 bias variance   g f

32 Model Selection Procedures Cross validation: Measure the total error, rather than bias/variance, on a validation set.  Train/Validation sets  K-fold cross validation  Leave-One-Out  No prior assumption about the models

33 Model Selection Procedures 2. Regularization (Breiman 1998): Penalize the augmented error: error on data + l.model complexity 1. If l is too large, we risk introducing bias 2. Use cross validation to optimize for l 3. Structural Risk Minimization (Vapnik 1995): 1. Use a set of models ordered in terms of their complexities 1. Number of free parameters 2. VC dimension,… 2. Find the best model w.r.t empirical error and model complexity. 4. Minimum Description Length Principle 5. Bayesian Model Selection: If we have some prior knowledge about the approximating function, it can be incorporated into the Bayesian approach in the form of p(model).