Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

: INTRODUCTION TO Machine Learning Parametric Methods.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Data mining in 1D: curve fitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Statistical Decision Theory, Bayes Classifier
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Linear Algebra Curve Fitting. Last Class: Curve Fitting.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
ECE 5424: Introduction to Machine Learning
Special Topics In Scientific Computing
Bias and Variance of the Estimator
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
INTRODUCTION TO Machine Learning
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4

In previous lectures we showed how to build classifiers when the underlying densities are known  Bayesian Decision Theory introduced the general formulation In most situations, however, the true distributions are unknown and must be estimated from data.  Parameter Estimation (we saw the Maximum Likelihood Method) Assume a particular form for the density (e.g. Gaussian), so only the parameters (e.g., mean and variance) need to be estimated Maximum Likelihood Bayesian Estimation  Non-parametric Density Estimation (not covered) Assume NO knowledge about the density Kernel Density Estimation Nearest Neighbor Rule 2

3

4 How Good is an Estimator Assume our dataset X is sampled from a population specified up to the parameter  how good is an estimator d(X) as an estimate for  Notice that the estimate depends on sample set X If we take an expectation of d(X) over different datasets X, E[(d(X)-  ) 2 ], and expand using E[d  E[d(X) , we get: X E[(d(X)-  ) 2 ]= E[(d(X)-E[d  ) 2 ] + (E[d]-  ) 2 variance bias sq. of the estimator Using a simpler notation (dropping the dependence on X from the notation – but knowing it exists): E[(d-  ) 2 ] = E[(d-E[d  ) 2 ] + (E[d]-  ) 2 variance bias sq. of the estimator

5 Properties of and  ML is an unbiased estimator  ML is biased Use instead:

6 Bias Variance Decomposition

7 The Bias-Variance Decomposition (1) Recall the expected squared loss, where We said that the second term corresponds to the noise inherent in the random variable t. What about the first term?

8 The Bias-Variance Decomposition (2) Suppose we were given multiple data sets, each of size N. Any particular data set, D, will give a particular function y(x; D). Consider the error in the estimation:

9 The Bias-Variance Decomposition (3) Taking the expectation over D yields:

10 The Bias-Variance Decomposition (4) Thus we can write where

11 Bias measures how much the prediction (averaged over all data sets) differs from the desired regression function. Variance measures how much the predictions for individual data sets vary around their average. There is a trade-off between bias and variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)

12 bias variance f gigi g f

13 Model Selection Procedures 1. Regularization (Breiman 1998): Penalize the augmented error: 1. error on data +.model complexity 1. If is too large, we risk introducing bias 2. Use cross validation to optimize for 2. Structural Risk Minimization (Vapnik 1995): 1. Use a set of models ordered in terms of their complexities 1. Number of free parameters 2. VC dimension,… 2. Find the best model w.r.t empirical error and model complexity. 3. Minimum Description Length Principle 4. Bayesian Model Selection: If we have some prior knowledge about the approximating function, it can be incorporated into the Bayesian approach in the form of p(model).

Reminder: Introduction to Overfitting PRML 1.1 Concepts: Polynomial curve fitting, overfitting, regularization, training set size vs model complexity

15 Polynomial Curve Fitting

16 Sum-of-Squares Error Function

17 0 th Order Polynomial

18 1 st Order Polynomial

19 3 rd Order Polynomial

20 9 th Order Polynomial

21 Over-fitting Root-Mean-Square (RMS) Error:

22 Polynomial Coefficients

Regularization One solution to control complexity is to penalize complex models -> regularization.

24 Regularization Penalize large coefficient values:

25 Regularization: Small –regularization has little effect

26 Regularization: Large –regularization dominates

27 Regularization: vs.

28 Polynomial Coefficients

29 The Bias-Variance Decomposition (5) Example: 100 data sets, each with 25 data points from the sinusoidal h(x) = sin(2px), varying the degree of regularization,.

30 The Bias-Variance Decomposition (6) Regularization constant  = exp{-0.31}.

31 The Bias-Variance Decomposition (7) Regularization constant  = exp{-2.4}.

32 The Bias-Variance Trade-off From these plots, we note that;  an over-regularized model (large ) will have a high bias  while an under-regularized model (small ) will have a high variance. Minimum value of bias 2 +variance is around =-0.31 This is close to the value that gives the minimum error on the test data.

33 bias variance   g f

34 Model Selection Procedures Cross validation: Measure the total error, rather than bias/variance, on a validation set.  Train/Validation sets  K-fold cross validation  Leave-One-Out  No prior assumption about the models