Model assessment and cross-validation - overview

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Chapter 5 Multiple Linear Regression
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Pattern Recognition and Machine Learning
Model generalization Test error Bias, variance and complexity
Week 2 Video 4 Metrics for Regressors.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
CMPUT 466/551 Principal Source: CMU
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kernel methods - overview
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Model Selection and Validation
Data mining and statistical learning - lecture 13 Separating hyperplane.
Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Kernel Methods Jong Cheol Jeong. Out line 6.1 One-Dimensional Kernel Smoothers Local Linear Regression Local Polynomial Regression 6.2 Selecting.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Validation methods.
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Introduction to Radial Basis Function Networks
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Introduction to Radial Basis Function Networks
Bias and Variance of the Estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Presenter: Georgi Nalbantov
Linear Model Selection and regularization
Cross-validation for the selection of statistical models
Model generalization Brief summary of methods
Presentation transcript:

Model assessment and cross-validation - overview Bias-variance decomposition and tradeoff Analytical validation methods (AIC, BIC) Resampling methods (cross-validation, bootstrap methods) Data Mining and statistical learning 2008

Bias, variance and model complexity Data Mining and statistical learning 2008

Bias, variance decomposition, and tradeoff Expected prediction error INSERT formula 7.8 We would like to minimize Err! Increasing the complexity of models increases variance and decreases bias Example: Smoothing based on nearest neighbours. Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Loss function, training error, and test error - quantitative response (regression) Loss functions (examples) Training error: Test error: where the expectation is taken over the joint distribution of (X, Y) Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Loss function, training error, and test error - qualitative response (classification) Model: G=[1, 2, …, K], pk(X)=Pr(G=k| X). Decision rule: Examples of loss functions: 0-1 Loss INSERT form. 7.4 Cross-entropy loss (log-likelihood) INSERT form. 7.5 Data Mining and statistical learning 2008

Training error, and test error - qualitative response (classification) 0-1-loss: misclassification rate Cross-entropy loss: Test error Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Model selection Assume that the given model fα(x) is dependent on some tuning parameter (model complexity parameter) α Examples of α: No. predictors (multiple regression) Degree of a polynomial (polynomial regression) Penalty factor (smoothing splines, ridge regression) Window width (kernels) The aim of model selection: To find a model having minimum test error Data Mining and statistical learning 2008

Model selection and assessment Estimate the performance of different models in order to choose the best one. Model assessment Having chosen the final model, estimate the test error Data Mining and statistical learning 2008

Model selection and assessment in data-rich applications Training set (to produce a fit, appr. 50%) Validation set (for model selection, 25 %) Test set (for model assessment, 25%) Example (splines) Fit the training set using models with smoothing factors λ1, …, λn Using fitted splines f1*(x),…fn*(x), estimate the prediction error using the validation set and choose the model #i producing minimal error. Estimate the generalization error using the test set and model #i Data Mining and statistical learning 2008

Model selection and assessment in applications with insufficient data Analytical expressions to select and assess models Cp (correction for the number of inputs or basis functions) AIC (Akaike’s information criterion) BIC (Bayesian information criterion) Resampling Cross-validation The bootstrap Data Mining and statistical learning 2008

Analytical validation methods Background: A model typically overfits the data. The prediction error will on average be higher than the training error Terminology: The difference between the average training and prediction error is called optimism Basic idea: Find an analytical expression for the optimism. Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Optimism Training error rate: In-sample error: Optimism: Data Mining and statistical learning 2008

Analytical expressions for the optimism For squared error, 0-1 loss, and some other loss functions 2. For linear models with d inputs or basis functions Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Cp scores Basic idea: when d parameters are fitted under squared loss. Compute and choose the model with smallest Cp score Properties: Penalization inreases with increasing model complexity (d) and decreases as the training sample size increases Data Mining and statistical learning 2008

Akaike’s information criterion When the likelihood is maximized it holds asymptotically that where Given a tuning parameter , we set where d() is the effective number of parameters. For Gaussian models, AIC is equivalent to Cp Data Mining and statistical learning 2008

Effective number of parameters (d()) For linear smoothers: Examples: Simple linear regression (#exact param), ridge regression Smoothing splines Kernel smoothers Define the effective number of parameters as Data Mining and statistical learning 2008

Bayesian information criterion Based on Bayesian theory we set For Gaussian models Properties: BIC =AIC if ”2” is substituted for log(N) Since log(N)>1 for N>7, BIC penalizes complex models more heavily than AIC Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Features of AIC and BIC For large models (assymptotical property) BIC chooses the right model (if it is present among alternatives) AIC chooses too complex models For small models BIC chooses too simple models AIC is OK Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Resampling methods Cross-validation K-fold cross-validation (rough scheme, show picture): Divide data-set in K roughly equally-sized subsets Remove subset #i and fit the model using remaining data. Predict the function values for subset #i using the fitted model. Repeat steps 2-3 for different i CV= squared difference between observed values and predicted values (another function is possible) Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Resampling methods Cross-validation Note: if K=N then method is leave-one-out cross-validation. K-fold cross-validation: Data Mining and statistical learning 2008

Model selection using cross-validation Having a model depending on a tuning (complexity) parameter, choose the one with smallest CV: Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Prediction error and cross-validation curve estimated from a single training set Data Mining and statistical learning 2008

Generalized cross-validation Basic idea: approximate the leave-one-out CV to make it faster Used for linear smoothers: Note: In smoothing problems, GCV is similiar to AIC Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 The bootstrap Why do we need the bootstrap? To estimate the uncertainty of parameter estimates Example: Having a sample X1, …, Xn from an unknown distribution we estimate its mean (expectation) by computing the sample mean How to find uncertainty of the sample mean? Data Mining and statistical learning 2008

Implementation of the crude nonparametric bootstrap Sample with replacement form the observed data to obtain a bootstrap sample Repeat step 1 B times Compute parameter estimates using the bootstrap samples Compute the variance of estimates from step 3 Data Mining and statistical learning 2008

Data Mining and statistical learning 2008 Bootstrap replicates Data Mining and statistical learning 2008

Benefits and drawbacks of the nonparametric bootstrap The uncertainty of an estimate can be obtained without any information about the underlying distribution It is not applicable to small data sets If the distribution is known (except for the level of some parameters) the bootstarp may be slightly less efficient than conventional parametric methods Data Mining and statistical learning 2008

The bootstrap and estimation of prediction errors Fit the model to bootstrap samples (role=training) and examine how well it predicts the training set (role=prediction) Data Mining and statistical learning 2008

An improved bootstrap estimator of the prediction error The leave-one-out bootstrap is similar to two-fold CV, and produces biased estimates of the expected squared prediction error. Solution: .632-estimator(pulls bias down) Data Mining and statistical learning 2008