LECTURE 11: LINEAR MODEL SELECTION PT. 1 March 2 2016 SDS 293 Machine Learning.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Chapter Outline 3.1 Introduction
NOTATION & ASSUMPTIONS 2 Y i =  1 +  2 X 2i +  3 X 3i + U i Zero mean value of U i No serial correlation Homoscedasticity Zero covariance between U.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
L.M. McMillin NOAA/NESDIS/ORA Regression Retrieval Overview Larry McMillin Climate Research and Applications Division National Environmental Satellite,
Week 2 Video 4 Metrics for Regressors.
Model Assessment and Selection
Model assessment and cross-validation - overview
Psychology 202b Advanced Psychological Statistics, II February 22, 2011.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistics for Managers Using Microsoft® Excel 5th Edition
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Fractional factorial Chapter 8 Hand outs. Initial Problem analysis Eyeball, statistics, few graphs Note what problems are and what direction would be.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Ensemble Learning (2), Tree and Forest
Regression Model Building
Session 4. Applied Regression -- Prof. Juran2 Outline for Session 4 Summary Measures for the Full Model –Top Section of the Output –Interval Estimation.
Introduction to Normalization CPSC 356 Database Ellen Walker Hiram College.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
1 Confounding In an unreplicated 2 K there are 2 K treatment combinations. Consider 3 factors at 2 levels each: 8 t.c’s If each requires 2 hours to run,
Lecture 9 Page 1 CS 239, Spring 2007 More Experiment Design CS 239 Experimental Methodologies for System Software Peter Reiher May 8, 2007.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 4 Section 2 – Slide 1 of 20 Chapter 4 Section 2 Least-Squares Regression.
Regression Model Building LPGA Golf Performance
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
CpSc 881: Machine Learning
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Machine Learning 5. Parametric Methods.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
LECTURE 03: LINEAR REGRESSION PT. 1 February 1, 2016 SDS 293 Machine Learning.
LECTURE 04: LINEAR REGRESSION PT. 2 February 3, 2016 SDS 293 Machine Learning.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Election Theory A Tale of Two Countries or Voting Power Comes To the Rescue!
L. M. LyeDOE Course1 Design and Analysis of Multi-Factored Experiments Part I Experiments in smaller blocks.
LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
LECTURE 20: SUPPORT VECTOR MACHINES PT. 1 April 11, 2016 SDS 293 Machine Learning.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Non-transitivity and Probability Steven Kumer Advised by Dr. Bryan Shader.
LECTURE 12: LINEAR MODEL SELECTION PT. 2 March 7, 2016 SDS 293 Machine Learning.
L. M. LyeDOE Course1 Design and Analysis of Multi-Factored Experiments Part I Experiments in smaller blocks.
LECTURE 09: CLASSIFICATION WRAP-UP February 24, 2016 SDS 293 Machine Learning.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
LECTURE 21: SUPPORT VECTOR MACHINES PT. 2 April 13, 2016 SDS 293 Machine Learning.
1 Fractional Factorial Designs Consider a 2 k, but with the idea of running fewer than 2 k treatment combinations. Example: (1) 2 3 design- run 4 t.c.’s.
LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.
Algebra. Factorials are noted by the expression n! and reads “n factorial”
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
PREDICT 422: Practical Machine Learning Module 4: Linear Model Selection and Regularization Lecturer: Nathan Bastian, Section: XXX.
Reducing Number of Candidates
Design and Analysis of Multi-Factored Experiments
Regression Model Building
Linear Model Selection and Regularization
Fractional Factorial Design
Stat 324 – Day 28 Model Validation (Ch. 11).
Linear Model Selection and regularization
The Bias Variance Tradeoff and Regularization
Design matrix Run A B C D E
Presentation transcript:

LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning

Announcements 1/2 Labs: please turn them in before spring break!

Announcements 2/2 April 1-3 at UMASS Several teams still looking for members! Interested? Contact Jordan

Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and backward) - Estimating error Shrinkage methods - Ridge regression and the Lasso (yee-haw!) – B. Miller - Dimension reduction Labs for each part

Back to the safety of linear models…

Answer: minimizing RSS

Discussion Why do we minimize RSS? (…have you ever questioned it?)

What do we know about least-squares? Assumption 1: we’re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear What can we say about the bias of our least-squares estimates?

What do we know about least-squares? Assumption 1: we’re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 1: the number of observations is much larger than the number of predictors (n>>p) What can we say about the variance of our least-squares estimates?

What do we know about least-squares? Assumption 1: we’re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 2: the number of observations is not much larger than the number of predictors (n≈p) What can we say about the variance of our least-squares estimates?

What do we know about least-squares? Assumption 1: we’re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 3: the number of observations is smaller than the number of predictors (n<p) What can we say about the variance of our least-squares estimates?

Bias vs. variance

Discussion So how do we reduce the variance?

Subset selection Big idea: if having too many predictors is the problem maybe we can get rid of some Problem: how do we choose?

Flashback: subset selection Image credit: Ming Malaykham

Best subset selection: try them all!

Finding the “best” subset Start with the null model M 0 (containing no predictors) 1. For k = 1 2 … p : a. Fit all ( p choose k ) models that contain exactly p predictors. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single “best” model from among M 0 … M p using cross-validated prediction error or something similar.

Discussion Question 1: why not just use the one with the lowest RSS? Answer: because you’ll always wind up choosing the model with the highest number of predictors (why?)

Discussion Question 2: why not just calculate the cross-validated prediction error on all of them? Answer: so… many... models...

Thought exercise We do a lot of work in groups in this class How many different possible groupings are there? Let’s break it down: 22 individual people 231 different groups of two 1540 different groups of three…

Model overload Number of possible models on a set of p predictors: On 10 predictors: 1024 models On 20 predictors: over 1 million

Another problem Question: what happens to our estimated coefficients as we fit more and more models? Answer: the larger the search space, the larger the variance. We’re overfitting!

What if we could eliminate some?

{ } A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

Best subset selection Start with the null model M 0 (containing no predictors) 1. For k = 1 2 … p : a. Fit all ( p choose k ) models that contain exactly p predictors. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single “best” model from among M 0 … M p using cross-validated prediction error or something similar.

Forward selection Start with the null model M 0 (containing no predictors) 1. For k = 1 2 … p : a. Fit all ( p - k ) models that augment M k-1 with exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single “best” model from among M 0 … M p using cross-validated prediction error or something similar.

Stepwise selection: way fewer models Number of models we have to consider: On 10 predictors: 1024 models  51 models On 20 predictors: over 1 million models  211 models

Forward selection Question: what potential problems do you see? Answer: there’s a risk we might prune an important predictor too early. While this method usually does well in practice, it is not guaranteed to give the optimal solution.

Forward selection Start with the null model M 0 (containing no predictors) 1. For k = 1 2 … p : a. Fit all ( p - k ) models that augment M k-1 with exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single “best” model from among M 0 … M p using cross-validated prediction error or something similar.

Backward selection Start with the full model M p (containing all predictors) 1. For k = p, p-1, … 1 : a. Fit all k models that reduce M k+1 by exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single “best” model from among M 0 … M p using cross-validated prediction error or something similar.

Forward selection Question: what potential problems do you see? Answer: if we have more predictors than we have observations, this method won’t work (why?)

Choosing the optimal model We know measures of training error (RSS and R 2 ) aren’t good predictors of test error (what we actually care about) Two options: 1. We can directly estimate the test error, using either a validation set approach or a cross-validation approach 2. We can indirectly estimate test error by making an adjustment to the training error to account for the bias

Adjusted R 2 Intuition: once all of the useful variables have been included in the model, adding additional junk variables will lead to only a small decrease in RSS Adjusted R 2 pays a penalty for unnecessary variables in the model by dividing RSS by (n-d-1) in the numerator

Estimate of the variance of the error terms AIC, BIC, and C p Some other ways of penalizing RSS Proportional for least-squares models More severe penalty for large models

Adjust or validate? Question: what are the benefits and drawbacks of each? Adjusted measuresValidation Pros Relatively inexpensive to compute More direct estimate (makes fewer assumptions) Cons Makes more assumptions about the model – more opportunity to be wrong More expensive: requires either a test set or round- robin validation

Lab: subset selection To do today’s lab in R: leaps To do today’s lab in python: itertools, time Instructions and code: Full version can be found beginning on p. 244 of ISLR