Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Chapter Outline 3.1 Introduction
Statistical Techniques I EXST7005 Multiple Regression.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Model Assessment, Selection and Averaging
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Multiple Linear Regression Model
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Part I – MULTIVARIATE ANALYSIS C3 Multiple Linear Regression II © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Chapter 4 Multiple Regression.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.
Multiple Regression Models
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Predictive Analysis in Marketing Research
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Chapter 15: Model Building
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Regression Model Building
Objectives of Multiple Regression
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Chap 6 Further Inference in the Multiple Regression Model
ANOVA, Regression and Multiple Regression March
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Occam's razor: states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference to any.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 7. Classification and Prediction
Generalized linear models
Chapter 9 Multiple Linear Regression
Statistics in MSmcDESPOT
Chapter 6: Multiple Linear Regression
Regression Model Building
Linear Model Selection and regularization
Lecture 12 Model Building
Regression Forecasting and Model Building
Multivariate Linear Regression
Multicollinearity What does it mean? A high degree of correlation amongst the explanatory variables What are its consequences? It may be difficult to separate.
Presentation transcript:

Lecture 11 Multivariate Regression A Case Study

Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our estimators?  The answer is pretty good if the multicollinearity effect is not present.  So what is multicollinearity?  When two or more of the explanatory variables are linearly related this effect comes into play. Think about a plane supported by a line.  Mathematically it can be shown that the variance of the coefficient estimators when there are 2 explanatory variables in the model is proportional to (1-r 2 ) -1/2 where r is the correlation between the explanatory variables

Finding the variables that cause multicollinearity  This analysis is limited to only the explanatory variables.  Nonessential multicollinearity is removed by standardizing the data. That is instead of each observation x i we use  In R this is achieved with (x-mean(x)) / sd(x) where x is the vector to be transformed.  Now look at the correlation matrix between the explanatory variables. If there is a linear relationship between just two of the explanatory variables this will be reflected in the matrix; the offending pair will have a correlation close to 1or -1.

 This is fine but in general the relationship is more complex.  The first step is to calculate the Variance Inflation Factors (VIF) by looking at the diagonal elements of (X T X) -1 or in R: diag(solve(cor(X))) (look at code). (X here is the matrix containing all the explanatory variables as vectors).  If the VIF are large (say greater than 10) it means that some of the variables can be dropped.  To obtain the approximate relationship between the columns of X we need to look at the eigenvalues and eigenvectors of cor(X) matrix.  A small eigenvalue indicates a linear relationship between columns of X. The exact relationship can be regained by looking at the eigenvectors.

Variable selection  Suppose we have a pool of k potential explanatory variables and suppose that the “correct” model involves only p of these. How do we select the best p?  Under-fitting: This means we select too few variables to be included in the model. Consequences include:  The estimates are biased  The estimate of the variance is biased upwards  Prediction intervals do not have the correct width  Over-fitting: Here we include relevant variables but unessential ones as well. The estimates of the coefficients are not biased in this case but the prediction errors tend to be much larger (i.e. intervals are too wide). Also it leads to multicollinearity.

 Parsimony principle (sometimes stated as Ockham razor after the 14 th century English philosopher William Ockham) “entia non sunt multiplicanda praeter necessitatem, “ or entities should not be multiplied beyond necessity this principle states that all things being equal the simplest solution tends to be the right one.  In statistics this translates into: when choosing between models all with approximately the same explanatory power always choose the model with the fewest parameters.

Selecting the correct set of variables.  There are two main categories of methods for selection of variables: 1) All possible regressions. As the name says for each possible submodel we define some measure of “goodness” which is then used to select the best submodel. The problem is that if there are k possible variables the number of candidate submodels is 2 k a very large number. 2) Stepwise regression. Here we move from one model to another adding or deleting variables to better the “goodness” measure, gradually arriving to a good model.

Goodness criteria  As you imagine there are many goodness of fit criteria. 1. Adjusted R 2 (in R denoted Adjusted R-squared).  Increasing the number of observations always increases R-squared. For this reason Adjusted R-squared compensates by penalizing for many variables in the model.  It is defined as: 2. Mallow’s C p  This is a measure of how well a model predicts.  Small values of this measure near p+1 where p is the number of variables currently included in the model indicates a good model. 3. AIC (Akaike Information criterion)  This is an estimate of the difference between the actual model and no model at all. It is based on the entropy concept.  Small values of this measure indicate a good model

4. BIC (Bayesian Information Criterion)  This is similar with AIC but penalizes more that AIC for models containing many variables  Again, small values indicate good models. 5. Size of the estimated error  This one is simple. One looks at Mean Squared Error and minimizes it. 6. Cross-Validation  This is more involved. In general it implies dividing the data into parts (say 10) using 9 of them to construct the model and seeing how well the model predicts the last part. The method is repeated over all possible combinations of the 9 parts and for each model the error is averaged. Best model is the one with the smallest prediction error.

Stepwise Regression  In this method we start with an initial model and then improve it by adding or deleting variables sequentially. There are 3 basic techniques: 1. Forward selection. This method starts with a model containing no explanatory then adds the one variable that produces the model with the best “goodness” criteria. Once this is done it adds another variable to produce once again the better criteria and so on. If at any step adding a variable will not improve the model from the perspective of the “goodness” criterion, the selection stops and the model is output. 2. Backward selection. This is very similar with forward selection, only we start with the full model (containing all predictors) and we keep deleting them until the goodness criteria cannot be improved. 3. Stepwise regression. Here we start with a null model. At every step we perform one forward addition (if needed) and one backward deletion (if needed). If none of these actions are needed the model construction stops. In R we use the function step() (please see accompanying code)

Summary.  If multicollinearity is present in the model it needs to be eliminated.  One uses VIF and eigenvalues/eigenvectors to identify and eliminate the responsible variables.  If many explanatory variables are possible we need to select the best subset. This is done using various criteria the most popular being the AIC in today’s high volume data, much information world.