Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi.

Slides:



Advertisements
Similar presentations
Multiple Regression and Model Building
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Objectives (BPS chapter 24)
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Lecture 24: Thurs., April 8th
Simple Linear Regression Analysis
Multiple Linear Regression
Regression Diagnostics Checking Assumptions and Data.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Part 7: Multiple Regression Analysis 7-1/54 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Descriptive measures of the strength of a linear association r-squared and the (Pearson) correlation coefficient r.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Correlation & Regression
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Objectives of Multiple Regression
Correlation and Regression
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Part 2: Model and Inference 2-1/49 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Chapter 15 Inference for Regression. How is this similar to what we have done in the past few chapters?  We have been using statistics to estimate parameters.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Chapter 20 Linear and Multiple Regression
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
CHAPTER 29: Multiple Regression*
Multiple Regression Chapter 14.
Multiple Linear Regression
Essentials of Statistics for Business and Economics (8e)
Presentation transcript:

Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

References “Applied Linear Regression,” Third Edition by Sanford Weisberg. “Linear Models with R,” by Julian Faraway. Countless other books on Linear Regression, statistical software, etc.

Statistical Packages Minitab (we’ll use this today) SPSS SAS R Splus JMP ETC!!

Outline I.Simple linear regression review II.Multiple Regression: Adding predictors III.Inference in Regression IV.Regression Diagnostics V.Model Selection

I. Simple Linear Regression Review 5 Savings Rate Data Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate. SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.) Pop>75: Percent of the population over 75 years old. (One of the predictors.)

I. Simple Linear Regression Review 6

7 Regression Output The regression equation is SaveRate = pop>75 S = R-Sq = 10.0% R-Sq(adj) = 8.1% Analysis of Variance Source DF SS MS F P Regression Error Total Fitted model R 2 (coeff. of determination) Testing the model

Importance of Plots Four data sets All have –Regression line Y = x –R 2 = 66.7% –S = 1.24 –Same t statistics, etc., etc. Without looking at plots, the four data sets would seem similar.

I. Simple Linear Regression Review 9 Importance of Plots (1)

I. Simple Linear Regression Review 10 Importance of Plots (2)

I. Simple Linear Regression Review 11 Importance of Plots (3)

I. Simple Linear Regression Review 12 Importance of Plots (4)

I. Simple Linear Regression Review 13 The model Y i = β 0 + β 1 x i + e i, for i = 1, 2, …, n “Errors” e 1, e 2, …, e n are assumed to be independent. Usually e 1, e 2, …, e n are assumed to have the same standard deviation, σ. Often e 1, e 2, …, e n are assumed to be normally distributed.

I. Simple Linear Regression Review 14 Least Squares The regression line (line of best fit) is based on “least squares.” The regression line is the line that minimizes the sum of the squared deviations from the data. The least squares line has certain optimality properties. The least squares line is denoted

I. Simple Linear Regression Review 15 Residuals The residuals represent the difference between the data and the least squares line:

I. Simple Linear Regression Review 16 Checking assumptions Residuals are the main tool for checking model assumptions, including linearity and constant variance. Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance. Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.

I. Simple Linear Regression Review 17

I. Simple Linear Regression Review 18

I. Simple Linear Regression Review 19

I. Simple Linear Regression Review 20

I. Simple Linear Regression Review 21 “Four in one” plot from Minitab

I. Simple Linear Regression Review 22 Coefficient of determination (R 2 ) Residual sum of squares, aka sum of squares for error: Total sum of squares: Coefficient of determination:

I. Simple linear regression review 23 R2R2 The coefficient of determination, R 2, measures the proportion of the variability in Y that is explained by the linear relationship with X. It’s also the square of the Pearson correlation coefficient

II. Multiple regression: Adding predictors 24 Adding a predictor Recall: Fitted model was SaveRate = pop>75 (p-value for test of whether pop>75 is significant was ) Another predictor: DPI (per-capita income) Fitted model: SaveRate = DPI (p-value for DPI: 0.124)

II. Multiple regression: Adding predictors 25 Adding a predictor (2) Model with both pop>75 and DPI is SaveRate = pop> DPI p-values are and for pop>75 and DPI The sign of the coefficient of DPI has changed! pop>75 was significant alone, but neither it nor DPI are significant together!

II. Multiple regression: Adding predictors 26 Adding a predictor (3) What happened?? The predictors pop>75 and DPI are highly correlated

II. Multiple regression: Adding predictors 27 Added variable plots and partial correlation 1.Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75. 2.Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75. 3.A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.” 4.The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.

II. Multiple regression: Adding predictors 28 Added variable plot Note that the slope term, , is the same as the slope term for DPI in the two-predictor model

II. Multiple regression: Adding predictors 29 Scatterplot matrices (Matrix Plots) With one predictor X, a scatterplot of Y vs. X is very informative. With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed. A scatterplot matrix (or matrix plot) is just an organized display of the plots

II. Multiple regression: Adding predictors 30

II. Multiple regression: Adding predictors 31 Changes in R 2 Consider adding a predictor X 2 to a model that already contains the predictor X 1 Let R 2,1 be the R 2 value for the fit of Y vs. X 1, and let R 2,2 be the R 2 value for the fit of Y vs. X 2

II. Multiple regression: Adding predictors 32 Changes in R 2 (2) The R 2 value for the multiple regression fit is always larger than R 2,1 and R 2,2 The R 2 value for the multiple regression fit of Y versus X 1 and X 2 may be –less than R 2,1 + R 2,2 (if the two predictors are explaining the same variation) –equal to R 2,1 + R 2,2 (if the two predictors measure different things) –more than R 2,1 + R 2,2 (e.g. Response is area of rectangle, and the two predictors are length and width)

II. Multiple regression: Adding predictors 33 Multiple regression model Response variable Y Predictors X 1, X 2, …, X p Same assumptions on errors e i (independent, constant variance, normality)

III. Inference in regression 34 Inference in regression Most inference procedures assume independence, constant variance, and normality of the errors. Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold. In general, techniques like the bootstrap can be used when normality is suspect.

III. Inference in regression 35 New data set Response variable: –Fuel = per-capita fuel consumption (times 1000) Predictors: –Dlic = proportion of the population who are licensed drivers (times 1000) –Tax = gasoline tax rate –Income = per person income in thousands of dollars –logMiles = base 2 log of federal-aid highway miles in the state

III. Inference in regression 36 t tests Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles The regression equation is Fuel = Tax Dlic Income logMiles Predictor Coef SE Coef T P Constant Tax Dlic Income logMiles t statistics p values

III. Inference in regression 37 t tests (2) The t statistics tests the hypothesis that a particular slope parameter is zero. The formula is t = (coefficient estimate)/(standard error) degrees of freedom are n-(p+1) p-values given are for the two-sided alternative This is like simple linear regression

III. Inference in regression 38 F tests General structure: –H a : Large model –H 0 : Smaller model, obtained by setting some parameters in the large model to zero, or equal to each other, or equal to a constant –RSS AH = resid. sum of squares after fitting the large (alt. hypothesis) model –RSS NH = resid. sum of squares after fitting the smaller (null hypothesis) model –df NH and df AH are the corresponding degrees of freedom

III. Inference in regression 39 F tests (2) Test statistic: Null distribution: F distribution with df NH – df AH numerator and df AH denominator degrees of freedom

III. Inference in regression 40 F test example Can the “economic” variables tax and income be dropped from the model with all four predictors? AH model includes all predictors NH model includes only Dlic and logMiles Fit both models and get RSS and df values

III. Inference in regression 41 F test example (2) RSS AH = ; df AH = 46 RSS NH = ; df NH = 48 P-value is the area to the right of 5.85 under a F(2,46) distribution, approx There’s pretty strong evidence that removing both Tax and Income is unwise

III. Inference in regression 42 Another F test example Question: Does it make sense that the two “economic” predictors should have the same coefficient? H a : Y = β 0 + β 1 Tax + β 2 Dlic+ β 3 Income + β 4 logMiles + error H 0 : Y = β 0 + β 1 Tax + β 2 Dlic+ β 1 Income + β 4 logMiles + error Note: H 0 : Y = β 0 + β 1 (Tax + Income)+ β 2 Dlic + β 4 logMiles + error

III. Inference in regression 43 Another F test example (2) Fit full model (AH) Create new predictor “TI” by adding Tax and Income, and fit a model with TI and Dlic and logMiles (NH) P-value is the area to the right of 5.85 under a F(1,46) distribution, approx This suggests that the simpler model with the same coefficient for Tax and Income fits well.

III. Inference in regression 44 Removing one predictor We have two ways to test whether one predictor can be removed from the model: –t test –F test The tests are equivalent, in the sense that t 2 = F, and that the p-values will be equivalent.

III. Inference in regression 45 Confidence regions Confidence intervals for one parameter use the familiar t-interval. For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model: ± (2.013)(2.194) = ± From Minitab output From t distribution with 46 df

III. Inference in regression 46 Joint confidence regions Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution. Minitab (and SPSS, and …) can’t draw these easily On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.

III. Inference in regression 47 Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two. (0,0) Boundary of confidence region

III. Inference in regression 48 Prediction Given a new set of predictor values x 1, x 2, …, x p, what’s the predicted response? It’s easy to answer this: Just plug the new predictors into the fitted regression model: But how do we assess the uncertainty in the prediction? How do we form a confidence interval?

III. Inference in regression 49 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI (588.34, ) (480.39, ) Values of Predictors for New Observations New Obs Dlic Income logMiles Tax Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17 Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17

IV. Regression Diagnostics 50 Diagnostics Want to look for points that have a large influence on the fitted model Want to look for evidence that one or more model assumptions are untrue. Tools: –Residuals –Leverage –Influence and Cook’s Distance

IV. Regression Diagnostics 51 Leverage A point whose predictor values are far from the “typical” predictor values has high leverage. For a high leverage point, the fitted value will be close to the data value Y i. A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting. Most statistical packages can compute leverages.

IV. Regression Diagnostics 52

IV. Regression Diagnostics 53

IV. Regression Diagnostics 54 Influential Observations A data point is influential if it has a large effect on the fitted model. Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted. Cook’s Distance is a measure of the influence of an observation. It may make sense to refit the model after removing a few of the most influential observations.

IV. Regression Diagnostics 55 High leverage, low influence High Influence

IV. Regression Diagnostics 56

V. Model Selection57 Model Selection Question: With a large number of potential predictors, how do we choose the predictors to include in the model? Want good prediction, but parsimony: Occam’s Razor. Also can be thought of as a bias-variance tradeoff.

V. Model Selection58 Model Selection Example Data on all 50 states, from the 1970s –Life.Exp = Life expectancy (response) –Population (in thousands) –Income = per-capita income –Illiteracy (in percent of population) –Murder = murder rate per 100,000 –HS.Grad (in percent of population) –Frost = mean # days with min. temp < 32F –Area = land area in square miles

V. Model Selection59 Forward Selection Choose a cutoff α Start with no predictors At each step, add the predictor with the lowest p-value less than α Continue until there are no unused predictors with p-values less than α

V. Model Selection60 Stepwise Regression: Life.Exp versus Population, Income,... Forward selection. Alpha-to-Enter: 0.25 Response is Life.Exp on 7 predictors, with N = 50 Step Constant Murder T-Value P-Value HS.Grad T-Value P-Value Frost T-Value P-Value Population T-Value 2.00 P-Value S R-Sq R-Sq(adj) Mallows Cp

V. Model Selection61 Variations on FS Backward elimination –Choose cutoff α –Start with all predictors in the model –Eliminate the predictor with the highest p- value that is greater than α –ETC Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)

V. Model Selection62 All subsets Fit all possible models. Based on a “goodness” criterion, choose the model that fits best. Goodness criteria include AIC, BIC, Adjusted R 2, Mallow’s C p Some of the criteria will be described next

V. Model Selection63 Notation RSS* = Resid. Sum of Squares for the current model p* = Number of terms (including intercept) in the current model n = number of observations s 2 = RSS/(n-(p+1)) = Estimate of σ 2 from model with all predictors and intercept term.

V. Model Selection64 Goodness criteria Smaller is better for AIC, BIC, C p*. Larger is better for adjR 2 AIC = n log(RSS*/n) + 2p* BIC = n log(RSS*/n) + p* log(n) C p* = RSS*/s 2 + 2p* - n adjR 2 =

V. Model Selection65 Best Subsets Regression: Life.Exp versus Population, Income,... Response is Life.Exp P I o l p l u i H l I t M S a n e u. F t c r r G r A i o a d r o r Mallows o m c e a s e Vars R-Sq R-Sq(adj) Cp S n e y r d t a X X X X X X X X X X X X X X X X X X X X X X X X X X X X

V. Model Selection66 Model selection can overstate significance Generate Y and X 1, X 2, …, X 50 All are independent and standard normal. So none of the predictors are related to the response. Fit the full model and look at the overall F test. Use model selection to choose a “good” smaller model, and look at its overall F test

V. Model Selection67 The full model Results from fitting model with all 50 predictors Note that the F test is not significant S = R-Sq = 57.6% R-Sq(adj) = 14.3% Analysis of Variance Source DF SS MS F P Regression Residual Error Total

V. Model Selection68 The “good” small model Run FS with α = 0.05 Predictors x38, x41, and x24 are chosen. Fit that three predictor model. Now the F test is highly significant Analysis of Variance Source DF SS MS F P Regression Residual Error Total

What’s left? Weighted least squares Tests for lack of fit Transformations of response and predictors Analysis of Covariance Etc.