Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi.

Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi

References “Applied Linear Regression,” Third Edition by Sanford Weisberg. “Linear Models with R,” by Julian Faraway. Countless other books on Linear Regression, statistical software, etc.

Statistical Packages Minitab (we’ll use this today) SPSS SAS R Splus JMP ETC!!

Outline I.Simple linear regression review II.Multiple Regression: Adding predictors III.Inference in Regression IV.Regression Diagnostics V.Model Selection

I. Simple Linear Regression Review 5 Savings Rate Data Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate. SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.) Pop>75: Percent of the population over 75 years old. (One of the predictors.)

I. Simple Linear Regression Review 6

7 Regression Output The regression equation is SaveRate = 7.152 + 1.099 pop>75 S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1% Analysis of Variance Source DF SS MS F P Regression 1 98.545 98.5454 5.34 0.025 Error 48 885.083 18.4392 Total 49 983.628 Fitted model R 2 (coeff. of determination) Testing the model

Importance of Plots Four data sets All have –Regression line Y = 3 + 0.5 x –R 2 = 66.7% –S = 1.24 –Same t statistics, etc., etc. Without looking at plots, the four data sets would seem similar.

I. Simple Linear Regression Review 9 Importance of Plots (1)

I. Simple Linear Regression Review 13 The model Y i = β 0 + β 1 x i + e i, for i = 1, 2, …, n “Errors” e 1, e 2, …, e n are assumed to be independent. Usually e 1, e 2, …, e n are assumed to have the same standard deviation, σ. Often e 1, e 2, …, e n are assumed to be normally distributed.

I. Simple Linear Regression Review 14 Least Squares The regression line (line of best fit) is based on “least squares.” The regression line is the line that minimizes the sum of the squared deviations from the data. The least squares line has certain optimality properties. The least squares line is denoted

I. Simple Linear Regression Review 15 Residuals The residuals represent the difference between the data and the least squares line:

I. Simple Linear Regression Review 16 Checking assumptions Residuals are the main tool for checking model assumptions, including linearity and constant variance. Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance. Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.

I. Simple Linear Regression Review 21 “Four in one” plot from Minitab

I. Simple Linear Regression Review 22 Coefficient of determination (R 2 ) Residual sum of squares, aka sum of squares for error: Total sum of squares: Coefficient of determination:

I. Simple linear regression review 23 R2R2 The coefficient of determination, R 2, measures the proportion of the variability in Y that is explained by the linear relationship with X. It’s also the square of the Pearson correlation coefficient

II. Multiple regression: Adding predictors 24 Adding a predictor Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is significant was 0.025.) Another predictor: DPI (per-capita income) Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)

II. Multiple regression: Adding predictors 25 Adding a predictor (2) Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI p-values are 0.100 and 0.738 for pop>75 and DPI The sign of the coefficient of DPI has changed! pop>75 was significant alone, but neither it nor DPI are significant together!

II. Multiple regression: Adding predictors 26 Adding a predictor (3) What happened?? The predictors pop>75 and DPI are highly correlated

II. Multiple regression: Adding predictors 27 Added variable plots and partial correlation 1.Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75. 2.Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75. 3.A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.” 4.The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.

II. Multiple regression: Adding predictors 28 Added variable plot Note that the slope term, -0.000341, is the same as the slope term for DPI in the two-predictor model

II. Multiple regression: Adding predictors 29 Scatterplot matrices (Matrix Plots) With one predictor X, a scatterplot of Y vs. X is very informative. With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed. A scatterplot matrix (or matrix plot) is just an organized display of the plots

II. Multiple regression: Adding predictors 30

II. Multiple regression: Adding predictors 31 Changes in R 2 Consider adding a predictor X 2 to a model that already contains the predictor X 1 Let R 2,1 be the R 2 value for the fit of Y vs. X 1, and let R 2,2 be the R 2 value for the fit of Y vs. X 2

II. Multiple regression: Adding predictors 32 Changes in R 2 (2) The R 2 value for the multiple regression fit is always larger than R 2,1 and R 2,2 The R 2 value for the multiple regression fit of Y versus X 1 and X 2 may be –less than R 2,1 + R 2,2 (if the two predictors are explaining the same variation) –equal to R 2,1 + R 2,2 (if the two predictors measure different things) –more than R 2,1 + R 2,2 (e.g. Response is area of rectangle, and the two predictors are length and width)

II. Multiple regression: Adding predictors 33 Multiple regression model Response variable Y Predictors X 1, X 2, …, X p Same assumptions on errors e i (independent, constant variance, normality)

III. Inference in regression 34 Inference in regression Most inference procedures assume independence, constant variance, and normality of the errors. Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold. In general, techniques like the bootstrap can be used when normality is suspect.

III. Inference in regression 35 New data set Response variable: –Fuel = per-capita fuel consumption (times 1000) Predictors: –Dlic = proportion of the population who are licensed drivers (times 1000) –Tax = gasoline tax rate –Income = per person income in thousands of dollars –logMiles = base 2 log of federal-aid highway miles in the state

III. Inference in regression 36 t tests Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles The regression equation is Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income + 18.5 logMiles Predictor Coef SE Coef T P Constant 154.2 194.9 0.79 0.433 Tax -4.228 2.030 -2.08 0.043 Dlic 0.4719 0.1285 3.67 0.001 Income -6.135 2.194 -2.80 0.008 logMiles 18.545 6.472 2.87 0.006 t statistics p values

III. Inference in regression 37 t tests (2) The t statistics tests the hypothesis that a particular slope parameter is zero. The formula is t = (coefficient estimate)/(standard error) degrees of freedom are n-(p+1) p-values given are for the two-sided alternative This is like simple linear regression

III. Inference in regression 38 F tests General structure: –H a : Large model –H 0 : Smaller model, obtained by setting some parameters in the large model to zero, or equal to each other, or equal to a constant –RSS AH = resid. sum of squares after fitting the large (alt. hypothesis) model –RSS NH = resid. sum of squares after fitting the smaller (null hypothesis) model –df NH and df AH are the corresponding degrees of freedom

III. Inference in regression 39 F tests (2) Test statistic: Null distribution: F distribution with df NH – df AH numerator and df AH denominator degrees of freedom

III. Inference in regression 40 F test example Can the “economic” variables tax and income be dropped from the model with all four predictors? AH model includes all predictors NH model includes only Dlic and logMiles Fit both models and get RSS and df values

III. Inference in regression 41 F test example (2) RSS AH = 193700; df AH = 46 RSS NH = 243006; df NH = 48 P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054 There’s pretty strong evidence that removing both Tax and Income is unwise

III. Inference in regression 42 Another F test example Question: Does it make sense that the two “economic” predictors should have the same coefficient? H a : Y = β 0 + β 1 Tax + β 2 Dlic+ β 3 Income + β 4 logMiles + error H 0 : Y = β 0 + β 1 Tax + β 2 Dlic+ β 1 Income + β 4 logMiles + error Note: H 0 : Y = β 0 + β 1 (Tax + Income)+ β 2 Dlic + β 4 logMiles + error

III. Inference in regression 43 Another F test example (2) Fit full model (AH) Create new predictor “TI” by adding Tax and Income, and fit a model with TI and Dlic and logMiles (NH) P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518 This suggests that the simpler model with the same coefficient for Tax and Income fits well.

III. Inference in regression 44 Removing one predictor We have two ways to test whether one predictor can be removed from the model: –t test –F test The tests are equivalent, in the sense that t 2 = F, and that the p-values will be equivalent.

III. Inference in regression 45 Confidence regions Confidence intervals for one parameter use the familiar t-interval. For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model: -6.135 ± (2.013)(2.194) = -6.135 ± 4.417. From Minitab output From t distribution with 46 df

III. Inference in regression 46 Joint confidence regions Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution. Minitab (and SPSS, and …) can’t draw these easily On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.

III. Inference in regression 47 Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two. (0,0) Boundary of confidence region

III. Inference in regression 48 Prediction Given a new set of predictor values x 1, x 2, …, x p, what’s the predicted response? It’s easy to answer this: Just plug the new predictors into the fitted regression model: But how do we assess the uncertainty in the prediction? How do we form a confidence interval?

III. Inference in regression 49 Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 613.39 12.44 (588.34, 638.44) (480.39, 746.39) Values of Predictors for New Observations New Obs Dlic Income logMiles Tax 1 900 28.0 15.0 17.0 Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17 Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17

IV. Regression Diagnostics 50 Diagnostics Want to look for points that have a large influence on the fitted model Want to look for evidence that one or more model assumptions are untrue. Tools: –Residuals –Leverage –Influence and Cook’s Distance

IV. Regression Diagnostics 51 Leverage A point whose predictor values are far from the “typical” predictor values has high leverage. For a high leverage point, the fitted value will be close to the data value Y i. A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting. Most statistical packages can compute leverages.

IV. Regression Diagnostics 52

IV. Regression Diagnostics 54 Influential Observations A data point is influential if it has a large effect on the fitted model. Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted. Cook’s Distance is a measure of the influence of an observation. It may make sense to refit the model after removing a few of the most influential observations.

IV. Regression Diagnostics 55 High leverage, low influence High Influence

V. Model Selection57 Model Selection Question: With a large number of potential predictors, how do we choose the predictors to include in the model? Want good prediction, but parsimony: Occam’s Razor. Also can be thought of as a bias-variance tradeoff.

V. Model Selection58 Model Selection Example Data on all 50 states, from the 1970s –Life.Exp = Life expectancy (response) –Population (in thousands) –Income = per-capita income –Illiteracy (in percent of population) –Murder = murder rate per 100,000 –HS.Grad (in percent of population) –Frost = mean # days with min. temp < 32F –Area = land area in square miles

V. Model Selection59 Forward Selection Choose a cutoff α Start with no predictors At each step, add the predictor with the lowest p-value less than α Continue until there are no unused predictors with p-values less than α

V. Model Selection60 Stepwise Regression: Life.Exp versus Population, Income,... Forward selection. Alpha-to-Enter: 0.25 Response is Life.Exp on 7 predictors, with N = 50 Step 1 2 3 4 Constant 72.97 70.30 71.04 71.03 Murder -0.284 -0.237 -0.283 -0.300 T-Value -8.66 -6.72 -7.71 -8.20 P-Value 0.000 0.000 0.000 0.000 HS.Grad 0.044 0.050 0.047 T-Value 2.72 3.29 3.14 P-Value 0.009 0.002 0.003 Frost -0.0069 -0.0059 T-Value -2.82 -2.46 P-Value 0.007 0.018 Population 0.00005 T-Value 2.00 P-Value 0.052 S 0.847 0.796 0.743 0.720 R-Sq 60.97 66.28 71.27 73.60 R-Sq(adj) 60.16 64.85 69.39 71.26 Mallows Cp 16.1 9.7 3.7 2.0

V. Model Selection61 Variations on FS Backward elimination –Choose cutoff α –Start with all predictors in the model –Eliminate the predictor with the highest p- value that is greater than α –ETC Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)

V. Model Selection62 All subsets Fit all possible models. Based on a “goodness” criterion, choose the model that fits best. Goodness criteria include AIC, BIC, Adjusted R 2, Mallow’s C p Some of the criteria will be described next

V. Model Selection63 Notation RSS* = Resid. Sum of Squares for the current model p* = Number of terms (including intercept) in the current model n = number of observations s 2 = RSS/(n-(p+1)) = Estimate of σ 2 from model with all predictors and intercept term.

V. Model Selection64 Goodness criteria Smaller is better for AIC, BIC, C p*. Larger is better for adjR 2 AIC = n log(RSS*/n) + 2p* BIC = n log(RSS*/n) + p* log(n) C p* = RSS*/s 2 + 2p* - n adjR 2 =

V. Model Selection65 Best Subsets Regression: Life.Exp versus Population, Income,... Response is Life.Exp P I o l p l u i H l I t M S a n e u. F t c r r G r A i o a d r o r Mallows o m c e a s e Vars R-Sq R-Sq(adj) Cp S n e y r d t a 1 61.0 60.2 16.1 0.84732 X 2 66.3 64.8 9.7 0.79587 X X 3 71.3 69.4 3.7 0.74267 X X X 4 73.6 71.3 2.0 0.71969 X X X X 5 73.6 70.6 4.0 0.72773 X X X X X 6 73.6 69.9 6.0 0.73608 X X X X X X 7 73.6 69.2 8.0 0.74478 X X X X X X X

V. Model Selection66 Model selection can overstate significance Generate Y and X 1, X 2, …, X 50 All are independent and standard normal. So none of the predictors are related to the response. Fit the full model and look at the overall F test. Use model selection to choose a “good” smaller model, and look at its overall F test

V. Model Selection67 The full model Results from fitting model with all 50 predictors Note that the F test is not significant S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3% Analysis of Variance Source DF SS MS F P Regression 50 55.7093 1.1142 1.33 0.160 Residual Error 49 41.0453 0.8377 Total 99 96.7546

V. Model Selection68 The “good” small model Run FS with α = 0.05 Predictors x38, x41, and x24 are chosen. Fit that three predictor model. Now the F test is highly significant Analysis of Variance Source DF SS MS F P Regression 3 20.9038 6.9679 8.82 0.000 Residual Error 96 75.8508 0.7901 Total 99 96.7546

What’s left? Weighted least squares Tests for lack of fit Transformations of response and predictors Analysis of Covariance Etc.

Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi.

Similar presentations

Presentation on theme: "Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi.

Similar presentations

Presentation on theme: "Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi."— Presentation transcript:

Similar presentations

About project

Feedback