Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Objectives (BPS chapter 24)
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Chapter Topics Types of Regression Models
Introduction to Probability and Statistics Linear Regression and Correlation.
Ch. 14: The Multiple Regression Model building
BCOR 1020 Business Statistics
Simple Linear Regression and Correlation
Chapter 7 Forecasting with Simple Regression
Simple Linear Regression Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
CHAPTER 14 MULTIPLE REGRESSION
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Chapter 5: Regression Analysis Part 1: Simple Linear Regression.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Simple Linear Regression ANOVA for regression (10.2)
Chapter 13 Multiple Regression
Regression Analysis Relationship with one independent variable.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10 Chapter 23. Inference for regression. Objectives (PSLS Chapter 23) Inference for regression (NHST Regression Inference Award)[B level award]
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Chapter 10 Inference for Regression
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 4: Inference about regression Priyantha Wijayatunga,
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
The simple linear regression model and parameter estimation
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Chapter 11: Simple Linear Regression
Chapter 11 Simple Regression
Relationship with one independent variable
The Practice of Statistics in the Life Sciences Fourth Edition
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Relationship with one independent variable
Simple Linear Regression
Linear Regression and Correlation
Linear Regression and Correlation
Introduction to Regression
Presentation transcript:

Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company

Objectives (IPS chapters 11.1 and 11.2) Multiple regression  Multiple linear regression model  Confidence interval for β j  Significance test for β j  Anova F-test for multiple regression  Squared multiple correlation R 2

For p explanatory variables, we can express the y-values  y =  0 +  1 x 1 … +  p x p +e e is an error which is difference between each y-value and its mean. The statistical model for the sample data (i = 1, 2, … n) is then: Data = fit + residual y i =  0 +  1 x 1i … +  p x pi +  i The parameters of the model are  0,  1 …  p and . Where s is the standard deviation of the errors, e Multiple linear regression model

We selected a random sample of n individuals for which p + 1 variables were measured (x 1 …, x p, y). The least-squares regression method minimizes the sum of squared deviations, e i 2 (= y i – ŷ i ) 2 to express y as a linear function of the explanatory variables: ŷ i = b 0 + b 1 x 1i … + b k x pi ■As with simple linear regression, the constant b 0 is the y intercept. ■The regression coefficients (b 1 … b p ) reflect the change in y for a 1 unit change in each independent variable. They are analogous to the slope in simple regression. ŷ  y b 0 are unbiased estimates of population parameters  β 0 b p  β p

Confidence interval for β j Estimating the regression parameters β 0, … β j … β p is a case of one- sample inference with unknown population variance.  We use the t distribution, with n – p – 1 degrees of freedom. A level C confidence interval for β j is given by: b j ± t* SE b j - SE b j is the standard error of b j. - t* is the critical t for the t (n – 2) distribution with area C between –t* and +t*.

Significance test for β j To test the hypothesis H 0 :  j = 0 versus a 1 or 2 sided alternative. We calculate the t statistic t = b j /SE b j which has the t (n – p – 1) null distribution to find the p-value of the test.

For a multiple linear relationship ANOVA tests the hypotheses H 0 : β 1 = β 2 = … = β p = 0 versus H a : H 0 not true by computing the F statistic: F = MSM / MSE When H 0 is true, F follows the F(1, n − p − 1) distribution. The p-value is P(> F). A significant p-value doesn’t mean that all the explanatory variables have a significant influence on y — only that at least one of them does. ANOVA F-test for multiple regression

ANOVA table for multiple regression SourceSum of squares SSdfMean square MSFP-value ModelpSSG/DFGMSG/MSETail area above F Errorn − p − 1SSE/DFE Totaln − 1 SST = SSM + SSE DFT = DFM + DFE The standard deviation of the sampling distribution, s, for n sample data points is calculated from the residuals e i = y i – ŷ i s is an unbiased estimate of the regression standard deviation σ.

Squared multiple correlation R 2 Just as with simple linear regression, R 2, the squared multiple correlation, is the proportion of the variation in the response variable y that is explained by the model. In the particular case of multiple linear regression, the model is a linear function of all p explanatory variables taken together.

We have data on 224 first-year computer science majors at a large university in a given year. The data for each student include: * Cumulative GPA after 2 semesters at the university (y, response variable) * SAT math score (SATM, x 1, explanatory variable) * SAT verbal score (SATV, x 2, explanatory variable) * Average high school grade in math (HSM, x 3, explanatory variable) * Average high school grade in science (HSS, x 4, explanatory variable) * Average high school grade in English (HSE, x 5, explanatory variable) Here are the summary statistics for these data given by software SAS:

The first step in multiple linear regression is to study all pair-wise relationships between the p + 1 variables. Here is the SAS output for all pair-wise correlation analyses (value of r and 2 sided p-value of H 0 : ρ = 0). Scatterplots for all 15 pair-wise relationships are also necessary to understand the data.

For simplicity, let’s first run a multiple linear regression using only the three high school grade averages: P-value very significant R 2 is fairly small (20%) HSM significant HSS, HSE not

The ANOVA for the multiple linear regression using only HSM, HSS, and HSE is very significant  at least one of the regression coefficients is significantly different from zero. But R 2 is fairly small (0.205)  only about 20% of the variation in cumulative GPA can be explained by these high school scores. (Remember, a small p-value does not imply a large effect.) P-value very significant R 2 is fairly small (20%)

The tests of hypotheses for each b within the multiple linear regression reach significance for HSM only. We found a significant correlation between HSS and GPA when analyzed by themselves, so why isn’t b HSS significant in the multiple regression equation? HHS and HHM are also significantly correlated suggesting redundancy. HSM significant HSS, HSE not If all 3 high school averages are used, neither HSE or HSS, significantly helps us predict GPA over and above what HSM does all by itself.

P-value very significant R 2 is small (20%) HSM significant HSE not We now drop the least significant variable from the previous model: HSS. The conclusions are about the same. But notice that the actual regression coefficients have changed.

P-value very significant R 2 is very small (6%) SATM significant SATV not Let’s run a multiple linear regression with the two SAT scores only. The ANOVA test for β SATM and β SATV is very significant  at least one is not zero. R 2 is really small (0.06)  only 6% of GPA variations are explained by these tests. When taken together, only SATM is a significant predictor of GPA (P ).

We finally run a multiple regression model with all the variables together The overall test is significant, but only the average high school math score (HSM) makes a significant contribution in this model to predicting the cumulative GPA. This conclusion applies to computer majors at this large university. P-value very significant R 2 fairly small (21%) HSM significant

Warnings  Multiple regression is a dangerous tool. The significance or non- significance of any one predictor in a multiple predictor model is largely a function of what other predictors may or may not be included in the model.  P-values arrived at after a selection process (where we drop seemingly unimportant X’s) don’t mean the same thing as P-values do in earlier parts of the textbook.  When the X’s are highly correlated (with each other) two equations with very different parameters can yield very similar predictions (so interpreting slopes as “if I change X by 1 unit, I will change Y by b units” may not be a good idea.