Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14.

Similar presentations


Presentation on theme: "Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14."— Presentation transcript:

1 Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14

2 14-2 Multiple Regression and Model Building 14.1The Multiple Regression Model and the Least Squares Point Estimate 14.2Model Assumptions and the Standard Error 14.3R² and Adjusted R² 14.4The Overall F Test 14.5Testing the Significance of an Independent Variable

3 14-3 Multiple Regression and Model Building Continued 14.6Confidence and Prediction Intervals 14.7Using Dummy Variables to Model Qualitative Independent Variables 14.8Model Building and the Effects of Multicollinearity 14.9Residual Analysis in Multiple Regression

4 The Multiple Regression Model and the Least Squares Point Estimate Simple linear regression used one independent variable to explain the dependent variable Multiple regression uses two or more independent variables to describe the dependent variable This allows multiple regression models to handle more complex situations There is no limit to the number of independent variables a model can use Has only one dependent variable

5 14-5 The Multiple Regression Model The linear regression model relating y to x 1, x 2,…, x k is y = β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k +  µ y = β 0 + β 1 x 1 + β 2 x 2 +…+ β k x k is the mean value of the dependent variable y β 0, β 1, β 2,… β k are unknown the regression parameters relating the mean value of y to x 1, x 2,…, x k  is an error term that describes the effects on y of all factors other than the independent variables x 1, x 2,…, x k

6 14-6 The Least Squares Estimates and Point Estimation and Prediction Estimation/prediction equation y ̂ = b 0 + b 1 x 01 + b 2 x 02 + … + b k x 0k is the point estimate of the dependent variable when the independent variables are x 1, x 2,…, x k It is also the point prediction of an individual value of the dependent variable when the independent variables are x 1, x 2,…, x k b 0, b 1, b 2,…, b k are the least squares point estimates of the parameters β 0, β 1, β 2,…, β k x 01, x 02,…, x 0k are specified values of the independent predictor variables x 1, x 2,…, x k

7 14-7 Fuel Consumption Case MINITAB Output Figure 14.4 (a)

8 Model Assumptions and the Standard Error The model is y = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k +  Assumptions for multiple regression are stated about the model error terms,  ’s

9 14-9 The Regression Model Assumptions 1.Mean of Zero Assumption 2.Constant Variance Assumption 3.Normality Assumption 4.Independence Assumption

10 14-10 Sum of Squares

11 R 2 and Adjusted R 2 1.Total variation is given by the formula Σ(y i - y ̄) 2 2.Explained variation is given by the formula Σ(y ̂ i - y ̄) 2 3.Unexplained variation is given by the formula Σ(y i - y ̂ i ) 2 4.Total variation is the sum of explained and unexplained variation

12 14-12 R 2 and Adjusted R 2 Continued 5.The multiple coefficient of determination is the ratio of explained variation to total variation 6.R 2 is the proportion of the total variation that is explained by the overall regression model 7.Multiple correlation coefficient R is the square root of R 2

13 14-13 Multiple Correlation Coefficient R The multiple correlation coefficient R is just the square root of R 2 With simple linear regression, r would take on the sign of b 1 There are multiple b i ’s with multiple regression For this reason, R is always positive To interpret the direction of the relationship between the x’s and y, you must look to the sign of the appropriate b i coefficient

14 14-14 The Adjusted R 2 Adding an independent variable to multiple regression will raise R 2 R 2 will rise slightly even if the new variable has no relationship to y The adjusted R 2 corrects this tendency in R 2 As a result, it gives a better estimate of the importance of the independent variables

15 The Overall F Test H 0 : β 1 = β 2 = …= β k = 0 versus H a : At least one of β 1, β 2,…, β k ≠ 0 The test statistic is Reject H 0 in favor of H a if F(model) > F  * or p-value <  * F  is based on k numerator and n-(k+1) denominator degrees of freedom

16 Testing the Significance of an Independent Variable A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y To test significance, we use the null hypothesis H 0 : β j = 0 Versus the alternative hypothesis Ha: β j ≠ 0

17 14-17 Testing Significance of an Independent Variable #2 AlternativeReject H 0 Ifp-Value H a : β j > 0t > t α Area under t distribution right of t H a : β j < 0t < –t α Area under t distribution left of t H a : β j ≠ 0|t| > t  /2 * Twice area under t distribution right of |t| * That is t > t  /2 or t < –t  /2

18 14-18 Testing Significance of an Independent Variable #3 Test Statistics 100(1-  )% Confidence Interval for β j [b 1 ± t  /2 Sb j ] t , t  /2 and p-values are based on n-(k+1) degrees of freedom

19 14-19 Testing Significance of an Independent Variable #4 It is customary to test the significance of every independent variable If we can reject H 0 : β j = 0 at the 0.05 level of significance, we have strong evidence that the independent variable x j is significantly related to y At the 0.01 level of significance, we have very strong evidence The smaller the significance level  at which H 0 can be rejected, the stronger the evidence that x j is significantly related to y

20 14-20 A Confidence Interval for the Regression Parameter β j If the regression assumptions hold, 100(1-  )% confidence interval for β j is [b 1 ± t  /2 Sb j ] t  /2 is based on n – (k + 1) degrees of freedom

21 Confidence and Prediction Intervals The point corresponding to a particular value of x 01, x 02,…, x 0k, of the independent variables is y ̂ = b 0 + b 1 x 01 + b 2 x 02 + … + b k x 0k It is unlikely that this value will equal the mean value of y for these x values Need bounds on how far the predicted value might be from the actual value We can do this by calculating a confidence interval for the mean value of y and a prediction interval for an individual value of y

22 14-22 A Confidence Interval and a Prediction Interval

23 Using Dummy Variables to Model Qualitative Independent Variables So far, we have only looked at including quantitative data in a regression model However, we may wish to include descriptive qualitative data as well For example, might want to include the gender of respondents We can model the effects of different levels of a qualitative variable by using what are called dummy variables Also known as indicator variables

24 14-24 How to Construct Dummy Variables A dummy variable always has a value of either 0 or 1 For example, to model sales at two locations, would code the first location as a zero and the second as a 1 Operationally, it does not matter which is coded 0 and which is coded 1

25 14-25 What If We Have More Than Two Categories? Consider having three categories, say A, B and C Cannot code this using one dummy variable A = 0, B = 1 and C = 2 would be invalid Assumes the difference between A and B is the same as B and C We must use multiple dummy variables Specifically, k categories requires k - 1 dummy variables

26 14-26 What If We Have More Than Two Categories? Continued For A, B, and C, would need two dummy variables x 1 is 1 for A, zero otherwise x 2 is 1 for B, zero otherwise If x 1 and x 2 are zero, must be C This is why the third dummy variable is not needed

27 14-27 Interaction Models So far, have only considered dummy variables as stand-alone variables Model so far is y = β 0 + β 1 x + β 2 D +  Where D is dummy variable However, can also look at interaction between dummy variable and other variables That model would take the form y = β 0 + β 1 x + β 2 D + β 3 xD +  With an interaction term, both the intercept and slope are shifted

28 Model Building and the Effects of Multicollinearity Multicollinearity causes problems evaluating the p-values of the model Therefore, we need to evaluate more than the additional importance of each independent variable We also need to evaluate how the variables work together One way to do this is to determine if the overall model gives a high R² and adjusted R², a small s, and short prediction intervals

29 14-29 Effect of Adding Independent Variable Adding any independent variable will increase R² Even adding an unimportant independent variable Thus, R² cannot tell us that adding an independent variable is undesirable

30 14-30 A Better Criterion A better criterion is the size of the standard error s If s increases when an independent variable is added, we should not add that variable However, decreasing s alone is not enough An independent variable should only be included if it reduces s enough to offset the higher t value and reduces the length of the desired prediction interval for y

31 14-31 C Statistic Another quantity for comparing regression models is called the C (a.k.a. C p ) statistic First, calculate mean square error for the model containing all p potential independent variables (s 2 p ) Next, calculate SSE for a reduced model with k independent variables

32 14-32 C Statistic Continued We want the value of C to be small Adding unimportant independent variables will raise the value of C While we want C to be small, we also wish to find a model for which C roughly equals k+1 A model with C substantially greater than k+1 has substantial bias and is undesirable If a model has a small value of C and C for this model is less than k+1, then it is not biased and the model should be considered desirable

33 Residual Analysis in Multiple Regression For an observed value of y i, the residual is e i = y i - y ̂ = y i – (b 0 + b 1 x i 1 + … + b k x ik ) If the regression assumptions hold, the residuals should look like a random sample from a normal distribution with mean 0 and variance σ 2

34 14-34 Residual Plots Residuals versus each independent variable Residuals versus predicted y’s Residuals in time order (if the response is a time series)


Download ppt "Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14."

Similar presentations


Ads by Google