Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 5. Multiple Regression II ECON 251 Research Methods.

Similar presentations


Presentation on theme: "1 5. Multiple Regression II ECON 251 Research Methods."— Presentation transcript:

1 1 5. Multiple Regression II ECON 251 Research Methods

2 2 The Regression Modeling Process  Multiple regression also introduces complexity in the modeling process itself. How does one decide which variables should be considered? How many variables should be initially considered? How many variables should there be in your final model? What does one do if the data cannot be found on the exact variable of interest?  While the answers to many of these questions involve as much art as science, there are some steps that can guide you along your way.  In this section, we will also introduce a few more tools to help you with some of these decisions.

3 3 The Modeling Process—Step by Step 1. Develop a model that has a sound basis. ―Theoretical and practical inputs into model formation –Working group of experts for brainstorming session –Literature review on factors influencing variable of interest 2. Gather data for the variables in the model. ―Gather data for dependent and independent variables ―If data cannot be found for the exact variable, use a “proxy.” –You believe sales of your product follows GDP growth, but you want a model of monthly data, and GDP figures are quarterly. What do you do? 3. Draw the scatter diagram to determine whether a linear model (or other forms) appears to be appropriate. 4. Estimate the model coefficients and statistics using statistical computer software.

4 4 The Modeling Process—Step by Step 5. Assess the model fit and usefulness using the model statistics. Use the three step process we developed with simple linear regression. Do the variables make sense? (significance, signs) 6. Diagnose violations of required conditions. Try to remedy problems when identified. 7. Assess the model fit and usefulness using the model statistics. Notice the iterative nature of the process. 8. If the model passes the assessment tests, use it to: Predict the value of the dependent variables Provide interval estimates for these predictions Provide insight into the impact of each independent variable on the dependent variable.  Remember: Statistics informs judgment, it does not replace it. Use your common sense when developing, finalizing and employing a model!

5 5 Example – Motel Profitability  La Quinta Motor Inns is planning an expansion.  Management wishes to predict which sites are likely to be profitable.  Step #1: Develop a model with a sound basis Several predictors of profitability which can be identified include: ―Competition ―Market awareness ―Demand generators ―Demographics ―Physical quality

6 6 Profitability Competition Market awareness Demand Generators Physical RoomsNearestOffice space College enrollment IncomeDisttown Distance to Downtown In miles. Median household Income in ‘000s. Distance in miles to the nearest Competitor of La Quinta inn. Number of hotel/motel rooms within 3 miles. Demographics At this stage, you should also assign your “a priori” expectations of the sign of each coefficient for each independent variable. We’ll use this information when we “assess” the model. In ‘000s of sq ft w/in 3 miles In ‘000s of students w/in 3 miles Example – Motel Profitability

7 7  Step #2: Gather Data Data was collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model: Margin =     Rooms   Nearest   Office   College +  5 Income +  6 Disttwn + ε Example – Motel Profitability

8 8  Step #3: Draw Scatter Diagrams Example – Motel Profitability

9 9  Step #4: Estimate Model This is the sample regression eq’n (sometimes called the prediction eq’n) MARGIN = 72.455 - 0.008 ROOMS - 1.646 NEAREST + 0.02 OFFICE +0.212 COLLEGE - 0.413 INCOME + 0.225 DISTTWN Example – Motel Profitability

10 10  Step #5: Assess the Model We will add a number of steps and sub-steps to our model assessment process when using multiple regression. The assessment process becomes: 1. R 2 (Coefficient of Determination) 1b. Adjusted R 2 2. F-Test for overall validity of the model 3. T-test for slope –using b (estimate of the slope) –Partial F-test to verify elimination of some independent variables Example – Motel Profitability

11 11 1a.Coefficient of determination The definition is From the printout, R 2 = 0.5251 52.51% of the variation in the measure of profitability is explained by the linear regression model formulated above. Notice that we are not using SSR/SST. This version of the formula would still work for now, but it will not work once we introduce “Adjusted R 2 ”... Assessing the Model (step #5)

12 12 1b.The “Adjusted” Coefficient of Determination is defined as: As you add additional independent variables to your model, what happens to SST, SSR, and SSE? What happens to R 2 ? R 2 ? If all you cared about was a model with a high R 2, you might be tempted to increase the number of independent variables almost irrespective of the amount of significant explanatory power each added. Adj R 2 penalizes you a small amount for each additional independent variable you add. The new variable must significantly contribute to explaining SST, before Adj R 2 will go up. From the printout, Adj R 2 ( R 2 )= 0.4944 or 49.44% 49.44% of the variation in the measure of profitability is explained by the linear regression model formulated above after “adjusting for the degrees of freedom,” or the “number of independent variables.”

13 13 2.The F-Test for Overall Validity of the Model Recall that in conducting this test, we are posing the question: Is there at least one independent variable linearly related to the dependent variable? To answer the question, we test the hypothesis: If at least one  i is not equal to zero, the model is valid. The F test ―Construct the F statistic MSE MSR F  F > F , k, n-k-1 Reject H 0 if H 0 :  1 =  2 = … =  k = 0 H 1 : At least one  i is not equal to zero.

14 14 F ,k,n-k-1 = F 0.05,6,100-6-1 = 2.197 F = 17.14 > 2.197 Also, the p-value (Significance F) = 3.03382(10) -13 Clearly,  = 0.05>3.03382(10) -13, and the null hypothesis is rejected. Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the  i is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid  Excel provides the following ANOVA results

15 15 3a. Testing the coefficients The hypothesis for each  i Example—Motel Profitability H 0 :  i = 0 H 1 :  i ≠ 0 Test statistic d.f. = n - k -1

16 16 3b. Do the Variables Make Sense? When you establish which variables you want to use, you should also establish your “a priori” assumptions regarding the expected sign of the slope coefficients. You do this prior to obtaining your actual model results so the actual numbers do not influence your expectations. By establishing these expectations, you are more able to identify surprises in your results. These surprises may lead you to additional insight into your model, or may lead you to question your results. Either is useful. We have already done this back on slide 6, so go back and find your original assumptions for this example.  Example—Motel Profitability Margin =     Rooms   Nearest   Office   College +  5 Income +  6 Disttwn

17 17 This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. In this model, for each additional 1000 rooms within 3 mile of the La Quinta inn, the operating margin decreases on the average by 7.6% (assuming the other variables are held constant). In this model, for each additional mile that the nearest competitor is to La Quinta inn, the average operating margin decreases by 1.65%. Sensible??? Example – Motel Profitability

18 18 For each additional 1000 sq-ft of office space, the average increase in operating margin will be.02%. For additional thousand students MARGIN increases by.21%. For additional $1000 increase in median household income, MARGIN decreases by.41% ??? For each additional mile to the downtown center, MARGIN increases by.23% on average??? Example – Motel Profitability

19 19  Based on the t-tests, one should consider getting rid of both “College” and “Disttwn” The sign on “Disttwn” is a bit unexpected as well—though if you try hard you could justify it. These two indications, reinforce one- another. Let’s get rid of it. The “College” variable sign is what you would expect, and it’s p- value, while not below 5%, is not that high. Let’s keep this for now, and see what happens when we eliminate “Disttwn.”  While Assumption Violations is officially a separate step, it is usually best to be checking your assumptions at this stage as well. Recall how dramatically the model changed when we had autocorrelation. Recall that serious multicollinearity could also be leading me to get rid of some variables that we might really want to keep. Example – Motel Profitability

20 20 Notice that when we get rid of “Disttwn,” both R 2 AND Adj R 2 went down, but the F stat went up. This is where the “art” comes in. Despite the decline in Adj R 2, we will eliminate “Disttwn” on the basis of the size of the p-value of the t-test, the sign being wrong and the direction of the change in the F stat. You could successfully argue to keep it as well based on Adj R 2. Notice the p-value on “College.”

21 21 When we got rid of “Disttwn,” the p-value for College actually increased, and now isn’t all that close to 5%. Consequently, we’ll get rid of it. Once we do, we have a similar circumstance as last time, regarding R 2, adj R 2 and the F stat. This could go either way as well. In our case, we’ll get rid of “College” for now, and do a Partial F-test, and see what that suggests we do about it.

22 22 There is no particular order in which you should check the assumptions. We’ll check for multicollinearity first, because it is easy to do, and you are also able to look at the correlations between each independent variable and the dependent variable at the same time. Checking Assumption #5 is most easily done using a correlation table. Notice that I have included all the variables in my original list. Office and Income show the highest absolute value of r, at -0.15. This is quite low. Notice that we ended up getting rid of the two variables which had the lowest absolute value of r when measured against the value of the dependent variable. This makes sense. Checking Assumptions

23 23 Checking Assumptions 1 and 2, there are no obvious violations. We won’t worry about Assumption #3 as this is cross-sectional data, not time-series. We should also have taken care of Assumption #4 as we drew our graphs of each independent variable vs. the dependent variable. We only showed a few of these graphs, but at least in those cases, there did not appear to be a problem with outliers. Checking Assumptions

24 24 3c.The Partial F-test (Wald test) How does one decide how many variables to keep in your final model? Do you keep all the variables, some of them? While there is some “art” to this process, we will use the following process. 1.First, consider your individual t-test results. Which variables should you keep on this basis? Are there any variables that officially should be eliminated, but are close to having a small enough p-value to be retained? Are there any variables you believe strongly “must” be in the model irrespective of the results of the t-test? 2.Once you have made your decisions, then conduct the “Partial F-test” to verify your results. Assessing the Model (step #5)

25 25 H 0 :  1 =  2 = … =  i = 0 H 1 : At least one  i is not equal to zero. Where:  i s refer only to those variables which were eliminated from the original regression; SSR f is from the full equation; SSR r is from the reduced equation; MSE f is from the full equation; K d is the number of variables eliminated. The test statistic is determined by the difference in SSR (full model) vs. SSR (reduced model). If there is a large difference, some of the variables you eliminated have significant explanatory power. If this is the case, you will reject H 0, conclude some coefficients from the variables you eliminated are non-zero, and use the “full model”. This test will always be a one-sided upper tail test by its nature.

26 26 The test statistic for the Partial F-test: = [(3123.83 - 3009.184)/2]/30.38307 = = 57.323/30.38307 = 1.8867  The ANOVA results for the reduced and the full models are: Example – Motel Profitability

27 27 F , k d, n-k-1 = F 0.05, 2, 100-6-1 = 3.095  critical value Test statistic: partial F = 1.8867 F = 1.8867 < 3.095 therefore, DNR H 0 Conclusion: There is insufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The independent variables eliminated from the regression do not appear to be different from 0, and hence have no explanatory power. The reduced model appears to be the most appropriate model in this case. Example – Motel Profitability

28 28 Assume you have conducted two regressions using the same data. The first regression on the “full model” had 9 independent variables, and a sample size of 200. You then run a “reduced model” after eliminating 4 of the independent variables that appeared insignificant on the basis of t-tests. Data for Full ModelData for Reduced Model SSR = 95,532SSR = 7,978 MSE = 654. MSE = 13,431 Conduct a partial F-test. F 0.05, 4, 190 = 2.41918485. Conduct the same test, this time assuming the SSR from your reduced model was 92,300. Example – Partial F test

29 29  Step #6: Diagnose Violations of Required Conditions We already did this in concert with Step #5, and that is the way you really should do it. You cannot effectively assess the model, without having considered whether the assumptions have been violated. We separate them into steps only because both are so critical to constructing a useful regression model. Having to combine these critical steps is another manner in which the “art” of regression analysis becomes obvious. Example – Motel Profitability

30 30 Step #7: Assess the Model We now have our final model. You should be able to do the assessment on your own at this stage. Example – Motel Profitability

31 31 Example – Motel Profitability  Step #8: Use the Model Use the model to predict the profit margin of three possible locations. What are your expectations for profit margins in each location? Where should we recommend LaQuinta to locate the next motel? What seem to be the deciding factors in this case? Characteristics Athens (OH)Bloomington (IN)Miami (OH) Rooms 26722,5002,300 Competitor Distance 1.31.20.5 Office Space (‘000s) 9521430655 Students (‘000s) 172115 Income (‘000s) 353733.5 Dist to Downtown 3.44.51.4


Download ppt "1 5. Multiple Regression II ECON 251 Research Methods."

Similar presentations


Ads by Google