Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations

Presentation on theme: "MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS."— Presentation transcript:


2 MULTIPLE REGRESSION With multiple regression, we can analyze the association between more than one independent variable and our dependent variable. 2

3 MULTIPLE REGRESSION Focus On: Calculations Coefficient estimates output The coefficient estimates are both positive, indicating that increases in the DTI and open lines are associated with increases in the loan rate. But only the DTI is significant at the 95% or better level as indicated by its t-stat of 4.1068. A 1% increase in the debt-to-income ratio leads to a 75.76 bp increase in loan rate, holding the number of open lines constant. 3 CoefficientsStandard Errort-Statp-Value Intercept0.00660.03520.18790.8563 DTI0.75760.18454.10680.0045 Open lines0.00590.00521.14290.2906

4 MULTIPLE REGRESSION Focus On: Hypothesis Testing We can test the hypothesis that the true population slope coefficient for the association between open lines and loan rate is zero. 1. Formulate hypothesis  H 0 : b 2 = 0 versus H a : b 2 ≠ 0 (a two-tailed test) 2.Identify appropriate test statistic  3.Specify the significance level  0.05 leading to a critical value of 2.4469 4.Collect data and calculate test statistic  5.Make the statistical decision  Fail to reject the null because 1.1429 < 2.4469 4

5 MULTIPLE REGRESSION Focus On: The p-Value Approach p-Values appear in reference to the coefficient estimates on the regression output. For the coefficient estimates, we would fail to reject a null hypothesis of a zero parameter value for b 0 at any  level above  = 0.8563, for b 1 at any level above  = 0.0045, and for b 2 at any level above  = 0.2906. Conventionally, accepted  levels are 0.1, 0.05, and 0.01, which leads us to reject the null hypothesis of a zero parameter value only for b 1 and conclude that only b 1 is statistically significantly different from zero at generally accepted levels. 5 CoefficientsStandard Errort-Statp-Value Intercept0.00660.03520.18790.8563 DTI0.75760.18454.10680.0045 OpenLines0.00590.00521.14290.2906

6 MULTIPLE REGRESSION ASSUMPTIONS Multiple linear regression has the same underlying assumptions as single independent variable linear regression and some additional ones. 1.The relationship between the dependent variable, Y, and the independent variables (X 1, X 2,..., X k ) is linear. 2.The independent variables (X 1, X 2,..., X k ) are not random. Also, no exact linear relation exists between two or more of the independent variables. 3.The expected value of the error term, conditioned on the independent variables, is zero. 4.The variance of the error term is the same for all observations. 5. The error term is uncorrelated across observations: E( ∈ i ∈ j ) = 0, j ≠ i. 6.The error term is normally distributed. 6

7 MULTIPLE REGRESSION PREDICTED VALUES Focus On: Calculations Returning to our multiple linear regression, what loan rate would we expect for a borrower with an 18% DTI and 3 open lines of credit? 7

8 UNCERTAINTY IN LINEAR REGRESSION There are two sources of uncertainty in linear regression models: 1.Uncertainty associated with the random error term. -The random error term itself contains uncertainty, which can be estimated from the standard error of the estimate for the regression equation. 2.Uncertainty associated with the parameter estimates. -The estimated parameters also contain uncertainty because they are only estimates of the true underlying population parameters. -For a single independent variable, as covered in the prior chapter, estimates of this uncertainty can be obtained. -For multiple independent variables, the matrix algebra necessary to obtain such estimates is beyond the scope of this text. 8

9 MULTIPLE REGRESSION: ANOVA Focus On: Regression Output The analysis of variance section of the output provides the F-test for the hypothesis that all the coefficient estimates are jointly zero. The high value of this F-test leads us to reject the null that all the coefficients are jointly zero, concluding that at least one coefficient estimate is nonzero. Combined with the coefficient estimates, this model suggests that the loan rate is fairly well described by the level of the debt-to-income ratio for the client, but that the number of outstanding open lines does not make a strong contribution to that understanding. 9 dfSSMSSFSignificance F Regression20.01200.00609.61040.0098 Residual70.00440.0006 Total90.0164

10 F-TEST Focus On: Calculations 10

11 R 2 AND ADJUSTED R 2 Focus On: Regression Output Regression specification output from our example regression provides -Multiple R is the correlation coefficient for the degree of association between the independent variables and the dependent variable. -R 2 is our familiar correlation estimate  the independent variables explain 73.3% of the variation in the dependent variable. -Adjusted R 2 is a more appropriate measure for a correlation estimate that accounts for the presence of multiple independent variables and it is 65.68%. 11 Regression Statistics Multiple R0.8562 R2R2 0.7330 Adjusted R 2 0.6568 Standard Error0.0250 Observations10

12 INDICATOR VARIABLES Often called “dummy variables,” indicator variables are used to capture qualitative aspects of the hypothesized relationship. 12

13 INDICATOR VARIABLES Focus On: Regression Output 13 CoefficientsStandard Errort-Statp-Value Intercept–0.01380.0324–0.42520.6855 DTI0.61170.17813.43400.0139 Open lines0.02650.01212.19580.0705 STR–0.06810.0371–1.83670.1159 dfSSMSSFSignificance F Regression30.01360.00459.70370.0102 Residual60.00280.0005 Total90.0164 Regression Statistics Multiple R0.8562 R2R2 0.7330 Adjusted R 2 0.6568

14 VIOLATIONS: HETEROSKEDASTICITY The variance of the errors differs across observations (Assumption 4). There are two types of heteroskedasticity: -Unconditional heteroskedasticity, which presents no problems for statistical inference, and -Conditional heteroskedasticity, wherein the error variance is correlated with the independent variable values. -Parameter estimates are still consistent. -F-test and t-tests are unreliable. 14

15 VIOLATIONS: SERIAL CORRELATION There is correlation between the error terms (Assumption 5). The focus in this chapter is the case in which there is serial correlation but no lagged values of the dependent variable as independent variable(s). -Parameter estimates are consistent, but the standard errors are incorrect. -The F-test and t-tests are likely inflated with positive serial correlation, the most common case with financial variables. Parameter estimates are still consistent as long as there are no lagged values of the dependent variable as independent variables. -If there are lagged values as independent variables, -Coefficient estimates are inconsistent. -This is the statistical arena of time series (Chapter 10). 15

16 TESTING AND CORRECTING FOR VIOLATIONS There are well-established tests for serial correlation and heteroskedasticity, as well as ways to correct for their impact. Testing for -Heteroskedasticitiy  Use the Breusch–Pagan test -Serial correlation  Use the Durbin–Watson test Correcting for -Heteroskedasticity  Use robust standard errors or generalized least squares  Use White standard errors -Serial correlation  Use the Hansen correction -This also corrects for heteroskedasticity. 16

17 TESTING FOR SERIAL CORRELATION Focus On: Calculating the Durbin–Watson Statistic 17 d l = 1.63d u = 1.72 Inconclusive Rejection zone for positive serial correlation Rejection zone for negative serial correlation

18 VIOLATIONS: MULTICOLLINEARITY Multicollinearity occurs when two or more independent variables or combinations of independent variables are highly (but not perfectly) correlated with each other (Assumption 6). Common with financial data -Estimates are still consistent, but imprecise and unreliable. -One indicator that you may have a collinearity problem is the presence of a significant F-test but no (few) significant t-tests. No easy solution to correct violation, you may have to drop variables. -The “story” here is critical. 18

19 SUMMARIZING VIOLATIONS AND SOLUTIONS ProblemEffectSolution HeteroskedasticityIncorrect standard errorsUse robust standard errors Serial CorrelationIncorrect standard errors*Use robust standard errors MulticollinearityHigh R 2 and low t-statsNo theory-based solution 19

20 MODEL SPECIFICATION Models should -Be grounded in financial or economic reasoning. -Have variables that are an appropriate functional form for their nature. -Have specifications that are parsimonious. -Be in compliance with the regression assumptions. -Be tested out-of-sample before applying them to decisions. 20

21 MODEL MISSPECIFICATIONS A model is misspecified when it violates the assumptions underlying linear regression, its functional form is incorrect, or it contains time-series specification problems. Generally, model misspecification can result in invalid statistical inference when we are using linear regression. Misspecification has a number of possible sources: 1.Misspecified functional form can arise from several possible problems: -Omitted variable bias. -Incorrectly represented variables. -Data that are pooled which should not. 2.Error term correlation with independent variables can arise from: -Lagged values of the dependent variable as independent variables. -Measurement error in the independent variables. -Independent variables that are functions of the dependent variable. 21

22 AVOIDING MISSPECIFICATION If independent or dependent variables are nonlinear, use an appropriate transformation to make them linear. -For example, use common size statements or log-based transformations. Avoid independent variables that are mathematical transformations of dependent variables. Don’t include spurious independent variables (no data mining). Perform diagnostic tests for violations of the linear regression assumptions. -If violations are found, use appropriate corrections. Validate model estimations out-of-sample when possible. Ensure that data come from a single underlying population. -The data collection process should be grounded in good sampling practice. 22

23 QUALITATIVE DEPENDENT VARIABLES The dependent variable of interest may be a categorical variable representing the state of the subject we are analyzing. Dependent variables that take on ordinal or nominal values are better estimated using models developed for qualitative analysis. -This approach is the dependent variable analog to indicator (dummy) variables as independent variables. Three broad categories 1.Probit: Based on the normal distribution, it estimates probability of the dependent variable outcome. 2.Logit: Based on the logistic distribution, it also estimates probability of the dependent variable outcome. 3.Discriminant Analysis: It estimates a linear function, which can then be used to assign the observation to the underlying categories. 23

24 ECONOMIC MEANING AND MULTIPLE REGRESSION 24 CoefficientsStandard Errort-Statp-value Intercept–0.01380.0324–0.42520.6855 DTI0.61170.17813.43400.0139 Open lines0.02650.01212.19580.0705 STR–0.06810.0371–1.83670.1159 dfSSMSSFSignificance F Regression30.01360.00459.70370.0102 Residual60.00280.0005 Total90.0164 Regression Statistics Multiple R0.8562 R2R2 0.7330 Adjusted R 2 0.6568

25 SUMMARY We are often interested in the relationship between more than two financial variables, and multiple linear regression allows us to model such relationships and subject our beliefs about them to rigorous testing. Financial data often exhibit characteristics that violate the underlying assumptions necessary for linear regression and its associated hypothesis test to be meaningful. The main violations are -Serial correlation. -Conditional heteroskedasticity. -Multicollinearity. We can test for each of these conditions and correct our estimations and hypothesis tests to account for their effects. 25


Similar presentations

Ads by Google