# Multiple regression and issues in regression analysis

## Presentation on theme: "Multiple regression and issues in regression analysis"β Presentation transcript:

Multiple regression and issues in regression analysis
At the outset of this lecture, point out that the results contained in the tables were determined using a statistical package within Excel, while the presentation shows various variables/estimates rounded off to four decimal places for the sake of clarity. In several places, this will lead to the full 32-bit precision result differing slightly from the value you will get if you replicate the indicated formula calculations.

Multiple regression With multiple regression, we can analyze the association between more than one independent variable and our dependent variable. Returning to our analysis of the determinants of loan rates, we also believe that the number of lines of credit the client currently employs is related to the loan rate charged. Accordingly, we model a multiple linear regression of the relationship: General form ο  π π = π 0 + π 1 π 1,π + π 2 π 2,π β¦+ π π π π,π + Ξ΅ π Specific form ο  Loan rate π = π 0 + π 1 DTI π + π 2 Open lines π + Ξ΅ π where DTI is the debt-to-income ratio and Open lines is the number of existing lines of credit the client already possesses. LOS: Formulate a multiple regression equation to describe the relationship between a dependent variable and several independent variables, determine the statistical significance of each independent variable, and interpret the estimated coefficients. Pages 325β326 It is rare to run a regression with only a single independent variable, although less so in investments than in corporate finance. Accordingly, the issues surrounding multiple independent variables are particularly important, as are the specialized topics in this chapter. At the outset of this lecture, point out that the results contained in the tables were determined using a statistical package within Excel, while the presentation shows various variables/estimates rounded off to four decimal places for the sake of clarity. In several places, this will lead to the full 32-bit precision result differing slightly from the value you will get if you replicate the indicated formula calculations.

Multiple regression Coefficients Standard Error t-Stat p-Value
Focus On: Calculations Coefficient estimates output The coefficient estimates are both positive, indicating that increases in the DTI and open lines are associated with increases in the loan rate. But only the DTI is significant at the 95% or better level as indicated by its t-stat of A 1% increase in the debt-to-income ratio leads to a bp increase in loan rate, holding the number of open lines constant. Coefficients Standard Error t-Stat p-Value Intercept 0.0066 0.0352 0.1879 0.8563 DTI 0.7576 0.1845 4.1068 0.0045 Open lines 0.0059 0.0052 1.1429 0.2906 LOS: Formulate a multiple regression equation to describe the relationship between a dependent variable and several independent variables, determine the statistical significance of each independent variable, and interpret the estimated coefficients. Pages 325β331 Please note the qualifier in italics. There is something of a philosophical gulf between statisticians and many practitioners on how to deal with coefficient estimates that are insignificant at the specified level of confidence. Strictly speaking, they are statistically zero and, as such, have no statistically discernible effect on the dependent variable. However, they do affect the estimated coefficient for those independent variables whose coefficient estimates are significant (note the change in both size and significance on the DTI relative to the estimation in the prior chapter), so practitioners will almost always include them in any resulting presentation of the relationship and determination of forecasts.

Multiple regression Focus On: Hypothesis Testing
We can test the hypothesis that the true population slope coefficient for the association between open lines and loan rate is zero. Formulate hypothesis ο  H0: b2 = 0 versus Ha : b2 β  0 (a two-tailed test) Identify appropriate test statistic ο  Specify the significance level ο  0.05 leading to a critical value of Collect data and calculate test statistic ο  Make the statistical decision ο  Fail to reject the null because < LOS: Formulate a null and an alternative hypothesis about the population value of a regression coefficient, calculate the value of the test statistic, determine whether the null hypothesis is rejected at a given level of significance using a one-tailed or two-tailed test, and interpret the result of the test. Pages 325β331 The degrees of freedom for this test is n β (k + 1) or 10 β (3 + 1) = 6. It is worth noting that this value is different from the CV in the Chapter 8 test of a similar hypothesis. As we increase the number of independent variables (k), the cutoff for a fixed sample size and significance level of the t-distribution will increase. Because the sample size here is quite small, the increase is quite large. *Slight differences from rounding lead to a mismatch between the 32-decimal precise number and the number one will get from βplugging inβ to this exact formula.

Multiple regression Coefficients Standard Error t-Stat p-Value
Focus On: The p-Value Approach Coefficients Standard Error t-Stat p-Value Intercept 0.0066 0.0352 0.1879 0.8563 DTI 0.7576 0.1845 4.1068 0.0045 OpenLines 0.0059 0.0052 1.1429 0.2906 p-Values appear in reference to the coefficient estimates on the regression output. For the coefficient estimates, we would fail to reject a null hypothesis of a zero parameter value for b0 at any a level above a = , for b1 at any level above a = , and for b2 at any level above a = Conventionally, accepted a levels are 0.1, 0.05, and 0.01, which leads us to reject the null hypothesis of a zero parameter value only for b1 and conclude that only b1 is statistically significantly different from zero at generally accepted levels. LOS: Interpret the p-values of a multiple regression output. Pages 325β331 The p-value approach has, for the most part, come to dominate statistical reporting because it doesnβt apply the βrazorβs edgeβ of preselected critical value, and allows the consumer of the research to decide which relationships are statistically important.

Multiple regression assumptions
Multiple linear regression has the same underlying assumptions as single independent variable linear regression and some additional ones. The relationship between the dependent variable, Y, and the independent variables (X1, X2, , Xk) is linear. The independent variables (X1, X2, , Xk) are not random. Also, no exact linear relation exists between two or more of the independent variables. The expected value of the error term, conditioned on the independent variables, is zero. The variance of the error term is the same for all observations. The error term is uncorrelated across observations: E(βiβj) = 0, j β  i. The error term is normally distributed. LOS: Explain the assumptions of a multiple regression model. Pages 331β332 The boldface type indicates one significant departure from the assumptions of the single independent variable model. It is from this requirement that problems with βmulticollinearityβ come, which are frequently encountered in corporate research.

Multiple regression predicted values
Focus On: Calculations Returning to our multiple linear regression, what loan rate would we expect for a borrower with an 18% DTI and 3 open lines of credit? Loan rate π = =0.1607 LOS: Calculate a predicted value for the dependent variable given an estimated regression model and assumed values for the independent variables. Pages 336β337

Uncertainty in linear regression
There are two sources of uncertainty in linear regression models: Uncertainty associated with the random error term. The random error term itself contains uncertainty, which can be estimated from the standard error of the estimate for the regression equation. Uncertainty associated with the parameter estimates. The estimated parameters also contain uncertainty because they are only estimates of the true underlying population parameters. For a single independent variable, as covered in the prior chapter, estimates of this uncertainty can be obtained. For multiple independent variables, the matrix algebra necessary to obtain such estimates is beyond the scope of this text. LOS: Discuss the types of uncertainty involved in regression model predictions. Pages 336β337 The second uncertainty, referred to as βparameter estimation error,β is particularly problematic in the forecasting arena and determines the statistical properties of a forecast. Frequently, practitioners will attempt to use the model to forecast and then use the standard deviation of the errors as their standard deviation for the forecast. This approach ignores the impact of parameter uncertainty on the statistical properties of the forecasts, thereby overstating their significance and making confidence intervals too narrow.

Multiple regression: anova
Focus On: Regression Output df SS MSS F Significance F Regression 2 0.0120 0.0060 9.6104 0.0098 Residual 7 0.0044 0.0006 Total 9 0.0164 The analysis of variance section of the output provides the F-test for the hypothesis that all the coefficient estimates are jointly zero. The high value of this F-test leads us to reject the null that all the coefficients are jointly zero, concluding that at least one coefficient estimate is nonzero. Combined with the coefficient estimates, this model suggests that the loan rate is fairly well described by the level of the debt-to-income ratio for the client, but that the number of outstanding open lines does not make a strong contribution to that understanding. LOS: Infer how well a regression model explains the dependent variable by analyzing the output of the regression equation and an ANOVA table. Pages 338β340

F-test Focus On: Calculations
The F-test for a multiple regression determines whether the slope coefficients, taken together simultaneously as a group, are all zero. The test statistic is πΉ= RSS π SSE [πβ π+1 ] = MSR MSE From our regression output, this is which is greater than the critical value for an F(0.05,2,8)= leading us to reject the null hypothesis of all coefficient estimates being equal to zero. LOS: Define, calculate, and interpret the F-statistic and discuss how it is used in regression analysis. Pages 338β340 *Slight differences from rounding lead to a mismatch between the 32-decimal precise number and the number one will get from βplugging inβ to this exact formula.

Regression Statistics
R2 and Adjusted R2 Focus On: Regression Output Regression specification output from our example regression provides Multiple R is the correlation coefficient for the degree of association between the independent variables and the dependent variable. R2 is our familiar correlation estimate ο  the independent variables explain 73.3% of the variation in the dependent variable. Adjusted R2 is a more appropriate measure for a correlation estimate that accounts for the presence of multiple independent variables and it is 65.68%. Regression Statistics Multiple R 0.8562 R2 0.7330 Adjusted R2 0.6568 Standard Error 0.0250 Observations 10 LOS: Define, distinguish between, and interpret R2 and adjusted R2 in multiple regression. Pages 340β341 Note that by βpenalizingβ the estimation for additional independent variables, adjusted R2 reduces the incentive to load the βkitchen sinkβ into the estimation in the hopes of getting significant results, which would actually be spurious rather than meaningful.

Indicator variables Often called βdummy variables,β indicator variables are used to capture qualitative aspects of the hypothesized relationship. Consider that a reliance on short-term sources of financing is also generally believed to be associated with more risky borrowers. The indicator variable, STR, for short-term reliance is coded as a β1β when borrowers have predominantly used lines of credit as existing borrowing and β0β otherwise. The hypothesized relationship is now Loan rate π = π 0 + π 1 DTI π + π 2 Open lines π + π 3 STR π + Ξ΅ π LOS: Formulate a multiple regression equation using dummy variables to represent qualitative factors, and interpret the coefficients and regression results. Page 341 The estimated slope coefficient on an indicator variable can be interpreted as the difference in the intercept for those observations possessing the attribute captured when the indicator variable is coded as a β1.β Note that the use of a series of indicator variables must always have an βomitted classββin this case, no reliance on short-term borrowing captured as a β0.β If there is no omitted class, we will violate the noncollinearity assumption of multiple regression estimation.

Regression Statistics
Indicator variables Regression Statistics Multiple R 0.8562 R2 0.7330 Adjusted R2 0.6568 Focus On: Regression Output df SS MSS F Significance F Regression 3 0.0136 0.0045 9.7037 0.0102 Residual 6 0.0028 0.0005 Total 9 0.0164 LOS: Formulate a multiple regression equation using dummy variables to represent qualitative factors, and interpret the coefficients and regression results. Page 341 At this stage, there should be enough knowledge to walk you through this output. The variation in the independent variables explains about 65.68% of the variation in the dependent variable. The F-test is statistically significant, indicating the beta coefficients are not all statistically zero. The coefficient on the DTI is positive and statistically significant at generally accepted levels. The coefficient estimate on open lines is positive and statistically significant only at the 10% level. The coefficient estimate on STR is negative and not statistically significant at generally accepted levels. It is almost significant at the 10%, suggesting some explanatory validity. It should be noted that it is of the opposite sign to that hypothesized. Coefficients Standard Error t-Stat p-Value Intercept β0.0138 0.0324 β0.4252 0.6855 DTI 0.6117 0.1781 3.4340 0.0139 Open lines 0.0265 0.0121 2.1958 0.0705 STR β0.0681 0.0371 β1.8367 0.1159

Violations: Heteroskedasticity
The variance of the errors differs across observations (Assumption 4). There are two types of heteroskedasticity: Unconditional heteroskedasticity, which presents no problems for statistical inference, and Conditional heteroskedasticity, wherein the error variance is correlated with the independent variable values. Parameter estimates are still consistent. F-test and t-tests are unreliable. LOS: Describe conditional and unconditional heteroskedasticity and discuss their effects on statistical inference. Page 345 We can sometimes see the effect of conditional heteroskedasticity by examining a scatter plot of the dependent variable versus the independent variable we believe has a conditionally heterskedastic relationship with the regression line overlaid on the plot. If the error terms (distance between the points and the regression line) differ systematically in magnitude, we may have conditional heteroskedasticity. It should be pointed out, however, that visual inspection will not always reveal the condition, nor is it likely to do so when we have more than one independent variable.

Violations: serial correlation
There is correlation between the error terms (Assumption 5). The focus in this chapter is the case in which there is serial correlation but no lagged values of the dependent variable as independent variable(s). Parameter estimates are consistent, but the standard errors are incorrect. The F-test and t-tests are likely inflated with positive serial correlation, the most common case with financial variables. Parameter estimates are still consistent as long as there are no lagged values of the dependent variable as independent variables. If there are lagged values as independent variables, Coefficient estimates are inconsistent. This is the statistical arena of time series (Chapter 10). LOS: Describe serial correlation and discuss its effects on statistical inference. Pages 351β352 It should be fairly intuitive that using lagged values of the dependent variable as independent variables is unlikely to allow them to be βindependent.β The case we usually deal with here is first-order serial correlation, which can be thought of as the case when the sign of the error term is persistent (a negative error is likely to be followed by another negative error).

Testing and correcting for violations
There are well-established tests for serial correlation and heteroskedasticity, as well as ways to correct for their impact. Testing for Heteroskedasticitiy ο  Use the BreuschβPagan test Serial correlation ο  Use the DurbinβWatson test Correcting for Heteroskedasticity ο  Use robust standard errors or generalized least squares ο  Use White standard errors Serial correlation ο  Use the Hansen correction This also corrects for heteroskedasticity. LOS: Explain how to test and correct for heteroskedasticity and serial correlation. Pages 348β350; 353β355 Most of the test and corrections for these violations are available with standard statistical packages, but have to be invoked as options (they are not generally default tests and corrections).

Testing for serial correlation
Focus On: Calculating the DurbinβWatson Statistic You have recently estimated a regression model with 100 observations and two independent variables. Using the estimated errors, you have determined that the correlation between the error term and a first lagged value of the error term is Do the observations exhibit positive serial correlation? The test statistic is DWβ2 1βπ =2 1β0.16 =1.68. The critical values from Appendix E are dl = 1.63 and du = 1.72. Because 1.68 > 1.63, we fail to reject the null of positive serial correlation. LOS: Calculate and interpret a DurbinβWatson statistic. Pages 353β355 The book emphasizes the positive serial correlation case because that is what we most often deal with in finance and because its effects are potentially more problematic. The important thing to emphasize is that between the dβs, we have an inconclusive test. Outside this range, we have serial correlationβeither negative above the upper d or positive below the lower d. dl = 1.63 du = 1.72 Inconclusive Rejection zone for positive serial correlation negative

Violations: multicollinearity
Multicollinearity occurs when two or more independent variables or combinations of independent variables are highly (but not perfectly) correlated with each other (Assumption 6). Common with financial data Estimates are still consistent, but imprecise and unreliable. One indicator that you may have a collinearity problem is the presence of a significant F-test but no (few) significant t-tests. No easy solution to correct violation, you may have to drop variables. The βstoryβ here is critical. LOS: Describe multicollinearity and discuss its effects on statistical inference. Pages 356β358 Multicollinearity can be between two independent variables in isolation, or between several independent variables in combination. The βdummy variables problem,β in which we must create an βomitted class,β occurs because of the necessity that independent variables be noncollinear.

Summarizing Violations and Solutions
Problem Effect Solution Heteroskedasticity Incorrect standard errors Use robust standard errors Serial Correlation Incorrect standard errors* Multicollinearity High R2 and low t-stats No theory-based solution Pages 345β359 *Unless we have lagged values of the dependent variable as independent variables ο  Use a time-series model (Chapter 10).

Model Specification Models should
Be grounded in financial or economic reasoning. Have variables that are an appropriate functional form for their nature. Have specifications that are parsimonious. Be in compliance with the regression assumptions. Be tested out-of-sample before applying them to decisions. LOS: Explain the principles of model specification. Pages 359β360 The most overlooked feature of any set of empirical tests is probably the appropriateness of the model and variable specifications.

Model Misspecifications
A model is misspecified when it violates the assumptions underlying linear regression, its functional form is incorrect, or it contains time-series specification problems. Generally, model misspecification can result in invalid statistical inference when we are using linear regression. Misspecification has a number of possible sources: Misspecified functional form can arise from several possible problems: Omitted variable bias. Incorrectly represented variables. Data that are pooled which should not. Error term correlation with independent variables can arise from: Lagged values of the dependent variable as independent variables. Measurement error in the independent variables. Independent variables that are functions of the dependent variable. LOS: Define misspecification and discuss its effects on the results of a regression analysis. Pages 360β367 There is a large amount of material here and this is a potentially deep topic. It should be emphasized that this is a brief coverage and that the book contains a deeper coverage well worth reading.

Avoiding misspecification
If independent or dependent variables are nonlinear, use an appropriate transformation to make them linear. For example, use common size statements or log-based transformations. Avoid independent variables that are mathematical transformations of dependent variables. Donβt include spurious independent variables (no data mining). Perform diagnostic tests for violations of the linear regression assumptions. If violations are found, use appropriate corrections. Validate model estimations out-of-sample when possible. Ensure that data come from a single underlying population. The data collection process should be grounded in good sampling practice. LOS: Explain how to avoid the common forms of misspecification in a regression analysis. Pages 360β367 Again, this is a very deep topic and these slides just touch the surface.

Qualitative dependent variables
The dependent variable of interest may be a categorical variable representing the state of the subject we are analyzing. Dependent variables that take on ordinal or nominal values are better estimated using models developed for qualitative analysis. This approach is the dependent variable analog to indicator (dummy) variables as independent variables. Three broad categories Probit: Based on the normal distribution, it estimates probability of the dependent variable outcome. Logit: Based on the logistic distribution, it also estimates probability of the dependent variable outcome. Discriminant Analysis: It estimates a linear function, which can then be used to assign the observation to the underlying categories. LOS: Discuss models for qualitative dependent variables. Page 372 There are entire texts written on the use of qualitative dependent variables, and one should seek out more information before trying to estimate such models. There are a host of new and different assumptions, diagnostic tests, and so on. This discussion should be presented as informing of the existence of such models and providing basic information to guide further search if there is a need to use such models.

Economic meaning and multiple regression
Regression Statistics Multiple R 0.8562 R2 0.7330 Adjusted R2 0.6568 df SS MSS F Significance F Regression 3 0.0136 0.0045 9.7037 0.0102 Residual 6 0.0028 0.0005 Total 9 0.0164 LOS: Interpret the economic meaning of a significant multiple regression. Pages 325β331 As noted earlier, interpretation of coefficient estimates in a multiple linear regression should be performed with care. The rates of change in the dependent variable resulting from a change in an independent variable being considered hold changes in the other independent variables constant, a condition that may not be reasonable relative to the actual occurrence of events leading to the data used. With that caveat, attendees should be at the stage where they can interpret these coefficients with guidance from the instructor and should be led to do so. Overall, the regression appears well specified (high adjusted R2, highly significant F-test). Economically, it shows that borrowers with more lines of credit and higher levels of debt to income are charged higher loan rates. But there is some evidence from this sample that borrowers with a greater reliance on short-term debt have lower average loan rates than those with less reliance on short-term debt, although the statistical value of this is in question. A 1% increase in the borrower debt-to-income ratio leads to a bp higher loan rate, holding short-term reliance on debt and the number of open lines constant. Coefficients Standard Error t-Stat p-value Intercept β0.0138 0.0324 β0.4252 0.6855 DTI 0.6117 0.1781 3.4340 0.0139 Open lines 0.0265 0.0121 2.1958 0.0705 STR β0.0681 0.0371 β1.8367 0.1159

Summary We are often interested in the relationship between more than two financial variables, and multiple linear regression allows us to model such relationships and subject our beliefs about them to rigorous testing. Financial data often exhibit characteristics that violate the underlying assumptions necessary for linear regression and its associated hypothesis test to be meaningful. The main violations are Serial correlation. Conditional heteroskedasticity. Multicollinearity. We can test for each of these conditions and correct our estimations and hypothesis tests to account for their effects.