Presentation on theme: "Multiple regression and issues in regression analysis"— Presentation transcript:
1Multiple regression and issues in regression analysis At the outset of this lecture, point out that the results contained in the tables were determined using a statistical package within Excel, while the presentation shows various variables/estimates rounded off to four decimal places for the sake of clarity. In several places, this will lead to the full 32-bit precision result differing slightly from the value you will get if you replicate the indicated formula calculations.
2Multiple regressionWith multiple regression, we can analyze the association between more than one independent variable and our dependent variable.Returning to our analysis of the determinants of loan rates, we also believe that the number of lines of credit the client currently employs is related to the loan rate charged. Accordingly, we model a multiple linear regression of the relationship:General form 𝑌 𝑖 = 𝑏 0 + 𝑏 1 𝑋 1,𝑖 + 𝑏 2 𝑋 2,𝑖 …+ 𝑏 𝑘 𝑋 𝑘,𝑖 + ε 𝑖Specific form Loan rate 𝑖 = 𝑏 0 + 𝑏 1 DTI 𝑖 + 𝑏 2 Open lines 𝑖 + ε 𝑖where DTI is the debt-to-income ratio and Open lines is the number of existing lines of credit the client already possesses.LOS: Formulate a multiple regression equation to describe the relationship between a dependent variable and several independent variables, determine the statistical significance of each independent variable, and interpret the estimated coefficients.Pages 325–326It is rare to run a regression with only a single independent variable, although less so in investments than in corporate finance. Accordingly, the issues surrounding multiple independent variables are particularly important, as are the specialized topics in this chapter.At the outset of this lecture, point out that the results contained in the tables were determined using a statistical package within Excel, while the presentation shows various variables/estimates rounded off to four decimal places for the sake of clarity. In several places, this will lead to the full 32-bit precision result differing slightly from the value you will get if you replicate the indicated formula calculations.
3Multiple regression Coefficients Standard Error t-Stat p-Value Focus On: CalculationsCoefficient estimates outputThe coefficient estimates are both positive, indicating that increases in the DTI and open lines are associated with increases in the loan rate. But only the DTI is significant at the 95% or better level as indicated by its t-stat ofA 1% increase in the debt-to-income ratio leads to a bp increase in loan rate, holding the number of open lines constant.CoefficientsStandard Errort-Statp-ValueIntercept0.00660.03520.18790.8563DTI0.75760.18454.10680.0045Open lines0.00590.00521.14290.2906LOS: Formulate a multiple regression equation to describe the relationship between a dependent variable and several independent variables, determine the statistical significance of each independent variable, and interpret the estimated coefficients.Pages 325–331Please note the qualifier in italics.There is something of a philosophical gulf between statisticians and many practitioners on how to deal with coefficient estimates that are insignificant at the specified level of confidence. Strictly speaking, they are statistically zero and, as such, have no statistically discernible effect on the dependent variable. However, they do affect the estimated coefficient for those independent variables whose coefficient estimates are significant (note the change in both size and significance on the DTI relative to the estimation in the prior chapter), so practitioners will almost always include them in any resulting presentation of the relationship and determination of forecasts.
4Multiple regression Focus On: Hypothesis Testing We can test the hypothesis that the true population slope coefficient for the association between open lines and loan rate is zero.Formulate hypothesis H0: b2 = 0 versus Ha : b2 ≠ 0 (a two-tailed test)Identify appropriate test statistic Specify the significance level 0.05 leading to a critical value ofCollect data and calculate test statistic Make the statistical decision Fail to reject the null because <LOS: Formulate a null and an alternative hypothesis about the population value of a regression coefficient, calculate the value of the test statistic, determine whether the null hypothesis is rejected at a given level of significance using a one-tailed or two-tailed test, and interpret the result of the test.Pages 325–331The degrees of freedom for this test is n – (k + 1) or 10 – (3 + 1) = 6. It is worth noting that this value is different from the CV in the Chapter 8 test of a similar hypothesis. As we increase the number of independent variables (k), the cutoff for a fixed sample size and significance level of the t-distribution will increase. Because the sample size here is quite small, the increase is quite large.*Slight differences from rounding lead to a mismatch between the 32-decimal precise number and the number one will get from “plugging in” to this exact formula.
5Multiple regression Coefficients Standard Error t-Stat p-Value Focus On: The p-Value ApproachCoefficientsStandard Errort-Statp-ValueIntercept0.00660.03520.18790.8563DTI0.75760.18454.10680.0045OpenLines0.00590.00521.14290.2906p-Values appear in reference to the coefficient estimates on the regression output. For the coefficient estimates, we would fail to reject a null hypothesis of a zero parameter value for b0 at any a level above a = , for b1 at any level above a = , and for b2 at any level above a =Conventionally, accepted a levels are 0.1, 0.05, and 0.01, which leads us to reject the null hypothesis of a zero parameter value only for b1 and conclude that only b1 is statistically significantly different from zero at generally accepted levels.LOS: Interpret the p-values of a multiple regression output.Pages 325–331The p-value approach has, for the most part, come to dominate statistical reporting because it doesn’t apply the “razor’s edge” of preselected critical value, and allows the consumer of the research to decide which relationships are statistically important.
6Multiple regression assumptions Multiple linear regression has the same underlying assumptions as single independent variable linear regression and some additional ones.The relationship between the dependent variable, Y, and the independent variables (X1, X2, , Xk) is linear.The independent variables (X1, X2, , Xk) are not random. Also, no exact linear relation exists between two or more of the independent variables.The expected value of the error term, conditioned on the independent variables, is zero.The variance of the error term is the same for all observations.The error term is uncorrelated across observations: E(∈i∈j) = 0, j ≠ i.The error term is normally distributed.LOS: Explain the assumptions of a multiple regression model.Pages 331–332The boldface type indicates one significant departure from the assumptions of the single independent variable model. It is from this requirement that problems with “multicollinearity” come, which are frequently encountered in corporate research.
7Multiple regression predicted values Focus On: CalculationsReturning to our multiple linear regression, what loan rate would we expect for a borrower with an 18% DTI and 3 open lines of credit?Loan rate 𝑖 ==0.1607LOS: Calculate a predicted value for the dependent variable given an estimated regression model and assumed values for the independent variables. Pages 336–337
8Uncertainty in linear regression There are two sources of uncertainty in linear regression models:Uncertainty associated with the random error term.The random error term itself contains uncertainty, which can be estimated from the standard error of the estimate for the regression equation.Uncertainty associated with the parameter estimates.The estimated parameters also contain uncertainty because they are only estimates of the true underlying population parameters.For a single independent variable, as covered in the prior chapter, estimates of this uncertainty can be obtained.For multiple independent variables, the matrix algebra necessary to obtain such estimates is beyond the scope of this text.LOS: Discuss the types of uncertainty involved in regression model predictions.Pages 336–337The second uncertainty, referred to as “parameter estimation error,” is particularly problematic in the forecasting arena and determines the statistical properties of a forecast. Frequently, practitioners will attempt to use the model to forecast and then use the standard deviation of the errors as their standard deviation for the forecast. This approach ignores the impact of parameter uncertainty on the statistical properties of the forecasts, thereby overstating their significance and making confidence intervals too narrow.
9Multiple regression: anova Focus On: Regression OutputdfSSMSSFSignificance FRegression20.01200.00609.61040.0098Residual70.00440.0006Total90.0164The analysis of variance section of the output provides the F-test for the hypothesis that all the coefficient estimates are jointly zero. The high value of this F-test leads us to reject the null that all the coefficients are jointly zero, concluding that at least one coefficient estimate is nonzero.Combined with the coefficient estimates, this model suggests that the loan rate is fairly well described by the level of the debt-to-income ratio for the client, but that the number of outstanding open lines does not make a strong contribution to that understanding.LOS: Infer how well a regression model explains the dependent variable by analyzing the output of the regression equation and an ANOVA table.Pages 338–340
10F-test Focus On: Calculations The F-test for a multiple regression determines whether the slope coefficients, taken together simultaneously as a group, are all zero. The test statistic is𝐹= RSS 𝑘 SSE [𝑛− 𝑘+1 ] = MSR MSEFrom our regression output, this iswhich is greater than the critical value for an F(0.05,2,8)= leading us to reject the null hypothesis of all coefficient estimates being equal to zero.LOS: Define, calculate, and interpret the F-statistic and discuss how it is used in regression analysis.Pages 338–340*Slight differences from rounding lead to a mismatch between the 32-decimal precise number and the number one will get from “plugging in” to this exact formula.
11Regression Statistics R2 and Adjusted R2Focus On: Regression OutputRegression specification output from our example regression providesMultiple R is the correlation coefficient for the degree of association between the independent variables and the dependent variable.R2 is our familiar correlation estimate the independent variables explain 73.3% of the variation in the dependent variable.Adjusted R2 is a more appropriate measure for a correlation estimate that accounts for the presence of multiple independent variables and it is 65.68%.Regression StatisticsMultiple R0.8562R20.7330Adjusted R20.6568Standard Error0.0250Observations10LOS: Define, distinguish between, and interpret R2 and adjusted R2 in multiple regression.Pages 340–341Note that by “penalizing” the estimation for additional independent variables, adjusted R2 reduces the incentive to load the “kitchen sink” into the estimation in the hopes of getting significant results, which would actually be spurious rather than meaningful.
12Indicator variablesOften called “dummy variables,” indicator variables are used to capture qualitative aspects of the hypothesized relationship.Consider that a reliance on short-term sources of financing is also generally believed to be associated with more risky borrowers. The indicator variable, STR, for short-term reliance is coded as a “1” when borrowers have predominantly used lines of credit as existing borrowing and “0” otherwise. The hypothesized relationship is nowLoan rate 𝑖 = 𝑏 0 + 𝑏 1 DTI 𝑖 + 𝑏 2 Open lines 𝑖 + 𝑏 3 STR 𝑖 + ε 𝑖LOS: Formulate a multiple regression equation using dummy variables to represent qualitative factors, and interpret the coefficients and regression results.Page 341The estimated slope coefficient on an indicator variable can be interpreted as the difference in the intercept for those observations possessing the attribute captured when the indicator variable is coded as a “1.” Note that the use of a series of indicator variables must always have an “omitted class”—in this case, no reliance on short-term borrowing captured as a “0.” If there is no omitted class, we will violate the noncollinearity assumption of multiple regression estimation.
13Regression Statistics Indicator variablesRegression StatisticsMultiple R0.8562R20.7330Adjusted R20.6568Focus On: Regression OutputdfSSMSSFSignificance FRegression30.01360.00459.70370.0102Residual60.00280.0005Total90.0164LOS: Formulate a multiple regression equation using dummy variables to represent qualitative factors, and interpret the coefficients and regression results.Page 341At this stage, there should be enough knowledge to walk you through this output.The variation in the independent variables explains about 65.68% of the variation in the dependent variable.The F-test is statistically significant, indicating the beta coefficients are not all statistically zero.The coefficient on the DTI is positive and statistically significant at generally accepted levels.The coefficient estimate on open lines is positive and statistically significant only at the 10% level.The coefficient estimate on STR is negative and not statistically significant at generally accepted levels.It is almost significant at the 10%, suggesting some explanatory validity.It should be noted that it is of the opposite sign to that hypothesized.CoefficientsStandard Errort-Statp-ValueIntercept–0.01380.0324–0.42520.6855DTI0.61170.17813.43400.0139Open lines0.02650.01212.19580.0705STR–0.06810.0371–1.83670.1159
14Violations: Heteroskedasticity The variance of the errors differs across observations (Assumption 4).There are two types of heteroskedasticity:Unconditional heteroskedasticity, which presents no problems for statistical inference, andConditional heteroskedasticity, wherein the error variance is correlated with the independent variable values.Parameter estimates are still consistent.F-test and t-tests are unreliable.LOS: Describe conditional and unconditional heteroskedasticity and discuss their effects on statistical inference.Page 345We can sometimes see the effect of conditional heteroskedasticity by examining a scatter plot of the dependent variable versus the independent variable we believe has a conditionally heterskedastic relationship with the regression line overlaid on the plot. If the error terms (distance between the points and the regression line) differ systematically in magnitude, we may have conditional heteroskedasticity. It should be pointed out, however, that visual inspection will not always reveal the condition, nor is it likely to do so when we have more than one independent variable.
15Violations: serial correlation There is correlation between the error terms (Assumption 5).The focus in this chapter is the case in which there is serial correlation but no lagged values of the dependent variable as independent variable(s).Parameter estimates are consistent, but the standard errors are incorrect.The F-test and t-tests are likely inflated with positive serial correlation, the most common case with financial variables.Parameter estimates are still consistent as long as there are no lagged values of the dependent variable as independent variables.If there are lagged values as independent variables,Coefficient estimates are inconsistent.This is the statistical arena of time series (Chapter 10).LOS: Describe serial correlation and discuss its effects on statistical inference.Pages 351–352It should be fairly intuitive that using lagged values of the dependent variable as independent variables is unlikely to allow them to be “independent.”The case we usually deal with here is first-order serial correlation, which can be thought of as the case when the sign of the error term is persistent (a negative error is likely to be followed by another negative error).
16Testing and correcting for violations There are well-established tests for serial correlation and heteroskedasticity, as well as ways to correct for their impact.Testing forHeteroskedasticitiy Use the Breusch–Pagan testSerial correlation Use the Durbin–Watson testCorrecting forHeteroskedasticity Use robust standard errors or generalized least squares Use White standard errorsSerial correlation Use the Hansen correctionThis also corrects for heteroskedasticity.LOS: Explain how to test and correct for heteroskedasticity and serial correlation.Pages 348–350; 353–355Most of the test and corrections for these violations are available with standard statistical packages, but have to be invoked as options (they are not generally default tests and corrections).
17Testing for serial correlation Focus On: Calculating the Durbin–Watson StatisticYou have recently estimated a regression model with 100 observations and two independent variables. Using the estimated errors, you have determined that the correlation between the error term and a first lagged value of the error term is Do the observations exhibit positive serial correlation?The test statistic is DW≈2 1−𝑟 =2 1−0.16 =1.68.The critical values from Appendix E are dl = 1.63 and du = 1.72.Because 1.68 > 1.63, we fail to reject the null of positive serial correlation.LOS: Calculate and interpret a Durbin–Watson statistic.Pages 353–355The book emphasizes the positive serial correlation case because that is what we most often deal with in finance and because its effects are potentially more problematic. The important thing to emphasize is that between the d’s, we have an inconclusive test. Outside this range, we have serial correlation—either negative above the upper d or positive below the lower d.dl = 1.63du = 1.72InconclusiveRejection zoneforpositiveserial correlationnegative
18Violations: multicollinearity Multicollinearity occurs when two or more independent variables or combinations of independent variables are highly (but not perfectly) correlated with each other (Assumption 6).Common with financial dataEstimates are still consistent, but imprecise and unreliable.One indicator that you may have a collinearity problem is the presence of a significant F-test but no (few) significant t-tests.No easy solution to correct violation, you may have to drop variables.The “story” here is critical.LOS: Describe multicollinearity and discuss its effects on statistical inference.Pages 356–358Multicollinearity can be between two independent variables in isolation, or between several independent variables in combination. The “dummy variables problem,” in which we must create an “omitted class,” occurs because of the necessity that independent variables be noncollinear.
19Summarizing Violations and Solutions ProblemEffectSolutionHeteroskedasticityIncorrect standard errorsUse robust standard errorsSerial CorrelationIncorrect standard errors*MulticollinearityHigh R2 and low t-statsNo theory-based solutionPages 345–359*Unless we have lagged values of the dependent variable as independent variables Use a time-series model (Chapter 10).
20Model Specification Models should Be grounded in financial or economic reasoning.Have variables that are an appropriate functional form for their nature.Have specifications that are parsimonious.Be in compliance with the regression assumptions.Be tested out-of-sample before applying them to decisions.LOS: Explain the principles of model specification.Pages 359–360The most overlooked feature of any set of empirical tests is probably the appropriateness of the model and variable specifications.
21Model Misspecifications A model is misspecified when it violates the assumptions underlying linear regression, its functional form is incorrect, or it contains time-series specification problems.Generally, model misspecification can result in invalid statistical inference when we are using linear regression.Misspecification has a number of possible sources:Misspecified functional form can arise from several possible problems:Omitted variable bias.Incorrectly represented variables.Data that are pooled which should not.Error term correlation with independent variables can arise from:Lagged values of the dependent variable as independent variables.Measurement error in the independent variables.Independent variables that are functions of the dependent variable.LOS: Define misspecification and discuss its effects on the results of a regression analysis.Pages 360–367There is a large amount of material here and this is a potentially deep topic. It should be emphasized that this is a brief coverage and that the book contains a deeper coverage well worth reading.
22Avoiding misspecification If independent or dependent variables are nonlinear, use an appropriate transformation to make them linear.For example, use common size statements or log-based transformations.Avoid independent variables that are mathematical transformations of dependent variables.Don’t include spurious independent variables (no data mining).Perform diagnostic tests for violations of the linear regression assumptions.If violations are found, use appropriate corrections.Validate model estimations out-of-sample when possible.Ensure that data come from a single underlying population.The data collection process should be grounded in good sampling practice.LOS: Explain how to avoid the common forms of misspecification in a regression analysis.Pages 360–367Again, this is a very deep topic and these slides just touch the surface.
23Qualitative dependent variables The dependent variable of interest may be a categorical variable representing the state of the subject we are analyzing.Dependent variables that take on ordinal or nominal values are better estimated using models developed for qualitative analysis.This approach is the dependent variable analog to indicator (dummy) variables as independent variables.Three broad categoriesProbit: Based on the normal distribution, it estimates probability of the dependent variable outcome.Logit: Based on the logistic distribution, it also estimates probability of the dependent variable outcome.Discriminant Analysis: It estimates a linear function, which can then be used to assign the observation to the underlying categories.LOS: Discuss models for qualitative dependent variables.Page 372There are entire texts written on the use of qualitative dependent variables, and one should seek out more information before trying to estimate such models. There are a host of new and different assumptions, diagnostic tests, and so on. This discussion should be presented as informing of the existence of such models and providing basic information to guide further search if there is a need to use such models.
24Economic meaning and multiple regression Regression StatisticsMultiple R0.8562R20.7330Adjusted R20.6568dfSSMSSFSignificance FRegression30.01360.00459.70370.0102Residual60.00280.0005Total90.0164LOS: Interpret the economic meaning of a significant multiple regression.Pages 325–331As noted earlier, interpretation of coefficient estimates in a multiple linear regression should be performed with care. The rates of change in the dependent variable resulting from a change in an independent variable being considered hold changes in the other independent variables constant, a condition that may not be reasonable relative to the actual occurrence of events leading to the data used.With that caveat, attendees should be at the stage where they can interpret these coefficients with guidance from the instructor and should be led to do so.Overall, the regression appears well specified (high adjusted R2, highly significant F-test).Economically, it shows that borrowers with more lines of credit and higher levels of debt to income are charged higher loan rates. But there is some evidence from this sample that borrowers with a greater reliance on short-term debt have lower average loan rates than those with less reliance on short-term debt, although the statistical value of this is in question.A 1% increase in the borrower debt-to-income ratio leads to a bp higher loan rate, holding short-term reliance on debt and the number of open lines constant.CoefficientsStandard Errort-Statp-valueIntercept–0.01380.0324–0.42520.6855DTI0.61170.17813.43400.0139Open lines0.02650.01212.19580.0705STR–0.06810.0371–1.83670.1159
25SummaryWe are often interested in the relationship between more than two financial variables, and multiple linear regression allows us to model such relationships and subject our beliefs about them to rigorous testing.Financial data often exhibit characteristics that violate the underlying assumptions necessary for linear regression and its associated hypothesis test to be meaningful.The main violations areSerial correlation.Conditional heteroskedasticity.Multicollinearity.We can test for each of these conditions and correct our estimations and hypothesis tests to account for their effects.