 # Correlation and regression

## Presentation on theme: "Correlation and regression"— Presentation transcript:

Correlation and regression

Scatter plots A scatter plot is a graph that shows the relationship between the observations for two data series in two dimensions. Scatter plots are formed by using the data from two different series to plot coordinates along the x- and y-axis, where one element of the data series forms the x-coordinate and the other the y-coordinate. Linear Nonlinear LOS: Define and interpret a scatter plot. Page 281 Visual inspection of a scatter plot, although not sufficient to demonstrate a statistical relationship, is often a starting point for examining data in order to assess whether there appears to be an underlying relationship. The graphs in the slide depict a basic scatter plot, one with a likely linear relationship and one with a curvilinear relationship.

Sample covariance Recall that covariance is the weighted average of the cross-product of each variable’s departure from its mean. Sample covariance is calculated by using the same process as sample variance; however, rather than squaring the deviation of each observation from its mean, we take the product of two different variables’ deviations from their respective means. Cov 𝑋,𝑌 = 𝑠 𝑋,𝑌 = 𝑖=1 𝑛 𝑋 𝑖 − 𝑋 𝑌 𝑖 − 𝑌 (𝑛−1) LOS: Calculate and interpret a sample covariance. Page 284 Point out that covariance is covered in its probabilistic form in Chapter 4, and that this is just the sample analog wherein we have data from the past instead of a stated probability distribution.

Sample covariance Focus On: Calculations Client 1 0.1595 0.1952 0.0070
Y X Y-Yhat X-Xhat Product 1 0.1595 0.1952 0.0070 0.0323 0.0002 2 0.1171 0.1239 –0.0354 –0.0390 0.0014 3 0.1229 –0.0400 4 0.1269 0.1625 –0.0256 –0.0004 0.0000 5 0.1343 0.1078 –0.0182 –0.0551 0.0010 6 0.1523 0.1470 –0.0002 –0.0159 7 0.1823 0.0194 8 0.2295 0.2599 0.0770 0.0970 0.0075 9 0.1112 0.1384 –0.0413 –0.0245 10 0.2247 0.1890 0.0722 0.0261 0.0019 Mean 0.1525 0.1629 Sum = 0.0144 StDev 0.0427 0.0454 Cov = 0.0016 Lending rates and current borrower burden are generally believed to be related. The following data cover the debt-to-income ratio for 10 borrowers and the interest rate they are being charged on five-year loans. What is the sample covariance between loan rate (Y) and debt-to-income ratio (X)? LOS: Calculate and interpret a sample covariance. Pages 282–287

Correlation Coefficient
The correlation coefficient measures the extent and direction of a linear association between two variables. If the sample covariance is denoted as sx,y, then the sample correlation coefficient is the sample covariance divided by each sample standard deviation or Continuing with our example, the sample correlation coefficient is then From this result, we can conclude that there is a strong linear relationship between the debt-to-income ratio of the borrowers and the loan rate they are charged. Furthermore, we can conclude that the relationship has a positive sign, indicating that an increase in the debt-to-income ratio is associated with a higher loan rate. LOS: Calculate and interpret a sample correlation coefficient. Pages 282–287 Remembert that sample correlation ranges from –1 (perfect negative correlation) to +1 (perfect positive correlation). Because we are using the positive roots of variance for standard deviation, a negative correlation perforce arises from a negative covariance. Numbers close to –1 or +1 are indicative of a strong linear relationship.

Limitations of correlation analysis
Focus On: Outliers Outliers are small numbers of observations with extreme values vis-à-vis the rest of the sample. Noise or news? Should we include them or discard them? Outliers can create the appearance of a linear relationship when there isn’t one OR create the appearance of no linear relationship when there is one. LOS: Explain how outliers can affect correlations. Pages 287–289 There is a great quote by a famous econometrician that goes something like “When I find an outlier, I don’t know whether to discard it or patent it.” We can learn more from the outliers than the analysis, but they also potentially mask other underlying correlations The left graph shows a linear relationship, and there likely is one, but maybe not the one created by the outlier to the right (this is actually a true linear relationship). The right graph shows how two clusters, neither of which shows a linear relationship, appears to create one.

Spurious correlation Potential sources of spurious correlation:
Spurious correlation is estimated correlation that arises because of the estimating process, not because of a fundamental underlying linear association. Potential sources of spurious correlation: Correlation between two variables that reflects chance relationships in a particular dataset. Correlation induced by a calculation that mixes each of two variables with a third. Correlation between two variables arising not from a direct relationship between them but from their relationship to a third variable. LOS: Define and explain the concept of spurious correlation. Page 289 The first source is easy to understand. The second arises because of a mathematical transformation of two variables using a third common variable and results in a linear association that is an artifact of the transformation. The third arises when A is correlated with C and B is correlated with C but A is not really associated with B. Probably the most famous example of the third source is evidence that shows when ice cream sales in urban areas are high, so is the murder rate. Clearly the two are not associated with each other but rather with the hidden variable of temperature (high temperature means higher ice cream sales and also higher tempers resulting in a higher number of violent altercations).

Correlation coefficients
Focus On: Hypothesis Tests Recall from Chapter 7 that we can test the value of a correlation coefficient as compared with the true correlation coefficient parameter using the test statistic: Returning to our earlier example, we can test whether the correlation between the debt- to-income ratio and the loan rate is zero at a 95% confidence level. Formulate hypothesis  H0: r = 0 versus Ha: r ≠ 0 (a two-tailed test) Identify appropriate test statistic (see above) Specify the significance level  0.05 leading to a critical value of 2.306 Collect data and calculate test statistic  Make the statistical decision  Reject the null because > 2.306 Statistically  The correlation between the debt-to-income ratio and the loan rate is nonzero. Economically  Higher debt-to-income ratios are associated with higher loan rates. LOS: Formulate a test of the hypothesis that the population correlation coefficient equals zero and determine whether the hypothesis is rejected at a given level of significance. Pages 297–300

The Basics of Linear regression
Linear regression allows us to describe one variable as a linear function of another variable. 𝑌 𝑖 = 𝑏 0 + 𝑏 1 𝑋 𝑖 + ε 𝑖 The independent variable (Xi) is the variable you are using to explain changes in the dependent variable (Yi), the variable you are attempting to explain. The linear regression estimation process chooses parameter estimates to minimize the sum of the squared departures of the predicted values from the observed values. b0 is known as the intercept and b1 is known as the slope coefficient. If the value of the independent variable increases by one unit, the value of the dependent variable changes by b1 units. b1 = 0.78 b0 = 0.026 e { LOS: Differentiate between dependent and independent variables in a linear regression. Pages 300–303

Assumptions underlying linear regression
𝑌 𝑖 = 𝑏 0 + 𝑏 1 𝑋 𝑖 + ε 𝑖 The relationship between the dependent variable, Y, and the independent variable, X, is linear in the parameters b0 and b1. The independent variable, X, is not random. The expected value of the error term is 0  E(ε) = 0. The variance of the error term is the same for all observations. The error term, ε, is uncorrelated across observations. Consequently, E(εi,εj) = 0 for all i not equal to j. The error term, ε, is normally distributed. LOS: Explain the assumptions underlying linear regression. Pages 303–305 These will become important in the next chapter as we examine advanced topics in linear regression. The majority of the time, data used in investment analysis will likely violate at least one of these assumptions.

The Basics of Linear regression
Focus On: Regression Output 𝑌 𝑖 = 𝑏 0 + 𝑏 1 𝑋 𝑖 + ε 𝑖  Regression Output Coefficient Estimates Standard Error t-Statistic b0 0.0258 0.0315 0.8197 b1 0.7774 0.1872 4.1534 LOS: Differentiate between dependent and independent variables in a linear regression. Pages 300–303 The associated output is from MS Excel on the accompanying spreadsheet.

Standard error of the estimate
The standard error of the estimate gives us a measure of the goodness of fit for the relationship. Client Y Predicted Y Residuals2 1 0.1595 0.1776 0.0003 2 0.1171 0.1222 0.0000 3 0.1214 4 0.1269 0.1522 0.0006 5 0.1343 0.1096 6 0.1523 0.1401 0.0001 7 0.1676 0.0002 8 0.2295 0.2279 9 0.1112 0.1334 0.0005 10 0.2247 0.1728 0.0027 SEE = SEE= 𝑖=1 𝑛 (𝑌 𝑖 − 𝑏 0 − 𝑏 1 𝑋 𝑖 ) 2 (𝑛−2) = 𝑖=1 𝑛 (ε 𝑖 ) 2 (𝑛−2) LOS: Define and calculate the standard error of estimate. Pages 306–308 Point out that a small Standard Error of the Estimate (SEE) represents a better fit (less average squared error).

Coefficient of determination
The coefficient of determination is the portion of variation in the dependent variable explained by variation in the independent variable(s). Total variation = Unexplained variation + Explained variation; therefore, we can calculate it two ways. Square the correlation coefficient when we have one dependent and one independent variable. We can use the above relationship to determine the unexplained portion of the total variation as the sum of the squared prediction errors divided by the total variation in the dependent variable when we have more than one independent variable. Because we have one independent and one dependent variable in our regression, the coefficient of determination is = The debt-to-income ratio explains 68.11% of the variation in loan rate. LOS: Define, calculate, and interpret the coefficient of determination. Page 309 Recall that we calculated the correlation coefficient for our specification earlier in the lecture.

Regression coefficients
Focus On: Calculations When we calculate the confidence interval for a regression coefficient, we can use the estimated coefficient, the standard error of that coefficient, and the distribution of the coefficient estimate (in this case, a t-distribution) to estimate a confidence interval as For a 95% confidence interval of our estimated slope coefficient of , the confidence interval would be LOS: Calculate a confidence interval for a regression coefficient. Pages 310–313 We would expect the true underlying b1 to fall within this confidence interval 95% of the time. It is worth noting for the students that this interval does not include zero, implying a hypothesis test of b1 = 0 will reject the null. A hypothesis test of b1 = 1, however, would fail to reject the null. or

Regression coefficients
Focus On: Hypothesis Testing Alternatively, we could test the hypothesis that the true population slope coefficient is zero. Formulate hypothesis  H0: b1 = 0 versus Ha: b1 ≠ 0 (a two-tailed test) Identify appropriate test statistic  𝑡= 𝑏 1 − 𝑏 1 𝑠 𝑏 1 Specify the significance level  0.05 leading to a critical value of Collect data and calculate test statistic  Make the statistical decision  Reject the null because > LOS: Formulate a null and an alternative hypothesis about a population value of a regression coefficient, select the appropriate test statistic, and determine whether the null hypothesis is rejected at a given level of significance. Pages 310–313 As expected from the confidence interval information, reject the null that b1 = 0.

Regression coefficients
Focus On: Interpretation 6. Interpret the results of the test. Statistically  The coefficient estimate for the slope of the relationship is nonzero. Economically  A unit increase in the debt-to-income ratio leads to a unit increase in the loan rate. In other words, an increase of 1% in the debt-to-income ratio leads to a basis point increase in the loan rate charged. LOS: Interpret a regression coefficient. Pages 310–313

Prediction and Linear regression
Focus On: Calculating Predicted Values Continuing with our example, we can calculate predicted values for our dependent variable given our estimated regression model and values for our independent variable. If we want to predict the value of a loan rate for a borrower with a debt-to- income ratio of 18%, we substitute our estimated coefficients and a value of X = 0.18 to get For our estimated relationship, a borrower with an 18% debt-to-income ratio would be expected to have a 16.58% loan rate. LOS: Calculate a predicted value for the dependent variable given an estimated regression model and a value for the independent variable. Pages 321–322

Prediction and Linear regression
Focus On: Calculations Just as we can estimate a confidence interval for our coefficients, we can also estimate a confidence interval for our predicted (forecast) values. But we must also account for the estimation error in our coefficient estimates: Using the coefficient estimates and our predicted value from the prior slide, we determine a 95% confidence interval for our prediction: LOS: Calculate and interpret a confidence interval for the predicted value of a dependent variable. Pages 321–322

Analysis of variance Known as ANOVA, this process enables us to divide the total variability in the dependent variable into components attributable to different sources. ANOVA allows us to estimate the usefulness of an independent variable or variables in explaining the variation in the dependent variable. We do so using a test that determines whether the estimated coefficients are jointly zero. The ratio of the mean regression sum of squares to the mean squared error follows an F-distribution with 1 and n – 2 degrees of freedom. For a single independent variable, this is expressed as SSE = the sum of the squared errors (residuals) and RSS = the sum of the squared deviations of the predicted values from the mean value of the dependent variable or 𝐹= RSS 1 SSE (𝑛−2) LOS: Describe the use of analysis of variance (ANOVA) in regression analysis and interpret ANOVA results. Pages 318–320 RSS= 𝑖=1 𝑛 ( 𝑌 𝑖 − 𝑌 )

Analysis of variance Focus On: Calculations Pred. Y (Pred Y – Avg Y)2
0.1776 0.0006 0.1222 0.0009 0.1214 0.0010 0.1522 0.0000 0.1096 0.0018 0.1401 0.0002 0.1676 0.2279 0.0057 0.1334 0.0004 0.1728 0.1525 = Avg Y RSS = For our example, with a single independent variable, we can test the overall significance of the estimated relationship. Formulate hypothesis  H0: all b = 0 versus Ha: all b ≠ 0 Identify appropriate TS  Specify the significance level  0.05 leading to CV = Collect data (see above) and calculate test statistic 5. Make the statistical decision  Reject the null 6. Statistically  at least one b is non-zero Economically  the specified relationship has valid explanatory power 𝐹= RSS 1 SSE (𝑛−2) LOS: Describe the use of analysis of variance (ANOVA) in regression analysis and interpret ANOVA results. Pages 318–320

Limitations of regression analysis
Parameter instability occurs when regression relationships change over time. This instability generally occurs when the underlying population from which the sample is drawn has changed fundamentally in some way. Example: regime shifts in regulatory or monetary policy Public knowledge of the relationships may decrease or eliminate their usefulness. Violation of the underlying assumptions makes hypothesis tests and prediction intervals invalid, and we may not be certain as to whether the assumptions have been violated. LOS: Discuss the limitations of regression analysis. Page 324 Point out that No. 3 is covered in the next chapter, a nice transition to the next material.

Summary We are often interested in knowing the extent of the relationship between two or more financial variables. We can assess this relationship in several ways, including correlation, which measures the degree to which two variables move together, and linear regression, which describes at a more fundamental level the nature of any linear relationship between two variables. We can combine hypothesis testing from the prior chapter with linear regression and correlation to test beliefs about the nature and extent of relationships between two or more variables.