Presentation on theme: "Regression Analysis Notes. What is a simple linear relation? When one variable is associated with another variable in such a way that two numbers completely."— Presentation transcript:
What is a simple linear relation? When one variable is associated with another variable in such a way that two numbers completely describe this relation: –Intercept on the y-axis, b –Slope, or rate of change between the two variables, m Y = mX + b Because we work with samples of data, the simple linear relation will be subject to ????: Y = mX + b + error
A Regression Analysis A regression analysis consists of checks that help you quantify and qualify how good a simple linear regression model is.
Regression Analysis Check #1: How well does X explain Y? –Coefficient of Determination: R 2 (%)—always a number between 0 (no relation) and 1 (perfect linear relation). R-square = 72%, on average, 72% of variations in RENT are explained by variations in SIZE Check #2: How much of the variations in Y remain unexplained by X? –Standard Error of the Regression (unit values). On average, $194.60 of RENT are not explainable by SIZE.
Regression Analysis Check #3: (hypothesis test) Is the linear model “statistically significant”? –H0: (both) m=b=0 vs. H1: m≠0, or b≠0 Y = m*X + b + e, under the alternative hypothesis Y = e, under the null hypothesis You WANT TO REJECT THE NULL P-value <<.05 (significance level) In this case the p-value = “Significance-F” (PhStat)
Regression Analysis Check #4: two separate hypothesis tests—one for the slope term, m, by itself; the other for the intercept term, b, by itself. –Appear in the “Coefficients Table” of the PhStat output, and they are labeled as p-values in this table. Intercept test: H0: b = 0 versus H1: b ≠ 0 Slope test: H0: m = 0 versus H1: m ≠ 0 WANT TO REJECT H0 IN BOTH CASES—or at least you reject null in the case of the slope term PREFER p-values <<.05 (significance)
Residuals Analysis A residuals analysis is a set of issues that you check to ensure that the UNEXPLAINED part of Y behaves in accordance to assumptions that you make in order for the simple linear regression to be valid. There are three (3) checks to perform a residuals analysis. Each check is related to the three ideal properties of residuals in a regression: –Errors are normal in distribution—around zero (Normality) –Errors are NOT related to X (homoskedasticity). –Errors are NOT related to each other (no autocorrelation).
Residual Analysis Check #1 Are the residuals normal in distribution around a zero mean? –Procedure: draw a box-and-whisker plot, or a normal probability plot (like in chapters 3 and 6) using the residuals as data –Why should errors be normal around zero? Because you want the simple linear model to be consistent and unbiased. The Normal probability plot tests for these two assumptions. You WANT your normal plot to look like panel B on page 196.
Residual Analysis Check #2 Are the residuals unrelated to the explanatory variable— are the X’s and the errors independent of one another? –Scatter Plot of the Residuals versus X called the “Residuals Plot” –Ideally, you want to observe no evident pattern between X (the explanatory variable) and e (the residuals). –If you see a pattern between the two variables, then the residuals are indicating a problem: heteroskedasticity (hetero: many) (skedasticity: variations) … this problem means that the variable X is related to what you could not explain about Y… (huh????) … in other words this problems implies that the relation between X and Y is not LINEAR, OR THAT THERE MAY BE OTHER EXPLANATORY FACTORS— MULTIPLE REGRESSION MODEL MAY BE BETTER.
Residual Analysis Check #3 Are the residuals independent of one another, serially uncorrelated? –The “Durbin Watson Statistic” is used to perform this last check on the errors. –A DW is a statistic that measures the extent to which errors are related to each other 0 ≤ DW ≤ 4 –Ideally the DW should fall in the range of 1.5 to 2.5. The closer to 2.00 the better
Durbin Watson Statistic Page 435: formula Ratio of the variations between different errors (numerator), and the variations within each error (denominator)… it’s a fancy correlation coefficient to detect correlation among the errors in a regression.
Durbin Watson Statistic Page 435: formula What happens if a model makes the same mistake each and every time? e t = e t-1 ? DW = 0. This is evidence of perfect, positive autocorrelation. Ideally DW > 1.5. What happens if a model maker the exact opposite error from the last time each time? e t = - e t-1 ? DW = 4. This is evidence of perfect, negative autocorrelation. Ideally DW < 2.5.