Presentation on theme: "Four Mini-Lectures QMM 510 Fall 2014"— Presentation transcript:
1Four Mini-Lectures QMM 510 Fall 2014 Week 11 November 10-14Four Mini-LecturesQMM 510Fall 2014
2Chapter 12: Correlation and Regression Too much?Chapter Learning ObjectivesLO12-1: Calculate and test a correlation coefficient for significance.LO12-2: Interpret the slope and intercept of a regression equation.LO12-3: Make a prediction for a given x value using a regression equation.LO12-4: Fit a simple regression on an Excel scatter plot.LO12-5: Calculate and interpret confidence intervals for regression coefficients.LO12-6: Test hypotheses about the slope and intercept by using t tests.LO12-7: Perform regression analysis with Excel or other software.LO12-8: Interpret the standard error, R2, ANOVA table, and F test.LO12-9: Distinguish between confidence and prediction intervals for Y.LO12-10: Test residuals for violations of regression assumptions.LO12-11: Identify unusual residuals and high-leverage observations.Learning Objectives:Listed are the first six learning objectives for this chapter. These objectives will be discussed as the chapter progresses.
3Correlation Analysis ML 11.1 Chapter 12Visual DisplaysBegin the analysis of bivariate data (i.e., two variables) with a scatter plot.A scatter plot: - displays each observed data pair (xi, yi) as a dot on an X / Y grid. - indicates visually the strength of the relationship between X and YSample Scatter PlotVisual Displays and Correlation Analysis:One of the first things one should do when dealing with bivariate data is to display the data on a scatter plot.A scatter plot gives us an idea about the behavior of the relationship between the two variables.Figure 12.1 shows an example of a scatter plot.
4Correlation Analysis Chapter 12 Strong Positive Correlation Weak Positive CorrelationWeak Negative CorrelationVisual Displays and Correlation Analysis:The slide shows a few possible cases of scatter plots and the correlation coefficient associated with them.Strong Negative CorrelationNo CorrelationNonlinear RelationNote: r is an estimate of the populationcorrelation coefficient r (rho).
5Correlation Analysis Chapter 12 Steps in Testing if = 0 (population correlation = 0)Step 1: State the Hypotheses H0: = 0 H1: ≠ 0Step 2: Specify the Decision Rule For degrees of freedom d.f. = n 2, look up the critical value ta in Appendix D or Excel =T.INV.2T(α,df). for a 2-tailed testStep 3: Calculate the Test StatisticStep 4: Make the Decision If using the t statistic method, reject H0 if t > ta or if the p-value a.Visual Displays and Correlation Analysis:One can test whether the linear correlation between two variables is significant or not.This slide shows the process of doing such a test.1 ≤ r ≤ +1r = 0 indicates no linear relationship
6Correlation Analysis Chapter 12 Alternative Method to Test for = 0 Equivalently, you can calculate the critical value for the correlation coefficient usingThis method gives a benchmark for the correlation coefficient.However, there is no p-value and is inflexible if you change your mind about a.MegaStat uses this method, giving two-tail critical values for = and = 0.01.Critical values of r for various sample sizesVisual Displays and Correlation Analysis:One can test whether the linear correlation between two variables is significant or not.This slide shows the process of doing such a test (continued).
7What is Simple Regression? Simple Regression ML 11.2Chapter 12What is Simple Regression?Simple regression analyzes the relationship between two variables.It specifies one dependent (response) variable and one independent (predictor) variable.This hypothesized relationship (in this chapter) will be linear.Simple Regression:Regression analysis helps us model the relationship between the two variables.Simple regression analysis deals with one dependent (response) variable and one independent (predictor) variable when the relationship is linear.
8Interpreting an Estimated Regression Equation Simple RegressionChapter 12Interpreting an Estimated Regression EquationSimple Regression:The slide shows examples of simple regression models or equations.
9Prediction Using Regression: Examples Simple RegressionChapter 12Prediction Using Regression: ExamplesSimple Regression:The slide shows examples of simple regression models or equations and using these models to make predictions.
10Simple Regression Chapter 12 Cause-and-Effect? Can We Make Predictions?Simple Regression:One of the things you should keep in mind when dealing with regression models is that cause and effect are not proven by simple regression.Once the regression model is determined, care should be taken when using it to make predictions. The model captures the behavior of the bivariate data over the range of the predictor variable. Using predictor values outside this range may lead to unreliable predictions because one does not know how the model is behaving outside the range of the predictor variable.
11Regression Terminology Chapter 12Model and ParametersThe assumed model for a linear relationship isy = b0 + b1x + e .The relationship holds for all pairs (xi, yi).The error term is not observable; it is assumed to be independently normally distributed with mean of 0 and standard deviation s.The unknown parameters are: b0 Intercept b1 SlopeRegression Terminology:This slide presents the population linear regression model with the intercept and slope parameters. Note the model includes an error term.
12Regression Terminology Chapter 12Model and ParametersThe fitted model or regression model used to predict the expected value of Y for a given value of X isThe fitted coefficients are b0 Estimated intercept b1 Estimated slopeRegression Terminology:This slide presents the fitted linear regression model with the estimates for the intercept and slope. Note the model does include an error term. We would want the predicted error to be zero.The error or the residual for a given response value for a given predictor value is the difference between that value and the predicted value for that predictor value.Each response value used in the model will have an error associated with it.
13Regression Terminology Chapter 12A more precise method is to let Excel calculate the estimates. Enter observations on the independent variable x1, x2, . . ., xn and the dependent variable y1, y2, . . ., yn into separate columns, and let Excel fit the regression equation. Excel will choose the regression coefficients so as to produce a good fit.Regression Terminology:Figure 12.6 displays the Excel scatter plot with the fitted line for the given data in the example. Included in the display is the equation or model for the linear regression.
14Regression Terminology Chapter 12Scatter plot shows a sample of miles per gallon and horsepower for 15 vehicles.Slope Interpretation: The slope of says that for each additional unit of engine horsepower, the miles per gallon decreases by mile. This estimated slope is a statistic because a different sample might yield a different estimate of the slope.Intercept Interpretation: The intercept value of suggests that when the engine has no horsepower, the fuel efficiency would be quite high. However, the intercept has little meaning in this case, not only because zero horsepower makes no logical sense, but also because extrapolating to x = 0 is beyond the range of the observed data.Regression Terminology:The discussion of the example from the previous slide is continued on this slide.
15Ordinary Least Squares (OLS) Formulas Chapter 12OLS MethodThe ordinary least squares method (OLS) estimates the slope and intercept of the regression line so that the sum of squared residuals is minimized.The sum of the residuals = 0.The sum of the squared residuals is SSE.Ordinary Least Squares (OLS) Formulas:The ordinary least squares method allows us to determine estimates for the intercept and the slope such that the error sum of squares is minimized.
16Ordinary Least Squares (OLS) Formulas Chapter 12Slope and InterceptThe OLS estimator for the slope is:The OLS estimator for the intercept is:Excel function:=SLOPE(YData, XData)Excel function:Ordinary Least Squares (OLS) Formulas:The ordinary least squares formulas for the intercept and slope are given on this slide.=INTERCEPT(YData, XData)These formulas are built into Excel.
17Ordinary Least Squares (OLS) Formulas Chapter 12Example: Achievement Test Scores20 high school students’ achievement exam scores.Analysis of Variance: Overall Fit:We can test the overall significance of the regression by using an F test to compare the explained and unexplained sum of squares. The ANOVA table in the slide shows all the components used in the F test.Note that verbal scores average higher than quant scores (slope exceeds 1 and intercept shifts the line up almost 20 points).17
18Ordinary Least Squares (OLS) Formulas Chapter 12Slope and InterceptOrdinary Least Squares (OLS) Formulas:Figure 12.8 shows that the ordinary least squares regression line always passes through the point where the ordered pair is the mean of the predictor variable and the mean of the response variable.
19Assessing Fit Assessing Fit Chapter 12 We want to explain the total variation in Y around its mean (SST for total sums of squares).The regression sum of squares (SSR) is the explained variation in Y.Ordinary Least Squares (OLS) Formulas:How well does the regression model fit the data? One can assess the fit.
20Assessing Fit Assessing Fit Chapter 12 The error sum of squares (SSE) is the unexplained variation in Y.If the fit is good, SSE will be relatively small compared to SST.A perfect fit is indicated by an SSE = 0.The magnitude of SSE depends on n and on the units of measurement.Ordinary Least Squares (OLS) Formulas:How well does the regression model fit the data? One can assess the fit (continued).
21Coefficient of Determination Assessing FitChapter 12Coefficient of DeterminationR2 is a measure of relative fit based on a comparison of SSR (explained variation) and SST (total variation).0 R2 1Ordinary Least Squares (OLS) Formulas:How well does the regression model fit the data? One can assess the fit (continued).The goodness of the fit is measured through the coefficient of determination.A coefficient of determination close to 1 or 100% will indicate that the model will be a useful model to help with predictions for the data.Note that the coefficient of determination can be obtained by just squaring the correlation coefficient.Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates perfect fit.In simple regression, R2 = r2 where r2 is the squared correlation coefficient).
22Example: Achievement Test Scores Assessing FitChapter 12Example: Achievement Test ScoresStrong relationship between quant score and verbal score (68 percent of variation explained)R2 = SSR / SST = / = .6838Analysis of Variance: Overall Fit:We can test the overall significance of the regression by using an F test to compare the explained and unexplained sum of squares. The ANOVA table in the slide shows all the components used in the F test.SSRExcel shows the sums needed to calculate R2.SST22
23Tests for Significance Chapter 12Standard Error of RegressionThe standard error (se) is an overall measure of model fit.Excel’s Data Analysis > Regression calculates se.If the fitted model’s predictions are perfect (SSE = 0), then se = 0. Thus, a small se indicates a better fit.Used to construct confidence intervals.Magnitude of se depends on the units of measurement of Y and on data magnitude.Tests for Significance:The standard error is an overall measure of model fit. A small standard error implies a good fit.The standard error is used when constructing confidence intervals for the intercept and the slope.
24Tests for Significance Chapter 12Confidence Intervals for Slope and InterceptStandard error of the slope and intercept:Excel’s Data Analysis > Regression constructs confidence intervals for the slope and intercept.Tests for Significance:This slide gives the standard errors for the intercept and slope.Confidence interval for the true slope and intercept:
25Tests for Significance Chapter 12Hypothesis TestsIf b1 = 0, then the regression model collapses to a constant b0 plus random error.The hypotheses to be tested are:Excel ‘s Data Analysis > Regression performs these tests.Tests for Significance:This slide presents the procedures to test for zero slope and intercept. Observe that the test statistic has a t distribution.Reject H0 if tcalc > ta/2or if p-value α.d.f. = n 2
26Analysis of Variance: Overall Fit Chapter 12Example: Achievement Test Scores20 high school students’ achievement exam scores.Excel shows 95% confidence intervals and t test statisticsMegaStat is similar but rounds off and highlights p-values to show significance (light yellow .05, bright yellow .01)Analysis of Variance: Overall Fit:We can test the overall significance of the regression by using an F test to compare the explained and unexplained sum of squares. The ANOVA table in the slide shows all the components used in the F test.26
27Analysis of Variance: Overall Fit Chapter 12F Test for Overall FitTo test a regression for overall significance, we use an F test to compare the explained (SSR) and unexplained (SSE) sums of squares.Analysis of Variance: Overall Fit:We can test the overall significance of the regression by using an F test to compare the explained and unexplained sum of squares. The ANOVA table in the slide shows all the components used in the F test.27
28Analysis of Variance: Overall Fit Chapter 12Example: Achievement Test Scores20 high school students’ achievement exam scores.Excel shows the ANOVA sums, the F test statistic , and its p-value.Analysis of Variance: Overall Fit:We can test the overall significance of the regression by using an F test to compare the explained and unexplained sum of squares. The ANOVA table in the slide shows all the components used in the F test.MegaStat is similar, but also highlights p-values to indicate significance (light yellow .05, bright yellow .01)28
29Confidence and Prediction Intervals for Y Chapter 12How to Construct an Interval Estimate for YConfidence interval for the conditional mean of Y is shown below.Prediction intervals are wider than confidence intervals for the mean because individual Y values vary more than the mean of Y.Excel does not do these CIs!Confidence and Prediction Intervals for Y:The slide presents the relationships used to determine the confidence interval for the mean of the response variable (Y) and the prediction interval for an individual Y value.29
30Three Important Assumptions Tests of AssumptionsChapter 12Three Important AssumptionsThe errors are normally distributed.The errors have constant variance (i.e., they are homoscedastic).The errors are independent (i.e., they are nonautocorrelated).Non-normal ErrorsNon-normality of errors is a mild violation since the regression parameter estimates b0 and b1 and their variances remain unbiased and consistent.Confidence intervals for the parameters may be untrustworthy because the normality assumption is used to justify using Student’s t distribution.Residual Tests:For the true model, the errors are assumed to be independent and normally distributed with constant variance.30
31Normal Probability Plot Residual TestsChapter 12Non-normal ErrorsA large sample size would compensate.Outliers could pose serious problems.Normal Probability PlotThe normal probability plot tests the assumption H0: Errors are normally distributed H1: Errors are not normally distributedIf H0 is true, the residual probability plot should be linear, as shown in the example.Residual Tests:We can test to determine whether the normality assumption is violated. This slide explains one such test called the normality plot test.If the plot results in an approximate straight line, it would indicate that the normality assumption has not been violated.31
32What to Do about Non-normality? Residual TestsChapter 12What to Do about Non-normality?Trim outliers only if they clearly are mistakes.Increase the sample size if possible.If data are totals, try a logarithmic transformation of both X and Y.Residual Tests:If the normality assumption is violated, what are some of the steps to employ to fix the problem.Three solutions are presented in the slide.32
33Tests for Heteroscedasticity Residual TestsChapter 12Heteroscedastic Errors (Nonconstant Variance)The ideal condition is if the error magnitude is constant (i.e., errors are homoscedastic).Heteroscedastic errors increase or decrease with X.In the most common form of heteroscedasticity, the variances of the estimators are likely to be understated.This results in overstated t statistics and artificially narrow confidence intervals.Residual Tests:The variance for the residuals is assumed to be constant for the true model.One can use graphical techniques to help determine whether the constant variance assumption is violated.If the residual plot versus the predictor variable (X) shows no pattern as in the diagram, it would indicate that maybe the constant variance assumption has not been violated.Tests for HeteroscedasticityPlot the residuals against X. Ideally, there is no pattern in the residuals moving from left to right.33
34Tests for Heteroscedasticity Residual TestsChapter 12Tests for HeteroscedasticityThe “fan-out” pattern of increasing residual variance is the most common pattern indicating heteroscedasticity.Residual Tests:Patterns as shown in this slide would indicate nonconstant variance and would imply a violation of the assumption.34
35What to Do about Heteroscedasticity? Residual TestsChapter 12What to Do about Heteroscedasticity?Transform both X and Y, for example, by taking logs.Although it can widen the confidence intervals for the coefficients, heteroscedasticity does not bias the estimates.Autocorrelated ErrorsAutocorrelation is a pattern of non-independent errors.In a first-order autocorrelation, et is correlated with et 1.The estimated variances of the OLS estimators are biased, resulting in confidence intervals that are too narrow, overstating the model’s fit.Residual Tests:One can transform both X and Y to help with the nonconstant variance for the residuals.We can use the first-order autocorrelation to test for independence. Most statistical software can perform such a test.35
36Runs Test for Autocorrelation Residual TestsChapter 12Runs Test for AutocorrelationIn the runs test, count the number of the residuals’ sign reversals (i.e., how often does the residual cross the zero centerline?).If the pattern is random, the number of sign changes should be n/2.Fewer than n/2 would suggest positive autocorrelation.More than n/2 would suggest negative autocorrelation.Durbin-Watson (DW) TestTests for autocorrelation under the hypotheses H0: Errors are nonautocorrelated H1: Errors are autocorrelatedThe DW statistic will range from 0 to 4. DW < 2 suggests positive autocorrelation DW = 2 suggests no autocorrelation (ideal) DW > 2 suggests negative autocorrelationResidual Tests:This slide presents two tests, which can help to determine whether the residuals are independent of each other or not.36
37What to Do about Autocorrelation? Residual TestsChapter 12What to Do about Autocorrelation?Transform both variables using the method of first differences in which both variables are redefined as changes. Then we regress Y against X.Although it can widen the confidence interval for the coefficients, autocorrelation does not bias the estimates.Don’t worry about it at this stage of your training. Just learn to detect whether it exists.Residual Tests:This slide presents a fix to help with violation of the independence assumption for the residuals.37
38Example: Excel’s Tests of Assumptions Residual TestsChapter 12Example: Excel’s Tests of AssumptionsExcel’s Data Analysis > Regression does residual plots and gives the DW test statistic. Its standardized residuals are done in a strange way, but usually they are not misleading.Warning: Excel offers normal probability plots for residuals, but they are done incorrectly.Residual Tests:This slide presents a fix to help with violation of the independence assumption for the residuals.38
39Example: MegaStat’s Tests of Assumptions Residual TestsChapter 12Example: MegaStat’s Tests of AssumptionsMegaStat will do all three tests (if you check the boxes). Its runs plot (residuals by observation) is a visual test for autocorrelation, which Excel does not offer.Residual Tests:This slide presents a fix to help with violation of the independence assumption for the residuals.39
40Example: MegaStat’s Tests of Assumptions Residual TestsChapter 12Example: MegaStat’s Tests of Assumptionsnear-linear plot - indicates normal errorsno pattern - suggests homoscedastic errorsno pattern - suggests homoscedastic errorsDW near 2 - suggests no autocorrelationResidual Tests:This slide presents a fix to help with violation of the independence assumption for the residuals.40
41Standardized Residuals Unusual ObservationsChapter 12Standardized ResidualsUse Excel, MINITAB, MegaStat or other software to compute standardized residuals.If the absolute value of any standardized residual is at least 2, then it is classified as unusual.Leverage and InfluenceA high leverage statistic indicates the observation is far from the mean of X.These observations are influential because they are at the “end of the lever.”The leverage for observation i is denoted hi.Unusual Observations:One can use appropriate statistical software to help determine whether we have unusual observations in the data. If we standardize the residuals, a rule of thumb is that if the absolute value of the standardized residual is outside 2 standard deviations, then it may be classified as unusual.41
42Unusual Observations Chapter 12 Leverage A leverage that exceeds 4/n is unusual.Unusual Observations:A high leverage statistic would indicate that the observation is far from the mean of the predictor variable (X).42
43Example: Achievement Test Scores Unusual ObservationsExample: Achievement Test ScoresChapter 12If the absolute value of any standardized residual is at least 2, then it is classified as unusual.Leverage that exceeds 4/n indicates an influential X value (far from mean of X).Unusual Observations:A high leverage statistic would indicate that the observation is far from the mean of the predictor variable (X).43
44Other Regression Problems Chapter 12OutliersOutliers may be caused byan error in recording data.impossible data (can be omitted).an observation that has been influenced by an unspecified “lurking” variable that should have been controlled but wasn’t.To fix the problemdelete the observation(s) if you are sure they are actually wrong.formulate a multiple regression model that includes the lurking variable.Other Regression Problems:This slide presents what may cause outliers to occur in the data and some possible fixes.12B-4444
45Other Regression Problems Chapter 12Model MisspecificationIf a relevant predictor has been omitted, then the model is misspecified.For example, Height depends on Gender as well as Age.Use multiple regression instead of bivariate regression.Ill-Conditioned DataWell-conditioned data values are of the same general order of magnitude.Ill-conditioned data have unusually large or small data values and can cause loss of regression accuracy or awkward estimates.Other Regression Problems:This slide presents two problems one may encounter in regression analysis.45
46Other Regression Problems Chapter 12Ill-Conditioned DataAvoid mixing magnitudes by adjusting the magnitude of your data before running the regression.For example, Revenue= 139,405,377 mixed with ROI = .037.Spurious CorrelationIn a spurious correlation two variables appear related because of the way they are defined.This problem is called the size effect or problem of totals.Expressing variables as per capita or per cent may be helpful.Other Regression Problems:This slide presents two other problems one may encounter in regression analysis.46
47Other Regression Problems Chapter 12Model Form and Variable TransformsSometimes a nonlinear model is a better fit than a linear model. Excel offers other model forms for simple regression (one X and one Y)Variables may be transformed (e.g., logarithmic or exponential functions) in order to provide a better fit.Log transformations reduce heteroscedasticity.Nonlinear models may be difficult to interpret.Other Regression Problems:One should try other models to compare in order to get the best fit.The appropriate technology may be used to help with this.47
48Assignments ML 11.4 Connect C-8 (covers chapter 12) You get three attemptsFeedback is given if requestedPrintable if you wishDeadline is midnight each MondayProject P-3 (data, tasks, questions)Review instructionsLook at the dataYour task is to write a nice, readable report (not a spreadsheet)Length is up to youHere we discuss the difference between the subject of statistics and a single measure used to summarize a sample data set. The subject of statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data, while a statistic is simply a numerical value computed from the data set. Later we will see that it is associated with a sample.
49Projects: General Instructions For each team project, submit a short (5-10 page) report (using Microsoft Word or equivalent) that answers the questions posed. Strive for effective writing (see textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless spelling and grammar. Paste graphs and computer tables or output into your written report (it may be easier to format tables in Excel and then use Paste Special > Picture to avoid weird formatting and permit sizing within Word). Allocate tasks among team members as you see fit, but all should review and proofread the report (submit only one report).Here we discuss the difference between the subject of statistics and a single measure used to summarize a sample data set. The subject of statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data, while a statistic is simply a numerical value computed from the data set. Later we will see that it is associated with a sample.
50Project P-3You will be assigned team members and a dependent variable (see Moodle) from the 2010 state database Big Dataset 09 - US States. The team may change the assigned dependent variable (instructor assigned one just to give you a quick start). Delegate tasks and collaborate as seems appropriate, based on your various skills. Submit one report. Data: Choose an interesting dependent variable (non-binary) from the 2010 state database posted on Moodle. Analysis: (a). Propose a reasonable model of the form Y = f(X1, X2, ... , Xk) using not more than 12 predictors. (b) Use regression to investigate the hypothesized relationship. (c) Try deleting poor predictors until you feel that you have a parsimonious model, based on the t-values, p-values, standard error, and R2adj. (d) For the preferred model only, obtain a list of residuals and request residual tests and VIFs. (e) List the states with high leverage and/or unusual residuals. (f) Make a histogram and/or probability plot of the residuals. Are the residuals normal? (g) For the predictors that were retained, analyze the correlation matrix and/or VIFs. Is multicollinearity a problem? If so, what could be done? (h) If you had more time, what might you do?Here we discuss the difference between the subject of statistics and a single measure used to summarize a sample data set. The subject of statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data, while a statistic is simply a numerical value computed from the data set. Later we will see that it is associated with a sample.Watch the instructor’s video walkthrough using Assault as an example (posted on Moodle)
51Project P-3 (preview, data, tasks) Example using the 2005 state database:170 variables on n = 50 statesChoose one variable as Y ( the response).Goal: to explain why Y varies from state to state.Start choosing X1, X2, … , Xk (the predictors).Copy Y and X1, X2, … , Xk to a new spreadsheet.Study the definitions for each variable (e.g., Burglary is the burglary rate per 100,000 population.
52Project P-3 (preview, data, tasks) Why multiple predictors?One predictor usually is an incorrect specification.Fit can usually be improved.How many predictors: Evans’ Rule (k n/10)Up to one predictor per 10 observationsFor example, n = 50 suggests k = 5 predictors.Evans’ Rule is conservative. It’s OK to start with more (you will end up with fewer after deleting weak predictors).
53Project P-3 (preview, data, tasks) Work with partners? Absolutely – it will be more fun.Post questions for peers or instructor on Moodle.Get started. But don’t run a bunch of regressions until you have studied Chapter 13.It’s a good idea to have the instructor look over your list of intended Y and X1, X2, … , Xk in order to avoid unnecessary re- work if there are obvious problems.Look at all the categories of variables – don’t just grab the first one you see (there are 170 variables). Or just use the one your instructor assigned.