Stat 112: Lecture 14 Notes Finish Chapter 6:

Slides:



Advertisements
Similar presentations
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Advertisements

Inference for Regression
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
BA 555 Practical Business Analysis
Lecture 25 Multiple Regression Diagnostics (Sections )
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Regression Diagnostics - I
Class 7: Thurs., Sep. 30. Outliers and Influential Observations Outlier: Any really unusual observation. Outlier in the X direction (called high leverage.
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
1 4. Multiple Regression I ECON 251 Research Methods.
Regression Diagnostics Checking Assumptions and Data.
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Stat Notes 4 Chapter 3.5 Chapter 3.7.
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
Class 11: Thurs., Oct. 14 Finish transformations Example Regression Analysis Next Tuesday: Review for Midterm (I will take questions and go over practice.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Business Statistics - QBM117 Statistical inference for regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
8/7/2015Slide 1 Simple linear regression is an appropriate model of the relationship between two quantitative variables provided: the data satisfies the.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Multivariate Data Analysis Chapter 2 – Examining Your Data
Stat 112 Notes 5 Today: –Chapter 3.7 (Cautions in interpreting regression results) –Normal Quantile Plots –Chapter 3.6 (Fitting a linear time trend to.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Stat 112 Notes 14 Assessing the assumptions of the multiple regression model and remedies when assumptions are not met (Chapter 6).
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
1 Simple Linear Regression Chapter Introduction In Chapters 17 to 19 we examine the relationship between interval variables via a mathematical.
Lecture Slides Elementary Statistics Twelfth Edition
Inference for Least Squares Lines
Chapter 12: Regression Diagnostics
Lecture Slides Elementary Statistics Thirteenth Edition
Regression model Y represents a value of the response variable.
Diagnostics and Transformation for SLR
CHAPTER 29: Multiple Regression*
Essentials of Statistics for Business and Economics (8e)
Diagnostics and Transformation for SLR
Presentation transcript:

Stat 112: Lecture 14 Notes Finish Chapter 6: Checking and Remedying Constant Variance Assumption (Section 6.5) Checking and Remedying Normality Assumption (Seciton 6.6) Outliers and Influential Points (Section 6.7) Homework 4 is due next Thursday. Please let me know of any ideas you want to discuss for the final project.

Assumptions of Multiple Linear Regression Model Linearity: Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. The observations are independent.

Checking Constant Variance Assumption Residual plot versus explanatory variables should exhibit constant variance. Residual plot versus predicted values should exhibit constant variance (this plot is often most useful for detecting nonconstant variance)

The spread increases with y Heteroscedasticity When the requirement of a constant variance is violated we have a condition of heteroscedasticity. Diagnose heteroscedasticity by plotting the residual against the predicted y. + ^ y Residual + + + + + + + + + + + + + + ^ + + + y + + + + + + + + + The spread increases with y ^

How much traffic would a building generate? The goal is to predict how much traffic will be generated by a proposed new building of 150,000 occupied sq ft. (Data is from the MidAtlantic States City Planning Manual.) The data tells how many automobile trips per day were made in the AM to office buildings of different sizes. The variables are x = “Occupied Sq Ft of floor space in the building (in 1000 sq ft)” and Y = “number of automobile trips arriving at the building per day in the morning”.

The heteroscedasticity shows here.

Reducing Nonconstant Variance/Nonnormality by Transformations A brief list of transformations y’ = y1/2 (for y > 0) Use when the spread of residuals increases with y’ = log y (for y > 0) Use when the distribution of the residuals is skewed to the right. y’ = y2 Use when the spread of residuals is decreasing with , Use when the error distribution is left skewed

To try to fix heteroscedasticity we transform Y to Log(Y) This fixes heteroskedasticity BUT it creates a nonlinear pattern.

To fix nonlinearity we now transform x to Log(x), without changing the Y axis anymore. The resulting pattern is both satisfactorily homoscedastic AND linear.

Often we will plot residuals versus predicted. For simple regression the two residual plots are equivalent

Checking Normality If the disturbances are normally distributed, about 68% of the standardized residuals should be between –1 and +1, 95% should be between –2 and +2 and 99% should be between –3 and +3. Standardized residual for observation i= Graphical methods for checking normality: Histogram of (standardized) residuals, normal quantile plot of (standardized) residuals.

Normal Quantile Plots Normal quantile (probability) plot: Scatterplot involving ordered residuals (values) with the x-axis giving the expected value of the kth ordered residual on the standard normal scale (residual / ) and the y-axis giving the actual residual. JMP implementation: Save residuals, then click Analyze, Distribution, red triangle next to Residuals and Normal Quantile Plot. If the residuals follow approximately a normal distribution, they should fall approximately on a straight line.

Traffic and office space ex. Cont.

Residuals from the final Log – Log fit

Importance of Normality and Corrections for Normality For point estimation/confidence intervals/tests of coefficients and confidence intervals for mean response, normality of residuals is only important for small samples because of Central Limit Theorem. Guideline: Do not need to worry about normality if there are 30 observations plus 10 additional observations for each additional explanatory variable in multiple regression beyond the first one. For prediction intervals, normality is critical for all sample sizes. Corrections for normality: transformations of y variable (see earlier slide)

Order of Correction of Violations of Assumptions in Multiple Regression First, focus on correcting a violation of the linearity assumption. Then, focus on correction violations of constant variance after the linearity assumption is satifised. If constant variance is achieved, make sure that linearity still holds approximately. Then, focus on correctiong violations of normality. If normality is achieved, make sure that linearity and constant variance still approximately hold.

Outliers and Influential Observations in Simple Regression Outlier: Any really unusual observation. Outlier in the X direction (called high leverage point): Has the potential to influence the regression line. Outlier in the direction of the scatterplot: An observation that deviates from the overall pattern of relationship between Y and X. Typically has a residual that is large in absolute value. Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Outliers in Direction of Scatterplot Residual Standardized Residual: Under multiple regression model, about 5% of the points should have standardized residuals greater in absolute value than 2, 1% of the points should have standardized residuals greater in absolute value than 3. Any point with standardized residual greater in absolute value than 3 should be examined. To compute standardized residuals in JMP, right click in a new column, click Formula and create a formula with the residual divided by the RMSE.

Housing Prices and Crime Rates A community in the Philadelphia area is interested in how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values. The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate.JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000 population.

Which points are influential? Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Excluding Observations from Analysis in JMP To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation. To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Formal measures of leverage and influence Leverage: “Hat values” (JMP calls them hats) Influence: Cook’s Distance (JMP calls them Cook’s D Influence). To obtain them in JMP, click Analyze, Fit Model, put Y variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances. To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No other observations have high influence or high leverage.

Rules of Thumb for High Leverage and High Influence High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where # of coefficients in regression model = 2 for simple linear regression. n=number of observations. High Influence: Any observation with a Cook’s Distance greater than 1 indicates a high influence.

What to Do About Suspected Influential Observations? See flowchart attached to end of slides Does removing the observation change the substantive conclusions? If not, can say something like “Observation x has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation? If yes, omit the observation and proceed. If no, does the observation have high leverage (outlier in explanatory variable). If yes, omit the observation and proceed. Report that conclusions only apply to a limited range of the explanatory variable. If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

General Principles for Dealing with Influential Observations General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.