Presentation on theme: "LECTURE 3 Introduction to Linear Regression and Correlation Analysis"— Presentation transcript:
1 LECTURE 3 Introduction to Linear Regression and Correlation Analysis 1 Simple Linear Regression2 Regression Analysis3 Regression Model Validity
2 GoalsAfter this, you should be able to:Interpret the simple linear regression equation for a set of dataUse descriptive statistics to describe the relationship between X and YDetermine whether a regression model is significant
3 Goals(continued)After this, you should be able to:Interpret confidence intervals for the regression coefficientsInterpret confidence intervals for a predicted value of YCheck whether regression assumptions are satisfiedCheck to see if the data contains unusual values
4 Introduction to Regression Analysis Regression analysis is used to:Predict the value of a dependent variable based on the value of at least one independent variableExplain the impact of changes in an independent variable on the dependent variableDependent variable: the variable we wish to explainIndependent variable: the variable used to explain the dependent variable
5 Simple Linear Regression Model Only one independent variable, xRelationship between x and y is described by a linear functionChanges in y are assumed to be caused by changes in x
6 Types of Regression Models Positive Linear RelationshipRelationship NOT LinearNegative Linear RelationshipNo Relationship
7 Population Linear Regression The population regression model:Random Error term, or residualPopulation Slope CoefficientPopulation y interceptIndependent VariableDependent VariableLinear componentRandom Errorcomponent
8 Linear Regression Assumptions The underlying relationship between the x variable and the y variable is linearThe distribution of the errors has constant variabilityError values are normally distributedError values are independent (over time)
9 Population Linear Regression yObserved Value of y for xiεiSlope = β1Predicted Value of y for xiRandom Error for this x valueIntercept = β0xix
10 Estimated Regression Model The sample regression line provides an estimate of the population regression lineEstimated (or predicted) y valueEstimate of the regression interceptEstimate of the regression slopeIndependent variable
11 Interpretation of the Slope and the Intercept b0 is the estimated average value of y when the value of x is zerob1 is the estimated change in the average value of y as a result of a one-unit change in x
12 Finding the Least Squares Equation The coefficients b0 and b1 will be found using computer software, such as Excel’s data analysis add-in or MegaStatOther regression measures will also be computed as part of computer-based regression analysis
13 Simple Linear Regression Example A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)A random sample of 10 houses is selectedDependent variable (y) = house price in $1000Independent variable (x) = square feet
14 Sample Data for House Price Model House Price in $1000s(y)Square Feet(x)245140031216002791700308187519911002191550405235032424503191425255
15 Regression output from Excel – Data – Data Analysis or MegaStat – Correlation/ regression
17 Graphical Presentation House price model: scatter plot and regression lineSlope=Intercept=
18 Interpretation of the Intercept, b0 b0 is the estimated average value of Y when the value of X is zero (if x = 0 is in the range of observed x values)Here, houses with 0 square feet do not occur, so b0 = just indicates the height of the line.
19 Interpretation of the Slope Coefficient, b1 b1 measures the estimated change in Y as a result of a one-unit increase in XHere, b1 = tells us that the average value of a house increases by ($1000) = $109.77, on average, for each additional one square foot of size
20 Least Squares Regression Properties The simple regression line always passes through the mean of the y variable and the mean of the x variableThe least squares coefficients are unbiased estimates of β0 and β1
21 Coefficient of Determination, R2 The percentage of variability in Y that can be explained by variability in X.Note: In the single independent variable case, the coefficient of determination iswhere:R2 = Coefficient of determinationr = Simple correlation coefficient
22 Examples of R2 Values y R2 = 1, correlation = -1 Perfect linear relationship between x and y:100% of the variation in y is explained by variation in xxR2 = 1yxR2 = 1, correlation = +1
23 Examples of Approximate R2 Values y0 < R2 < 1, correlation is negativeWeaker linear relationship between x and y:Some but not all of the variation in y is explained by variation in xxy0 < R2 < 1, correlation is positivex
24 Examples of Approximate R2 Values yNo linear relationship between x and y:The value of Y does not depend on x. (None of the variation in y is explained by variation in x)xR2 = 0
25 Excel Output Regression Analysis r² 0.581 r 0.762 Std. Error 41.330 58.08% of the variation in house prices is explained by variation in square feetThe correlation of .762 shows a fairly strong direct relationship.The typical error in predicting Price is 41.33($000) = $41,330
26 Inference about the Slope: t Test t test for a population slopeIs there a linear relationship between x and y?Null and alternative hypothesesH0: β1 = 0 (no linear relationship)Ha: β1 0 (linear relationship does exist)Obtain p-value from ANOVA or across from the slope coefficient (they are the same in simple regression)
27 Inference about the Slope: t Test (continued)Estimated Regression Equation:House Price in $1000s(y)Square Feet(x)245140031216002791700308187519911002191550405235032424503191425255The slope of this model isDoes square footage of the house affect its sales price?
28 Inferences about the Slope: t Test Example P-valueH0: β1 = 0Ha: β1 0From Excel output:CoefficientsStandard Errort StatP-valueInterceptSquare FeetDecision:Conclusion:Reject H0We can be 98.96% confident that square feet is related to house price.
29 Regression Analysis for Description Confidence Interval Estimate of the Slope:CoefficientsStandard Errort StatP-valueLower 95%Upper 95%InterceptSquare FeetExcel Printout for House Prices:We can be 95% confident that house prices increase by between $33.74 and $ for a 1 square foot increase.
30 Estimates of Expected y for Different Values of x The relationship describes how x impacts your estimate from yyypyy = b0 + b1xxxpx
31 Interval Estimates for Different Values of x Prediction Interval for an individual y, given xp The father from x the less accurate the prediction.yyy = b0 + b1xxxpx
32 Example: House PricesEstimated Regression Equation:House Price in $1000s(y)Square Feet(x)245140031216002791700308187519911002191550405235032424503191425255Predict the price for a house with 2000 square feet
33 Example: House Prices(continued)Predict the price for a house with 2000 square feet:The predicted price for a house with 2000 square feet is ($1,000s) = $317,850
34 Estimation of Individual Values: Example Prediction Interval Estimate for y|xpFind the 95% confidence interval for an individual house with 2,000 square feetPredicted Price Yi = ($1,000s) = $317, 850MegaStat will give both the predicted value as well as the lower and upper limitsPredicted values for: Price($000)95% Confidence Interval95% Prediction IntervalSquare feetPredictedlowerupper2,000The prediction interval endpoints are from $215,503 to $420,065. We can be 95% confident that the price of a 2000 ft2 home will fall within those limits.
35 Residual Analysis Purposes Graphical Analysis of Residuals Check for linearity assumptionCheck for the constant variability assumption for all levels of predicted YCheck normal residuals assumptionCheck for independence over timeGraphical Analysis of ResidualsCan plot residuals vs. x and predicted YCan create NPP of residuals to check for normality (or use Skewness/Kurtosis)Can check D-W statistic to confirm independence
36 Residual Analysis for Linearity xxxxresidualsresidualsNot LinearLinear
37 Residual Analysis for Constant Variance xxŶŶresidualsresidualsConstant varianceNon-constant variance
38 Residual Analysis for Normality Can create NPP of residuals to check for normality. If you see an approximate straight line residuals are acceptably normal. You can also use Skewness/Kurtosis. If both are within + 1 the residuals are acceptably normalResidual Analysis for IndependenceCan check D-W statistic to confirm independence. If D-W statistic is greater than 1.3 the residuals are acceptably independent. Needed only if the data is collected over time.
39 Checking Unusual Data Points Check for outliers from the predicted values (studentized and studentized deleted residuals do this; MegaStat highlights in blue)Check for outliers on the X-axis; they are indicated by large leverage values; more than twice as large as the average leverage. MegaStat highlights in blue.Check Cook’s Distance which measures the harmful influence of a data point on the equation by looking at residuals and leverage together. Cook’s D > 1 suggests potentially harmful data points and those points should be checked for data entry error. MegaStat highlights in blue based on F distribution values.
40 Patterns of Outliersa). Outlier is extreme in both X and Y but not in pattern. The point is unlikely to alter regression line.b). Outlier is extreme in both X and Y as well as in the overall pattern. This point will strongly influence regression linec). Outlier is extreme for X nearly average for Y. The further it is away from the pattern the more it will change the regression.d). Outlier extreme in Y not in X. The further it is away from the pattern the more it will change the regression.e). Outlier extreme in pattern, but not in X or Y. Slope may not be changed much but intercept will be higher with this point included.
41 Summary Introduced simple linear regression analysis Calculated the coefficients for the simple linear regression equationmeasures of strength (r, R2 and se)
42 Summary (continued) Described inference about the slope Addressed prediction of individual valuesDiscussed residual analysis to address assumptions of regression and correlationDiscussed checks for unusual data points