Presentation on theme: "Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of “how closely” the data points fit a particular line."— Presentation transcript:
Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of “how closely” the data points fit a particular line – the least squares regression line. Will give us an indication of the type of association – positive or negative.
Notes: 1. Data in the summation is standardized like a z-score (and thus is not affected by change in units). 2. Divide scatterplot into quadrants based on centroid -Data in 1 st and 3 rd quadrants contribute positive values to r -Data in 2 nd and 4 th quadrants contribute negative values to r Page 124
How Data Relative to Centroid Affects r Quadrant 1Quadrant 2 Quadrant 3Quadrant 4
1. Turn Diagnostics On: 2 nd Catalog, scroll down to DiagnosticOn and press Enter (you do not have to repeat this step everytime!) 2. Compute r (and a few other things!): Stat|Calc|LinReg(a+bx) press Enter and then give your lists: L1,L2 3. Your output should be: a=102.5, b=-3.62, r^2=0.8915, r=-0.9442 StudentABCDEFG Number of Absences (L1)621591258 Final Grade (L2)82864374589078 What are the meanings of these numbers?? Let’s start with r…. Let’s use our TI’s to find the correlation for our data set!
Symmetric in X and Y (makes no difference which variable is the explanatory and which is response) Both variables must be quantitative! -1 <= r <= 1 ALWAYS The closer in magnitude r is to 1, the stronger the linear relationship between X and Y The sign of r indicates whether there is a positive or negative relationship between X and Y Just like the mean and standard deviation, r is strongly affected by outliers See page 125 for more! Properties of the Correlation Coefficient (r)
Getting a Feel for r Let’s Play the Guessing Correlations Game! http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GuessCGI.html I will put this link on your assignments page!
Section 2.3 Least-Squares Regression We will first learn how to find the least-squares regression line and then understand how to interpret it. Please enter the data in Example 2.9 on page 152 into L3 and L4 on your TI calculator. L3 is NEA Increase (cal) and L4 is Fat Gain (kg)
Using LinReg(a+bx) L3,L4 we get the coefficients for the least-squares regression line: a=3.505, b=-0.00344, r^2=0.6061, and r=-0.7786 So we have the line: L4 variable = a + b*(L3 variable) fat gain = 3.505 – 0.00334*(NEA increase) To use line to predict fat gain for an NEA increase of 400 calories (Example 2.10 page 134) plug value of 400 into NEA increase. How does this look graphically???
fat gain = 3.505 – 0.00334*(NEA increase) Slope = b = -0.00334 Y-intercept = a = 3.505 a Least-squares regression line
Let’s Get the Equation of the least-squares regression line for our Absence and Final Grade Data (It should still be in L1 and L2): LinReg(a+bx) L1,L2 gives: y=a+bx where: a=102.49, b=-3.622, r^2=0.8915 and r=-0.9442 So the equation of the least-squares regression line is: Final Grade = 102.49 – 3.622*(Number of Absences) Use this model to predict the Final Grade for a student who has 10 absences. Let’s look at this graphically…. Another Example
Notice: The least-squares regression line goes through the centroid. We can graphically represent the prediction of the Final Class Grade for a given Number of Absences. What is the meaning of b and r^2????
Caution! Using the Regression Line to Make Predictions For Certain Values of x
Interpretation of the Least-Squares Regression Line Page136 Error = Residual = Observed - Predicted
Makes sense because the line always passes through the centroid!
Interpretation of b, the Slope of the Regression Line (page 138) A change of one unit in x corresponds to a change of b units in y. A change of one standard deviation in x corresponds to a change of r standard deviations in y. Let’s find b via the formula on page 137 for our example data: How do we interpret b? What are the units for b?
Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x Page141, 142 Interpretation of r^2 (p141, 142)
Let’s use some list operations to verify r^2 for our example data set of Absences and Final Grade: -Regression Line: Final Grade = 102.49 – 3.622*(Number of Absences) -Observed values of y are in L2 -To get predicted values of y for each value of x (in L1): 102.49-3.622*L1 L5 ( is the STO key) -To get the residuals (i.e. the Predicted – Observed): L5-L2 L6 (What is the meaning of the data in L6???) Interpretation of r^2 (p141, 142) r^2 = (Variance of Predicted Values)/ (Variance of Observed Values)
Interpretation of r^2 (page 142) r^2 = (Variance of Predicted Values)/ (Variance of Observed Values) = (standard dev. L5)^2/(standard dev. L2)^2 = (15.8472)^2/(16.7829)^2 = (15.8472/16.7829)^2 = 0.8916 (note we have some round-off error in the 4 th decimal place) So regression line explains about 89% of the variability in the values of y (a very strong result!)
Here r^2=0.606 so the regression model explains about 61% of the variability in y, i.e. about 61% of the vertical scatter in y. Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x Page141, 142
Section 2.4 - Cautions about Correlation and Regression Error = Residual = Observed - Predicted Example 2.15 (scatterplot with regression line page 152) An Interesting Fact: The sum of the residuals about the least-squares regression line is always zero.
A residual plot (page 153) gives us a visual representation in the leftover variance in the response variable after taking into account the regression. It helps us to assess how well the line describes the data. IF the regression line catches the overall pattern of the data there should be no pattern in the residuals.
(b) Negative Residual (a) Positive Residual The residual plot will
A Residual Plot Note: No discernable pattern to residuals
Example 2.4 (page 108) Revisited Both the scatter plot and residual Plot show more variability in field measurements as true (laboratory measured) defect size increases, despite strong correlation (r=0.9445) and large percent of variability in y explained by regression (r^2=0.8921)
Example 2.16 (page 154 – 157) Weakens Regression Strengthens Regression Data Pointr with data point r without data point Subject 150.48190.5684 Subject 180.48190.3837
Beware the Lurking Variable (page 158) and Remember: Correlation does not imply Causation! (page 160) Lurking variables can create “nonsense correlations” or possibly hide true relationships between x and y.
A “nonsense” correlation Lurking variable? Both variables increased during the time period plotted. Thus the common year is a lurking variable. Example 2.2 page 159