Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.

Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression 10-4 Variation and Prediction Intervals-4 Variation and Prediction Intervals 10-5 Rank Correlation-5 Rank Correlation

Copyright © 2004 Pearson Education, Inc.  A Scatterplot (or scatter diagram) is a graph in which the paired ( x, y ) sample data are plotted with a horizontal x- axis and a vertical y- axis. Each individual ( x, y ) pair is plotted as a single point. Definition

Copyright © 2004 Pearson Education, Inc. Notation for the Linear Correlation Coefficient n = number of pairs of data presented  denotes the addition of the items indicated.  x denotes the sum of all x- values.  x 2 indicates that each x - value should be squared and then those squares added. (  x ) 2 indicates that the x- values should be added and the total then squared.  xy indicates that each x -value should be first multiplied by its corresponding y- value. After obtaining all such products, find their sum. r represents linear correlation coefficient for a sample  represents linear correlation coefficient for a population

Copyright © 2004 Pearson Education, Inc. n  xy – (  x)(  y) n(  x 2 ) – (  x) 2 n(  y 2 ) – (  y) 2 r =r = Formula 10-1 The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample. Calculators can compute r  (rho) is the linear correlation coefficient for all paired data in the population. Definition

Copyright © 2004 Pearson Education, Inc.  Round to three decimal places so that it can be compared to critical values in Table A-6.  Use calculator or computer if possible. Rounding the Linear Correlation Coefficient r

Copyright © 2004 Pearson Education, Inc. 1212 1818 3636 5454 Data x y Calculating r n  xy – (  x)(  y) n(  x 2 ) – (  x) 2 n(  y 2 ) – (  y) 2 r =r =  48  – (10)(20) 4(36) – (10) 2 4(120) – (20) 2 r =r = –8 59.329 r =r = = – 0.135

Copyright © 2004 Pearson Education, Inc. Interpreting the Linear Correlation Coefficient  If the absolute value of r exceeds the value in Table A - 6, conclude that there is a significant linear correlation.  Otherwise, there is not sufficient evidence to support the conclusion of significant linear correlation.

Copyright © 2004 Pearson Education, Inc. Example: Boats and Manatees Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant linear correlation between the number of registered boats and the number of manatees killed by boats. Using the same procedure previously illustrated, we find that r = 0.922. Referring to Table A-6, we locate the row for which n =10. Using the critical value for  =5, we have 0.632. Because r = 0.922, its absolute value exceeds 0.632, so we conclude that there is a significant linear correlation between number of registered boats and number of manatee deaths from boats.

Copyright © 2004 Pearson Education, Inc. Properties of the Linear Correlation Coefficient r 1. –1  r  1 2. Value of r does not change if all values of either variable are converted to a different scale. 3. The r is not affected by the choice of x and y. interchange x and y and the value of r will not change. 4. r measures strength of a linear relationship.

Copyright © 2004 Pearson Education, Inc. Using the boat/manatee data in Table 9-1, we have found that the value of the linear correlation coefficient r = 0.922. What proportion of the variation of the manatee deaths can be explained by the variation in the number of boat registrations? With r = 0.922, we get r 2 = 0.850. We conclude that 0.850 (or about 85%) of the variation in manatee deaths can be explained by the linear relationship between the number of boat registrations and the number of manatee deaths from boats. This implies that 15% of the variation of manatee deaths cannot be explained by the number of boat registrations. Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no significant linear correlation.

Copyright © 2004 Pearson Education, Inc. Formal Hypothesis Test  We wish to determine whether there is a significant linear correlation between two variables.  We present two methods.  Both methods let H 0 :  =  (no significant linear correlation) H 1 :    (significant linear correlation)

Copyright © 2004 Pearson Education, Inc. Test statistic: Critical values: Use Table A-3 with degrees of freedom = n – 2 1 – r 2 n – 2 r t = Method 1: Test Statistic is t (follows format of earlier chapters)

Copyright © 2004 Pearson Education, Inc. Using the boat/manatee data in Table 9-1, test the claim that there is a linear correlation between the number of registered boats and the number of manatee deaths from boats. Use Method 1. 1 – r 2 n – 2 r t = 1 – 0.922 2 10 – 2 0.922 t = = 6.735 Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Using the boat/manatee data in Table 9-1, test the claim that there is a linear correlation between the number of registered boats and the number of manatee deaths from boats. Use Method 2. The test statistic is r = 0.922. The critical values of r =  0.632 are found in Table A-6 with n = 10 and  = 0.05. Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Using the boat/manatee data in Table 9-1, test the claim that there is a linear correlation between the number of registered boats and the number of manatee deaths from boats. Use both (a) Method 1 and (b) Method 2. Using either of the two methods, we find that the absolute value of the test statistic does exceed the critical value (Method 1: 6.735 > 2.306. Method 2: 0.922 > 0.632); that is, the test statistic falls in the critical region. We therefore reject the null hypothesis. There is sufficient evidence to support the claim of a linear correlation between the number of registered boats and the number of manatee deaths from boats. Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Regression Definition  Regression Equation The regression equation expresses a relationship between x (called the independent variable, predictor variable or explanatory variable, and y (called the dependent variable or response variable. The typical equation of a straight line y = mx + b is expressed in the form y = b 0 + b 1 x, where b 0 is the y - intercept and b 1 is the slope. ^

Copyright © 2004 Pearson Education, Inc. Assumptions 1. We are investigating only linear relationships. 2. For each x- value, y is a random variable having a normal (bell-shaped) distribution. All of these y distributions have the same variance. Also, for a given value of x, the distribution of y- values has a mean that lies on the regression line. (Results are not seriously affected if departures from normal distributions and equal variances are not too extreme.)

Copyright © 2004 Pearson Education, Inc. Regression Definition  Regression Equation Given a collection of paired data, the regression equation  Regression Line The graph of the regression equation is called the regression line (or line of best fit, or least squares line). y = b 0 + b 1 x ^ algebraically describes the relationship between the two variables

Copyright © 2004 Pearson Education, Inc. Notation for Regression Equation y -intercept of regression equation  0 b 0 Slope of regression equation  1 b 1 Equation of the regression line y =  0 +  1 x y = b 0 + b 1 x Population Parameter Sample Statistic ^

Copyright © 2004 Pearson Education, Inc. Formula for b 0 and b 1 Formula 10-2 n(  xy) – (  x) (  y) b 1 = (slope) n(  x 2 ) – (  x) 2 b 0 = y – b 1 x ( y -intercept) Formula 10-3 calculators or computers can compute these values

Copyright © 2004 Pearson Education, Inc. Rounding the y -intercept b 0 and the slope b 1  Round to three significant digits.  If you use the formulas 10-2 and 10-3, try not to round intermediate values.

Copyright © 2004 Pearson Education, Inc. 1212 1818 3636 5454 Data x y Calculating the Regression Equation In Section 9-2, we used these values to find that the linear correlation coefficient of r = –0.135. Use this sample to find the regression equation.

Copyright © 2004 Pearson Education, Inc. 1212 1818 3636 5454 Data x y Calculating the Regression Equation n = 4  x = 10  y = 20  x 2 = 36  y 2 = 120  xy = 48 n(  xy) – (  x) (  y) n(  x 2 ) –(  x) 2 b 1 = 4(48) – (10) (20) 4(36) – (10) 2 b 1 = –8 44 b 1 = = –0.181818

Copyright © 2004 Pearson Education, Inc. 1212 1818 3636 5454 Data x y Calculating the Regression Equation n = 4  x = 10  y = 20  x 2 = 36  y 2 = 120  xy = 48 b 0 = y – b 1 x 5 – (–0.181818)(2.5) = 5.45

Copyright © 2004 Pearson Education, Inc. 1212 1818 3636 5454 Data x y Calculating the Regression Equation n = 4  x = 10  y = 20  x 2 = 36  y 2 = 120  xy = 48 The estimated equation of the regression line is: y = 5.45 – 0.182 x ^

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, find the regression equation. Using the same procedure as in the previous example, we find that b 1 = 2.27 and b 0 = –113. Hence, the estimated regression equation is: y = –113 + 2.27 x ^ Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. In predicting a value of y based on some given value of x... 1. If there is not a significant linear correlation, the best predicted y -value is y. Predictions 2.If there is a significant linear correlation, the best predicted y -value is found by substituting the x -value into the regression equation.

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. Assume that in 2001 there were 850,000 registered boats. Because Table 9-1 lists the numbers of registered boats in tens of thousands, this means that for 2001 we have x = 85. Given that x = 85, find the best predicted value of y, the number of manatee deaths from boats. ^ Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. We must consider whether there is a linear correlation that justifies the use of that equation. We do have a significant linear correlation (with r = 0.922). Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. Given that x = 85, find the best predicted value of y, the number of manatee deaths from boats. ^ Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. Given that x = 85, find the best predicted value of y, the number of manatee deaths from boats. ^ Example: Boats and Manatees y = –113 + 2.27x –113 + 2.27(85) = 80.0 ^ The predicted number of manatee deaths is 80.0. The actual number of manatee deaths in 2001 was 82, so the predicted value of 80.0 is quite close.

Copyright © 2004 Pearson Education, Inc. 1. If there is no significant linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data was drawn. Guidelines for Using The Regression Equation

Copyright © 2004 Pearson Education, Inc. Definitions  Marginal Change: The marginal change is the amount that a variable changes when the other variable changes by exactly one unit.  Outlier: An outlier is a point lying far away from the other data points.  Influential Points: An influential point strongly affects the graph of the regression line.

Copyright © 2004 Pearson Education, Inc. Definitions  Residual for a sample of paired ( x, y ) data, the difference ( y - y ) between an observed sample y -value and the value of y, which is the value of y that is predicted by using the regression equation. ^ ^ Residuals and the Least-Squares Property

Copyright © 2004 Pearson Education, Inc. Definitions  Residual for a sample of paired ( x, y ) data, the difference ( y - y ) between an observed sample y -value and the value of y, which is the value of y that is predicted by using the regression equation.  Least-Squares Property A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible. ^ Residuals and the Least-Squares Property ^

Copyright © 2004 Pearson Education, Inc. Definitions We consider different types of variation that can be used for two major applications:  To determine the proportion of the variation in y that can be explained  by the linear relationship between x and y. 2. To construct interval estimates of predicted y -values. Such intervals are called prediction intervals.

Copyright © 2004 Pearson Education, Inc. Definitions Total Deviation The total deviation from the mean of the particular point ( x, y ) is the vertical distance y – y, which is the distance between the point ( x, y ) and the horizontal line passing through the sample mean y.

Copyright © 2004 Pearson Education, Inc. Definitions Total Deviation The total deviation from the mean of the particular point ( x, y ) is the vertical distance y – y, which is the distance between the point ( x, y ) and the horizontal line passing through the sample mean y. Explained Deviation is the vertical distance y - y, which is the distance between the predicted y- value and the horizontal line passing through the sample mean y. ^

Copyright © 2004 Pearson Education, Inc. Definitions Unexplained Deviation is the vertical distance y - y, which is the vertical distance between the point ( x, y ) and the regression line. (The distance y - y is also called a residual, as defined in Section 9-3.). ^ ^

Copyright © 2004 Pearson Education, Inc. (total deviation) = (explained deviation) + (unexplained deviation) ( y - y ) = ( y - y ) + (y - y ) ^ ^ (total variation) = (explained variation) + (unexplained variation)  ( y - y ) 2 =  ( y - y ) 2 +  (y - y) 2 ^^ Formula 10-4

Copyright © 2004 Pearson Education, Inc. Definition r 2 = explained variation. total variation or simply square r (determined by Formula 10-1, section 10-2) Coefficient of determination the amount of the variation in y that is explained by the regression line

Copyright © 2004 Pearson Education, Inc. The standard error of estimate is a measure of the differences (or distances) between the observed sample y values and the predicted values y that are obtained using the regression equation. Prediction Intervals ^ Definition

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. Find the standard error of estimate s e for the boat/manatee data. ^ n = 10  y 2 = 33456  y = 558  xy = 42214 b 0 = –112.70989 b 1 = 2.27408 s e = n - 2  y 2 - b 0  y - b 1  xy s e = 10 – 2 33456 –(–112.70989)(558) – (2.27408)(42414) Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. Find the standard error of estimate s e for the boat/manatee data. ^ n = 10  y 2 = 33456  y = 558  xy = 42214 b 0 = –112.70989 b 1 = 2.27408 Example: Boats and Manatees s e = 6.61234 = 6.61

Copyright © 2004 Pearson Education, Inc. y - E < y < y + E ^ ^ Prediction Interval for an Individual y where E = t   2 s e n(x2)n(x2) – (  x) 2 n(x 0 – x) 2 1 + + 1 n x 0 represents the given value of x t   2 has n – 2 degrees of freedom

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. We have also found that when x = 85, the predicted number of manatee deaths is 80.0. Construct a 95% prediction interval given that x = 85. ^ E = t  2 s e + n(  x 2 ) – (  x) 2 n(x 0 – x) 2 1 + 1 n E = (2.306)(6.6123) 10(55289) – (741) 2 10(85 –74 ) 2 +1 + 1 10 E = 18.1 Example: Boats and Manatees

Copyright © 2004 Pearson Education, Inc. Given the sample data in Table 9-1, we found that the regression equation is y = –113 + 2.27 x. We have also found that when x = 85, the predicted number of manatee deaths is 80.0. Construct a 95% prediction interval given that x = 85. ^ Example: Boats and Manatees y – E < y < y + E 80.6 – 18.1 < y < 80.6 + 18.1 62.5 < y < 98.7 ^ ^

Copyright © 2004 Pearson Education, Inc. Rank Correlation Definition  Rank Correlation uses the ranks of sample data consisting of matched pairs.  The rank correlation test is used to test for an association between two variables  H o :  s = 0 (There is no correlation between the two variables.)  H 1 :  s  0 (There is a correlation between the two variables.)

Copyright © 2004 Pearson Education, Inc. Advantages 1. The nonparametric method of rank correlation can be used in a wider variety of circumstances than the parametric method of linear correlation. With rank correlation, we can analyze paired data that are ranks or can be converted to ranks. 2. Rank correlation can be used to detect some (not all) relationships that are not linear. 3. The computations for rank correlation are much simpler than the computations for linear correlation, as can be readily seen by comparing the formulas used to compute these statistics.

Copyright © 2004 Pearson Education, Inc. Disadvantages A disadvantage of rank correlation is its efficiency rating of 0.91, as described in Section 12-1. This efficiency rating shows that with all other circumstances being equal, the nonparametric approach of rank correlation requires 100 pairs of sample data to achieve the same results as only 91 pairs of sample observations analyzed through parametric methods.

Copyright © 2004 Pearson Education, Inc. Assumptions 1. The sample data have been randomly selected. 2. Unlike the parametric methods of Section 10-2, there is no requirement that the sample pairs of data have a bivariate normal distribution. There is no requirement of a normal distribution for any population.

Copyright © 2004 Pearson Education, Inc. Notation r s = rank correlation coefficient for sample paired data ( r s is a sample statistic)  s = rank correlation coefficient for all the population data (  s is a population parameter) n = number of pairs of data d = difference between ranks for the two values within a pair r s is often called Spearman’s rank correlation coefficient

Copyright © 2004 Pearson Education, Inc. Test Statistic for the Rank Correlation Coefficient where each value of d is a difference between the ranks for a pair of sample data Critical values:  If n  30, refer to Table A-6  If n > 30, use Formula 9-1 r s = 1 – 6  d 2 n(n 2 – 1 )

Copyright © 2004 Pearson Education, Inc. Complete the computation of to get the sample statistic. Figure 12-4 Rank Correlation for Testing H 0 :  s = 0 Start Calculate the difference d for each pair of ranks by subtracting the lower rank from the higher rank. Let n equal the total number of signs. Are the n pairs of data in the form of ranks ? Convert the data of the first sample to ranks from 1 to n and then do the same for the second sample. No Square each difference d and then find the sum of those squares to get r s = 1 – 6d26d2 n(n 2 –1 ) (d2)(d2) Yes

Copyright © 2004 Pearson Education, Inc. Complete the computation of to get the sample statistic. r s = 1 – 6d26d2 n(n 2 – 1 ) Is n  30 ? If the sample statistic r s is positive and exceeds the positive critical value, there is a correlation. If the sample statistic r s is negative and is less than the negative critical value, there is a correlation. If the sample statistic r s is between the positive and negative critical values, there is no correlation. Find the critical values of r s in Table A-9 Calculate the critical values where z corresponds to the significance level r s = n – 1  z Yes No Figure 13-4 Rank Correlation for Testing H 0 :  s = 0

Copyright © 2004 Pearson Education, Inc. Example: Perceptions of Beauty Use the data in Table 12-6 to determine if there is a correlation between the rankings of men and women in terms of what they find attractive. Use a significance level of  = 0.05.

Copyright © 2004 Pearson Education, Inc. Example: Perceptions of Beauty Use the data in Table 12-6 to determine if there is a correlation between the rankings of men and women in terms of what they find attractive. Use a significance level of  = 0.05. H 0 :  s = 0 H 1 :  s  0 n = 10 r s = 1 – 6  d 2 n(n 2 – 1 ) r s = 0.552 r s = 1 – 6(74) 10(10 2 – 1 )

Copyright © 2004 Pearson Education, Inc. Example: Perceptions of Beauty Use the data in Table 12-6 to determine if there is a correlation between the rankings of men and women in terms of what they find attractive. Use a significance level of  = 0.05. We refer to Table A-9 to determine that the critical values are  0.648. Because the test statistic of r s = 0.552 does not exceed the critical value of 0.648, we fail to reject the null hypothesis. There is no sufficient evidence to support a claim of a correlation between the rankings of men and women.

Copyright © 2004 Pearson Education, Inc. Assume that the preceding example is expanded by including a total of 40 women and that the test statistic r s is found to be 0.291. If the significance level of  = 0.05, what do you conclude about the correlation? Example: Perceptions of Beauty with Large Samples

Copyright © 2004 Pearson Education, Inc. The test statistic of r s = 0.291 does not exceed the critical value of 0.314, so we fail to reject the null hypothesis. There is not sufficient evidence to support the claim of a correlation between men and women. Example: Perceptions of Beauty with Large Samples

Copyright © 2004 Pearson Education, Inc. The data in Table 12-7 are the numbers of games played and the last scores (in millions) of a Raiders of the Lost Ark pinball game. We expect that there should be an association between the number of games played and the pinball score. Is there sufficient evidence to support the claim that there is such an association? Example: Detecting a Nonlinear Pattern

Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.

Similar presentations

Presentation on theme: "Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.

Similar presentations

Presentation on theme: "Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression."— Presentation transcript:

Similar presentations

About project

Feedback