Lecture Slides Elementary Statistics Eleventh Edition

Slides:



Advertisements
Similar presentations
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Advertisements

Correlation and Regression
1 Objective Investigate how two variables (x and y) are related (i.e. correlated). That is, how much they depend on each other. Section 10.2 Correlation.
10-2 Correlation A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way. A.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
STATISTICS ELEMENTARY C.M. Pascual
Chapter 10 Correlation and Regression
Introduction to Linear Regression and Correlation Analysis
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Correlation.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Correlation and Regression
Sections 9-1 and 9-2 Overview Correlation. PAIRED DATA Is there a relationship? If so, what is the equation? Use that equation for prediction. In this.
1 Chapter 9. Section 9-1 and 9-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 9-2 Inferences About Two Proportions.
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 10 Correlation and Regression
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 10 Correlation and Regression Lecture 1 Sections: 10.1 – 10.2.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
1 MVS 250: V. Katch S TATISTICS Chapter 5 Correlation/Regression.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
1 Objective Given two linearly correlated variables (x and y), find the linear function (equation) that best describes the trend. Section 10.3 Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Lecture Slides Elementary Statistics Twelfth Edition
Lecture Slides Elementary Statistics Twelfth Edition
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
Regression and Correlation
Review and Preview and Correlation
Lecture Slides Elementary Statistics Twelfth Edition
Lecture Slides Essentials of Statistics 5th Edition
Lecture Slides Elementary Statistics Twelfth Edition
CHS 221 Biostatistics Dr. wajed Hatamleh
Lecture Slides Elementary Statistics Twelfth Edition
Elementary Statistics
Lecture Slides Elementary Statistics Twelfth Edition
Slides by JOHN LOUCKS St. Edward’s University.
Lecture Slides Elementary Statistics Thirteenth Edition
Correlation and Regression
Lecture Slides Elementary Statistics Eleventh Edition
CHAPTER 26: Inference for Regression
Elementary Statistics
Chapter 10 Correlation and Regression
Elementary Statistics
Lecture Slides Elementary Statistics Eleventh Edition
Lecture Slides Elementary Statistics Tenth Edition
Correlation and Regression
Product moment correlation
CHAPTER 3 Describing Relationships
Correlation and Regression Lecture 1 Sections: 10.1 – 10.2
Lecture Slides Elementary Statistics Twelfth Edition
Created by Erin Hodgess, Houston, Texas
Chapter 9 Correlation and Regression
Presentation transcript:

Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by Mario F. Triola

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Rank Correlation

Section 10-1 Review and Preview

Review In Chapter 9 we presented methods for making inferences from two samples. In Section 9-4 we considered two dependent samples, with each value of one sample somehow paired with a value from the other sample. In Section 9-4 we considered the differences between the paired values, and we illustrated the use of hypothesis tests for claims about the population of differences. We also illustrated the construction of confidence interval estimates of the mean of all such differences. In this chapter we again consider paired sample data, but the objective is fundamentally different from that of Section 9-4. page 517 of Elementary Statistics, 10th Edition

Preview In this chapter we introduce methods for determining whether a correlation, or association, between two variables exists and whether the correlation is linear. For linear correlations, we can identify an equation that best fits the data and we can use that equation to predict the value of one variable given the value of the other variable. In this chapter, we also present methods for analyzing differences between predicted values and actual values. page 517 of Elementary Statistics, 10th Edition

Preview In addition, we consider methods for identifying linear equations for correlations among three or more variables. We conclude the chapter with some basic methods for developing a mathematical model that can be used to describe nonlinear correlations between two variables. page 517 of Elementary Statistics, 10th Edition

Section 10-2 Correlation

Key Concept In part 1 of this section introduces the linear correlation coefficient r, which is a numerical measure of the strength of the relationship between two variables representing quantitative data. Using paired sample data (sometimes called bivariate data), we find the value of r (usually using technology), then we use that value to conclude that there is (or is not) a linear correlation between the two variables.

Key Concept In this section we consider only linear relationships, which means that when graphed, the points approximate a straight-line pattern. In Part 2, we discuss methods of hypothesis testing for correlation.

Part 1: Basic Concepts of Correlation

Definition A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way.

Definition The linear correlation coefficient r measures the strength of the linear relationship between the paired quantitative x- and y-values in a sample. page 518 of Elementary Statistics, 10th Edition

Exploring the Data We can often see a relationship between two variables by constructing a scatterplot. Figure 10-2 following shows scatterplots with different characteristics. Relate a scatter plot to the algebraic plotting of number pairs (x,y).

Scatterplots of Paired Data Page 519 of Elementary Statistics, 10th Edition Figure 10-2

Scatterplots of Paired Data Page 519 of Elementary Statistics, 10th Edition Figure 10-2

Scatterplots of Paired Data Page 519 of Elementary Statistics, 10th Edition Figure 10-2

Requirements 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. The effects of any other outliers should be considered by calculating r with and without the outliers included. Explain to students the difference between the ‘paired’ data of this chapter and the investigation of two groups of data in Chapter 9.

Notation for the Linear Correlation Coefficient n = number of pairs of sample data  denotes the addition of the items indicated. x denotes the sum of all x-values. x2 indicates that each x-value should be squared and then those squares added. (x)2 indicates that the x-values should be added and then the total squared.

Notation for the Linear Correlation Coefficient xy indicates that each x-value should be first multiplied by its corresponding y-value. After obtaining all such products, find their sum. r = linear correlation coefficient for sample data.  = linear correlation coefficient for population data.

Formula r = nxy – (x)(y) n(x2) – (x)2 n(y2) – (y)2 The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample. nxy – (x)(y) n(x2) – (x)2 n(y2) – (y)2 r = Formula 10-1 Computer software or calculators can compute r

Interpreting r Using Table A-6: If the absolute value of the computed value of r, denoted |r|, exceeds the value in Table A-6, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation. Using Software: If the computed P-value is less than or equal to the significance level, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation.

Caution Know that the methods of this section apply to a linear correlation. If you conclude that there does not appear to be linear correlation, know that it is possible that there might be some other association that is not linear.

Correlation Coefficient r Rounding the Linear Correlation Coefficient r Round to three decimal places so that it can be compared to critical values in Table A-6. Use calculator or computer if possible. Page 521 of Elementary Statistics, 10th Edition

Properties of the Linear Correlation Coefficient r 2. if all values of either variable are converted to a different scale, the value of r does not change. 3. The value of r is not affected by the choice of x and y. Interchange all x- and y-values and the value of r will not change. 4. r measures strength of a linear relationship. 5. r is very sensitive to outliers, they can dramatically affect its value. page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Example: The paired pizza/subway fare costs from Table 10-1 are shown here in Table 10-2. Use computer software with these paired sample values to find the value of the linear correlation coefficient r for the paired sample data. Requirements are satisfied: simple random sample of quantitative data; Minitab scatterplot approximates a straight line; scatterplot shows no outliers - see next slide page 521 of Elementary Statistics, 10th Edition.

Example: Using software or a calculator, r is automatically calculated: page 521 of Elementary Statistics, 10th Edition.

Interpreting the Linear Correlation Coefficient r We can base our interpretation and conclusion about correlation on a P-value obtained from computer software or a critical value from Table A-6. page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Interpreting the Linear Correlation Coefficient r Using Computer Software to Interpret r: If the computed P-value is less than or equal to the significance level, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation. page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Interpreting the Linear Correlation Coefficient r Using Table A-6 to Interpret r: If |r| exceeds the value in Table A-6, conclude that there is a linear correlation. Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation. page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Interpreting the Linear Correlation Coefficient r page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change. Critical Values from Table A-6 and the Computed Value of r

Example: Using a 0.05 significance level, interpret the value of r = 0.117 found using the 62 pairs of weights of discarded paper and glass listed in Data Set 22 in Appendix B. When the paired data are used with computer software, the P-value is found to be 0.364. Is there sufficient evidence to support a claim of a linear correlation between the weights of discarded paper and glass? page 524 of Elementary Statistics, 10th Edition

Example: Requirements are satisfied: simple random sample of quantitative data; scatterplot approximates a straight line; no outliers Using Software to Interpret r: The P-value obtained from software is 0.364. Because the P-value is not less than or equal to 0.05, we conclude that there is not sufficient evidence to support a claim of a linear correlation between weights of discarded paper and glass. page 524 of Elementary Statistics, 10th Edition

Example: Using Table A-6 to Interpret r: If we refer to Table A-6 with n = 62 pairs of sample data, we obtain the critical value of 0.254 (approximately) for  = 0.05. Because |0.117| does not exceed the value of 0.254 from Table A-6, we conclude that there is not sufficient evidence to support a claim of a linear correlation between weights of discarded paper and glass. page 524 of Elementary Statistics, 10th Edition

Interpreting r: Explained Variation The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y. If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Example: Using the pizza subway fare costs in Table 10-2, we have found that the linear correlation coefficient is r = 0.988. What proportion of the variation in the subway fare can be explained by the variation in the costs of a slice of pizza? With r = 0.988, we get r2 = 0.976. We conclude that 0.976 (or about 98%) of the variation in the cost of a subway fares can be explained by the linear relationship between the costs of pizza and subway fares. This implies that about 2% of the variation in costs of subway fares cannot be explained by the costs of pizza. page 524 of Elementary Statistics, 10th Edition

Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no linear correlation. page 525 of Elementary Statistics, 10th Edition

Caution Know that correlation does not imply causality. page 525 of Elementary Statistics, 10th Edition

Part 2: Formal Hypothesis Test

Formal Hypothesis Test We wish to determine whether there is a significant linear correlation between two variables. page 527 of Elementary Statistics, 10th Edition

Hypothesis Test for Correlation Notation n = number of pairs of sample data r = linear correlation coefficient for a sample of paired data  = linear correlation coefficient for a population of paired data page 527 of Elementary Statistics, 10th Edition

Hypothesis Test for Correlation Requirements 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. The effects of any other outliers should be considered by calculating r with and without the outliers included. page 527 of Elementary Statistics, 10th Edition

Hypothesis Test for Correlation Hypotheses H0: = (There is no linear correlation.) H1:  (There is a linear correlation.) page 527 of Elementary Statistics, 10th Edition Test Statistic: r Critical Values: Refer to Table A-6

Hypothesis Test for Correlation Conclusion If |r| > critical value from Table A-6, reject H0 and conclude that there is sufficient evidence to support the claim of a linear correlation. If |r| ≤ critical value from Table A-6, fail to reject H0 and conclude that there is not sufficient evidence to support the claim of a linear correlation. page 527 of Elementary Statistics, 10th Edition

H0: = (There is no linear correlation.) Example: Use the paired pizza subway fare data in Table 10-2 to test the claim that there is a linear correlation between the costs of a slice of pizza and the subway fares. Use a 0.05 significance level. Requirements are satisfied as in the earlier example. page 527 of Elementary Statistics, 10th Edition H0: = (There is no linear correlation.) H1:  (There is a linear correlation.)

Example: The test statistic is r = 0.988 (from an earlier Example). The critical value of r = 0.811 is found in Table A-6 with n = 6 and  = 0.05. Because |0.988| > 0.811, we reject H0: r = 0. (Rejecting “no linear correlation” indicates that there is a linear correlation.) We conclude that there is sufficient evidence to support the claim of a linear correlation between costs of a slice of pizza and subway fares. page 527 of Elementary Statistics, 10th Edition

Hypothesis Test for Correlation P-Value from a t Test H0: = (There is no linear correlation.) H1:  (There is a linear correlation.) Test Statistic: t page 527 of Elementary Statistics, 10th Edition

Hypothesis Test for Correlation Conclusion P-value: Use computer software or use Table A-3 with n – 2 degrees of freedom to find the P-value corresponding to the test statistic t. If the P-value is less than or equal to the significance level, reject H0 and conclude that there is sufficient evidence to support the claim of a linear correlation. page 527 of Elementary Statistics, 10th Edition If the P-value is greater than the significance level, fail to reject H0 and conclude that there is not sufficient evidence to support the claim of a linear correlation.

H0: = (There is no linear correlation.) Example: Use the paired pizza subway fare data in Table 10-2 and use the P-value method to test the claim that there is a linear correlation between the costs of a slice of pizza and the subway fares. Use a 0.05 significance level. Requirements are satisfied as in the earlier example. page 527 of Elementary Statistics, 10th Edition H0: = (There is no linear correlation.) H1:  (There is a linear correlation.)

Example: The linear correlation coefficient is r = 0.988 (from an earlier Example) and n = 6 (six pairs of data), so the test statistic is page 527 of Elementary Statistics, 10th Edition With df = 4, Table A-6 yields a P-value that is less than 0.01. Computer software generates a test statistic of t = 12.692 and P-value of 0.00022.

Example: Using either method, the P-value is less than the significance level of 0.05 so we reject H0:  = 0. We conclude that there is sufficient evidence to support the claim of a linear correlation between costs of a slice of pizza and subway fares.

One-Tailed Tests One-tailed tests can occur with a claim of a positive linear correlation or a claim of a negative linear correlation. In such cases, the hypotheses will be as shown here. For these one-tailed tests, the P-value method can be used as in earlier chapters.

Recap In this section, we have discussed: Correlation. The linear correlation coefficient r. Requirements, notation and formula for r. Interpreting r. Formal hypothesis testing.

Section 10-3 Regression

Key Concept In part 1 of this section we find the equation of the straight line that best fits the paired sample data. That equation algebraically describes the relationship between two variables. The best-fitting straight line is called a regression line and its equation is called the regression equation. In part 2, we discuss marginal change, influential points, and residual plots as tools for analyzing correlation and regression results.

Part 1: Basic Concepts of Regression

Regression The regression equation expresses a relationship between x (called the explanatory variable, predictor variable or independent variable), and y (called the response variable or dependent variable). ^ The typical equation of a straight line y = mx + b is expressed in the form y = b0 + b1x, where b0 is the y-intercept and b1 is the slope. ^

Definitions Regression Equation y = b0 + b1x Regression Line ^ Given a collection of paired data, the regression equation y = b0 + b1x ^ algebraically describes the relationship between the two variables. Regression Line The graph of the regression equation is called the regression line (or line of best fit, or least squares line).

Notation for Regression Equation Population Parameter Sample Statistic y-intercept of regression equation Slope of regression equation Equation of the regression line 0 b0 1 b1 y = 0 + 1 x y = b0 + b1x ^

Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors.

calculators or computers can compute these values Formulas for b0 and b1 Formula 10-3 (slope) (y-intercept) Formula 10-4 calculators or computers can compute these values

The regression line fits the sample points best. Special Property The regression line fits the sample points best.

Rounding the y-intercept b0 and the Slope b1 Round to three significant digits. If you use the formulas 10-3 and 10-4, do not round intermediate values. page 543 of Elementary Statistics, 10th Edition

Example: Refer to the sample data given in Table 10-1 in the Chapter Problem. Use technology to find the equation of the regression line in which the explanatory variable (or x variable) is the cost of a slice of pizza and the response variable (or y variable) is the corresponding cost of a subway fare. page 543 of Elementary Statistics, 10th Edition.

Example: Requirements are satisfied: simple random sample; scatterplot approximates a straight line; no outliers Here are results from four different technologies technologies page 543 of Elementary Statistics, 10th Edition.

Example: All of these technologies show that the regression equation can be expressed as y = 0.0346 +0.945x, where y is the predicted cost of a subway fare and x is the cost of a slice of pizza. We should know that the regression equation is an estimate of the true regression equation. This estimate is based on one particular set of sample data, but another sample drawn from the same population would probably lead to a slightly different equation. ^ page 543 of Elementary Statistics, 10th Edition.

Example: Graph the regression equation (from the preceding Example) on the scatterplot of the pizza/subway fare data and examine the graph to subjectively determine how well the regression line fits the data. On the next slide is the Minitab display of the scatterplot with the graph of the regression line included. We can see that the regression line fits the data quite well. page 543 of Elementary Statistics, 10th Edition.

Example: page 543 of Elementary Statistics, 10th Edition.

Using the Regression Equation for Predictions 1. Use the regression equation for predictions only if the graph of the regression line on the scatterplot confirms that the regression line fits the points reasonably well. 2. Use the regression equation for predictions only if the linear correlation coefficient r indicates that there is a linear correlation between the two variables (as described in Section 10-2).

Using the Regression Equation for Predictions 3. Use the regression line for predictions only if the data do not go much beyond the scope of the available sample data. (Predicting too far beyond the scope of the available sample data is called extrapolation, and it could result in bad predictions.) 4. If the regression equation does not appear to be useful for making predictions, the best predicted value of a variable is its point estimate, which is its sample mean.

Strategy for Predicting Values of Y page 546 of Elementary Statistics, 10th Edition

Using the Regression Equation for Predictions If the regression equation is not a good model, the best predicted value of y is simply y, the mean of the y values. Remember, this strategy applies to linear patterns of points in a scatterplot. If the scatterplot shows a pattern that is not a straight-line pattern, other methods apply, as described in Section 10-6. ^

Part 2: Beyond the Basics of Regression

Definitions In working with two variables related by a regression equation, the marginal change in a variable is the amount that it changes when the other variable changes by exactly one unit. The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit. The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit.

Definitions In a scatterplot, an outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line.

Example: Consider the pizza subway fare data from the Chapter Problem. The scatterplot located to the left on the next slide shows the regression line. If we include this additional pair of data: x = 2.00,y = –20.00 (pizza is still $2.00 per slice, but the subway fare is $–20.00 which means that people are paid $20 to ride the subway), this additional point would be an influential point because the graph of the regression line would change considerably, as shown by the regression line located to the right. page 547 – 548 of Elementary Statistics, 10th Edition The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit.

Example: page 547 – 548 of Elementary Statistics, 10th Edition The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit.

Example: Compare the two graphs and you will see clearly that the addition of that one pair of values has a very dramatic effect on the regression line, so that additional point is an influential point. The additional point is also an outlier because it is far from the other points. The slope b1 in the regression equation represents the marginal change in y that occurs when x changes by one unit.

Definition For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y-value that is predicted by using the regression equation. That is, ^ residual = observed y – predicted y = y – y

Residuals

Definitions A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. page 549 - 550 of Elementary Statistics, 10th Edition

Definitions A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y – y (where y denotes the predicted value of y). That is, a residual plot is a graph of the points (x, y – y). ^ ^ ^ page 549 - 550 of Elementary Statistics, 10th Edition

Residual Plot Analysis When analyzing a residual plot, look for a pattern in the way the points are configured, and use these criteria: The residual plot should not have an obvious pattern that is not a straight-line pattern. The residual plot should not become thicker (or thinner) when viewed from left to right.

Residuals Plot - Pizza/Subway

Residual Plots

Residual Plots

Residual Plots

Complete Regression Analysis 1. Construct a scatterplot and verify that the pattern of the points is approximately a straight-line pattern without outliers. (If there are outliers, consider their effects by comparing results that include the outliers to results that exclude the outliers.) 2. Construct a residual plot and verify that there is no pattern (other than a straight-line pattern) and also verify that the residual plot does not become thicker (or thinner).

Complete Regression Analysis 3. Use a histogram and/or normal quantile plot to confirm that the values of the residuals have a distribution that is approximately normal. 4. Consider any effects of a pattern over time.

Recap In this section we have discussed: The basic concepts of regression. Rounding rules. Using the regression equation for predictions. Interpreting the regression equation. Outliers Residuals and least-squares. Residual plots.

Variation and Prediction Intervals Section 10-4 Variation and Prediction Intervals

Key Concept In this section we present a method for constructing a prediction interval, which is an interval estimate of a predicted value of y.

Definition Assume that we have a collection of paired data containing the sample point (x, y), that y is the predicted value of y (obtained by using the regression equation), and that the mean of the sample y-values is y. ^

Definition The total deviation of (x, y) is the vertical distance y – y, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y.

Definition The explained deviation is the vertical distance y – y, which is the distance between the predicted y-value and the horizontal line passing through the sample mean y. ^

Definition The unexplained deviation is the vertical distance y – y, which is the vertical distance between the point (x, y) and the regression line. (The distance y – y is also called a residual, as defined in Section 10-3.) ^ ^

Unexplained, Explained, and Total Deviation Figure 10-7

Relationships (y - y) = (y - y) + (y - y) (total deviation) = (explained deviation) + (unexplained deviation) (y - y) = (y - y) + (y - y) ^ (total variation) = (explained variation) + (unexplained variation) (y - y) 2 =  (y - y) 2 +  (y - y) 2 ^ Formula 10-5

r2 = Definition Coefficient of determination is the amount of the variation in y that is explained by the regression line. r2 = explained variation. total variation page 559 of Elementary Statistics, 10th Edition The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y.

Definition A prediction interval, is an interval estimate of a predicted value of y.

Definition The standard error of estimate, denoted by se is a measure of the differences (or distances) between the observed sample y-values and the predicted values y that are obtained using the regression equation. ^

Standard Error of Estimate  (y – y)2 n – 2 ^ se = or se =  y2 – b0  y – b1  xy n – 2 Formula 10-6

Example: Use Formula 10-6 to find the standard error of estimate se for the paired pizza/subway fare data listed in Table 10-1in the Chapter Problem. n = 6 y2 = 9.2175 y = 6.35 xy = 9.4575 b0 = 0.034560171 b1 = 0.94502138 se = n - 2  y2 - b0  y - b1  xy page 561 of Elementary Statistics, 10th Edition. se = 6 – 2 9.2175 – (0.034560171)(6.35) – (0.94502138)(9.4575) se = 0.12298700 = 0.123

Prediction Interval for an Individual y y - E < y < y + E ^ where E = t2 se n(x2) – (x)2 n(x0 – x)2 1 + + 1 n x0 represents the given value of x t2 has n – 2 degrees of freedom

Example: For the paired pizza/subway fare costs from the Chapter Problem, we have found that for a pizza cost of $2.25, the best predicted cost of a subway fare is $2.16. Construct a 95% prediction interval for the cost of a subway fare, given that a slice of pizza costs $2.25 (so that x = 2.25). E = t2 se + n(x2) – (x)2 n(x0 – x)2 1 + 1 n page 562 of Elementary Statistics, 10th Edition. E = (2.776)(0.12298700) 6(9.77) – (6.50)2 6(2.25 – 1.0833333)2 1 + 1 6 E = (2.776)(0.12298700)(1.2905606) = 0.441 +

y – E < y < y + E 2.16 – 0.441 < y < 2.16 + 0.441 Example: Construct the confidence interval. y – E < y < y + E 2.16 – 0.441 < y < 2.16 + 0.441 1.72 < y < 2.60 ^ page 562 of Elementary Statistics, 10th Edition.

Recap In this section we have discussed: Explained and unexplained variation. Coefficient of determination. Standard error estimate. Prediction intervals.

Section 10-5 Rank Correlation

Key Concept This section describes the nonparametric method of the rank correlation test, which uses paired data to test for an association between two variables. In Chapter 10 we used paired sample data to compute values for the linear correlation coefficient r, but in this section we use ranks as a the basis for computing the rank correlation coefficient rs . We use the notation rs for the rank correlation coefficient so that we don’t confuse it with the linear correlation coefficient r. The subscript s is used to honor Charles Spearman who originated the approach.

Definition The rank correlation test (or Spearman’s rank correlation test) is a non-parametric test that uses ranks of sample data consisting of matched pairs. It is used to test for an association between two variables.

Advantages Rank correlation has these advantages over the parametric methods discussed in Chapter 10: The nonparametric method of rank correlation can be used in a wider variety of circumstances than the parametric method of linear correlation.  With rank correlation, we can analyze paired data that are ranks or can be converted to ranks. Rank correlation can be used to detect some (not all) relationships that are not linear.

Objective Compute the rank correlation coefficient r1 and use it to test for an association between two variables. H0: ρs = 0 (There is no correlation between the two variables.) H1: ρs  0 (There is a correlation between the two variables.)

Notation rs = rank correlation coefficient for sample paired data (rs is a sample statistic) s = rank correlation coefficient for all the population data (s is a population parameter) n = number of pairs of sample data d = difference between ranks for the two values within a pair

Requirements The sample paired data have been randomly selected. Note: Unlike the parametric methods of Section 10-2, there is no requirement that the sample pairs of data have a bivariate normal distribution. There is no requirement of a normal distribution for any population.

Rank Correlation Test Statistic No ties: After converting the data in each sample to ranks, if there are no ties among ranks for either variable, the exact value of the test statistic can be calculated using this formula:

Rank Correlation Test Statistic Ties: After converting the data in each sample to ranks, if either variable has ties among its ranks, the exact value of the test statistic rs can be found by using Formula 10-1 with the ranks:

Rank Correlation Critical values: If n ≤ 30, critical values are found in Table A-9. If n > 30, use Formula 13-1. Table A-9 in this edition includes some values that have been changed from those in Table A-9 from earlier editions where the value of z corresponds to the significance level. (For example, if a = 0.05, z = 1.96.)

Disadvantages A disadvantage of rank correlation is its efficiency rating of 0.91, as described in Section 13-1. This efficiency rating shows that with all other circumstances being equal, the nonparametric approach of rank correlation requires 100 pairs of sample data to achieve the same results as only 91 pairs of sample observations analyzed through parametric methods, assuming that the stricter requirements of the parametric approach are met.

Figure 13-4 Rank Correlation for Testing H0: s = 0

Figure 13-4 Rank Correlation for Testing H0: s = 0

Example: Table 13-1 lists overall quality scores and selectivity rankings of a sample of national universities (based on data from U.S. News and World Report). Find the value of the rank correlation coefficient and use it to determine whether there is a correlation between the overall quality scores and the selectivity rankings. Use a 0.05 significance level. Based on the result, does it appear that national universities with higher overall quality scores are more difficult to get into?

Example: Requirement is satisfied: paired data are a simple random sample The selectivity data consist of ranks that are not normally distributed. So, we use the rank correlation coefficient to test for a relationship between overall quality score and selectivity rank. The null and alternative hypotheses are as follows: H0: s = 0 H1: s  0

Example: Since neither variable has ties in the ranks:

Example: From Table A-9, using  = 0.05 and n = 8, the critical values are 0.738. Because the test statistic of rs = –0.857 is not between the critical values of –0.738 and 0.738, we reject the null hypothesis. There is sufficient evidence to support a claim of a correlation between overall quality score and selectivity ranking. It appears that Universities with higher quality scores are more selective and are more difficult to get into.

Example: Detecting a Nonlinear Pattern An experiment involves a growing population of bacteria. Table 13-8 lists randomly selected times (in hr) after the experiment is begun, and the number of bacteria present. Use a 0.05 significance level to test the claim that there is a correlation between time and population size.

Example: Detecting a Nonlinear Pattern Requirement is satisfied: date are from a simple random sample The hypotheses are: H0: s = 0 H1: s  0 We follow the rank correlation procedure summarized in Figure 13-5. The original values are not ranks, so we convert them to ranks and enter the results in Table 13-9.

Example: Detecting a Nonlinear Pattern There are no ties among ranks of either list. page 713 of Elementary Statistics, 10th Edition

Example: Detecting a Nonlinear Pattern Since n = 10 is less than 30, use Table A-9 Critical values are ± 0.648 The test statistic rs = 1 is not between –0.648 and 0.648, so we reject the null hypothesis of s = 0 (no correlation). There is sufficient evidence to conclude there is a correlation between time and population size.

Example: Detecting a Nonlinear Pattern If this example is done using the methods of Section 10-2, the linear correlation coefficient is r = 0.0621 and critical values of –0.632 and 0.632. This leads to the conclusion that there is not enough evidence to support the claim of a significant linear correlation, whereas the nonlinear test found that there was enough evidence. The Minitab scatter diagram shows that there is a non-linear relationship that the parametric method would not have detected.

Recap In this section we have discussed: Rank correlation which is the non-parametric equivalent of testing for correlation described in Chapter 10. It uses ranks of matched pairs to test for association. Sometimes rank correlation can detect non-linear correlation that the parametric test will not recognize.