Chapter 10 Correlation and Regression

Slides:



Advertisements
Similar presentations
Section 10-3 Regression.
Advertisements

Inference for Regression
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation and Regression
Chapter 4 The Relation between Two Variables
Chapter 3 Bivariate Data
1 Objective Investigate how two variables (x and y) are related (i.e. correlated). That is, how much they depend on each other. Section 10.2 Correlation.
Scatter Diagrams and Linear Correlation
Describing the Relation Between Two Variables
SIMPLE LINEAR REGRESSION
10-2 Correlation A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way. A.
SIMPLE LINEAR REGRESSION
Correlation and Regression Analysis
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation & Regression
STATISTICS ELEMENTARY C.M. Pascual
Linear Regression.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Chapter 10 Correlation and Regression
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Relationship of two variables
Correlation.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Sections 9-1 and 9-2 Overview Correlation. PAIRED DATA Is there a relationship? If so, what is the equation? Use that equation for prediction. In this.
1 Chapter 9. Section 9-1 and 9-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Scatterplot and trendline. Scatterplot Scatterplot explores the relationship between two quantitative variables. Example:
Statistics Class 7 2/11/2013. It’s all relative. Create a box and whisker diagram for the following data(hint: you need to find the 5 number summary):
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Slide Slide 1 Warm Up Page 536; #16 and #18 For each number, answer the question in the book but also: 1)Prove whether or not there is a linear correlation.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
Correlation Section The Basics A correlation exists between two variables when the values of one variable are somehow associated with the values.
Chapter 10 Correlation and Regression Lecture 1 Sections: 10.1 – 10.2.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
1 MVS 250: V. Katch S TATISTICS Chapter 5 Correlation/Regression.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
© The McGraw-Hill Companies, Inc., Chapter 10 Correlation and Regression.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Part II Exploring Relationships Between Variables.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-1 Overview Overview 10-2 Correlation 10-3 Regression-3 Regression.
1 Objective Given two linearly correlated variables (x and y), find the linear function (equation) that best describes the trend. Section 10.3 Regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Lecture Slides Elementary Statistics Twelfth Edition
Regression and Correlation
Review and Preview and Correlation
CHS 221 Biostatistics Dr. wajed Hatamleh
Elementary Statistics
Lecture Slides Elementary Statistics Thirteenth Edition
Correlation and Regression
Chapter 10 Correlation and Regression
Lecture Slides Elementary Statistics Eleventh Edition
Correlation and Regression Lecture 1 Sections: 10.1 – 10.2
Lecture Slides Elementary Statistics Eleventh Edition
Algebra Review The equation of a straight line y = mx + b
Created by Erin Hodgess, Houston, Texas
Presentation transcript:

Chapter 10 Correlation and Regression

Objective Consider two variables of a population denoted x and y (e.g. weight and height) Goal: Determine if there is a relation between x and y (correlation). If there is a relation, find a method of predicting values (regression).

Examples 1. x : Height of the mother y : Height of the daughter 2. x : Number of cigarettes per day y : Lifespan 3. x : Daily calorie intake y : Weight 3. x : Shoe size y : Number of friends on Facebook

Example This table includes a random sample of heights of mothers, fathers, and their daughters. Question Are the heights of daughters independent of the height of their mothers? Or is there a correlation between the heights of mothers and those of daughters? If yes, how strong is it?

Example This table includes a random sample of heights of mothers, fathers, and their daughters. The heights of mothers and their daughters in this sample seem to be strongly correlated… But heights of fathers and their daughters in this sample seem to be weakly correlated (if at all). (we will soon see how we came to this conclusion)

Section 10.2 Correlation between two variables (x and y) Objective Investigate how two variables (x and y) are related (i.e. correlated). That is, how much they depend on each other.

Definitions A correlation exists between two variables when the values of one appears to somehow affect the values of the other in some way. In this class, we are only interested in linear correlation Text will use the wording ‘matched pairs’. Example at bottom of page 469-470 of Elementary Statistics, 10th Edition

Definitions Linear correlation coefficient : r A numerical measure of the strength of the linear relationship between two variables, x and y, representing quantitative data. r always belongs in the interval (-1,1) ( i.e. –1  r  1 ) We use this value to conclude if there is (or is not) a linear correlation between the two variables.

Exploring the Data We can often see a relationship between two variables by constructing a scatterplot. Relate a scatter plot to the algebraic plotting of number pairs (x,y).

Positive Correlation We say the data has positive correlation if the data follows a line (with a positive slope). The correlation coefficient (r) will be close to +1 Relate a scatter plot to the algebraic plotting of number pairs (x,y).

Negative Correlation We say the data has negative correlation if the data follows a line (with a negative slope). The correlation coefficient (r) will be close to –1 Relate a scatter plot to the algebraic plotting of number pairs (x,y).

No Correlation We say the data has no correlation if the data does not seem to follow any line. The correlation coefficient (r) will be close to 0 Relate a scatter plot to the algebraic plotting of number pairs (x,y).

r ≈ 1 Strong positive linear correlation r ≈ 0 Weak linear correlation Interpreting r r ≈ 1 Strong positive linear correlation r ≈ 0 Weak linear correlation r ≈ -1 Strong negative linear correlation

Nonlinear Correlation The data may follow a curve, but if the data is not linear, the linear correlation coefficient (r) will be close to zero. Page 519 of Elementary Statistics, 10th Edition

Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. (Note: We will not do this in this course)

Notation n Number of pairs of sample data  Denotes the addition of the items x The sum of all x-values x = x1 + x2 +…+ xn y The sum of all y-values y = y1 + y2 +…+ yn

Notation x2 The sum of the squares for all x-values x2 = x12 + x22 +…+ xn2 (x)2 The sum of the x-values, then squared (x)2 = (x1 + x2 +…+ xn)2 xy The sum of the products of x and y xy = x1y1 + x1y2 +…+ xnyn

Correlation Coefficient r Sample linear correlation coefficient r Population linear correlation coefficient (i.e. the linear correlation between the two populations) r measures the strength of a linear relationship between the paired values in a sample. n(xy) – (x)(y) n(x2) – (x)2 n(y2) – (y)2 r = We use StatCrunch compute r (Don’t panic!)

Example 1a 1 Make a scatterplot for the heights of mother , daughter Enter data on StatCrunch (Mother in 1st column, daughter in 2nd column) 1

Example 1a 2 Make a scatterplot for the heights of mother , daughter Graphics – Scatter Plot

Example 1a 3 Make a scatterplot for the heights of mother , daughter Select var1 for X variable (height of mother) Select var2 for Y variable (height of daughter) Click Create Graph! 3

Example 1a 4 Make a scatterplot for the heights of mother , daughter Voila! (Does there appear to be correlation?)

Example 1b 1 Find the linear correlation coefficient of the heights Stat – Summary Stats – Correlation

Example 1b 2 Find the linear correlation coefficient of the heights Select var1 and var2 so they appear in the right box Click Calculate 2

Example 1b 3 Find the linear correlation coefficient of the heights The Correlation Coefficient is r = 0.802 (rounded)

Determining if Correlation Exists We determine whether a population is correlated via a two-tailed test on a sample using a significance level (α) H0 : ρ = 0 (i.e. not correlated) H1 : ρ ≠ 0 (i.e. is correlated) Again, two methods available: Critical Regions (Use Table A-6) P-value (Use StatCrunch) Note: In most cases we use significance level a = 0.05

Using Critical Regions Use Table A-6 to find the critical values, which depends on the sample size n. Use both positive a negative values (two-tailed) ● If the r is in the critical region, we conclude that there is a linear correlation. (Since H0 is rejected) ● If the r is not in the critical region, we say there is insufficient evidence of correlation. (Since we fail to reject H0) -1 1

Using Critical Regions Example 1c Use a 0.05 significance level to determine if the heights are linearly correlated. Using Critical Regions ● From Example 1b, we found r = 0.802 ● Since n = 20 and α = 0.05, using Table A-6, we find the critical values to be: 0.444, -0.444 Since r is in the critical region (reject H0), we conclude the data is linearly correlated (under 0.05 significance).

Using P-value Use StatCrunch to calculate the two-tailed P-value from a sample set (see Example 1c) ● If the P-value is less than α, we conclude that there is a linear correlation. (Since H0 is rejected) ● If the P-value is greater than α, we say there is insufficient evidence of correlation. (Since we fail to reject H0)

Example 1c Using P-value Use a 0.05 significance level to determine if the heights are linearly correlated. Using P-value ● On StatCrunch: Stat – Summary Stats – Correlation ● Select var1, var2 so they appear in right box Click Next ● Check “Display two-sides P-value from sig. test” Click Calculate ● Result: P-value < 0.0001 Since P-value is less than α=0.05 (reject H0), we conclude the data is linearly correlated

Caution! Know that the methods of this section apply only to a linear correlation. If you conclude that there is no linear correlation, it is possible that there is some other association that is not linear.

Correlation Coefficient Rounding the Linear Correlation Coefficient Round r to three decimal places so that it can be compared to critical values in Table A-6 Page 521 of Elementary Statistics, 10th Edition

Properties of the Linear Correlation Coefficient r 2. If all values of either variable are converted to a different scale, the value of r does not change. 3. The value of r is not affected by the choice of x and y. Interchange all x-values and y-values and the value of r will not change. 4. r measures strength of a linear relationship. 5. r is very sensitive to outliers, they can dramatically affect its value. page 524 of Elementary Statistics, 10th Edition If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change.

Example 2 A new medication for high blood pressure was tested on a batch of 18 patients with different ages. The results were as follows: (a) Plot the points (b) Find the correlation coefficient (c) Use a 0.05 significance level to test linear correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 56 34 76 33 67 69 22 65 43 23 66 19 84 27 39 Blood Pressure 194 133 250 71 201 227 230 124 68 219 157 123 222 182 298 113 146

Example 2a Plot the points ● Enter data in StatCrunch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 56 34 76 33 67 69 22 65 43 23 66 19 84 27 39 Blood Pressure 194 133 250 71 201 227 230 124 68 219 157 123 222 182 298 113 146 ● Enter data in StatCrunch ● Go to: Graphics – Scatter Plot ● Select var1 and var2, hit Create Graph!

Example 2b r = 0.964 Find Correlation Coefficient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 56 34 76 33 67 69 22 65 43 23 66 19 84 27 39 Blood Pressure 194 133 250 71 201 227 230 124 68 219 157 123 222 182 298 113 146 ● Go to: Stats – Summary Stats – Correlation ● Select var1 and var2, hit Calculate r = 0.964

Example 2c Using Critical Values r = 0.964 Test for Correlation (α = 0.05) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 56 34 76 33 67 69 22 65 43 23 66 19 84 27 39 Blood Pressure 194 133 250 71 201 227 230 124 68 219 157 123 222 182 298 113 146 Using Critical Values Given n=18 and α=0.05, using Table A-6, the critical values are 0.468, -0.468 r = 0.964 Since r is in the critical region, (reject H0), we conclude there is linear correlation

Example 2c Using P-value r = 0.964 P-value < 0.0001 Test for Correlation (α = 0.05) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Age 56 34 76 33 67 69 22 65 43 23 66 19 84 27 39 Blood Pressure 194 133 250 71 201 227 230 124 68 219 157 123 222 182 298 113 146 Using P-value ● Go to: Stats – Summary Stats – Correlation ● Select var1 and var2, hit Next ● Check box, hit Calculate r = 0.964 P-value < 0.0001 Since P-value less than α=0.05 (reject H0), we conclude there is linear correlation

Example 3 A survey of 15 people was conducted to see how many friends people had on Facebook vs. their shoe size. The results were as follows: (a) Plot the points (b) Find the correlation coefficient (c) Use a 0.05 significance level to test linear correlation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friends on FB 170 680 510 425 85 850 17 Shoe size 8.4 9.0 9.1 7.8 8.8 8.7 8.5 9.6 8.2 9.3 8.0 9.4

Example 3a Plot the points ● Enter data in StatCrunch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friends on FB 170 680 510 425 85 850 17 Shoe size 8.4 9.0 9.1 7.8 8.8 8.7 8.5 9.6 8.2 9.3 8.0 9.4 ● Enter data in StatCrunch ● Go to: Graphics – Scatter Plot ● Select var1 and var2, hit Create Graph!

Example 3b r = 0.409 Find Correlation Coefficient 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friends on FB 170 680 510 425 85 850 17 Shoe size 8.4 9.0 9.1 7.8 8.8 8.7 8.5 9.6 8.2 9.3 8.0 9.4 ● Go to: Stats – Summary Stats – Correlation ● Select var1 and var2, hit Calculate r = 0.409

Example 3c Using Critical Values r = 0.409 Test for Correlation (α = 0.05) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friends on FB 170 680 510 425 85 850 17 Shoe size 8.4 9.0 9.1 7.8 8.8 8.7 8.5 9.6 8.2 9.3 8.0 9.4 Using Critical Values Given n=15 and α=0.05, using Table A-6, the critical values are 0.514, -0.514 r = 0.409 Since r is not in the critical region, (fail to reject H0), we conclude there is no correlation

Example 3c Using P-value r = 0.409 P-value = 0.1297 Test for Correlation (α = 0.05) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Friends on FB 170 680 510 425 85 850 17 Shoe size 8.4 9.0 9.1 7.8 8.8 8.7 8.5 9.6 8.2 9.3 8.0 9.4 Using P-value ● Go to: Stats – Summary Stats – Correlation ● Select var1 and var2, hit Next ● Check box, hit Calculate r = 0.409 P-value = 0.1297 Since P-value greater than α=0.05 (fail to reject H0), we conclude there is no correlation

Interpreting r: Explained Variation The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Low variance High variance If using a graphics calculator for demonstration, it will be an easy exercise to switch the x and y values to show that the value of r will not change. r = 0.998 r 2 = 0.996 r = 0.623 r 2 = 0.388

Example 4 With r = 0.964, we get r2 = 0.929 Using the data in Example 2 (blood pressure vs. age), we found that the linear correlation coefficient is r = 0.964 What proportion of the variation in the patients’ blood pressure can be explained by the variation in the patients’ age? With r = 0.964, we get r2 = 0.929 We conclude that 0.929 (or about 93%) of the variation in blood pressure can be explained by the linear relationship between the age and blood pressure. This implies about 7% of the variation in blood pressure cannot be explained by the age. page 524 of Elementary Statistics, 10th Edition

Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Linearity: There may be some relationship between x and y even when there is no linear correlation. page 525 of Elementary Statistics, 10th Edition

Caution!!! Know that correlation does not imply causality. There may be correlation without causality. page 525 of Elementary Statistics, 10th Edition

Section 10.3 Regression Objective Given two linearly correlated variables (x and y), find the linear function (equation) that best describes the trend.

y = m x + b Equation of a line Recall that the equation of a line is given by its slope and y-intercept y = m x + b

Regression For a set of data (with variables x and y) that is linearly correlated, we want to find the equation of the line that best describes the trend. This process is called Regression

Definitions x : The predictor variable (Also called the explanatory variable or independent variable) y : The response variable (Also called the dependent variable) Regression Equation The equation that describes the algebraically relationship between the two variables Regression Line The graph of the regression equation (also called the line of best fit or least squares line)

b0 : y-intercept b1 : slope Definitions Regression Equation y = b0 + b1x b0 : y-intercept b1 : slope Regression Line

Notation for Regression Equation y-intercept Slope Equation Population 0 1 y = 0 + 1 x Sample b0 b1 y = b0 + b1 x

Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors.

Rounding b0 and b1 Round to three significant digits If you use the formulas from the book, do not round intermediate values. page 543 of Elementary Statistics, 10th Edition

Example 1 Refer to the sample data given in Table 10-1 in the Chapter Problem. Find the equation of the regression line in which the explanatory variable (x-variable) is the cost of a slice of pizza and the response variable (y-variable) is the corresponding cost of a subway fare. (CPI=Consumer Price Index, not used)

Example 1 1. Enter data in StatCrunch (columns) y : 0.15 0.35 1.00 1.35 1.50 2.00 1. Enter data in StatCrunch (columns)

Example 1 2. Stat – Regression – Simple Linear y : 0.15 0.35 1.00 1.35 1.50 2.00 2. Stat – Regression – Simple Linear

Example 1 x : 0.15 0.35 1.00 1.25 1.75 2.00 y : 0.15 0.35 1.00 1.35 1.50 2.00 2. Select var1 and var2 (i.e. x and y values) Click Calculate

Example 1 x : 0.15 0.35 1.00 1.25 1.75 2.00 y : 0.15 0.35 1.00 1.35 1.50 2.00 b0 = 0.0345 b1 = 0.945 Regression Equation y = (0.0345) + (0.945)x

Example 1 y = (0.0345) + (0.945)x Regression Equation page 543 of Elementary Statistics, 10th Edition.

Using the Regression Equation for Predictions 1. Predicted value of y is y = b0 + b1x 2. Use the regression equation for predictions only if the graph of the regression line on the scatterplot confirms that the regression line fits the points reasonably well. 3. Use the regression equation for predictions only if the linear correlation coefficient r indicates that there is a linear correlation between the two variables.

Using the Regression Equation for Predictions 4. Use the regression line for predictions only if the value of x does not go much beyond the scope of the available sample data. Predicting too far beyond the scope of the available sample data is called extrapolation, and it could result in bad predictions. 5. If the regression equation does not appear to be useful for making predictions, the best predicted value of a variable is its point estimate, which is its sample mean ( y ) _

Using the Regression Equation for Predictions Source: www.xkcd.com

Strategy for Predicting Values of Y page 546 of Elementary Statistics, 10th Edition

Using the Regression Equation for Predictions If the regression equation is not a good model, the best predicted value of y is simply y (the mean of the y values) Remember, this strategy applies to linear patterns of points in a scatterplot. _

Definition For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y-value that is predicted by using the regression equation. That is, Residual = (observed y) – (predicted y) = y – y

Residuals

Definition A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. The best possible regression line satisfies this properties (hence why it is also called the least squares line) page 549 - 550 of Elementary Statistics, 10th Edition

Least Squares Property sum = (-5)2 + 112 + (-13) 2 + 72 = 364 (any other line would yield a sum larger than 364)