Regression and Correlation

Slides:



Advertisements
Similar presentations
Lesson 10: Linear Regression and Correlation
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Forecasting Using the Simple Linear Regression Model and Correlation
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Correlation and Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 12 Simple Regression
Linear Regression and Correlation
The Simple Regression Model
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
SIMPLE LINEAR REGRESSION
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
Correlation and Regression Analysis
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Correlation and Linear Regression
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Correlation Scatter Plots Correlation Coefficients Significance Test.
Linear Regression and Correlation
Correlation and Linear Regression
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Correlation and Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
EQT 272 PROBABILITY AND STATISTICS
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
Introduction to Linear Regression
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
Elementary Statistics Correlation and Regression.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Linear Regression and Correlation Chapter GOALS 1. Understand and interpret the terms dependent and independent variable. 2. Calculate and interpret.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION 10.2 SCATTER DIAGRAM 10.3 GRAPHICAL METHOD FOR DETERMINING REGRESSION 10.4 LEAST SQUARE METHOD.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Correlation and Regression
Chapter 20 Linear and Multiple Regression
Inference for Least Squares Lines
Correlation and Simple Linear Regression
Linear Regression and Correlation Analysis
Chapter 11: Simple Linear Regression
Chapter 11 Simple Regression
Elementary Statistics
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
Correlation and Regression
PENGOLAHAN DAN PENYAJIAN
Correlation and Simple Linear Regression
Correlation and Regression
Correlation and Regression
SIMPLE LINEAR REGRESSION
Simple Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Warsaw Summer School 2017, OSU Study Abroad Program
Chapter Thirteen McGraw-Hill/Irwin
Honors Statistics Review Chapters 7 & 8
Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Regression and Correlation

Scatter Diagram a plot of paired data to determine or show a relationship between two variables

When using x values to predict y values: Call x the explanatory variable Call y the response variable

Paired Data

Scatter Diagram

Questions Arising Can we find a relationship between x and y? How strong is the relationship?

When there appears to be a linear relationship between x and y: attempt to “fit” a line to the scatter diagram.

Linear Correlation The general trend of the points seems to follow a straight line segment.

Linear Correlation

Non-Linear Correlation

No Linear Correlation

High Linear Correlation Points lie close to a straight line.

High Linear Correlation

Moderate Linear Correlation

Low Linear Correlation

Perfect Linear Correlation

The Sample Correlation Coefficient, r A measurement of the strength of the linear association between two variables Also called the Pearson product-moment correlation coefficient

Positive Linear Correlation High values of x are paired with high values of y and low values of x are paired with low values of y.

Negative Linear Correlation High values of x are paired with low values of y and low values of x are paired with high values of y.

Little or No Linear Correlation Both high and low values of x are sometimes paired with high values of y and sometimes with low values of y.

Positive Correlation y x

Negative Correlation y x

Little or No Linear Correlation y x

What type of correlation is expected? Height and weight Mileage on tires and remaining tread IQ and height Years of driving experience and insurance rates

Calculating the Correlation Coefficient, r

Linear correlation coefficient

If r = 0, scatter diagram might look like: y x

If r = +1, all points lie on the least squares line y x

If r = –1, all points lie on the least squares line y x

– 1 < r < 0 y x

0 < r < 1 y x

To Compute r: Complete a table, with columns listing x, y, x2, y2, xy Compute SSxy, SSx, and SSy Use the formula:

Find the Correlation Coefficient

Calculations:

The Correlation Coefficient,

Warning The correlation coefficient ( r) measures the strength of the relationship between two variables. Just because two variables are related does not imply that there is a cause-and-effect relationship between them.

The Least Squares Line The sum of the squares of the vertical distances from the points to the line is made as small as possible.

Least Squares Criterion The sum of the squares of the vertical distances from the points to the line is made as small as possible.

Equation of the Least Squares Line = a + bx a = the y-intercept b = the slope

Finding the slope

Finding the y-intercept

A relationship between correlation coefficient, r, and the slope, b, of the least squares line:

Find the Least Squares Line

Finding the slope

Finding the y-intercept

The equation of the least squares line is: = a + bx = 2.8 + 1.7x

Graphing the least squares line Using two values in the range of x, compute two corresponding y values. Plot these points. Join the points with a straight line.

The following point will always be on the least squares line:

Graphing y = 30.9 + 1.7x Use (8.3, 16.9) (average of the x’s, the average of the y’s) Try x = 5. Compute y: y = 2.8 + 1.7(5)= 11.3

Sketching the Line Using the Points (8.3, 16.9) and (5, 11.3)

Using the Equation of the Least Squares Line to Make Predictions Choose a value for x (within the range of x values). Substitute the selected x in the least squares equation. Determine corresponding value of y.

Predict the time to make a trip of 14 miles Equation of least squares line: y = 2.8 + 1.7x Substitute x = 14: y = 2.8 + 1.7 (14) y = 26.6 According to the least squares equation, a trip of 14 miles would take 26.6 minutes.

Interpolation Using the least squares line to predict values for x values that fall between the points in the scatter diagram

Prediction beyond the range of observations Extrapolation Prediction beyond the range of observations

A statistic related to r: the coefficient of determination = r2

Coefficient of Determination a measure of the proportion of the variation in y that is explained by the regression line using x as the predicting variable

Formula for Coefficient of Determination

Interpretation of r2 If r = 0.9753643, then what percent of the variation in minutes (y) is explained by the linear relationship with x, miles traveled? What percent is explained by other causes?

Interpretation of r2 If r = 0.9753643, then r2 = .9513355 Approximately 95 percent of the variation in minutes (y) is explained by the linear relationship with x, miles traveled. Less than five percent is explained by other causes.

Testing the Correlation Coefficient Determining whether a value of the sample correlation coefficient, r, is far enough from zero to indicate correlation in the population.

The Population Correlation Coefficient  = Greek letter “rho”

H0: x and y are not correlated, so  = 0. Hypotheses to Test Rho Assume that both variables x and y are normally distributed. To test if the (x, y) values are correlated in the population, set up the null hypothesis that they are not correlated: H0: x and y are not correlated, so  = 0.

If you believe  is positive, use a right-tailed test. H1:  > 0

If you believe  is negative, H0:  = 0 If you believe  is negative, use a left-tailed test. H1:  < 0

If you believe  is not equal to zero, H0:  = 0 If you believe  is not equal to zero, use a two-tailed test. H1:   0

Convert r to a Student’s t Distribution

A researcher wishes to determine (at 5% level of significance) if there is a positive correlation between x, the number of hours per week a child watches television and y, the cholesterol measurement for the child. Assume that both x and y are normally distributed.

Correlation Between Hours of Television and Cholesterol Suppose that a sample of x and y values for 25 children showed the correlation coefficient, r to be 0.42. Use a right-tailed test. The null hypothesis: H0:  = 0 The alternate hypothesis: H1:  > 0  = 0.05

Convert the sample statistic r = 0.42 to t using n = 25

Find critical t value for right-tailed test with  = 0.05 Use Table 6. d.f. = 25 - 2 = 23. t = 1.714 2.22 > 1.714 Reject the null hypothesis. Conclude that there is a positive correlation between the variables.

P Value Approach Use Table 6 in Appendix II, d.f. = 23 Our t value =2.22 is between 2.069 and 2.500. This gives P between 0.025 and 0.010. Since we would reject H0 for any   P, we reject H0 for  = 0.05. We conclude that there is a positive correlation between the variables.

Conclusion We conclude that there is a positive correlation between the number of hours spent watching television and the cholesterol measurement.

Note Even though a significance test indicates the existence of a correlation between x and y in the population, it does not signify a cause-and-effect relationship.

Standard Error of Estimate A method for measuring the spread of a set of points about the least squares line

The Residual y – yp = difference between the y value of a data point on the scatter diagram and the y value of the point on the least-squares line with the same x value

The Residual difference between the y value of a data point and the y value of the point on the line with the same x value

Standard Error of Estimate

Standard Error of Estimate The number of points must be greater that or equal to three. If n = 2, the line is a perfect fit and there is no need to compute Se. The nearer the points are to the least squares line, the smaller Se will be. The larger Se is, the more scattered the points are.

Calculating Formula for Se

Calculating Formula for Se Use caution in rounding. Uses quantities also used to determine the least squares line.

Find Se

Finding the Standard Error of Estimate

Finding Se

Finding Se

Finding Se

Confidence Interval for y Least squares line gives a predicted y value, yp, for a given x. Least squares line estimates the true y value. True y value is given by: y =  + x +   = y intercept  = slope  = random error

For a Specific x, a c Confidence Interval for y

For a Specific x, a c Confidence Interval for y

For a Specific x, a c Confidence Interval for y

For a Specific x, a c Confidence Interval for y

For a Specific x, a c Confidence Interval for y

Find a 95% confidence interval for the number of minutes for a trip of eight miles

The least squares line and prediction, : = a + bx = 2.8 + 1.7x For x = 8, = 2.8 + 1.7(8) = 16.4

For x = 8, a c Confidence Interval for y

Finding E

Finding E

For x = 8, a 95% Confidence Interval for y

we are 95% sure that the trip will take between 11.3 and 21.5 minutes. For x = 8 miles we are 95% sure that the trip will take between 11.3 and 21.5 minutes.

Confidence Interval for y at a Specific x Uses: The values of E increase as x is chosen further from the mean of the x values. Confidence interval for y becomes wider for values of x further from the mean.

Try not to use the least squares line to predict y values for x values beyond the data extremes of the sample x distribution.

Testing the Slope  = slope of the population based least squares line. b = slope of the sample based least squares line.

To test the slope: Use H0: The population slope = zero,  = 0 H1 may be  > 0 or  < 0 or   0 Convert b to a Student’s t distribution:

Standard Error for b

Test the Slope

We have: The least squares line: y = 2.8 + 1.7x Slope = b = 1.7 Se  1.85 SSx  115.4 We suspect the slope  is positive.

Hypothesis Test H0:  = 0 H1:  > 0 Use 1% level of significance. Convert the sample test statistic b = 1.7 to a t value.

t value For d.f. = 7 - 1 = 5 and  ´ = 0.01, critical value of t = 3.365. From Table 6, we note that P < 0.005. Since we would reject H0 for any   P, we reject H0 for  = 0.01. We conclude that  is positive.

Confidence Intervals for the Slope  We wish to estimate the slope of the population-based least squares line.

Confidence Intervals for the Slope  = slope of the population based least squares line. b = slope of the sample based least squares line.

To determine a confidence interval for  : Convert b to a Student’s t distribution:

A c Confidence Interval for 

b – E <  < b + E

Find a 95% Confidence Interval for 

We have: The least squares line: y = 2.8 + 1.7x Slope = b = 1.7 Se  1.85 SSx  115.4 c = 95% = 0.95 d.f. = n - 2 = 7 - 2 = 5 t0.75 = 2.571

b – E <  < b + E

Conclusion: We are 95% confident that the true slope of the regression line is between 1.26 and 2.14.

Multiple Regression More than a single random variable is used in the computation of predictions.

Common formula for linear relationships among more than two variables: y = b0 + b1x1 + b2x2 + … + bkxk y = response variable x1 , x2 , … , xk = explanatory variables, variables on which predictions will be based b0 , b1, b2, … , bk = coefficients obtained from least squares criterion

Multiple regression models are analyzed by computer programs such as: Minitab Excel SPSS

A collection of random variables with a number of properties Regression Model A collection of random variables with a number of properties

Properties of a Regression Model One variable is identified as response variable. All other variables are explanatory variables. For any application there will be a collection of numerical values for each variable.

Properties of a Regression Model Using numerical data values, least squares criterion the least-squares equation (regression equation) can be constructed. Usually includes a measure of “goodness of fit” of the regression equation to the data values.

Properties of a Regression Model Allows us to supply given values of explanatory variables in order to predict corresponding value of the response variable. A c% confidence interval can be constructed for least-squares criterion.

“Goodness of Fit” of Least-Squares Regression Equation May be measured by coefficient of multiple determination, r2