 # 7.1 Seeking Correlation LEARNING GOAL

## Presentation on theme: "7.1 Seeking Correlation LEARNING GOAL"— Presentation transcript:

7.1 Seeking Correlation LEARNING GOAL
Be able to define correlation, recognize positive and negative correlations on scatter diagrams, and understand the correlation coefficient as a measure of the strength of a correlation.

Statistical Thinking Thought Question 1 For all cars manufactured in the U.S., there is a positive correlation between the size of the engine and horsepower. There is a negative correlation between the size of the engine and gas mileage. What does it mean for two variables to have a positive correlation or a negative correlation? Chapter 14

Types of Correlation Positive linear correlation: Both variables tend to increase (or decrease) together Negative linear correlation: One variable increases while the other variable decreases No correlation: no apparent linear relationship Nonlinear relationship: variables related but not in a straight line pattern

Statistical Thinking Thought Question 2 What type of correlation would the following pairs of variables have – positive, negative, or none? Temperature during the summer and electricity bills Temperature during the winter and heating costs Number of years of education and height Frequency of brushing and number of cavities Number of churches and number of bars in cities in a city Height of husband and height of wife Chapter 14

Scatter Diagram or Scatterplot
A graph in which each point represents the values of two variables. Always plot the explanatory variable (independent variable) on the horizontal axis. Always plot the response variable (dependent variable) on the vertical axis. If there is no explanatory/response distinction either variable can go on the horizontal axis.

Is there a correlation between car weight and fuel consumption
Is there a correlation between car weight and fuel consumption? Draw a scatterplot. Car weight (lb) Fuel consumption (mpg) 3175 27 3450 29 3225 3985 24 2440 37 2500 34 2290

Measuring the Strength of a Correlation
The strength of a correlation is measured with a number called the correlation coefficient, represented by the letter r.

Properties of the Correlation Coefficient, r
The values of r is such that If there is no correlation, the value of r is close to 0. If there is positive correlation, r is positive. The closer r is to 1, the stronger the correlation. If there is negative correlation, r is negative. The closer r is to -1, the stronger the correlation. If r=1, there is perfect positive correlation. If r=-1, there is perfect negative correlation.

7.2 Interpreting Correlations
LEARNING GOAL Be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality.

Statistical Thinking Beware of Outliers Consider the two scatterplots below. How does the outlier impact the correlation for each plot? Does the outlier increase the correlation, decrease the correlation, or have no impact? Chapter 14

If the outlier is included, r = 0.880
If the outlier is removed, r = 0

What should we do with outliers?
If the outliers are mistakes in the data set, they produce apparent correlations that are not real or may mask the presence of real correlations. If the outliers represent correct data points, they may help us to see relationships. Examine outliers carefully, but do not remove them unless we have strong reason to believe they do not belong.

Beware of Inappropriate Grouping
Scatterplot of heights versus weights of males and females r =0.545

Separate the previous data into males and females
Male height versus weight data r = 0.522 Female height versus weight data r = 0.366

Correlation does not imply causality.
Possible explanations for a correlation The correlation may be a coincidence. Both correlation variables might be directly influenced by some common underlying cause. 3. One of the correlated variables may actually be a cause of the other. Even then, it may be just one of several causes.

The time spent in recreation on weekends and scores on a Monday exam.
a. State the correlation clearly. b. Is the correlation due to coincidence, a common underlying cause, or a direct cause. Explain. The time spent in recreation on weekends and scores on a Monday exam. The outside temperature and the amount of ice cream sold. The number of wins of a basketball team and the number of spectators. The weight of a person and the time spent reading.

7.3 Best-Fit Lines and Prediction
LEARNING GOAL Become familiar with the concept of a best-fit line for a correlation, recognize when such lines have predictive value and when they may not, understand how the square of the correlation coefficient is related to the quality of the fit, and qualitatively understand the use of multiple regression.

Best-fit line The best-fit line on a scatter diagram is a line that lies closer to the data points than any other possible line. Also called a regression line or least-squares line. It is called a least squares line because it is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

Best –Fit Lines Equation of best-fit line: y = a + bx
Statistical Thinking Best –Fit Lines Equation of best-fit line: y = a + bx x is the value of the explanatory variable y is the average value of the response variable note that a and b are just the intercept and slope of a straight line note that r and b are not the same thing, but their signs will agree. We will use the regression equation to predict the response variable, y, given the explanatory variable, x. Use software to calculate the regression equation. Plot The <Plot> link on this slide is to the Correlation & Regression applet found on the VCU Stat 208 website. The address is . Chapter 15

Car Weight and Fuel Consumption
Car Weight (lb) Fuel Consumption (mpg) Chrysler Sebring 3175 27 Ford Mustang 3450 29 BMW 3-series 3225 Ford Crown Victoria 3985 24 Honda Civic 2440 37 Mazda Protégé 2500 34 Hyundai Accent 2290

Draw scatterplot and best-fit line.

Use Excel to write equation of best-fit line
Look at the Coefficient column. The equation of the best-fit line is: Y= x or mpg = (car weight in lb.) How many miles per gallon would a 2000 lb. car get? Is it reasonable to make a prediction for a 2000 lb. car? A lb. car? A 4000 lb. car?

Cautions in Making Predictions from Best-Fit Lines
Best-fit lines only give a good prediction when the correlation is strong and there are many data points. Only use the best-fit line to make predictions within the bounds of the data points used. A best-fit line based on past data is not necessarily valid now or in the future. Don’t make predictions about a population different from which the data is drawn. Best-fit line is meaningless when there is no significant correlation or the relationship is nonlinear.

The Correlation Coefficient and Best-Fit Lines
Best-Fit Lines and r2 The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line. Page 311 Copyright © 2009 Pearson Education, Inc.

Correlation between your bill and the tip you leave
\$10.15 \$25.36 \$7.38 \$43.78 \$55.89 11.17 \$33.89 \$26.17 \$21.18 Tip \$2.00 \$3.50 \$1.50 \$6.00 \$9.00 \$5.50 \$4.00 r squared = 0.933 93.3% of the variation in the tip can be explained by the cost of the bill. What explains the other 6.7%? What should the slope of the best-fit line be? The equation of the best-fit line is y= x What does the line tell you about how people tip?