Presentation on theme: "7.1 Seeking Correlation LEARNING GOAL Be able to define correlation, recognize positive and negative correlations on scatter diagrams, and understand the."— Presentation transcript:
7.1 Seeking Correlation LEARNING GOAL Be able to define correlation, recognize positive and negative correlations on scatter diagrams, and understand the correlation coefficient as a measure of the strength of a correlation.
2 Thought Question 1 For all cars manufactured in the U.S., there is a positive correlation between the size of the engine and horsepower. There is a negative correlation between the size of the engine and gas mileage. What does it mean for two variables to have a positive correlation or a negative correlation?
Types of Correlation Positive linear correlation: Both variables tend to increase (or decrease) together Negative linear correlation: One variable increases while the other variable decreases No correlation: no apparent linear relationship Nonlinear relationship: variables related but not in a straight line pattern
4 Thought Question 2 What type of correlation would the following pairs of variables have – positive, negative, or none? 1.Temperature during the summer and electricity bills 2.Temperature during the winter and heating costs 3.Number of years of education and height 4.Frequency of brushing and number of cavities 5.Number of churches and number of bars in cities in a city 6.Height of husband and height of wife
Scatter Diagram or Scatterplot A graph in which each point represents the values of two variables. Always plot the explanatory variable (independent variable) on the horizontal axis. Always plot the response variable (dependent variable) on the vertical axis. If there is no explanatory/response distinction either variable can go on the horizontal axis.
Is there a correlation between car weight and fuel consumption? Draw a scatterplot. Car weight (lb) Fuel consumption (mpg)
Measuring the Strength of a Correlation The strength of a correlation is measured with a number called the correlation coefficient, represented by the letter r.
Properties of the Correlation Coefficient, r The values of r is such that If there is no correlation, the value of r is close to 0. If there is positive correlation, r is positive. The closer r is to 1, the stronger the correlation. If there is negative correlation, r is negative. The closer r is to -1, the stronger the correlation. If r=1, there is perfect positive correlation. If r=-1, there is perfect negative correlation.
7.2 Interpreting Correlations LEARNING GOAL Be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality.
Beware of Outliers Consider the two scatterplots below. How does the outlier impact the correlation for each plot? – Does the outlier increase the correlation, decrease the correlation, or have no impact?
If the outlier is included, r = If the outlier is removed, r = 0
What should we do with outliers? If the outliers are mistakes in the data set, they produce apparent correlations that are not real or may mask the presence of real correlations. If the outliers represent correct data points, they may help us to see relationships. Examine outliers carefully, but do not remove them unless we have strong reason to believe they do not belong.
Beware of Inappropriate Grouping Scatterplot of heights versus weights of males and females r =0.545
Separate the previous data into males and females Male height versus weight data r = Female height versus weight data r = 0.366
Correlation does not imply causality. Possible explanations for a correlation 1.The correlation may be a coincidence. 2.Both correlation variables might be directly influenced by some common underlying cause. 3. One of the correlated variables may actually be a cause of the other. Even then, it may be just one of several causes.
a. State the correlation clearly. b. Is the correlation due to coincidence, a common underlying cause, or a direct cause. Explain. The time spent in recreation on weekends and scores on a Monday exam. The outside temperature and the amount of ice cream sold. The number of wins of a basketball team and the number of spectators. The weight of a person and the time spent reading.
7.3 Best-Fit Lines and Prediction LEARNING GOAL Become familiar with the concept of a best-fit line for a correlation, recognize when such lines have predictive value and when they may not, understand how the square of the correlation coefficient is related to the quality of the fit, and qualitatively understand the use of multiple regression.
Best-fit line The best-fit line on a scatter diagram is a line that lies closer to the data points than any other possible line. Also called a regression line or least-squares line. It is called a least squares line because it is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Best –Fit Lines Equation of best-fit line: y = a + bx – x is the value of the explanatory variable – y is the average value of the response variable – note that a and b are just the intercept and slope of a straight line – note that r and b are not the same thing, but their signs will agree. – We will use the regression equation to predict the response variable, y, given the explanatory variable, x. – Use software to calculate the regression equation. Plot
Car Weight and Fuel Consumption Car Weight (lb)Fuel Consumption (mpg) Chrysler Sebring Ford Mustang BMW 3-series Ford Crown Victoria Honda Civic Mazda Protégé Hyundai Accent229037
Draw scatterplot and best-fit line.
Use Excel to write equation of best-fit line Look at the Coefficient column. The equation of the best-fit line is: Y= x or mpg = (car weight in lb.) How many miles per gallon would a 2000 lb. car get? Is it reasonable to make a prediction for a 2000 lb. car? A 3000 lb. car? A 4000 lb. car?
Cautions in Making Predictions from Best-Fit Lines Best-fit lines only give a good prediction when the correlation is strong and there are many data points. Only use the best-fit line to make predictions within the bounds of the data points used. A best-fit line based on past data is not necessarily valid now or in the future. Don’t make predictions about a population different from which the data is drawn. Best-fit line is meaningless when there is no significant correlation or the relationship is nonlinear.
Correlation between your bill and the tip you leave Bill$10.15$25.36$7.38$43.78$ $33.89$26.17$21.18 Tip$2.00$3.50$1.50$6.00$9.00$2.00$5.50$4.00$2.00 r squared = % of the variation in the tip can be explained by the cost of the bill. What explains the other 6.7%? What should the slope of the best-fit line be? The equation of the best-fit line is y= x What does the line tell you about how people tip?