 # Chapter 41 Describing Relationships: Scatterplots and Correlation.

## Presentation on theme: "Chapter 41 Describing Relationships: Scatterplots and Correlation."— Presentation transcript:

Chapter 41 Describing Relationships: Scatterplots and Correlation

Objectives (BPS chapter 4) Relationships: Scatterplots and correlation u Explanatory and response variables u Displaying relationships: scatterplots u Interpreting scatterplots u Adding categorical variables to scatterplots u Measuring linear association (correlation) u Facts about correlation Chapter 42

3 Scatterplot A scatterplot is a graph in which paired (x, y) data (usually collected on the same individuals) are plotted with one variable represented on a horizontal (x -) axis and the other variable represented on a vertical (y-) axis. Each individual pair (x, y) is plotted as a single point. Example:

StudentNumber of Beers Blood Alcohol Level 150.1 220.03 390.19 670.095 730.07 930.02 1140.07 1350.085 480.12 530.04 850.06 1050.05 1260.1 1470.09 1510.01 1640.05 Here we have two quantitative variables for each of 16 students. 1. How many beers they drank, and 2. Their blood alcohol level (BAC) We are interested in the relationship between the two variables: How is one affected by changes in the other one?

StudentBeersBAC 150.1 220.03 390.19 670.095 730.07 930.02 1140.07 1350.085 480.12 530.04 850.06 1050.05 1260.1 1470.09 1510.01 1640.05 Scatterplots In a scatterplot one axis is used to represent each of the variables, and the data are plotted as points on the graph.

Explanatory (independent) variable: number of beers Response (dependent) variable: blood alcohol content x y Explanatory and response variables A response variable measures or records an outcome of a study. An explanatory variable explains changes in the response variable. Typically, the explanatory or independent variable is plotted on the x axis and the response or dependent variable is plotted on the y axis.

Some plots don’t have clear explanatory and response variables. Do calories explain sodium amounts? Does percent return on Treasury bills explain percent return on common stocks?

Chapter 48 Examining a Scatterplot You can describe the overall pattern of a scatterplot by the  Form – linear or non-linear ( quadratic, exponential, no correlation etc.)  Direction – negative, positive.  Strength – strong, very strong, moderately strong, weak etc.  Look for outliers and how they affect the correlation.

Chapter 49 Scatterplot x12345 y-4-2102 x 24 –2–2 – 4 y 2 6 Example: Draw a scatter plot for the data below. What is the nature of the relationship between X and Y. Strong, positive and linear.

Chapter 410 Examining a Scatterplot  Two variables are positively correlated when high values of the variables tend to occur together and low values of the variables tend to occur together. The scatterplot slopes upwards from left to right.  Two variables are negatively correlated when high values of one of the variables tend to occur with low values of the other and vice versa. The scatterplot slopes downwards from left to right.

Chapter 411 Types of Correlation x y Negative Linear Correlation x y No Correlation x y Positive Linear Correlation x y Non-linear Correlation As x increases, y tends to decrease. As x increases, y tends to increase.

Chapter 1312 Examples of Relationships

Caution: u Relationships require that both variables be quantitative (thus the order of the data points is defined entirely by their value). u Correspondingly, relationships between categorical data are meaningless. Example: Beetles trapped on boards of different colors What association? What relationship? Blue White Green Yellow Board color Blue Green White Yellow Board color Describe one category at a time. ?

Chapter 414 Thought Question 1 What type of association would the following pairs of variables have – positive, negative, or none? 1. Temperature during the summer and electricity bills 2. Temperature during the winter and heating costs 3. Number of years of education and height (Elementary School) 4. Frequency of brushing and number of cavities 5. Number of churches and number of bars in cities 6. Height of husband and height of wife

Chapter 415 Thought Question 2 Consider the two scatterplots below. How does the outlier impact the correlation for each plot? –does the outlier increase the correlation, decrease the correlation, or have no impact?

Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values.

How to scale a scatterplot Using an inappropriate scale for a scatterplot can give an incorrect impression. Both variables should be given a similar amount of space: Plot roughly square Points should occupy all the plot space (no blank space) Same data in all four plots

Adding categorical variables to scatterplots Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends. What may look like a positive linear relationship is in fact a series of negative linear associations. Plotting different habitats in different colors allowed us to make that important distinction.

Comparison of men’s and women’s racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization. Relationship between lean body mass and metabolic rate in men and women. While both men and women follow the same positive linear trend, women show a stronger association. As a group, males typically have larger values for both variables.

Chapter 420 Measuring Strength & Direction of a Linear Relationship u How closely does a non-horizontal straight line fit the points of a scatterplot? u The correlation coefficient (often referred to as just correlation): r –measure of the strength of the relationship: the stronger the relationship, the larger the magnitude of r. –measure of the direction of the relationship: positive r indicates a positive relationship, negative r indicates a negative relationship.

Chapter 421 Correlation Coefficient Greek Capital Letter Sigma – denotes summation or addition.

Example: Find the correlation between X and Y Chapter 422 x12345 y-4-2102 xy 1 -4-3.46.8 2-2-1.41.4 3011.60 4100.6 5222.65.2

Chapter 423 Correlation Coefficient u The range of the correlation coefficient is -1 to 1. 0 1 If r = -1 there is a perfect negative correlation If r = 1 there is a perfect positive correlation If r is close to 0 there is no linear correlation

Chapter 424 Linear Correlation Strong negative correlation Weak positive correlation Strong positive correlation Non-linear Correlation x y x y x y x y r =  0.91 r = 0.88 r = 0.42r = 0.07 Try

Chapter 425 Correlation Coefficient u special values for r :  a perfect positive linear relationship would have r = +1  a perfect negative linear relationship would have r = -1  if there is no linear relationship, or if the scatterplot points are best fit by a horizontal line, then r = 0  Note: r must be between -1 and +1, inclusive u r > 0: as one variable changes, the other variable tends to change in the same direction u r < 0: as one variable changes, the other variable tends to change in the opposite direction

Chapter 426 Correlation Coefficient u Because r uses the z-scores for the observations, it does not change when we change the units of measurements of x, y or both. u Correlation ignores the distinction between explanatory and response variables. u r measures the strength of only linear association between variables. u A large value of r does not necessarily mean that there is a strong linear relationship between the variables – the relationship might not be linear; always look at the scatterplot. u When r is close to 0, it does not mean that there is no relationship between the variables, it means there is no linear relationship. u Outliers can inflate or deflate correlations. Try

Chapter 427 Not all Relationships are Linear Miles per Gallon versus Speed u Curved relationship (r is misleading) u Speed chosen for each subject varies from 20 mph to 60 mph u MPG varies from trial to trial, even at the same speed u Statistical relationship r=-0.06

Chapter 428 Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no linear correlation.

Chapter 429 Example A survey of the world’s nations in 2004 shows a strong positive correlation between percentage of countries using cell phones and life expectancy in years at birth. a) Does this mean that cell phones are good for your health? No. It simply means that in countries where cell phone use is high, the life expectancy tends to be high as well. b) What might explain the strong correlation? The economy could be a lurking variable. Richer countries generally have more cell phone use and better health care.

Chapter 430 Example The correlation between Age and Income as measured on 100 people is r = 0.75. Explain whether or not each of these conclusions is justified. a) When Age increases, Income increases as well. b) The form of the relationship between Age and Income is linear. c) There are no outliers in the scatterplot of Income vs. Age. d) Whether we measure Age in years or months, the correlation will still be 0.75.

Chapter 431 Example Explain the mistakes in the statements below: a) “My correlation of -0.772 between GDP and Infant Mortality Rate shows that there is almost no association between GDP and Infant Mortality Rate”. b) “There was a correlation of 0.44 between GDP and Continent” c) “There was a very strong correlation of 1.22 between Life Expectancy and GDP”.

Chapter 432 Key Concepts u Strength of Linear Relationship u Direction of Linear Relationship u Correlation Coefficient u Common Problems with Correlations u r can only be calculated for quantitative data.