Image from Minitab Website

Image from Minitab Website
Scatterplots Image from Minitab Website

Learning Objectives By the end of this lecture, you should be able to:
Describe what a scatterplot is Be comfortable with the terms explanatory variable and response variable Describe a scatterplot in terms of form, direction, and strength Define what is meant by an “outlier” and “influential point” (in terms of a scatterplot), and how you might identify them Recognize why poorly chosen scales on a scatterplot can give misleading impressions of the data

Examining Relationships
Up to this point, we have focused on single-variable (“univariate”) data. e.g. Women’s heights, Percentage of Hispanics in each state, SAT scores, etc. Much of statistical analysis involves looking at the relationship between two or more variables. For example, we may be interested in the relationship between the number of beers people consumed at a party and their resulting blood alcohol level (BAC). With the proper statistical tools we can try to determine things like: IS there a relationship? That is, does the number of beers truly affect blood alcohol level? If there is a relationship, can we predict how the quantity of beer consumed affects the to BAC. A human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known evolutionary flaw in the human brain and we should be aware of it. We will discuss this topic in more detail as we progress through the course.

Student Beers Blood Alcohol S1 5 0.1 S2 2 0.03 S3 9 0.19 S4 7 0.095 S5 3 0.07 S6 0.02 S7 4 S8 0.085 S9 8 0.12 S10 0.04 S11 0.06 S12 0.05 S13 6 S14 0.09 S15 1 0.01 S16 Here, we have two quantitative variables for each of 16 students (n=16). 1) How many beers each student drank 2) The blood alcohol level (BAC) of each student after consuming those beers We are interested in the relationship between the two variables: How is one variable affected by changes in the other variable?

Looking for relationships between variables
Always start with a graph (if possible) Hopefully this detail is becoming increasingly obvious to you! Look for An overall pattern Deviations from the pattern (deviations such as outliers are sometimes the most interesting part!) If appropriate, try to provide both descriptive and numerical descriptions about the data and pattern.

Scatterplots In a scatterplot, each axis is used to represent each of the variables, and the data are plotted as points on the graph. Student Beers BAC 1 5 0.1 2 0.03 3 9 0.19 6 7 0.095 0.07 0.02 11 4 13 0.085 8 0.12 0.04 0.06 10 0.05 12 14 0.09 15 0.01 16

Explanatory and response variables
A response variable measures or records an outcome of a study. An explanatory variable explains (“causes”) the changes in the response variable. Which variable should go on which axis? Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis. y axis Blood Alcohol Content (Response variable) x axis Number of Beers (Explanatory Variable)

Terminology: Dependent / Independent
Instead of explanatory / response, you may often encounter the terms independent and dependent. Independent for Explanatory Dependent for Response They are pretty much interchangeable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. You will occasionally see SPSS use dependent/intendent.

Which variable should be the explanatory, and which the response?
The variable from which you are trying to predict the change in the other variable should be the explanatory variable. (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned, there is a subtle distinction between them which we may discuss at a later point). The variable that gets changed in response to changes in the explanatory variable (i.e. “responds” to the explanatory variable), is the response variable. Example: Exercise v.s. Calories burned? Answer: If in your analysis, you are trying to predict or analyze the number of calories burned as a result of exercise, then exercise would be the explanatory variable, and calories burned would be the response variable. Exam Score v.s. Hours studying Answer: If we are trying to predict or analyze the scores on an exam as a result of studying, then “hours studying” would be the explanatory variable, and exam score would be the response variable.

Describing Scatterplots
Much in the same way we describe a single variable’s distribution in terms of its distribution, center, spread, etc, we should also be able to describe a scatterplot. When describing a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern … Form: linear (a straight line), curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the “form”

Form of an association Linear / Nonlinear / No Relationship

Direction of a linear association Positive or Negative
If a relationship is linear, it is given a directional description of Positive or Negative Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. Again, note that we only describe the direction of the relationship when the relationship is linear.

Scatterplot Direction: No Relationship
Sometimes there isn’t any relationship: X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe this as “No relationship”.

Scatterplot: Strength of the association
The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. ? ? ? With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values. (You could probably make a reasonable argument that the relationship of this plot isn’t even linear.)

Strong or Weak Relationship?
This is a relatively weak relationship. For a particular state median household income, you can’t predict the state per capita income very well. This looks like a reasonably strong relationship. The daily amount of gas consumed can be predicted pretty accurately for a given temperature value.

Describing the strength
For now we are using the admittedly vague terms ‘strong, moderate, weak’. In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength of a linear relationship between two variables.

Describing/Interpreting scatterplots
As mentioned earlier, when you are asked to interpret a scatterplot, you should be familiar with these 3 terms in particular. Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: strong, moderate, weak Note: Recall that if the relationship is not linear, we will not bother to describe direction or strength.

Examples – Describe each plot
Form: Linear, Direction: positive, Strength: strong Form: Linear, Direction: negative, Strength: moderate Form: No relationship. Examining any particular x tells us nothing about y. As a result, the terms ‘positive/negative’ don’t apply. Neither does the strength.

Examples Form: Non-linear. Therefore, we don’t bother trying to describe direction or strength. Form: Linear, Direction: positive, Strength: moderate In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.

Lying with Statistics: How (not) to scale a scatterplot
Same data in all four plots! Using an inappropriate scale for a scatterplot can give an incorrect impression. Ideally, both variables should be given a similar amount of space: Plot roughly square Points should occupy most of the plot space In other words, if faced with this group plots, you should be suspicious of most of them!

Outliers on Scatterplots
An outlier is a data value that has a very low probability of occurrence (i.e., that particular observation is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the pattern of the relationship. This scatterplot appears to show a linear relationship between the two variables. The observation at the upper right is consistent with the linear relationship and therefore would not be considered an outlier. This plot also appears to show a linear relationship. However, the observation at approximately (7,2) is not consistent with the linear relationship. i.e. It appears to be an outlier.

Outliers? The upper right-hand point here is not an outlier of the relationship—It is what you would expect for this many beers given the linear relationship between beers/weight and blood alcohol. This point is not consistent with the relationship, so we would label it as an outlier.

IQ score and Grade point average Describe in words the purpose of this plot: It is there to help us determine if there is a relationship between IQ score and GPA. Describe the shape, direction, and strength: Shape: linear Direction: positive Strength: appears somewhat weak Outliers present? Appear to be outliers, but it is hard to say definitively.

IQ score and Grade point average Are there outliers present? The circled datapoints (and perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide? Recall that on a scatterplot that is showing a linear relationship, we consider a datapoint to be an outlier if it is way off the regression line (the line through the data points). If the regression line looks like the one here, then we would probably label these two observations as outliers.

IQ score and Grade point average Are there outliers present? If the regression line looks like the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.

IQ score and Grade point average Suppose that we have a different regression line as shown here. Are there outliers present? If the regression line looks like the one drawn here, then the upper circled datapoint (and probably some of others nearby as well) could be considered outliers. But the lower one would not be.

WHICH line, then, is the “correct” regression line?
Answer: As with other models we have discussed (e.g. density curves to summarize a histogram) we use a mathematical formula to draw regression lines. (We don’t just “eyeball it”!) We will discuss this topic in our next lecture on scatterplots.

Image from Minitab Website

Similar presentations

Presentation on theme: "Image from Minitab Website"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Image from Minitab Website

Similar presentations

Presentation on theme: "Image from Minitab Website"— Presentation transcript:

Similar presentations

About project

Feedback