Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Woods Modified over 9 years ago
1
Scatterplots
2
Learning Objectives By the end of this lecture, you should be able to: – Describe what a scatterplot is – Be comfortable with the terms exaplanatory variable and response variable. – Describe a scatterplot in terms of form, direction, and strength – Define what is meant by an outlier, and be able to Identify them on a scatterplot – Recognize why poorly chosen scales on a scatterplot can give misleading impressions of the data
3
Examining Relationships Up to this point, we have focused on single-variable (“univariate”) data. Eg: Women’s heights, Percentage of Hispanics in each state, SAT scores, etc. Most statistical studies involve more than one variable. For example, a great deal of analysis goes into examining the relationship between two variables. Example: We may be interested in the relationship between The number of beers they consumed at a party Blood alcohol level (BAC) With the proper statistical tools we can try to determine things like: IS there a relationship? I.E. Does the number of beers affect blood alcohol level? If there is a relationship, can we predict how much each beer contributes to BAC. A great human flaw: It is tempting to just intuitively assume that there is a relationship between two variables. However, this can lead to some highly erroneous conclusions. As humans, we LOVE to assume stuff, find patterns that don’t truly exist, and then jump to conclusions. This is a very well-known flaw in the human character and we should be aware of it. We will discuss this topic in more detail as we progress through the course.
4
StudentBeersBlood Alcohol S150.1 S220.03 S390.19 S470.095 S530.07 S630.02 S740.07 S850.085 S980.12 S1030.04 S1150.06 S1250.05 S1360.1 S1470.09 S1510.01 S1640.05 Here, we have two quantitative variables for each of 16 students (n=16). 1) How many beers they drank, and 2) Their blood alcohol level (BAC) We are interested in the relationship between the two variables: How is one affected by changes in the other one?
5
Looking for relationships between variables Start with a graph (always – whenever possible) Look for an overall pattern deviations from the pattern (deviations such as outliers are sometimes the most interesting part!) If appropriate, try to provide numerical descriptions of the data and overall pattern.
6
StudentBeersBAC 150.1 220.03 390.19 670.095 730.07 930.02 1140.07 1350.085 480.12 530.04 850.06 1050.05 1260.1 1470.09 1510.01 1640.05 Scatterplots In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph.
7
Number of Beers (Explanatory Variable) Blood Alcohol Content (Response variable) x y Explanatory and response variables A response variable measures or records an outcome of a study. An explanatory variable explains (“causes”) the changes in the response variable. Typically, the explanatory variable is plotted on the x axis, and the response variable is plotted on the y axis.
8
Terminology: Dependent / Independent Instead of explanatory / response, you will often encounter the terms independent and dependent used. – Independent for Explanatory – Dependent for Response They are pretty much interchangable, but there is a subtle difference. However, it is more accurate to use the terms explanatory and response, so I would like you to focus on those terms. – You will ocasionally see SPSS use dependent/indepdent.
9
Which should be the explanatory, and which the response? The variable that you think “causes” the change in the other variable should be the explanatory variable. – (This is why it is frequently called the ‘dependent’ variable. But as was just mentioned, there is a subtle distinction between them which we may get to down the road). The variable that “responds” to a change in the explanatory variable, is, then, the response variable. Example: – Exercise v.s. Calories burned? Answer: The amount of exercise will (hopefully!) result in a change in calories burned. Whereas, burning calories, does not ‘cause’ a change in exercise. So exercise should be our explanatory variable, and calories the response variable. – Exam Score v.s. Hours studying Answer: We would expect that that the amount of hours studying would cause a change in exam score rather than the othe rway around. So ‘hours studying’ would be our explanatory variable.
10
Describing/Interpreting scatterplots When describing a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern … – Form: linear (a straight line), curved, clusters, no pattern – Direction: positive, negative, no direction – Strength: how closely the points fit the “form”
11
Form of an association: Linear / Nonlinear / No Relationship Linear Nonlinear No relationship
12
A linear relationship is given a directional description of Positive or Negative Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable. Direction of a linear association Positive or Negative Note that we only describe the direction of the relationship when the relationship is linear.
13
Sometimes there isn’t any relationship: X and Y may vary, but are independent of each other. Knowing a value for X tells you nothing about the value for Y. We describe as ‘no relationship’ Scatterplot Direction: No Relationship
14
Scatterplot: Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values. (You could probably make a reasonable argument that the reationship of this plot isn’t even linear.) ? ? ?
15
This is a strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. This is a relatively weak relationship. For a particular state median household income, you can’t predict the state per capita income very well.
16
Describing the strength For now we are using the admittedly vague terms ‘strong, moderate, weak’. In a subsequent lecture on scatterplots, we will learn a technique for quantifying the strength.
17
Describing/Interpreting scatterplots As mentioned earlier, when you are asked to interpret a scatterplot, you should be familiar with these 3 terms in particular. – Form: linear, curved, clusters, no pattern – Direction: positive, negative, no direction – Strength: how closely the points fit the “form” – Note: Recall that if the relationship is not linear, we will not bother to describe direction or strength.
18
Examples – Describe each plot Form: Linear, Direction: positive, Strength: strong Form: Linear, Direction: negative, Strength: moderate Form: No relationship. Note that for a given x does not tell us anything new about y. As a result, the terms ‘postive/negative’ don’t apply. Neither does the strength.
19
Examples Form: Non-linear. Therefore, we don’t bother trying to describe direction or strength. Form: Linear, Direction: positive, Strength: moderate In our next lecture on scatterplots, we will discuss a tool for quantifying the strength of the relationship.
20
Lying with statistics: How (not) to scale a scatterplot Using an inappropriate scale for a scatterplot can give an incorrect impression. Ideally, both variables should be given a similar amount of space: Plot roughly square Points should occupy most of the plot space Same data in all four plots
21
How to scale a scatterplot Same data in all four plots In other words, if faced with this group plots, you should be suspicious of most of them!
22
Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship.
23
Not an outlier: The upper right-hand point here is not an outlier of the relationship—It is what you would expect for this many beers given the linear relationship between beers/weight and blood alcohol. This point is not in line with the others, so it is an outlier of the relationship. Outliers
24
IQ score and Grade point average Describe in words what this plot shows. Looking to see if there is a relationship between IQ score and GPA. Describe the direction, shape, and strength. Are there outliers? Shape: linear Direction: positive Strength: appears somewhat weak Outliers present? Appear to be outliers, but it is hard to say.
25
IQ score and Grade point average Are there outliers present? The circled datapoints (and perhaps some of the others too) appear to be outliers. Still, it is hard to say. How do we decide? Recall that on a scatterplot, we consider a datapoint to be an outlier if it is way off the “line”. If the “regression” line (the line through the points) looks like the one here, then both IQ scores (circled) would almost certainly be considered outliers.
26
IQ score and Grade point average Are there outliers present? If the regression line looks like the one drawn here, then certainly the lower circled datapoint (and probably some of others nearby as well) would be considered outliers.
27
IQ score and Grade point average Are there outliers present? Conversely, if the regression line looks like the one drawn here, then certainly the upper circled datapoint (and probably several of others nearby as well) would be considered outliers. But the lower one would not be.
28
WHICH line, then, is the “correct” regression line? Answer: Once again, we use a mathematical model to draw a regression line. We will discuss how to do so in our next lecture on scatterplots.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.