Presentation on theme: "Examining Relationships Prob. And Stat. CH.2.1 Scatterplots."— Presentation transcript:
Examining Relationships Prob. And Stat. CH.2.1 Scatterplots
The relationships between two variables can be strongly influenced by other variables that are lurking in the background. Because variation is everywhere, statistical relationships are overall tendencies, not ironclad rules. They allow individual exceptions. To study a relationship between two variables, we measure both variables on the same individual. We often think that one of the variables explains or influences the other.
Response Variable, Explanatory Variable A response variable measures an outcome of a study. An explanatory variable explains or influences changes in a response variable.
A response variable measures an outcome of a study. (Independent Variables) An explanatory variable explains or influences changes in a response variable. (Dependent Variables) Because of the multiple meanings that the words “independent”, and “dependent”, have in statistics, we should try to avoid using them. It is easier to identify explanatory and response variables when we actually set values of one variable in order to see how it affects another variable.
When we don’t set the values of either variable but just observe both variables, there may or may not be explanatory and response variables. Whether there are depends on how we plan to use the data.
EXAMPLE 2.1 The effects of alcohol Alcohol has many effects on the body. One effect is a drop in body temperature. To study this effect, researchers give several different amounts of alcohol to mice, then measure the change in each mouse’s body temperature in the 15 minutes after taking the alcohol. Amount of alcohol is the explanatory variable, and change in body temperature is the response variable.
EXAMPLE 2.2SAT scores Jim wants to know the average SAT math and verbal scores in the 51 states (including the District of Columbia) are related to each other. He doesn’t think that either score explains or causes the other. Jim has two related variables, and neither is an explanatory variable. Julie looks at the same data. She asks, “Can I predict a state’s SAT math score if I know its SAT verbal score?” Julie is treating the verbal score as the explanatory variable and the math score as the response variable.
Alcohol actually causes a change in body temperature. There is no cause-and –effect relationship between SAT math and verbal scores. Because the scores are closely related, we can use a state’s SAT verbal score to predict its math score. In section 2.3 we will look at predictions, that require us to identify an explanatory variable and a response variable Most statistical studies examine data on more than one variable. If we call one variable explanatory and the other response doesn’t mean that changes in one cause changes in the other.
Statistical Principles First plot the data, then add numerical summaries. Look for overall patterns and deviations from those patterns. When the overall pattern is quite regular, use a compact mathematical model to describe it.
The most common way to display the relation between two quantitative variables is a scatterplot. Always plot the explanatory variable, if there is one, on the horizontal axis (the x axis) of a scatterplot. Usually we call the explanatory variable x and the response variable y.
SCATTERPLOT A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual.
EXAMINING A SCATTERPLOT In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship.
Analysis of SAT/ACT Scatterplots Figure 2.1 shows a clear form: there are two distinct clusters of states with a gap between them. In the cluster at the right of the plot, 45% or more of high school graduates take the SAT, and the average scores are low. The states in the cluster at the left have higher SAT scores and lower percents of graduates taking the test. There are no clear outliers. That is, no points fall clearly outside the clusters.
What explains the clusters? There are two widely used college entrance exams, the SAT and the America College Testing (ACT) exam. Each state favors one or the other. The left cluster in figure 2.1 contains the ACT states, and the SAT states make up the right cluster. In ACT states, most students who take the SAT are applying to a selective college that requires SAT scores. This select group of students has a higher average score than the much larger group of students who take the SAT in SAT states.
The relationship in Figure 2.1 also has a clear direction: states in which a higher percent of students take the SAT tend to have lower average scores. This is a negative association between the two variables.
POSITIVE ASSOCIATION, NEGATIVE ASSOCIATION Two variables are positively associated when above-average values of one tend to accompany above-average values of the other and below- average values also tend to occur together. Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice versa.
The strength of a relationship in a scatterplot is determined by how closely the points follow a clear form. The overall relationship in Figure 2.1 is not strong-states with similar percents taking the SAT show quite a bit of scatter in their average scores. Here is an example of a stronger relationship with a clearer form.
Not all relationships are linear in form. They can be nonlinear. Not all relationships have a clear direction that we can describe as a positive association or a negative association.
Adding categorical variables to scatterplots Looking at SAT scores, we can add categorical variables for Southern and Non-southern states, adds a third variable to the scatterplot. Figure 2.3 uses two different colors to distinguish the difference. Use different colors or symbols to plot points when you want to add a categorical variable to a scatterplot. We can use different symbols if two or more values occupy the same point or represent more than one individual.
Example 2.5Is the South different? Figure 2.3 enhances the scatterplot in Figure 2.1 by plotting the southern states in purple. (We took the South to be the states in the East South Central and South Atlantic regions.) Most of the southern states blend in with the rest of the country. Several southern states do lie at the lower edges of their clusters, along with the District of Columbia, which is a city rather than a state. Georgia, South Carolina, and West Virginia have lower SAT scores than we would expect from the percent of their high school graduates who take the examination.
A scatterplot displays the form, direction, and strength of the relationship between two quantitative variables.