Chapter 10 Exploring Relationships Between Numerical Variables

Slides:



Advertisements
Similar presentations
AP Statistics Section 3.1B Correlation
Advertisements

Residuals.
Scatterplots and Correlation
Chapter 3 Examining Relationships
 Objective: To look for relationships between two quantitative variables.
Chapter 8 Linear regression
EXAMINING RELATIONSHIPS Section 3.2 Correlation. Recall from yesterday…  A scatterplot displays form, direction, and strength of the relationship between.
Chapter 6: Exploring Data: Relationships Lesson Plan
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Exploring Relationships Between Numerical Variables Scatterplots.
Relationships Scatterplots and correlation BPS chapter 4 © 2006 W.H. Freeman and Company.
Chapter 6: Exploring Data: Relationships Chi-Kwong Li Displaying Relationships: Scatterplots Regression Lines Correlation Least-Squares Regression Interpreting.
Chapter 6: Exploring Data: Relationships Lesson Plan Displaying Relationships: Scatterplots Making Predictions: Regression Line Correlation Least-Squares.
Residuals Target Goal: I can construct and interpret residual plots to assess if a linear model is appropriate. 3.2c Hw: pg 192: 48, 50, 54, 56, 58 -
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Lesson Scatterplots and Correlation. Knowledge Objectives Explain the difference between an explanatory variable and a response variable Explain.
Objectives (IPS Chapter 2.1)
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.1 Scatterplots.
Chapter 5 Comparing Two Means or Two Medians
Chapters 8 & 9 Linear Regression & Regression Wisdom.
+ Warm Up Tests 1. + The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 3: Describing Relationships Section 3.1 Scatterplots.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Chapter 4 - Scatterplots and Correlation Dealing with several variables within a group vs. the same variable for different groups. Response Variable:
The Big Picture Where we are coming from and where we are headed…
 Describe the association between two quantitative variables using a scatterplot’s direction, form, and strength  If the scatterplot’s form is linear,
Copyright © 2010 Pearson Education, Inc. Chapter 7 Scatterplots, Association, and Correlation.
3.2 Least-Squares Regression Objectives SWBAT: INTERPRET the slope and y intercept of a least-squares regression line. USE the least-squares regression.
3.1 Scatterplots and Correlation Objectives SWBAT: IDENTIFY explanatory and response variables in situations where one variable helps to explain or influences.
Response Variable: measures the outcome of a study (aka Dependent Variable) Explanatory Variable: helps explain or influences the change in the response.
Correlation  We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here.
Chapter 3: Describing Relationships
Two-Variable Data Analysis
Chapter 7 Exploring Measures of Variability
CHAPTER 3 Describing Relationships
3.1 Scatterplots and Correlation
Chapter 5 Comparing Two Means or Two Medians
Aim – How can we analyze bivariate data using scatterplots?
Choosing the “Best Average”
Regression and Residual Plots
Chapter 7 Part 1 Scatterplots, Association, and Correlation
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Do Now Describe the following relationships. Do Now Describe the following relationships.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

Chapter 10 Exploring Relationships Between Numerical Variables Objectives SWBAT: Create and examine scatterplots of data Describe the direction, form, and strength of an association Use correlation to measure the strength of a linear association Test for correlation

In recent years, golfers have been driving the golf ball farther than before, partly due to equipment, and partly due to the golfers themselves. However, trying to hit the ball as far as possible off the tee can result in less accuracy, and a more challenging second shot playing out of the rough. Each time a golfer tees off, they need to determine if they should try to hit the ball very far, which would leave them with a shorter and easier second shot but risk missing the fairway, or should they ease up a bit and make sure to hit the ball down the middle even though the ball won’t go as far.

In order to examine situations like the one on the previous slide, we need to expand beyond what we have learned so far, which has only focused on one variable at a time, such as home runs, rushing yards, or points per game. We will now begin to investigate the relationships between two numerical variables.

We will see if average driving distance can help explain accuracy. In order to begin analyzing this situation we’ll investigate the association between average driving distance and driving accuracy for the top 10 women golfers in the LPGA in 2009. Average driving distance is our explanatory variable and driving accuracy is our response variable. We will see if average driving distance can help explain accuracy. In other words, we want to see if the explanatory variable can help predict the value of the response variable.

Some things to keep in mind when making scatterplots: In order to display the relationship between two numerical variables, we will create a scatterplot (this is our only option). Some things to keep in mind when making scatterplots: 1) it is essential to include labels and clear, consistent scales on each axis 2) neither axis needs to start at 0 3) the explanatory variable will be plotted on the horizontal axis and the response variable will be plotted on the vertical axis (Note: the response variable is usually the one we are more interested in - either because we want to predict the response for a particular value of the explanatory variable or because we want to use the explanatory variable to explain changes in the response variable.

Steps to Make a Scatterplot Label and scale the horizontal (explanatory) axis and vertical (response) axis.

2) Draw a dot for each player to represent the ordered pair.

3) Finish the scatterplot by drawing dots for the remaining players.

Making a Scatterplot Using the TI-84 Enter the average driving distance values in L1 and the driving accuracy values into L2. 2) Open STAT PLOT by pressing 2nd y=. Turn on Plot1 (make sure all other plots are off). Choose the first graph type, and enter L1 for Xlist and L2 for Ylist.

3) Press ZOOM and select 9: ZoomStat and press ENTER in order to see the scatterplot in a nice window. 4) To see the values used by the graphing calculator for the scales, press WINDOW. On this screen, you can change the starting, ending or increment values to something more convenient and then regraph.

Describing the Association Between Variables After constructing a scatterplot, there are several important characteristics about the association between the variables that should be considered. 1. the direction of the association 2. the form of the association 3. the strength of the association

Describing the Direction of an Association Here is a scatterplot showing the average driving distance and driving accuracy for the top 146 money winners on the LPGA tour in 2009. Based on the scatterplot, we can see there is a negative association between average driving distance and driving accuracy, meaning that players who drive the ball farther typically hit a smaller percentage of fairways; those who don’t hit the ball as far typically hit a higher percentage of fairways.

Here is the relationship between a golfer’s average driving distance and her percentage of greens-in-regulation (GIR means you hit the ball on the putting surface (the green) in at least two shots under par.) There is a positive association between average driving distance and GIR. Typically, players who drive the ball farther have a better chance to land on the putting surface in at least two shots under par.

It is important to remember that a positive or negative association is describing the overall tendency of the data, NOT an absolute relationship. Example: Even though players who drive the ball farther typically hit a higher percentage of greens, this is not always true. Lee drives the ball farther than Creamer, but Lee hits a lower percentage of GIR. This is the opposite of what we’d expect.

Sometimes, two variables have no association. The scatterplot to the right shows the relationship between driving accuracy and putting average. Also included are the mean putting average and mean driving accuracy. In this case, knowing a golfer’s driving accuracy tells us nothing about how many putts she will average per round. Therefore, we can say there is no association between driving accuracy and putting average. To recap: if two variables have an association, then knowing the value of one variable will help you predict the value of the other variable. However, if there is no association, then knowing the value of one will not help you predict the value of the other.

Describing the Form of an Association The form of an association can either be linear or nonlinear. If the association is linear, then a line would be a reasonable way to model the overall relationship. But if an association is nonlinear, then a line won’t match the pattern of the scatterplot very well. Examples:

Describing the Strength of an Association The strength of an association describes the amount of scatter there is from the overall form of the data. In other words, how closely the points on the scatterplot conform to the linear (or nonlinear) form. In a strong association, there isn’t much scatter and predictions of the response variable will be fairly precise.

Examples of positive, linear associations with different amounts of strength.

The same strength associations can also be tied to negative linear associations. Here is an example of a strong negative linear association:

Describe the strength of these associations: Moderate association Strong association Very strong association

Let’s return back to the initial question of is it better to drive the ball long or drive it straight. Let’s look at two different scatterplots: one comparing average driving distance to scoring average, and one comparing driving accuracy to scoring average. Negative relationship – golfers who drive the ball farther tend to have lower scores (this is a good thing in golf!) Negative relationship – golfers who hit the fairway more often tend to have lower scores (again, lower scores are good!)

Both scatterplots show negative associations, but which one shows the stronger relationship? In other words, which explanatory variable, average driving distance or driving accuracy, is a more reliable predictor of scoring average? Neither association is strong, but average driving distance as an explanatory variable conforms more to a linear pattern than does driving accuracy. Therefore, low scores in golf tend to be more strongly associated with average driving distance than with driving accuracy.

Measuring Strength: Correlation Judging the strength of a relationship is difficult to do simply by looking at a scatterplot. What might seem strong to one person might seem moderate to another. There is a numerical way to measure the strength of a linear association in a scatterplot, and that is called correlation.

The correlation (r) is a measure of the strength and direction of a linear association between two numerical variables. The TI-84 will calculate correlation for us, so we’ll start by talking about some of the properties of correlation. Unfortunately, Park Place and Boardwalk are not part of correlation’s properties.

Properties of the Correlation (r)

5) The value of r has no units and is not dependent on the units used to measure the variables. This makes it an ideal way to compare the strengths of the associations between different sets of variables. This also means that changing the units of the explanatory or response variable won’t affect the correlation.

To give you a visual, here are some examples of different correlations, using data from the 32 NFL teams in the 2008 regular season.

Now it’s time for everyone’s favorite game….Guess the Correlation!!! http://www.rossmanchance.com/applets/

Something to be aware of: Correlation and association are NOT synonyms. Association is a more general word to describe the relationship between any two variables, whether numerical or categorical. Correlation is a specific measure of the strength and direction of a linear association between two numerical variables.

Calculating Correlation on the TI-84 1) Turn on the diagnostic feature. Press 2nd: 0 to enter the CATALOG. Scroll down to DiagnosticOn. Press ENTER twice so it says Done on the home screen.

2) Enter the top 10 LPGA data in L1 and L2. 3) Press the STAT button, move to the CALC menu, and scroll down to select number 8: LinReg (a+bx). After choosing enter L1, L2 and press ENTER. The last line of the output gives the value of r. In this case the correlation is r=-0.905.

An applet on the book website also calculates correlation An applet on the book website also calculates correlation. It is appropriately titled “Correlation and Regression.” Book site

Let’s come back to this comparison, in which we initially said average driving distance seems to be a better predictor of scoring average than driving accuracy. We made this statement based off of just looking at the scatterplots. It turns out that the correlation between average driving distance and scoring average is r=-0.47, compared to r=-0.23 for the other comparison. This result confirmed our initial suspicion.

FYI: Statistics 101 In a more traditional statistics course, you may be asked to calculate the correlation by hand. Here is the formula: The terms being multiplied in the numerator are standardized scores for the x variable and the y variable (they are then being summed).

Be cautioned: Correlation does NOT imply cause-and-effect. Just think of Happy Gilmore. He was able to crush the ball off the tee, but he still finished in last early in the movie because other facets of his game were not polished. So even though there is correlation between average driving distance and scoring average, there is no guarantee that increasing driving distance will result in lower scores.

Influential Points What effect do you think unusual observations have on correlation? Much like mean and standard deviation, correlation is not a resistant measure, and as such it can be strongly effected by outliers. On a scatterplot, an outlier might be a point that seems out of place.

Here is a scatterplot showing the number of stolen bases and home runs for the nine primary offensive players for the 2009 Boston Red Sox. The points on the left side of the graph seem to have a positive association, but Jacoby Ellsbury’s unusually value makes the overall relationship between stolen bases and home runs look negative. In fact, the correlation here is r=-0.35 What would happen to the correlation is Ellsbury was omitted?

Here is the scatterplot with Ellsbury omitted Here is the scatterplot with Ellsbury omitted. The correlation is now r=0.21. Be careful: It’s not a good idea to remove observations from a data set without a good reason. If a value was recorded incorrectly, correct it and keep it in the data set. However, if you don’t know the correct value, then remove it. Any observations that do not belong should be excluded. For example, certain totals or averages of all the observations.

Testing the Correlation In sports, teams play exhibition games before the regular season begins. Some people think a team’s PERFORMANCE in exhibition games is a good predictor of how well that team will do in the regular season. Other people think PERFORMANCE in exhibition games tells us nothing about the future.

Here is a scatterplot showing the 2009 winning percentages (WP) for the 30 MLB teams during spring training and during the regular season. The correlation is r=0.45. There seems to be a moderate, positive, linear association between spring training winning percentage and regular season winning percentage. Let’s keep in mind this is only based off of PERFORMANCES for one season, so we need to account for RANDOM CHANCE before making any firm conclusions. It is possible that there really is no association.

Some terminology: The true correlation between two variables (much like ABILITY) exists only in theory. To find the true correlation between spring training winning percentage and regular season winning percentage, we would need to repeat the 2009 spring training and regular season for each team millions and millions of times and then find the correlation. The observed correlation between two variables (much like PERFORMANCE) is based on a limited amount of data, such as one season. Because it is based on a limited amount of data, the observed correlation will vary from the true correlation due to RANDOM CHANCE.

To see if the observed data provide convincing evidence that there really is a positive association between winning percentage in spring training and winning percentage in the regular season, we will test the following hypotheses using the correlation as our test statistic.

To simulate: -Get 30 notecards (for the 30 teams) -Write each of the 30 regular season winning percentages on the notecards -Shuffle -Randomly pair one of the regular season winning percentage cards to each of the spring training winning percentages -Calculate the correlation

Here are the results for 100 trials of the simulation: What is our p-value? 0%