Download presentation

1
**The Practice of Statistics**

Daniel S. Yates The Practice of Statistics Third Edition Chapter 3: Examining Relationships Copyright © 2008 by W. H. Freeman & Company

2
**3.1 – Scatterplots and Correlation**

When are some situations when we might want to examine a relationship between two variables? Height & Heart Attacks Weight & Blood Pressure Hours studying & test scores What else? In this chapter we will deal with relationships and quantitative variables; the next chapter will deal with more categorical variables.

3
**Explanatory vs. Response**

The response variable is our dependent variable (traditionally y) The explanatory variable is our independent variable (traditionally x) 3

4
**Explanatory or Response?**

Which is the explanatory and which is the response variable? Jim wants to know how the mean 2005 SAT Math and Verbal scores in the 50 states are related to each other. He doesn't think that either score explains or causes the other. Julie looks at some data. She asks, “Can I predict a state's mean 2005 SAT Math score if I know its mean 2005 SAT Verbal score?” 4

5
**Explanatory and Response Variables**

When we deal with cause and effect, there is always a definite response variable and explanatory variable. But calling one variable response and one variable explanatory doesn't necessarily mean that one causes change in the other. 5

6
**When analyzing several-variable data, the same principles apply… Data Analysis Toolbox**

To answer a statistical question of interest involving one or more data sets, proceed as follows. DATA Organize and examine the data. Answer the key questions. GRAPHS Construct appropriate graphical displays. NUMERICAL SUMMARIES Calculate relevant summary statistics INTERPRETATION Look for overall patterns and deviations When the overall pattern is regular, use a mathematical model to describe it. W5HW 6

7
Scatterplots Let's say we wanted to examine the relationship between the percent of a state's high school seniors who took the SAT exam in 2005 and the mean SAT Math score in state that year. A scatterplot is an effective way to graphically represent our data. But first, what is the explanatory variable and what is the response variable in this situation? 7

8
Scatterplots Once we decide on the response and explanatory variables, we can create a scatterplot. response variable explanatory variable 8

9
Scatterplot

10
Scatterplot Tips Plot the explanatory variable on the horizontal axis. If there is no explanatory-response distinctions, either variable can go on the horizontal axis. Label both axes! Scale the horizontal and vertical axes. The intervals must be uniform. (but do not have to have same scales) If you are given a grid, try to adopt a scale so that your plot uses the whole grid. Make your plot large enough so that the details can be easily seen. 10

11
**Note: there is no outlier rule for bivariate data (like 1.5xIQR)**

Must use definition.

12
**Overall Pattern Direction: negative (or none) positive Form:**

Strength: how closely they follow form negative positive (or none) linear nonlinear r-value

13
Positive vs. Negative

15
**Interpreting Scatterplots**

Direction? Form? Strength? Outliers? 15

16
**Adding Categorical Data**

The Mean SAT Math scores and percent of high school seniors who take the test, by state, with the southern states highlighted. Is the South different? 16

17
**Measuring Linear Association: Correlation**

Linear relations are important because, when we discuss the relationship between two quantitative variables, a straight line is a simple pattern that is quite common. A strong linear relationship has points that lie close to a straight line. A weak linear relationship has points that are widely scattered about a line. 17

18
Correlation strong association weak association

19
**Our eyes are not good measures of how strong a linear relationship is...**

A numerical measure along with a graph gives the linear association an exact value. 19

20
20

21
**Facts about Correlation**

Correlation makes no distinction between explanatory and response variables. r doesn't change when we change the units of measurement of x, y, or both. r is positive when the association is positive and is negative when the association is negative. The correlation r is always a number between -1 and 1. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either -1 or 1. 21

22
**Patterns closer to a straight line have correlations closer to 1 or -1**

22

23
**Cautionary Notes about Correlation**

Correlation requires that both variables be quantitative. Correlation does not describe curved relationships, no matter how strong they are. Like the mean and standard deviation, the correlation is not resistant; r is strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data. You should give the means and standard deviations of both x and y along with the correlation 23

24
**Cautionary Notes about Correlation**

Many data sets can have the same r value but have completely different linear relationships ALWAYS PLOT YOUR DATA!!! Correlation applet 24

25
**3.2 – Least Squared Regression**

When a scatterplot shows a linear relationship, we would like to summarize the overall pattern by drawing a line on the scatterplot. Regression line – describes how a response variable y changes as an explanatory x changes. Regression requires explanatory and response variables

26
**y-intercept does not always make sense**

represents predicted or average change must be very specific when interpreting

27
Regression Lines Once we have our regression line, we can use it to predict responses. Extrapolation – using the line for predictions outside the range of values of the explanatory variable Such predictions are often not accurate

28
That’s one big rat!!! Some data were collected on the weight of a male white laboratory rat following its birth. A scatterplot of the weight (in grams) and time since birth (in weeks) shows a fairly strong positive linear relationship. The linear regression equation models the data fairly well. weight = (time) a) Interpret the slope in the (context of this setting) Interpret the y-intercept (in this setting) Would you be willing to use this line to predict the rat’s weight at age 2 years? (there are 454 grams in a pound)

29
That’s one big rat!!! slope: For every one week increase in age, the rat will increase its weight by an average of grams y-intercept: An estimate for the birth weight ( grams) of this male rat No, this would be extrapolation. The rat would weigh approximately 4,260 gram or 9.4 lbs. This is what a medium-sized cat weighs!

30
**The Least-Squares Regression Line (LSRL)**

In most cases, no line will pass exactly through all of the points in a scatterplot Our eyes are not a good judge of the best line Because we use the line to predict y from x, the prediction errors are errors in y, the vertical direction A good regression line makes the vertical distances of the points from the line as small as possible

31
**The Least-Squares Regression Line (LSRL)**

32
**Does fidgeting keep you slim?**

Refer to today’s handout for scatterplot

33
**Use your calculator and program CORR to find the equation of the LSRL for the NEA and Fat gain data.**

34
Using Program CORR

35
**Fat gain = 3.505 – 0.00344(NEA change)**

NEA and Fat Gain Fat gain = – (NEA change)

36
NEA and Fat Gain Make the errors in predicting y as small as possible by minimizing the sum of the squares of the vertical distances of the data points from the line

37
**Understanding the LSRL**

38
**How well does the line fit the data?**

Two ways: Residual plot Coefficient of determination, r2 Residual – difference between observed value of response and the predicted How much error there is in the LSRL

39
**Residuals x: 1 2 3 4 5 y: 2 4 6 8 15 Find LSRL: 1 4 7 10 13**

Residuals:

40
**Residual Plot Plot (x, residual) * Residuals should be small**

The residual plot should show no obvious pattern. Curved: linear not a good fit Fanning: predictions will be less accurate for larger/smaller x * Residuals should be small residual

41
**Residuals Need to be small…but what’s small enough?**

Standard deviation of the residuals Used to measure the typical prediction error Consistently off by 1.83

42
**Residuals for NEA & Fat Gain**

x -94 -57 -29 135 143 151 245 355 .37 -.701 .095 -.34 .187 .61 -.26 -.98 x 392 473 486 535 571 580 620 690 1.64 -.18 -.23 .54 -.54 -1.11 .93 -.03 In calculator: L3 = Y1(L1) gives all predicted values L4 = L2 – L3 (actual – predicted)

43
**Residuals for NEA & Fat Gain**

Make scatterplot of residuals: L1, L4 1 var stats: L4 Sres = 0.71

44
**Residual Plots Scattered…no real pattern. A line is a good model.**

Curved patter. A line may not be the best model.

45
Residual Plots Fanning…more spread for larger values of x. Prediction will be less accurate when x is large. HW: pg #39, 40

46
**Using r2 to determine how well the data fits the line**

r2: coefficient of determination proportion of variation in y How well LSRL does at predicting values of response How much better is the LSRL at predicting responses than if we just used as our prediction.

47
**We know that the LRSL minimizes the sum of the squared residuals….**

Compare sum of squared residuals of LRSL to the sum of squared residuals of Use NEA and Fat Gain data. = Create a new list and use 1 var stats to find

48
This gives us the proportion of how much error there is in the LSRL model with respect to the error in the mean model. How can we use this to determine how much better the LSRL is (r2)? r2 = 1 – = .6066

49
So what does this mean? 60.6% of the variation in fat gain is explained by the LSRL relating fat gain and non-exercise activity. The other 39.4% is individual variation that is not explained by this linear relationship

50
**If all the point lie on the LSRL then and r2 = 1**

All of the variation in y is explained by the linear relationship with x Worst case scenario: r2 = 0 0% is explained by the line When reporting regression always give r2 to determine how successful the line was in explaining the response.

51
Facts about LSRL Distinction between explanatory and response is essential. (Will get a different line if they are reversed) Close connection between correlation and slope LSRL always passes through r describes the strength of the straight-line relationship r2 is the proportion of variation in y that is explained by the least-squared regression of y on x

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google