 # The Practice of Statistics

## Presentation on theme: "The Practice of Statistics"— Presentation transcript:

The Practice of Statistics
Daniel S. Yates The Practice of Statistics Third Edition Chapter 3: Examining Relationships Copyright © 2008 by W. H. Freeman & Company

3.1 – Scatterplots and Correlation
When are some situations when we might want to examine a relationship between two variables? Height & Heart Attacks Weight & Blood Pressure Hours studying & test scores What else? In this chapter we will deal with relationships and quantitative variables; the next chapter will deal with more categorical variables.

Explanatory vs. Response
The response variable is our dependent variable (traditionally y)‏ The explanatory variable is our independent variable (traditionally x)‏ 3

Explanatory or Response?
Which is the explanatory and which is the response variable? Jim wants to know how the mean 2005 SAT Math and Verbal scores in the 50 states are related to each other. He doesn't think that either score explains or causes the other. Julie looks at some data. She asks, “Can I predict a state's mean 2005 SAT Math score if I know its mean 2005 SAT Verbal score?” 4

Explanatory and Response Variables
When we deal with cause and effect, there is always a definite response variable and explanatory variable. But calling one variable response and one variable explanatory doesn't necessarily mean that one causes change in the other. 5

When analyzing several-variable data, the same principles apply… Data Analysis Toolbox
To answer a statistical question of interest involving one or more data sets, proceed as follows. DATA Organize and examine the data. Answer the key questions. GRAPHS Construct appropriate graphical displays. NUMERICAL SUMMARIES Calculate relevant summary statistics INTERPRETATION Look for overall patterns and deviations When the overall pattern is regular, use a mathematical model to describe it. W5HW 6

Scatterplots Let's say we wanted to examine the relationship between the percent of a state's high school seniors who took the SAT exam in 2005 and the mean SAT Math score in state that year. A scatterplot is an effective way to graphically represent our data. But first, what is the explanatory variable and what is the response variable in this situation? 7

Scatterplots Once we decide on the response and explanatory variables, we can create a scatterplot. response variable explanatory variable 8

Scatterplot

Scatterplot Tips Plot the explanatory variable on the horizontal axis. If there is no explanatory-response distinctions, either variable can go on the horizontal axis. Label both axes! Scale the horizontal and vertical axes. The intervals must be uniform. (but do not have to have same scales) If you are given a grid, try to adopt a scale so that your plot uses the whole grid. Make your plot large enough so that the details can be easily seen. 10

Note: there is no outlier rule for bivariate data (like 1.5xIQR)
Must use definition.

Overall Pattern Direction: negative (or none) positive Form:
Strength: how closely they follow form negative positive (or none) linear nonlinear r-value

Positive vs. Negative

Interpreting Scatterplots
Direction? Form? Strength? Outliers? 15

The Mean SAT Math scores and percent of high school seniors who take the test, by state, with the southern states highlighted. Is the South different? 16

Measuring Linear Association: Correlation
Linear relations are important because, when we discuss the relationship between two quantitative variables, a straight line is a simple pattern that is quite common. A strong linear relationship has points that lie close to a straight line. A weak linear relationship has points that are widely scattered about a line. 17

Correlation strong association weak association

Our eyes are not good measures of how strong a linear relationship is...
A numerical measure along with a graph gives the linear association an exact value. 19

20

Correlation makes no distinction between explanatory and response variables. r doesn't change when we change the units of measurement of x, y, or both. r is positive when the association is positive and is negative when the association is negative. The correlation r is always a number between -1 and 1. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either -1 or 1. 21

Patterns closer to a straight line have correlations closer to 1 or -1
22

Correlation requires that both variables be quantitative. Correlation does not describe curved relationships, no matter how strong they are. Like the mean and standard deviation, the correlation is not resistant; r is strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data. You should give the means and standard deviations of both x and y along with the correlation 23

Many data sets can have the same r value but have completely different linear relationships ALWAYS PLOT YOUR DATA!!! Correlation applet 24

3.2 – Least Squared Regression
When a scatterplot shows a linear relationship, we would like to summarize the overall pattern by drawing a line on the scatterplot. Regression line – describes how a response variable y changes as an explanatory x changes. Regression requires explanatory and response variables

y-intercept does not always make sense
represents predicted or average change must be very specific when interpreting

Regression Lines Once we have our regression line, we can use it to predict responses. Extrapolation – using the line for predictions outside the range of values of the explanatory variable Such predictions are often not accurate

That’s one big rat!!! Some data were collected on the weight of a male white laboratory rat following its birth. A scatterplot of the weight (in grams) and time since birth (in weeks) shows a fairly strong positive linear relationship. The linear regression equation models the data fairly well. weight = (time) a) Interpret the slope in the (context of this setting) Interpret the y-intercept (in this setting) Would you be willing to use this line to predict the rat’s weight at age 2 years? (there are 454 grams in a pound)

That’s one big rat!!! slope: For every one week increase in age, the rat will increase its weight by an average of grams y-intercept: An estimate for the birth weight ( grams) of this male rat No, this would be extrapolation. The rat would weigh approximately 4,260 gram or 9.4 lbs. This is what a medium-sized cat weighs!

The Least-Squares Regression Line (LSRL)
In most cases, no line will pass exactly through all of the points in a scatterplot Our eyes are not a good judge of the best line Because we use the line to predict y from x, the prediction errors are errors in y, the vertical direction A good regression line makes the vertical distances of the points from the line as small as possible

The Least-Squares Regression Line (LSRL)

Does fidgeting keep you slim?
Refer to today’s handout for scatterplot

Use your calculator and program CORR to find the equation of the LSRL for the NEA and Fat gain data.

Using Program CORR

Fat gain = 3.505 – 0.00344(NEA change)
NEA and Fat Gain Fat gain = – (NEA change)

NEA and Fat Gain Make the errors in predicting y as small as possible by minimizing the sum of the squares of the vertical distances of the data points from the line

Understanding the LSRL

How well does the line fit the data?
Two ways: Residual plot Coefficient of determination, r2 Residual – difference between observed value of response and the predicted How much error there is in the LSRL

Residuals x: 1 2 3 4 5 y: 2 4 6 8 15 Find LSRL: 1 4 7 10 13
Residuals:

Residual Plot Plot (x, residual) * Residuals should be small
The residual plot should show no obvious pattern. Curved: linear not a good fit Fanning: predictions will be less accurate for larger/smaller x * Residuals should be small residual

Residuals Need to be small…but what’s small enough?
Standard deviation of the residuals Used to measure the typical prediction error Consistently off by 1.83

Residuals for NEA & Fat Gain
x -94 -57 -29 135 143 151 245 355 .37 -.701 .095 -.34 .187 .61 -.26 -.98 x 392 473 486 535 571 580 620 690 1.64 -.18 -.23 .54 -.54 -1.11 .93 -.03 In calculator: L3 = Y1(L1) gives all predicted values L4 = L2 – L3 (actual – predicted)

Residuals for NEA & Fat Gain
Make scatterplot of residuals: L1, L4 1 var stats: L4 Sres = 0.71

Residual Plots Scattered…no real pattern. A line is a good model.
Curved patter. A line may not be the best model.

Residual Plots Fanning…more spread for larger values of x. Prediction will be less accurate when x is large. HW: pg #39, 40

Using r2 to determine how well the data fits the line
r2: coefficient of determination proportion of variation in y How well LSRL does at predicting values of response How much better is the LSRL at predicting responses than if we just used as our prediction.

We know that the LRSL minimizes the sum of the squared residuals….
Compare sum of squared residuals of LRSL to the sum of squared residuals of Use NEA and Fat Gain data. = Create a new list and use 1 var stats to find

This gives us the proportion of how much error there is in the LSRL model with respect to the error in the mean model. How can we use this to determine how much better the LSRL is (r2)? r2 = 1 – = .6066

So what does this mean? 60.6% of the variation in fat gain is explained by the LSRL relating fat gain and non-exercise activity. The other 39.4% is individual variation that is not explained by this linear relationship

If all the point lie on the LSRL then and r2 = 1
All of the variation in y is explained by the linear relationship with x Worst case scenario: r2 = 0 0% is explained by the line When reporting regression always give r2 to determine how successful the line was in explaining the response.

Facts about LSRL Distinction between explanatory and response is essential. (Will get a different line if they are reversed) Close connection between correlation and slope LSRL always passes through r describes the strength of the straight-line relationship r2 is the proportion of variation in y that is explained by the least-squared regression of y on x