2 3.1 – Scatterplots and Correlation When are some situations when we might want to examine a relationship between two variables?Height & Heart AttacksWeight & Blood PressureHours studying & test scoresWhat else?In this chapter we will deal with relationships and quantitative variables; the next chapter will deal with more categorical variables.
3 Explanatory vs. Response The response variable is our dependent variable (traditionally y)The explanatory variable is our independent variable (traditionally x)3
4 Explanatory or Response? Which is the explanatory and which is the response variable?Jim wants to know how the mean 2005 SAT Math and Verbal scores in the 50 states are related to each other. He doesn't think that either score explains or causes the other.Julie looks at some data. She asks, “Can I predict a state's mean 2005 SAT Math score if I know its mean 2005 SAT Verbal score?”4
5 Explanatory and Response Variables When we deal with cause and effect, there is always a definite response variable and explanatory variable.But calling one variable response and one variable explanatory doesn't necessarily mean that one causes change in the other.5
6 When analyzing several-variable data, the same principles apply… Data Analysis Toolbox To answer a statistical question of interest involving one or more data sets, proceed as follows.DATAOrganize and examine the data. Answer the key questions.GRAPHSConstruct appropriate graphical displays.NUMERICAL SUMMARIESCalculate relevant summary statisticsINTERPRETATIONLook for overall patterns and deviationsWhen the overall pattern is regular, use a mathematical model to describe it.W5HW6
7 ScatterplotsLet's say we wanted to examine the relationship between the percent of a state's high school seniors who took the SAT exam in 2005 and the mean SAT Math score in state that year. A scatterplot is an effective way to graphically represent our data.But first, what is the explanatory variable and what is the response variable in this situation?7
8 ScatterplotsOnce we decide on the response and explanatory variables, we can create a scatterplot.response variableexplanatory variable8
10 Scatterplot TipsPlot the explanatory variable on the horizontal axis. If there is no explanatory-response distinctions, either variable can go on the horizontal axis.Label both axes!Scale the horizontal and vertical axes. The intervals must be uniform. (but do not have to have same scales)If you are given a grid, try to adopt a scale so that your plot uses the whole grid. Make your plot large enough so that the details can be easily seen.10
11 Note: there is no outlier rule for bivariate data (like 1.5xIQR) Must use definition.
12 Overall Pattern Direction: negative (or none) positive Form: Strength: how closely they follow formnegativepositive(or none)linearnonlinearr-value
16 Adding Categorical Data The Mean SAT Math scores and percent of high school seniors who take the test, by state, with the southern states highlighted.Is the South different?16
17 Measuring Linear Association: Correlation Linear relations are important because, when we discuss the relationship between two quantitative variables, a straight line is a simple pattern that is quite common.A strong linear relationship has points that lie close to a straight line.A weak linear relationship has points that are widely scattered about a line.17
21 Facts about Correlation Correlation makes no distinction between explanatory and response variables.r doesn't change when we change the units of measurement of x, y, or both.r is positive when the association is positive and is negative when the association is negative.The correlation r is always a number between -1 and 1. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward either -1 or 1.21
22 Patterns closer to a straight line have correlations closer to 1 or -1 22
23 Cautionary Notes about Correlation Correlation requires that both variables be quantitative.Correlation does not describe curved relationships, no matter how strong they are.Like the mean and standard deviation, the correlation is not resistant; r is strongly affected by a few outlying observations.Correlation is not a complete summary of two-variable data. You should give the means and standard deviations of both x and y along with the correlation23
24 Cautionary Notes about Correlation Many data sets can have the same r value but have completely different linear relationshipsALWAYS PLOT YOUR DATA!!!Correlation applet24
25 3.2 – Least Squared Regression When a scatterplot shows a linear relationship, we would like to summarize the overall pattern by drawing a line on the scatterplot.Regression line – describes how a response variable y changes as an explanatory x changes.Regression requires explanatory and response variables
26 y-intercept does not always make sense represents predicted or average changemust be very specific when interpreting
27 Regression LinesOnce we have our regression line, we can use it to predict responses.Extrapolation – using the line for predictions outside the range of values of the explanatory variableSuch predictions are often not accurate
28 That’s one big rat!!!Some data were collected on the weight of a male whitelaboratory rat following its birth. A scatterplot of theweight (in grams) and time since birth (in weeks) showsa fairly strong positive linear relationship. The linearregression equation models the data fairly well.weight = (time)a) Interpret the slope in the (context of this setting)Interpret the y-intercept (in this setting)Would you be willing to use this line to predict the rat’s weight at age 2 years?(there are 454 grams in a pound)
29 That’s one big rat!!!slope: For every one week increase in age, the rat will increase its weight by an average of gramsy-intercept: An estimate for the birth weight ( grams) of this male ratNo, this would be extrapolation. The rat would weigh approximately 4,260 gram or 9.4 lbs. This is what a medium-sized cat weighs!
30 The Least-Squares Regression Line (LSRL) In most cases, no line will pass exactly through all of the points in a scatterplotOur eyes are not a good judge of the best lineBecause we use the line to predict y from x, the prediction errors are errors in y, the vertical directionA good regression line makes the vertical distances of the points from the line as small as possible
38 How well does the line fit the data? Two ways:Residual plotCoefficient of determination, r2Residual – difference between observed value of response and the predictedHow much error there is in the LSRL
40 Residual Plot Plot (x, residual) * Residuals should be small The residual plot should show no obvious pattern.Curved: linear not a good fitFanning: predictions will be less accurate for larger/smaller x* Residuals should be smallresidual
41 Residuals Need to be small…but what’s small enough? Standard deviation of the residualsUsed to measure the typical prediction errorConsistently off by 1.83
42 Residuals for NEA & Fat Gain x-94-57-29135143151245355.37-.701.095-.34.187.61-.26-.98x3924734865355715806206901.64-.18-.23.54-.54-1.11.93-.03In calculator: L3 = Y1(L1) gives all predicted valuesL4 = L2 – L3 (actual – predicted)
43 Residuals for NEA & Fat Gain Make scatterplot of residuals: L1, L41 var stats: L4Sres = 0.71
44 Residual Plots Scattered…no real pattern. A line is a good model. Curved patter. A line may not be the best model.
45 Residual PlotsFanning…more spread for larger values of x. Prediction will be less accurate when x is large.HW: pg #39, 40
46 Using r2 to determine how well the data fits the line r2: coefficient of determinationproportion of variation in yHow well LSRL does at predicting values of responseHow much better is the LSRL at predicting responses than if we just used as our prediction.
47 We know that the LRSL minimizes the sum of the squared residuals…. Compare sum of squared residuals of LRSL to the sum of squared residuals ofUse NEA and Fat Gain data.=Create a new list and use 1 var stats to find
48 This gives us the proportion of how much error there is in the LSRL model with respect to the error in the mean model.How can we use this to determine how much better the LSRL is (r2)?r2 = 1 – = .6066
49 So what does this mean?60.6% of the variation in fat gain is explained by the LSRL relating fat gain and non-exercise activity.The other 39.4% is individual variation that is not explained by this linear relationship
50 If all the point lie on the LSRL then and r2 = 1 All of the variation in y is explained by the linear relationship with xWorst case scenario:r2 = 00% is explained by the lineWhen reporting regression always give r2 to determine how successful the line was in explaining the response.
51 Facts about LSRLDistinction between explanatory and response is essential. (Will get a different line if they are reversed)Close connection between correlation and slopeLSRL always passes throughr describes the strength of the straight-line relationshipr2 is the proportion of variation in y that is explained by the least-squared regression of y on x