Regression and Residual Plots

Slides:



Advertisements
Similar presentations
Residuals.
Advertisements

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Summarizing Bivariate Data Introduction to Linear Regression.
CHAPTER 3 Describing Relationships
Relationship of two variables
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
CHAPTER 5 Regression BPS - 5TH ED.CHAPTER 5 1. PREDICTION VIA REGRESSION LINE NUMBER OF NEW BIRDS AND PERCENT RETURNING BPS - 5TH ED.CHAPTER 5 2.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Describing Relationships. Least-Squares Regression  A method for finding a line that summarizes the relationship between two variables Only in a specific.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
CHAPTER 3 Describing Relationships
Describing Relationships
CHAPTER 3 Describing Relationships
Least Squares Regression Line.
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Examining Relationships Least-Squares Regression & Cautions about Correlation and Regression PSBE Chapters 2.3 and 2.4 © 2011 W. H. Freeman and Company.
LSRL Least Squares Regression Line
Lecture Slides Elementary Statistics Thirteenth Edition
Chapter 8 Part 2 Linear Regression
Residuals, Residual Plots, and Influential points
AP Stats: 3.3 Least-Squares Regression Line
Least-Squares Regression
residual = observed y – predicted y residual = y - ŷ
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
^ y = a + bx Stats Chapter 5 - Least Squares Regression
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Least-Squares Regression
Residuals, Residual Plots, & Influential points
Chapter 3 Describing Relationships Section 3.2
Chapter 3: Describing Relationships
Objectives (IPS Chapter 2.3)
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Basic Practice of Statistics - 3rd Edition Regression
Chapter 3: Describing Relationships
Least-Squares Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Warmup A study was done comparing the number of registered automatic weapons (in thousands) along with the murder rate (in murders per 100,000) for 8.
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
3.2 – Least Squares Regression
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Warm-up: Pg 197 #79-80 Get ready for homework questions
A medical researcher wishes to determine how the dosage (in mg) of a drug affects the heart rate of the patient. Find the correlation coefficient & interpret.
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
9/27/ A Least-Squares Regression.
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Honors Statistics Review Chapters 7 & 8
Review of Chapter 3 Examining Relationships
Chapter 3: Describing Relationships
CHAPTER 3 Describing Relationships
Presentation transcript:

Regression and Residual Plots Section 3.2

Amount of soda remaining (mL) Tapping on Cans Amount of soda remaining (mL) 0 sec 4sec 8 sec 12 sec 245 260 267 275 255 250 271 280 268 270 276 285 265 290 248 284 278 251 261 279 249 259

The number of seconds spent tapping is the explanatory variable and the amount left in the can is the response. Store the data in L1 and L2.

It’s easier to copy the scatterplot to paper if you adjust the window to numbers which make sense for the data. Be sure to label the axes.

Create the least-squares regression line. 𝑦 =248.64+2.635𝑥 predicts the number of milliliters of soda left after “x” seconds of tapping.

Interpreting the regression equation 𝑦 is the predicted value of the response variable y for a given value of the explanatory variable x. The slope is the amount by which y is predicted to change when x increases by one unit. (2.635 more mL of soda for each additional second of tapping.) The y-intercept is the predicted value of y when x = 0. (248.64 mL of soda is the predicted amount if the can is not tapped at all.) We won’t always be able to see the y-intercept on our scatterplots because of how we label the y-axis.

Return to the Graph screen to see the regression line superimposed on the data. Use “trace” to examine y-values and 𝑦 values. Note that the difference between an actual y-value and the vertical distance to the regression line is called a residual.

Residuals are stored in a list called ‘residuals’ on the TI-84 Residuals are stored in a list called ‘residuals’ on the TI-84. Store these values in L3. The sum of the residuals is zero. The regression line is the line that makes the sum of the squared residuals as small as possible. Notice that residuals have both size and direction.

We can examine the sum of the squared residuals using L4 and the sum command on the TI-84. 951.32 is the smallest sum of the squared residuals from any line which can be drawn through the data points. This means that the errors in predicting y are minimized.

Properties of the least-squares regression line. 𝑦 will always run through the means of the x data and the y data. I.e., ( 𝑥 , 𝑦 ) is always a solution of 𝑦 =𝑎+𝑏𝑥. An outlier will always “pull” the regression line towards itself. The closer the outlier is to ( 𝑥 , 𝑦 ), the less it will change the slope of the regression line. Outliers in the extreme x-direction are more likely to be influential than those in the y-direction. http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.ht ml

HOW TO CALCULATE THE LEAST-SQUARES REGRESSION LINE using only the means, standard deviations, and correlation. We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and the standard deviations sx and sy of the two variables and their correlation r. The least-squares regression line is the line = a + bx with: Slope and y intercept . This means that a change in x of 1 standard deviation is met with an “r” standard deviation change in y. So, y is closer to 𝑦 than x is to 𝑥 . We say that “the values of y regress to their mean.”

For the can tapping example with r = 0.924 1-Variable Stats for x 1-Variable Stats for y 𝑆𝑙𝑜𝑝𝑒 𝑏 = 0.924 ∙ 12.916 4.529 =2.635 and 𝑦−𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑎 =264.45− 2.635 ∙6=248.64

Determining the appropriateness of a linear model using a “residual plot” Because the sum of the residuals (measured from the least-squared regression line) is always zero, the mean of these residuals is also zero. A residual plot consists of the scatterplot of the residuals and the horizontal line y = 0. The horizontal line represents the regression line. (The regression line is a distance of 0 from itself.) Plot the explanatory variable values (in L1) as x, and the residuals (in L3) as y.

Use Plot 2 and L1 and L3 to draw the residual plot.

Linearity is indicated by random scatter. If you see any sort of leftover pattern or curve in the scatter plot, then a linear model is not appropriate. Think of this: form of the residual plot = form of association – form of model Remember this by remembering that residual = observed y – predicted y.

How well does the regression line “fit” the data How well does the regression line “fit” the data? Standard deviation of residuals & Coefficient of determination. Standard deviation of the residuals This tells us the approximate size of the “typical” prediction error (residual). When we use the least-squares regression line to predict the amount of soda left in a can given the number of seconds it was tapped, our predictions will typically be off by about 4.9 mL.

Predicting amount of soda if you don’t know the tapping time (x) – use 𝑦 Prediction error using 𝑦 Prediction error using 𝑦 951.3 6506 =14.6% of the variation in the soda remaining is unaccounted for by the regression line. So, (100 – 14.6)% = 85.4% is accounted for by the regression line.

Coefficient of determination (r2) The 85.4% we found on the last slide is called the coefficient of determination. It is the square of the correlation, r. DEFINITION: The coefficient of determination: r2 The coefficient of determination r2 is the fraction of the variation in the values of y that is accounted for by the least-squares regression line of y on x. We can calculate r2 using the following formula: Prediction error using regression line, 𝑦 Prediction error using mean, 𝑦

Interpreting computer output – How does seat in a classroom affect test scores? Predictor Coef SE Coef T P Constant 85.706 4.239 20.22 0.000 Row -1.1171 0.9472 -1.18 0.248   S = 10.0673 R-Sq = 4.7% R-Sq(adj) = 1.3%

(a) What is the equation of the least-squares regression line that describes the relationship between row number and test score? Define any variables that you use. (b) Interpret the slope of the regression line in context. (c) Find the correlation. (When you take the square root of the coefficient of determination, be sure to choose the sign correctly.) (d) Is a line an appropriate model to use for these data? Explain how you know.

More practice interpreting computer output.

Should we use a linear model and, if so, how accurate will our predictions be? Use the four step process: State Plan Do Conclude STATE: Is a linear model appropriate for these data? If so, how well does the least-squares regression line fit the data? PLAN: To determine whether a linear model is appropriate, we will look at the scatterplot and residual plot to see if the association is linear or nonlinear. Then, if a linear model is appropriate, we will use the standard deviation of the residuals and r2 to measure how well the least-squares line fits the data.

DO: The scatterplot shows a moderately strong, negative linear association between age at first word and Gesell score. There are a couple of outliers in the scatterplot. Child 19 has a very high Gesell score for his or her age at first word. Also, child 18 didn’t speak his or her first word until much later than the other children in the study and has a much lower Gesell score. The residual plot does not have any obvious patterns, confirming what we saw in the scatterplot— a linear model is appropriate for these data.

From the computer output, the equation of the least-squares regression line is = 109.874 − 1.1270x. The standard deviation of the residuals is s = 11.0229. This means that our predictions will typically be off by 11.0229 points when we use the linear model to predict Gesell scores from age at first word. Finally, 41% of the variation in Gesell score is accounted for by the linear model relating Gesell score to age at first word.

CONCLUDE: Although a linear model is appropriate for these data, our predictions might not be very accurate. Our typical prediction error is about 11 points, and more than half of the variation in Gesell score is still unaccounted for. Furthermore, we should be hesitant to use this model to make predictions until we understand the effect of the two outliers on the regression results.

DEFINITION: Outliers and influential observations in regression An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line. (the best way to determine if an outlier is influential is to calculate the regression line with and without the point.) http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html