Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8: Linear Regression

Similar presentations


Presentation on theme: "Chapter 8: Linear Regression"— Presentation transcript:

1 Chapter 8: Linear Regression
By Dara Lee and Michelle Smith Period 1

2 The linear model is just an equation of a straight line through data.
The points in a scatterplot don’t all line up, but a straight line can summarize the general pattern. The model can help understand how the variables are associated. Linear Model

3 Residuals An estimate from a model is called the predicted value (ŷ)
The difference between observed (y) and predicted values (ŷ) is called the residual (e) Residual=Observed-Predicted (e=y-ŷ) A negative residual means the predicted value is too big. A positive residual means the predicted value is too small. To see if a linear model is appropriate, the residuals plot should be scattered with no interesting features, no direction, no shape, no bends, and no outliers. Residuals help us to see whether the model makes sense If r=0 there’s no linear relationship R=-1 or 1 means the data fall exactly on one straight line Residuals

4 The line of best fit is the line for which the sum of the squared residuals (R²) is the smallest.
Also known as “line of least squares.” By squaring the residuals, all are made positive for summation. This also emphasizes the largest residuals. The smaller the sum, the better the fit. Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y-b1x Line of Best Fit

5 Correlation and the Line
The equation for a line that passes through the origin can be written with just a slope and no intercept: y=mx The coordinates of these standard points aren’t written as (x,y)—their coordinates are z-scores: (zx,zy) For every horizontal change in Sx there is a vertical change in r(Sy) Moving one standard deviation away from the mean in “x” moves our estimate “r” standard deviations away from the mean in “y.” In general, moving any number of standard deviations in “x” moves “r” times that number of standard deviations in “y.” Correlation and the Line

6 How big can Predicted Values Get?
Each predicted “y” tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line. How big can Predicted Values Get?

7 R²: The Variation accounted for
“r” is the correlation between two variables. The greater the absolute value of the correlation, the stronger the association. The squared correlation gives the fraction of the data’s variation accounted for by the model, and 1-R² is the fraction of the original variation left in the residuals. An R² of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. Squaring the residuals ensures that all are positive so that they can be added to figure out the line of best fit. The smaller the sum, the better the fit. R²: The Variation accounted for

8 Assumptions and Conditions
Quantitative Variables Condition: Variables cannot be categorical variables. Straight Enough Condition: Scatterplot must look reasonably straight. The linearity can be checked again after the regression, when residuals can be examined. Outlier Condition: No point should be singled out. To spot outliers, you can check the residuals—they may have large residuals. Outliers can dramatically change a regression model. Assumptions and Conditions

9 Chapter 8 Problem #33 Age (yr) Price Advertised ($) 1 12995, 10950 2
Find the equation of the regression line. Explain the meaning of the slope of the line. Explain the meaning of the intercept of the line. If you want to sell a 7-year-old Corolla, what price seems appropriate? You have a chance to buy one of two cars. They are about the same age and appear to be in equally good condition. Would you rather buy the one with a positive residual or a negative residual? Explain. You see a “For Sale” sign on a 10-year-old Corolla stating the asking price as $1500. What is the residual? Would this regression model be useful in establishing a fair price for a 20-year-old car? Explain. Classified ads in the Ithaca Journal offered several used Toyota Corollas for sale. Listed below are the ages of the cars and the advertised prices. Age (yr) Price Advertised ($) 1 12995, 10950 2 10495 3 10995, 10995 4 6995, 7990 5 8700, 6995 6 5990, 4995 9 3200, 2250, 3995 11 2900, 2995 13 1750

10 Chapter 8 Problem #33 a) Predicted price= 12319.6 - 924 x years
b) Every extra year of age decreases average value by $924 c) The average new Corolla costs $12,319.60 d) $ e) Negative residual. Its price is below the predicted value for its age. f) -$ g) No. After age 13, the model predicts negative prices. The relationship is no longer linear. QUESTIONS Find the equation of the regression line. Explain the meaning of the slope of the line. Explain the meaning of the intercept of the line. If you want to sell a 7-year-old Corolla, what price seems appropriate? You have a chance to buy one of two cars. They are about the same age and appear to be in equally good condition. Would you rather buy the one with a positive residual or a negative residual? Explain. You see a “For Sale” sign on a 10-year-old Corolla stating the asking price as $1500. What is the residual? Would this regression model be useful in establishing a fair price for a 20-year-old car? Explain. ANSWER EXPLANATIONS a) First calculate the mean of the age of the used Corollas. 1+…+13, all divided by 9 (total). You should get 6. Second, calculate the mean of the advertised prices ….+1750, all divided by 17 (total). You should get 6, … Third, calculate the Standard deviations of the age and the advertised prices. Subtract the mean from each of the data values and list the differences. Square each of the differences and make a list of the squares. Add all these squares together. Then, subtract one from the number of values. For age, you should get 8 (9-1). For price, you should get 16 (17-1). Then divide the sum of the squares by this number (Age: 8, Price: 16). Then take the square root of this number. The standard deviation of the age should be …. The standard deviation of price should be … Fourth, multiply the results from the value-mean step from the standard deviation calculation (Subtract the mean from each of the data values and list the differences). Add up all those products and set that number aside. Then multiply together the total-minus-one number (Age: 8, Price: 16), the standard deviation of age, and the standard deviation of price. Divide the sum of the products from this number to get the correlation efficient. You should get that r= Fifth plug these numbers into the formula for the regression line. Remember: Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y(mean)-b1x(mean) So, b1= ( / )= = 924 Therefore b0 = – (-924)(6) = Giving us the equation ŷ = – 924x b) The slope is -924: Every extra year of age decreases average value by $924 c) The average new Corolla costs $12, because ŷ = – 924(0) = d) If you want to sell a 7-year-old Corolla, the price of $ seems appropriate because – 924(7) = e) You would want to buy the car with a negative residual. Remember the equation for residual is observed minus predicted. If the number turns out negative, that means there is ”extra ‘predicted value’ leftover” A.K.A. the model overestimated the price. Having a positive residual means that the model underestimated the price and if you only brought enough money based on the model, you wouldn’t be able to buy the car. f) 10-year-old Corolla: asking price is $ – 924(10) = Observed – predicted = as the residual g) This regression model would not be useful in establishing a fair price for a 20-year-old car because after 13 years, the model predicts negative prices. The relationship is no longer linear. Looking so far ahead to age 20, away from all the other data values, is also known as extrapolation.

11 Chapter 8 Problem #37 Here are the data used when the association between the amounts of fat and calories in hamburgers were examined. Fat (g) 19 31 34 35 39 43 Calories 410 580 590 570 640 680 660 When a scatterplot was made, the equation of the line of regression was calculated to be: Predicted calories= x calories/fat gram Explain why you cannot use that model to estimate the fat content of a burger with 600 calories. Using an appropriate model, estimate the fat content of a burger with 600 calories.

12 Chapter 8 Problem #37 a) The regression was for predicting calories from fat, not the other way around. b) Predicted fat grams= grams/calories Predict 34.8 grams of fat. QUESTIONS When a scatterplot was made, the equation of the line of regression was calculated to be: Predicted calories= x calories/fat gram Explain why you cannot use that model to estimate the fat content of a burger with 600 calories. Using an appropriate model, estimate the fat content of a burger with 600 calories. ANSWER EXPLANATIONS Remember that the first regression analysis was to create a scatterplot of calories vs. fat content, NOT fat content vs. calories. A.K.A. The regression was for predicting calories from fat, not for predicting fat from calories. b) First, calculate the mean of the calories. 410+….+660, all divided by 7 (total). You should get 590. Second, calculate the mean of the fat content. 19+…+43, all divided by 7 (total). You should get Third, calculate the Standard deviations of the calories and fat content. Subtract the mean from each of the data values and list the differences. Square each of the differences and make a list of the squares. Add all these squares together. Then, subtract one from the number of values. For calories and fat content, you should get 6 (7-1). Then divide the sum of the squares by this number (6). Then take the square root of this number. The standard deviation of the calories should be 220. The standard deviation of the fat content should be … Fourth, multiply the results from the value-mean step from the standard deviation calculation (Subtract the mean from each of the data values and list the differences). Add up all those products and set that number aside. Then multiply together the total-minus-one number (6), the standard deviation of the calories, and the standard deviation of the fat content. Divide the sum of the products from this number to get the correlation efficient. You should get that r= .9606 Fifth, plug these numbers into the formula for the regression line. Remember: Equation of the line: ŷ=bo+b1x Equation of the slope: b1=r(Sy/Sx) b0=y(mean)-b1x(mean) So, b1= .9606(19.116/220) = .083 Therefore b0 = – (.083)(590) = -15 Giving us the equation ŷ = x Sixth, plug in 600 for x (600) = 34.8


Download ppt "Chapter 8: Linear Regression"

Similar presentations


Ads by Google