Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

Similar presentations


Presentation on theme: "Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?"— Presentation transcript:

1 Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?

2 Fat Versus Protein: An Example  The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:  How many grams of fat would an item with 25 grams of protein have? Slid e 8- 2

3 What is Linear Regression  Remember that correlation suggests there is a “linear” relationship between two variables.  We can say more about the linear relationship between two quantitative variables with a model.  The linear relationship is modeled by a straight line through the data.  The data points do not all line up on the line, but a straight line summarizes the overall direction of the data.

4 Regression and Residuals  Some points will be above the line some points will be below the line.  The estimate made from a model is the predicted value (denoted as ŷ ).  The difference between a predicted value and the actual value is known as the residual

5 Residuals (cont.)  A negative residual means the predicted value’s too big (an overestimate).  A positive residual means the predicted value’s too small (an underestimate). Slid e 8- 5

6 Line of Best Fit  Some residuals are positive (above the predicted line) and some are negative (below the predicted line).  To find how well the line fits we add up the residuals. If we add the negatives and the positives, they cancel each other out. Therefore we add the squared residual values.  The line of best fit is the line where the sum of the squared residuals is the smallest.  The regression line is also know as the Least Squared Regression Line (LSRL)

7 Line of best fit  It is written as Ŷ = a + bx ŷ= b 0 +b 1 x

8 Slope of the regression line  Our slope is always in units of y per unit of x

9 Y intercept  Our intercept is always in units of y

10 Residuals Revisited  The model assumes all points are on the straight line.  The points of data that are not on the line are those that have not been modeled.  Data = Model + Residual  Residual = Data – Model  In symbols

11 Example  Given the regression line for the previous scatter plot  Ŷ = 6.413 + 0.9769x  Predicted Fat = 6.413 + 0.9769protein  What does the slope represent?  What does the y intercept mean?

12 Example continued  Given the regression line for the previous scatter plot  Ŷ = 6.413 + 0.9769x  Predicted Fat = 6.413 + 0.9769protein  How much fat would we expect an item with 12 grams of protein to have?  How much protein would an item with 15 grams of fat have?

13 Example continued  Given the regression line for the previous scatter plot  Ŷ = 6.413 + 0.9769x  Predicted Fat = 6.413 + 0.9769protein  A Double Whopper sandwich has 48 grams of Protein and 58 grams of fat. What is the residual in fat for this sandwich?

14 Example Burger King  The following are select items from the Burger King Menu with grams of fat and total calories ItemCaloriesGrams of fat Whopper65037 Whopper with cheese73044 Big King53031 Hamburger2309 Cheeseburger27012 Tendergrill chicken Sandwich46021 Original chicken Sandwich66040 Big fish Sandwich52028 BK Veggie Burger39016

15 Example Continued  What is the regression line for the data?  What is the slope in the context of the problem?  What is the y-intercept in the context of the problem?  A sandwich with 15 grams of fat would be expected to have how many calories?  A sandwich with 450 calories would be expected to have how many grams of fat?  A Bacon Cheeseburger has 13 grams of fat and 290 total calories, what is the residual in calories for this sandwich?

16 Conditions Required 1. Quantitative Variable condition 2. Straight enough condition 3. Outlier condition

17 R-Squared  R 2 – gives the fraction of the data’s variation accounted for by the model and 1 - R 2 is the fraction of the original variation left in the residuals.  Example: Burger King sandwich example r is 0.9881 r 2 is 0.9763 97.63% of the calorie content in Burger King Sandwiches is explained by the fat content. 2.37% comes from other factors.

18 Residual Plot  A diagram of the residuals of the regression line.  A noticeable pattern in the residual plot may indicate that the regression line is not a good model.  The residual plot of a better fit model will have appropriate scatter

19 What not to do  Don’t fit a straight line to a non linear relationship  Beware of extraordinary points  Don’t extrapolate beyond the data  Don’t infer that x causes y just because there is a good linear model for their relationship  Don’t choose a model based on r 2 alone.

20 Breakfast Cereals, sugar and Calories The following is data from 77 different breakfast cereals comparing the relationship of sugar in the cereal and the amount of calories with each cereal. R = 0.564 Calories mean – 107.0 SD – 19.5 Sugar mean – 7.0 grams, SD – 4.4 What is the slope of regression line? What is the y – intercept? Write the regression equation? Interpret

21 Urban planning  We want to estimate the costs per person associated with traffic delays  2002 Urban mobility report (70 cities in 2000)  Annual cost person mean - $298.96 SD - $180.83  Average speed per person mean – 54.34 mph, SD 4.494 mph  R = -0.90  Write an equation to model this situation  What does the slope mean?

22 What to watch out for in Regression  Interpreting beyond the data – extrapolating  Influential points  Lurking variables  Linear regression that is not “linear” – what to do

23 Extrapolation  We cannot assume that a linear relationship in the data exists beyond the range of the data.  Once we venture into new x territory, such a prediction is called an extrapolation.

24 Slide 9- 24 Extrapolation (cont.)  A regression of mean age at first marriage for men vs. year fit to the first 4 decades of the 20 th century does not hold for later years:

25 Influential Outliers  We say that a point is influential if omitting the point from the scatterplot completely gives a different model.

26 Slide 9- 26 Outliers, Leverage, and Influence (cont.)  The following scatterplot shows that something was awry in Palm Beach County, Florida, during the 2000 presidential election…

27 Lurking Variable  No matter how straight the line, no matter how strong the association, or how high the R- squared value is, there is no way to conclude from regression alone that one variable causes the other.  There is always the possibility that some third variable is driving both of the variables being observed.

28 What to do when the linear regression line is not straight  Re-express the data with logs, square roots, reciprocals We will look at square roots and logarithms, primarily Example: taking the square root of the response variable and re-expressing the data in a scatterplot and examining the residual plot. Example: Re-expressing data using a combination of logarithms, log(x), log (y)  Fit a line to the curved graph – more difficult

29 Slide 10- 29 The Ladder of Powers Ratios of two quantities (e.g., mph) often benefit from a reciprocal. The reciprocal of the data An uncommon re-expression, but sometimes useful. Reciprocal square root -1/2 Measurements that cannot be negative often benefit from a log re-expression. We’ll use logarithms here “0” Counts often benefit from a square root re- expression. Square root of data values ½ Data with positive and negative values and no bounds are less likely to benefit from re- expression. Raw data 1 Try with unimodal distributions that are skewed to the left. Square of data values 2 CommentNamePower

30 Slide 10- 30 Plan B: Attack of the Logarithms (cont.)

31 Slide 10- 31 Why Not Just a Curve?  If there’s a curve in the scatterplot, why not just fit a curve to the data?

32 Slide 10- 32 Why Not Just a Curve? (cont.)  The mathematics and calculations for “curves of best fit” are considerably more difficult than “lines of best fit.”  Besides, straight lines are easy to understand.  We know how to think about the slope and the y-intercept.

33 Example: Data collected in the study of water pollution from commercial and domestic waste DayOxygen Demand 1109 2149 3 5191 7213 10224


Download ppt "Chapter 8 Linear Regression HOW CAN A MODEL BE CREATED WHICH REPRESENTS THE LINEAR RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES?"

Similar presentations


Ads by Google