Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapters 8 & 9 Linear Regression & Regression Wisdom.

Similar presentations


Presentation on theme: "Chapters 8 & 9 Linear Regression & Regression Wisdom."— Presentation transcript:

1 Chapters 8 & 9 Linear Regression & Regression Wisdom

2 r = 0.8718945 Price of Homes Bases on Size (in Square Feet) Sold in Ames between Sep. 2004 and Oct. 2005

3 Statistical Modeling Statistical Model: An equation that fits the pattern between a response variable and possible explanatory variables, accounting for deviations from the model. (Simplest case: one quantitative response variable and one quantitative explanatory variable.) Response Variable (Y): The quantitative outcome of a study. Explanatory Variable (X): A quantitative variable that may explain or predict the response variable What is the beset model for: Predicting weight (Y) from height (X)? What is the best model for: Predicting blood pressure (Y) from age (X)?

4 Correlation and the Line Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945

5 Regression line Explains how the response variable (y) changes in relation to the explanatory variable (x) Use the line to predict value of y for a given value of x

6 Regression line Need a mathematical formula We want to predict y from x The predicted values are called. The observed values are called y.

7 Which Line is Best? What are some ways we can determine which model out of all the possible models is the “best” one? What are some ways that we can numerically rank the different models. (i.e. the different lines)

8 Which Model is Best? Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green)

9 Regression line “Putting a hat on it” means we have predicted something from the model Look at vertical distance Amount of error in the regression line The goal is to find the line so that these errors are minimized.

10 Least squares regression Most commonly used regression line Makes the sum of the squared errors as small as possible Based on the statistics

11 Regression line equation where

12 Regression line equation b 1 = slope of line. For every unit increase in x, y changes by the amount of the slope. Interpreting b 1 (slope): For every one unit increase in the explanatory variable, there will be, on average, a b 1 unit(s) increase/decrease in the response variable. For example: For every one square foot increase in size, on average, there will be a $159.80 increase in home price. MEMORIZE THIS!!!!!

13 Regression line equation b 0 = y-intercept of line. The value of y when x = 0. Interpreting b 0 (y-intercept): When the explanatory variable = 0, on average, the value of the response variable = b 0. For example: When the sq. ft. of a home is 0, the price of the home will be -$90,245.80 on average. MEMORIZE THIS!!!!! BE CAREFUL. The interpretation of the intercept does not always make sense. When interpreting, be sure to mention if the interpretation does not make sense.

14 Example – Kobe’s Shooting I visited cnnsi’s website and checked out some of Kobe Bryant’s personal scoring numbers. I looked at the number of times he shot the ball and his point total for each game so far this year. Let’s come up with the regression equation for this data.

15 Kobe’s Shooting r = 0.7293762 Form: Linear Strength: Moderate to Strong Direction: Positive

16 Calculating the regression line Remember that: Our explanatory variable(x) is the number of shots Our response variable(y) is the number of points So the five numbers needed are:

17 Calculating the Regression Line Find the Slope Find the Intercept

18 Calculating the regression line. Don’t forget to write the equation. DON’T FORGET TO WRITE THE EQUATION IN THE CONTEXT OF THE PROBLEM.

19 Interpretation How would we interpret b 1 ? For a one shot increase from Kobe Bryant, on average we would expect him to score 1.19 more points. How would we interpret b 0 ? If Kobe Bryant did not take one shot then on average we would expect him to score 3.436 points

20 Prediction Use the regression equation to predict y from x. Ex. What is the predicted number of points when Kobe shoots 30 times in a game? Ex. What is the predicted number of points when Kobe shoots 55 times in a game?

21 Plotting the regression line Find two points on the line: Ex. x = 30, y = 39 and x = 55, y =69 If you are plotting by hand it is ok to round values Plot these two points on the graph Connect the points This is the regression line

22 Plotting the Regression Line

23 Properties of regression line r is related to the value of b 1 r has the same sign as b 1 One standard deviation change in x corresponds to r times one standard deviation change in y The regression line always goes through the point

24 Properties of regression line r 2 Percent of variation in y that is explained by the least squares regression of y on x The higher the value of r 2, the more the regression line explains the changes that occur in the y variable The higher the values of r 2, the better the regression line fits the data

25 Properties of regression line r 2 0  r 2  1 since -1  r  1 Interpretation of r 2 r 2 is the percent of variation in the response variable that can be explained by the least squares regression of the response variable on the explanatory variable. For Kobe’s example: 53.1% of the variability in the number of points Kobe Bryant scores in a game can be explained by the LS regression of points per game on number of shots per game (g). MEMORIZE THIS!!!!

26 Residuals Amount of variation in y not taken into account by regression line Formula: There is a residual for each data point Mean of the residuals is zero

27 Calculating Residuals – Kobe Find the residual for the point (46,81) First find the predicted number of calories for a sandwich with a serving weight of 182 g: Now find residual:

28 Calculating Residuals – Kobe Find the residual for the point (26,35)

29 Residual Plots Scatterplot of Residuals Explanatory variable on horizontal axis Residuals on vertical axis Horizontal line at residual = 0

30 Residual Plots

31 Interpreting Residual Plots Is there a curved pattern? This could mean a non-linear relationship Is there increasing spread about the line as x increases? This could mean non-constant variance Is there decreasing spread about the line as x increases? This could mean non-constant variance

32 Interpreting Residual Plots Points with large residuals These are probably outliers in the y direction These will pull the regression line in the direction of the outlier (up or down) Extreme points in the x direction These are called influential points They do not always show up in residuals because the residual could be small Removing the outlier could markedly change the regression line

33 Reading JMP Data Bivariate Fit of BAC by # of Beers

34 Reading JMP Data Linear Fit BAC = -0.011654 + 0.0180112 # of Beers This is the regression line for the data. Slope is 0.0180112. y-Intercept is -0.011654. The response variable is the BAC. The explanatory variable is the # of Beers.

35 Reading JMP Data Summary of Fit RSquare0.803536 RSquare Adj0.788424 Root Mean Square Error0.020920 Mean of Response0.076000 Observations (or Sum Wgts)15 This gives some summary of the data. RSquare = r 2 = (r) 2 = (correlation) 2 Root Mean Square Error = s Mean of response = Observations = n

36 Reading JMP Data Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 10.023270410.02327053.1700 Error130.005689590.000438Prob > F C. Total140.02896000<.0001 This is called the ANOVA Table. This is another way to analyze the data. We aren’t going to discuss this in this class.

37 Reading JMP Data Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.011654 0.013179 -0.880.3926 #beers 0.0180112 0.002470 7.29<.0001 This tells you what the y-intercept and slope are. It also gives the standard error for each of the estimates. If you were to form confidence intervals for the parameter estimates, you would need these values. We won’t discuss that in this class.

38 Reading JMP Data Here is your residual plot. Check it to see if there are any problems with linearity of the data and constant variance.

39 Example

40 Age at first word vs. Gesell score. Scatterplot: Weak negative linear relationship between two variables. Possible outliers at (42,57) and (17,121). Regression: r = -0.64, r 2 = 40.96%.

41 Example

42 Age at first word vs. Gesell score. Prediction: When x=17 When x=42 Residuals: point (17,121) point (42,57)

43 Example

44 Residual Plot Outliers at x=17 and x=42 Small residual for x=42 Could be influential Remove (42,57) from data. Regression line changes markedly. r = -0.33, r 2 = 10.89%.

45 Example

46 Outliers--What should you do? Make sure data points have been recorded correctly Collect more data Remove the outlier Examine collection techniques Examine outside influences

47 Cautions about regression Linear relationship only Not resistant Using averaged data Makes relationship appear stronger Taking average removes variation Extrapolation Predicting y when x value is outside the original data

48 Cautions about Regression Extrapolation Remember the data about home prices vs. the amount of sq. footage in the home. The regression line we found based on data collected from homes with 900 to 3,000 sq. ft. is This would mean that if my home has no square footage, then I pay -$75,470. If you must extrapolate, at least don’t expect that your prediction will come true.

49 Cautions about regression ASSOCIATION IS NOT CAUSATION! Strong association between explanatory and response variables does not mean that the explanatory variable causes the response variable.

50 Proving Causation Experiment Change the values of x and control for lurking variables. Not all problems can be solved by experiment Smoking causes lung cancer. Living near power lines causes leukemia.

51 Proving Causation Lurking variable Important effect on variables, but not included in study. Example: Do taller people make more money? What do you think a lurking variable might be?

52 Proving Causation Proving smoking causes lung cancer Association is strong Association is consistent High doses are associated with stronger response Cause precedes the effect in time Cause is plausible

53 Review Let’s calculate the formula for this regression line: Number of Calories By Sugar Content (g) for 13 Cereals

54 Review Let’s review all the formulas we need:

55 Review Here are all the numbers you need:

56 Review First, calculate s x and s y :

57 Review Second, calculate r: Third, calculate b 1 :

58 Review Fourth, calculate and : Fifth, calculate a (we’re almost done!!):

59 Review Last, but definitely the most important, WRITE DOWN THE EQUATION IN THE CONTEXT OF THE PROBLEM:

60 Review Interpret b 1 : For every one gram increase in sugar, the number of calories will increase by 3.36. Interpret r 2 : About 55% of the variability in the number of calories in cereal can be explained by the LS regression of calories on sugar content.


Download ppt "Chapters 8 & 9 Linear Regression & Regression Wisdom."

Similar presentations


Ads by Google