Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?

Similar presentations


Presentation on theme: "Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?"— Presentation transcript:

1 Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?

2 The Chapter in Outline It’s all about explaining variability

3 11-1 Empirical Models Empirical – derived from observation rather than theory Regression Analysis - A mathematical optimization technique which, when given a series of observed data, attempts to find a function which closely approximates the data a "best fit" I let the data do the talking.

4 The General Idea - Example

5 A Straight Line The linear equation: dependent variable constant (y-intercept) slope independent variable random error

6 The Statistical Model This model is linear in   and    not necessarily in the predictor variable x. Assume  is a random variable with E[  ] = 0 and V[  ] =  2

7 The Problem Given n (paired) data points: (x 1,y 1 ), (x 2,y 2 ), … (x n,y n ) Fit a straight line to the data. That is find values of  0 and  1 such that y i (actual) and (predicted) are somehow “close.”

8 The Error Terms

9 The Method of Least Squares As in Max Likelihood methods, we treat the x’s as constants and the parameters as the variables.

10 Let’s do some more math… normal equations

11 and even more math …

12 More Method of Least Squares A useful way to think of the solution for  1.

13 Some Notation

14 The Least-Squares Estimates

15 Estimating  2 which means that an unbiased estimator of the variance is The text says that it would be tedious to compute SS E as in the equation at the top. So we are offered a computational form. My comment – the computational form is conceptually important as well. Mean Square Error (MSE)

16 Partitioning and Explaining Variability The objective in predictive modeling is to explain the variability in the data. The computational form for SS E is our first example of this. SS T is the total variability of the data about the mean. It gets broken into two parts – variability explained by the regression and that left unexplained as error. A good predictor explains a lot of variability.

17 Partitioning Variability This identity comes up in many contexts. SST – total variability about the mean in the data SSR – variability explained by the model SSE – variability left unexplained – attributed to error Note that under our distributional assumptions SSR and SSE are sums of squares of normal variates.

18 Bonus Slide For the discriminating and overachieving student: Derive the above by first starting with the identity: Then rearrange terms, square both sides, and simplify. xixi yiyi

19 Problem 11-5 S XY = 265864.4 – 558 * 5062.24 / 12 = 30470.5 S XX = 29256.0 – 558*558/12 = 3309 SS T = 2416143.7 – 5062.24 * 5062.24 / 12 = 280620.9 SS E = 280620.9 – 9.21 * 30470.5 = 37.75 => MSE = 37.75/10 ~ 3.8 b 1 = (30470.5/3309) = 9.21 b 0 = (5062.24/12) – 9.21 * (558 / 12) = -6.34 Pounds (in 1,000) of steam used per month

20 Problem 11-5, Minitab Usage = - 6.34 + 9.21 Temp Predictor Coef SE Coef T P Constant -6.336 1.668 -3.80 0.003 Temp 9.20836 0.03377 272.64 0.000 S = 1.943 R-Sq = 100.0% R-Sq(adj) = 100.0% Analysis of Variance Source DF SS MS F P Regression 1 280583 280583 74334.36 0.000 Residual Error 10 38 4 Total 11 280621   SSE SSR SST Note: (272.64) 2 = 74334.36

21 Bias and Variance of the Estimators Since the betas are functions of the observations they are random variables. Properties of the regression parameters are implied by the properties of the observations and the algebra of expected values and variances. Our assumption is that the error term has a mean of zero and a variance of  2. Remember we treat the x values as fixed – constants.

22 Expected Values The k i terms come in handy now.

23 Variance Terms – beta 1 The k i terms still come in handy.

24 Variance Terms – beta 0 But the Y i are all independent so that,

25 Variance Terms Usage = - 6.34 + 9.21 Temp Predictor Coef SE Coef T P Constant -6.336 1.668 -3.80 0.003 Temp 9.20836 0.03377 272.64 0.000 S = 1.943 R-Sq = 100.0% R-Sq(adj) = 100.0% Analysis of Variance Source DF SS MS F P Regression 1 280583 280583 74334.36 0.000 Residual Error 10 38 4 Total 11 280621  2 estimate = 3.8 S xx from Excel version = 3309 Xbar =558/12=46.5 =sqrt(3.8/3309)=.03377 =sqrt(3.8*[1/12+46.5^2/3309 ]) = 1.67

26 Introducing Distributional Assumptions NID(0,  2 ) is the basic assumption This implies that the error terms in our equation are independent, normally distributed with mean 0, and constant variance  2. Our work at the end of this chapter will focus on how we can verify or test some of the assumptions by examining the estimates of the error. When these assumptions are valid, we can build confidence and prediction intervals on the regression line and new predictions. We can also create hypothesis tests on the coefficients. - We know the estimates are unbiased. - We know their standard error and can estimate it.

27 Hypothesis Tests on the Coefficients Two-sided test Recall the definition of a t variate as a standard normal divided by the square root of a chi-square divided by its degrees of freedom.

28 Hypothesis Tests on the Coefficients Similarly, we can write out a test statistic for  0. Does the regression have any predictive value? Test H 0 :  1 = 0 H 1 :  1 <> 0 Failing to reject can mean x is not a good predictor, or the relation is not linear.

29 11-4.1 Use of t-Tests An important special case of the hypotheses of Equation 11-18 is These hypotheses relate to the significance of regression. Failure to reject H 0 is equivalent to concluding that there is no linear relationship between x and Y.

30 Figure 11-5 The hypothesis H 0 :  1 = 0 is not rejected.

31 Figure 11-6 The hypothesis H 0 :  1 = 0 is rejected.

32 11-4.2 Analysis of Variance Approach to Test Significance of Regression If the null hypothesis, H 0 :  1 = 0 is true, the statistic follows the F 1,n-2 distribution and we would reject if f 0 > f ,1,n-2.

33 F-Test of Regression Significance Note that the numerator and denominator are divided by d.f. As pointed out in the text, the F-test on the regression and the T-test on the b 1 coefficient are completely equivalent – not just close or likely to produce the same answer.

34 11-4.2 Analysis of Variance Approach to Test Significance of Regression The quantities, MS R and MS E are called mean squares. Analysis of variance table:

35 Problem 11-24/27 The regression equation is y = - 16.5 + 0.0694 x Predictor Coef SE Coef T P Constant -16.509 9.843 -1.68 0.122 x 0.06936 0.01045 6.64 0.000 S = 2.706 R-Sq = 80.0% R-Sq(adj) = 78.2% Analysis of Variance Source DF SS MS F P Regression 1 322.50 322.50 44.03 0.000 Residual Error 11 80.57 7.32 Total 12 403.08 Unusual Observations Obs x y Fit SE Fit Residual St Resid 8 960 43.000 50.072 0.782 -7.072 -2.73R R denotes an observation with a large standardized residual

36 A Complete Example – prbl 11-3 The following are NFL quarterback ratings for the 2004 season. It is suspected that the rating (y) is related to the average number of yards gained per pass attempt (x). A prob-stat student generating sample data.

37 Problem 11-3 – some calculations

38 More of a Complete Example

39 Problem 11-3 – a graph

40 Problem 11-3 – some questions (b) Find an estimate for the mean rating if a quarterback averages 7.5 yards per attempt. (c) What change in the mean rating is associated with a decrease of one yard per attempt? (d) To increase the mean rating by 10 points, how much increase in the average yards per attempt must be generated? (e) Given that x = 7.21 yards (M. Vick), find the fitted value of y and the corresponding residual.

41 Excel Regression Output SUMMARY OUTPUT Regression Statistics Multiple R0.8872 R Square0.787123 Adjusted R Square0.77952 Standard Error5.712519 Observations30 ANOVA dfSSMSFSignificance F Regression13378.526 103.53146.55E-11 Residual28913.720532.63287 Total294292.247 Coefficients Standard Errort StatP-value Intercept-5.557629.159389-0.606770.548894 Yds per Att12.651921.24342710.175046.55E-11

42 Problem 11-23 (a) Test for significance of the regression at 1% level. From prob-calculator: F.01,1,28 = 7.6356

43 Problem 11-23 (b) Estimate the standard error of the slope and intercept

44 Problem 11-23 (c) Test H 0 :  1 = 10 (two-tailed)

45 Next Time Restoring our Confidence (intervals) Join us next time when we return to an old favorite – the confidence interval.


Download ppt "Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?"

Similar presentations


Ads by Google