Presentation on theme: "Objectives 10.1 Simple linear regression"— Presentation transcript:
1 Objectives 10.1 Simple linear regression Statistical model for linear regressionEstimating the regression parametersConfidence interval for regression parametersSignificance test for the slopeConfidence interval for µyPrediction intervals
2 Statistical model for linear regression In the population, the linear regression equation isy = 0 + 1x + e,where e is the random deviation (or error) of the response variable from the prediction formula.Usually, we assume that e has Normal(0,σ) distribution.0 (y-intercept) and 1 (slope) are the parameters.Statistical inference is conducted to draw conclusions about the parameters.Confidence interval and hypothesis test for 1. We especially want to test whether the slope equals zero.Confidence interval for 0 + 1x, given a value for x.Prediction interval for a random y, given a value for x.
3 Estimating the parameters The population linear regression equation isy = 0 + 1x + e.The sample fitted regression line isŷ = b0 + b1x.b0 is the estimate for the intercept 0 andb1 is the estimate for the slope 1.We also estimate σ (the standard deviation of e), usingse is a measure of the typical size of a residual y − ŷ.We will use se to compute the standard errors we need.
4 Confidence interval for the slope parameter Before we do inference for the slope parameter b1, we need the standard error for the estimate b1:We use the t distribution, now with n – 2 degrees of freedom.A level C confidence interval for the slope, b1, ist* is the table value for the t(n – 2) distribution with area C between −t* and t*.“Confidence” has the same interpretation as always.
5 Significance test for the slope parameter We can test the hypothesis H0: b1 = m versus either a 1-sided or a 2-sided alternative, using a t-statistic. (The primary case is with m = 0.)We calculateand use the t(n – 2) distribution to find the P-value of the test.Note: Software typicallyprovides two-sided p-values.
6 Relationship between ozone and carbon pollutants In StatCrunch: Stat-Regression-Simple Linear; choose Hypothesis Testsedf = n − 2To test H0: 1 = 0 with α = 0.05, we computeFrom the t-table, using df = 28 − 2 = 26, we can see that the P-value is less than Since it is very small we reject H0 and conclude the slope is not zero.
7 Relationship between ozone and carbon pollutants In StatCrunch: Stat-Regression-Simple Linear; choose Confidence IntervalHaving decided that the slope is not zero,we next estimate it with a 95% confidenceinterval:
8 Confidence interval for 0 + 1x We can also calculate a confidence interval for the regression line itself, at any choice x. Generally this is sensible as long as x is within the range of data observed (interpolation). Extrapolation should only be done with a great deal of caution.The interval is centered on ŷ = b0 + b1x, but we need a standard error for this particular estimate.The confidence interval is then calculated in the usual fashion:This is an estimate of the point on the line (the expected value of y) for the given value of x.
9 Prediction interval for a new obs. y It often is of greater interest to predict what the actual y value might be (not just what it is expected to be). Such a prediction interval for an actual (new) observation y, must necessarily account for both the estimation of the line and the random deviation e away from that line.The interval is again centered on ŷ = b0 + b1x, but now we also account for the random deviation. The prediction interval for the actual y, with given value for x, isThe distinction between a confidence interval and a prediction interval is whether you want to capture the expected value of y or the actual value of y.
10 Prediction intervalsUnlike confidence intervals, the size of the prediction interval does not get narrower as you increase the sample size. This is because:The confidence interval is estimating a parameter, such as the mean, the slope, the slope equation. For example, if I am interesting in the mean grade of all people taking midterm 3 who scored 10 on midterm 2, the CI will get narrower as the sample size grows (because the estimators tend to get better for large sample size).The prediction interval is completely different. Here we are trying to predict the grade of a randomly selected person who scored 10 on midterm 2. There will be a lot of variability, and it does not improve as we increase the sample size: very individual is different (it is like predicting the weight of someone who is 6 foot tall, even if we know what the average weight of a 6 footer is, there is a huge variation in this group, thus the prediction interval must be wide for us to be able to capture the height).This is a fundamental difference between predicting the measurement of an individual and estimating the mean. The mean estimator will get better with sample size, the individual won’t.
11 Efficiency of a biofilter, by temperature In StatCrunch: Stat-Regression-Simple Linear; choose Predict Y for XFor a 95% confidence interval of the expectedozone level, with temperature = 16, we computeFor a 95% prediction interval of the actual ozone level, with temperature = 16, we compute