Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?

Prediction concerning Y variable

Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable? What would one predict a new observation, Y h(new), to be for a given level, X h, of the predictor variable? What would one predict the mean of m new observations,, to be for a given level, X h, of the predictor variable?

Example: Mortality and Latitude What is the expected (mean) mortality rate for locations at 40 o N latitude? What is the predicted mortality rate for a new randomly selected location at 40 o N? What is the predicted mortality rate for 10 new randomly selected locations at 40 o N?

Point estimators is the best point estimator in each case. That is, it is: the best guess of the mean response at X h the best guess of a new observation at X h the best guess of a mean of m new observations at X h But, as always, to be confident in the answer to our research question, we should put an interval around our best guess.

Interval estimation of mean response E(Y h )

Sampling distribution of Y-hat-h Y-hat-h is normally distributed Providing error terms ε i are normally distributed: with mean E(Y h ) and variance

Implications on precision The greater the spread in the X i values, the smaller the variance of Y-hat-h, the more precise the prediction of E(Y h ). Given the same set of X i values, the further X h is from the (sample) mean of the X i, the greater the variance of Y-hat-h, the less precise the prediction of E(Y h ).

Estimate of the variance Estimate variance with Then, the estimated standard deviation is

Confidence interval for E(Y h ) Sample estimate ± margin of error

The estimation in Minitab Stat >> Regression >> Regression … Specify response and predictor(s). Select Options… In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. Specify confidence level. Click on OK. Results appear in session window.

Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0

Minitab output “Fit” is “SE Fit” is Therefore, the “95% CI” for E(Y h ) is

Difference in precision of estimates The mean of the 49 latitudes in the data set is 39.5 o N. SE Fit for X h =40 is 2.75. SE Fit for X h =28 is 7.42 (larger as expected). The closer X h is to the sample mean, the narrower the confidence interval, the more precise the estimate of E(Y h ).

Comments on assumptions X h is value within scope of model, that is, within range of X values in data set, but not necessary that it is one of the X values. It is OK to use the formula for the confidence interval for E(Y h ) even if the error terms are only approximately normally distributed. If you have a large sample, the error terms can even deviate substantially from normality without greatly affecting appropriateness of the confidence interval.

Prediction of a New Observation

Restatement of problem We previously estimated the mean response E(Y h ). That is, we estimated the mean of the distribution of Y at a given X h. Now, we want to predict a new response Y h(new). That is, we predict an individual outcome Y at a given X h. Most outcomes Y deviate from the mean response E(Y h ). We must take this into account when we predict Y h(new).

How to obtain a prediction interval if distribution of Y is known If you know the distribution of Y, you know –its shape (say, it’s normal) –its mean (say, it’s μ (“mu”)) –its standard deviation (say, it’s σ (“sigma”)) BASIC IDEA: Using the distribution, determine a range in which most of the Y observations will fall. Claim that the next observation will fall there, too.

Example: High school GPA (X) and College GPA (Y) Distribution of college GPA (Y) depends on high school GPA (X) through intercept and slope parameters. Suppose: –Y is normally distributed –Mean is E(Y) = 0.10 + 0.95 X –Standard deviation σ (“Sigma”) = 0.12 For students with X = 3.5 high school GPA: –E(Y) = 0.10 + 0.95(3.5) = 3.425

Example: 99.7% prediction interval for Y h(new) The probability that a randomly selected high school student with a GPA of 3.5 will have a college GPA between –3.425 - 3(0.12) = 3.065 and –3.425 + 3(0.12) = 3.785 is 0.997.

But we have a problem … The last calculation was possible because we knew β 0, β 1, and σ. Hence, we knew the mean and variance, E(Y) and σ 2, respectively, of the distribution of Y. We could consider estimating E(Y) and σ 2 with Y-hat-h and MSE, respectively, and applying the same method as before. But, it’s not quite right. Here’s why.

So … We cannot be certain of the location (mean) of the distribution of Y. Prediction limits for Y h(new) must take into account: –variation in possible location (mean) of the distribution of Y –variation in the Y of the probability distribution

Variation of the prediction The variation in the prediction of a new response depends on two components: the variation due to estimating E(Y h ) with Y-hat-h and the variation in Y within the probability distribution. which is estimated by:

Prediction interval for Y h(new) Providing error terms ε i are normally distributed:

The prediction in Minitab Stat >> Regression >> Regression … Specify response and predictor(s). Select Options… In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. Specify confidence level. Click on OK. Results appear in session window.

Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0 S = 19.12 R-Sq = 68.0% R-Sq(adj)= 67.3%

Minitab output “Fit” is Therefore, the “95% PI” for Y h(new) is

As always, some comments… In general, prediction intervals are wider than confidence intervals. Prediction intervals are (somewhat) wider the further X h is from the mean of the X values. The formula for the prediction interval depends strongly on the assumption that the error terms are normally distributed.

Remember the distinction … A confidence interval concerns the estimation of an unknown parameter. It is an interval that is intended to cover the value of the unknown parameter. A prediction interval, on the other hand, is a statement about the value to be taken by a random variable, here, the new observation Y h(new).

Getting a plot of the CI and PI in Minitab Stat >> Regression >> Fitted line plot … Specify predictor and response. Under Options …Select Display confidence bands. Select Display prediction bands. Specify desired confidence level. Select OK.

Row Xh Xbar sumsqX n MSE SD_EY SD_Pred 1 28 39.533 1020.54 49 365.383 7.42147 20.5052 2 29 39.533 1020.54 49 365.383 6.86862 20.3116 3 30 39.533 1020.54 49 365.383 6.32406 20.1340 4 31 39.533 1020.54 49 365.383 5.79013 19.9727 5 32 39.533 1020.54 49 365.383 5.27006 19.8282 6 33 39.533 1020.54 49 365.383 4.76838 19.7008 7 34 39.533 1020.54 49 365.383 4.29156 19.5908 8 35 39.533 1020.54 49 365.383 3.84884 19.4986 9 36 39.533 1020.54 49 365.383 3.45337 19.4244 10 37 39.533 1020.54 49 365.383 3.12313 19.3685 11 38 39.533 1020.54 49 365.383 2.88066 19.3308 12 39 39.533 1020.54 49 365.383 2.74927 19.3117 13 40 39.533 1020.54 49 365.383 2.74497 19.3111 14 41 39.533 1020.54 49 365.383 2.86833 19.3290 15 42 39.533 1020.54 49 365.383 3.10416 19.3654 16 43 39.533 1020.54 49 365.383 3.42933 19.4202 17 44 39.533 1020.54 49 365.383 3.82112 19.4932 18 45 39.533 1020.54 49 365.383 4.26117 19.5842 19 46 39.533 1020.54 49 365.383 4.73606 19.6930 20 47 39.533 1020.54 49 365.383 5.23632 19.8192 21 48 39.533 1020.54 49 365.383 5.75533 19.9626 22 49 39.533 1020.54 49 365.383 6.28846 20.1228

Prediction of the mean of m new observations for given X h

Same thinking as before …just a slight adjustment We cannot be certain of the location (mean) of the distribution of the Y. The best estimate is Y-hat-h. Prediction limits for Y h(new) must take into account: –variation in possible location (mean) of the distribution of the Y –variation in the Y within the probability distribution

Variation of the prediction The variation in the prediction of the mean of m new responses depends on two components: the variation due to estimating E(Y h ) with Y-hat-h and the variation in the sample means within the probability distribution. which is estimated by:

Prediction interval for Y h(new) Providing error terms ε i are normally distributed:

Predict mean of m=10 new responses “Fit” is Therefore, the “95% PI” for Y h(new) is

Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?

Similar presentations

Presentation on theme: "Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?

Similar presentations

Presentation on theme: "Prediction concerning Y variable. Three different research questions What is the mean response, E(Y h ), for a given level, X h, of the predictor variable?"— Presentation transcript:

Similar presentations

About project

Feedback