Presentation on theme: "5/11/2015330 lecture 71 STATS 330: Lecture 7. 5/11/2015330 lecture 72 Prediction Aims of today’s lecture Describe how to use the regression model to."— Presentation transcript:
5/11/ lecture 71 STATS 330: Lecture 7
5/11/ lecture 72 Prediction Aims of today’s lecture Describe how to use the regression model to predict future values of the response, given the values of the covariates Describe how to estimate the mean response, for particular values of the covariates
5/11/ lecture 73 Typical questions Given the height and diameter, can we predict the volume of a cherry tree? E.g.if a particular tree has diameter 11 inches and height 85 feet, can we predict its volume? What is the average volume of all cherry trees having a given height and diameter? E.g. if we consider all cherry trees with diameter 11 inches and height 85 feet, can we estimate their mean volume?
5/11/ lecture 74 The regression model (revisited) The diagram shows the true plane, which describes the mean response for any x1, x2 The short “sticks” represent the “errors” The y-observation is the sum of the mean and the error, Y=
5/11/ lecture 75 Prediction: how to do it Suppose we want to predict the response of an individual whose covariates are x 1, … x k. for the cherry tree example, k=2 and x1 (height) is 85, x2 (diameter) is 11 Since the response we want to predict is Y= we predict Y by making separate predictions of and and adding the results together.
5/11/ lecture 76 Prediction: how to do it (2) Since is the value of the true regression plane at x 1,…x k, we predict by the value of the fitted plane at x 1,…x k. This is We have no information about other than the fact that it is sampled from a normal distribution with zero mean. Thus, we predict by zero.
5/11/ lecture 77 Prediction: how to do it (3) Thus, the value of the predictor is Thus, to predict the response y when the covariates are x 1,…x k, we use the fitted plane at x 1,…x k.as the predictor.
5/11/ lecture 78 Inner product form Inner product form of predictor: Vector of regression coefficients Vector of predictor variables Inner product
5/11/ lecture 79 Prediction error Prediction error is measured by the expected value of (actual-predictor) 2 Equal to Var (predictor) + 2 First part comes from predicting the mean, second part from predicting the error. Var(predictor) depends on the x’s and on see formula on slide 12 Standard error of prediction is square root of Var (predictor) + 2 Prediction intervals approximately predictor +/- 2 standard errors, or more precisely….
5/11/ lecture 710 Prediction interval A prediction interval is an interval which contains the actual value of the response with a given probability e.g 95% Interval is for x’ is
5/11/ lecture 711 Var(predictor) Arrange covariate data (used for fitting model) into a matrix X
5/11/ lecture 712 qt qt(0.975,28) Area = Area = 0.025
5/11/ lecture 713 Doing it in R Suppose we want to predict the volume of 2 cherry trees. The first has diameter 11 inches and height 85 feet, the second has diameter 12 inches and height 90 feet. First step is to make a data frame containing the new data. Names must be the same as in the original data. Then we need to combine the results of the fit (regression coefs, estimate of error variance) with the new data (the predictor variables) to calculate the predictor
5/11/ lecture 714 Doing it in R (2) This only gives the predictions. What about prediction errors, prediction intervals etc? # calculate the fit from the original data cherry.lm =lm(volume~diameter+height,data=cherry.df) # make the new data frame new.df = data.frame(diameter=c(11,12), height=c(85,90)) # do the prediction predict(cherry.lm,new.df) New data (predictor variables) Results of fit
5/11/ lecture 715 predict(cherry.lm,new.df, se.fit=T,interval="prediction") $fit fit lwr upr $se.fit $df  28 $residual.scale  Doing it in R (3) 95% Prediction intervals Sqrt of var(predictor) not standard error of prediction Residual mean square
5/11/ lecture 716 Hand calculation Prediction is SE(predictor) = sqrt(se.fit^2 + residual.scale^2) Prediction interval is: Prediction +/- SE(predictor) *qt(0.975,n-k-1) Hence SE(predictor) = sqrt( ^ ^2) = qt(0.975,n-k-1) = qt(0.975,28) = PI is \ * or ( , )
5/11/ lecture 717 Estimating the mean response Suppose we want to estimate the mean response of all individuals whose covariates are x 1, … x k. Since the mean we want to predict is the height of the true regression plane at x 1, … x k, we use as our estimate the height of the fitted plane at x 1, …, x k. This is
5/11/ lecture 718 Standard error of the estimate The standard error of the estimate is the square root of Var ( Predictor ) Note that this is less than the standard error of the prediction: Confidence intervals: predictor +/- 2 standard error of estimate * qt(0.975,n-k-1)
5/11/ lecture 719 Doing it in R Suppose we want to estimate the mean volume of all cherry trees having diameter 11 inches and height 85 feet. As before, the first step is to make a data frame containing the new data. Names must be the same as in the original data.
5/11/ lecture 720 > new.df<-data.frame(diameter=11,height=85) > predict(cherry.lm,new.df,se.fit=T, interval="confidence") $fit fit lwr upr [1,] $se.fit  $df  28 $residual.scale  Doing it in R (2) Confidence interval Note: shorter than prediction interval
5/11/ lecture 721 Example: Hydrocarbon data When petrol is pumped into a tank, hydrocarbon vapours are forced into the atmosphere. To reduce this significant source of air pollution, devices are installed to capture the vapour. A laboratory experiment was conducted in which the amount of vapour given off was measured under carefully controlled conditions. In addition to the response, there were four variables which were thought relevant for prediction: t.temp: initial tank temperature ( degrees F) p.temp: temperature of the dispensed petrol ( degrees F) t.vp: initial vapour pressure in tank (psi) p.vp:vapour pressure of the dispensed petrol (psi) hc:emitted dispensed hydrocarbons (g) (response)
5/11/ lecture 722 The data > petrol.df t.temp p.temp t.vp p.vp hc … 125 observations in all
5/11/ lecture 723 Pairs plot
5/11/ lecture 724 Preliminary conclusion All variables seem related to the response p.vp and t.vp seem highly correlated Quite strong correlations between some of the other variables No obvious outliers
5/11/ lecture 725 Fit the “full” model >petrol.reg<-lm(hc~ t.temp + p.temp + t.vp + p.vp, data=petrol.df) > summary(petrol.reg) Call: lm(formula = hc ~ t.temp + p.temp + t.vp + p.vp, data = petrol.df) Residuals: Min 1Q Median 3Q Max Continued on next slide...
5/11/ lecture 726 Fit the “full” model (2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) t.temp p.temp e-05 *** t.vp ** p.vp e-09 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 120 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 4 and 120 DF, p-value: < 2.2e-16
5/11/ lecture 727 Conclusions Large R 2 Significant coefficients, except for temp Model seems satisfactory Can move on to prediction Lets predict hydrocarbon emissions when t.temp=28, p.temp=30, t.vp=3 and p.vp=3.5
5/11/ lecture 728 Prediction > pred.df<-data.frame(t.temp=28, p.temp=30, t.vp=3, p.vp=3.5) > predict(petrol.reg,pred.df,interval="p" ) fit lwr upr [1,] Hydrocarbon emission is predicted to be between 20.2 and 31.9 grams.