Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/2/2015330 Lecture 51 STATS 330: Lecture 5. 7/2/2015330 Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,

Similar presentations


Presentation on theme: "7/2/2015330 Lecture 51 STATS 330: Lecture 5. 7/2/2015330 Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,"— Presentation transcript:

1 7/2/2015330 Lecture 51 STATS 330: Lecture 5

2 7/2/2015330 Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab, Building 303 South (303S B75) Tutorial 1: Wednesday 11-12 Tutorial 2: Friday 10-11 Tutorial 2: Friday 2-3  Start this Thursday

3 7/2/2015330 Lecture 53 My office hours  Tuesday 10:30 – 12:00  Thursday 10:30 – 12:00  Room 265, Building 303S Come up to the second floor in the Computer Science lifts, through the doors and turn right

4 Tutor office hours  Tuesday 1-4 pm  Thursday 12-2, 3-4 pm  Friday 1-3 pm  All in Room 303.133  (First floor, Building 303) 7/2/2015330 Lecture 54

5 Next Week  Monday, Thursday lectures by Arden Miller  No lecture Tuesday  Don’t forget assignment due Aug 4 7/2/2015330 Lecture 55

6 7/2/2015330 Lecture 56 Today’s lecture: The multiple regression model Aim of the lecture:  To describe the type of data suitable for multiple regression  To review the multiple regression model  To describe some plots that check the suitability of the model

7 7/2/2015330 Lecture 57 Suitable data Suppose our data frame contains A numeric “response” variable Y One or more numeric “explanatory” variables X 1,…X k (also called covariates, independent variables) and we want to “explain” (model, predict) Y in terms of the explanatory variables, e.g. explain volume of cherry trees in terms of height and diameter explain NOx in terms of C and E

8 7/2/2015330 Lecture 58 The regression model The model assumes  The responses are normally distributed with means  (each response has a different mean) and constant variance  2  The mean response  of a typical observation depends on the covariates through a linear relationship      x   k x k  The responses are independent

9 7/2/2015330 Lecture 59 The regression plane The relationship      x   k x k can be visualized as a plane. The plane is determined by the coefficients     ,  k e.g. for k=2 (k=number of covariates)

10 7/2/2015330 Lecture 510 Slope =   slope=  2 00

11 7/2/2015330 Lecture 511 The regression model (cont)  The data is scattered above and below the plane:  Size of “sticks” is random, controlled by  2, doesn’t depend on x 1, x 2

12 7/2/2015330 Lecture 512 Alternative form of model Y        x   k x k  where  is normally distributed with mean zero and variance  2

13 7/2/2015330 Lecture 513 Checking the data  How can we tell when this model is suitable?  Suitable = data randomly scattered about plane, “planar” for short  k=1, draw scatterplot of y vs x  k=2, use spinner, look for “edge”  k  2 use coplots: if data are planar the plots should be parallel: coplot(y~x1|x2*x3)

14 7/2/2015330 Lecture 514

15 7/2/2015330 Lecture 515 Interpretation of coefficients  The regression coefficient  0 gives the average response when all covariates are zero  The regression coefficient  1 gives the slope of the plane in the x 1 direction measures the average increase in the response for a unit increase in x 1 when all other covariates are held constant Is the slope of plots of y versus x1 in a coplot Is not necessarily the slope of y versus x in a scatter plot

16 7/2/2015330 Lecture 516 Example Consider data from model y = 5 - 1*x1 + 4*x2 + N(0,1) x1 and x2 highly correlated, r = 0.95 Regression of y on x 1 alone has slope 2.767:

17 7/2/2015330 Lecture 517 Scatter plot: y vs x1 Slope 2.767

18 7/2/2015330 Lecture 518 Coplot Slopes are all about - 1

19 Points to note  The regression coefficients  1 and  2 refer to the conditional mean of the response, given the covariates (in this case conditional on x1 and x2)   1 is the slope in the coplot, conditioning on x2  The coefficient in the plot of y versus x1 refers to a different conditional distribution (conditional on x1 only)  The same if and only if x1 and x2 uncorrelated 7/2/2015330 Lecture 519

20 7/2/2015330 Lecture 520 Estimation of the coefficients  We estimate the (unknown) regression plane by the “least squares plane” (best fitting plane)  Best fitting plane = plane that minimizes the sum of squared vertical deviations from the plane  That is, minimize the least squares criterion

21 7/2/2015330 Lecture 521 Estimation of the coefficients (2)  The R command lm calculates the coefficients of the best fitting plane  This function solves the normal equations, a set of linear equations derived by differentiating the least squares criterion with respect to the coefficients

22 7/2/2015330 Lecture 522 Math Stuff (only if you like it) Arrange data on response and covariates into a vector y and a matrix X

23 7/2/2015330 Lecture 523 Math Stuff: Normal equations Estimates are solutions of

24 7/2/2015330 Lecture 524 Calculation of best fitting plane > lm(y~x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 4.988 -1.055 4.060

25 7/2/2015330 Lecture 525 How well does the plane fit?  Judge this by examining the residuals and fitted values: Each observation has a fitted value: Y i : response for observation i, x i1, x i2 : values of explanatory variables for observation i Fitted plane is Fitted value for observation i is (the height of the fitted plane at (x i1,x i2 )

26 7/2/2015330 Lecture 526 How well does the plane fit? (cont)  The residual is the difference between the response and the fitted value

27 7/2/2015330 Lecture 527 How well does the plane fit (cont)  For a good fit, residuals must be Small relative to the y’s Have no pattern (don’t depend on the fitted values, x’s etc)  For the model to be useful, we should have a strong relationship between the response and the explanatory variables

28 7/2/2015330 Lecture 528 Measuring goodness of fit  How can we measure the relative size of residuals and the strength of the relationship between y and the x’s?  The “anova identity” is a way of explaining this

29 7/2/2015330 Lecture 529 Measuring goodness of fit (cont)  Consider the “residual sum of squares”  Provides an overall measure of the size of the residuals

30 7/2/2015330 Lecture 530 Measuring goodness of fit (cont)  Consider the “total sum of squares”  Provides an overall measure of the variability of the data

31 7/2/2015330 Lecture 531 Math stuff: ANOVA identity  Math result: Anova Identity TSS = RegSS + RSS Since RegSS is non-negative, RSS must be less than TSS so RSS/TSS is between 0 and 1. RegSS: Regression sum of squares

32 7/2/2015330 Lecture 532 Interpretation  The smaller RSS compared to TSS, the better the fit.  If RSS is 0, and TSS is positive, we have a perfect fit  If RegSS=0, RSS=TSS and the plane is flat (all coefficients except the constant are zero), so the x’s don’t help predict y

33 7/2/2015330 Lecture 533 R2R2  Another way to express this is the coefficient of determination R 2  This is defined as 1-RSS/TSS = RegSS/TSS  R 2 is always between 0 and 1 1 means a perfect fit 0 means a flat plane  R 2 is the square of the correlation between the observations and the fitted values

34 7/2/2015330 Lecture 534 Estimate of    Recall that   controls the scatter of the observations about the regression plane: the bigger  , the more scatter the bigger  , the smaller R 2    estimated by s 2 =RSS/(n-k-1)

35 7/2/2015330 Lecture 535 Calculations cherry.lm <- lm(volume~diameter+height, data=cherry.df)) summary(cherry.lm)  The lm function produces an “lm object” that contains all the information from fitting the regression  lm = “linear model” “ell” not I !!!!

36 7/2/2015330 Lecture 536 Call: lm(formula = volume ~ diameter + height, data = cherry.df) Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 *** diameter 4.7082 0.2643 17.816 < 2e-16 *** height 0.3393 0.1302 2.607 0.0145 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.882 on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Info on residuals coefficients R2R2 Estimate of  Cherries


Download ppt "7/2/2015330 Lecture 51 STATS 330: Lecture 5. 7/2/2015330 Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,"

Similar presentations


Ads by Google