Download presentation

Presentation is loading. Please wait.

Published byJoan Craig Modified over 9 years ago

1
7/2/2015330 Lecture 51 STATS 330: Lecture 5

2
7/2/2015330 Lecture 52 Tutorials These will cover computing details Held in basement floor tutorial lab, Building 303 South (303S B75) Tutorial 1: Wednesday 11-12 Tutorial 2: Friday 10-11 Tutorial 2: Friday 2-3 Start this Thursday

3
7/2/2015330 Lecture 53 My office hours Tuesday 10:30 – 12:00 Thursday 10:30 – 12:00 Room 265, Building 303S Come up to the second floor in the Computer Science lifts, through the doors and turn right

4
Tutor office hours Tuesday 1-4 pm Thursday 12-2, 3-4 pm Friday 1-3 pm All in Room 303.133 (First floor, Building 303) 7/2/2015330 Lecture 54

5
Next Week Monday, Thursday lectures by Arden Miller No lecture Tuesday Don’t forget assignment due Aug 4 7/2/2015330 Lecture 55

6
7/2/2015330 Lecture 56 Today’s lecture: The multiple regression model Aim of the lecture: To describe the type of data suitable for multiple regression To review the multiple regression model To describe some plots that check the suitability of the model

7
7/2/2015330 Lecture 57 Suitable data Suppose our data frame contains A numeric “response” variable Y One or more numeric “explanatory” variables X 1,…X k (also called covariates, independent variables) and we want to “explain” (model, predict) Y in terms of the explanatory variables, e.g. explain volume of cherry trees in terms of height and diameter explain NOx in terms of C and E

8
7/2/2015330 Lecture 58 The regression model The model assumes The responses are normally distributed with means (each response has a different mean) and constant variance 2 The mean response of a typical observation depends on the covariates through a linear relationship x k x k The responses are independent

9
7/2/2015330 Lecture 59 The regression plane The relationship x k x k can be visualized as a plane. The plane is determined by the coefficients , k e.g. for k=2 (k=number of covariates)

10
7/2/2015330 Lecture 510 Slope = slope= 2 00

11
7/2/2015330 Lecture 511 The regression model (cont) The data is scattered above and below the plane: Size of “sticks” is random, controlled by 2, doesn’t depend on x 1, x 2

12
7/2/2015330 Lecture 512 Alternative form of model Y x k x k where is normally distributed with mean zero and variance 2

13
7/2/2015330 Lecture 513 Checking the data How can we tell when this model is suitable? Suitable = data randomly scattered about plane, “planar” for short k=1, draw scatterplot of y vs x k=2, use spinner, look for “edge” k 2 use coplots: if data are planar the plots should be parallel: coplot(y~x1|x2*x3)

14
7/2/2015330 Lecture 514

15
7/2/2015330 Lecture 515 Interpretation of coefficients The regression coefficient 0 gives the average response when all covariates are zero The regression coefficient 1 gives the slope of the plane in the x 1 direction measures the average increase in the response for a unit increase in x 1 when all other covariates are held constant Is the slope of plots of y versus x1 in a coplot Is not necessarily the slope of y versus x in a scatter plot

16
7/2/2015330 Lecture 516 Example Consider data from model y = 5 - 1*x1 + 4*x2 + N(0,1) x1 and x2 highly correlated, r = 0.95 Regression of y on x 1 alone has slope 2.767:

17
7/2/2015330 Lecture 517 Scatter plot: y vs x1 Slope 2.767

18
7/2/2015330 Lecture 518 Coplot Slopes are all about - 1

19
Points to note The regression coefficients 1 and 2 refer to the conditional mean of the response, given the covariates (in this case conditional on x1 and x2) 1 is the slope in the coplot, conditioning on x2 The coefficient in the plot of y versus x1 refers to a different conditional distribution (conditional on x1 only) The same if and only if x1 and x2 uncorrelated 7/2/2015330 Lecture 519

20
7/2/2015330 Lecture 520 Estimation of the coefficients We estimate the (unknown) regression plane by the “least squares plane” (best fitting plane) Best fitting plane = plane that minimizes the sum of squared vertical deviations from the plane That is, minimize the least squares criterion

21
7/2/2015330 Lecture 521 Estimation of the coefficients (2) The R command lm calculates the coefficients of the best fitting plane This function solves the normal equations, a set of linear equations derived by differentiating the least squares criterion with respect to the coefficients

22
7/2/2015330 Lecture 522 Math Stuff (only if you like it) Arrange data on response and covariates into a vector y and a matrix X

23
7/2/2015330 Lecture 523 Math Stuff: Normal equations Estimates are solutions of

24
7/2/2015330 Lecture 524 Calculation of best fitting plane > lm(y~x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 4.988 -1.055 4.060

25
7/2/2015330 Lecture 525 How well does the plane fit? Judge this by examining the residuals and fitted values: Each observation has a fitted value: Y i : response for observation i, x i1, x i2 : values of explanatory variables for observation i Fitted plane is Fitted value for observation i is (the height of the fitted plane at (x i1,x i2 )

26
7/2/2015330 Lecture 526 How well does the plane fit? (cont) The residual is the difference between the response and the fitted value

27
7/2/2015330 Lecture 527 How well does the plane fit (cont) For a good fit, residuals must be Small relative to the y’s Have no pattern (don’t depend on the fitted values, x’s etc) For the model to be useful, we should have a strong relationship between the response and the explanatory variables

28
7/2/2015330 Lecture 528 Measuring goodness of fit How can we measure the relative size of residuals and the strength of the relationship between y and the x’s? The “anova identity” is a way of explaining this

29
7/2/2015330 Lecture 529 Measuring goodness of fit (cont) Consider the “residual sum of squares” Provides an overall measure of the size of the residuals

30
7/2/2015330 Lecture 530 Measuring goodness of fit (cont) Consider the “total sum of squares” Provides an overall measure of the variability of the data

31
7/2/2015330 Lecture 531 Math stuff: ANOVA identity Math result: Anova Identity TSS = RegSS + RSS Since RegSS is non-negative, RSS must be less than TSS so RSS/TSS is between 0 and 1. RegSS: Regression sum of squares

32
7/2/2015330 Lecture 532 Interpretation The smaller RSS compared to TSS, the better the fit. If RSS is 0, and TSS is positive, we have a perfect fit If RegSS=0, RSS=TSS and the plane is flat (all coefficients except the constant are zero), so the x’s don’t help predict y

33
7/2/2015330 Lecture 533 R2R2 Another way to express this is the coefficient of determination R 2 This is defined as 1-RSS/TSS = RegSS/TSS R 2 is always between 0 and 1 1 means a perfect fit 0 means a flat plane R 2 is the square of the correlation between the observations and the fitted values

34
7/2/2015330 Lecture 534 Estimate of Recall that controls the scatter of the observations about the regression plane: the bigger , the more scatter the bigger , the smaller R 2 estimated by s 2 =RSS/(n-k-1)

35
7/2/2015330 Lecture 535 Calculations cherry.lm <- lm(volume~diameter+height, data=cherry.df)) summary(cherry.lm) The lm function produces an “lm object” that contains all the information from fitting the regression lm = “linear model” “ell” not I !!!!

36
7/2/2015330 Lecture 536 Call: lm(formula = volume ~ diameter + height, data = cherry.df) Residuals: Min 1Q Median 3Q Max -6.4065 -2.6493 -0.2876 2.2003 8.4847 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -57.9877 8.6382 -6.713 2.75e-07 *** diameter 4.7082 0.2643 17.816 < 2e-16 *** height 0.3393 0.1302 2.607 0.0145 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 3.882 on 28 degrees of freedom Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442 F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 Info on residuals coefficients R2R2 Estimate of Cherries

Similar presentations

© 2024 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google