Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.

Similar presentations


Presentation on theme: "1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable."— Presentation transcript:

1 1 Multiple Regression

2 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable or depended variable Y. Y ~ N( ,  where  =  0 +  1 x 1 +  2 x 2 +…+  p x p Interpret  i : When xi change one unit, then Y in mean change  i units, given that all other x- variables don’t change.

3 3 Parametrar Dependent variableIndependent variables Slumpvar. The model We have p independent variables that are related to the dependent variable. Y =  0 +  1 x 1 +  2 x 2 + …+  p x p + 

4 4 Multiple regression demonstration E(Y) =  0 +  1 x X y X2X2 1 Linear regression model with one independent variable x. Y =  0 +  1 x +  In the multiple linear regression model There are several independent variables Y =  0 +  1 x 1 +  2 x 2 +  The line becomes a plane. E(Y) =  0 +  1 x 1 +  2 x 2

5 Assumptions in regression X Y 1 Assumptions 3 and 4 1. There is a multiple linear relation between x variables the and Y variable. 2. The observations are independent of each other. 3. The variance around the hyper plane is the same for all combinations of x values. 4. The variance around the hyper plane can be modeled with a normal distribution.

6 6 –If the model assumptions are fulfilled and the model is acceptable to use then the parameter estimates can be interpret and the model can be used for predictions. –Check how well the model fits the data. –Check if the model assumptions are fulfilled by study the residual. Estimation of parameters and evaluation of the model. Work order: –Estimate the parameters with some computer program. (SPSS, SAS, R, Mintab,….)

7 7 –“Toulon Theatres” makes advertisements in newspapers and television. –Question: We want to understand what kind of advertisements that the better investment. –During some random weeks we observe how much is spent on advertisement, (TVAdv, NewsAdv) and income (Revenue). Model: Revenue =     TVAdv   NewsAdv  Example

8 8 Data material RevenueTVAdvNewsAdv 965,01,5 902,0 954,01,5 922,5 953,03,3 943,52,3 942,54,2 943,02,5 The numbers are in 1000 Euro.

9 9 Minitab print-out Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

10 10 Minitab out-print

11 11 Assumptions ok? Assumptions Normal- distribution. Constant variance Independent (If the observations are in time order)

12 12 Coefficient of determination, R 2 Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

13 13 Adjusted Coefficient of determination, adj R 2 Adjusted R 2 is used if to compare coefficient of determination between models with different numbers of x- variables. It is possible to show that R 2 always increase if we add more x- variables, but adj R 2 decreases if the new x-variable is weekly related to Y

14 14 Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

15 15 We start with the question: Is there at least one x- variable that is linear related to Y? Hypothesis H 0 :  1 =  2 = … =  p =0 (No x- variable is linear related to Y) vs H 1 : At least one  i is not zero (At least one x -variable is related to Y) Hypothesis

16 16 The test statistic is called F and is F- distributed with p and n-p-1 degrees of freedom. MSE=SSE/(n-p-1) MSR=SSR/p F obs SSE SSR Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500 Conclusion? P-value p n-p-1

17 17 We can reject H 0. We have enough evidence to state that at least one x- variable is linear related to Y. But which x- variable (or variables)? We need to look at the P-value for each regression coefficient. H 0 :  i  0 x-var no i, is not related to Y H 1 :  i  0 x-var no i, is related to Y

18 18 Minitab print-out Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0,642587 R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

19 19 b 0 = 83.23 is the intercept. This is the expected income if no money is spent on advertisement. We have no observations when all x- variables are zero so the interpretation is an extrapolation. We need to be careful when we do extrapolations. b 1 = 2.290. For each 1000 EUR we spend on television advertisement the income increase with 2290 EUR, given that the other x- variables are constant. Interpretation of the parameter estimates

20 20 b 2 = 1.301. For each 1000 EUR we spend on newspaper advertisement the income increase with 2290 EUR, given that the other x- variables are constant. We can use the model to make predictions about the income if we spend money on television and newspaper advertisement.

21 Multiple Regression with nominal variables

22 Multiple Regression with nominal x-variables Ex: “Johansson Filtration” The company want a model for prediction repair time. (For quotation.) Y = Repair time x 1 = time since last repair x 2 = type of repair (mechanic or electrical) Nominal

23 Nominal x- variabels A regression with one nominal (x 3 ) and two interval (x 1 x 2 ) variables. A regression with one nominal (x 2 ) and one interval (x 1 ) variable. X1X1 Y Line for x 2 =1 Line for x 2 =0 b0b0 b 0 +b 2 x2x2 x1x1 y b3b3

24 Kvalitativa x-variabler (forts.) b0b0 X1X1 Y Line: x 2 = 0 and x 3 = 1 A regression with two nominal (x 2 and x 3 ) and one interval (x 1 ) variable. b 0 +b 2 b 0 +b 3 Line: x 2 = 1 and x 3 = 0 Line: x 2 = 0 and x 3 = 0 A nominal variable with k categories is represented with k-1 dummy variables Category x 2 x 3 El 0 0 Mech 1 0 Both 0 1

25 Example ”Johansson” Regression Analysis: Time versus Months; Type The regression equation is Time = 0,930 + 0,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0,38762 0,06257 6,20 0,000 Type 1,2627 0,3141 4,02 0,005 S = 0,459048 R-Sq = 85,9% R-Sq(adj) = 81,9% Analysis of Variance Source DF SS MS F P Regression 2 9,0009 4,5005 21,36 0,001 Residual Error 7 1,4751 0,2107 Total 9 10,4760

26 Example ”Johansson” Ev. tveksamt om antaganden uppfyllda…

27 Example ”Johansson”

28 Both  1 and  2 are significant separated from zero. Both x- variables helps explain the Y- variable  0 not significant separated from zero. High coefficient of determination. Regression Analysis: Time versus Months; Type The regression equation is Time = 0,930 + 0,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0,38762 0,06257 6,20 0,000 Type 1,2627 0,3141 4,02 0,005 S = 0,459048 R-Sq = 85,9% R-Sq(adj) = 81,9% Analysis of Variance Source DF SS MS F P Regression 2 9,0009 4,5005 21,36 0,001 Residual Error 7 1,4751 0,2107 Total 9 10,4760

29 b 0 = 0.93. Expected repair time in hours (56 min) for a mechanical reparation of a newly repaired facility. We do a extrapolation in our interpretation. The parameter  0 is not significantly separated from zero. b 1 = 0.39. For each month without service the mean repair time increase with 0.39 hours (23 min), for both kinds of repairs b 2 = 1.26. If the repair is electrical the mean repair time increase with 1.26 hours (1 hour16 min) irrespective of time since last repair. Interpretation of the parameter estimations Regression Analysis: Time versus Months; Type The regression equation is Time = 0,930 + 0,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0,38762 0,06257 6,20 0,000 Type 1,2627 0,3141 4,02 0,005

30 Example ”Johansson”

31 Problems in Multiple Regression

32 Multicollinearity problem Ideal: Each x-variable in the multiple regressions model contributes with unique information about the Y-variable. All x- variables are uncorrelated with all the others x- variables. “Worst case” (maximum Multicollinearity ): The regression model can not be estimated because the x- variables are perfect correlated with each other. Often in practice: The situation is something between ideal and worst case. The x- variables are correlated but not perfectly correlated.

33 Multicollinearity Illustration: x1x1 x2x2 “Worst case” All observations are in a line in. (perfect correlation) Vi can not estimate a plane. We don’t have enough information about the slope. x y Compare to OLS (one x- variable) We can’t estimate the regression line. All points are at the same x- coordinate. We got no information about the slope.

34 Causes for Multicollinearity Model problem: If two x- variables measure almost the same thing: Example 1: x 1 = length in cm and x 2 = length in inc Example 2: x 1 = household income and x 2 = income of the person the household with highest salary Data gathering problem: Bad luck when collecting data (or bad experimental design) results in that the correlation between the x- variables gets large.

35 Consequence of Multicollinearity The estimators of the regression parameters gets large variance. (No significance in the T-test. No low P-values) The size of the estimates and the sign of the estimates do not fit with the theory. No robustness. The estimates of the regression parameters changes a lot if there is minor changes in the observations or if a observation is removed or added. In some cases the F-test result is significant but none of the test results in the T-tests are significant..

36 How to discover Multicollinearity Correlation matrix x- variables. Example: Y = house living space x 1 = disposable income x 2 = Family size high correlation between the variables Variance Inflation Factor (VIF) – Rule of thumb VIF>10 means problems with Multicollinearity Correlations: Disp.Inkomst; Storlek Pearson correlation of Disp.Inkomst and Storlek = 0,978 P-Value = 0,000

37 Example Regression Analysis: Boyta versus Disp.Inkomst; Storlek The regression equation is Boyta = - 11,5 + 0,568 Disp.Inkomst + 3,4 Storlek Predictor Coef SE Coef T P VIF Constant -11,52 70,45 -0,16 0,878 Disp.Inkomst 0,5681 0,5504 1,03 0,360 22,750 Storlek 3,43 17,41 0,20 0,853 22,750 S = 16,0924 R-Sq = 89,5% R-Sq(adj) = 84,3% Analysis of Variance Source DF SS MS F P Regression 2 8849,9 4424,9 17,09 0,011 Residual Error 4 1035,9 259,0 Total 6 9885,7 F-test significant but none of the T-tests! VIF = 22.75 > 10!

38 Countermeasure against Multicollinearity Model problem: Remove one x- variable at the time until the problem disappears. Data collection problem: Try to make more observations perfectly in a “new x- variable area.”

39 If time Example with dummy variable

40 State finance in war and peace. We want to examine if public purchase of premium bond (x) is related to the national income Y. Data: Yearly registrations of the variables in Canada during 1933 to 1949

41 Observationer ÅryxD 19332,62,40 19343,02,80 19353,63,10 19363,73,40 19373,83,90 19384,14,00 19394,44,20 19407,15,11 19418,06,31 ÅryxD 19428,98,11 19439,78,81 194410,29,61 194510,19,71 19467,99,60 19478,710,40 19489,112,00 194910,112,90

42 Dummy variable D is a dummy variable D = 1 if Canada in war 0 if Canada in peace

43

44 Regression Analysis: y versus x (Whitout the dummy) The regression equation is y = 1,57 + 0,759 x Predictor Coef SE Coef T P Constant 1,5698 0,6337 2,48 0,026 x 0,75936 0,08307 9,14 0,000 S = 1,15623 R-Sq = 84,8% R-Sq(adj) = 83,8% Analysis of Variance Source DF SS MS F P Regression 1 111,71 111,71 83,56 0,000 Residual Error 15 20,05 1,34 Total 16 131,76

45

46

47 Residuals vs. Estimated values

48 Regression Analysis: y versus x; D (Whit the dummy) The regression equation is y = 1,29 + 0,681 x + 2,30 D Predictor Coef SE Coef T P Constant 1,2897 0,1155 11,16 0,000 x 0,68141 0,01549 43,99 0,000 D 2,3044 0,1094 21,06 0,000 S = 0,209367 R-Sq = 99,5% R-Sq(adj) = 99,5% Analysis of Variance Source DF SS MS F P Regression 2 131,145 65,573 1495,92 0,000 Residual Error 14 0,614 0,044 Total 16 131,759

49

50

51 Residuals vs. Estimated values


Download ppt "1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable."

Similar presentations


Ads by Google