# Types of regression models

## Presentation on theme: "Types of regression models"— Presentation transcript:

Types of regression models
Simple Multiple 1° order 1° order 2° order Interaction 2° order Higher order Higher order

Unit 2 E(Y)=β0+ β1x+ β2 x2 Interpretation of model parameters: β0: y-intercept. The value of E(Y) when x1 = x2 = 0 β1 : is the shift parameter; β2 : is the rate of curvature;

Unit 2 The true model, supposedly unknown, is Yi = 2 + xi2 + εi, with εi~N(0,2) Data: (x,y). See SQM.sav

Model 1: E(Y) = β0 + β1x Unit 2

Unit 2

Smaller variance and SE
Model 2: E(Y) = β0 + β1x2 Unit 2 Smaller variance and SE

Unit 2

Model 3: E(Y) = β0 + β1x + β2x2 Unit 2

Types of regression models
Simple Multiple 1° order 1° order 2° order Interaction 2° order Higher order Higher order

A third order model with 1 IV
E(Y)=β0+ β1x+ β2 x2+ β3 x3 Use with caution given numerical problems that could arise 3 > 0 3 < 0 49

Types of regression models
Simple Multiple 1° order 1° order 2° order Interaction 2° order Higher order Higher order

First-Order model in k Quantitative variables
Unit 2 E(Y)=β0+β1x1+β2 x βk xk Interpretation of model parameters: β0: y-intercept. The value of E(Y) when x1 = x2 =...= xk= 0 β1: change in E(Y) for a 1-unit increase in x1 when x2,.., xk are held fixed; β2: change in E(Y) for a 1-unit increase in x2 when x1, x3,..., xk are held fixed; ...

E(Y)=β0+β1x1+β2 x2 A bivariate model
Unit 2 A bivariate model E(Y)=β0+β1x1+β2 x2 Changing x2 changes only the y-intercept. In the first order model a 1-unit change in one independent variable will have the same effect on the mean value of y regardless of the other independent variables.

A bivariate model 12

Example: executive salaries
Unit 2 Example: executive salaries Y = Annual salary (in dollars) x1 = Years of experience x2 = Years of education x3 = Gender : 1 if male; 0 if female x4 = Number of employees supervised x5 = Corporate assets (in millions of dollars) E(Y)=β0+ β1x1+ β2 x2 + β4 x4 + β5 x5 Do not consider x3 (Gender) for the moment Data: ExecSal.sav

Exsecutive salaries: Computer Output
Riepilogo del modello Modello R R-quadrato R-quadrato corretto Deviazione standard Errore della stima ,870a ,757 ,747 12685,309 a. Predittori: (Costante), Corporate assets (in million \$), Years of Experience, Years of Education, Number of Employees supervised Multiple regression Simple regression Riepilogo del modello Modello R R-quadrato R-quadrato corretto Deviazione standard Errore della stima dimension0 1 ,783a ,613 ,609 15760,006 a. Predittori: (Costante), Years of Experience

Coefficient of determination
The coefficient R2 is computed exactly as in the simple regression case. SSR is sum of squares regression (not residual; that’s SSE). SST (Total) SSR (Regression) SSE (Error) A drawback of R2: it increases with the number of added variables, even if these are NOT relevant to the problem. 27

Adjusted R2 and estimate of the variance σ2
A solution: Adjusted R2 Each additional variable reduces adjusted R2, unless SSE varies enough to compensate An unbiased estimator of the variance σ 2 is computed as

Exsecutive salaries: Computer Output (2)
Coefficientia Model Coefficienti non standardizzati Coefficienti standardizzati t Sig. B Deviazione standard Errore Beta 1 (Costante) -37082,148 17052,089 -2,175 ,032 Years of Experience 2696,360 173,647 ,785 15,528 ,000 Years of Education 2656,017 563,476 ,243 4,714 Number of Employees supervised 41,092 7,807 ,272 5,264 Corporate assets (in million \$) 244,569 83,420 ,149 2,932 ,004 Variabile dipendente: Annual salary in \$ T-tests Variables

Testing overall significance: the F-test
1. Shows If There Is a Linear Relationship Between All X Variables Together & Y 2. Uses F Test Statistic 3. Hypotheses H0: 1 = 2 = ... = k = 0 No Linear Relationship Ha: At Least One Coefficient Is Not 0 At Least One X Variable Affects Y Less chance of error than separate t-tests on each coefficient. Doing a series of t-tests leads to a higher overall Type I error than . The F-test for 1 single coefficient is equivalent to the t-test

Anova table Anovab F-statistic Modello Somma dei quadrati df
Media dei quadrati F Sig. 1 Regressione 4,766E10 4 1,192E10 74,045 ,000a Residuo 1,529E10 95 1,609E8 Totale 6,295E10 99 a. Predittori: (Costante), Corporate assets (in million \$), Years of Experience, Years of Education, Number of Employees supervised b. Variabile dipendente: Annual salary in \$ p-vale of F-test df = k: number of regression slopes df = n-1: n= number of observations Decision: reject H0, i.e. accept this model MSE (mean square error), the estimate of variance

Interaction (second order) model
Unit 2 E(Y)=β0+ β1x1+ β2 x2 + β3 x1x2 Interpretation of model parameters: β0: y-intercept. The value of E(Y) when x1 = x2 = 0 β1+ β3 x2 : change in E(Y) for a 1-unit increase in x1 when x2 is held fixed; β2 + β3 x1 : change in E(Y) for a 1-unit increase in x2 when x1 is held fixed; β3: controls the rate of change of the surface.

Interaction (second order) model
Unit 2 Interaction (second order) model E(Y)=β0+ β1x1+ β2 x2 + β3 x1x2 Contour lines are not parallel The effect of one variable depends on the level of the other

Example: Antique grandfather clocks auction
Unit 2 Example: Antique grandfather clocks auction Clocks are sold at an auction on competitive offers. Data are: Y : auction price in dollars X1: age of clocks X2: number of bidders Model 1: E(Y) = β0 + β1x1 + β2x2 Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 Data: GFCLOCKS.sav

If data are Normal Skewness is 0
Unit 2 Data summaries If data are Normal Skewness is 0 If data are Normal (eccess) Kurtosis is 0 Note: Skewness and Kurtosis are not enough to establish Normality

If data are Normal. Points should be along the straight line.
P-P plot for Normality Unit 2 If data are Normal. Points should be along the straight line. In this example the situation is fairly good

Bivariate scatter-plots
Unit 2

Model 1: E(Y) = β0 + β1x1 + β2x2 Unit 2

Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2
Unit 2

Interpreting interaction models
Unit 2 The coefficient for the interaction term is significant. If an interaction term is present then also the corresponding first order terms need to be included to correctly interpret the model. In the example an uncareful analyst could estimate the effect of Bidders as negative, since b2=-93.26 Since an interaction term is present, the slope estimate for Bidders (x2) is b2 + b3x1 ^ Note: b = β For x1= 150 (age) the estimated slope for Bidders is (150) =

Models with qualitative X’s
Unit 2 Models with qualitative X’s Regression models can also include qualitative (or categorical) independent variables (QIV). The categories of a QIV are called levels Since the levels of a QIV are not measured on a natural numerical scale in order to avoid introducing fictitious linear relations in the model we need to use a specific type of coding. Coding is done by using IV which assume only two values: 0 or 1. These coded IV are called dummy variables

In this simple model, only the means for the two groups are modeled
Models with QIV Unit 2 Suppose we want to model Income (Y) as a function of Sex (x) -> use coded, or dummy, variables x = 1 if Male, x = 0 if Female E(Y) = β0+ β1x E(Y) = β0+ β1 if x =1, i.e. Male E(Y) = β0 if x =0, i.e. Female β0 is the base level, i.e Female is the reference category β1 is the additional effect if Male In this simple model, only the means for the two groups are modeled

QIV with q levels x1 = 1 level A, x1 = 0 if not
Unit 2 As a general rule, if a QIV has q levels we need q-1 dummies for coding. The uncoded level is the reference one. Example: a QIV has three levels, A, B and C Define x1 = 1 level A, x1 = 0 if not x2 = 1 level B, x2 = 0 if not Model: E(Y) = β0+ β1x1 + β2x2 C is the reference level Interpreting β’s β0 = μC (mean for base level C) β1 = μA - μC (additional effect wrt C if level A) β2 = μB - μC (additional effect wrt C if level B)

Unit 2 Models with dummies Even if models which consider only dummy variables do in practice estimate the means of various groups, the testing machinery of the regression setup can be useful for group comparisons. Dummies can be used in combination with any other dummies and quantitative X’s to construct models with first order effects (or main effects) and interactions to test hypotheses of interest. In order to define dummies in SPSS see “Computing dummy vars in SPSS.ppt”

Example: executive salaries
Unit 2 Example: executive salaries A managing consulting firms has developed a regression model in order to analyze executive’s salary structure Y = Annual salary (in dollars) x1 = Years of experience x2 = Years of education x3 = Gender : 1 if male; 0 if female x4 = Number of employees supervised x5 = Corporate assets (in millions of dollars) Data: ExecSal.sav

A simple model: E(Y) = β0 + β3x3
Unit 2 A simple model: E(Y) = β0 + β3x3 Male group Female group This model estimates the means of the two groups (M,F) We wanto to test if the difference in means is significant, i.e. not due to chance

Salary difference between groups is significant
Unit 2 Regression Output Salary difference between groups is significant C.I. for mean increment Mean increment for Male

Unit 2 Model 2: E(Y) = β0 + β1x1 + β3x3 It seems that the two groups are separated Model 2 considers same slope but different intercepts If x3 = 0 (female) then E(Y) = β0 + β1x1 If x3 = 1 (male) then E(Y) = β0 + β3 + β1x1

Computer output for model 2
Unit 2 Computer output for model 2 R square improved greatly In this model effect of experience is assumed equal for the two groups New intercept for Male is significant

Model 3: E(Y) = β0 + β1x1 + β3x3 + β4x1x3
Unit 2 Model 3: E(Y) = β0 + β1x1 + β3x3 + β4x1x3 With this model we want to test whether gender and experience interacts, i.e. if male salary tend to grow at a faster (slower) rate with experience. If x3 = 0 (female) then E(Y) = β0 + β1x1 If x3 = 1 (male) then E(Y) = (β0 + β3) + (β1 + β4)x1 New intercept for male New slope for male Remark: running regression for the two groups together allows to have higher degrees of freedom (n) for estimating parameters and model variance.

Model 3: E(Y) = β0 + β1x1 + β3x3 + β4x1x3
Model 3 considers different slope and different intercepts

Computer output for model 3
Unit 2 Computer output for model 3 There is evidence that salaries for the two groups grow at different rate with experience Estimated lines: Y = *(Years of Experience) for female Y = *(Years of Experience) for male ^ ^

A complete second order model
Unit 2 E(Y)=β0+ β1x1+ β2 x2 + β3 x1x2+ β4x12+ β5 x22 Interpretation of model parameters: β0: y-intercept. The value of E(Y) when x1 = x2 = 0 β1 and β2 : shifts along the x1 and x2 axes; β3 : rotation of the surface; β4 and β5 : controls the rate of curvature.

Back to Executive salaries
What about if suspect that rate of growth changes and has opposite signs for M and F? x1 = Years of experience x3 = Gender (1 if Male) Note: x32 = x3 since it is a dummy E(Y)=β0+ β1x1+ β2 x3 + β3 x1x3+ β4x12 Model 4 E(Y)=β0+ β1x1+ β2 x3 + β3 x1x3+ β4x12+ β5 x3x12 Model 5

Comparing Model 4 and 5 Model 4 If x3 = 0 (female) then
E(Y) = β0 + β1x1 + β4x12 If x3 = 1 (male) then E(Y) = (β0 + β2) + (β1 + β3)x1 + β4x12 Different intercept and slope for M and F but same curvature Model 5 If x3 = 0 (female) then E(Y) = β0 + β1x1 + β4x12 If x3 = 1 (male) then E(Y) = (β0 + β2) + (β1 + β3)x1 + (β4+β5)x12 Different intercept, slope and curvature for M and F

Model 5: computer output
Riepilogo del modello Modello R R-quadrato R-quadrato corretto Deviazione standard Errore della stima dimension0 1 ,875a ,766 ,754 12507,735 a. Predittori: (Costante), Exp2Gen, Gender, Years of Experience, ExpSqu, ExpGen Anovab Modello Somma dei quadrati df Media dei quadrati F Sig. 1 Regressione 4,824E10 5 9,648E9 61,673 ,000a Residuo 1,471E10 94 1,564E8 Totale 6,295E10 99 a. Predittori: (Costante), Exp2Gen, Gender, Years of Experience, ExpSqu, ExpGen b. Variabile dipendente: Annual salary in \$

Model 5: computer output
Coefficientia Modello Coefficienti non standardizzati t Sig. B Deviazione standard Errore Beta 1 (Costante) 52391,973 6497,971 8,063 ,000 Years of Experience 3373,970 1165,248 ,982 2,895 ,005 Gender 21122,152 8285,802 ,399 2,549 ,012 ExpGen -2081,897 1459,842 -,724 -1,426 ,157 ExpSqu -53,181 45,001 -,422 -1,182 ,240 Exp2Gen 112,836 54,950 ,904 2,053 ,043 a. Variabile dipendente: Annual salary in \$ Which model is preferable? Model 3 or model 5?

A test for comparing nested models
Unit 2 A test for comparing nested models Two models are nested if one model contains all the terms of the other model and at least one additional term. The more complex of the two models is called the complete (or full) model. The other is called the reduced (or restricted) model. Example: model 1 is nested in model 2 Model 1: E(Y)=β0+ β1x1+ β2 x2 + β3 x1x2 Model 2: E(Y)=β0+ β1x1+ β2 x2 + β3 x1x2+ β4x12+ β5 x22 To compare the two models we are interested in testing H0: β4 = β5 = 0, vs. H1: at least one, β4 or β5, differs from 0

F-test for comparing nested models
Unit 2 F-test for comparing nested models Reduced model: E(Y) = β0+ β1x1+ … + β2 xg Complete Model: E(Y) = β0+ β1x1+ … + β2 xg + βg+1 xg+1 + … + βkxk To test H0: βg+1 = … = βk = 0 H1: at least one of the parameters being tested is not 0 Compute Reject H0 when F > Fα, where Fα is the level α critical point of an F distribution with (k-g, n-(k+1)) d.f.

F-test for nested models
Unit 2 Where: SSER = Sum of squared errors for the reduced model; SSEC = Sum of squared errors for the complete model; MSEC = Mean square error for the complete model; Remark: k – g = number of parameters tested k +1 = number of parameters in the complete model n = total sample size

Compute partial F-tests with SPSS
Unit 2 Enter your complete model in the Regression dialog box choose the Method “Enter” Click on “Next” In the new box for Independent variables, enter those you want to remove (i.e. those you’d like to test) choose the Method “Remove” 4. In the “Statistics” option select “R squared change” 5. Ok.

E(Y) = β0 + β1x1 + β2x3 + β3x1x3 + β4x12 + β5x3x12
Applying the F-test Unit 2 Let us use the F-test to compare Model 3 and Model 5 in the executive salaries example. Note that Model 3 is nested in Model 5 Model 3: E(Y) = β0 + β1x1 + β2x3 + β3x1x3 Model 5: E(Y) = β0 + β1x1 + β2x3 + β3x1x3 + β4x12 + β5x3x12 Apply the F-test for H0: β4 = β5 = 0

Variabili inserite/rimossec
Computer output Variabili inserite/rimossec Modello Variabili inserite Variabili rimosse Metodo 1 Exp2Gen, Gender, Years of Experience, ExpSqu, ExpGena . Per blocchi 2 .a Exp2Gen, ExpSqub Rimuovi a. Tutte le variabili richieste sono state immesse. b. Tutte le variabili richieste sono state rimosse. c. Variabile dipendente: Annual salary in \$ Do NOT reject H0: β4 = β5 = 0, i.e. Model 3 is better F-statistic F p-value Riepilogo del modello Model R R-quadrato R-quadrato corretto Deviazione standard Errore della stima Variazione dell'adattamento Variazione di R-quadrato Variazione di F df1 df2 Sig. Variazione di F 1 ,875° ,766 ,754 12507,735 61,673 5 94 ,000 2 ,868b ,746 12700,080 -,012 2,488 ,089 a. Predittori: (Costante), Exp2Gen, Gender, Years of Experience, ExpSqu, ExpGen b. Predittori: (Costante), Gender, Years of Experience, ExpGen

A quadratic model example: Shipping costs
Unit 2 A quadratic model example: Shipping costs Although a regional delivery service bases the charge for shipping a package on the package weight and distance shipped, its profit per package depends on the package size (volume of space it occupies) and the size and nature of the delivery truck. The company conducted a study to investigate the relationship between the cost of shipment and the variables that control the shipping charge: weight and distance. Y : cost of shipment in dollars X1: package weight in pounds X2: distance shipped in miles It is suspected that non linear effect may be present Model: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12 + β5x22 Data: Express.sav

Scatter plots Unit 2 Scatter plots in multiple regression often do not show too much information

Model: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12 + β5x22
Unit 2 Not significant, try to eliminate Distance squared

Model: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12
Unit 2

Applying the F-test: Shipping costs
Unit 2 Applying the F-test: Shipping costs A company conducted a study to investigate the relationship between the cost of shipment and the variables that control the shipping charge: weight and distance. Y : cost of shipment in dollars X1: package weight in pounds X2: distance shipped in miles It is suspected that non linear effect may be present, use the F-test for nested models to decide between Model 1: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12 + β5x22 Model 2: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 Data: Express.sav

ANOVA Tables Unit 2 Full model Reduced model

F-statistic To test H0: β4 = β5 = 0, from the ANOVA tables we have
Unit 2 F-statistic To test H0: β4 = β5 = 0, from the ANOVA tables we have The critical value Fα (at 5% level) for and F-distribution with 2 and 14 d.f. is 3.74 Since F (9.92) > Fα (3.74) the null hypothesis is rejected at the 5% significance level. I.e. the model with quadratic terms is preferred over the reduced one.

Unit 2 Computer output F-statistic F p-value Reject H0: β4 = β5 = 0

Executive salaries: a final model (?)
Y = Annual salary (in dollars) x1 = Years of experience x2 = Years of education x3 = Gender : 1 if male; 0 if female x4 = Number of employees supervised x5 = Corporate assets (in millions of dollars) Try adding other variables to model 3 E(Y) = β0 + β1x1 + β2x2 + β3x3 + β4x1x3 + β5x4 + β6x5 Model 6

Computer Output: Model 6
Riepilogo del modello Modello R R-quadrato R-quadrato corretto Errore della stima 1 ,963a ,927 ,922 7020,089 a. Predittori: (Costante), Corporate assets (in million \$), Years of Experience, Years of Education, Gender, Number of Employees supervised, ExpGender Anovab Model Somma dei quadrati df Media dei quadrati F Sig. 1 Regressione 5,836E10 6 9,727E9 197,384 ,000a Residuo 4,583E9 93 4,928E7 Totale 6,295E10 99 a. Predittori: (Costante), Corporate assets (in million \$), Years of Experience, Years of Education, Gender, Number of Employees supervised, ExpGender

Computer Output: Model 6
Coefficients Model Coefficienti non standardizzati Coefficienti standardizzati t Sig. B Deviazione standard Errore Beta 1 (Costante) -38331,331 9533,238 -4,021 ,000 Years of Experience 2178,964 171,979 ,634 12,670 Gender 13203,101 3137,775 ,249 4,208 ExpGender 669,546 209,042 ,233 3,203 ,002 Years of Education 2689,594 311,914 ,246 8,623 Number of Employees supervised 53,239 4,470 ,353 11,910 Corporate assets (in million \$) 180,310 46,600 ,110 3,869 a. Variabile dipendente: Annual salary in \$

Executive salaries: comparison of models
Predictors Adj. R2 Standard error F-stat 1 x1, x2, x4, x5 0.747 74.05 2 x1, x3 0.735 138.26 3 x1, x3, x1∙x3 0.746 98.09 6 x1, x3, x1∙x3, x4, x5 0.922 197.38