# Class 18 – Thursday, Nov. 11 Omitted Variables Bias

## Presentation on theme: "Class 18 – Thursday, Nov. 11 Omitted Variables Bias"— Presentation transcript:

Class 18 – Thursday, Nov. 11 Omitted Variables Bias
Specially Constructed Explanatory Variables Interactions Squared Terms for Curvature Dummy variables for categorical variables (next class) I will you Homework 7 after class. It will be due next Thursday.

California Test Score Data
The California Standardized Testing and Reporting (STAR) data set californiastar.JMP contains data on test performance, school characteristics and student demographic backgrounds from Average Test Score is the average of the reading and math scores for a standardized test administered to 5th grade students. One interesting question: What would be the causal effect of decreasing the student-teacher ratio by one student per teacher?

Multiple Regression and Causal Inference
Goal: Figure out what the causal effect on average test score would be of decreasing student-teacher ratio and keeping everything else in the world fixed. Lurking variable: A variable that is associated with both average test score and student-teacher ratio. In order to figure out whether a drop in student-teacher ratio causes higher test scores, we want to compare mean test scores among schools with different student-teacher ratios but the same values of the lurking variables. If we include all of the lurking variables in the multiple regression model, the coefficient on student-teacher ratio represents the change in the mean of test scores that is caused by a one unit increase in student-teacher ratio.

Omitted Variables Bias
Schools with many English learners tend to have worst resources. The multiple regression that shows how mean test score changes when student teacher ratio changes but percent of English learners is held fixed gives a better idea of the causal effect of the student-teacher ratio than the simple linear regression that does not hold percent of English learners fixed. Omitted variables bias of omitting percentage of English learners = -2.28-(-1.10)=-1.28.

Omitted Variables Bias: General Formula
What happens if we omit a lurking variable from the regression? Suppose we are interested in the causal effect of on y and believe that there are lurking variables and that is the causal effect of on y. If we omit the lurking variable, , then the multiple regression will be estimating the coefficient as the coefficient on How different are and .

Omitted Variables Bias Formula
Suppose that Then Formula tells us about direction and magnitude of bias from omitting a variable in estimating a causal effect. Formula also applies to least squares estimates, i.e., Key point: In order for there to be omitted variable bias, the omitted variable must be associated with both the explanatory variable of interest and the response.

Omitted Variables Bias Examples
Would you expect the slope coefficient on X to be too high, too low or have no bias for the regression that omits the given variable? Y = Test Score, X= Number of Music Classes Taken, Omitted Variable = Student Ability Y = Salary, X = Gender (1=Female, 0=Male), Omitted Variable = Education

Even if we have included many lurking variables in the multiple regression, we may have failed to include one or not have enough data to include one. There will then be omitted variables bias. The best way to study causal effects is to do a randomized experiment (coming up next week).

Specially Constructed Explanatory Variables
Interaction variables Squared and higher polynomial terms for curvature Dummy variables for categorical variables.

Interaction Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2). There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1. To incorporate interaction in multiple regression model, we add the explanatory variable There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).

Interaction variables in JMP
To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box. JMP creates the explanatory variable

Interaction Example The number of car accidents on a stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling. A city alderman has decided to ask the county sheriff to provide him with statistics covering the last few years with the intention of examining these data statistically so that she can introduce new speed laws that will reduce traffic accidents. accidents.JMP contains data for different time periods on the number of cars passing along the stretch of road, the average speed of the cars and the number of accidents during the time period.

Interactions in Accident Data
Increases in speed have a worse impact on number of accidents when there are a large number of cars on the road than when there are a small number of cars on the road.

Notes on Interactions The need for interactions is not easily spotted with residual plots. It is best to try including an interaction term and see if it is significant. To understand better the multiple regression relationship when there is an interaction, it is useful to make an Interaction Plot. After Fit Model, click red triangle next to Response, click Factor Profiling and then click Interaction Plots.

Plot on left displays E(Accidents|Cars, Speed=56
Plot on left displays E(Accidents|Cars, Speed=56.6), E(Accidents|Cars,Speed=62.5) as a function of Cars. Plot on right displays E(Accidents|Cars=12.6), E(Accidents| Cars,Speed=7) as a function of Speed. We can see that the impact of speed on Accidents depends critically on the number of cars on the road.

Fast Food Locations An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp

Multivariate Correlations Scatterplot Matrix Revenue Income Age 1.0000
0.4355 0.3769 0.0201 Correlations 900 1000 1100 1200 1300 20 25 30 35 5.0 7.5 10.0 12.5 15.0 Scatterplot Matrix Multivariate

Squared Terms for Curvature
To capture a quadratic relationship between X1 and Y, we add as an explanatory variable. To do this in JMP, add X1 to the model, then highlight X1 in the Select Columns box and highlight X1 in the Construct Model Effects box and click Cross.

Notes on Squared Terms for Curvature
If t-test for squared term has p-value <.05, indicating that there is curvature, then we keep the linear term in the model regardless of its p-value. Coefficients in model with squared terms for curvature are tricky to interpret. If we have explanatory variables and in the model, then we can’t keep fixed and change As with interactions, to better understand the multiple regression relationship when there is a squared term for curvature, a plot is useful. After Fit Model, click red triangle next to Response, click Factor Profiling and click Profiler. JMP shows a plot for each explanatory variable of how the mean of Y changes as the explanatory variable is increased and the other explanatory variables are held fixed at their mean value.

Left hand plot is a plot of Mean Revenue for different levels of income when Age is
held fixed at its mean value of The / is a confidence interval for the mean response at income=24.2, Age=8.392.

Regression Model for Fast Food Chain Data
Interactions and polynomial terms can be combined in a multiple regression model. Strong evidence of a quadratic relationship between revenue and age, revenue and income. Moderate evidence of an interaction between age and income.