2 Lecture Objectives You should be able to : Convert categorical variables into dummies.Identify and eliminate Multicollinearity.Use interaction terms and interpret their coefficients.Identify heteroscedasticity.
3 I. Using Categorical Data: Dummy Variables X1X2Accidentsper 10000CarObsDriversAgeColor18917Red270Black375Blue48518574676790198789108020Consider insurance company data on accidents and their relationship to age of driver and the color of car driven.See spreadsheet for complete data.
4 Coding a Categorical Variable Original CodingAlternate CodingYX1X2Accidentsper 10000CarDriversAgeColor89171370275851874769019788020This is the incorrect way. Output from the two ways of coding give inconsistent results.
5 Original Coding: Partial Output and Forecasts SUMMARYOUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations27RESIDUAL OUTPUTObservationPredicted AccidentsResiduals10.64442340.0389561.038978.433381.43339101.8278CoefficientsInterceptAgeColor
6 Modified Coding: Partial Output and Forecasts SUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations27RESIDUAL OUTPUTObservationPredicted AccidentsResiduals123450.7056676.766788.10009100.1611CoefficientsInterceptAgeColor6.6667
7 Coding with DummiesOriginal DummiesAlternately Coded DummiesYX1X2X3Accidentsper 10000DriversAgeD1 RedD2 BlackD1 BlackD2 Blue891717075851874769019788020This is the correct way. Output from either way of coding gives the same forecasts.
8 Regression Statistics Original Dummy CodingSUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations27RESIDUAL OUTPUTObservationPredicted AccidentsResiduals12345675.655686.9889910CoefficientsInterceptAgeD1 RedD2 Black
9 Regression Statistics Modified Dummy CodingSUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations27RESIDUAL OUTPUTObservationPredicted AccidentsResiduals12345675.655686.9889910CoefficientsInterceptAgeD1 BlackD2 Blue
10 II. MulticollinearityWe wish to forecast the height of a person based on the length of his/her feet. Consider data as shown:HeightRightLeft77.3111.5911.5467.589.579.6370.408.978.9864.849.399.4677.0312.0512.0379.6611.3911.4172.3710.5510.6173.1810.3110.3377.6011.8171.409.929.88
11 Regression with Right Foot SUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations105As right foot length increases by an inch, height increases on average by 3.99 inches.CoefficientsInterceptRight
12 Regression with Left Foot SUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations105As left foot length increases by an inch, height increases on average by 3.99 inches.CoefficientsInterceptLeft
13 Regression Statistics Regression with BothSUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations105As right foot length increases by an inch, height increases on average by 8.52 inches (assuming left foot is constant!) while lengthening of the left foot makes a person shorter by 4.55 inches!!CoefficientsInterceptRightLeft
14 The Reason? Multicollinearity. HeightRightLeftHeight (y)1.0000Right (X1)0.9031Left (X2)0.90030.9990While both feet (Xs) are correlated with height (y), they are also highly correlated with each other (0.999). In other words, the second foot adds no extra information to the prediction of y. One of the two Xs is sufficient.
15 III. Interaction Effects Scores on test of reflexesYX1X2ObsScoreAgeGender1802528228375324703356535660437674685595610119024Do reflexes slow down with age? Are there gender differences?A portion of the data is shown here.
16 Scatterplots with Age, Gender Does age seem related? How about Gender?
17 Correlation, Regression SUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations20Correlations ScoreAge1Gender0.1406CoefficientsStandard Errort StatP-valueInterceptE-14AgeE-07GenderAge is related, gender is not.
18 Interaction TermYX1X2X1*X2ObsScoreAgeGenderAge*18025282283753247033565356604376746..…1190241287A 2-way interaction term is the product of the two variables.
19 Regression with Interaction SUMMARY OUTPUTRegression StatisticsMultiple RR SquareAdjusted R SquareStandard ErrorObservations20How do we interpret the coefficient for the interaction term?CoefficientsStandard Errort StatP-valueIntercept1.49E-11AgeGenderAge*Gender
20 Meaning of Interaction X1 and X2 are said to interact with each other if the impact of X1 on y changes as the value of X2 changes.In this example, the impact of age (X1) on reflexes (y) is different for males and females (changing values of X2). Hence age and gender are said to interact.Explain how this is different from multicollinearity.
21 IV. Heteroscedasticity Consider the water levels in Lake Lanier. There is a trend that can be used to forecast. However, the variability around the trendline is not consistent. The increase in variation makes the prediction margin of error unreliable.
22 Example 2: Income and Spending As income grows, the ability to spend on luxury goods grows with it, and so does the variation in how much is actually spent. Once again, forecasts become less reliable due to changing variation (heteroscedasticity).
23 SolutionWhen heteroscedasticity is identified, data may need to be transformed (change to a log scale, for instance) to reduce its impact. The type of transformation needed depends on the data.