Functional Form, Scaling and Use of Dummy Variables Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 10-1.

Slides:



Advertisements
Similar presentations
Multiple Regression and Model Building
Advertisements

Welcome to Econ 420 Applied Regression Analysis
The Multiple Regression Model.
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
FIN822 Li11 Binary independent and dependent variables.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Specification Error II
Introduction and Overview
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
Econ 140 Lecture 151 Multiple Regression Applications Lecture 15.
Lecture 4 Econ 488. Ordinary Least Squares (OLS) Objective of OLS  Minimize the sum of squared residuals: where Remember that OLS is not the only possible.
Copyright © 2006 Pearson Addison-Wesley. All rights reserved. Lecture 12: Joint Hypothesis Tests (Chapter 9.1–9.3, 9.5–9.6)
Assumption MLR.3 Notes (No Perfect Collinearity)
Multiple Linear Regression Model
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Chapter 12 Simple Regression
Interaksi Dalam Regresi (Lanjutan) Pertemuan 25 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Regresi dan Rancangan Faktorial Pertemuan 23 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
© 2000 Prentice-Hall, Inc. Chap Multiple Regression Models.
Multiple Regression Models. The Multiple Regression Model The relationship between one dependent & two or more independent variables is a linear function.
Economics 20 - Prof. Anderson
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Multiple Regression Models
Econ 140 Lecture 181 Multiple Regression Applications III Lecture 18.
Econ 140 Lecture 171 Multiple Regression Applications II &III Lecture 17.
Multicollinearity Omitted Variables Bias is a problem when the omitted variable is an explanator of Y and correlated with X1 Including the omitted variable.
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Multiple Regression Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-1.
7.1 Lecture #7 Studenmund(2006) Chapter 7 Objective: Applications of Dummy Independent Variables.
Topic 3: Regression.
© 2004 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 5. Dummy Variables.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
AAEC 4302 ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH Chapter 13.3 Multicollinearity.
1 Javier Aparicio División de Estudios Políticos, CIDE Primavera Regresión.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Ordinary Least Squares
Multiple Linear Regression Analysis
Objectives of Multiple Regression
Hypothesis Testing in Linear Regression Analysis
Chapter 12 Multiple Regression and Model Building.
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
LESSON 5 Multiple Regression Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-1.
1 Dummy Variables. 2 Topics for This Chapter 1. Intercept Dummy Variables 2. Slope Dummy Variables 3. Different Intercepts & Slopes 4. Testing Qualitative.
Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each.
Statistics and Econometrics for Business II Fall 2014 Instructor: Maksym Obrizan Lecture notes III # 2. Advanced topics in OLS regression # 3. Working.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Analysis A statistical procedure used to find relations among a set of variables.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
7.4 DV’s and Groups Often it is desirous to know if two different groups follow the same or different regression functions -One way to test this is to.
Chapter 4 The Classical Model Copyright © 2011 Pearson Addison-Wesley. All rights reserved. Slides by Niels-Hugo Blunch Washington and Lee University.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Stats Methods at IC Lecture 3: Regression.
Kakhramon Yusupov June 15th, :30pm – 3:00pm Session 3
Multiple Regression Analysis and Model Building
Multiple Regression Analysis with Qualitative Information
Fundamentals of regression analysis
Multiple Regression Analysis with Qualitative Information
Chapter 6: MULTIPLE REGRESSION ANALYSIS
Prepared by Lee Revere and John Large
Chapter 8: DUMMY VARIABLE (D.V.) REGRESSION MODELS
Multicollinearity What does it mean? A high degree of correlation amongst the explanatory variables What are its consequences? It may be difficult to separate.
Chapter 9 Dummy Variables Undergraduated Econometrics Page 1
Financial Econometrics Fin. 505
Presentation transcript:

Functional Form, Scaling and Use of Dummy Variables Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 10-1

Scaling the Data Y hat = X Y = Consumption in $ X = Income in $ Interpret the equation

Suppose we change the units of measurement of income (X) to $100 increases  We have scaled the data Choice of scale does not affect measurement of underlying relationship but affects interpretation of coefficients.

Now the equation becomes Yhat = x What we did was divided income by 100 so the coefficient of income becomes 100 times larger. Yhat = (100 * )(x/100)

Scaling X alone changes the slope coefficient Changes the standard error of the coefficient by the same factor T ratio is unaffected. All other regression statistics are unchanged

Suppose we change the measurement of Y but not X All coefficients must change in order for equation to remain valid E.g. If Consumption is measured in cents instead of $ 100 y hat = (100*40.77) + (100*.1283)x Yhat* = X

Changing the scale of Y alone  All coefficients must change Scales standard errors of the coefficients accordingly T-ratios and R sq is unchanged

If X and Y are changed by the same factor  No change in regression results for slope but estimated intercept will change T and Rsq. are unaffected

Consider the following regressions yi =  0 +  1xi +  i Yi =  0* +  1* Xi +  i yi is measured in inches Yi is measured in ft. (12 inches) xi is measured in cm. Xi is measured in inches. (2.54 cm)

If estimated  0 = 10, what is the estimated  0* = If estimated.  1* = 22, what is the estimated  1=

Dummy Variables Used to capture qualitative explanatory variables Used to capture any event that has only two possible outcomes e.g. race, gender, geographic region of residence etc.

Use of Intercept Dummy Most common use of dummy variables. Modifies the regression model intercept parameter e.g. Let test the “location”, “location” “location” model of real estate Suppose we take into account location near say a university or golf course

P t = β o + β 1 S t +β 2 D t + ε t St = square footage D = dummy variable to represent if the characteristic is present or not D = 1if property is in a desirable neighborhood 0if not in a desirable neighborhood

Effect of the dummy variable is best seen by examining the E(Pt). If model is specified correctly, E(ε t ) = 0 E(P t ) = ( β o + β 2 ) + β 1 S t when D=1 β o + β 1 S t when D = 0

B2 is the location premium in this case. It is the difference between the Price of a house in a desirable are and one in a not so desirable area, all things held constant The dummy variable is to capture the shift in the intercept as a result of some qualitative variable  Dt is an intercept dummy variable

Dt is treated as any explanatory variable. You can construct a confidence interval for B2 You can test if B2 is significantly different from zero. In such a test, if you accept Ho, then there is no difference between the two categories.

Application of Intercept Dummy Variable Wages = B0 + B1EXP + B2RACE +B3SEX + Et Race = 1 if white 0 if non white Sex = 1 if male 0 if female

WAGES = 40, EXP RACE +1082SEX Mean salary for black female 40, EXP Mean salary for white female 41, EXP +1102

Mean salary for Asian male Mean salary for white male What sucks more, being female or non white?

Determining the # of dummies to use If h categories, then use h-1 dummies Category left out defines reference group If you use h dummies you’d fall into the dummy trap

Slope Dummy Variables Allows for different slope in the relationship Use an interaction variable between the actual variable and a dummy variable e.g. Pt = Bo + B1Sqfootage+B2(Sqfootage*D)+et D= 1 desirable area, 0 otherwise

Captures the effect of location and size on the price of a house E(Pt) = B0 + (B1+B2)Sqfoot if D=1 = BO + B1Sqfoot if D = 0  in the desirable area, price per square foot is b1+b2, and it is b1 in other areas If we believe that a house location affects both the intercept and the slope then the model is

Pt = B0 +B1sqfoot +B2(sqfoot*D) + B3D +et

24 Dummies for Multiple Categories We can use dummy variables to control for something with multiple categories Suppose everyone in your data is either a HS dropout, HS grad only, or college grad To compare HS and college grads to HS dropouts, include 2 dummy variables hsgrad = 1 if HS grad only, 0 otherwise; and colgrad = 1 if college grad, 0 otherwise

25 Multiple Categories (cont) Any categorical variable can be turned into a set of dummy variables Because the base group is represented by the intercept, if there are n categories there should be n – 1 dummy variables If there are a lot of categories, it may make sense to group some together Example: top 10 ranking, 11 – 25, etc.

26 Interactions Among Dummies Interacting dummy variables is like subdividing the group Example: have dummies for male, as well as hsgrad and colgrad Add male*hsgrad and male*colgrad, for a total of 5 dummy variables –> 6 categories Base group is female HS dropouts hsgrad is for female HS grads, colgrad is for female college grads The interactions reflect male HS grads and male college grads

27 More on Dummy Interactions Formally, the model is y =  0 +  1 male +  2 hsgrad +  3 colgrad +  4 male*hsgrad +  5 male*colgrad +  1 x + u, then, for example: If male = 0 and hsgrad = 0 and colgrad = 0 y =  0 +  1 x + u If male = 0 and hsgrad = 1 and colgrad = 0 y =  0 +  2 hsgrad +  1 x + u If male = 1 and hsgrad = 0 and colgrad = 1 y =  0 +  1 male +  3 colgrad +  5 male*colgrad +  1 x + u

28 Other Interactions with Dummies Can also consider interacting a dummy variable, d, with a continuous variable, x y =  0 +  1 d +  1 x +  2 d*x + u If d = 0, then y =  0 +  1 x + u If d = 1, then y = (  0 +  1 ) + (  1 +  2 ) x + u This is interpreted as a change in the slope

29 y x y =  0 +  1 x y = (  0 +  0 ) + (  1 +  1 ) x Example of  0 > 0 and  1 < 0 d = 1 d = 0

Multicollinearity Omitted Variables Bias is a problem when the omitted variable is an explanator of Y and correlated with X 1 Including the omitted variable in a multiple regression solves the problem. The multiple regression finds the coefficient on X 1, holding X 2 fixed.

Multicollinearity (cont.) Multivariate Regression finds the coefficient on X 1, holding X 2 fixed. To estimate  1, OLS requires: Are these conditions always possible?

Multicollinearity (cont.) To strip out the bias caused by the correlation between X 1 and X 2, OLS has to impose the restriction This restriction in essence removes those parts of X 1 that are correlated with X 2 If X 1 is very correlated with X 2, OLS doesn’t have much left-over variation to work with. If X 1 is perfectly correlated with X 2, OLS has nothing left.

Multicollinearity (cont.) Suppose X 2 is simply a function of X 1 For some silly reason, we want to estimate the returns to an extra year of education AND the returns to an extra month of education. So we stick in two variables, one recording the number of years of education and one recording the number of months of education.

Multicollinearity (cont.)

Let’s look at this problem in terms of our unbiasedness conditions. No weights can do both these jobs!

Multicollinearity (cont.) Bottom Line: you CANNOT add variables that are perfectly correlated with each other (and nearly perfect correlation isn’t good). You CANNOT include a group of variables that are a linear combination of each other: You CANNOT include a group of variables that sum to 1 and also include a constant.

Multicollinearity (cont.) Multicollinearity is easy to fix. Simply omit one of the troublesome variables. Maybe you can find more data for which your variables are not multicollinear. This isn’t possible if your variables are weighted sums of each other by definition.

Checking Understanding You have a cross-section of workers from Which of the following variables would lead to multicollinearity? 1.A Constant, Year of birth, Age 2.A Constant, Year of birth, Years since they finished high school 3.A Constant, Year of birth, Years since they started working for their current employer

Checking Understanding (cont.) 1.A Constant, Year of Birth, and Age will be a problem. These variables will be multicollinear (or nearly multicollinear, which is almost as bad).

Checking Understanding (cont.) 2.A Constant, Year of Birth, and Years Since High School PROBABLY suffers from ALMOST perfect multicollinearity. Most Americans graduate from high school around age 18. If this is true in your data, then

Checking Understanding (cont.) 3.A Constant, Birthyear, Years with Current Employer is very unlikely to be a problem. There is usually ample variation in the ages at which different workers begin their employment with a particular firm.

Multicollinearity When two or more of the explanatory variables are highly related (correlated) Collinearity exists so the question is how much before it becomes a problem. Perfect multicollinearity Imperfect Multicollinearity

Using the Ballantine

Detecting Multicollinearity 1.Check simple correlation coefficients (r) If |r| > 0.8, then multicollinearity may be a problem 2.Perform a t-test at on the correlation coefficient

3.Check Variance Inflation Factors (VIF) or the Tolerance (TOL) Run a regression of each X on the other Xs Calculate the VIF for each B hati

The higher VIF, the severity of the problem of multicollinearity If VIF is greater than 5, then there might be a problem (arbitrarily chosen)

Tolerance (TOR) = (1 – Rsq) 0 < TOR < 1 If TOR is close to zero then multicollinearity is severe. You could use VIF or TOR.

EFFECTS OF MULTICOLLINEARITY 1.OLS estimates are still unbiased 2.Standard error of the estimated coefficients will be inflated 3.t- statistics will be small 4.Estimates will be sensitive to small changes, either from dropping a variable or adding a few more observations

With multicollinearity, you may accept Ho for all your t-test but reject Ho for you F-test

Dealing with Multicollinearity 1.Ignore It. Do this if multicollinearity is not causing any problems. i.e. if the t-statistics are insignificant and unreliable then do something. If not, do nothing

2.Drop a variable. If two variables are significantly related, drop one of them (redundant) 3.Increase the sample size The larger the sample size the more accurate the estimates

Copyright © 2006 Pearson Addison-Wesley. All rights reserved Review Perfect multicollinearity occurs when 2 or more of your explanators are jointly perfectly correlated. That is, you can write one of your explanators as a linear function of other explanators:

Copyright © 2006 Pearson Addison-Wesley. All rights reserved Review (cont.) OLS breaks down with perfect (or even near perfect) multicollinearity. Multicollinearity most frequently occurs when you want to include: – Time, age, and birthyear effects – A dummy variable for each category, plus a constant

Copyright © 2006 Pearson Addison-Wesley. All rights reserved Review (cont.) Dummy variables (also called binary variables) take on only the values 0 or 1. Dummy variables let you estimate separate intercepts and slopes for different groups. To avoid multicollinearity while including a constant, you need to omit the dummy variable for one group (e.g. males or non- Hispanic whites). You want to pick one of the larger groups to omit.