Chapter 5: Dummy Variables. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables.

Slides:



Advertisements
Similar presentations
CHOW TEST AND DUMMY VARIABLE GROUP TEST
Advertisements

EC220 - Introduction to econometrics (chapter 5)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: slope dummy variables Original citation: Dougherty, C. (2012) EC220 -
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: interactive explanatory variables Original citation: Dougherty, C. (2012)
ELASTICITIES AND DOUBLE-LOGARITHMIC MODELS
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
INTERPRETATION OF A REGRESSION EQUATION
DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES Example: the cost of running a school depends on the number of pupils, but it also depends on whether.
Lecture 4 This week’s reading: Ch. 1 Today:
EC220 - Introduction to econometrics (chapter 2)
Chapter 4 Multiple Regression.
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
© Christopher Dougherty 1999–2006 VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE We will now investigate the consequences of misspecifying.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: testing a hypothesis relating to a regression coefficient (2010/2011.
EC220 - Introduction to econometrics (chapter 1)
1 INTERPRETATION OF A REGRESSION EQUATION The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade.
TESTING A HYPOTHESIS RELATING TO A REGRESSION COEFFICIENT This sequence describes the testing of a hypotheses relating to regression coefficients. It is.
Chapter 4 – Nonlinear Models and Transformations of Variables.
SLOPE DUMMY VARIABLES 1 The scatter diagram shows the data for the 74 schools in Shanghai and the cost functions derived from a regression of COST on N.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 3) Slideshow: precision of the multiple regression coefficients Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: semilogarithmic models Original citation: Dougherty, C. (2012) EC220.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: Chow test Original citation: Dougherty, C. (2012) EC220 - Introduction.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy variable classification with two categories Original citation:
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: two sets of dummy variables Original citation: Dougherty, C. (2012) EC220.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: the effects of changing the reference category Original citation: Dougherty,
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: dummy classification with more than two categories Original citation:
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
1 TWO SETS OF DUMMY VARIABLES The explanatory variables in a regression model may include multiple sets of dummy variables. This sequence provides an example.
Confidence intervals were treated at length in the Review chapter and their application to regression analysis presents no problems. We will not repeat.
1 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for.
Returning to Consumption
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
What is the MPC?. Learning Objectives 1.Use linear regression to establish the relationship between two variables 2.Show that the line is the line of.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
© Christopher Dougherty 1999–2006 The denominator has been rewritten a little more carefully, making it explicit that the summation of the squared deviations.
Simple regression model: Y =  1 +  2 X + u 1 We have seen that the regression coefficients b 1 and b 2 are random variables. They provide point estimates.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
POSSIBLE DIRECT MEASURES FOR ALLEVIATING MULTICOLLINEARITY 1 What can you do about multicollinearity if you encounter it? We will discuss some possible.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 4) Slideshow: exercise 4.5 Original citation: Dougherty, C. (2012) EC220 - Introduction.
(1)Combine the correlated variables. 1 In this sequence, we look at four possible indirect methods for alleviating a problem of multicollinearity. POSSIBLE.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
COST 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 This sequence explains how you can include qualitative explanatory variables in your regression.
Christopher Dougherty EC220 - Introduction to econometrics (revision lectures 2011) Slideshow: dummy variables Original citation: Dougherty, C. (2011)
Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: exercise 6.13 Original citation: Dougherty, C. (2012) EC220 - Introduction.
1 HETEROSCEDASTICITY: WEIGHTED AND LOGARITHMIC REGRESSIONS This sequence presents two methods for dealing with the problem of heteroscedasticity. We will.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
SEMILOGARITHMIC MODELS 1 This sequence introduces the semilogarithmic model and shows how it may be applied to an earnings function. The dependent variable.
GRAPHING A RELATIONSHIP IN A MULTIPLE REGRESSION MODEL The output above shows the result of regressing EARNINGS, hourly earnings in dollars, on S, years.
1 BINARY CHOICE MODELS: LINEAR PROBABILITY MODEL Economists are often interested in the factors behind the decision-making of individuals or enterprises,
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
1 In the Monte Carlo experiment in the previous sequence we used the rate of unemployment, U, as an instrument for w in the price inflation equation. SIMULTANEOUS.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
WHITE TEST FOR HETEROSCEDASTICITY 1 The White test for heteroscedasticity looks for evidence of an association between the variance of the disturbance.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE In this sequence we will investigate the consequences of including an irrelevant variable.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 8 Section A1 Using categorical data in regression
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
Presentation transcript:

Chapter 5: Dummy Variables

DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES 1 We’ll now examine how you can include qualitative explanatory variables in your regression model. Suppose that you have data on the annual recurrent expenditure, COST, and the number of students enrolled, N, for a sample of secondary schools, of which there are two types: regular and occupational. The occupational schools aim to provide skills for specific occupations and they tend to be relatively expensive to run because they need to maintain specialized workshops.

© Christopher Dougherty 1999–2006 Suppose, we want to estimate the cost of running an occupational and a regular school. One way of dealing with the difference in the costs would be to run separate regressions for the two types of schools. However this would have the drawback that you would be potentially running regressions with two small samples instead of one large one, with an adverse effect on the precision of the estimates of the coefficients. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

OCC = 0 Regular schoolCOST =  1 +  2 N + u OCC = 1 Occupational schoolCOST =  1 ' +  2 N + u Another way of handling the difference would be to hypothesize that the cost function for occupational schools has an intercept  1 ' that is greater than that for regular schools. Effectively, we are hypothesizing that the annual overhead cost is different for the two types of school, but the marginal cost is the same. The marginal cost assumption is not very plausible and we will relax it in due course. 11 1'1' DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006 Let us define  to be the difference in the intercepts:  =  1 ' –  1. Then  1 ' =  1 +  and we can rewrite the cost function for occupational schools as shown. 1+1+  OCC = 0 Regular schoolCOST =  1 +  2 N + u OCC = 1 Occupational schoolCOST =  1 +  +  2 N + u 11 DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

Combined equationCOST =  1 +  OCC +  2 N + u OCC = 0 Regular schoolCOST =  1 +  2 N + u OCC = 1 Occupational schoolCOST =  1 +  +  2 N + u We can now combine the two cost functions by defining a dummy variable OCC that has value 0 for regular schools and 1 for occupational schools. (Dummy variables always have two values, 0 or 1.)  11 1+1+ DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006 We will now fit a function of this type using actual data for a sample of 74 secondary schools in Shanghai. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

School TypeCOST N OCC 1Occupational345, Occupational 537, Regular 170, Occupational Regular100, Regular 28, Regular 160, Occupational 45, Occupational 120, Occupational61, The table shows the data for the first 10 schools in the sample. The annual cost is measured in yuan, one yuan being worth about 20 cents U.S. at the time. N is the number of students in the school. OCC is the dummy variable for the type of school. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N OCC Source | SS df MS Number of obs = F( 2, 71) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | OCC | _cons | We now run the regression of COST on N and OCC, treating OCC just like any other explanatory variable, despite its artificial nature. The Stata output is shown above. We will begin by interpreting the regression coefficients. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006 COST = –34, ,000OCC + 331N ^ The regression results have been rewritten in equation form. From it we can derive cost functions for the two types of school by setting OCC equal to 0 or 1. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

Regular School (OCC = 0) COST = –34, ,000OCC + 331N COST = –34, N ^ ^ If OCC is equal to 0, we get the equation for regular schools, as shown. It implies that the marginal cost per student per year is 331 yuan and that the annual overhead cost is -34,000 yuan. Obviously having a negative intercept does not make any sense at all and it suggests that the model is misspecified in some way. We will come back to this later. It’s worth noting that its t-statistic indicates that its not significant. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

Regular School (OCC = 0) Occupational School (OCC = 1) The coefficient of the dummy variable is an estimate of , the extra annual overhead cost of an occupational school. Putting OCC equal to 1, we estimate the annual overhead cost of an occupational school to be 99,000 yuan. The marginal cost is the same as for regular schools. It must be, given the model specification. COST = –34, ,000OCC + 331N COST = –34, N COST = –34, , N = 99, N ^ ^ ^ DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006 The scatter diagram shows the data and the two cost functions derived from the regression results. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N OCC Source | SS df MS Number of obs = F( 2, 71) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | OCC | _cons | In addition to the estimates of the coefficients, the regression results will include standard errors and the usual diagnostic statistics. We will perform a t test on the coefficient of the dummy variable. Our null hypothesis is H 0 :  = 0 and our alternative hypothesis is H 1 :  0. In words, our null hypothesis is that there is no difference in the overhead costs of the two types of school. The t statistic is 6.40, so it is rejected at the 0.1% significance level. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N OCC Source | SS df MS Number of obs = F( 2, 71) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | OCC | _cons | We can perform t tests on the other coefficients in the usual way. The t statistic for the coefficient of N is 8.34, so we conclude that the marginal cost is (very) significantly different from 0. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N OCC Source | SS df MS Number of obs = F( 2, 71) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | OCC | _cons | In the case of the intercept, the t statistic is –1.43, so we do not reject the null hypothesis H 0 :  1 = 0. Thus one explanation of the nonsensical negative overhead cost of regular schools might be that they do not actually have any overheads and our estimate is a random number. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N OCC Source | SS df MS Number of obs = F( 2, 71) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | OCC | _cons | A more realistic version of this hypothesis is that  1 is positive but small (as you can see, the 95 percent confidence interval includes positive values) and the error term is responsible for the negative estimate. As already noted, a further possibility is that the model is misspecified in some way. We will continue to develop the model in the next sequence. DUMMY VARIABLE CLASSIFICATION WITH TWO CATEGORIES

© Christopher Dougherty 1999–2006 DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES Now we’ll study how to extend the dummy variable technique to handle a qualitative explanatory variable which has more than two categories. Previously, we used a dummy variable to differentiate between regular and occupational schools when fitting a cost function. In actual fact there are two types of regular secondary school in Shanghai. There are general schools, which provide the usual academic education, and vocational schools.

© Christopher Dougherty 1999–2006 DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES As their name implies, the vocational schools are meant to impart occupational skills as well as give an academic education. However the vocational component of the curriculum is typically quite small and the schools are similar to the general schools. Often they are just general schools with a couple of workshops added. Likewise there are two types of occupational school. There are technical schools training technicians and skilled workers’ schools training craftsmen.

© Christopher Dougherty 1999–2006 So now the qualitative variable has four categories. The standard procedure is to choose one category as the reference category and to define dummy variables for each of the others. In general it is good practice to select the most normal or basic category as the reference category, if one category is in some sense more normal or basic than the others. In the Shanghai sample it is sensible to choose the general schools as the reference category. They are the most numerous and the other schools are variations of them. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 Accordingly we will define dummy variables for the other three types. TECH will be the dummy for the technical schools: TECH is equal to 1 if the observation relates to a technical school, 0 otherwise. Similarly we will define dummy variables WORKER and VOC for the skilled workers’ schools and the vocational schools. Each of the dummy variables will have a coefficient which represents the extra overhead costs of the schools, relative to the reference category. Note that you do not include a dummy variable for the reference category, and that is the reason that the reference category is usually described as the omitted category. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 If an observation relates to a general school, the dummy variables are all 0 and the regression model is reduced to its basic components. COST =  1  +  T TECH +  W WORKER +  V VOC +  2 N + u General SchoolCOST =  1  +  2 N + u (TECH = WORKER = VOC = 0) DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 If an observation relates to a technical school, TECH will be equal to 1 and the other dummy variables will be 0. The regression model simplifies as shown. COST =  1  +  T TECH +  W WORKER +  V VOC +  2 N + u General SchoolCOST =  1  +  2 N + u (TECH = WORKER = VOC = 0) Technical SchoolCOST = (  1  +  T ) +  2 N + u (TECH = 1; WORKER = VOC = 0) DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 COST =  1  +  T TECH +  W WORKER +  V VOC +  2 N + u General SchoolCOST =  1  +  2 N + u (TECH = WORKER = VOC = 0) Technical SchoolCOST = (  1  +  T ) +  2 N + u (TECH = 1; WORKER = VOC = 0) Skilled Workers’ SchoolCOST = (  1  +  W ) +  2 N + u (WORKER = 1; TECH = VOC = 0) Vocational SchoolCOST = (  1  +  V ) +  2 N + u (VOC = 1; TECH = WORKER = 0) The regression model simplifies in a similar manner in the case of observations relating to skilled workers’ schools and vocational schools. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 COST N 1+T1+T 1+W1+W 1+V1+V 11 Workers’ Vocational WW VV TT The diagram illustrates the model graphically. The  coefficients are the extra overhead costs of running technical, skilled workers’, and vocational schools, relative to the overhead cost of general schools. Note that we do not make any prior assumption about the size, or even the sign, of the  coefficients. They will be estimated from the sample data. Technical General DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 School TypeCOST N TECH WORKERVOC 1Technical345, Technical 537, General 170, Workers’ General 100, Vocational 28, Vocational 160, Technical 45, Technical 120, Workers’ 61, Here are the data for the first 10 of the 74 schools. Note how the values of the dummy variables TECH, WORKER, and VOC are determined by the type of school in each observation. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 The scatter diagram shows the data for the entire sample, differentiating by type of school. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | Here is the Stata output for this regression. The coefficient of N indicates that the marginal cost per student per year is 343 yuan. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | The coefficients of TECH, WORKER, and VOC are 154,000, 143,000, and 53,000, respectively, and should be interpreted as the additional annual overhead costs, relative to those of general schools. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | The constant term is –55,000, indicating that the annual overhead cost of a general academic school is –55,000 yuan per year. Obviously this is nonsense and indicates that something is wrong with the model specification. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 The top line shows the regression result in equation form. We will derive the implicit cost functions for each type of school. ^ COST = –55, ,000TECH + 143,000WORKER + 53,000VOC + 343N General SchoolCOST= –55, N (TECH = WORKER = VOC = 0) DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 COST = –55, ,000TECH + 143,000WORKER + 53,000VOC + 343N General SchoolCOST= –55, N (TECH = WORKER = VOC = 0) In the case of a general school, the dummy variables are all 0 and the equation reduces to the intercept and the term involving N. The annual marginal cost per student is estimated at 343 yuan. The annual overhead cost per school is estimated at –55,000 yuan. Obviously a negative amount is inconceivable. ^ ^ DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

COST = –55, ,000TECH + 143,000WORKER + 53,000VOC + 343N General SchoolCOST= –55, N (TECH = WORKER = VOC = 0) Technical SchoolCOST= –55, , N (TECH = 1; WORKER = VOC = 0) = 99, N The extra annual overhead cost for a technical school, relative to a general school, is 154,000 yuan. Hence we derive the implicit cost function for technical schools. ^ ^ ^ DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

And similarly the extra overhead costs of skilled workers’ and vocational schools, relative to those of general schools, are 143,000 and 53,000 yuan, respectively. ^ ^ ^ ^ ^ COST = –55, ,000TECH + 143,000WORKER + 53,000VOC + 343N General SchoolCOST= –55, N (TECH = WORKER = VOC = 0) Technical SchoolCOST= –55, , N (TECH = 1; WORKER = VOC = 0) = 99, N Skilled Workers’ SchoolCOST= –55, , N (WORKER = 1; TECH = VOC = 0) = 88, N Vocational SchoolCOST= –55, , N (VOC = 1; TECH = WORKER = 0) = –2, N DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 COST = –55, ,000TECH + 143,000WORKER + 53,000VOC + 343N General SchoolCOST= –55, N (TECH = WORKER = VOC = 0) Technical SchoolCOST= –55, , N (TECH = 1; WORKER = VOC = 0) = 99, N Skilled Workers’ SchoolCOST= –55, , N (WORKER = 1; TECH = VOC = 0) = 88, N Vocational SchoolCOST= –55, , N (VOC = 1; TECH = WORKER = 0) = –2, N Note that in each case the annual marginal cost per student is estimated at 343 yuan. The model specification assumes that this figure does not differ according to type of school. ^ ^ ^ ^ ^ DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 The four cost functions are illustrated graphically. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | We can perform t tests on the coefficients in the usual way. The t statistic for N is 8.52, so the marginal cost is (very) significantly different from 0, as we would expect. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | The t statistic for the technical school dummy is 5.76, indicating the the annual overhead cost of a technical school is (very) significantly greater than that of a general school, again as expected. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | Similarly for skilled workers’ schools, the t statistic is 5.15, indicating the the annual overhead cost of a skilled workers’ school is (very) significantly greater than that of a general school, again as expected. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | In the case of vocational schools, however, the t statistic is only 1.71, indicating that the overhead cost of such a school is not significantly greater than that of a general school. This is not surprising, given that the vocational schools are not much different from the general schools. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | Note that the null hypotheses for the tests on the coefficients of the dummy variables are than the overhead costs of the other schools are not different from those of the general schools. Finally we will perform an F test of the joint explanatory power of the dummy variables as a group. The null hypothesis is H 0 :  T =  W =  V = 0. The alternative hypothesis is that at least one  is different from 0. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | The residual sum of squares in the specification including the dummy variables is 5.41× DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N Source | SS df MS Number of obs = F( 1, 72) = Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.1e COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | _cons | The residual sum of squares in the specification excluding the dummy variables is 8.92× T he reduction in RSS when we include the dummies is therefore (8.92 – 5.41)× We will check whether this reduction is significant with the usual F test. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006. reg COST N Source | SS df MS Number of obs = F( 1, 72) = Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.1e+05. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = The numerator in the F ratio is the reduction in RSS divided by the cost, which is the 3 degrees of freedom given up when we estimate three additional coefficients (the coefficients of the dummies). DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

. reg COST N Source | SS df MS Number of obs = F( 1, 72) = Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.1e+05. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = The denominator is RSS for the specification including the dummy variables, divided by the # degrees of freedom remaining after they have been added. The F ratio is therefore DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

. reg COST N Source | SS df MS Number of obs = F( 1, 72) = Model | e e+11 Prob > F = Residual | e e+10 R-squared = Adj R-squared = Total | e e+10 Root MSE = 1.1e+05. reg COST N TECH WORKER VOC Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = F tables do not give the critical value for 3 and 69 degrees of freedom, but it must be lower than the critical value with 3 and 60 degrees of freedom. This is 6.17, at the 0.1% significance level. Thus we reject H 0 at a high significance level. This is not exactly surprising since t tests show that TECH and WORKER have highly significant coefficients. DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES

© Christopher Dougherty 1999–2006 THE EFFECTS OF CHANGING THE REFERENCE CATEGORY So far, we chose general academic schools as the reference (omitted) category and defined dummy variables for the other categories.

© Christopher Dougherty 1999–2006 This enabled us to compare the overhead costs of the other schools with those of general schools and to test whether the differences were significant. However, suppose that we were interested in testing whether the overhead costs of skilled workers’ schools were different from those of the other types of school. How could we do this? The simplest solution is to re-run the regression making skilled workers’ schools the reference category. Now we need to define a dummy variable GEN for the general schools instead. THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 The model is shown in equation form. Note that there is no longer a dummy variable for skilled workers’ schools since they form the reference category. COST =  1  +  T TECH +  V VOC +  G GEN +  2 N + u THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 In the case of observations relating to skilled workers’ schools, all the dummy variables are 0 and the model simplifies to the intercept and the term involving N. COST =  1  +  T TECH +  V VOC +  G GEN +  2 N + u Skilled Workers' SchoolCOST =  1  +  2 N + u (TECH = VOC = GEN = 0) THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 In the case of observations relating to technical schools, TECH is equal to 1 and the intercept increases by an amount  T. Note that  T should now be interpreted as the extra overhead cost of a technical school relative to that of a skilled workers’ school. COST =  1  +  T TECH +  V VOC +  G GEN +  2 N + u Skilled Workers' SchoolCOST =  1  +  2 N + u (TECH = VOC = GEN = 0) Technical SchoolCOST = (  1  +  T ) +  2 N + u (TECH = 1; VOC = GEN = 0) THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 COST =  1  +  T TECH +  V VOC +  G GEN +  2 N + u Skilled Workers' SchoolCOST =  1  +  2 N + u (TECH = VOC = GEN = 0) Technical SchoolCOST = (  1  +  T ) +  2 N + u (TECH = 1; VOC = GEN = 0) Vocational SchoolCOST = (  1  +  V ) +  2 N + u (VOC = 1; TECH = GEN = 0) General SchoolCOST = (  1  +  G ) +  2 N + u (GEN = 1; TECH = VOC = 0) Similarly one can derive the implicit cost functions for vocational and general schools, their  coefficients also being interpreted as their extra overhead costs relative to those of skilled workers’ schools. THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 This diagram illustrates the model graphically. Note that the  shifts are measured from the line for skilled workers’ schools. COST N 1+T1+T 1+V1+V Technic al Workers’ Vocation al General GG VV TT 11 1+G1+G THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 Here are the data for the first 10 of the 74 schools with skilled workers’ schools as the reference category. School TypeCOST N TECH VOCGEN 1Technical345, Technical 537, General 170, Workers’ General 100, Vocational 28, Vocational 160, Technical 45, Technical 120, Workers’ 61, THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | Here is the Stata output for the regression. We will focus first on the regression coefficients. THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 The regression result is shown written as an equation. ^ COST = 88, ,000TECH – 90,000VOC – 143,000GEN + 343N THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 Putting all the dummy variables equal to 0, we obtain the equation for the reference category, the skilled workers’ schools. ^ COST = 88, ,000TECH – 90,000VOC – 143,000GEN + 343N Skilled Workers' SchoolCOST= 88, N (TECH = VOC = GEN = 0) ^ THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 Putting TECH equal to 1 and VOC and GEN equal to 0, we obtain the equation for the technical schools. ^ ^ COST = 88, ,000TECH – 90,000VOC – 143,000GEN + 343N Skilled Workers' SchoolCOST= 88, N (TECH = VOC = GEN = 0) Technical SchoolCOST= 88, , N (TECH = 1; VOC = GEN = 0) = 99, N ^ THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

And similarly we obtain the equations for the vocational and general schools, putting VOC and GEN equal to 1 in turn. Note that the cost functions turn out to be exactly the same as when we used general schools as the reference category. ^ ^ ^ ^ COST = 88, ,000TECH – 90,000VOC – 143,000GEN + 343N Skilled Workers' SchoolCOST= 88, N (TECH = VOC = GEN = 0) Technical SchoolCOST= 88, , N (TECH = 1; VOC = GEN = 0) = 99, N Vocational SchoolCOST= 88,000 – 90, N (VOC = 1; TECH = GEN = 0) = –2, N General SchoolCOST= 88,000 – 143, N (GEN = 1; TECH = VOC = 0) = –55, N ^ THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 Consequently the scatter diagram with regression lines is exactly the same as before. THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 The goodness of fit, whether measured by R 2, RSS, or the standard error of the regression (the estimate of the standard deviation of u, here denoted Root MSE), is likewise not affected by the change.. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 But the t tests are affected. In particular, the meaning of a null hypothesis for a dummy variable coefficient being equal to 0 is different.. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 For example, the t statistic for the technical school coefficient is for the null hypothesis that the overhead costs of technical schools are the same as those of skilled workers’ schools. The t ratio in question is only 0.35, so the null hypothesis is not rejected.. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 The t ratio for the coefficient of VOC is –2.65, so one concludes that the overheads of vocational schools are significantly lower than those of skilled workers’ schools, at the 1% significance level.. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 General schools clearly have lower overhead costs than the skilled workers’ schools, according to the regression.. reg COST N TECH VOC GEN Source | SS df MS Number of obs = F( 4, 69) = Model | e e+11 Prob > F = Residual | e e+09 R-squared = Adj R-squared = Total | e e+10 Root MSE = COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

© Christopher Dougherty 1999–2006 Note that there are some differences in the standard errors as well. However, the standard error (and t-statistic) of the coefficient of N are unaffected.. reg COST N TECH WORKER VOC COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | reg COST N TECH VOC GEN COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

The one test involving the dummy variables that can be performed with either specification is the test of whether the overhead costs of general schools and skilled workers’ schools are different. The choice of specification can make no difference to the outcome of this test. The only difference is caused by the fact that the regression coefficient has become negative in the second specification. The standard error is the same, so the t statistic has the same absolute magnitude and the outcome of the test must be the same.. reg COST N TECH WORKER VOC COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | reg COST N TECH VOC GEN COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY

However the standard errors of the coefficients of the other dummy variables are slightly larger in the second specification. This is because the skilled workers’ schools are less ‘normal’ or ‘basic’ than the general schools and there are fewer of them in the sample (only 17, as opposed to 28). As a consequence there is less precision in measuring the difference between their costs and those of the other schools than there was when general schools were the reference category.. reg COST N TECH WORKER VOC COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | WORKER | VOC | _cons | reg COST N TECH VOC GEN COST | Coef. Std. Err. t P>|t| [95% Conf. Interval] N | TECH | VOC | GEN | _cons | THE EFFECTS OF CHANGING THE REFERENCE CATEGORY