# Multiple Regression Analysis

## Presentation on theme: "Multiple Regression Analysis"— Presentation transcript:

Multiple Regression Analysis
Multiple Regression Model Sections

The Model and Assumptions
If we can predict the value of a variable on the basis of one explanatory variable, we might make a better prediction with two or more explanatory variables Expect to reduce the chance component of our model Hope to reduce the standard error of the estimate Expect to eliminate bias that may result if we ignore a variable that substantially affects the dependent variable

The Model and Assumptions
The multiple regression model is where yi is the dependent variable for the ith observation 0 is the Y intercept 1,.. ,k are the population partial regression coefficients x1i, x2i,…xki are the observed values of the independent variables, X1, X2….Xk. k = 1,2,3…K explanatory variables

The Model and Assumptions
The assumptions of the model are the same as those discussed for simple regression The expected value of Y for the given Xs is a linear function of the Xs The standard deviation of the Y terms for given X values is a constant, designated as y|x The observations, yi, are statistically independent The distribution of the Y values (error terms) is normal

Interpreting the Partial Regression Coefficients
For each X term there is a partial regression coefficient, k This coefficient measures the change in the E(Y) given a one unit change in the explanatory variable Xk, holding the remaining explanatory variables constant controlling for the remaining explanatory variables ceteris parabis Equivalent to a partial derivative in calculus

Method of Least Squares - OLS
To estimate the population regression equation, we use the method of least squares The model written in terms of the sample notation is The sample regression equation is

Method of Least Squares - OLS
Goal is to minimize the distance between the predicted values of Y, the , and the observed values, yi, that is, minimize the residual, ei Minimize

Method of Least Squares - OLS
Take partial derivatives of SSE with respect to each of the partial regression coefficients and the intercept Each equation is set equal to zero This gives us k+1 equations in k+1 unknowns The equations must be independent and non-homogeneous Using matrix algebra or a computer, this system of equations can be solved With a single explanatory variable, the fitted model is a straight line With two explanatory variables, the model represents a plane in a three dimensional space With three or more variables it becomes a hyperplane in higher dimensional space The sample regression equation is correctly called a regression surface, but we will call it a regression line

An Example: The Human Capital Model
Consider education as an investment in human capital There should be a return on this investment in terms of higher future earnings Most people accept that earnings tend to rise with schooling levels, but this knowledge by itself does not imply that individuals should go on for more schooling More is usually costly Direct payments (tuition) Indirect payments (foregone earnings) Thus the actual magnitude of the increased earnings with additional years of schooling is important Can not simply calculate the average earnings for a sample of workers with different education levels Have to consider the effects on earnings of other factors, for example, experience in the labor market, age, ability, race and sex Let’s consider education as an investment in human capital. As such there should be a return on this investment in terms of higher future earnings. Most people accept that earnings tend to rise with schooling levels, but this knowledge by itself does not imply that individuals should go on for more schooling. More is usually costly - there are both direct payments (tuition) and indirect payments (foregone earnings). Thus the actual magnitude of the increased earnings with additional years of schooling is important. Estimating the change in earnings of an additional year of schooling is not easy. We can not simply calculate the average earnings for a sample of workers with different education levels. We have to consider the effects on earnings of other factors, for example, experience in the labor market, age, ability, race and sex

An Example: The Human Capital Model
Consider a first simple model (1) Earnings = 0 + 1education + Expect that the coefficient on education will be positive, 1 > 0 Realize that most people have higher earnings as they age, regardless of their education If age and education are positively correlated, the estimated regression coefficient on education will overstate the marginal impact of education A better model would account for the effect of age (2) Earnings = 0 + 1education +2age +  We will consider a first simple model. (1) Earnings = 0 + 1education + We expect that the coefficient on education will be positive. We are interested in the effect of education on earnings, but we realize that most people have higher earnings as they age, regardless of their education. If age and education are positively correlated, the estimated regression coefficient on education will overstate the marginal impact of education. A better specification would account for the effect of age:An Example: The Human Capital Model

A Conceptual Experiment
Multiple regression involves a conceptual experiment that we might not be able to carry out in practice What we would like to do is to compare individuals with different education levels who are the same age We would then be able to see the effects of education on average earnings, while controlling for age The use of multiple regression involves a conceptual experiment that we might not be able to carry out in practice. What we would like to be able to do is to compare individuals with different education levels who are the same age. We would then be able to see the effects of education on average earnings, while controlling for age.

Current Population Survey, White Males, March 1991
All workers are 40 years old n Average Annual Earnings Educ = 12 227 \$27,970.59 Educ = 13 132 \$31,523.24 What is the affect of an additional year of education? \$31, , = \$3,552.65

A Conceptual Experiment
Frequently we do not have large enough data sets to be able to ask this type of question Multiple regression analysis allows us to perform the conceptual exercise of comparing individuals with the same age and different education levels, even if the sample contains no such pairs of individuals

Sample Data Data was obtained for the March 1992 Current Population Survey The CPS is the source of the official Government statistics on employment and unemployment A very important secondary purpose is to collect information such as age, sex, race, education, income and previous work experience. The survey has been conducted monthly for over 50 years About 57,000 households are interviewed monthly, containing approximately 114,500 persons 15 years and older; based on the civilian non-institutional population For multiple regression question, sample consists of white male respondents years old, who spent at least one week in the labor force in the preceding year and who provided information on wage earnings during the preceding year. Sample size is 30,040 Students download Multiple Regression Human Capital Hand-out Data have been obtained for the March 1992 CPS (Current Population Survey). The CPS is the source of the official Government statistics on employment and unemployment. A very important secondary purpose is to collect information such as age, sex, race, education, income and previous work experience. The survey has been conducted monthly for over 50 years. About 57,000 households are interviewed monthly, containing approximately 114,500 persons 15 years and older. The sample is based on the civilian noninstitutional population. I have selected white male respondents years old, who spent at least one week in the labor force in the preceding year and who provided information on earnings during the preceding year. There are 30,040 such men in the remaining sample. Students need Multiple Regression Hand-out (multiplereg.doc).

Sample Statistics In 1991, the average white male in the sample was 37.5 years old, had 13.0 years of education and earned \$27, age earn educ Mean 37.50 13.02 Standard Error 0.070 0.017 Median 36 24000 13 Mode 35 30000 12 Standard Deviation 12.19 2.92 Sample Variance 148.54 8.54 Minimum 18 2 Maximum 65 199998 20 Count 30040 In 1991, the average white male in our sample was 37.5 years old, had 13.0 years of education and earned \$27,

Correlation Matrix Second, consider the correlation matrix, which shows the simple correlation coefficients for all pairs of variables There is a small, but positive correlation between education and age A simple regression of earnings on education will overstate the effect of education because education is positively correlated with age and age has a strong positive effect on earnings age earn educ 1

Earnings = 0 + 1education +
b0 = b1 = Sb0 = Sb1 =

Is Education a Significant Explanatory Variable?
Use t-test H0: 1 ≤ 0 No relationship H1: 1> 0 Positive relationship t-test statistic = and the p-value is 0.000 Reject the H0: 1 ≤ 0 There is a significant positive relationship between education and earnings Does the model have any worth, that is, is education a significant explanatory variable

For each additional year of schooling, average earnings increase by \$2,933.78 The R2 = .1710 Find that 17.1% of the variation in earnings across workers is explained by variation in education levels The standard error of the estimate, Se equals \$18,876 How do we interpret the coefficient on education? For each additional year of schooling, average earnings increase by \$2,933.78

Earnings = 0 + 1education +2age + 
b0 = b1 = Sb0 = b2 = Sb1 = Sb2 =

Interpret the Coefficients
In terms of this problem For each additional year of schooling, average earnings increase by \$2,759.73, controlling for age For each additional year of age, average earnings increase by \$572.74, controlling for schooling

Prediction Predict the mean earnings for white male workers who are 30 old and have a college degree The standard error of the estimate, Se = \$17, where k = no. of explanatory variables

Assessing the Regression as a Whole
Want to assess the performance of the model as a whole H0: 1 = 2 = 3 = …= k = 0 The model has no worth H1: At least one regression coefficient is not equal to zero The model has worth If all the b’s are close to zero, then the SSR will approach zero While we are interested in the significance of individual regression coefficients, we want to assess the performance of the model as a whole. The H0: 1 = 2 = 3 = …= k = 0 (The model has no worth.) H1: At least one regression coefficient is not equal to zero. (The model has worth.)

Assessing the Regression as a Whole
Test Statistic where k = the number of explanatory variables If the null hypothesis is true, the calculated test statistic will be close to zero; if the null hypothesis is false, the F test statistic will be “large”

Assessing the Regression as a Whole
The calculated F test statistic is compared with the critical F to determine whether the null hypothesis should be rejected If Fk,n-k-1 > F,k,n-k-1 (cv) reject the H0 reject cv F

ANOVA Table in Regression
P-value SSR SSE 3.6632e+12 is read as x 1012 or as 3.66 trillion. The “Residual” refers to the SS for the error or the SSE. The F critical value is F(.01, 2, ) = 4.61. Finally note the p-value, written as Significance F, which equals This tells us that we have a zero probability of observing a test statistic as large as 5,949.8 if the null hypothesis is true. The model has worth.

Inferences Concerning the Population Regression Coefficients
Which explanatory variables have coefficients significantly different from zero? Perform a hypothesis test for each explanatory variable Essentially the same t-test used for simple regression Hypotheses H0: k = 0 H1: k  0 Once we test whether the model has any worth using the F test statistic, we will want to know which explanatory variables have coefficients significantly different from zero. We will perform a hypothesis test for each explanatory variable. This is essentially the same t test we used for the simple regression

Inferences Concerning the Population Regression Coefficients
The test statistic is where K = number of independent variables The denominator, , is the standard error of the regression coefficient, bk Take the standard errors of the regression coefficients from the computer output

Inferences Concerning the Population Regression Coefficients
In our model, there are two explanatory variables There will be two tests about population regression coefficients Test whether Education is a significant variable H0: educ ≤ 0 H1: educ > 0 Test whether Age is a significant variable H0: age ≤ 0 H1: age > 0 Let ⍺ = 0.01 t,.01 = from the t tables

T-test Test statistic: educ Test statistic: age
p-values < 0.01 Reject the null hypothesis, one tail test,  = Find that education is significantly and positively related to earnings. Again, we reject the null hypothesis and conclude that age is significantly and positively related to earnings.

The Coefficient of Determination and the Adjusted R2
The R2 value is still defined as the ratio of the SSR to the SST We see that 28.38% of the variation in earnings is explained by variation in education and in age The simple regression has an R2 = Appears that adding the new explanatory variable improved the “goodness of fit” This conclusion can be misleading As we add new explanatory variables to our model, the R2 always increases, even when the new explanatory variables are not significant The SSE always decreases as more explanatory variables are added This is a mathematical property and doesn’t depend on the relevance of the additional variables The R2 value is still defined as the ratio of the SSR to the SST. We see that 28.38% of the variation in earnings is explained by variation in education and in age. The simple regression has an R2 = It would appear that adding the new explanatory variable improved the “goodness of fit”. However, this conclusion can be misleading. As we add new explanatory variables to our model, the R2 will always increase, even when the new explanatory variables are not significant. The SSE always decreases as more explanatory variables are added. This is a mathematical property and doesn’t depend on the relevance of the additional variables.

The Coefficient of Determination and the Adjusted R2