QM222 Class 11 Section A1 Multiple Regression

Slides:



Advertisements
Similar presentations
Dummy Variables and Interactions. Dummy Variables What is the the relationship between the % of non-Swiss residents (IV) and discretionary social spending.
Advertisements

Heteroskedasticity The Problem:
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
Lecture 4 This week’s reading: Ch. 1 Today:
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Introduction to Regression Analysis Straight lines, fitted values, residual values, sums of squares, relation to the analysis of variance.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
1 INTERPRETATION OF A REGRESSION EQUATION The scatter diagram shows hourly earnings in 2002 plotted against years of schooling, defined as highest grade.
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
1 INTERACTIVE EXPLANATORY VARIABLES The model shown above is linear in parameters and it may be fitted using straightforward OLS, provided that the regression.
Returning to Consumption
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
F TEST OF GOODNESS OF FIT FOR THE WHOLE EQUATION 1 This sequence describes two F tests of goodness of fit in a multiple regression model. The first relates.
MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE 1 This sequence provides a geometrical interpretation of a multiple regression model with two.
Introduction to Linear Regression
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 5) Slideshow: exercise 5.2 Original citation: Dougherty, C. (2012) EC220 - Introduction.
STAT E100 Section Week 12- Regression. Course Review - Project due Dec 17 th, your TA. - Exam 2 make-up is Dec 5 th, practice tests have been updated.
RAMSEY’S RESET TEST OF FUNCTIONAL MISSPECIFICATION 1 Ramsey’s RESET test of functional misspecification is intended to provide a simple indicator of evidence.
1 CHANGES IN THE UNITS OF MEASUREMENT Suppose that the units of measurement of Y or X are changed. How will this affect the regression results? Intuitively,
SEMILOGARITHMIC MODELS 1 This sequence introduces the semilogarithmic model and shows how it may be applied to an earnings function. The dependent variable.
1 In the Monte Carlo experiment in the previous sequence we used the rate of unemployment, U, as an instrument for w in the price inflation equation. SIMULTANEOUS.
F TESTS RELATING TO GROUPS OF EXPLANATORY VARIABLES 1 We now come to more general F tests of goodness of fit. This is a test of the joint explanatory power.
1 COMPARING LINEAR AND LOGARITHMIC SPECIFICATIONS When alternative specifications of a regression model have the same dependent variable, R 2 can be used.
VARIABLE MISSPECIFICATION II: INCLUSION OF AN IRRELEVANT VARIABLE In this sequence we will investigate the consequences of including an irrelevant variable.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
QM222 Class 19 Section D1 Tips on your Project
QM222 Class 12 Section D1 1. A few Stata things 2
For presentation schedule and makeup test signups, see:
Lecture #25 Tuesday, November 15, 2016 Textbook: 14.1 and 14.3
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 11 Section D1 1. Review and Stata: Time series data, multi-category dummies, etc. (chapters 10,11) 2. Capturing nonlinear relationships (Chapter.
business analytics II ▌appendix – regression performance the R2 
QM222 Class 10 Section D1 1. Goodness of fit -- review 2
QM222 Nov. 7 Section D1 Multicollinearity Regression Tables What to do next on your project QM222 Fall 2016 Section D1.
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Review Multiple Regression Multiple-Category Dummy Variables
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
QM222 Class 19 Omitted Variable Bias pt 2 Different slopes for a single variable QM222 Fall 2017 Section A1.
Multiple Regression Analysis and Model Building
QM222 Class 14 Section D1 Different slopes for the same variable (Chapter 14) Review: Omitted variable bias (Chapter 13.) The bias on a regression coefficient.
QM222 Class 18 Omitted Variable Bias
QM222 Class 9 Section D1 1. Multiple regression – review and in-class exercise 2. Goodness of fit 3. What if your Dependent Variable is an 0/1 Indicator.
QM222 A1 On tests and projects
QM222 Class 8 Section A1 Using categorical data in regression
QM222 Class 8 Section D1 1. Review: coefficient statistics: standard errors, t-statistics, p-values (chapter 7) 2. Multiple regression 3. Goodness of fit.
QM222 A1 Nov. 27 More tips on writing your projects
The slope, explained variance, residuals
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
QM222 Your regressions and the test
QM222 Dec. 5 Presentations For presentation schedule, see:
Auto Accidents: What’s responsible?
QM222 Class 15 Section D1 Review for test Multicollinearity
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
BEC 30325: MANAGERIAL ECONOMICS
BEC 30325: MANAGERIAL ECONOMICS
Introduction to Econometrics, 5th edition
Introduction to Econometrics, 5th edition
MGS 3100 Business Analysis Regression Feb 18, 2016
Introduction to Econometrics, 5th edition
Presentation transcript:

QM222 Class 11 Section A1 Multiple Regression QM222 Fall 2017 Section A1

To-dos Have you signed up for your first URO? Do it today. Today – All day, office hours except 3:15:4:30. Sign up for 15 minute slots if I said we should speak: https://docs.google.com/a/bu.edu/spreadsheets/d/188IrHsjGhE758eIQ1Jcru-1WGKFmJJYrmD1ppcdcMhY/edit?usp=sharing QM222 Fall 2017 Section A1

Today we… Review how to interpret the statistics about regression coefficients (listed in the same line as the coefficient itself). Standard errors 95% confidence intervals t-tests p-values Multiple regression: How to interpret it and why use it. Break into groups to talk about possible confounding factors QM222 Fall 2017 Section A1

Standard errors of coefficients price = 12934 + 407.45 size Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 Next to each coefficient is a standard error. We use it to make confidence intervals We are approximately 68% certain that the true coefficient (with an infinitely very large sample) is within one standard error of this coefficient. 407.45 +/- 7.167 We are approximately 95% certain that the true coefficient (with an infinitely very large sample) is within two standard errors of this coefficient. 407.45 +/- 2 * 7.167 QM222 Fall 2017 Section A1

The regression results even give you the 95% confidence interval for each coefficient Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 The 95% confidence interval for each coefficient is given on the right of that coefficient’s line. QM222 Fall 2017 Section A1

What’s the most important hypothesis about the coefficient to test? The most important thing to know is whether the variable actually has a relationship, i.e. whether the coefficient is not zero. The regression output gives several ways to test this: If the 95% confidence interval does not include zero |t| > 2 P-value < .05 QM222 Fall 2017 Section A1

The t-statistic of the coefficient in the regression output tests the hypothesis that coefficient=0 Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 The t-stat next to the coefficient in the regression output tests the hypothesis that: H0: β =0 i.e. that the true coefficient is actually zero: t-statistic = b – 0 s.e. Note that this simplifies to: t-statistic = b If | t | > 2 , we are more than 95% certain that the true coefficient is NOT zero. QM222 Fall 2017 Section A1

p-values: The p–value tells us exactly how probable it is that the coefficient is 0 or of the opposite sign. Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 1, 1083) = 3232.35 Model | 5.6104e+13 1 5.6104e+13 Prob > F = 0.0000 Residual | 1.8798e+13 1083 1.7357e+10 R-squared = 0.7490 -------------+------------------------------ Adj R-squared = 0.7488 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05   ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- size | 407.4513 7.166659 56.85 0.000 393.3892 421.5134 _cons | 12934.12 9705.712 1.33 0.183 -6110.006 31978.25 This p-value says that it is less than .0005 (or .05%) likely that the coefficient on size is 0 or negative. (Higher than this & it would be .001 rounded) I am more than 100% - .05% = 99.95% certain that the coefficient is not zero. QM222 Fall 2017 Section A1

More If I know with at least 95% certainty that a coefficient in NOT zero, I say it is statistically significant. This occurs when: 0 is not in the 95% confidence interval OR | t | > 2 means we’re >=95% certain the coefficient is not zero OR p-value <= .05 means we’re >=95% certain the coefficient is not zero QM222 Fall 2017 Section A1

Multiple Regression QM222 Fall 2016 Section D1

Multiple Regression Ŷ=b0+b1X1 +b2X2 +b3X3 … The multiple linear regression model is an extension of the simple linear regression model, where the dependent variable Y depends (linearly) on more than one explanatory variable: Ŷ=b0+b1X1 +b2X2 +b3X3 … We now interpret b1 as the change in Y when X1 changes by 1 and all other variables in the equation REMAIN CONSTANT. We say: “controlling for” other variables (X2 , X3).

Example of multiple regression We predicted the sale price of a condo in Brookline based on “beaconstreet” (t-statistics in parentheses) : Price = 520,729 – 46969 beaconstreet (61.73) (-1.82) Q: Are we 95% certain that beaconstreet has a relationship with price? Q: Are we 68% certain ( |t| >1) ? We expected condos on Beacon to cost more and are surprised with the result, but there are confounding factors that might be correlated with Beacon Street, such as size (in square feet). So we run a regression of Price (Y) on TWO explanatory variables, beaconstreet AND size. QM222 Fall 2015 Section D1

Multiple regression in Stata . regress price Beacon_Street size   Source | SS df MS Number of obs = 1085 -------------+------------------------------ F( 2, 1082) = 1627.49 Model | 5.6215e+13 2 2.8108e+13 Prob > F = 0.0000 Residual | 1.8687e+13 1082 1.7271e+10 R-squared = 0.7505 -------------+------------------------------ Adj R-squared = 0.7501 Total | 7.4902e+13 1084 6.9098e+10 Root MSE = 1.3e+05 ------------------------------------------------------------------------------- price | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- beaconstreet | 32935.89 12987.55 2.54 0.011 7452.263 58419.52 size | 409.4219 7.190862 56.94 0.000 395.3122 423.5315 _cons | 6981.353 9961.969 0.70 0.484 -12565.61 26528.32 Write the regression equation: Is the coefficient of beaconstreet statistically significant? How do you know? What about size? How do we interpret these coefficients? ………………….

More on interpreting multiple regression Price = 6981 + 32936 beaconstreet + 409.4 size If we compare 2 condos of the same size, the one on Beacon Street will cost 32936 more. Or: Holding size constant, condos on Beacon Street cost 32936 more. Or: Controlling for size, condos on Beacon Street cost 32936 more. IN OTHER WORDS: By adding additional, possibly confounding variables into the regression, this takes out the bias (due to the missing confounding variable) from the coefficient on the variable we are interested in (Beacon Street), so we isolate the true “effect” of Beacon from being confounded with the fact that Beacon and size are related and size affects price.

More on interpreting multiple regression Price = 520,729 – 46969 Beacon_Street Price = 6981 + 32936 Beacon_Street + 409.4 size If I want to know how much a similarly sized apartment is on and off Beacon street, I use the second regression. Beacon IS more expensive. Also, I learn something from the difference in the coefficients on Beacon Street. I learn that apartments on Beacon are ______ than others. (fill in: bigger smaller)

Exercise on interpreting regression (scratch cards) Let’s say I run a regression of drownings per capita on ice cream sales per capita per day and get drownings = .00010 + .00015 icecream with both |t-stats| > 2 (Note numbers are small because there aren’t many drownings per person!) If I were to add in average daily temperature, I’d get the regression: drownings = b0 + b1 icecream + b2 temperature Q3: What is the likely sign of b2? a) negative b)positive c)can’t tell Q4: What is the most likely value of b1? .00015 a different significantly positive number an insignificant number a significantly negative number QM222 Fall 2017 Section A1

Multiple regression: Why use it? There are 2 reasons why we use multiple regression: To get the closer to the“correct/causal” (unbiased) coefficient by controlling for confounding factors (This is important for those of you trying to measure the effect of X on Y). To increase the predictive power of a regression. (We’ll soon learn how to measure this power.) (This is important for those of you trying to predict e.g. stock prices.)

Today we… Review how to interpret the statistics about regression coefficients (listed in the same line as the coefficient itself). Standard errors 95% confidence intervals t-tests p-values Multiple regression: How to interpret it and why use it. Break into groups to talk about possible confounding factors QM222 Fall 2017 Section A1

Break into groups to discuss possibly confounding factors Those using a survey of people (ADD health, GSS, ACS) Those using country-level or state/municipality data (often cross-section/time series) Those using sports data Those using financial data (often cross-section/time series) QM222 Fall 2017 Section A1