CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression analysis Taehyun Jung taehyun.jung@circle.lu.se CIRCLE, Lund University 15.15-17.00 December 10 2012 For Survey of Quantitative Research, NORSI

CIRCLE, Lund University, Sweden 2 Objectives of this session

CIRCLE, Lund University, Sweden 3 Contents

CIRCLE, Lund University, Sweden 4 Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden 5 Bivariate Linear Regression Model

CIRCLE, Lund University, Sweden 6

Least squares 7 residuals

CIRCLE, Lund University, Sweden Highest R2 8

CIRCLE, Lund University, Sweden Unbiasedness 9

CIRCLE, Lund University, Sweden  Would you prefer to obtain your estimate by making a single random draw out of an unbiased sampling distribution with a small variance or out of an unbiased sampling distribution with a large variance?  Best unbiased estimator is efficient  BLUE: Best linear unbiased estimator Efficiency 10

CIRCLE, Lund University, Sweden Ordinary Least Squares 11

CIRCLE, Lund University, Sweden 12

CIRCLE, Lund University, Sweden  The discrepancies between the actual and fitted values of Y are known as the residuals. – Note that the values of the residuals are not the same as the values of the disturbance term 13

CIRCLE, Lund University, Sweden Deriving linear regression coefficients 14 Conditions for Minimizing RSS

CIRCLE, Lund University, Sweden Deriving linear regression coefficients (cont’d) 15

CIRCLE, Lund University, Sweden We chose the parameters of the fitted line so as to minimize the sum of the squares of the residuals. As a result, we derived the expressions for b1 and b2. 16 XXnXn X1X1 Y b0b0 b1b1 True model: Fitted line:

CIRCLE, Lund University, Sweden hourly earnings in 2002 plotted against years of schooling, defined as highest grade completed, for a sample of 540 respondents from the National Longitudinal Survey of Youth. 17

CIRCLE, Lund University, Sweden  In this case there is only one variable, S, and its coefficient is 2.46. _cons, in Stata, refers to the constant. The estimate of the intercept is -13.93 Interpretation of a regression equation 18. reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321.2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden  hourly earnings increase by $2.46 for each extra year of schooling.  Literally, the constant indicates that an individual with no years of education would have to pay $13.93 per hour to be allowed to work. – Nonsense! – the only function of the constant term is to enable you to draw the regression line at the correct height on the scatter diagram Interpretation of a regression equation 19

CIRCLE, Lund University, Sweden  You can see that the t statistic for the coefficient of S is enormous. We would reject the null hypothesis that schooling does not affect earnings at the 1% significance level (critical value about 2.59). Testing a hypothesis relating to a regression coefficient 20. reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321.2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden  2.455 – 0.232 x 1.965 ≤ b2 ≤ 2.455 + 0.232 x 1.965 – The critical value of t at the 5% significance level with 538 degrees of freedom is 1.965.  1.999 ≤ b2 ≤ 2.911 Testing a hypothesis relating to a regression coefficient: Confidence intervals 21. reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321.2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden  The null hypothesis that we are going to test is that the model has no explanatory power  k is the number of parameters in the regression equation, which at present is just 2.  n – k is, as with the t statistic, the number of degrees of freedom  F is a monotonically increasing function of R2 – Why do we perform the test indirectly, through F, instead of directly through R2? After all, it would be easy to compute the critical values of R2 from those for F Hypotheses concerning goodness of fit are tested via the F statistic 22

CIRCLE, Lund University, Sweden For simple regression analysis, the F statistic is the square of the t statistic. 23

CIRCLE, Lund University, Sweden Calculation of F statistic 24. reg EARNINGS S Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 1, 538) = 112.15 Model | 19321.5589 1 19321.5589 Prob > F = 0.0000 Residual | 92688.6722 538 172.283777 R-squared = 0.1725 -------------+------------------------------ Adj R-squared = 0.1710 Total | 112010.231 539 207.811189 Root MSE = 13.126 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.455321.2318512 10.59 0.000 1.999876 2.910765 _cons | -13.93347 3.219851 -4.33 0.000 -20.25849 -7.608444 ------------------------------------------------------------------------------

CIRCLE, Lund University, Sweden OLS Assumptions 25

CIRCLE, Lund University, Sweden Assumptions for OLS 1: 26

CIRCLE, Lund University, Sweden A.2There is some variation in the regressor in the sample. 27

CIRCLE, Lund University, Sweden A.3The disturbance term has zero expectation 28

CIRCLE, Lund University, Sweden  We assume that the disturbance term is homoscedastic, meaning that its value in each observation is drawn from a distribution with constant population variance.  Once we have generated the sample, the disturbance term will turn out to be greater in some observations, and smaller in others, but there should not be any reason for it to be more erratic in some observations than in others. A.4The disturbance term is homoscedastic 29

CIRCLE, Lund University, Sweden  OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE.  This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading.  Whether the standard errors calculated using the usual formulae are too big or too small will depend upon the form of the heteroskedasticity. Consequences of Using OLS in the Presence of Heteroskedasticity 30

CIRCLE, Lund University, Sweden Multiple Regression 31

CIRCLE, Lund University, Sweden  an earnings function model where hourly earnings, EARNINGS, depend on years of schooling (highest grade completed), S, and years of work experience, EXP. 32 Note that the interpretation of the model does not depend on whether S and EXP are correlated or not However we do assume that the effects of S and EXP on EARNINGS are additive. The impact of a difference in S on EARNINGS is not affected by the value of EXP, or vice versa.

CIRCLE, Lund University, Sweden  The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.  However, the expressions for the slope coefficients are considerably more complex than that for the slope coefficient in simple regression analysis. Calculating regression coefficients 33

CIRCLE, Lund University, Sweden  It indicates that earnings increase by $2.68 for every extra year of schooling and by $0.56 for every extra year of work experience.  Intercept: Obviously, this is impossible. The lowest value of S in the sample was 6. We have obtained a nonsense estimate because we have extrapolated too far from the data range Interpretation of a regression equation 34. reg EARNINGS S EXP Source | SS df MS Number of obs = 540 -------------+------------------------------ F( 2, 537) = 67.54 Model | 22513.6473 2 11256.8237 Prob > F = 0.0000 Residual | 89496.5838 537 166.660305 R-squared = 0.2010 -------------+------------------------------ Adj R-squared = 0.1980 Total | 112010.231 539 207.811189 Root MSE = 12.91 ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.678125.2336497 11.46 0.000 2.219146 3.137105 EXP |.5624326.1285136 4.38 0.000.3099816.8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213 ------------------------------------------------------------------------------ 20

CIRCLE, Lund University, Sweden Properties of the multiple regression coefficients. Only A.2 is different. 35

CIRCLE, Lund University, Sweden  the inclusion of the new term has had a dramatic effect on the coefficient of EXP  The high correlation causes the standard error of EXP to be larger than it would have been if EXP and EXPSQ had been less highly correlated, warning us that the point estimate is unreliable Multicollinearity 36. reg EARNINGS S EXP EXPSQ ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.754372.2417286 11.39 0.000 2.279521 3.229224 EXP | -.2353907.665197 -0.35 0.724 -1.542103 1.071322 EXPSQ |.0267843.0219115 1.22 0.222 -.0162586.0698272 _cons | -22.21964 5.514827 -4.03 0.000 -33.05297 -11.38632 ------------------------------------------------------------------------------. reg EARNINGS S EXP ------------------------------------------------------------------------------ EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- S | 2.678125.2336497 11.46 0.000 2.219146 3.137105 EXP |.5624326.1285136 4.38 0.000.3099816.8148837 _cons | -26.48501 4.27251 -6.20 0.000 -34.87789 -18.09213 ------------------------------------------------------------------------------.cor EXP EXPSQ (obs=540) | EXP EXPSQ ------+------------------ EXP | 1.0000 EXPSQ | 0.9812 1.0000

CIRCLE, Lund University, Sweden  When high correlations among the explanatory variables lead to erratic point estimates of the coefficients, large standard errors and unsatisfactorily low t statistics, the regression is said to be suffering from multicollinearity. – the standard errors and t tests remain valid.  Multicollinearity may also be caused by an approximate linear relationship among the explanatory variables. When there are only 2, an approximate linear relationship means there will be a high correlation, but this is not always the case when there are more than 2.  Note that, multicollinearity does not cause the regression coefficients to be biased. 37

CIRCLE, Lund University, Sweden  Reduce the variance of the disturbance term by including further relevant variables in the model  Increase the number of observations  Increase MSD(X2) (the variation in the explanatory variables). – For example, if you were planning a household survey with the aim of investigating how expenditure patterns vary with income, you should make sure that the sample included relatively rich and relatively poor households as well as middle-income households.  Reduce  Combine the correlated variables  Drop some of the correlated variables – However, this approach to multicollinearity is dangerous. It is possible that some of the variables with insignificant coefficients really do belong in the model and that the only reason their coefficients are insignificant is because there is a problem of multicollinearity 38

CIRCLE, Lund University, Sweden  Use common sense and economic theory.  Avoid Type III errors – Producing the right answer to the wrong question is called a type III error – place relevance before mathematical elegance  know the context – Do not perform ignorant statistical analyses  inspect the data – place data cleanliness ahead of econometric godliness  Keep it sensibly simple – Do not talk Greek without knowing the English translation  look long and hard at your results – apply the laugh test  beware the costs of data mining – E.g. tailoring one’s specification to the data, resulting in a specification that is misleading  Be prepared to compromise – Should a proxy be used? Can sample attrition be ignored?  Do not confuse significance with substance  Report a sensitivity analysis Kennedy’s 10 commandments of applied econometrics 39

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.

Similar presentations

Presentation on theme: "CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression.

Similar presentations

Presentation on theme: "CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 3: Basic techniques for innovation data analysis. Part II: Introducing regression."— Presentation transcript:

Similar presentations

About project

Feedback