Presentation is loading. Please wait.

Presentation is loading. Please wait.

QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.

Similar presentations


Presentation on theme: "QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1."— Presentation transcript:

1 QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships
QM222 Fall 2017 Section A1

2 To-dos I am available today 11-3:30.
If you don’t have your data completed, come then or make a special appointment. QM222 Fall 2017 Section A1

3 Today Estimating nonlinear relationships - part 1
Future special topics: Omitted variable bias (leaving things out of the regression) One variable with different slopes. We already covered A dummy variable as the dependent (Y) variable Multiple-category dummies Time series analysis - See me if you have time-series cross-section QM222 Fall 2017 Section A1

4 Estimating nonlinear relationships
Could the relationship be non-linear, and if so, how can we estimate this using linear regression? QM222 Fall 2017 Section A1

5 Non-linear relationships between Y and X
Sometimes, the relationship between a numerical Y variable and a numerical X variable is unlikely to be linear. This may lead you to measure a very low insignificant slope. e.g. If you ran a regression of this graph, its coefficient would be zero. QM222 Fall 2017 Section A1

6 Many of you believe that you might have nonlinear relationships
e.g. You want to see how age will affect job-satisfaction (measured on a scale 1 to 10) so you think about running the regression: regress jobsatis age Is this likely to be linear? Maybe job satisfaction goes up with age and then down again. For instance, maybe you do not believe that an extra year increases job-satisfaction by a constant amount whether if it is the difference between 24 to 25 years old, or the difference between 60 and 61 years old. QM222 Fall 2017 Section A1

7 Many of you believe that you might have nonlinear relationships
Note that this section is only applicable for numerical variables. You cannot do these nonlinear things with dummy variables. Why not? 12 = and = 0 Squaring a dummy variable just gives you the same dummy variable. QM222 Fall 2017 Section A1

8 To solve the problem of Y possibly increasing with X and then decreasing:
You simply add to the regression a new X variable that is a non-linear version of old variable. My suggestion: estimate a quadratic by making a new variable X2 and run the regression with both the linear and non-linear (quadratic) term in the equation. If you don’t know if a relationship is nonlinear, you can estimate the regression assuming it is nonlinear (e.g. quadratic) and then examine the results to see if this assumption is correct. QM222 Fall 2017 Section A1

9 Quadratic: Y = b0 + b1 X + b2 X2 In high school you learned that quadratic equations look like this. So by adding a squared term, you can estimate these shapes. QM222 Fall 2017 Section A1

10 However, a regression with a quadratic can estimate ANY part of these shapes
So, using a quadratic does not mean that the curve need actually ever change from a positive to a negative slope or vice versa … QM222 Fall 2017 Section A1

11 How do you know whether the relationship really is nonlinear?
Put a nonlinear term (e.g. a squared term) in the regression and let the |t-stats|’s in the equation tell you if it belongs in there. If the |t-stat|>2, you are more than 95% confident that the relationship is nonlinear. However, it’s a good idea to keep in the quadratic term as long as the| t-stat | >1, which means that I am at least 68% confident the relationship is nonlinear. Recall: The adjusted R2 goes up if you add in a variable with |t|>=1 (so that you are >= 68% certain that the coefficient is not 0 ) The adjusted R2 goes down if you add in a variable with |t|<1 (so that you are < 68% certain that the coefficient is not 0 ) QM222 Fall 2017 Section A1

12 Example of testing nonlinear relationships by adding in a squared term
Example: I know annual visitors to a national park (Cape Canaveral). I want to know if they are growing (or falling) at a constant rate over time, or not. Assume the data is in chronological order. First I make the variables: gen time= _n gen timesq = time^2 Then I run a regression with BOTH variables, time and timesq QM222 Fall 2017 Section A1

13 Here are regressions of visitors (to Cape Canaveral) first on time, then on time AND timesq. Is the relationship nonlinear? Are visitors growing/shrinking, and at a constant rate? . regress annualvisitors time Source | SS df MS Number of obs = F( 1, 21) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 2.5e+05 annualvisi~s | Coef. Std. Err t P>|t| [95% Conf. Interval] time | _cons | . regress annualvisitors time timesq F( 2, 20) = Model | e e Prob > F = Residual | e e R-squared = Adj R-squared = Total | e e Root MSE = 1.3e+05 time | timesq | _cons | QM222 Fall 2017 Section A1

14 Sketching that Quadratic Visitors = 1102401 + 118498 time - 5374 time2
The linear term in positive, so at a small X eg. time=0.1 the slope is positive. The squared is negative so the slope eventually becomes negatively sloped. So the general shape is as below. But which part of the curve is it? For those who don’t think in derivatives, plug in high, medium and low values for X in the original equation. In this data, time goes from 1 to 23 so: At time=1, Visitors = (1) (1) = 1,215,525 At time=10, Visitors = (10) (102) =1,749,981 At time=23, Visitors = (23) (232) =985,009 So over these 23 years, predicted visitors go up, then back down again. QM222 Fall 2017 Section A1

15 Sketching the Quadratic using calculus (easier
Sketching the Quadratic using calculus (easier!) Visitors = time time2 Calculus tells us the slope: dVisitors/dtime = – 2*5374 time The slope gets smaller as time increases. At the top of this curve, the slope is exactly zero. So solve slope= – 2*5374 time = 0 time = /2 = 11.03 QM222 Fall 2017 Section A1

16 Another kind of non-linear term
Another kind of non-linear term. [NOT ON TEST] You believe that a 1% increase in X will have the same percentage (% ) effect on Y no matter what X you start at. e.g. You believe a 1 percent increase in price has a constant percentage effect on sales. Mathematical rule: If lnY = b0+ b1 lnX, b1 represents the %∆Y/ %∆X (for small changes) Or, the percentage change in Y when X changes by 1% (ln is natural log, the coefficient of “e”. log means to the base 10. Either works.) So just make two new variables from yvariable and xvariable: gen lnY=ln(yvariable) gen lnX =ln(xvariable) regress lnY lnX Then the coefficient will tell you the percentage change in Y when X changes by 1% QM222 Fall 2017 Section A1

17 A case when logs might be useful?
If you have skewed data (like lifetime gross in movies), you could just regress ln(Lifetime gross) = b0 + b1 ln(metascore) QM222 Fall 2017 Section A1

18 Of course, we said last time that with a very skewed Y- variable, you could….
You could predict the median, not the average, income. qreg incwage age - You could top-code the variable (i.e. treat all incomes above a certain amount – your choice – as that top amount) e.g. make all incomes above $400,000 into $400,000: replace incwage= if incwage>400000 QM222 Fall 2017 Section A1

19 Which non-linear way of dealing with things is best with a skewed Y?
There is no one best way. It is trial and error. It is an art rather than a science. BUT ONLY use methods that you understand. QM222 Fall 2017 Section A1

20 Comparing regressions with skewed Y
Regular variable: regress income wage Source | SS df MS Number of obs = 150,000 F(1, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = income | Coef. Std. Err t P>|t| [95% Conf. Interval] age | _cons | With a quadratic: regress income age agesq F(2, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = age | agesq | _cons | QM222 Fall 2017 Section A1

21 Comparing regressions
Regular variable: regress income wage Source | SS df MS Number of obs = 150,000 F(1, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = income | Coef. Std. Err t P>|t| [95% Conf. Interval] age | _cons | With topcoding if income> regress incometop age F(1, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = incometop | Coef. Std. Err t P>|t| [95% Conf. Interval] age | _cons | QM222 Fall 2017 Section A1

22 Comparing regressions
With a quadratic: regress income age agesq Source | SS df MS Number of obs = 150,000 F(2, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = income | Coef. Std. Err t P>|t| [95% Conf. Interval] age | agesq | _cons | With topcoding (if income>400000) AND a quadratic . regress incometop age agesq Best adj Rsq F(2, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = incometop | Coef. Std. Err t P>|t| [95% Conf. Interval] age | agesq | _cons | QM222 Fall 2017 Section A1

23 Comparing regressions
With topcoding if income> AND a quadratic . regress incometop age agesq Source | SS df MS Number of obs = 150,000 F(2, ) = Model | e e+13 Prob > F = Residual | e , e+09 R-squared = Adj R-squared = Total | e , e+09 Root MSE = incometop | Coef. Std. Err t P>|t| [95% Conf. Interval] age | agesq | _cons | With logs regress lnincome lnage F(1, ) = Model | Prob > F = Residual | , R-squared = Adj R-squared = Total | , Root MSE = lnincome | Coef. Std. Err t P>|t| [95% Conf. Interval] lnage | _cons | This is the best adj Rsq, but you are fitting a whole different Y-variable. It does look very good from the t-stat on lnage. QM222 Fall 2017 Section A1

24 Comparing regressions
With logs regress lnincome lnage Source | SS df MS Number of obs = 150,000 F(1, ) = Model | Prob > F = Residual | , R-squared = Adj R-squared = Total | , Root MSE = lnincome | Coef. Std. Err t P>|t| [95% Conf. Interval] lnage | _cons | With logged income,and an age quadratic (rather than using log age) . regress lnincome age agesq F(2, ) = Model | Prob > F = Residual | , R-squared = Adj R-squared = Total | , Root MSE = age | agesq | _cons | This is the best fit so far, and it IS the same lnY variable as above. QM222 Fall 2017 Section A1

25 Comparing regressions
With logs for income, an age quadratic (rather than using log age) . regress lnincome age agesq Source | SS df MS Number of obs = 150,000 F(2, ) = Model | Prob > F = Residual | , R-squared = Adj R-squared = Total | , Root MSE = lnincome | Coef. Std. Err t P>|t| [95% Conf. Interval] age | agesq | _cons | Finally, with qreg to get a prediction for median wage, with age and agesq qreg income age agesq Median regression Number of obs = ,000 Raw sum of deviations 3.35e+09 (about 69000) Min sum of deviations 3.22e Pseudo R2 = income | Coef. Std. Err t P>|t| [95% Conf. Interval] age | agesq | _cons | Think of pseudo R-sq are adjusted R-sq. This does NOT fit as well as lnincome QM222 Fall 2017 Section A1

26 To sum today up There are many ways to deal with a skewed Y variable
Top-code it Take its log Predict the median rather than the average. There are many ways to deal with an X variable when you think the Y-X relationship is nonlinear First look at the scatter diagram – It helps Then add a quadratic X term and see if |t| >1 [EASIEST] Or add a different kind of non-linear X term (e.g. log of X) Remember: If you use logs of both sides, the coefficient can be interpreted as: The percentage change in Y when X changes by 1% QM222 Fall 2017 Section A1

27 Which non-linear way of dealing with things is best with a skewed Y?
There is no one best way. It is trial and error. It is an art rather than a science. BUT ONLY use methods that you understand. QM222 Fall 2017 Section A1


Download ppt "QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1."

Similar presentations


Ads by Google