Biostat 200 Lecture 10 1. Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Inference for Regression Today we will talk about the conditions necessary to make valid inference with regression We will also discuss the various types.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Simple Linear Regression and Correlation
Lecture 9 Today: Ch. 3: Multiple Regression Analysis Example with two independent variables Frisch-Waugh-Lovell theorem.
1 Nonlinear Regression Functions (SW Chapter 8). 2 The TestScore – STR relation looks linear (maybe)…
Sociology 601, Class17: October 27, 2009 Linear relationships. A & F, chapter 9.1 Least squares estimation. A & F 9.2 The linear regression model (9.3)
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Chapter 13 Multiple Regression
Multiple regression analysis
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Simple Linear Regression Basic Business Statistics 11 th Edition.
Chapter 12 Multiple Regression
Statistics for Business and Economics
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
Chapter 11 Multiple Regression.
Simple Linear Regression Analysis
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Ch. 14: The Multiple Regression Model building
Interpreting Bi-variate OLS Regression
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Chapter 12 Section 1 Inference for Linear Regression.
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Returning to Consumption
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
EDUC 200C Section 3 October 12, Goals Review correlation prediction formula Calculate z y ’ = r xy z x for a new data set Use formula to predict.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
© 2001 Prentice-Hall, Inc. Statistics for Business and Economics Simple Linear Regression Chapter 10.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Biostat 200 Lecture 9 1. Chi-square test when the exposure has several levels E.g. Is sleep quality associated with having had at least one cold in the.
Lecture 10: Correlation and Regression Model.
Biostat 200 Lecture 9 1. Chi-square test when the exposure has several levels E.g. Is sleep quality associated with having had at least one cold in the.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Lecture 10 Introduction to Linear Regression and Correlation Analysis.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
QM222 Class 9 Section A1 Coefficient statistics
QM222 Class 8 Section A1 Using categorical data in regression
Chapter 11 Simple Regression
CHAPTER 29: Multiple Regression*
Presentation transcript:

Biostat 200 Lecture 10 1

Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients of the equation α is the y-intercept and which is the mean value of Y when X=0, which is μ y|0 The slope  is the change in the mean value of y that corresponds to a one-unit increase in x E.g. X=3 vs. X=2 μ y|3 - μ y|2 = (α +  *3 ) – (α +  *2) =  2

Simple linear regression The linear regression equation is y = α +  x + ε The error, ε, is the distance a sample value y has from the population regression line y = α +  x + ε μ y|x = α +  x so y- μ y|x = ε 3

Simple linear regression Assumptions of linear regression – X’s are measured without error – For each value of x, the y’s are normally distributed with mean μ y|x and standard deviation σ y|x – μ y|x = α + βx – Homoscedasticity – All the y i ‘s are independent 4

Simple linear regression The regression line equation is The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σe i 2 (hence the name “least squares”) We are minimizing the sum of the squares of the residuals 5

Simple linear regression example: Regression of age on FEV FEV= α̂ + β̂ age regress yvar xvar. regress fev age Source | SS df MS Number of obs = F( 1, 652) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | _cons | β ̂ ̂ = Coef for age α̂ = _cons (short for constant) 6

regress fev age Source | SS df MS Number of obs = F( 1, 652) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | _cons | =

Inference for regression coefficients We can use these to test the null hypothesis H 0 :  =  0 The test statistic for this is And it follows the t distribution with n-2 degrees of freedom under the null hypothesis 95% confidence intervals for  ( β̂ - t n-2,.025 se(β̂), β̂ + t n-2,.025 se(β̂) ) 8

Inference for predicted values We might want to estimate the mean value of y at a particular value of x E.g. what is the mean FEV for children who are 10 years old? ŷ = *x = *10 = liters 9

Inference for predicted values We can construct a 95% confidence interval for the estimated mean ( ŷ - t n-2,.025 se(ŷ), ŷ + t n-2,.025 se(ŷ) ) where Note what happens to the terms in the square root when n is large 10

Stata will calculate the fitted regression values and the standard errors – regress fev age – predict fev_pred, xb -> predicted mean values ( ŷ) – predict fev_predse, stdp -> se of ŷ values You don’t have to calculate these to get a plot with the 95% CI: twoway (lfitci fev age) 11 New variable names that I made up

. list fev age fev_pred fev_predse | fev age fev_pred fev_pr~e | | | 1. | | 2. | | 3. | | 4. | | 5. | | | | 6. | | 7. | | 8. | | 9. | | 10. | | | | 11. | | 12. | | 13. | | 14. | | 15. | | 336. | | 337. | | 338. | | 12

13 twoway (scatter fev age) (lfitci fev age, ciplot(rline) blcolor(black)), legend(off) title(95% CI for the predicted means for each age ) Note that the Cis get wider as you get farther from x ̅ ; but here n is large so the CI is still very narrow

14 The 95% confidence intervals get much wider with a small sample size

Prediction intervals The intervals we just made were for means of y at particular values of x What if we want to predict the FEV value for an individual child at age 10? Same thing – plug into the regression equation: ỹ = *10 = liters But the standard error of ỹ is not the same as the standard error of ŷ 15

Prediction intervals 16 This differs from the se(y ̂ ) only by the extra variance of y in the formula But it makes a big difference There is much more uncertainty in predicting a future value versus predicting a mean Stata will calculate these using predict fev_predse_ind, stdf f is for forecast

17. list fev age fev_pred fev_predse fev_pred_ind | fev age fev_pred fev~edse fev~ndse | | | 1. | | 2. | | 3. | | 4. | | 5. | | | | 6. | | 7. | | 8. | | 9. | | 10. | | | | 11. | | 12. | | 13. | | 14. | | 15. | | 336. | | 337. | | 338. | |

18 twoway (scatter fev age) (lfitci fev age, ciplot(rline) blcolor(black) ) (lfitci fev age, stdf ciplot(rline) blcolor(red) ), legend(off) title(95% prediction interval and CI ) Note the width of the confidence intervals for the means at each x versus the width of the prediction intervals

19 The intervals are wider farther from x ̅, but that is only apparent for small n because most of the width is due to the added s y|x

A summary of the model fit is the coefficient of determination, R 2 R 2 represents the portion of the variability that is removed by performing the regression on X R 2 is calculated from the regression with MSS/TSS 20 Model fit

regress fev age Source | SS df MS Number of obs = F( 1, 652) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | _cons | =

Model fit The F statistic compares the model to a model with just y̅ The statistic is 22

Model fit When there is only one independent variable in the model, these are equivalent tests – F test that compares the model fit to the null model – The test that  =0 – The test that r=0 (Pearson correlation) 23

Model fit -- Residuals 24 Residuals are the difference between the observed y values and the regression line for each value of x y i -ŷ i If all the points lie along a straight line, the residuals are all 0 If there is a lot of variability at each level of x, the residuals are large The sum of the squared residuals is what was minimized in the least squares method of fitting the line

25

Residuals We examine the residuals using scatter plots We plot the fitted values ŷ i on the x-axis and the residuals y i -ŷ i on the y-axis We use the fitted values because they have the effect of the independent variable removed To calculate the residuals and the fitted values Stata: regress fev age rvfplot 26

27 rvfplot, title(Fitted values versus residuals for regression of FEV on age)

This plot shows that as the fitted value of FEV increases, the spread of the residuals increase – this suggests heteroscedasticity We would get a similar plot if we plotted age on the x-axis rvpplot age, name(res_v_age) We had a hint of this when looking at the box plots of FEV by age groups in the previous lecture 28

29 graph box fev, over(age) title(FEV by age)

Note that heteroscedasticity does not bias the estimates of the parameters, but it does reduce the precision of the estimates 30

Transformations One way to deal with this is to transform either x or y or both A common transformation is the log transformation Log transformations bring large values closer to the rest of the data There are methods to correct the standard errors for heteroscedasticity other than transformations 31

Log function refresher Log 10 – Log 10 (x) = y means that x=10 y – So if x=1000 log 10 (x) = 3 because 1000=10 3 – Log 10 (103) = 2.01 because 103= – Log 10 (1)=0 because 10 0 =1 – Log 10 (0)=-∞ because 10 -∞ =0 Log e or ln – e is a constant approximately equal to – ln(1) = 0 because e 0 =1 – ln(e) = 1 because e 1 =e – ln(103) = 4.63 because 103=e 4.63 – Ln(0)=-∞ because e -∞ =0 32

Log transformations 33 ValueLnLog 10 0-∞-∞-∞-∞ Be careful of log(0) or ln(0) Be sure you know which log base your computer program is using In Stata log() will give you ln()

34

Let’s try transforming FEV to ln(FEV). gen fev_ln=log(fev). summ fev fev_ln Variable | Obs Mean Std. Dev. Min Max fev | fev_ln | Run the regression of ln(FEV) on age and examine the residuals regress fev_ln age rvfplot, title(Fitted values versus residuals for regression of lnFEV on age) 35

36

37

Interpretation of regression coefficients for transformed y value The regression equation is: ln(FEV) =  ̂ +  ̂ age = age So a one year change in age corresponds to a.087 change in ln(FEV) The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y e = 1.09 – so a one year change in age corresponds to a 9% increase in FEV 38

Ln(FEV) = age Ln(FEV age21 ) = *21 Ln(FEV age20 ) = *20 Ln(FEV age21 )-ln(FEV age20 ) = Remember ln(a)-ln(b) = ln(a/b) Ln(FEV age21 /FEV age20 )= FEV age21 /FEV age20 = e =

Now using height as the independent variable 1.Make a scatter plot of FEV by height 2.Run a regression of FEV on height and examine the output 3.Construct a plot of the residuals vs. the fitted values 4.Consider transformation that might be a better fit 1.Run the regression and examine the output 2.Examine the residuals 40

41

42

43

Categorical independent variables We previously noted that the independent variable (the X variable) does not need to be normally distributed In fact, this variable can be categorical Dichotomous variables in regression models are coded as 1 to represent the level of interest and 0 to represent the comparison or reference group. These 0- 1 variables are called indicator or dummy variables. The regression model is the same The interpretation of  ̂ is the change in y that corresponds to being in the group of interest vs. not 44

Categorical independent variables Example sex: female x sex =0, for male x sex =1 Regression of FEV and sex fe ̂ v = α ̂ + β ̂̂ x sex For male: fêv male = α̂ + β̂*1 = α̂ + β̂ For female: fêv female =  ̂ + β̂*0 = α̂ So fêv male – fêv female =  ̂ +  ̂ -  ̂ =  ̂ Remember,  ̂ is the mean value of y when x=0 So here it is the mean FEV for sex=female 45

1.Using the FEV data, run the regression with FEV as the dependent variable and sex as the independent variable 2.What is the estimate for beta? How is it interpreted? 3.What is the estimate for alpha? How is it interpreted? 4.What hypothesis is tested where it says P>|t|? 5.What is the result of this test? 6.How much of the variance in FEV is explained by sex? 46

47

Categorical independent variable Remember that the regression equation is μ y|x = α +  x The only variables x can take are 0 and 1 μ y|0 = α μ y|1 = α +  So the estimated mean FEV for females is  ̂ and the estimated mean FEV for males is  ̂ +  ̂ When we conduct the hypothesis test of the null hypothesis  =0 what are we testing? What other test have we learned that tests the same thing? Run that test. 48

49

Categorical independent variables In general, you need k-1 dummy or indicator variables (0-1) for a categorical variable with k levels One level is chosen as the reference value Indicator variables are set to one for each category for only one of the dummy variables, they are set to 0 otherwise 50

Categorical independent variables E.g. Race group = White, Asian/PI, Other If Race=White is set as reference category, dummy variables look like: 51 x Asian/PI x Other White00 Asian/PI10 Other01

Categorical independent variables Then the regression equation is: y =  +  1 x Asian/PI +  2 x Other + ε For race group=White ŷ =  ̂ +v  ̂ 1 0+  ̂ 2 0 =  ̂ For race group=Asian/PI ŷ =  ̂ +  ̂  ̂ 2 0 =  ̂ +  ̂ 1 For race group=other ŷ =  ̂ +  ̂  ̂ 2 1 =  ̂ +  ̂ 2 52

You actually don’t have to make the dummy variables yourself (when I was a girl we did have to do) All you have to do is tell Stata that a variable is categorical using i. before a variable name Run the regression equation for the regression of BMI regressed on race group (using the class data set) regress bmi i.racegrp 53

54

1.What is the estimated mean BMI for race group = White? 2.What is the estimated mean BMI for race group = Asian/PI? 3.What is the estimated mean BMI for race group = Other? 4.What do the estimated betas signify? 5.What other test looks at the same thing? Run that test. 55

56

A new Stata trick allows you to specify the reference group with the prefix b# where # is the number value of the group that you want to be the reference group. 1.Try out regress bmi b1.racegrp Now the reference category is racegrp=1 which is the Asian/PI group 2. Interpret that parameter estimates 3. Note if other output is changed 57

58

Multiple regression Additional explanatory variables might add to our understanding of a dependent variable We can posit the population equation μ y|x1,x2,...,xq = α +  1 x 1 +  2 x  q x q α is the mean of y when all the explanatory variables are 0  i is the change in the mean value of y the corresponds to a 1 unit change in x i when all the other explanatory variables are held constant 59

Because there is natural variation in the response variable, the model we fit is y = α +  1 x 1 +  2 x  q x q +  Assumptions – x 1,x 2,...,x q are measured without error – The distribution of y is normal with mean μ y|x1,x2,...,xq and standard deviation σ y|x1,x2,...,xq – The population regression model holds – For any set of values of the explanatory variables, x 1,x 2,...,x q, σ y|x1,x2,...,xq is constant – homoscedasticity – The y outcomes are independent 60

Multiple regression – Least Squares We estimate the regression line ŷ = α̂ + β̂ 1 x 1 + β̂ 2 x β̂ q x q using the method of least squares to minimize 61

Multiple regression For one explanatory variable – the regression model represents a straight line through a cloud of points -- in 2 dimensions With 2 explanatory variables, the model is a plane in 3 dimensional space (one for each variable) etc. In Stata we just add explanatory variables to the regress statement Try regress fev age ht 62

63

We can test hypotheses about individual slopes The null hypothesis is H 0 :  i =  i0 assuming that the values of the other explanatory variables are held constant The test statistic follows a t distribution with n-q-1 degrees of freedom 64

65. regress fev age ht Source | SS df MS Number of obs = F( 2, 651) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | ht | _cons | Now the F-test has 2 degrees of freedom in the numerator because there are 2 explanatory variables R 2 will always increase as you add more variables into the model The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters Note that the beta for age decreased

Examine the residuals… 66 rvfplot, title(Residuals versus fitted for regression of age and height on FEV)

67

For next time Read Pagano and Gauvreau – Pagano and Gauvreau Chapters (review) – Pagano and Gauvreau Chapter 20