Presentation on theme: "Chapter 11 Regression and correlation methods. Abdus Wahed BIOST 2041 3 Goals To relate (associate) a continuous random variable, preferably normally."— Presentation transcript:
Chapter 11 Regression and correlation methods
Abdus Wahed BIOST Goals To relate (associate) a continuous random variable, preferably normally distributed, to other variables
Abdus Wahed BIOST Terminology Dependent Variable (Y): –The variable which is supposed to depend on others e.g., Birthweight Independent variable, explanatory variable or predictors (x): –The variables which are used to predict the dependent variable, or explains the variation in the dependent variable, e.g., estriol levels
Abdus Wahed BIOST Assumptions Dependent Variable: –Continuous, preferably normally distributed –Have a linear association with the predictors Independent variable: –Fixed (not random)
Abdus Wahed BIOST Simple Linear Regression Model Assume Y be the dependent variable and x be the lone covariate. Then a linear regression assumes that the true relationship between Y and x is given by E(Y|x) = α + βx (1)
Abdus Wahed BIOST Simple Linear Regression Model (1) can be written as Y = α + βx + e, (2) where e is an error term with mean 0 and variance σ 2.
Abdus Wahed BIOST Implication If there was a perfect linear relationship, every subject with the same value of x would have a common value of Y. –Deterministic relationship The error term takes into account the inter- patient variability. σ 2 = Var(Y) = Var(e).
Abdus Wahed BIOST Parameters α is the intercept of the line. β is the slope of the line, referred to as regression coefficient oβ < 0 indicates a negative linear association (the higher the x, the smaller the Y) oβ = 0, no linear relationship. oβ > 0 indicates a positive linear association (the higher the x, the larger the Y) oβ is the amount of change in Y for a unit change in x.
Abdus Wahed BIOST Goal How to estimate α, β, and σ 2 ? –Fitting Regression Lines How to draw inference? The relationship we see – is it just due to chance? –Inference about regression parameters
Abdus Wahed BIOST Fitting Regression Line Least Square method
Abdus Wahed BIOST Least square method Idea: –Estimate α and β in a way that the observations are “closest” to the line Impossible Implement: –Estimate α and β in a way that the sum of squared deviations is minimized.
Abdus Wahed BIOST Least square method Minimize Σ(y i - α – βx i ) 2 b = Σx i y i – Σx i Σ y i /n Σx i 2 –(Σx i ) 2 /n a = (Σy i – bΣx i )/n Least square estimate of α Least square estimate of β Estimated Regression line: y = a + bx
Abdus Wahed BIOST Example 11.3 Estimate the regression line for the birthweight data in Table 11.1, i.e. Estimate the intercept a and slope b We do the following calculations (see the corresponding Excel file)
Abdus Wahed BIOST Regression analysis for the data in Table 11.1 Sum of products: 17500(1) Sum of X: 534(2) Sum of Y: 992(3) Sum of squared x: 9876(4) Corrected Sum of products : (1) - (2)*(3)/n Lxy=412(5) Corrected Sum of products : (4) - (2)*(2)/n Lxx= (6) Regression coefficient: (5)/(6) b=Lxy/Lxx= (7) Intercept: [(3) - (7)*(2)]/n a= Estimated Regression Line: Birthweight (g/100) = *Estriol (mg/24hr)
Abdus Wahed BIOST Regression Analysis: Interpretation There is a positive association (statistically significant or not, we will test later) between birthweight and estriol levels. For each mg increase in estriol level, the birthweight of the newborn is increased by 61 g.
Abdus Wahed BIOST The predicted value of Y for a given value of x is Prediction
Abdus Wahed BIOST Prediction What is the estimated (predicted) birthweight if a pregnant women has an estriol level of 15 mg/24hr? = (g/100) = 3065 g
Abdus Wahed BIOST Calibration If low birthweight is defined as <= 2500, for what estriol level would the newborn be low birthweight? That is to what value of estriol level does the predicted birthweight of 2500 correspond to?
Abdus Wahed BIOST Calibration Women having estriol level of 5.72 or lower are expected to have low birthweight newborns
Abdus Wahed BIOST Goodness of fit of a regression line How good is x in predicting Y? Estriol (mg/24hr) Birthweight (g/100) Predicted Birthweight (g/100)Residual x1=7y1= r1=-0.78 x2=9y2= r2=-1.99 x3=9y3= r3=-1.99 x4=12y4= r4=
Abdus Wahed BIOST Goodness of fit of a regression line Residual sum of squares (Res SS) Summary Measure of Distance Between the Observed and Predicted The smaller the Res. SS, the better the regression line is in predicting Y
Abdus Wahed BIOST Total variation in observed Y Total sum of squares Summary Measure of Variation in Y
Abdus Wahed BIOST Total variation in predicted Y Total sum of squares Summary Measure of Variation in predicted Y
Abdus Wahed BIOST Goodness of fit of a regression line
Abdus Wahed BIOST Goodness of fit of a regression line It can be shown that The smaller the residual SS, the closer the total and regression sum of squares are, the better the regression is
Abdus Wahed BIOST Coefficient of determination R 2 R 2 is the proportion of total variation in Y explained by the regression on x. R 2 lies between 0 and 1. R 2 = 1 implies a perfect fit (all the points are on the line).
Abdus Wahed BIOST F-test Another way of formally looking at how good the regression of Y on x is, is through F-test. The F-test compares Reg. SS to Residual SS: Larger F indicates Better Regression Fit
Abdus Wahed BIOST F-test Test Test statistic Reject H 0 if F > F 1,n-2,1-α
Abdus Wahed BIOST Summary of Goodness of regression fit We need to compute three quantities –Total SS –Reg. SS –Res. Ss Total SS = L yy Reg. SS = b*L xy Res. SS = Total SS – Reg.SS
Abdus Wahed BIOST Example Total SS: 674 Reg. SS: R^2: 0.37 =>37% of the variation in birthweight is explained by the regression on estriol level F :17.16 p-value: P(F 1,29 > 17.16) = H0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight
Abdus Wahed BIOST T-test Same hypothesis can be tested using a t- test.
Abdus Wahed BIOST T-test
Abdus Wahed BIOST T-test P-value = 2 Pr(t n-2 > |t|) 100(1-α)% CI for β
Abdus Wahed BIOST Example Is the regression coefficient (slope) for the estriol level significantly different from zero? S^2=14.6s=3.82 SE(b)=0.15t=4.14 p= % CI for reg coeff (0.31, 0.91) H 0 : β = 0 is rejected => The slope of the regression line is significantly different from zero, implying a statistically significant linear relationship between estriol level and birthweight
Abdus Wahed BIOST Correlation Correlation refers to a quantitative measure of the strength of linear relationship between two variables Regression, on the other hand is used for prediction No distinction between dependent and independent variable is made when assessing the correlation
Abdus Wahed BIOST Correlation: Example 11.14
Abdus Wahed BIOST Correlation
Abdus Wahed BIOST Correlation coefficient Population correlation coefficient (See section in my notes) If X and Y could be measured on everyone in the population, we could have calculated ρ.
Abdus Wahed BIOST Interpretation of ρ ρ lies between −1 and 1, ρ = 0 implies no linear relationship, ρ = −1 implies perfect negative linear relationship, ρ = +1 implies perfect positive linear relationship.
Abdus Wahed BIOST Sample correlation coefficient Unfortunately, we cannot measure X and Y on everyone in the population. We estimate ρ from the sample data as follows:
Abdus Wahed BIOST Interpretation of r r lies between −1 and 1, r = 0 implies no linear relationship, r = −1 implies perfect negative linear relationship, r = +1 implies perfect positive linear relationship, The closer |r| is to 1, the stronger the relationship is.
Abdus Wahed BIOST Sample correlation coefficient r = 1
Abdus Wahed BIOST Sample correlation coefficient r = -1
Abdus Wahed BIOST Correlation: Example Sum of products: (1) Sum of X: 1872(2) Sum of Y: 32.3(3) Sum of squared X: (4) Sum of squared Y: 93.11(5) Corrected Sum of products : (1) - (2)*(3)/n Lxy=117.4(6) Corrected Sum of squares of X : (4) - (2)*(2)/n Lxx=2288(7) Corrected Sum of squares of Y : (5) - (3)*(3)/n Lyy=6.17(8) Sample Correlation Coefficient(6)/sqrt[(7)*(8)] r=0.988
Abdus Wahed BIOST Correlation: Example Since r =0.988, there exists nearly perfect positive correlation between mean FEV and the height. The taller a person is the higher the FEV levels. Had we done a regression of one of the variables (FEV or height) on the other, the R 2 would have been R 2 = r 2 = 0.976~98%. This implies that 98% of the variation in one variable is explained by the other.
Abdus Wahed BIOST Correlation: Example The sample correlation coefficient between estriol levels and the birth weights is calculated as r = 0.61, implying moderately strong positive linear relationship. The higher the estriol levels, the higher the birth weights. Remember, R 2 = 0.37 (slide # 33) which is equal to r 2 = (0.61) 2.
Abdus Wahed BIOST Statistical Significance of Correlation If |r| is close to 1, such as 0.988, one would believe that there is a strong linear relationship between the two variables. That means, there is no reason to believe that this strong association just happened by chance (sampling/observation).
Abdus Wahed BIOST Statistical Significance of Correlation But If |r| = 0.23, what conclusion would you draw about the relationship? Is it possible that in truth there was no correlation (ρ = 0), but the sample by chance only shows that there is some sort of correlation between the two variables?
Abdus Wahed BIOST Significance test for correlation coefficient Test the hypothesis H 0 : ρ = 0 vs. H a : ρ ≠ 0. Under the assumption that both variables are normally distributed, Calculate two-sided p-value from a t distribution with (n-2) d.f.
Abdus Wahed BIOST Correlation: Example The sample correlation coefficient between estriol levels and the birth weights is calculated as r = Is the correlation significant? (Is the correlation coefficient significantly different from zero?)
Abdus Wahed BIOST Correlation: Example Since p-value is very small, we reject the null hypothesis. The correlation is statistically significant at α = => We have enough evidence to conclude that the correlation coefficient is significantly different from zero. Did you notice that the t-statistic (t = 4.14) and p-value ( ) for testing H 0 : ρ = 0 are exactly same as the t-statistic calculated for H 0 : β = 0 in slide 37?
Abdus Wahed BIOST Significance test for correlation coefficient Test the hypothesis H 0 : ρ = ρ 0 vs. H a : ρ ≠ ρ 0. Let (Fisher’s Z transformation),
Abdus Wahed BIOST Significance test for correlation coefficient Then under H 0, The p-value for the test could then be calculated from a standard normal distribution We will mainly use this result to find confidence intervals for ρ