Linear regression Brian Healy, PhD BIO203.

Name: Linear regression Brian Healy, PhD BIO203.
Uploaded: 2017-11-22T21:32:18+00:00
Duration: PTM26S54
Channel: Trevor McDonald
Description: Linear regression Brian Healy, PhD BIO203.

Linear regression Brian Healy, PhD BIO203

Previous classes Hypothesis testing Correlation Parametric
Nonparametric Correlation

What are we doing today? Linear regression
Continuous outcome with continuous, dichotomous or categorical predictor Equation: Interpretation of coefficients Connection between regression and correlation t-test ANOVA

Big picture Linear regression is the most commonly used statistical technique. It allows the comparison of dichotomous, categorical and continuous predictors with a continuous outcome. Extensions of linear regression allow Dichotomous outcomes- logistic regression Survival analysis- Cox proportional hazards regression Repeated measures Amazingly, many of the analyses we have learned can be completed using linear regression

Example Yesterday, we investigated the association between age and BPF using a correlation coefficient Can we fit a line to this data?

Quick math review As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept One definition of m is that for every one unit increase in x, there is an m unit increase in y One definition of b is the value of y when x is equal to zero

Picture Look at the data in this picture
Does there seem to be a correlation (linear relationship) in the data? Is the data perfectly linear? Could we fit a line to this data?

What is linear regression?
Linear regression tries to find the best line (curve) to fit the data The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points

How do we find the best line?
Let’s look at three candidate lines Which do you think is the best? What is a way to determine the best line to use?

Residuals The actual observations, yi, may be slightly off the population line because of variability in the population. The equation is yi = b0 + b1xi + ei, where ei is the deviation from the population line (See picture). This is called the residual This is the distance from the line for patient 1, e1

Least squares The method employed to find the best line is called least squares. This method finds the values of b that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei2

Estimates of regression coefficients
Once we have solved the least squares equation, we obtain estimates for the b’s, which we refer to as The final least squares equation is where yhat is the mean value of y for a value of x1

Assumptions of linear regression
Linearity Linear relationship between outcome and predictors E(Y|X=x)=b0 + b1x1 + b2x22 is still a linear regression equation because each of the b’s is to the first power Normality of the residuals The residuals, ei, are normally distributed, N(0, s2) Homoscedasticity of the residuals The residuals, ei, have the same variance Independence All of the data points are independent Correlated data points can be taken into account using multivariate and longitudinal data methods

Linearity assumption One of the assumptions of linear regression is that the relationship between the predictors and the outcomes is linear We call this the population regression line E(Y | X=x) = my|x = b0 + b1x This equation says that the mean of y given a specific value of x is defined by the b coefficients The coefficients act exactly like the slope and y-intercept from the simple equation of a line from before

Normality and homoscedasticity assumption
Two other assumptions of linear regression are related to the ei’s Normality- the distribution of the residuals are normal. Homoscedasticity- the variance of y given x is the same for all values of x Distribution of y-values at each value of x is normal with the same variance

Example Here is a regression equation for the comparison of age and BPF

Results The estimated regression equation

Estimated slope Estimated intercept

Interpretation of regression coefficients
The final regression equation is The coefficients mean the estimate of the mean BPF for a patient with an age of 0 is (b0hat) an increase of one year in age leads to an estimated decrease of in mean BPF (b1hat)

Unanswered questions Is the estimate of b1 (b1hat) significantly different than zero? In other words, is there a significant relationship between the predictor and the outcome? Have the assumptions of regression been met?

Estimate of variance for bhat ’s
In order to determine if there is a significant association, we need an estimate of the variance of b0hat and b1hat sy|x is the residual variance in y after accounting for x (standard deviation from regression, root mean square error)

Test statistic For both regression coefficients, we use a t-statistic to test any specific hypothesis Each has n-2 degrees of freedom (This is the sample size-number of parameters estimated) What is the usual null hypothesis for b1?

Hypothesis test H0: b1 =0 Continuous outcome, continuous predictor
Linear regression Test statistic: t=-3.67 (27 dof) p-value=0.0011 Since the p-value is less than 0.05, we reject the null hypothesis We conclude that there is a significant association between age and BPF

Estimated slope p-value for slope Estimated intercept

Comparison to correlation
In this example, we found a relationship between the age and BPF. We also investigated this relationship using correlation We get the same p-value!! Our conclusion is exactly the same!! There are other relationships we will see later Method p-value Correlation 0.0010 Linear regression

Confidence interval for b1
As we have done previously, we can construct a confidence interval for the regression coefficients Since we are using a t-distribution, we do not automatically use Rather we use the cut-off from the t-distribution Interpretation of confidence interval is same as we have seen previously

Intercept STATA also provides a test statistic and p-value for the estimate of the intercept This is for Ho: b0 = 0, which is often not a hypothesis of interest because this corresponds to testing whether the BPF is equal to zero at age of 0 Since BPF can’t be 0 at age 0, this test is not really of interest We can center covariates to make this test important

Prediction

Prediction Beyond determining if there is a significant association, linear regression can also be used to make predictions Using the regression equation, we can predict the BPF for patients with specific age values Ex. A patient with age=40 The expected BPF for a patient of age 40 based on our experiment is 0.841

Extrapolation Can we predict the BPF for a patient with age 80? What assumption would we be making?

Confidence interval for prediction
We can place a confidence interval around our predicted mean value This corresponds to the plausible values for the mean BPF at a specific age To calculate a confidence interval for the predicted mean value, we need an estimate of variability in the predicted mean

Confidence interval Note that the standard error equation has a different magnitude based on the x value. In particular, the magnitude is the least when x=the mean of x Since the test statistic is based on the t-distribution, our confidence interval is This confidence interval is rarely used for hypothesis testing because

Prediction interval A confidence interval for a mean provides information regarding the accuracy of a estimated mean value for a sample size Often, we are interested in how accurate our prediction would be for a single observation, not the mean of a group of observations. This is called a prediction interval What would you estimate as the value for a single new observation? Do you think a prediction interval is narrower or wider?

Prediction interval Confidence interval always tighter than prediction intervals The variability in the prediction of a single observation contains two types of variability Variability of the estimate of the mean (confidence interval) Variability around the estimate of the mean (residual variability)

Conclusions Prediction interval is always wider than confidence interval Common to find significant differences between groups but not be able to predict very accurately To predict accurately for a single patient, we need limited overlap of the distribution. The benefit of an increased sample size decreasing the standard error does not help

Model checking

How good is our model? Although we have found a relationship between age and BPF, linear regression also allows us to assess how well our model fits the data R2=coefficient of determination=proportion of variance in the outcome explained by the model When we have only one predictor, it is the proportion of the variance in y explained by x

R2 What if all of the variability in y was explained by x?
What would R2 equal? What does this tell you about the correlation between x and y? What if the correlation between x and y is negative? What if none of the variability in y is explained by x? What is the correlation between x and y in this case?

r vs. R2 R2=(Pearson’s correlation coefficient)2=r2
Since r is between -1 and 1, R2 is always less than r r=0.1, R2=0.01 r=0.5, R2=0.25 Method Estimate r -0.577 R2 0.333

Evaluation of model Linear regression required several assumptions
Linearity Homoscedasticity Normality Independence-usually from study design We must determine if the model assumptions were reasonable or a different model may have been needed Statistical research has investigated relaxing each of these assumptions

Scatter plot A good first step in any regression is to look at the x vs. y scatter plot. This allows us to see Are there any outliers? Is the relationship between x and y approximately linear? Is the variance in the data approximately constant for all values of x?

Tests for the assumptions
There are several different ways to test the assumptions of linear regression. Graphical Statistical Many of the tests use the residuals, which are the distances from the fitted line and the outcomes

Residual plot If the assumptions of linear regression are met, we will observe a random scatter of points

Investigating linearity
Scatter plot of predictor vs outcome What do you notice here? One way to handle this is to transform the predictor to include a quadratic or other term

Aging Research has shown that the decrease in BPF in normal people is pretty slow up until age 65 and then there is a more steep drop

Fitted line Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends

What if we fit a line for this?
Residual plot shows a non-random scatter because the relationship is not really linear

What can we do? If the relationship between x and y is not linear, we can try a transformation of the values Possible transformations Add a quadratic term Fit a spline. This is when there is a slope for a certain part of the curve and a different slope for the rest of the curve

Adding a quadratic term

Residual plot

Checking linearity Plot of residuals vs. the predictor is also used to detect departures from linearity These plots allow you to investigate each predictor separately so becomes important in multiple regression If linearity holds, we anticipate a random scatter of the residuals on both types of residual plot

Homoscedasticity The second assumption is equal variance across the values of the predictor The top plot shows the assumption is met, while the bottom plot shows that there is a greater amount of variance for larger fitted values

Example

Example In this example, we can fit a linear regression model assuming that there is a linear increase in expression with lipid number, but here is the residuals plot from this analysis What is wrong?

Transform the y-value Clearly, the residuals showed that we did not have equal variance What if we log-transform our y-value?

New regression equation
By transforming the outcome variable we have changed our regression equation: Original: Expressioni =b0+ b1*lipidi+ei New: ln(Expressioni) =b0+ b1*lipidi+ei What is the interpretation of b1 from the new regression model? For every one unit increase in lipid number, there is a b1 unit increase in the ln(Expression) on average The interpretation has changed due to the transformation

Residual plot On the log-scale, the assumption of equal variance appears much more reasonable

Checking homoscedasticity
If we do not appear to have equal variance, a transformation of the outcome variable can be used Most common are log-transformation or square root transformation Other approaches involving weighted least squares can also be used if a transformation does not work

Normality Regression requires that the residuals are normally distributed To test if the residuals are normal: Histogram of residuals Normal probability plot Several statistical tests for normality of residuals are also available

What if normality does not hold?
Transformations of the outcome can often help Changing to another type of regression that does not require normality of the residuals Logistic regression Poisson regression

Outliers Investigating the residuals also provides information regarding outliers If a value is extreme in the vertical direction, the residual will be extreme as well You will see this in lab If a value is extreme in the horizontal direction, this value can have too much importance (leverage) This is beyond the scope of this class

Example Another measure of disease burden in MS is the T2 lesion volume in the brain Over the course of the disease patients accumulate brain lesions that they do not recover from This is a measure of the disease burden in the brain Is the significant linear relationship between T2 lesion volume and age?

Linear model Our initial linear model:
LVi =b0+b1*agei +ei What is the interpretation of b1? What is the interpretation of b0? Using STATA, we get the following regression equation: Is there a significant relationship between age and lesion volume?

Linear regression Test statistic: t=0.99 (102 dof) p-value=0.32 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between age and lesion volume

Estimated coefficient
p-value

Linear model Our initial linear model:
ln(LVi) =b0+b1*agei +ei What is the interpretation of b1? What is the interpretation of b0? Using STATA, we get the following regression equation: Is there a significant relationship between age and lesion volume?

Linear regression Test statistic: t=0.38 (102 dof) p-value=0.71 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between age and lesion volume

Estimated coefficient
p-value

Histograms of residuals
Untransformed values Transformed values

Conclusions for model checking
Checking model assumptions for linear regression is needed to ensure inferences are correct If you have the wrong model, your inference will be wrong as well Majority of model checking based on the residuals If model fit is bad, should use a different model

Dichotomous predictors

Linear regression with dichotomous predictor
Linear regression can also be used for dichotomous predictors, like sex To do this, we use an indicator variable, which equals 1 for male and 0 for female. The resulting regression equation for BPF is

The regression equation can be rewritten as
The meaning of the coefficients in this case are b0 is the mean BPF when sex=0, in the female group b0 + b1 is the mean BPF when sex=1, in the male group What is the interpretation of b1? For a one-unit increase in sex, there is a b1 increase in mean of the BPF The difference in mean BPF between the males and females

Interpretation of results
The final regression equation is The meaning of the coefficients in this case are 0.823 is the estimate of the mean BPF in the female group 0.037 is the estimate of the mean increase in BPF between the males and females What is the estimated mean BPF in the males? How could we test if the difference between the groups is statistically significant?

Hypothesis test H0: There is no difference based on gender (b1 =0)
Continuous outcome, dichotomous predictor Linear regression Test statistic: t=1.82 (27 dof) p-value=0.079 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant difference in the mean BPF in males compared to females

p-value for difference
Estimated difference between groups

T-test As hopefully you remember, you could have tested this same null hypothesis using a two sample t-test Linear regression makes an equal variance assumption, so let’s use the same assumption for our t-test

Hypothesis test H0: There is no difference based on gender
Continuous outcome, dichotomous predictor t-test Test statistic: t=-1.82 (27 dof) p-value=0.079 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant difference in the mean BPF in males compared to females

Amazing!!! We get the same result using both approaches!!
Linear regression has the advantages of: Allowing multiple predictors (tomorrow) Accommodating continuous predictors (relationship to correlation) Accommodating categorical predictors (tomorrow) Very flexible approach

Conclusion Indicator variables can be used to represent dichotomous variables in a regression equation Interpretation of the coefficient for an indicator variable is the same as for a continuous variable Provides a group comparison Tomorrow we will see how to use regression to match ANOVA results

Linear regression Brian Healy, PhD BIO203.

Similar presentations

Presentation on theme: "Linear regression Brian Healy, PhD BIO203."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear regression Brian Healy, PhD BIO203.

Similar presentations

Presentation on theme: "Linear regression Brian Healy, PhD BIO203."— Presentation transcript:

Similar presentations

About project

Feedback