Presentation on theme: "Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale."— Presentation transcript:
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Lecturer: Lorenzo Marini DAFNAE, University of Padova, Viale dellUniversità 16, Legnaro, Padova. Tel.: Skype: lorenzo.marini Session 4 Lecture: Regression Analysis Practical: multiple regression
Statistical modelling: more than one parameter Nature of the response variable NORMALPOISSON,BINOMIAL… GLM CategoricalContinuousCategorical + continuous General Linear Models Generalized Linear Models ANOVARegressionANCOVA Nature of the explanatory variables Session 3Session 4 Session 3
REGRESSION Simple linear -One X -Linear relation REGRESSION Multiple linear -2 or > X i - Linear relation Non-linear -One X -Complex relation Polynomial -One X but more slopes - Non-linear relation Linear methods Non-linear methods
LINEAR REGRESSION lm() Regression analysis is a technique used for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent continuous variables (explanatory variables) dependent variable independent continuous variables Assumptions Independence: The Y-values and the error terms must be independent of each other. Linearity between Y and X. Normality: The populations of Y-values and the error terms are normally distributed for each level of the predictor variable x Homogeneity of variance: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x. (dont test for normality or heteroscedasticity, check the residuals instead!)
AIMS 1.To describe the linear relationships between Y and X i (EXPLANATORY APPROACH) and to quantify how much of the total variation in Y can be explained by the linear relationship with X i. 2. To predict new values of Y from new values of X i (PREDICTIVE APPROACH) LINEAR REGRESSION lm() Y i = α + βx i + ε i Y response X i predictors We estimate one INTERCEPT and one or more SLOPES
Simple LINEAR regression step by step: SIMPLE LINEAR REGRESSION: I step: -Check linearity [visualization with plot()] II step: -Estimate the parameters (one slope and one intercept) III step: -Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
Normality SIMPLE LINEAR REGRESSION: Do not test the normality over the whole y
MODEL SIMPLE LINEAR REGRESSION y i = α + βx i α = y mean - β*x mean β = Σ [(x i -x mean )(y i -y mean )] Σ (x i -x mean ) 2 SLOPE INTERCEPT The model gives the fitted values Residuals= observed y i - fitted y i Observed value RESIDUALS The model does not explained everything
SIMPLE LINEAR REGRESSION Least square regression explanation library(animation) ########################################### ##Slope changing # save the animation in HTML pages ani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ") ani.start() par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3) least.squares() ani.stop() ############################################ # Intercept changing # save the animation in HTML pages ani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ") ani.start() par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3) least.squares(ani.type = "i") ani.stop()
SIMPLE LINEAR REGRESSION Hypothesis testing Ho: β = 0 (There is no relation between X and Y) H1: β 0 We must measure the unreliability associated with each of the estimated parameters (i.e. we need the standard errors) SE(β) = [(residual SS/(n-2))/Σ(x i - x mean )] 2 t = (β – 0) / SE(β) Parameter t testing Parameter t testing (test the single parameter!)
SIMPLE LINEAR REGRESSION Measure of goodness-of-fit Total SS = Σ(y observed i - y mean ) 2 Model SS = Σ(y fitted i - y mean ) 2 Residual SS = Total SS - Model SS R 2 = Model SS /Total SS Explained variation It does not provide information about the significance If the model is significant (β 0) How much variation is explained?
SIMPLE LINEAR REGRESSION: example 1 If the model is significant, then model checking 1. Linearity between X and Y? ok No patterns in the residuals vs. predictor plot
2. Normality of the residuals Q-Q plot + Shapiro-Wilk test on the residuals > shapiro.test(residuals) Shapiro-Wilk normality test data: residuals W = , p-value = ok SIMPLE LINEAR REGRESSION: example 1
3. Homoscedasticity Call: lm(formula = abs(residuals) ~ yfitted) Coefficients: Estimate SE t P (Intercept) yfitted SIMPLE LINEAR REGRESSION: example 1 ok
no NO LINEARITY between X and Y SIMPLE LINEAR REGRESSION: example 2 no yes 1. Linearity between X and Y?
> shapiro.test(residuals) Shapiro-Wilk normality test data: residuals W = , p-value = no SIMPLE LINEAR REGRESSION: example 2 2. Normality of the residuals Q-Q plot + Shapiro-Wilk test on the residuals
SIMPLE LINEAR REGRESSION: example 2 3. Homoscedasticity NOYES
Transformation of the data -Box-cox transformation (power transformation of the response) -Square-root transformation -Log transformation -Arcsin transformation How to deal with non-linearity and non-normality situations? SIMPLE LINEAR REGRESSION: example 2 Polynomial regression Regression with multiple terms (linear, quadratic, and cubic) Y= a + b 1 X + b 2 X 2 + b 3 X 3 + error X is one variable!!!
POLYNOMIAL REGRESSION: one X, n parameters Hierarchy in the testing (always test the highest)!!!! X + X 2 + X 3 X + X 2 X n.s. Stop P<0.01 Stop P<0.01 n.s. No relation NB Do not delete lower terms even if non-significant n.s.
MULTIPLE LINEAR REGRESSION: more than one x Multiple regression Regression with two or more variables Y= a + b 1 X 1 + b 2 X 2 +… + b i X i The Multiple Regression Model There are important issues involved in carrying out a multiple regression: which explanatory variables to include (VARIABLE SELECTION); NON-LINEARITY in the response to the explanatory variables; INTERACTIONS between explanatory variables; correlation between explanatory variables (COLLINEARITY); RELATIVE IMPORTANCE of variables Assumptions Same assumptions as in the simple linear regression!!!
MULTIPLE LINEAR REGRESSION: more than one x Multiple regression MODEL Regression with two or more variables Y = a+ b 1 X 1 + b 2 X 2 +…+ b i X i Each slope (b i ) is a partial regression coefficient: bi are the most important parameters of the multiple regression model. They measure the expected change in the dependent variable associated with a one unit change in an independent variable holding the other independent variables constant. This interpretation of partial regression coefficients is very important because independent variables are often correlated with one another.
MULTIPLE LINEAR REGRESSION: more than one x Multiple regression MODEL EXPANDED We can add polynomial terms and interactions Y= a + linear terms + quadratic & cubic terms+ interactions QUADRATIC AND CUBIC TERMS account for NON-LINEARITY INTERACTIONS account for non-independent effects of the factors
Multiple regression step by step: MULTIPLE LINEAR REGRESSION: I step: -Check collinearity (visualization with pairs() and correlation) -Check linearity II step: -Variable selection and model building (different procedures to select the significant variables) III step: -Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
Lets begin with an example from air pollution studies. How is ozone concentration related to wind speed, air temperature and the intensity of solar radiation? MULTIPLE LINEAR REGRESSION: I STEP I STEP: -Check collinearity -Check linearity
Model simplification 1. Remove non-significant interaction terms. 2. Remove non-significant quadratic or other non-linear terms. 3. Remove non-significant explanatory variables. 4. Amalgamate explanatory variables that have similar parameter values. Start with a complex model with interactions and quadratic and cubic terms Minimum Adequate Model How to carry out a model simplification in multiple regression MULTIPLE LINEAR REGRESSION: II STEP II STEP: MODEL BUILDING
Start with the most complicate model (it is one approach) model1t) (Intercept) 5.7E+022.1E ** temp -1.1E+014.3E * wind -3.2E+011.2E ** rad -3.1E-015.6E I(rad^2) -3.6E-042.6E I(temp^2) 5.8E-022.4E * I(wind^2) 6.1E-011.5E *** temp:wind 2.4E-011.4E temp:rad 8.4E-037.5E wind:rad 2.1E-024.9E temp:wind:rad -4.3E-046.6E Delete only the highest interaction temp:wind:rad !!!!!! We cannot delete these terms !!!!!!! MULTIPLE LINEAR REGRESSION: II STEP
At each deletion test: Is the fit of a simpler model worse? Manual model simplification (It is one of the many philosophies) Deletion the non-significant terms one by one: Hierarchy in the deletion: 1. Highest interactions 2. Cubic terms 3. Quadratic terms 4. Linear terms If you have quadratic and cubic terms significant you cannot delete the linear or the quadratic term even if they are not significant If you have an interaction significant you cannot delete the linear terms even if they are not significant COMPLEX SIMPLE Deletion MULTIPLE LINEAR REGRESSION: II STEP IMPORTANT!!!
III STEP: we must check the assumptions We can transform the data (e.g. Log-transformation of y) model
The log-transformation has improved our model but maybe there is an outlier MULTIPLE LINEAR REGRESSION: more than one x
PARTIAL REGRESSION: With partial regression we can remove the effect of one or more variables (covariates) and test a further factor which becomes independent from the covariates WHEN? Would like to hold third variables constant, but cannot manipulate. Can use statistical control. HOW? Statistical control is based on residuals. If we regress Y on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.
PARTIAL REGRESSION: VARIATION PARTITIONING Relative importance of groups of explanatory variables Longitude (km) EnvironmentSpace Latitude (km) Site Full.model
NON-LINEAR REGRESSION: nls() Sometimes we have a mechanistic model for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data. We must SPECIFY the exact nature of the function as part of the model formula when we use non-linear modelling In place of lm() we write nls() (this stands for non-linear least squares). Then, instead of y~x+I(x 2 )+I(x 3 ) (polynomial), we write the y~function to spell out the precise nonlinear model we want R to fit to the data.
NON-LINEAR REGRESSION: step by step 3. Start fitting the different models 1. Plot y against x 2. Get an idea of the family of functions that you can fit 7. Compare PAIRS of models and choose the best 5. [Get the MAM for each by model simplification] 6. Check the residuals Multimodel inference (minimum deviance + minimum number of parameters) Compare GROUPS of model at a time Alternative approach AIC = scaled deviance +2k Model weights and model average [see Burnham & Anderson, 2002] 4. Specify initial guesses for the values of the parameters k= parameter number + 1
nls(): examples of function families Asymptotic functionsS-shaped functions Humped functionsExponential functions
nls(): Look at the data Using the data plot work out sensible starting values. It always helps in cases like this to work out the equations at the limits – i.e. find the values of y when x=0 and when x= Asymptotic functions S-shaped functions Humped functions Exponential functions ? Asymptotic exponential Understand the role of the parameters a, b, and c
nls(): Look at the data Can we try another function from the same family? Fit the model Model choice is always an important issue in curve fitting (particularly for prediction) 2. Extract the fitted values (y i ) 3. Check graphically the curve fitting Different behavior at the limits! Think about your biological system not just residual deviance! 1. Estimate of a, b, and c (iterative)
nls(): Look at the data Fit a second model 1. Extract the fitted values (y i ) 2. Check graphically the curve fitting You can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis–Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months. Michaelis–Menten
Application of regression: prediction Regression models for prediction Spatial extent + data range A model can be used to predict values of y in space or in time knowing new x i values NOYES Before using a model for prediction it has to be VALIDATED!!! 2 APPROACHES
VALIDATION 1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model). 2. If data scarce, must resort to artificially produced validation sets Model fitPredicted Real y Cross-validation Bootstrap Residual=Prediction error
K-FOLD CROSS-VALIDATION Split randomly the data in K groups with roughly the same size Take turns using one group as test set and the other k-I as training set for fitting the model Train Test Train Test Train Test Train Test Cross-validation estimate of prediction error is average of these Train Test 1. Prediction error 1 2. Prediction error 2 3. Prediction error 3 4. Prediction error 4 5. Prediction error 5
BOOTSTRAP 1. Generate a large number (n= ) of bootstrap samples 3. The mean of these estimates is the bootstrap estimate of prediction error n= For each bootstrap sample, compute the prediction error Error 1 Error 2 Error 3 Error n … …
If you can, use an independent data set for validating the model 1. Do not use your model for prediction without carrying out a validation Application of regression: prediction If you cannot, use at least bootstrap or cross-validation 2. Never extrapolate