1Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students Session 4Lecture: Regression AnalysisPractical: multiple regressionLecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, Legnaro, Padova.Tel.:Skype: lorenzo.marini
2Statistical modelling: more than one parameter Nature of the response variableGeneralizedLinear ModelsNORMALPOISSON,BINOMIAL…GLMSession 3General Linear ModelsNature of the explanatory variablesCategoricalContinuousCategorical + continuousANOVARegressionANCOVASession 3Session 4
3REGRESSION REGRESSION Linear methods Non-linear methods Simple linear -One X-Linear relationMultiple linear-2 or > Xi- Linear relationNon-linear-One X-Complex relationPolynomial-One X but more slopes- Non-linear relation
4LINEAR REGRESSION lm() Regression analysis is a technique used for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent continuous variables (explanatory variables)AssumptionsIndependence: The Y-values and the error terms must be independent ofeach other.Linearity between Y and X.Normality: The populations of Y-values and the error terms are normallydistributed for each level of the predictor variable xHomogeneity of variance: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x.(don’t test for normality or heteroscedasticity, check the residuals instead!)
5We estimate one INTERCEPT LINEAR REGRESSION lm()AIMS1.To describe the linear relationships between Y and Xi (EXPLANATORY APPROACH) and to quantify how much of the total variation in Y can be explained by the linear relationship with Xi.2. To predict new values of Y from new values of Xi (PREDICTIVE APPROACH)XipredictorsYresponseYi = α + βxi + εiWe estimate one INTERCEPTand one or more SLOPES
6SIMPLE LINEAR REGRESSION: Simple LINEAR regression step by step:I step:-Check linearity [visualization with plot()]II step:-Estimate the parameters (one slope and one intercept)III step:-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
7SIMPLE LINEAR REGRESSION: NormalityDo not test the normality over the whole y
8SIMPLE LINEAR REGRESSION MODELThe model gives the fitted valuesyi = α + βxiSLOPEβ = Σ [(xi-xmean)(yi-ymean)]Σ (xi-xmean)2INTERCEPTα = ymean- β*xmeanThe model does not explained everythingRESIDUALSResiduals= observed yi- fitted yiObserved value
9SIMPLE LINEAR REGRESSION Least square regression explanationlibrary(animation)#############################################Slope changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(),title = "Demonstration of Least Squares",description = "We want to find an estimate for the slopein 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares()ani.stop()############################################# Intercept changingleast.squares(ani.type = "i")
10SIMPLE LINEAR REGRESSION Parameter t testingHypothesis testingHo: β = 0 (There is no relation between X and Y)H1: β ≠ 0Parameter t testing (test the single parameter!)We must measure the unreliability associated with each of theestimated parameters (i.e. we need the standard errors)SE(β) = [(residual SS/(n-2))/Σ(xi - xmean)]2t = (β – 0) / SE(β)
11SIMPLE LINEAR REGRESSION If the model is significant (β ≠ 0)How much variation is explained?Measure of goodness-of-fitTotal SS = Σ(yobserved i- ymean)2Model SS = Σ(yfitted i - ymean)2Residual SS = Total SS - Model SSR2 = Model SS /Total SSIt does not provide informationabout the significanceExplained variation
12SIMPLE LINEAR REGRESSION: example 1 If the model is significant, then model checking1. Linearity between X and Y?okNo patterns in the residuals vs. predictor plot
13SIMPLE LINEAR REGRESSION: example 1 2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residualsok> shapiro.test(residuals)Shapiro-Wilk normality testdata: residualsW = , p-value =ok
14SIMPLE LINEAR REGRESSION: example 1 3. HomoscedasticityCall:lm(formula = abs(residuals) ~ yfitted)Coefficients:Estimate SE t P(Intercept)yfittedok
15SIMPLE LINEAR REGRESSION: example 2 1. Linearity between X and Y?nonoyesNO LINEARITY between X and Y
16SIMPLE LINEAR REGRESSION: example 2 2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residualsno> shapiro.test(residuals)Shapiro-Wilk normality testdata: residualsW = , p-value =no
17SIMPLE LINEAR REGRESSION: example 2 3. HomoscedasticityNOYES
18SIMPLE LINEAR REGRESSION: example 2 How to deal with non-linearity and non-normality situations?Transformation of the data-Box-cox transformation (power transformation of the response)-Square-root transformation-Log transformation-Arcsin transformationPolynomial regressionRegression with multiple terms (linear, quadratic, and cubic)Y= a + b1X + b2X2 + b3X3 + error X is one variable!!!
19POLYNOMIAL REGRESSION: one X, n parameters Hierarchy in the testing (always test the highest)!!!!n.s.n.s.n.s.No relationXX + X2 + X3X + X2P<0.01P<0.01P<0.01StopStopStopNB Do not delete lower terms even if non-significant
20MULTIPLE LINEAR REGRESSION: more than one x Multiple regressionRegression with two or more variablesY= a + b1X1 + b2X2 +… + biXiAssumptionsSame assumptions as in the simple linear regression!!!The Multiple Regression ModelThere are important issues involved in carrying out a multiple regression:• which explanatory variables to include (VARIABLE SELECTION);• NON-LINEARITY in the response to the explanatory variables;• INTERACTIONS between explanatory variables;• correlation between explanatory variables (COLLINEARITY);• RELATIVE IMPORTANCE of variables
21MULTIPLE LINEAR REGRESSION: more than one x Multiple regression MODELRegression with two or more variablesY = a+ b1X1+ b2X2+…+ biXiEach slope (bi) is a partial regression coefficient:bi are the most important parameters of the multiple regression model. They measure the expected change in the dependent variable associated with a one unit change in an independent variable holding the other independent variables constant. This interpretation of partial regression coefficients is very important because independent variables are often correlated with one another.
22MULTIPLE LINEAR REGRESSION: more than one x Multiple regression MODEL EXPANDEDWe can add polynomial terms and interactionsY= a + linear terms + quadratic & cubic terms+ interactionsQUADRATIC AND CUBIC TERMS account for NON-LINEARITYINTERACTIONS account for non-independent effects of the factors
23MULTIPLE LINEAR REGRESSION: Multiple regression step by step:I step:-Check collinearity (visualization with pairs() and correlation)-Check linearityII step:-Variable selection and model building (different procedures to select the significant variables)III step:-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
24MULTIPLE LINEAR REGRESSION: I STEP -Check collinearity-Check linearityLet’s begin with an example from air pollution studies. How is ozoneconcentration related to wind speed, air temperature and the intensityof solar radiation?
25II STEP: MODEL BUILDING MULTIPLE LINEAR REGRESSION: II STEPII STEP: MODEL BUILDINGStart with a complex model with interactions and quadraticand cubic termsModel simplificationMinimum Adequate ModelHow to carry out a model simplification in multiple regression1. Remove non-significant interaction terms.2. Remove non-significant quadratic or other non-linear terms.3. Remove non-significant explanatory variables.4. Amalgamate explanatory variables that have similar parameter values.
26MULTIPLE LINEAR REGRESSION: II STEP Start with the most complicate model (it is one approach)model1<lm( ozone ~ temp*wind*rad+I(rad2)+I(temp2+I(wind2))EstimateStd.ErrortPr(>t)(Intercept)5.7E+022.1E+022.740.01**temp-1.1E+014.3E+00-2.50*wind-3.2E+011.2E+01-2.76rad-3.1E-015.6E-01-0.560.58I(rad^2)-3.6E-042.6E-04-1.410.16I(temp^2)5.8E-022.4E-022.440.02I(wind^2)6.1E-011.5E-014.160.00***temp:wind2.4E-011.4E-011.740.09temp:rad8.4E-037.5E-031.120.27wind:rad2.1E-024.9E-020.420.68temp:wind:rad-4.3E-046.6E-04-0.660.51!!!!!!We cannot delete these terms!!!!!!!Delete only the highest interaction temp:wind:rad
27MULTIPLE LINEAR REGRESSION: II STEP Manual model simplification(It is one of the many philosophies)Deletion the non-significant terms one by one:Hierarchy in the deletion:1. Highest interactions2. Cubic terms3. Quadratic terms4. Linear termsCOMPLEXDeletionAt each deletion test:Is the fit of asimpler model worse?SIMPLEIMPORTANT!!!If you have quadratic and cubic terms significant you cannotdelete the linear or the quadratic term even if they are not significantIf you have an interaction significant you cannotdelete the linear terms even if they are not significant
28MULTIPLE LINEAR REGRESSION: III STEP III STEP: we must check the assumptionsNONOVariance tends to increase with yNon-normal errorsWe can transform the data (e.g. Log-transformation of y)model<lm( log(ozone) ~ temp + wind + rad + I(wind2))
29MULTIPLE LINEAR REGRESSION: more than one x The log-transformation has improved our model but maybe there is an outlier
30PARTIAL REGRESSION:With partial regression we can remove the effect of one or more variables (covariates) and test a further factor which becomes independent from the covariatesWHEN?Would like to hold third variables constant, but cannot manipulate.Can use statistical control.HOW?Statistical control is based on residuals. If we regress Y on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.
31PARTIAL REGRESSION: VARIATION PARTITIONING Relative importance of groups of explanatory variablesR2= 76% (TOTAL EXPLAINED VARIATION)What is space and what is environment?Total variationLatitude (km)Space∩Envir.Unexpl.SpaceEnvironmentLongitude (km)Explained variationFull.model<lm(species ~ environment i + space i)SiteResponse variable: orthopteran species richnessExplanatory variable: SPACE (latitude + longitude) +ENVIRONMENT (temperature + land-cover heterogeneity)
32VARIATION PARTITIONING: varpart(vegan) Full.model<lm(SPECIES ~ temp + het + lat + long)Unexpl.SpaceEnvironmentTVE=76%Env.model<lm(SPECIES ~ temp + het)env.residualsUnexpl.SpaceEnvironmentPure.Space.model<lm(ENV.RESIDUALS ~ lat + long)VE=15%Unexpl.SpaceEnvironmentSpace.model<lm(SPECIES ~ lat + long)space.residualsUnexpl.SpaceEnvironmentPure.env.model<lm(SPACE.RESIDUALS ~ tem + het)VE=40%Unexpl.SpaceEnvironment
33NON-LINEAR REGRESSION: nls() Sometimes we have a mechanistic model for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data.We must SPECIFY the exact nature of the function as part of the model formula when we use non-linear modellingIn place of lm() we write nls() (this stands for ‘non-linear least squares’). Then, instead of y~x+I(x2)+I(x3) (polynomial), we write the y~function to spell out the precise nonlinear model we want R to fit to the data.
34Model weights and model average NON-LINEAR REGRESSION: step by step1. Plot y against xAlternative approach2. Get an idea of the family of functions that you can fitMultimodel inference(minimum deviance +minimum number of parameters)3. Start fitting the different modelsAIC = scaled deviance +2k4. Specify initial guesses for the values of the parametersk= parameter number + 1Compare GROUPS of model at a time5. [Get the MAM for each by model simplification]6. Check the residualsModel weights and model average[see Burnham & Anderson, 2002]7. Compare PAIRS of models and choose the best
35Exponential functions nls(): examples of function familiesAsymptotic functionsS-shaped functionsHumped functionsExponential functions
36Asymptotic exponential nls(): Look at the dataAsymptotic functions?S-shaped functionsHumped functionsExponential functionsAsymptotic exponentialUnderstand the role of the parameters a, b, and cUsing the data plot work out sensible starting values. It always helps in cases like this to work out the equation’s at the limits – i.e. find the values of y when x=0 and when x=
37nls(): Look at the data Fit the model 1. Estimate of a, b, and c (iterative)2. Extract the fitted values (yi)3. Check graphically the curve fittingCan we try another function from the same family?Model choice is always an important issue in curve fitting(particularly for prediction)Different behavior at the limits!Think about your biological system not just residual deviance!
38nls(): Look at the data Michaelis–Menten Fit a second model 1. Extract the fitted values (yi)2. Check graphically the curve fittingYou can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis–Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months.
39Application of regression: prediction Regression models for predictionA model can be used to predict values of y in space or in timeknowing new xi valuesSpatial extent + data rangeYESNOBefore using a model for prediction it has tobe VALIDATED!!!2 APPROACHES
40VALIDATION1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model).Residual=Prediction errorReal yModel fitPredicted2. If data scarce, must resort to “artificially produced” validation setsCross-validationBootstrap
41Cross-validation estimate of prediction error is average of these K-FOLD CROSS-VALIDATIONSplit randomly the data in K groups with roughly the same sizeTake turns using one group as test set and the other k-I as training set for fitting the model123451. Prediction error1TestTrainTrainTrainTrainTrainTrainTestTrainTrain2. Prediction error23. Prediction error3TrainTrainTrainTestTrain4. Prediction error4TrainTestTrainTrainTrain5. Prediction error5TrainTrainTrainTrainTestCross-validation estimate of prediction error is average of these
422. For each bootstrap sample, compute the prediction error 1. Generate a large number (n= ) of bootstrap samples…n=100002. For each bootstrap sample, compute the prediction errorError1Error2Error3…Errorn3. The mean of these estimates is the bootstrap estimate of prediction error
43Application of regression: prediction 1. Do not use your model for prediction without carrying out a validationIf you can, use an independent data set for validating the modelIf you cannot, use at least bootstrap or cross-validation2. Never extrapolate