Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges Linear Regression Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Choice of Statistical Test Outcome Variable Are the observations independent or correlated? Assumptions independent correlated Continuous (e.g. Body mass index, blood pressure) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Difference in proportions Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Chi-square test assumes sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups
Tests for Continuous outcomes Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. Body mass index, blood pressure) T-test: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient
Correlation vs. Regression Assesses the relationship only Assesses the relationship Finds the best line of fit Prediction estimation
General Idea Simple regression considers the relation between a single explanatory variable and response variable Y X Dependent Outcome Response Explained variable Regressand Independent Predictor Covariate Explanatory variable Regressor
General Idea Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y The intent is to look at the independent effect of each variable while “adjusting out” the influence of potential confounders
Regression Modeling A simple regression model (one independent variable) fits a regression line in 2- dimensional space A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space
Simple Regression Model Regression coefficients are estimated by minimizing ∑residuals2 (i.e., sum of the squared residuals) to derive this model: The standard error of the regression (sY|x) is based on the squared residuals:
Multiple Regression Model Estimates for the multiple slope coefficients are derived by minimizing ∑residuals2 to derive this multiple regression model: The standard error of the regression is based on the ∑residuals2:
Multiple Regression Model Intercept α predicts where the regression plane crosses the Y axis Slope for variable X1 (β1) predicts the change in Y due to per unit change in X1 holding X2 constant The slope for variable X2 (β2) predicts the change in Y due to per unit change in X2 holding X1 constant
Multiple Regression Model A multiple regression model with k independent variables fits a regression “surface” in k + 1 dimensional space (cannot be visualized)
Understanding LR Regression is the attempt to explain the variation in a dependent variable using the variation in independent variables. Regression is thus an explanation of causation. If the independent variable(s) sufficiently explain the variation in the dependent variable, the model can be used for prediction.
Understanding LR The output of a regression is a function that predicts the dependent variable based upon values of the independent variables. Simple regression fits a straight line to the data.
Understanding LR The function will make a prediction for each observed data point. The observation is denoted by y and the prediction is denoted by y (hat).
Understanding LR
Understanding LR A least squares regression selects the line with the lowest total sum of squared prediction errors. This value is called the Sum of Squares of Error, or SSE.
Understanding LR The Sum of Squares Regression (SSR) is the sum of the squared differences between the prediction for each observation and the population mean.
Understanding LR
Understanding LR The proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of Determination, and is often referred to as R . The value of R can range between 0 and 1, and the higher its value the more accurate the regression model is. It is often referred to as a percentage.
LR Interpretation The slope coefficient associated for SMOKE is −.206, suggesting that smokers have .206 less FEV on average compared to non-smokers (after adjusting for age) The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)
LR Assumptions: Linearity
LR Assumptions: Outcome variable continuous Weight
LR Assumptions: Zero mean error
LR Assumptions: Equal variance
LR Assumptions: Uncorrelated Errors
LR presentation in Journal