Download presentation
Presentation is loading. Please wait.
Published byEmory Parker Modified over 9 years ago
1
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician ypak@labiomed.org Session 4: Regression Models and Multivariate Analyses
2
What and Why? Multivariate analysis (MVA) techniques allow more than two variables to be analysed at once. Compared with univariate or bivariate Data richness with computational technologies advanced Data reductions or classifications eg., Factor analysis, Principal Component Analysis(PCA) Several variables are potentially correlated with some degree potential confounding bias the result eg., Analysis of Covariance (ANCOVA), Multiple Linear or Generalized Linear Regression Models
3
What and Why ? Many variables are all interrelated with multiple dependent and independent variables eg., Multivariate Analysis of Variance (MANOVA), Path Models, Structural Equation Models(SEM), Partially Least Square(PLS) Models. This Session will focus on multiple regression models.
4
Why regression models? To reduce “Random Noise” in Data => better variance estimations by adding source of variability of your dependent variables eg. ANCOVA To determine a optimal set of predictors => predictive models eg. Variable selection procedures for multiple regression models To adjust for potential confounding effects eg, regression models with covariates
5
Actual mathematical Models ANOVA Y ij =μ+τ i + ij,, where Y ij represents the j th observation (j=1,2,…,n) on the i th treatment (i=1,2,…,l levels). The errors ij are assumed to be normally and independently (NID) distributed, with mean zero and variance σ 2. ANCOVA with k number of covariates Y ij =μ+τ i +X 1ij + X 2ij + …+ X kij + ij, MANOVA (with p number of outcome variables) Y(nxp) = X(nx[q+1]) B([q+1] x p) + E (n x p)
6
Actual mathematical Models Simple Linear Regression Models (SLR) Y i = β 0 + β 1 X i + ε i µ Y (true mean value of Y) ε =“error” (random noise due to random sampling error), assumed ε follow a normal distribution with mean=0, variance=σ 2 β 0 & β 1 = intercept & slope often called Regression (or beta) Coefficients Y=Dependent Variable(DV) X=Independent Variable (IV) eg., Y= Insulin Sensitivity X= FattyAcid in percentage Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)
7
SLR: Example SPSS output Two-sided p-value=0.002. Thus, there is significant statistical evidence (alpha=0.05) to conclude that the true slope is not zero Fatty Acid(%) is significantly related to insulin sensitivity. Mean Insulin sensitivity increase by 37.208 unit as Fatty Acid(%) increase by one percent.
8
SLR w/CI
9
Checking the assumptions using a residual Plot A plot has to be looked as “RANDOM” no special pattern is supposed to be shown if the assumptions are met.
10
Actual mathematical Models Multiple Linear Regression Models (SLR) Y = β 0 + β 1 X 1 + β 2 X 2 + … + β k X k + ε µ Y (true mean value of Y) Assumptions are the same as SLR with one more addition : All Xs are not highly correlated. If they are, this is called “Multicollinearity”, which will make model very unstable. Diagnosis for multicollinearity Variance Inflation Factor (VIF) = 1 OK VIF < 5 Tolerable VIF > 5 Problematic Remove the variable which has a high VIF or do PCA Multiple Linear Regression Models (MLR) Simple Logistic Models(SL) Multiple Logistic Models(ML)
11
MRL: Example m Y = -56.935 + 1.634X 1 + 0.249X 2 11 1.634*Flexibility For every 1 degree increase in flexibility, MEAN punt distance increases by 1.634 feet, adjusting for leg strength. 0.249*Strength For every 1 lb increase in strength, MEAN punt distance increases by 0.249 feet, adjusting for flexibility.
12
What do mean by “adjusted for”? If categorical covariates? eg., Mean % gain w/o adjustment for Gender Exercise & Diet: (20% x 10+10% x 40) / 50 = 12 % Exercise only: (15%x40 + 5%x10) / 50 = 13 % Mean % gain with adjustment for Gender Exercise & Diet: Male avg. x 0.5 + Female avg. x 0.5 = 20% x 0.5 + 10% x 0.5=15 % Exercise only: Male avg. x 0.5 + Female avg. x 0.5 = 15% x 0.5 + 5% x 0.5=10% Mean muscle gain % (N) Exercise & DietExercise only Male20% (10) 15% (40) Female10% (40)5 % (10)
13
Why different? % gain for males are 10% higher than female in both diet potential confounding However, two groups are unbalanced in terms of gender, i.e, 80% male for the exercise group while 20% female for the diet & exercise group dilute the “treatment effect” If continuous covariates such as baseline age, similar adjustment will be performed based on the correlation between % gain and the baseline age.
14
Graphical illustration : Adjusting for a continuous covariate * Changes in Adiponectin (a glucose regulating protein) b/w two groups
15
Multiple Logistic Regression Models The model: Logit(π)= β 0 + β 1 X 1 + β 2 X 2 + +β k X k where π=Prob (event =1), Logit(π)= ln[π /(1- π)] or π = e LP / (1+ e LP ), where Lp= β 0 + β 1 X 1 + β 2 X 2 + +β k X k
16
Interpretation of the coefficients in logistic regression models For a continuous predictor, a coefficient (e β ) represents the multiplicative increase in the mean odds of Y=1 for one unit change in X odds ratio for X+1 to X. Similarly, for a nominal predictor, the coefficient represent the odds ratio for one group (X=1) to another (X=0). Remember, MLR has other covariates. Hence, the interpretation of one coefficient is applied when other covariates are adjusted for. 16
17
Estimated Prob. Vs. Age 17
18
Other Models Ordinal Logistic Regression for ordinal responses such as cancer stage I, II, III, IV : assumes the constant rate of change in OR between any two groups. Poisson regressions when responses are count data such as # of pregnancy : over dispersion is common and some times a negative binomial distribution is used instead. Mixed Model ; commonly used for a repeated measures ANOVA or ANCOVA. Time is used as within-subject factor and random factor. Mixed models are also used for nested design. Cox proportional Hazard models: multivariate models for survival data.
19
General Linear Model vs. Generalized Linear Model(GLM) A Linear Model General Linear Model –eg., ANOVA, ANCOVA, MANOVA, MANCOVA, Linear regression, mixed model A Non Linear Model Generalized Linear Model –Eg., Logistic, Ordinary Logistic, Possion All these used a link function for a response variable (Y) such as a logit link or possion link. GEE(Generalized Estimating Equation) models are an extension of GLM.
20
Variable Selection Procedures Forward By adding a new predictor that as the lowest p-value and keep repeating this step until no more predictors to be added at 0.05 alpha level Backward Start a full model with all predictors and eliminate the predictor with the highest p-value and keep repeating this procedure until no more predictors left to be eliminated at 0.05 alpha level Stepwise Combination of Forward and Backward Level of stay : 0.01, Level of entry: 0.05 usually used Simulation studies show Backward is most recommendable based on many simulation studies.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.