Introduction to Logistic Regression Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren, Viviane Bremer
Objectives When do we need to use logistic regression Principles of logistic regression Uses of logistic regression What to keep in mind
Chlamorea Sexually transmitted infection –Virus recently identified –Leads to general rash, blush, pimples and feeling of shame –Increasing prevalence with age –Risk factors unknown so far
Case control study Population of Berlin 150 cases, 150 controls Hypothesis: Consistent use of condoms protects against chlamorea Questionnaire with questions on demographic characteristics, sexual behaviour OR, t-test
Results bivariate analysis Cases n=150 Controls n=150 Odds ratio Used condoms at last sex Did not use condoms 11060Ref
Results bivariate analysis Cases n=150 Controls n=150 Odds ratio Single Currently in a relationship 25100Ref
Results bivariate analysis Cases n=150 Controls n=150 T-test nr partners during last year 42p=0.001 Mean age in years 3926p=0.001 Confounding?
a c b d OR raw a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 aiai cici bibi didi OR i a1a1 c1c1 b1b1 d1d1 a2a2 c2c2 b2b2 d2d2 OR 1 OR 2 a3a3 c3c3 b3b3 d3d3 OR 3 aiai cici bibi didi OR 4 Chlamorea and condom use Single status Agegroup Number of partners Stratification
Lets go one step back
Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women
SBP (mm Hg) Age (years) adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974
Simple linear regression Relation between 2 continuous variables (SBP and age) Regression coefficient 1 –Measures association between y and x –Amount by which y changes on average when x changes by one unit –Least squares method y x Slope
What if we have more than one independent variable?
Multiple risk factors Objective: To attribute to each risk factors the respective effect (RR) it has on the occurrence of disease.
Types of multivariable analysis Multiple models –Linear regression –Logistic regression –Cox model –Poisson regression –Loglinear model –Discriminant analysis… Choice of the tool according objectives, study design and variables
Multiple linear regression Relation between a continuous variable and a set of i variables Partial regression coefficients i –Amount by which y changes when x i changes by one unit and all the other x i remain constant –Measures association between x i and y adjusted for all other x i Example –Number of partners in relation to age & income
Multiple linear regression Predicted Predictor variables Response variableExplanatory variables Outcome variableCovariables Dependent Independent variables y (number of partners) = α + β 1 age + β 2 income + β 3 gender
What if our outcome variable is dichotomous?
Logistic regression (1) Table 2 Age and chlamorea
How can we analyse these data? Compare mean age of diseased and non-diseased –Non-diseased: 26 years –Diseased: 39 years (p=0.0001) Linear regression?
Dot-plot: Data from Table 2 Presence of Chlamorea
Logistic regression (2) Table 3 Prevalence (%) of chlamorea according to age group
Dot-plot: Data from Table 3 Diseased % Age group
Logistic function (1) Probability of disease x
Logistic function Logistic regression models the logit of the outcome =natural logarithm of the odds of the outcome Probability of the outcome (p) Probability of not having the outcome (1-p) ln
Logistic function = log odds of disease in unexposed = log odds ratio associated with being exposed e = odds ratio
Multiple logistic regression More than one independent variable –Dichotomous, ordinal, nominal, continuous … Interpretation of i –Increase in log-odds for a one unit increase in x i with all the other x i s constant –Measures association between x i and log-odds adjusted for all other x i
Uses of multivariable analysis Etiologic models –Identify risk factors adjusted for confounders –Adjust for differences in baseline characteristics Predictive models –Determine diagnosis –Determine prognosis
Fitting equation to the data Linear regression: –Least squares Logistic regression: –Maximum likelihood
Elaborating e β e β = OR What if the independent variable is continuous? whats the effect of a change in x by more than one unit?
The Q fever example Distance to farm as independent continuous variable counted in meters –β in logistic regression was and statistically significant OR for each 1 meter distance is –Too small to use Whats the OR for every 1000 meters? –e 1000*β = e -1000* =
Continuous variables Increase in OR for a one unit change in exposure variable Logistic model is multiplicative OR increases exponentially with x –If OR = 2 for a one unit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 2 3 = 8 Verify if OR increases exponentially with x –When in doubt, treat as qualitative variable
Coding of variables (2) Nominal variables or ordinal with unequal classes: –Preferred hair colour of partners: »No hair=0, grey=1, brown=2, blond=3 –Model assumes that OR for blond partners = OR for grey-haired partners 3 –Use indicator variables (dummy variables)
Indicator variables: Hair colour Neutralises artificial hierarchy between classes in variable hair colour of partners" No assumptions made 3 variables in model using same reference OR for each type of hair adjusted for the others in reference to no hair
Classes Relationship between number of partners during last year and chlamorea –Code number of partners: 0-1 = 1, 2-3 = 2, 4-5 = 3 Compatible with assumption of multiplicative model –If not compatible, use indicator variables Code nr partners CasesControlsOR
Risk factors for Chlamorea No condom use Chlamorea Sex Hair colour Agegroup Single Visiting bars Number of partners
Unconditional Logistic Regression Term Odds Ratio 95% C.I.Coef.S. E. Z- Statistic P- Value # partners1,26640,263410,70820,23620,94520,54860,5833 Single (Yes/No)1,03450,3277 3,26600,03390,58660,05780,9539 Hair colour (1/0) 1,61260,26759,72200,47780,91660,52130,6022 Hair colour (2/0)0,72910,0991 5,3668-0,31591,0185-0,31020,7564 Hair colour (3/0) 1,11370,15737,88700,10760,99880,10780,9142 Visiting bars 1,59420,49535,13170,46640,59650,78190,4343 Used no Condoms 9,09183,021927,35332,20740,56203,92780,0001 Sex (f/m) 1,30240,22787,44680,26420,88960,29700,7665 CONSTANT ** * -3,00802,0559-1,46310,1434
Last but not least
Why do we need multivariable analysis? Our real world is multivariable Multivariable analysis is a tool to determine the relative contribution of all factors
Sequence of analysis Descriptive analysis –Know your dataset Bivariate analysis –Identify associations Stratified analysis –Confounding and effect modifiers Multivariable analysis –Control for confounding
What can go wrong Small sample size and too few cases Wrong coding Skewed distribution of independent variables –Empty subgroups Collinearity –Independent variables express the same
Do not forget Rubbish in - rubbish out Check for confounders first Number of subjects >> variables in the model Keep the model simple –Statisticians can help with the model but you need to understand the interpretation You will need several attempts to find the best model
If in doubt… Really call a statistician !!!!
References Norman GR, Steiner DL. Biostatistics. The Bare Essentials. BC Decker, London, 2000 Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989 Schwartz MH. Multivariable analysis. Cambridge University Press, 2006