Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.

Similar presentations


Presentation on theme: "Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia."— Presentation transcript:

1 Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia

2 Introductory example 1 Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Question: Is there a gender effect on the preference ? GenderLikeDislikeALL Men233457 Women35132167 ALL58166224

3 Introductory example 2 Fat concentration and preference. 435 samples of a sauce of various fat concentration were tasted by consumers. There were two outcome: like or dislike. The results are as follows: Question: Is there an effect of fat concentration on the preference ? ConcentrationLikeDislikeALL 1.3513013 1.6019019 1.7567269 1.8545550 1.9571879 2.05502070 2.15353166 2.2574956 2.3511213

4 Consideration … The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) However, there is a much better and more systematic method to analysis these data: Logistic regression However, there is a much better and more systematic method to analysis these data: Logistic regression

5 Odds and odds ratio Let P be the probability of preference, then the odds of preference is: O = P / (1-P) Let P be the probability of preference, then the odds of preference is: O = P / (1-P) GenderLikeDislikeALLP(like) Men2334570.403 Women351321670.209 ALL581662240.259 O men = 0.403 / 0.597 = 0.676 O men = 0.403 / 0.597 = 0.676 O women = 0.209 / 0.791 = 0.265 O women = 0.209 / 0.791 = 0.265 Odds ratio: OR = O men / O women = 0.676 / 0.265 = 2.55 (Meaning: the odds of preference is 2.55 times higher in men than in women)

6 Meanings of odds ratio OR > 1: the odds of preference is higher in men than in women OR > 1: the odds of preference is higher in men than in women OR < 1: the odds of preference is lower in men than in women OR < 1: the odds of preference is lower in men than in women OR = 1: the odds of preference in men is the same as in women OR = 1: the odds of preference in men is the same as in women How to assess the “significance” of OR ? How to assess the “significance” of OR ?

7 Computing variance of odds ratio The significance of OR can be tested by calculating its variance. The significance of OR can be tested by calculating its variance. The variance of OR can be indirectly calculated by working with logarithmic scale: The variance of OR can be indirectly calculated by working with logarithmic scale: Convert OR to log(OR) Convert OR to log(OR) Calculate variance of log(OR) Calculate variance of log(OR) Calculate 95% confidence interval of log(OR) Calculate 95% confidence interval of log(OR) Convert back to 95% confidence interval of OR Convert back to 95% confidence interval of OR

8 Computing variance of odds ratio OR = (23/34)/ (35/132) = 2.55 OR = (23/34)/ (35/132) = 2.55 Log(OR) = log(2.55) = 0.937 Log(OR) = log(2.55) = 0.937 Variance of log(OR): Variance of log(OR): V = 1/23 + 1/34 + 1/35 + 1/132 = 0.109 Standard error of log(OR) Standard error of log(OR) SE = sqrt(0.109) = 0.330 95% confidence interval of log(OR) 95% confidence interval of log(OR) 0.937 + 0.330(1.96) = 0.289 to 1.584 Convert back to 95% confidence interval of OR Convert back to 95% confidence interval of OR Exp(0.289) = 1.33 to Exp(1.584) = 4.87 GenderLikeDislike Men2334 Women35132 ALL58166

9 Logistic analysis by R sex <- c(1, 2) like <- c(23, 35) dislike <- c(34, 132) total <- like + dislike prob <- like/total logistic <- glm(prob ~ sex, family=”binomial”, weight=total) GenderLikeDislike Men2334 Women35132 ALL58166 > summary(logistic) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5457 0.5725 0.953 0.34044 sex -0.9366 0.3302 -2.836 0.00456 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.8676e+00 on 1 degrees of freedom Residual deviance: 2.2204e-15 on 0 degrees of freedom AIC: 13.629

10 Logistic regression model for continuous factor Concentr ation LikeDislike % like 1.351301.00 1.601901.00 1.756720.971 1.854550.900 1.957180.899 2.0550200.714 2.1535310.530 2.257490.125 2.351120.077

11 Analysis by using R conc <- c(1.35, 1.60, 1.75, 1.85, 1.95, 2.05, 2.15, 2.25, 2.35) like <- c(13, 19, 67, 45, 71, 50, 35, 7, 1) dislike <- c(0, 0, 2, 5, 8, 20, 31, 49, 12) total <- like+dislike prob <- like/total plot(prob ~ conc, pch=16, xlab="Concentration")

12 Logistic regression model for continuous factor - model Let p = probability of preference Let p = probability of preference Logit of p is: Logit of p is: Model: Logit(p) =  +  (FAT) where  is the intercept, and  is the slope that have to be estimated from the data

13 Analysis by using R logistic <- glm(prob ~ conc, family="binomial", weight=total) summary(logistic) Deviance Residuals: Min 1Q Median 3Q Max -1.78226 -0.69052 0.07981 0.36556 1.36871 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 22.708 2.266 10.021 <2e-16 *** conc -10.662 1.083 -9.849 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 198.7115 on 8 degrees of freedom Residual deviance: 8.5568 on 7 degrees of freedom AIC: 37.096

14 Logistic regression model for continuous factor – Interpretation The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level. Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level.

15 Multiple logistic regression id fx age bmi bmd ictp pinp 1 1 79 24.7252 0.818 9.170 37.383 2 1 89 25.9909 0.871 7.561 24.685 3 1 70 25.3934 1.358 5.347 40.620 4 1 88 23.2254 0.714 7.354 56.782 5 1 85 24.6097 0.748 6.760 58.358 6 0 68 25.0762 0.935 4.939 67.123 7 0 70 19.8839 1.040 4.321 26.399 8 0 69 25.0593 1.002 4.212 47.515 9 0 74 25.6544 0.987 5.605 26.132 10 0 79 19.9594 0.863 5.204 60.267... 137 0 64 38.0762 1.086 5.043 32.835 138 1 80 23.3887 0.875 4.086 23.837 139 0 67 25.9455 0.983 4.328 71.334 Fracture (0=no, 1=yes) Dependent variables: age, bmi, bmd, ictp, pinp Question: Which variables are important for fracture?

16 Multiple logistic regression: R analysis setwd(“c:/works/stats”) fracture <- read.table(“fracture.txt”, header=TRUE, na.string=”.”) names(fracture) fulldata <- na.omit(fracture) attach(fulldata) temp <- glm(fx ~., family=”binomial”, data=fulldata) search <- step(temp) summary(search)

17 Bayesian Model Average (BMA) analysis Library(BMA) xvars <- fulldata[, 3:7] y <- fx bma.search <- bic.glm(xvars, y, strict=F, OR=20, glm.family="binomial") summary(bma.search)imageplot.bma(bma.search)

18 Bayesian Model Average (BMA) analysis > summary(bma.search) Call: Best 5 models (cumulative posterior probability = 0.8836 ): p!=0 EV SD model 1 model 2 model 3 model 4 model 5 p!=0 EV SD model 1 model 2 model 3 model 4 model 5 Intercept 100 -2.85012 2.8651 -3.920 -1.065 -1.201 -8.257 -0.072 age 15.3 0.00845 0.0261... 0.063. bmi 21.7 -0.02302 0.0541.. -0.116. -0.070 bmd 39.7 -1.34136 1.9762. -3.499.. -2.696 ictp 100.0 0.64575 0.1699 0.606 0.687 0.680 0.554 0.714 pinp 5.7 -0.00037 0.0041..... nVar 1 2 2 2 3 BIC -525.044 -524.939 -523.625 -522.672 -521.032 post prob 0.307 0.291 0.151 0.094 0.041

19 Bayesian Model Average (BMA) analysis > imageplot.bma(bma.search)

20 Summary of main points Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. The determinants can be binary, categorical or continuous measurements The determinants can be binary, categorical or continuous measurements The model is logit(p) = log[p / (1-p)] =  +  X, where X is a factor, and  and  must be estimated from observed data. The model is logit(p) = log[p / (1-p)] =  +  X, where X is a factor, and  and  must be estimated from observed data.

21 Summary of main points Exp(  ) is the odds ratio associated with an increment in the determinant X. Exp(  ) is the odds ratio associated with an increment in the determinant X. The logistic regression model can be extended to include many determinants: The logistic regression model can be extended to include many determinants: logit(p) = log[p / (1-p)] =  +  X 1 +  X 2 +  X 3 + … logit(p) = log[p / (1-p)] =  +  X 1 +  X 2 +  X 3 + …


Download ppt "Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia."

Similar presentations


Ads by Google