Logistic Regression.

Name: Logistic Regression.
Uploaded: 2017-07-11T08:22:47+00:00
Duration: PTM22S31
Channel: Russell Blankenship
Description: Logistic Regression.

Logistic Regression

Example: Birth defects
We want to know if the probability of a certain birth defect is higher among women of a certain age Outcome (y) = presence/absence of birth defect Explanatory (x) = maternal age a birth > bdlog<-glm(bd$casegrp~bd$MAGE,family=binomial("logit")) Since “logit” is the default, you can actually use: > bdlog<-glm(bd$casegrp~bd$MAGE,binomial) > summary(bdlog)

Example in R Call: glm(formula = bd$casegrp ~ bd$MAGE, family = binomial("logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * bd$MAGE <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC:

Plot > plot(MAGE~ fitted(glm(casegrp~ MAGE,binomial)), xlab=“Maternal Age”, ylab=“P(Birth Defect)”, pch=15)

Example: Categorical x
Sometimes, it’s easier to interpret logistic regression output if the x variables are categorical Suppose we categorize maternal age into 3 categories: > bd$magecat3 <- ifelse(bd$MAGE>25, c(1),c(0)) > bd$magecat2 <- ifelse(bd$MAGE>=20 & bd$MAGE<=25, c(1),c(0)) > bd$magecat1 <- ifelse(bd$MAGE<20, c(1),c(0)) Maternal Age Birth Defect < 20 years 20-24 years > 24 years Yes 101 105 36 No 1385 3755 6511

Example in R > bdlog2<-glm(casegrp~magecat1+magecat2,binomial) > summary(bdlog2) Remember, with a set of dummy variables, you always put in one less variable than category

Example in R Call: glm(formula = casegrp ~ magecat1 + magecat2, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** magecat <2e-16 *** magecat <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC:

Interpretation: odds ratios
> exp(cbind(OR=coef(bdlog2),confint(bdlog2))) Waiting for profiling to be done... OR % % (Intercept) magecat magecat This tells us that: women <20 years have a 13 times greater odds of a birth defect than women >24 years women years have a 5 times greater odds of a birth defect than women >24 years

What variables can we consider dropping?
> anova(bd.log,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: casegrp Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL magecat < 2.2e-16 *** magecat < 2.2e-16 *** bthparity *** smoke e-05 *** Small p-values indicate that all variables are needed to explain the variation in y

Goodness of fit statistics
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** magecat < 2e-16 *** magecat e-14 *** bthparity2parous e-05 *** smokesmoker e-05 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: -2 log L AIC

Binned residual plot > x<-predict(bd.log) > y<-resid(bd.log) > binnedplot(x,y) Plots the average residual and the average fitted (predicted) value for each bin, or category Category is based on the fitted values 95% of all values should fall within the dotted line

Poisson Regression Using count data

What is a Poisson distribution?

Example: children ever born
The dataset has 70 rows representing group-level data on the number of children ever born to women in Fiji: Number of children ever born Number of women in the group Duration of marriage 1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 Residence 1=Suva (capital city), 2=Urban, 3=Rural Education 1=none, 2=lower primary, 3=upper primary, 4=secondary+

Poisson regression in R
> ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson", data = ceb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** educnone <2e-16 *** educsec <2e-16 *** educupper <2e-16 *** resSuva * resurban * --- Null deviance: on 69 degrees of freedom Residual deviance: on 64 degrees of freedom AIC: Inf Need to account for different population sizes in each area/group unless data are from same-size populations

Assessing model fit 1. Examine AIC score – smaller is better
2. Examine the deviance as an approximate goodness of fit test Expect the residual deviance/degrees of freedom to be approximately 1 > ceb2$deviance/ceb2$df.residual [1] 3. Compare residual deviance to a 2 distribution > pchisq(2646.5, 64, lower=F) [1] 0

Model fitting: analysis of deviance
Similar to logistic regression, we want to compare the differences in the size of residuals between models > ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb) > ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb) > 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2)) [1] Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

Overdispersion in Poission models
A characteristic of the Poisson distribution is that its mean is equal to its variance Sometimes the observed variance is greater than the mean Known as overdispersion Another common problem with Poisson regression is excess zeros Are more zeros than a Poisson regression would predict

Overdispersion Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter Doesn't change the estimates for the coefficients, but may change their standard errors > ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

Poisson vs. quasipoisson
Family = “poisson” Family = “quasipoisson” Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) <2e-16 *** educnone <2e-16 *** educsec <2e-16 *** educupper <2e-16 *** resSuva * resurban * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: on 69 degrees of freedom Residual deviance: on 64 degrees of freedom Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** educnone educsec ** educupper * resSuva resurban (Dispersion parameter for quasipoisson taken to be ) Null deviance: on 69 degrees of freedom Residual deviance: on 64 degrees of freedom

Models for overdispersion
When overdispersion is a problem, use a negative binomial model Will adjust  estimates and standard errors > install.packages(“MASS”) > library(MASS) > ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb) OR > ceb.nb<-glm.nb(ceb2) > summary(ceb.nb)

NB model in R > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297
glm.nb(formula = ceb2, x = T, init.theta = , link = log) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** educnone educsec ** educupper resSuva resurban (Dispersion parameter for Negative Binomial(3.3872) family taken to be 1) Null deviance: on 69 degrees of freedom Residual deviance: on 64 degrees of freedom AIC: Theta: Std. Err.: x log-likelihood: > ceb.nb$deviance/ceb.nb$df.residual [1]

What if your data looked like…

Zero-inflated Poisson model (ZIP)
If you have a large number of 0 counts… > install.packages(“pscl”) > library(pscl) > ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)

Logistic Regression.

Similar presentations

Presentation on theme: "Logistic Regression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Logistic Regression.

Similar presentations

Presentation on theme: "Logistic Regression."— Presentation transcript:

Similar presentations

About project

Feedback