Logistic Regression.

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Inference for Regression
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
An Introduction to Logistic Regression
Ch. 14: The Multiple Regression Model building
Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011.
1 Logistic Regression Homework Solutions EPP 245/298 Statistical Analysis of Laboratory Data.
Generalized Linear Models
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Logistic Regression and Generalized Linear Models:
Chapter 13: Inference in Regression
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Chapter 14 Introduction to Multiple Regression
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Session 10. Applied Regression -- Prof. Juran2 Outline Binary Logistic Regression Why? –Theoretical and practical difficulties in using regular (continuous)
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Lecture 4 Introduction to Multiple Regression
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Logistic Regression. Linear Regression Purchases vs. Income.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Logistic Regression and Odds Ratios Psych DeShon.
Logistic Regression. What is the purpose of Regression?
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Lecture 21: poisson regression log-linear regression BMTRY 701 Biostatistical Methods II.
Transforming the data Modified from:
Chapter 14 Introduction to Multiple Regression
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Basic Estimation Techniques
Multiple Regression Analysis and Model Building
Generalized Linear Models
SAME THING?.
Logistic Regression with “Grouped” Data
Presentation transcript:

Logistic Regression

Example: Birth defects We want to know if the probability of a certain birth defect is higher among women of a certain age Outcome (y) = presence/absence of birth defect Explanatory (x) = maternal age a birth > bdlog<-glm(bd$casegrp~bd$MAGE,family=binomial("logit")) Since “logit” is the default, you can actually use: > bdlog<-glm(bd$casegrp~bd$MAGE,binomial) > summary(bdlog)

Example in R Call: glm(formula = bd$casegrp ~ bd$MAGE, family = binomial("logit")) Deviance Residuals: Min 1Q Median 3Q Max -0.56672 -0.24047 -0.14728 -0.08994 3.60539 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.82555 0.32491 2.541 0.0111 * bd$MAGE -0.19793 0.01489 -13.290 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 2364.1 on 11892 degrees of freedom Residual deviance: 2130.8 on 11891 degrees of freedom AIC: 2134.8

Plot > plot(MAGE~ fitted(glm(casegrp~ MAGE,binomial)), xlab=“Maternal Age”, ylab=“P(Birth Defect)”, pch=15)

Example: Categorical x Sometimes, it’s easier to interpret logistic regression output if the x variables are categorical Suppose we categorize maternal age into 3 categories: > bd$magecat3 <- ifelse(bd$MAGE>25, c(1),c(0)) > bd$magecat2 <- ifelse(bd$MAGE>=20 & bd$MAGE<=25, c(1),c(0)) > bd$magecat1 <- ifelse(bd$MAGE<20, c(1),c(0)) Maternal Age Birth Defect < 20 years 20-24 years > 24 years Yes 101 105 36 No 1385 3755 6511

Example in R > bdlog2<-glm(casegrp~magecat1+magecat2,binomial) > summary(bdlog2) Remember, with a set of dummy variables, you always put in one less variable than category

Example in R Call: glm(formula = casegrp ~ magecat1 + magecat2, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -0.3752 -0.2349 -0.1050 -0.1050 3.2259 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.1977 0.1671 -31.101 <2e-16 *** magecat1 2.5794 0.1964 13.137 <2e-16 *** magecat2 1.6208 0.1942 8.345 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2364.1 on 11892 degrees of freedom Residual deviance: 2148.6 on 11890 degrees of freedom AIC: 2154.6

Interpretation: odds ratios > exp(cbind(OR=coef(bdlog2),confint(bdlog2))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.005529105 0.003910843 0.007544544 magecat1 13.189149619 9.066531887 19.622868917 magecat2 5.057367954 3.492376718 7.495720840 This tells us that: women <20 years have a 13 times greater odds of a birth defect than women >24 years women 20-24 years have a 5 times greater odds of a birth defect than women >24 years

What variables can we consider dropping? > anova(bd.log,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: casegrp Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 11880 2355.87 magecat1 1 130.58 11879 2225.28 < 2.2e-16 *** magecat2 1 82.73 11878 2142.56 < 2.2e-16 *** bthparity2 1 13.37 11877 2129.18 0.0002555 *** smoke 1 16.10 11876 2113.09 6.022e-05 *** Small p-values indicate that all variables are needed to explain the variation in y

Goodness of fit statistics Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.9100 0.1870 -26.252 < 2e-16 *** magecat1 2.2534 0.2073 10.872 < 2e-16 *** magecat2 1.4732 0.1965 7.497 6.52e-14 *** bthparity2parous -0.5932 0.1497 -3.962 7.45e-05 *** smokesmoker 0.6515 0.1546 4.213 2.52e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 2355.9 on 11880 degrees of freedom Residual deviance: 2113.1 on 11876 degrees of freedom AIC: 2123.1 -2 log L AIC

Binned residual plot > x<-predict(bd.log) > y<-resid(bd.log) > binnedplot(x,y) Plots the average residual and the average fitted (predicted) value for each bin, or category Category is based on the fitted values 95% of all values should fall within the dotted line

Poisson Regression Using count data

What is a Poisson distribution?

Example: children ever born The dataset has 70 rows representing group-level data on the number of children ever born to women in Fiji: Number of children ever born Number of women in the group Duration of marriage 1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 Residence 1=Suva (capital city), 2=Urban, 3=Rural Education 1=none, 2=lower primary, 3=upper primary, 4=secondary+

Poisson regression in R > ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson", data = ceb) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom AIC: Inf Need to account for different population sizes in each area/group unless data are from same-size populations

Assessing model fit 1. Examine AIC score – smaller is better 2. Examine the deviance as an approximate goodness of fit test Expect the residual deviance/degrees of freedom to be approximately 1 > ceb2$deviance/ceb2$df.residual [1] 41.35172 3. Compare residual deviance to a 2 distribution > pchisq(2646.5, 64, lower=F) [1] 0

Model fitting: analysis of deviance Similar to logistic regression, we want to compare the differences in the size of residuals between models > ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb) > ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb) > 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2)) [1] 0.0007124383 Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

Overdispersion in Poission models A characteristic of the Poisson distribution is that its mean is equal to its variance Sometimes the observed variance is greater than the mean Known as overdispersion Another common problem with Poisson regression is excess zeros Are more zeros than a Poisson regression would predict

Overdispersion Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter Doesn't change the estimates for the coefficients, but may change their standard errors > ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

Poisson vs. quasipoisson Family = “poisson” Family = “quasipoisson” Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.43029 0.01795 79.691 <2e-16 *** educnone 0.21462 0.02183 9.831 <2e-16 *** educsec+ -1.00900 0.05217 -19.342 <2e-16 *** educupper -0.40485 0.02956 -13.696 <2e-16 *** resSuva -0.05997 0.02819 -2.127 0.0334 * resurban 0.06204 0.02442 2.540 0.0111 * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.43029 0.10999 13.004 < 2e-16 *** educnone 0.21462 0.13378 1.604 0.11358 educsec+ -1.00900 0.31968 -3.156 0.00244 ** educupper -0.40485 0.18115 -2.235 0.02892 * resSuva -0.05997 0.17277 -0.347 0.72965 resurban 0.06204 0.14966 0.415 0.67988 --- (Dispersion parameter for quasipoisson taken to be 37.55359) Null deviance: 3731.5 on 69 degrees of freedom Residual deviance: 2646.5 on 64 degrees of freedom

Models for overdispersion When overdispersion is a problem, use a negative binomial model Will adjust  estimates and standard errors > install.packages(“MASS”) > library(MASS) > ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb) OR > ceb.nb<-glm.nb(ceb2) > summary(ceb.nb)

NB model in R > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297 glm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.490043 0.160589 9.279 < 2e-16 *** educnone 0.002317 0.183754 0.013 0.98994 educsec+ -0.630343 0.200220 -3.148 0.00164 ** educupper -0.173138 0.184210 -0.940 0.34727 resSuva -0.149784 0.165622 -0.904 0.36580 resurban 0.055610 0.165391 0.336 0.73670 --- (Dispersion parameter for Negative Binomial(3.3872) family taken to be 1) Null deviance: 85.001 on 69 degrees of freedom Residual deviance: 71.955 on 64 degrees of freedom AIC: 740.55 Theta: 3.387 Std. Err.: 0.583 2 x log-likelihood: -726.555 > ceb.nb$deviance/ceb.nb$df.residual [1] 1.124297

What if your data looked like…

Zero-inflated Poisson model (ZIP) If you have a large number of 0 counts… > install.packages(“pscl”) > library(pscl) > ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)