Generalized Linear Models Logistic Regression Log-Linear Models © G. Quinn & M. Keough, 2004.

Slides:



Advertisements
Similar presentations
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Advertisements

Brief introduction on Logistic Regression
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Logistic Regression Example: Horseshoe Crab Data
Variance and covariance M contains the mean Sums of squares General additive models.
N-way ANOVA. 3-way ANOVA 2 H 0 : The mean respiratory rate is the same for all species H 0 : The mean respiratory rate is the same for all temperatures.
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Final Review Session.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Log-linear and logistic models
EPI 809/Spring Multiple Logistic Regression.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Log-linear analysis Summary. Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Chapter 14 Inferential Data Analysis
Generalized Linear Models
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
Logistic Regression Logistic Regression - Dichotomous Response variable and numeric and/or categorical explanatory variable(s) –Goal: Model the probability.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Categorical Data Prof. Andy Field.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Overview of Meta-Analytic Data Analysis
Regression Analysis (2)
Simple Linear Regression
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.
ALISON BOWLING THE GENERAL LINEAR MODEL. ALTERNATIVE EXPRESSION OF THE MODEL.
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Multiple regression models Experimental design and data analysis for biologists (Quinn & Keough, 2002) Environmental sampling and analysis.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
AN INTRODUCTION TO LOGISTIC REGRESSION ENI SUMARMININGSIH, SSI, MM PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA UNIVERSITAS BRAWIJAYA.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Correlation & Regression Analysis
PCB 3043L - General Ecology Data Analysis.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Logistic Regression Analysis Gerrit Rooks
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Methods of Presenting and Interpreting Information Class 9.
Stats Methods at IC Lecture 3: Regression.
Nonparametric Statistics
BINARY LOGISTIC REGRESSION
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Chapter 13 Nonlinear and Multiple Regression
Categorical Data Aims Loglinear models Categorical data
Generalized Linear Models
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
6-1 Introduction To Empirical Models
Review for Exam 2 Some important themes from Chapters 6-9
Presentation transcript:

Generalized Linear Models Logistic Regression Log-Linear Models © G. Quinn & M. Keough, 2004

Generalized linear model Fit model using maximum likelihood Three components –Random component response variable & its probability distribution Exponential distribution (normal, gamma, binomial, etc) –Systematic component predictor variable(s) –Continuous or categorical –Combinations, polynomial functions –Link function

Link function Links expected value of Y to predictors

Three common link functions Identity link –g(  ) =  –Models mean or expected value of Y –Standard linear models Log link –g(  ) = log(  ) –Used for count data, which cannot be negative Logit link –g(  ) = log(  /(1-  )) –Used for binary data and logistic regression

Logistic regression Modelling response variables that are discrete –Often binary Present/Absent Alive/Dead Response/No Response Predictors categorical or continuous –Equivalent to simple linear regression, ANOVA, multiple regression, ANCOVA

Simple logistic regression Single, binary, response variable Single, continuous predictor Model  (x) –P(Y) =1 for a given X Fit logistic regression model –Sigmoidal –Greatest change in values in mid-range of X –Response variable has binomial distribution OLS not appropriate; ML required

Example: Lizards on islands Polis et al. (1998), examination of factors controlling spider populations on islands in the Gulf of California. Hypothesis: Presence of predator lizards (Uta) a key influence. What aspects of an island influence the presence of Uta? –P/A ratio

P/A ratio Uta presence/absence

The logistic model  0 and  1 are parameters to be estimated Slope and intercept

A simpler model Calculate odds –P(event)/P(1-event) –P(y i = 1)/P(y i = 0)

Log odds Link function (logit), g(x) g(x) =  0 +  1 x 1 Linear model g(x) =  0 +  1 (P/A ratio)

Interpretation  1 is the rate of change in the log(odds) for a unit change in X More often expressed as the Odds Ratio –Change in odds for unit change in X –e   0 is the intercept –not often of biological interest

Null hypotheses Most often  1 = 0 Wald test –ML equivalent of t test –Parameter estimate/standard error –b/s b –Normal for large sample sizes –Test using z distribution

Tests (continued) Compare fit of full and reduced model g(x) =  0 +  1 x 1 Full model g(x) =  0 Reduced model Difference in fit reflects effect of  1 Assess fit using Likelihood Ratio statistic (  )

Log(L) ML estimator Possible  0 values Log(L) for best parameter estimate 0

 1 values  0 values Log(L) Log(L) for best parameter estimate

Tests (continued)  is the ratio of the likelihood of the reduced model to that of the full model If  is near 1,  1 contributes little If  is <1,  1 has an effect G 2 = -2 ln(  ) Log-likelihood  2 or G statistic G 2 = - 2 (log-likelihood reduced – log-likelihood full) G 2 follows  2 with 1 df

Test statistics Test with either Wald or G 2 Unlike regression, Wald ≠ G 2 Use G 2 for small sample sizes G 2 also called deviance when a specific model is compared to a saturated model (which fits data perfectly –Change in deviance used to compare different models –Equivalent to SS Residual

Worked example Category choices 0 (REFERENCE) 9 1 (RESPONSE) 10 Total : 19 L-L at iteration 1 is L-L at iteration 2 is L-L at iteration 3 is L-L at iteration 4 is L-L at iteration 5 is L-L at iteration 6 is L-L at iteration 7 is Log Likelihood: Maximum Likelihood estimation, so procedure is iterative: estimate parameters, calculate log-likelihood. Refine parameter estimates, recalculate log-likelihood. Continue until convergence: SYSTAT output:

Output (cont) Parameter Estimate S.E. t-ratio p-value 1 CONSTANT PARATIO % bounds Parameter Odds Ratio Upper Lower 2 PARATIO Log Likelihood of constants only model = LL(0) = *[LL(N)-LL(0)] = with 1 df Chi-sq p-value = =e ( – (-7.110))

P/A ratio Predicted probability of occurrence P/A ratio Uta presence/absence

A special case: toxicity testing Logistic regression used to estimate relationship between concentration of substance and response variable Equation used to solve for concentration that produces a given level of response –LC50 –EC50

Worked example: toxicity testing Effect of copper on larvae of a marine invertebrate, Bugula dentata Methods –Larvae exposed to copper at range of concentrations Range of [Cu] 0 – 400  g/L –Recorded as swimming or not after 6 hours –Recorded as live or dead after 24 h

Parameter estimates: –Intercept 3.07  0.64 –Slope  0.31 (t = -4.56, P<0.001) LC50 –50% swimming –odds = 1, log(odds) = 0 –Solve for y = 0 Log[Cu] = 2.16 [Cu] = 145  g/L 0123 Log [Cu] % Swimming

Log [Cu] % Swimming

Extension to multiple regression Analogous to least squares multiple regression Generates partial regression coefficients Test overall regression by comparing fit of –Full model –Reduced model (constant or  0 only) Wald tests as equivalent to t tests Use likelihood ratio statistics (deviance) Assumptions

Age Distance Rodents presence/absence % shrub

Logistic ANCOVA Analogous to ANCOVA Test for heterogeneity of slopes –Fit models with and without interaction present –Compare fit of two models Run reduced model with covariate and categorical variable Test effects of each

Worked example Marshall et al. (2003) Ecology Effects of larval size on juvenile survivorship in a bryozoan, Bugula neritina –Larval size measured, then juveniles transplanted to field and survival (and growth) monitored –Experiment repeated several times Response variable: colony survival Predictor variables: –Larval Size –Experimental Run

Logistic ANCOVA (cont) Fit model g(x) =  0 +  1 x 1 +  I +  i1 x 1  1 = overall effect of size  i = effect of Run i  i1 = effect of size in Run i –LL = , df = 7 Fit model g(x) =  0 +  1 x 1 +  I –LL = , df = 4 Effect of interaction term –G 2 = -2 ( – ( )) = 5.65, df = 3, P = –Conclude slopes not significantly heterogeneous

Effect of Larval Size P = Effect of Run P = 0.229

Important assumptions Correct probability distribution for response variable Collinearity –Inflates standard errors of parameter estimates –Interpretations unreliable –Few diagnostics available Correlation matrices for predictor variables Examine tolerance by running as OLS linear regression Residuals –Not useful for individual observations –Aggregate approaches Deciles of risk Influence

Contingency tables and log- linear models

Introduction Each observation classified into 2 groups –Phenotypes for trait controlled by single-locus with dominance –Behavioural choice between two alternatives Is the distribution between these groups consistent with a particular hypothesis? –Crosses between known genotypes –No behavioural preference Data expected to follow binomial distribution

Binomial test Behavioural experiment –n1, n2 are numbers making choices 1 & 2 Null hypothesis: no preference –p = q = 0.5 Test –Calculate probability of observing ≥ n1 by chance –Binomial expansion (p + q) n

Example Binomial Expansion animals choose A, 1 chooses B. One-tailed test: P(5 or more) = P(5) + P(6) = = Two-tailed test: P(≥ 5 or ≤ 1) = P(5) + P(6) + P(1) +P(0) = = 0.220

Two groups Binomial test appropriate for small sample sizes –Provides exact probabilities Alternative procedures for larger samples –Goodness-of-fit tests  2 Log-likelihood ratio tests

 2 Goodness of fit test Data in K groups o i is observed number in group i e i is expected number in group I Assess using  2 with k-1 df

More groups Outcome no longer binomial, but multinomial: (p 1 + p 2 + … p i + … p k ) n Computationally difficult

Observations Factor B 12 Factor A 1n 11 n 12 n 1+ 2n 21 n 22 n 2+ n +1 n +2 n

Factor B 12 Factor A 1  11  12  1+ 2  21  22  2+  +1  +2   i+ = n i+ /n If A & B independent: Remember: P(A  B) = P(A)  P(B)

Calculate expected frequencies & test goodness-of-fit Test: Assess against  2 with df = (I-1)(J-1) Expected cell frequencies

Worked example: two-way tables Regeneration and seed dispersal mechanisms of plants French & Westoby (1996) cross-classified plant species following fire by two variables: – whether they regenerated by seed only or vegetatively – whether they were ant or vertebrate dispersed. H 0 : dispersal mechanism is independent of mode of regeneration. SeedVegetative Ant2536 Vertebrate621

Observations Factor B SeedVeg Factor A Ant Vert

Expected values ( ) 2 / 21.5 = 0.57  2 = 2.89, df = 1, P = 0.089

Odds Ratio approach Used for 2 x 2 tables Calculate odds for each level of one factor –e.g., for seed only plants, odds of being ant dispersed, repeat for vegetative plants –  i /(1-  i ) Calculate Odds Ratio (  ) –Log (  ) Calculate se: Assess using Log (  ) /se 

Example: odds ratio test Odds 4.17 / 1.71 = 2.43  = 0.89 se  = % CI = 0.86 to 6.89

Small Sample Sizes Aim for expected values <5 in no more than 20% of cells –Pool categories to raise expected numbers Yate’s correction –Adjustment for continuity –Not widely recommended now Fisher’s exact test –For 2 x 2 tables Other exact methods –Randomisation tests

Interpreting patterns: Residuals Raw Residual n ij – f ij Calculate for each cell Sample size dependent Standardized residual Freeman-Tukey deviate Compare to

Example: Dead trees on floodplains Surveys of dead coolibah trees Transects with 3 positions: Top (dunes), middle, and bottom (lakeshore)  2 = 13.66, df = 2, P < Reject H 0 –Incidence of dead trees depends on floodplain position Dead Coolibah trees WithWithout Bottom 1513 Middle 48 Top 017

Dead Coolibah trees WithWithout Bottom Middle Top Example: Dead trees on floodplains Standardized residuals More dead than expected near bottom, fewer than expected on dunes

An alternative: log-linear models GLM Expected cell frequencies modelled using –Log link function –Poisson error term Maximum likelihood estimation Fit assessed using log-likelihood

f ij is the expected frequency in cell ij, constant is the mean of the logs of all the expected frequencies i X is the effect of category i of variable X j Y is the effect of category j of variable Y ij XY is the effect of any interaction between X and Y. The interaction measures deviations from independence of the two variables.

Saturated model Fits data perfectly Reduced model Independent action of factors X and Y Difference in fit of the two models indicates the importance of the interaction between X and Y (H 0 : XY = 0)

For coolibah example Log-likelihood , df = , df = 2 G 2 = -2(LL model – LL saturated ) = -2( – ( )) = 18.61, df = 1, P < Reject H 0

More complex designs 3-way tables 3 main effects 3 two-factor interactions 1 three-factor interaction Estimation of parameters not simple –Iterative procedures

Full model: Loglinear modeldf X + Y + ZIJK-I-J-K+2 X + Y + Z + XY(K-1)(IJ-1) X + Y + Z + XZ(J-1)(IK-1) X + Y + Z + YZ(I-1)(JK-1) X + Y + Z + XZ + YZK(I-1)(J-1) X + Y + Z + XY + YZJ(I-1)(K-1) X + Y + Z + XY + XZI(J-1)(K-1) X + Y + Z + XY + XZ + YZ(I-1)(J-1)(K-1) Saturated model: X + Y + Z + XY + XZ + YZ + XYZ 0 Models are hierarchical: Higher order term “forces” all simpler terms in Omission of two-way term forces omission of 3-way Representative models

Comparison of models Choosing best model –Lowest value of G 2 –Akaike Information Criterion (AIC) Adjusts for number of parameters in model G 2 – 2 df test of model Tests of hypothesis –Contrast fit of two models differing in the presence of the term in question

Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) Carcasses classified according to –Sex –Cause of death (predation or not) –Health (state of bone marrow)

Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) Marrow type Cause of deathSexSWFOGTGTotal PredationFemale PredationMale Non-pred.Female Non-pred.Male Totals

Fit of models ModelG2G2 dfPAIC 1death + sex + marrow42.767< death x sex42.686< death x marrow sex x marrow37.985< death x sex + death x marrow death x sex + sex x marrow37.894< death x marrow + sex x marrow death x sex + death x marrow + sex x marrow Saturated (full) model00

Tests of hypotheses 1death + sex + marrow death x sex death x marrow sex x marrow death x sex + death x marrow death x sex + sex x marrow death x marrow + sex x marrow death x sex + death x marrow + sex x marrow Saturated (full) model00