Generalized Linear Models Logistic Regression Log-Linear Models © G. Quinn & M. Keough, 2004
Generalized linear model Fit model using maximum likelihood Three components –Random component response variable & its probability distribution Exponential distribution (normal, gamma, binomial, etc) –Systematic component predictor variable(s) –Continuous or categorical –Combinations, polynomial functions –Link function
Link function Links expected value of Y to predictors
Three common link functions Identity link –g( ) = –Models mean or expected value of Y –Standard linear models Log link –g( ) = log( ) –Used for count data, which cannot be negative Logit link –g( ) = log( /(1- )) –Used for binary data and logistic regression
Logistic regression Modelling response variables that are discrete –Often binary Present/Absent Alive/Dead Response/No Response Predictors categorical or continuous –Equivalent to simple linear regression, ANOVA, multiple regression, ANCOVA
Simple logistic regression Single, binary, response variable Single, continuous predictor Model (x) –P(Y) =1 for a given X Fit logistic regression model –Sigmoidal –Greatest change in values in mid-range of X –Response variable has binomial distribution OLS not appropriate; ML required
Example: Lizards on islands Polis et al. (1998), examination of factors controlling spider populations on islands in the Gulf of California. Hypothesis: Presence of predator lizards (Uta) a key influence. What aspects of an island influence the presence of Uta? –P/A ratio
P/A ratio Uta presence/absence
The logistic model 0 and 1 are parameters to be estimated Slope and intercept
A simpler model Calculate odds –P(event)/P(1-event) –P(y i = 1)/P(y i = 0)
Log odds Link function (logit), g(x) g(x) = 0 + 1 x 1 Linear model g(x) = 0 + 1 (P/A ratio)
Interpretation 1 is the rate of change in the log(odds) for a unit change in X More often expressed as the Odds Ratio –Change in odds for unit change in X –e 0 is the intercept –not often of biological interest
Null hypotheses Most often 1 = 0 Wald test –ML equivalent of t test –Parameter estimate/standard error –b/s b –Normal for large sample sizes –Test using z distribution
Tests (continued) Compare fit of full and reduced model g(x) = 0 + 1 x 1 Full model g(x) = 0 Reduced model Difference in fit reflects effect of 1 Assess fit using Likelihood Ratio statistic ( )
Log(L) ML estimator Possible 0 values Log(L) for best parameter estimate 0
1 values 0 values Log(L) Log(L) for best parameter estimate
Tests (continued) is the ratio of the likelihood of the reduced model to that of the full model If is near 1, 1 contributes little If is <1, 1 has an effect G 2 = -2 ln( ) Log-likelihood 2 or G statistic G 2 = - 2 (log-likelihood reduced – log-likelihood full) G 2 follows 2 with 1 df
Test statistics Test with either Wald or G 2 Unlike regression, Wald ≠ G 2 Use G 2 for small sample sizes G 2 also called deviance when a specific model is compared to a saturated model (which fits data perfectly –Change in deviance used to compare different models –Equivalent to SS Residual
Worked example Category choices 0 (REFERENCE) 9 1 (RESPONSE) 10 Total : 19 L-L at iteration 1 is L-L at iteration 2 is L-L at iteration 3 is L-L at iteration 4 is L-L at iteration 5 is L-L at iteration 6 is L-L at iteration 7 is Log Likelihood: Maximum Likelihood estimation, so procedure is iterative: estimate parameters, calculate log-likelihood. Refine parameter estimates, recalculate log-likelihood. Continue until convergence: SYSTAT output:
Output (cont) Parameter Estimate S.E. t-ratio p-value 1 CONSTANT PARATIO % bounds Parameter Odds Ratio Upper Lower 2 PARATIO Log Likelihood of constants only model = LL(0) = *[LL(N)-LL(0)] = with 1 df Chi-sq p-value = =e ( – (-7.110))
P/A ratio Predicted probability of occurrence P/A ratio Uta presence/absence
A special case: toxicity testing Logistic regression used to estimate relationship between concentration of substance and response variable Equation used to solve for concentration that produces a given level of response –LC50 –EC50
Worked example: toxicity testing Effect of copper on larvae of a marine invertebrate, Bugula dentata Methods –Larvae exposed to copper at range of concentrations Range of [Cu] 0 – 400 g/L –Recorded as swimming or not after 6 hours –Recorded as live or dead after 24 h
Parameter estimates: –Intercept 3.07 0.64 –Slope 0.31 (t = -4.56, P<0.001) LC50 –50% swimming –odds = 1, log(odds) = 0 –Solve for y = 0 Log[Cu] = 2.16 [Cu] = 145 g/L 0123 Log [Cu] % Swimming
Log [Cu] % Swimming
Extension to multiple regression Analogous to least squares multiple regression Generates partial regression coefficients Test overall regression by comparing fit of –Full model –Reduced model (constant or 0 only) Wald tests as equivalent to t tests Use likelihood ratio statistics (deviance) Assumptions
Age Distance Rodents presence/absence % shrub
Logistic ANCOVA Analogous to ANCOVA Test for heterogeneity of slopes –Fit models with and without interaction present –Compare fit of two models Run reduced model with covariate and categorical variable Test effects of each
Worked example Marshall et al. (2003) Ecology Effects of larval size on juvenile survivorship in a bryozoan, Bugula neritina –Larval size measured, then juveniles transplanted to field and survival (and growth) monitored –Experiment repeated several times Response variable: colony survival Predictor variables: –Larval Size –Experimental Run
Logistic ANCOVA (cont) Fit model g(x) = 0 + 1 x 1 + I + i1 x 1 1 = overall effect of size i = effect of Run i i1 = effect of size in Run i –LL = , df = 7 Fit model g(x) = 0 + 1 x 1 + I –LL = , df = 4 Effect of interaction term –G 2 = -2 ( – ( )) = 5.65, df = 3, P = –Conclude slopes not significantly heterogeneous
Effect of Larval Size P = Effect of Run P = 0.229
Important assumptions Correct probability distribution for response variable Collinearity –Inflates standard errors of parameter estimates –Interpretations unreliable –Few diagnostics available Correlation matrices for predictor variables Examine tolerance by running as OLS linear regression Residuals –Not useful for individual observations –Aggregate approaches Deciles of risk Influence
Contingency tables and log- linear models
Introduction Each observation classified into 2 groups –Phenotypes for trait controlled by single-locus with dominance –Behavioural choice between two alternatives Is the distribution between these groups consistent with a particular hypothesis? –Crosses between known genotypes –No behavioural preference Data expected to follow binomial distribution
Binomial test Behavioural experiment –n1, n2 are numbers making choices 1 & 2 Null hypothesis: no preference –p = q = 0.5 Test –Calculate probability of observing ≥ n1 by chance –Binomial expansion (p + q) n
Example Binomial Expansion animals choose A, 1 chooses B. One-tailed test: P(5 or more) = P(5) + P(6) = = Two-tailed test: P(≥ 5 or ≤ 1) = P(5) + P(6) + P(1) +P(0) = = 0.220
Two groups Binomial test appropriate for small sample sizes –Provides exact probabilities Alternative procedures for larger samples –Goodness-of-fit tests 2 Log-likelihood ratio tests
2 Goodness of fit test Data in K groups o i is observed number in group i e i is expected number in group I Assess using 2 with k-1 df
More groups Outcome no longer binomial, but multinomial: (p 1 + p 2 + … p i + … p k ) n Computationally difficult
Observations Factor B 12 Factor A 1n 11 n 12 n 1+ 2n 21 n 22 n 2+ n +1 n +2 n
Factor B 12 Factor A 1 11 12 1+ 2 21 22 2+ +1 +2 i+ = n i+ /n If A & B independent: Remember: P(A B) = P(A) P(B)
Calculate expected frequencies & test goodness-of-fit Test: Assess against 2 with df = (I-1)(J-1) Expected cell frequencies
Worked example: two-way tables Regeneration and seed dispersal mechanisms of plants French & Westoby (1996) cross-classified plant species following fire by two variables: – whether they regenerated by seed only or vegetatively – whether they were ant or vertebrate dispersed. H 0 : dispersal mechanism is independent of mode of regeneration. SeedVegetative Ant2536 Vertebrate621
Observations Factor B SeedVeg Factor A Ant Vert
Expected values ( ) 2 / 21.5 = 0.57 2 = 2.89, df = 1, P = 0.089
Odds Ratio approach Used for 2 x 2 tables Calculate odds for each level of one factor –e.g., for seed only plants, odds of being ant dispersed, repeat for vegetative plants – i /(1- i ) Calculate Odds Ratio ( ) –Log ( ) Calculate se: Assess using Log ( ) /se
Example: odds ratio test Odds 4.17 / 1.71 = 2.43 = 0.89 se = % CI = 0.86 to 6.89
Small Sample Sizes Aim for expected values <5 in no more than 20% of cells –Pool categories to raise expected numbers Yate’s correction –Adjustment for continuity –Not widely recommended now Fisher’s exact test –For 2 x 2 tables Other exact methods –Randomisation tests
Interpreting patterns: Residuals Raw Residual n ij – f ij Calculate for each cell Sample size dependent Standardized residual Freeman-Tukey deviate Compare to
Example: Dead trees on floodplains Surveys of dead coolibah trees Transects with 3 positions: Top (dunes), middle, and bottom (lakeshore) 2 = 13.66, df = 2, P < Reject H 0 –Incidence of dead trees depends on floodplain position Dead Coolibah trees WithWithout Bottom 1513 Middle 48 Top 017
Dead Coolibah trees WithWithout Bottom Middle Top Example: Dead trees on floodplains Standardized residuals More dead than expected near bottom, fewer than expected on dunes
An alternative: log-linear models GLM Expected cell frequencies modelled using –Log link function –Poisson error term Maximum likelihood estimation Fit assessed using log-likelihood
f ij is the expected frequency in cell ij, constant is the mean of the logs of all the expected frequencies i X is the effect of category i of variable X j Y is the effect of category j of variable Y ij XY is the effect of any interaction between X and Y. The interaction measures deviations from independence of the two variables.
Saturated model Fits data perfectly Reduced model Independent action of factors X and Y Difference in fit of the two models indicates the importance of the interaction between X and Y (H 0 : XY = 0)
For coolibah example Log-likelihood , df = , df = 2 G 2 = -2(LL model – LL saturated ) = -2( – ( )) = 18.61, df = 1, P < Reject H 0
More complex designs 3-way tables 3 main effects 3 two-factor interactions 1 three-factor interaction Estimation of parameters not simple –Iterative procedures
Full model: Loglinear modeldf X + Y + ZIJK-I-J-K+2 X + Y + Z + XY(K-1)(IJ-1) X + Y + Z + XZ(J-1)(IK-1) X + Y + Z + YZ(I-1)(JK-1) X + Y + Z + XZ + YZK(I-1)(J-1) X + Y + Z + XY + YZJ(I-1)(K-1) X + Y + Z + XY + XZI(J-1)(K-1) X + Y + Z + XY + XZ + YZ(I-1)(J-1)(K-1) Saturated model: X + Y + Z + XY + XZ + YZ + XYZ 0 Models are hierarchical: Higher order term “forces” all simpler terms in Omission of two-way term forces omission of 3-way Representative models
Comparison of models Choosing best model –Lowest value of G 2 –Akaike Information Criterion (AIC) Adjusts for number of parameters in model G 2 – 2 df test of model Tests of hypothesis –Contrast fit of two models differing in the presence of the term in question
Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) Carcasses classified according to –Sex –Cause of death (predation or not) –Health (state of bone marrow)
Worked example Wildebeeste carcasses (Sinclair & Arcese 1995) Marrow type Cause of deathSexSWFOGTGTotal PredationFemale PredationMale Non-pred.Female Non-pred.Male Totals
Fit of models ModelG2G2 dfPAIC 1death + sex + marrow42.767< death x sex42.686< death x marrow sex x marrow37.985< death x sex + death x marrow death x sex + sex x marrow37.894< death x marrow + sex x marrow death x sex + death x marrow + sex x marrow Saturated (full) model00
Tests of hypotheses 1death + sex + marrow death x sex death x marrow sex x marrow death x sex + death x marrow death x sex + sex x marrow death x marrow + sex x marrow death x sex + death x marrow + sex x marrow Saturated (full) model00