Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability.

Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability and Statistics Day, April 28, 2007 Joint Research with Professor Nagaraj K. Neerchal, UMBC and Jorge G. Morel, PhD, P&G Pharmaceuticals, Inc.

2 Motivation In the analysis of forest pollen, counts of the frequency of occurrence of different kinds of pollen grains are made at various levels of a sediment core An attempt is then made to reconstruct the past vegetation changes in the area from which the core was taken Example – Forrest Pollen Count, Mosimann (1962)

3 Motivation Four arboreal types of fossil forest pollen (pine, fir, oak and alder) were counted in the Bellas Artes core from the Valley of Mexico At various levels of the core, pollen was classified in clusters of 100 pollen grains The Data: Example – Forrest Pollen Count, Mosimann (1962)

4 Motivation The probability function: Key assumptions:  Each observation can be classified by exactly one of k possible outcomes, with probabilities  1,...,  k  All observations are independent of each other In our example, since each pollen count comes from a cluster of 100 pollen grains, the individual observations within a cluster can be expected to be correlated  The possible correlations are a violation of the multinomial model assumptions! The Multinomial Model

5 Motivation How can we properly model these data and estimate the proportions of pollen grains? What are the effects of using the wrong model? Problem Statement

6 Overdispersion (Extra Variation) Data exhibit variances larger than that permitted by the multinomial model Usually caused by a lack of independence or clustering of experimental units “Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception.”  McCullagh and Nelder (1989) Overview

7 Overdispersion (Extra Variation) Usually characterized by the first two moments  The quantity {1+  2 (m – 1)} is known as the design effect (Kish, 1965). The parameter  is known as the “intra class” or “intra cluster” correlation  We use  to denote a positive intra cluster correlation which corresponds to overdispersion Multinomial Overdispersion

8 Parameter Estimation How can we properly model these data and estimate the proportions of pollen grains? Moment Based Likelihood Based Quasi-Likelihood Generalized Estimating Equations Finite Mixture Distribution Dirichlet Multinomial Distribution (Easily implemented in SAS – Proc Genmod) (Not currently in SAS – Must write your own code)

9 Quasi-Likelihood Estimation Here we assume that overdispersion occurs by inflation of variances by a constant factor  Estimate systematic structure of the model via maximum likelihood procedures  Inflate the variance by a suitable constant Wedderburn (1974), Cox and Snell (1989)

10 Generalized Estimating Equations (GEE) Liang and Zeger (1986), Zeger and Liang (1986) Extension of Quasi-likelihood to clustered and longitudinal data:  The Generalized Estimating Equations are: 

11 Likelihood Models for Correlated Multinomial Multinomial Distribution with a Dirichlet Prior  Dirichlet Multinomial Distribution, Mosimann (1962)

12 It can be shown that  If we let then the moments of the Dirichlet Multinomial distribution are given by  Dirichlet Multinomial Distribution, Mosimann (1962) Likelihood Models for Correlated Multinomial

13 Likelihood Models for Correlated Multinomial Can be represented as: T=YN+X|N  N  Binomial( , m), Y  Multinomial( , 1), N  Y  (X|N)  Multinomial( , m-N ) if N < m Finite Mixture of Multinomials, Morel & Neerchal (1993)

14 Likelihood Models for Correlated Multinomial It can be shown that:  If and, Then the moments of the Finite Mixture distribution are given by,  Finite Mixture of Multinomials, Morel & Neerchal (1993)

15 Maximum Likelihood Estimation Computed using the Fisher Scoring Algorithm:  Fisher Information Matrix plays an important role  Can be computationally challenging   Approximations are available  Dirichlet Multinomial FIM can be computed using marginal Beta-Binomial moments Overview

16 Maximum Likelihood Estimation Maximum Likelihood Estimation results under the Finite Mixture and Dirichlet Multinomial Distributions The naïve model underestimates the standard errors The FM model gives smaller standard errors for the estimates of  Example – Forrest Pollen Count, Mosimann (1962) (pine) (fir) (oak) (alder)  4 = 1-(  1 +  2 +  3 )

17 Maximum Likelihood Estimation Simulation Study What are the effects of using the wrong model? After each simulation, we calculate the average of the determinants from each model A comparison of these averages gives us insight as to which model may be more efficient

18 Maximum Likelihood Estimation Simulation Study The Joint Asymptotic Relative Efficiency (JARE) can be used to summarize the simulation results as it indicates which estimate would have a smaller asymptotic variance For a vector parameter, JARE is the ratio of the determinants of the asymptotic variance-covariance matrices 

19 Conclusions If we observe correlated/clustered multinomial data, use of the naïve multinomial model causes the standard errors to be underestimated which leads to erroneous inferences and inflated Type-I error rates If the data truly comes from a Finite Mixture distribution, then estimation using this model clearly outperforms the Dirichlet Multinomial in terms of efficiency If we are unsure of the distribution, the FM model may underestimate the standard errors and the Dirichlet Multinomial model provides a safe alternative

20 Future Work Covariates can be included and linked to the model parameters through “link” functions as in the Generalized Linear Model (GLM) framework Obtain the expressions for the efficiency of likelihood models relative to GEE Use simulations to see if gains in efficiency of the likelihood models can be achieved over GEE Does the inclusion of covariates change our conclusions?  Does the choice of link function have an influence? Extension to Include Covariates Simulation Study

21 References Cox, D.R. and Snell, E.J. (1989) Analysis of Binary Data. 2nd Ed. New York: Chapman and Hall. Kish, L. (1965) Survey Sampling. New York: John Wiley & Sons. Liang, K.Y. and Zeger, S.L. (1986) “Longitudinal data analysis using generalized linear models.” Biometrika 73: 13-22. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd Ed. London: Chapman and Hall. Morel, J.G. and Nagaraj, N.K. (1993) “A finite mixture distribution for modelling multinomial extra variation.” Biometrika 80: 363-371. Mosimann, J. E. (1962) “On the Compound Multinomial Distribution, the Multivariate  - distribution, and Correlation among Proportions,” Biometrika, 49: 65-82. Neerchal, N.K. and Morel, J.G. (1998) “Large cluster results for two parametric multinomial extra variation models.” Journal of the American Statistical Association 93: 1078-1087. Wedderburn, R.W.M. (1974) “Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method.” Biometrika 61: 439-447. Zeger, S.L. and Liang, K.Y. (1986) “Longitudinal data analysis for discrete and continuous outcomes.” Biometrics 42: 121-130.

Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability.

Similar presentations

Presentation on theme: "Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability.

Similar presentations

Presentation on theme: "Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability."— Presentation transcript:

Similar presentations

About project

Feedback