Presentation is loading. Please wait.

Presentation is loading. Please wait.

Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics.

Similar presentations


Presentation on theme: "Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics."— Presentation transcript:

1 Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics Colorado State University Developed under the EPA STAR Research Assistance Agreement CR-829095

2 Dissertation Contributions Chapter 2: Discrete regression graphical chain model. –New graphical chain model. –Markov properties of the DR model are illustrated. Chapter 3: Random effects graphical models. –Single discrete response. –Derived Markov properties and integrability cond. Chapter 4: Multi-way composition models. –Allows analysis of multi-way compositions. –Derived Markov properties. –Integrability shown for preservative models. Chapter 5: Autoregressive models for capture-recapture data (Biometrics, 2003).

3 Dissertation Contributions Chapter 2: Discrete regression graphical chain model. –New graphical chain model. –Markov properties of the DR model are illustrated. Chapter 3: Random effects graphical models. –Single discrete response. –Derived Markov properties and integrability cond. Chapter 4: Multi-way composition models. –Allows analysis of multi-way compositions. –Derived Markov properties. –Integrability shown for preservative models. Chapter 5: Autoregressive models for capture-recapture data (Biometrics, 2003).

4 Motivating Problem Various stream sites in the Mid-Atlantic region of the United States were visited in Summer 1994. –For each site, each observed fish species was cross categorized according to several traits –Environmental variables are also measured at each site (e.g. precipitation, chloride concentration,…) Relative proportions are more informative. How can we determine if collected environmental variables affect species richness compositions (which ones)?

5 Outline Introduction –Compositional data –Probability models Brief introduction to chain graphs A graphical model for compositional data –Modeling individual probabilities –Markov properties of random effects graphical models Analysis of fish species richness compositional data Conclusions and Future Research

6 Discrete Compositions and Probability Models Compositional data are multivariate observations Z = ( Z 1,…, Z D ) subject to the constraints that  i Z i = 1 and Z i  0. Compositional data are usually modeled with the Logistic-Normal distribution (Aitchison 1986). –Scale and location parameters provide a large amount of flexibility compared to the Dirichlet model –LN model defined for positive compositions only Problem: With discrete counts one has a non-trivial probability of observing 0 individuals in a particular category

7 Existing Compositional Data Models Billhiemer and Guttorp (2001) proposed using a multinomial state-space model for a single composition, where Y ij is the number of individuals belonging to category j = 1,…,D at site i = 1,…, S. Limitations: –Models proportions of a single categorical variable. –Abstract interpretation of included covariate effects

8 Existing Graphical Models Graph model theory (see Lauritzen 1996) has been used for many years to –model cell probabilities for high dimensional contingency tables –determine dependence relationships among categorical and continuous variables Limitation: –Graphical models are designed for a single sample (or site in the case of the Oregon stream data). Compositional data may arise at many sites

9 New Improvements for Compositional Data Models The Billhiemer and Guttorp model can be generalized by the application of graphical model theory. –Generalized models can be applied to cross-classified compositions –Simple interpretation of covariate effects as a variable in a Markov random field Conversely, graphical model theory can be expanded to include models for multiple site sampling schemes

10 Chain Graphs     Mathematical graphs are used to illustrate complex dependence relationships in a multivariate distribution. A random vector is represented as a set of vertices, V. Pairs of vertices are connected by directed edges if a causal relationship is assumed, undirected if the relationship is mutual

11 Probability Model for Individuals (Unobserved Composition) Response variables –Set  of discrete categorical variables –Notation: y is a specific cell Explanatory variables –Set  of categorical (  ) and/or continuous (  ) variables –Notation: x refers to a specific explanatory observation Random effects –Allows flexibility when sampling many “sites” –Unobserved covariates –Notation:  f, f   refers to a random effect.

12 Probability Model and Extended Chain Graph, G  Joint distribution f (y, x,  ) = f (y|x,  )  f (x)  f (  ) Graph illustrating possible dependence relationships for the full model, G . X1X1 X2X2 Y2Y2 Y1Y1  {1,2} 22 11

13 Random Effects Discrete Regression Model (REDR) Sampling of individuals occurs at many different random sites, i = 1,…,S, where covariates are measured only once per site Hierarchical model for individual probabilities:

14 Random Effects Discrete Regression Model (REDR) Response parameters constraints: The function   ( x,  ) is a normalizing constant w.r.t. y|(x,  ), and therefore, is not a function of y. The parameters  fcd (y, x  ),  f  dm (y, x  ), and  f (y) are interaction effects that depend on y and x  through the levels of the variables in f and d only. Interaction parameters (and random effects) are set to zero for identifiability of the model if the cells y or x  are indexed by the first level of any variable in f or d 

15 Random Effects Discrete Regression Model (REDR) Model for explanatory variables (CG distribution): Again, interactions depend on x  through the levels of the variables in the set d only, and identifiability constraints are imposed.

16 Graphical Models for Discrete Compositions For a set  of categorical responses –Let D be the number of cross-classified cells –Let C ij = Number of observations in cell j=1,…,D at site i=1,…,S Likelihood (C i1,…,C iD ) | X  = x  ~ multinomial (N i ; p i1,…,p iD ), where p ij is given by the REDR model Covariate distribution X  ~ CG (, ,  )

17 Markov Properties of Chain Graph Models Let P denote a probability measure on the product space X = ∏  V X  Markov (Global) property The probability measure P is Markovian with respect to a graph G if for any triple ( A, B, S ) of disjoint sets in V, such that S separates A from B in {G an(A  B  S) } m, we have A  B | S. There are two weaker Markov properties, pairwise and local Markov properties.

18 Markov Properties of the REDR Model Proposition 1. A REDR model is G  Markovian if and only if the following six constraints are satisfied for a given extended graph G   Response model  fcd ( y, x  ) = 0 unless f  c  d is complete  for c  d ≠ Ø   f  dm (y, x  ) = 0 for m = 1,…,M, unless f  {  }  d is complete, where {  }   and d  .  f ( y ) = -  f ØØ ( y ) with probability 1 if f is not complete.

19 Markov Properties of the REDR Model Proposition 1. A REDR model is G  Markovian if and only if the following six constraints are satisfied for a given extended graph G   Covariate model  d ( x d ) = 0 unless d is complete    d  ( x d ) = 0 unless {  }  c is complete, where {  }   and d   6.  .= 0 unless { ,  } is complete, where ,    and   is the ( ,  ) element of  Ø.

20 Markov Properties of the REDR Model Sketch of proof  Lauritzen and Wermuth (1989) prove conditions concerning the, , and  Ø parameters for the CG distribution. If the  and  parameters are 0 for the specified sets then the density factorizes according to Frydenburg’s theorem. A modified version of the proof of the Hammersley- Clifford Theorem shows that if f ( y | x,  ) separates into complete factors, then, the corresponding  and  vectors for non-complete sets must be 0.

21 Preservative REDR Models Preservative REDR models are defined by the following conditions: 1.All connected components a q, q = 1,…,Q, of  in G  are complete, where Q is the total number of connected components. 1.Any    that is a parent of   a q is also a parent of every other   a q, q = 1,…,Q.

22 Markov Properties of the REDR Model Proposition 2. If P is a preservative REDR model, and P is G  Markovian, then the marginal distribution, P    of the covariates and response variables is G = ( G  )   Markovian. Sketch of Proof. The integrated REDR density follows Frydenberg’s (1990) factorization criterion. The factorizing functions, however, do not exist in closed form.

23 Parameter Estimation A Gibbs sampling approach is used for parameter estimation Hierarchical centering –Produces Gibbs samplers which converge to the posterior distributions faster –Most parameters have standard full conditionals if given conditional conjugate distributions. Independent priors imply that covariate and response models can be analyzed with separate MCMC procedures.

24 Fish Species Richness in the Mid-Atlantic Highlands 91 stream sites in the Mid Atlantic region of the United States were visited in an EPA EMAP study Response composition: Observed fish species were cross-categorized according to 2 discrete variables: 1.Habit Column species Benthic species 2.Pollution tolerance Intolerant Intermediate Tolerant

25 Stream Covariates Environmental covariates: values were measured at each site for the following covariates 1.Mean watershed precipitation (m) 2.Minimum watershed elevation (m) 3.Turbidity (ln NTU) 4.Chloride concentration (ln  eq/L) 5.Sulfate concentration (ln  eq/L) 6.Watershed area (ln km 2 )

26 Fish Species Richness Model Composition Graphical Model: and Prior distributions

27 Model Selection Three different models are considered 1.Independent response (i.e.  f  ( y i ) =  f ( y i ) = 0 for f = {H, T } ) 2.Depended response w/ independent errors 3.Dependent response w/ correlated errors (equivalent to Billheimer Guttorp model) ModelDIC  DIC pDpD Independent1107.7-- 68.7 Dependent (indep. errors)1117.810.1106.1 Dependent (corr. errors)1166.859.1162.5

28 Fish Species Functional Groups Edge exclusion determined from 95% HPD intervals for  parameters and off-diagonal elements of  Ø  Posterior suggested chain graph for independence model (lowest DIC model) Tolerance Precipitation Chloride Elevation Turbidity Area Sulfate Habit

29 Comments and Conclusions Using Discrete Response model with random effects, the Billheimer-Guttorp model can be generalized –Relationships evaluated though a graphical model –Multi-way compositions can be analyzed with specified dependence structure between cells –MVN random effects imply that the cell probabilities have a constrained LN distribution DR models also extend the capabilities of graphical models –Data can be analyzed from many multiple sites –Over dispersion in cell counts can be added

30 Future Work Model determination under a Bayesian framework –Models involve regression coefficients as well as many random effects –Initial investigation suggests selection based on parameters, not edge, inclusion produces models with higher posterior mass Accounting for spatial correlation

31 The work reported here was developed under the STAR Research Assistance Agreement CR-829095 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of presenter and the STARMAP, the Program he represents. EPA does not endorse any products or commercial services mentioned in this presentation. Any Questions? # CR - 829095


Download ppt "Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics."

Similar presentations


Ads by Google