Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics.

Slides:



Advertisements
Similar presentations
MCMC estimation in MlwiN
Advertisements

General Linear Model With correlated error terms  =  2 V ≠  2 I.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
VARYING RESIDUAL VARIABILITY SEQUENCE OF GRAPHS TO ILLUSTRATE r 2 VARYING RESIDUAL VARIABILITY N. Scott Urquhart Director, STARMAP Department of Statistics.
1 Colorado State University’s EPA-FUNDED PROGRAM ON SPACE-TIME AQUATIC RESOURCE MODELING and ANALYSIS PROGRAM (STARMAP) Jennifer A. Hoeting and N. Scott.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
What is Statistical Modeling
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey, F. Jay Breidt, Colorado State University Abstract In aquatic resources,
1 STARMAP: Project 2 Causal Modeling for Aquatic Resources Alix I Gitelman Stephen Jensen Statistics Department Oregon State University August 2003 Corvallis,
State-Space Models for Within-Stream Network Dependence William Coar Department of Statistics Colorado State University Joint work with F. Jay Breidt This.
Chapter 3 Simple Regression. What is in this Chapter? This chapter starts with a linear regression model with one explanatory variable, and states the.
Bayesian modeling for ordinal substrate size using EPA stream data Megan Dailey Higgs Jennifer Hoeting Brian Bledsoe* Department of Statistics, Colorado.
Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics.
Chapter 4 Multiple Regression.
1 Accounting for Spatial Dependence in Bayesian Belief Networks Alix I Gitelman Statistics Department Oregon State University August 2003 JSM, San Francisco.
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
Habitat association models  Independent Multinomial Selections (IMS): (McCracken, Manly, & Vander Heyden, 1998) Product multinomial likelihood with multinomial.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
Continuous Random Variables and Probability Distributions
Today Logistic Regression Decision Trees Redux Graphical Models
State-Space Models for Biological Monitoring Data Devin S. Johnson University of Alaska Fairbanks and Jennifer A. Hoeting Colorado State University.
Computer vision: models, learning and inference Chapter 3 Common probability distributions.
Statistical Models for Stream Ecology Data: Random Effects Graphical Models Devin S. Johnson Jennifer A. Hoeting STARMAP Department of Statistics Colorado.
Random Effects Graphical Models and the Analysis of Compositional Data Devin S. Johnson and Jennifer A. Hoeting STARMAP Department of Statistics Colorado.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
1 Adjustment Procedures to Account for Nonignorable Missing Data in Environmental Surveys Breda Munoz Virginia Lesser R
Lecture II-2: Probability Review
Separate multivariate observations
Chapter Two Probability Distributions: Discrete Variables
1 Spatial and Spatio-temporal modeling of the abundance of spawning coho salmon on the Oregon coast R Ruben Smith Don L. Stevens Jr. September.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Principles of Pattern Recognition
Statistical Decision Theory
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
Multinomial Distribution
Simulation of the matrix Bingham-von Mises- Fisher distribution, with applications to multivariate and relational data Discussion led by Chunping Wang.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
CHAPTER 5 SIGNAL SPACE ANALYSIS
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
1 Probability and Statistical Inference (9th Edition) Chapter 4 Bivariate Distributions November 4, 2015.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
6.4 Random Fields on Graphs 6.5 Random Fields Models In “Adaptive Cooperative Systems” Summarized by Ho-Sik Seok.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
VARYING DEVIATION BETWEEN H 0 AND TRUE  SEQUENCE OF GRAPHS TO ILLUSTRATE POWER VARYING DEVIATION BETWEEN H 0 AND TRUE  N. Scott Urquhart Director, STARMAP.
Ch 6. Markov Random Fields 6.1 ~ 6.3 Adaptive Cooperative Systems, Martin Beckerman, Summarized by H.-W. Lim Biointelligence Laboratory, Seoul National.
Biointelligence Laboratory, Seoul National University
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Shashi Shekhar Weili Wu Sanjay Chawla Ranga Raju Vatsavai
Markov Random Fields Presented by: Vladan Radosavljevic.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Presentation transcript:

Models for the Analysis of Discrete Compositional Data An Application of Random Effects Graphical Models Devin S. Johnson STARMAP Department of Statistics Colorado State University Developed under the EPA STAR Research Assistance Agreement CR

Dissertation Contributions Chapter 2: Discrete regression graphical chain model. –New graphical chain model. –Markov properties of the DR model are illustrated. Chapter 3: Random effects graphical models. –Single discrete response. –Derived Markov properties and integrability cond. Chapter 4: Multi-way composition models. –Allows analysis of multi-way compositions. –Derived Markov properties. –Integrability shown for preservative models. Chapter 5: Autoregressive models for capture-recapture data (Biometrics, 2003).

Dissertation Contributions Chapter 2: Discrete regression graphical chain model. –New graphical chain model. –Markov properties of the DR model are illustrated. Chapter 3: Random effects graphical models. –Single discrete response. –Derived Markov properties and integrability cond. Chapter 4: Multi-way composition models. –Allows analysis of multi-way compositions. –Derived Markov properties. –Integrability shown for preservative models. Chapter 5: Autoregressive models for capture-recapture data (Biometrics, 2003).

Motivating Problem Various stream sites in the Mid-Atlantic region of the United States were visited in Summer –For each site, each observed fish species was cross categorized according to several traits –Environmental variables are also measured at each site (e.g. precipitation, chloride concentration,…) Relative proportions are more informative. How can we determine if collected environmental variables affect species richness compositions (which ones)?

Outline Introduction –Compositional data –Probability models Brief introduction to chain graphs A graphical model for compositional data –Modeling individual probabilities –Markov properties of random effects graphical models Analysis of fish species richness compositional data Conclusions and Future Research

Discrete Compositions and Probability Models Compositional data are multivariate observations Z = ( Z 1,…, Z D ) subject to the constraints that  i Z i = 1 and Z i  0. Compositional data are usually modeled with the Logistic-Normal distribution (Aitchison 1986). –Scale and location parameters provide a large amount of flexibility compared to the Dirichlet model –LN model defined for positive compositions only Problem: With discrete counts one has a non-trivial probability of observing 0 individuals in a particular category

Existing Compositional Data Models Billhiemer and Guttorp (2001) proposed using a multinomial state-space model for a single composition, where Y ij is the number of individuals belonging to category j = 1,…,D at site i = 1,…, S. Limitations: –Models proportions of a single categorical variable. –Abstract interpretation of included covariate effects

Existing Graphical Models Graph model theory (see Lauritzen 1996) has been used for many years to –model cell probabilities for high dimensional contingency tables –determine dependence relationships among categorical and continuous variables Limitation: –Graphical models are designed for a single sample (or site in the case of the Oregon stream data). Compositional data may arise at many sites

New Improvements for Compositional Data Models The Billhiemer and Guttorp model can be generalized by the application of graphical model theory. –Generalized models can be applied to cross-classified compositions –Simple interpretation of covariate effects as a variable in a Markov random field Conversely, graphical model theory can be expanded to include models for multiple site sampling schemes

Chain Graphs     Mathematical graphs are used to illustrate complex dependence relationships in a multivariate distribution. A random vector is represented as a set of vertices, V. Pairs of vertices are connected by directed edges if a causal relationship is assumed, undirected if the relationship is mutual

Probability Model for Individuals (Unobserved Composition) Response variables –Set  of discrete categorical variables –Notation: y is a specific cell Explanatory variables –Set  of categorical (  ) and/or continuous (  ) variables –Notation: x refers to a specific explanatory observation Random effects –Allows flexibility when sampling many “sites” –Unobserved covariates –Notation:  f, f   refers to a random effect.

Probability Model and Extended Chain Graph, G  Joint distribution f (y, x,  ) = f (y|x,  )  f (x)  f (  ) Graph illustrating possible dependence relationships for the full model, G . X1X1 X2X2 Y2Y2 Y1Y1  {1,2} 22 11

Random Effects Discrete Regression Model (REDR) Sampling of individuals occurs at many different random sites, i = 1,…,S, where covariates are measured only once per site Hierarchical model for individual probabilities:

Random Effects Discrete Regression Model (REDR) Response parameters constraints: The function   ( x,  ) is a normalizing constant w.r.t. y|(x,  ), and therefore, is not a function of y. The parameters  fcd (y, x  ),  f  dm (y, x  ), and  f (y) are interaction effects that depend on y and x  through the levels of the variables in f and d only. Interaction parameters (and random effects) are set to zero for identifiability of the model if the cells y or x  are indexed by the first level of any variable in f or d 

Random Effects Discrete Regression Model (REDR) Model for explanatory variables (CG distribution): Again, interactions depend on x  through the levels of the variables in the set d only, and identifiability constraints are imposed.

Graphical Models for Discrete Compositions For a set  of categorical responses –Let D be the number of cross-classified cells –Let C ij = Number of observations in cell j=1,…,D at site i=1,…,S Likelihood (C i1,…,C iD ) | X  = x  ~ multinomial (N i ; p i1,…,p iD ), where p ij is given by the REDR model Covariate distribution X  ~ CG (, ,  )

Markov Properties of Chain Graph Models Let P denote a probability measure on the product space X = ∏  V X  Markov (Global) property The probability measure P is Markovian with respect to a graph G if for any triple ( A, B, S ) of disjoint sets in V, such that S separates A from B in {G an(A  B  S) } m, we have A  B | S. There are two weaker Markov properties, pairwise and local Markov properties.

Markov Properties of the REDR Model Proposition 1. A REDR model is G  Markovian if and only if the following six constraints are satisfied for a given extended graph G   Response model  fcd ( y, x  ) = 0 unless f  c  d is complete  for c  d ≠ Ø   f  dm (y, x  ) = 0 for m = 1,…,M, unless f  {  }  d is complete, where {  }   and d  .  f ( y ) = -  f ØØ ( y ) with probability 1 if f is not complete.

Markov Properties of the REDR Model Proposition 1. A REDR model is G  Markovian if and only if the following six constraints are satisfied for a given extended graph G   Covariate model  d ( x d ) = 0 unless d is complete    d  ( x d ) = 0 unless {  }  c is complete, where {  }   and d   6.  .= 0 unless { ,  } is complete, where ,    and   is the ( ,  ) element of  Ø.

Markov Properties of the REDR Model Sketch of proof  Lauritzen and Wermuth (1989) prove conditions concerning the, , and  Ø parameters for the CG distribution. If the  and  parameters are 0 for the specified sets then the density factorizes according to Frydenburg’s theorem. A modified version of the proof of the Hammersley- Clifford Theorem shows that if f ( y | x,  ) separates into complete factors, then, the corresponding  and  vectors for non-complete sets must be 0.

Preservative REDR Models Preservative REDR models are defined by the following conditions: 1.All connected components a q, q = 1,…,Q, of  in G  are complete, where Q is the total number of connected components. 1.Any    that is a parent of   a q is also a parent of every other   a q, q = 1,…,Q.

Markov Properties of the REDR Model Proposition 2. If P is a preservative REDR model, and P is G  Markovian, then the marginal distribution, P    of the covariates and response variables is G = ( G  )   Markovian. Sketch of Proof. The integrated REDR density follows Frydenberg’s (1990) factorization criterion. The factorizing functions, however, do not exist in closed form.

Parameter Estimation A Gibbs sampling approach is used for parameter estimation Hierarchical centering –Produces Gibbs samplers which converge to the posterior distributions faster –Most parameters have standard full conditionals if given conditional conjugate distributions. Independent priors imply that covariate and response models can be analyzed with separate MCMC procedures.

Fish Species Richness in the Mid-Atlantic Highlands 91 stream sites in the Mid Atlantic region of the United States were visited in an EPA EMAP study Response composition: Observed fish species were cross-categorized according to 2 discrete variables: 1.Habit Column species Benthic species 2.Pollution tolerance Intolerant Intermediate Tolerant

Stream Covariates Environmental covariates: values were measured at each site for the following covariates 1.Mean watershed precipitation (m) 2.Minimum watershed elevation (m) 3.Turbidity (ln NTU) 4.Chloride concentration (ln  eq/L) 5.Sulfate concentration (ln  eq/L) 6.Watershed area (ln km 2 )

Fish Species Richness Model Composition Graphical Model: and Prior distributions

Model Selection Three different models are considered 1.Independent response (i.e.  f  ( y i ) =  f ( y i ) = 0 for f = {H, T } ) 2.Depended response w/ independent errors 3.Dependent response w/ correlated errors (equivalent to Billheimer Guttorp model) ModelDIC  DIC pDpD Independent Dependent (indep. errors) Dependent (corr. errors)

Fish Species Functional Groups Edge exclusion determined from 95% HPD intervals for  parameters and off-diagonal elements of  Ø  Posterior suggested chain graph for independence model (lowest DIC model) Tolerance Precipitation Chloride Elevation Turbidity Area Sulfate Habit

Comments and Conclusions Using Discrete Response model with random effects, the Billheimer-Guttorp model can be generalized –Relationships evaluated though a graphical model –Multi-way compositions can be analyzed with specified dependence structure between cells –MVN random effects imply that the cell probabilities have a constrained LN distribution DR models also extend the capabilities of graphical models –Data can be analyzed from many multiple sites –Over dispersion in cell counts can be added

Future Work Model determination under a Bayesian framework –Models involve regression coefficients as well as many random effects –Initial investigation suggests selection based on parameters, not edge, inclusion produces models with higher posterior mass Accounting for spatial correlation

The work reported here was developed under the STAR Research Assistance Agreement CR awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of presenter and the STARMAP, the Program he represents. EPA does not endorse any products or commercial services mentioned in this presentation. Any Questions? # CR