Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s.

Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s Health Study Duncan Thomas University of Southern California Los Angeles, USA

Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution Oxidant Exposure Oxidative Stress Health Effects Molecular & enzymatic antioxidants Dose Physical Activity ROS metabolism Xenobiotic metabolism Oxidative production & detoxification Inflammation Gilliland et al. EHP 1999;107:403-7

Statistical Challenges Exposure assessment and modeling GxE and GxG interactions Pathways –Hierarchical modeling strategy –Mechanistic models GWAS Collaborations C

Multilevel Mixed Model Between times within subject Between subjects within community Between communities Berhane et al, Statist Sci 2004; 19: 414-440

Multi-stage Model Y = LF, t = age, Z = pollution 1: Y cij = a ci + b ci t cij +  1 Z ck  s(t cij ) + e cij – b ci = subject-specific 8-yr LF growth 2: b ci = B c +  2 Z ci + e ci – Regression on subject-specific variables 3: B c =  0 +  3 Z c + e c – Regression on ambient pollution level Fit as single mixed model Can include confounders at each level Berhane et al, Statist Sci 2004; 19: 414-440

Community FEV 1 growth vs. NO 2 Gauderman et al, AJRCCM 2000: 162:1383-90

Spatial Variability of Measured Pollution and Traffic Density Regionally Within Communities Modeled Exposure

Atmospheric Dispersion Models Road Wind Residence, x Vehicle, y  Benson, CALINE4, CA Dept of Transport 1989: #205

Effects of Local Variation in Air Pollution Prevalent Asthma, Long-Term Residents McConnell et al, EHP 2006:114:766-72 Distance Modeled traffic from freeway pollutants >300 m 150-300 m 75-150 m <75 m 0-25 % 25-50 % 50-75 % 75 -90% >90%

Measurements of Local Variability Selected 234 homes and 34 schools from 10 communities Homes chosen based on stratified sample, above/below median distance from freeways Two-week NO 2 measurements using Palms tubes in two seasons each (winter & summer) NO, NO 2, O 3 measurements now available on about 1000 homes PM measurements currently being made on ~300 homes Gauderman et al., Epidemiology 2005;16: 737-43

Sampling Strategies Case-control: choose S to be set of asthma cases and their town-matched controls Surrogate diversity: choose S that maximizes the variance of traffic density Spatial diversity: choose S that maximizes the geographic spread of measurements –Maximize total distance from all other points –Maximize minimum distance from nearest point –Maximize the informativeness of sample for predicting non- sample points Hybrid: First measure cases and controls; then add additional subjects that would be most informative for refining E(X |Z,P,W ) Thomas, Lifetime Data Analysis 2007; 13: 565-81

Main Effects of Air Pollution: Intra-Community Variation in Measured NO 2  Nonasthmatic Gauderman et al., Epidemiology 2005;16: 737-43 AL AT LB LE LN ML SM RV SD UP AL AT LB LE LN ML SM RV SD UP

Main Effects of Air Pollution: Intra-Community Variation in Measured NO 2  Nonasthmatic Gauderman et al., Epidemiology 2005;16: 737-43 Asthmatic Asthmatic  Nonasthmatic AL AT LB LE LN ML SM RV SD UP AL AT LB LE LN ML SM RV SD UP

W Y Z X Traffic, Land Use Local Exposure Measurements HealthOutcome TrueExposure L Locations P RegionalBackground Molitor et al, AJE 2006;164:69-76 (nonspatial) Molitor et al, EHP 2007:1147-53 (spatial) Bayesian Spatial Measurement Error Model Subsample S | Y, L, W

Spatial Regression Model Exposure model E(X i ) = W i  W = land use covariates, dispersion model predictions cov(X i,X j ) =  2 I ij +  2 exp(–  D ij ) MESA Air model: x(s,t) = X 0 (s) +  k X k (s) T k (t) Measurement model E(Z i ) = X i Disease model g[E(Y i )] =  X i Multivariate exposure model (“co-kriging”)

Spatial Measurement Error Model Molitor et al, EHP 2007:1147-53

Multigenic Models Focused Interaction Testing Framework (FITF) uses likelihood ratios to test for main effects and interactions conditional on lower-order ones Dimension reduction by screening for G–G associations among pooled case-control sample before testing for interactions False Discovery Rate used to assess significance Better power than exploratory methods like MDR, except for interactions with no marginal effects Millstein et al, AJHG 2005; 78:15-27

Multigenic Models: NQ01, MPO & CAT Millstein et al, AJHG 2006; 78: 15-27 EffectsWhite & HispanicNonwhites NQ010.49 (0.32 – 0.72)0.42 (0.21 – 0.77) MPO0.75 (0.49 – 1.13)1.60 (0.93 – 2.75) CAT0.88 (0.56 – 1.40)0.71 (0.21 – 1.86) NQ01 x MPO1.48 (0.88 – 2.49)1.29 (0.62 – 2.57) NQ01 x CAT1.39 (0.77 – 2.50)0.76 (0.01 – 3.90) MPO x CAT0.51 (0.25 – 0.99)0.28 (0.04 – 1.45) NQ01 x MPO x CAT1.14 (0.51 – 2.51)2.12 (0.26 – 14.1) Unadjusted p.00026.00008 Significance threshold.00052.05

Integrating Toxicology and Epidemiology Suppose we conduct a semi- ecologic epidemiology study to observe (Y ci, X c, G ci ) for individuals i in community c AND we characterize the biological activity B cs of samples s of the mixture X c in toxicologic assays on cells with genotypes G s Aim is to link the parameters of the two models, so toxicology can inform the epidemiologic analysis Y ci B cs XcXc   G ci Healthoutcome Biologicalactivity Ambientpollution GsGs Cell line genotype Individualgenotype

Putting It All Together Use modeled local concentrations as input to microenvironmental model for personal exposure Integrate over time for lifetime exposure Estimate uncertainties and incorporate into exposure-response analysis Integrate exposures, genes & biomarkers through a pathway-based biological model Chamber studies using particle concentrator Incorporate toxicological assessment of biological activity of town-specific particle composition

x(s,t)x(s,t) ZtZt zszs XiXi LiLi YiYi GiGi BiBi  il P il V il p il v il Long-term average personal exposure Latent disease process (e.g., inflammation) Clinical outcome (e.g., asthma) Genes (& other risk factors) Biomarkers (e.g., eNO) Spatio-temporal exposure field Central site continuous time monitors Home & school measurements GIS location histories Accelerometer Activity histories Usual physical activity (Q’aire) Usual times (Q’aire) True long-term time-activity W st Exposure predictors (e.g., traffic, weather) ZiZi Personal exposure measurements s il il Usual locations

Modeling Entire Pathways Hierarchical modeling approach (Conti et al, Hum Hered 2003;56:83-93) –Conventional logistic regression modeling of main effects and interactions –Second level model with priors for interactions –Bayes model averaging to allow for uncertainty about which terms to include PBPK modeling approach (Cortessis & Thomas, IARC Sci Publ 2004;57:127-150) –Explicit modeling of postulated pathway(s) –Involving latent variables for intermediate metabolites and individual rate parameters

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology E Exposures G Genes Main effect and interaction covariates Y Disease X

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology E Exposures G Genes Unobserved intermediate events Y Disease ?

General Concept for a “Systems Biology” Perspective in Molecular Epidemiology E Exposures G X1X1 Genes Unobserved intermediate events Y Disease B2B2 “-Omics” biomarker measurements  “Topology” of the network Z External biological knowledge (“Ontologies”) X2X2 X3X3 XnXn X n-1 B3B3 …

Hierarchical Models Incorporates external knowledge about pathways as “prior covariates” for coefficients of a data model Level I: Epidemiologic data model: –logit Pr(Y i = 1|X i ) =  0 +  p  p X ip –X = (G,E,GxE,GxG, GxGxE,…) Level II: Pathway model: –  p ~ N(  v  v Z pv,  2 ) –Z pv = prior covariates

Prior Covariates Define potential “exchangeability classes”, not absolute values of differences Examples: –Pathway indicators –Hung et al., CEBP 2004;13:1013-21 –In vitro functional assays –WECARE study (Concannon) –In silico predictions (SIFT, PolyPhen, etc.) –Zhu et al. Cancer Res 2004;64:2251-7 –Outputs from mechanistic models (e.g., PBPK) –Parl et al., Fund Molec Epi 2008, in press –Formal ontologies – Conti, NCI Monogr (2007)

Hierarchical Models for GxG Multivariate prior for  GxG :  p ~ N(  v  v Z pv,  2 )  ~ MVN [  Z v,  2 (I –  A) –1 ] where A is an “adjacency” matrix describing the a priori similarity of pairs of genes derived from an ontology database or other sources

Modeling Entire Pathways Hierarchical modeling approach (Conti et al, Hum Hered 2003;56:83-93) –Conventional logistic regression modeling of main effects and interactions –Second level model with priors for interactions –Bayes model averaging to allow for uncertainty about which terms to include PBPK modeling approach (Cortessis & Thomas, IARC Sci Publ 2004;57:127-150) –Explicit modeling of postulated pathway(s) –Involving latent variables for intermediate metabolites and individual rate parameters

X1X1 X2X2 Z1Z1 Z2Z2 Z3Z3 Z4Z4 Z5Z5 Z6Z6 Z7Z7 Y G2G2 G3G3 G5G5 E7E7 E5E5 G6G6 G4G4 E3E3 G1G1 Cyp1A2 NAT1 NAT2 Cyp1A1EPHX1 (mEH) GSTM3 UDP-GST Well-done red meat Smoking MeIQx N-OH-MeIQx N-Acetyl- OH-MeIQx BaP BaP 7,8-EpxBaP 7,8-Diol 9,10-Epx Polyps G8G8 Colorectal Polyps Model Heterocyclic amines (HCA) pathway Polycyclic aromatic hydrocarbons (PAH) pathway

Complex Pathways Example: Folate Linked differential equations models for biochemical reactions Genotype-specific enzyme activity rates Methionine intake and intracellular folate Boxes are metabolite concentrations, enzymes Ulrich et al., CEPB 2008:17:1822-31 Reed et al., J Nutr 2006;136:2653-61 Ulrich et al., Nat Rev Cancer 2003;3:912-20

Mechanistic Models Combines differential equations models for pathway with stochastic distributions of individual metabolic rates, population parameters, and disease risks Fitted using MCMC methods Allow inference on: –contribution of each exposure to each pathway –contribution of each pathway to disease –contribution of each gene to relevant pathway –measures of individual heterogeneity

Stochastic Boolean Networks

Uncertainty in Pathway Structure Techniques like logic regression Kooperberg & Ruczinski, Gen Epi 2005;28:157-70 and Bayesian network analysis Friedman, Science 2004; 303: 799-805 can be used to infer network structure MCMC proceeds by adding, deleting nodes, changing node types, etc., to sample distribution of possible topologies Summarize strength of evidence for each connection and marginal risk of disease, averaging over topologies

Network of Metabolic Pathways for Colorectal Cancer: Top: Folate metabolism (with DNA methylation and DNA damage / repair subpathways) Middle: Bile acid metabolism Bottom: PAH & HCA metabolism Simulation of model uncertainty Simulation of model uncertainty Simulation of model uncertainty “Ridiculome?”

Fitted Model (thickness of arrows indicate posterior probabilities)

A Cautionary Comment So, the modeling of the interplay of many genes — which is the aim of complex systems biology — is not without danger. Any model can be wrong (almost by definition), but particularly complex…models have much flexibility to hide their lack of biological relevance. Jansen RG. Studying complex biological systems through multifactorial perturbation. Nat Rev Genet 2003; 4: 145-151

http://www.mickey-mouse.com/clipartm109.htm

Some GWAS Issues Two-stage designs Incorporating priors Approaches to scanning for GxE Unifying pathway-based and agnostic approaches Post-GWAS

Some Methodological Issues in GWAS: The ENDGAME Consortium Multistage study designs Choice of platform for first stage Multiple comparisons Prioritizing SNPs for second stage Haplotype analyses using tag SNPs: unifying association and sharing GxE and GxG interactions Control of population stratification Thomas et al, AJHG 2005:77:337-45

Multistage Design Stage I: full scan of 500,000 SNPs on sample of size N 1 Stage II: genotype only SNPs “significant” at level  1 from stage I on a new sample of size N 2 Final analysis combines both samples at significance level  2, chosen to ensure an overall Type I error rate  –Significance assessed conditionally on hit in stage I Optimize choice of N 1 and  1 to minimize cost subject to constraint on  and power Satagopan et al., Genet Epidemiol 2003;25:149-57

No additional SNPs at stage II: –Genotype 30% of sample in stage I  1 =.0038 (i.e., 1900 SNPs in stage II)  2 = 1.7x10 –7 –87% of cost goes to stage I Test 5 flanking markers per hit in stage II: –Genotype 49% of sample in stage I   1 =.0005 (250 loci & 1500 SNPs in stage II)  2 = 0.5x10 –7 –95% of cost goes to stage I Wang et al., Genet Epidemiol 2006:30:356-68 Optimal Designs Per-Genotype Cost Ratio = 17.5 for Stages II / I, Genomewide  =.05, 1 –  = 0.9 500,000 SNPs in stage I

Hierarchical Approach to Prioritizing SNPs Standard multistage designs assume the  1 most significant SNPs from the first stage will be tested in later stage(s) Can we do better? False discovery rate weighted by prior knowledge Roeder et al, AJHG 2006:78:243-42 Bayesian FDR Whittemore, J Appl Statist, 2007:34:1-9 Empirical Bayes ranking, using an exchangeable mixture prior with a large mass at RR = 1 Adding prior knowledge to hierarchical Bayes Lewinger et al, GE 2007;31:871-82

Hierarchical Approach to Prioritizing SNPs Three level model: –I: model for distribution of observed chi statistics  in relation to true noncentrality parameter –II: mixture model for as either null with probability 1 – p or non-null with probability p mean  and variance  2 –III: logistic model for p and linear model for  as regressions on prior covariates Z Ranking of SNPs by: –posterior probability of being non-null –posterior mean of given non-null Lewinger et al, GE 2007;31:871-82

 1 = 0.693  1 = 1.1 Lewinger et al, Gen Epi 2007; 31:871-82

Sample Sizes Needed for GxE Required # case-control pairs  = 0.05 /  = 1x10 -7 (assuming we are testing the causal locus)

Minimum Detectable Effect Sizes p(G)p(G) OR G main effect OR GxE interaction p(E) = 0.1p(E) = 0.4 0.052.058.64.3 0.101.775.43.2 0.201.684.32.7  = 1x10 –7, 1–  = 0.80 N = 1000 cases, 2000 controls

Case-Only Design for GxE ExposureCasesControls Genotype:Non-carrierCarrierNon-carrierCarrier UnexposedabAB ExposedcdCD OR GxE estimators: –Case-control: (ad/bc) / (AD/BC) –Case-only:ad/bc Assuming no G-E association in controls Umbach et al Stat Med 1994;13:153-62 Smaller variance (more power) than case-control test Can’t test this assumption in controls, then decide whether to do case-only or case-control Albert et al, AJE 2001:154:687-93 But can combine case-only and case-control estimators Mukerjee et al, GE 2008;32:615-26. Li & Conti, AJE in press

Case-control vs. Case-only Design N for 80% power (  =.05): case-control / case-only

Two-Stage Approach to GxE Step 1: Screen genome-wide to find SNPs most likely to be involved in a GxE interaction by testing for G-E association in combined case and control sample Step 2: Only test these ‘likely’ SNPs using the standard 1-df case-control interaction test Murcray et al. Am J Epidemiol 2009;169:219-26

GWAS Test for GxE Interaction: Power for 2-step vs. 1-step method Murcray et al., AJE 2009:169:219-26

Conceptual Model for Oxidative Stress Pathway for Effects of Air Pollution Oxidant Exposure Oxidative Stress Health Effects Molecular & enzymatic antioxidants Dose Physical Activity ROS metabolism Xenobiotic metabolism Oxidative production & detoxification Inflammation Gilliland et al. EHP 1999;107:403-7

Using Hierarchical Models to Incorporate Pathways into GWAS Two approaches to unification –Use GWAS to “discover” pathways –Use pathways to inform GWAS Approach 1: Bayesian network analysis, gene set enrichment analysis, or other exploratory methods –Subramanian et al, PNAS 2005; 102: 15545-50 Approach 2: Treat pathway indicators as prior covariates –Wang et al, Am J Hum Genet 2007;81:1278-83

Post-GWAS: Resequensing Designs MarkerDiseaseCausal allelePr(D=1|M,Y) MYD=0D=1 Positive marker association Positive LD and positive causal association  =0.036, RR YD =2, RR YM =1.22 M Controls0.7960.0040.005 Cases0.7580.0080.010 M Controls0.1540.046 0.230 Cases0.1470.088 0.374 Negative LD and negative causal association  = .010, RR YD =0, RR YM =1.067 M Controls0.7500.050 0.063 Cases0.7890.000 M Controls0.2000.000 Cases0.2110.000 Negative marker association Negative LD and positive causal association  =  0.010, RR YD =3, RR YM =0.889 M Controls0.7500.050 0.063 Cases0.6820.136 M Controls0.2000.000 Cases0.1860.000 Positive LD and negative causal association  =0.036, RR YD =0.5, RR YM =0.887 M Controls0.7960.0040.005 Cases0.8160.0020.003 M Controls0.2000.046 0.230 Cases0.2110.024 0.130 Analysis strategy Combines full sequencing information on a stratified subset with SNP data on main study Thomas et al, GE 2007;27:401-4 Thomas et al, Statist Sci 2009, in press

Statistical Issues in Collaborations Combining population-based, family-based, and pedigree studies Meta-analysis or mega-analysis? Data harmonization –Phenotypes –Genotypes –Exposures and other risk factors Allowing for understanding heterogeneity –Fixed vs random effects models –Meta-regression

Conclusions Costs have now become feasible: many such studies now being undertaken Results of first publications very promising Efficient design and analysis strategies are essential Rich area for statistical research “Agnostic” genomewide scans and pathway- driven multigenic modeling are complementary

Acknowledgments Epidemiology John Peters Frank Gilliland Rob McConnell Nino Kuenzli Stephanie LondonBiostatistics Jim Gauderman Kiros Berhane Mike Jerrett Bryan Langholz David Conti Dan Stram Bill Navidi Field Work & Exposure Assessment Ed Avol Fred Lurman Field team (many!) Funding California Air Resources Board Helene Margolis National Institute of Environmental Health Sciences National Heart, Lung & Blood Institute Health Effects Institute Data Management & Analysis Ed Rappaport Hita Vora Josh Millstein Yu-Fen Li Talat Islam John & Jassy Molitor Genetics Louis Dubeau Respiratory Medicine Bill Linn

x(s,t)x(s,t) ZtZt zszs XiXi LiLi YiYi GiGi BiBi  il P il V il p il v il Long-term average personal exposure Latent disease process (e.g., inflammation) Clinical outcome (e.g., asthma) Genes (& other risk factors) Biomarkers (e.g., eNO) Spatio-temporal exposure field Central site continuous time monitors Home & school measurements GIS location histories Accelerometer Activity histories Usual physical Activity (Q’aire) Usual times (Q’aire) True long-term time-activity W st Exposure predictors (e.g., traffic, weather ZiZi Personal exposure measurements s il il Usual locations l  ll 

Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s.

Similar presentations

Presentation on theme: "Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s.

Similar presentations

Presentation on theme: "Gene-Environment Interactions, Pathways, and Genome-Wide Association Studies in Asthma: What are the Analysis Challenges? Examples from the Children’s."— Presentation transcript:

Similar presentations

About project

Feedback