Sampling Design in Regional Fine Mapping of a Quantitative Trait Shelley B. Bull, Lunenfeld-Tanenbaum Research Institute, & Dalla Lana School of Public.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Analysis of imputed rare variants
Why I chose: First reading results seemed counterintuitive Introduction full of references I didn’t know Useful? Or Gee Whizz so what?...Needed to read.
Association Tests for Rare Variants Using Sequence Data
Significance Tests Hypothesis - Statement Regarding a Characteristic of a Variable or set of variables. Corresponds to population(s) –Majority of registered.
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
METHODS FOR HAPLOTYPE RECONSTRUCTION
Plausible values and Plausibility Range 1. Prevalence of FSWs in some west African Countries 2 0.1% 4.3%
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
1 Parametric Sensitivity Analysis For Cancer Survival Models Using Large- Sample Normal Approximations To The Bayesian Posterior Distribution Gordon B.
Selection of Research Participants: Sampling Procedures
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
On the use of auxiliary variables in agricultural surveys design
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
1. Estimation ESTIMATION.
Complex Surveys Sunday, April 16, 2017.
Ranked Set Sampling: Improving Estimates from a Stratified Simple Random Sample Christopher Sroka, Elizabeth Stasny, and Douglas Wolfe Department of Statistics.
Sampling.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Why sample? Diversity in populations Practicality and cost.
Fundamentals of Sampling Method
Association Analysis SeattleSNPs March 21, 2006 Dr. Chris Carlson FHCRC.
A new sampling method: stratified sampling
Study Design Discussion The Ghost of Candidate Gene Past and the Ghost of Genome-wide Association Yet to Come Stephen S. Rich, Ph.D. Wake Forest University.
Survey Methodology Sampling error and sample size EPID 626 Lecture 4.
17 June, 2003Sampling TWO-STAGE CLUSTER SAMPLING (WITH QUOTA SAMPLING AT SECOND STAGE)
The genetic epidemiology of common hormonal cancers Deborah Thompson Centre for Cancer Genetic Epidemiology.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
Detection of Rare-Alleles and Their Carriers Using Compressed Se(que)nsing Or Zuk Broad Institute of MIT and Harvard In collaboration.
Geuvadis RNAseq analysis at UNIGE Analysis plans
Definitions Observation unit Target population Sample Sampled population Sampling unit Sampling frame.
Near East Regional Workshop - Linking Population and Housing Censuses with Agricultural Censuses. Amman, Jordan, June 2012 Improving Efficiency.
8.1 Inference for a Single Proportion
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Bangkok,
Lohr 2.2 a) Unit 1 is included in samples 1 and 3.  1 is therefore 1/8 + 1/8 = 1/4 Unit 2 is included in samples 2 and 4.  2 is therefore 1/4 + 3/8 =
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
Adaptive randomization
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
QTL Mapping in Heterogeneous Stocks Talbot et al, Nature Genetics (1999) 21: Mott et at, PNAS (2000) 97:
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Www. geocities.com/ResearchTriangle/Forum/4463/anigenetics.gif.
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Sampling Designs Outline
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
United Nations Regional Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Addis.
Survey sampling Outline (1 hr) Survey sampling (sources of variation) Sampling design features Replication Randomization Control of variation Some designs.
A Fine Mapping Theorem to Refine Results from Association Genetics Studies S.J. Schrodi, V.E. Garcia, C.M. Rowland Celera, Alameda, CA ABSTRACT Justification.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Population stratification
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Yufeng Wu and Dan Gusfield University of California, Davis
Stratification Lon Cardon University of Oxford
Gene Hunting: Design and statistics
QTL Fine Mapping by Measuring and Testing for Hardy-Weinberg and Linkage Disequilibrium at a Series of Linked Marker Loci in Extreme Samples of Populations 
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
CS639: Data Management for Data Science
GWAS-eQTL signal colocalisation methods
Xiaoquan Wen, Yeji Lee, Francesca Luca, Roger Pique-Regi 
Presentation transcript:

Sampling Design in Regional Fine Mapping of a Quantitative Trait Shelley B. Bull, Lunenfeld-Tanenbaum Research Institute, & Dalla Lana School of Public Health, University of Toronto Banff International Research Station Emerging Statistical Challenges and Methods Session 7: GWAS and Beyond II 25 June 2014 Co-authors: Zhijian Chen and Radu Craiu Lunenfeld-Tanenbaum Research Institute & University of Toronto

Overview Setting Studies designed to follow up associations detected in a GWAS Fine-mapping of a candidate region by sequencing Aim to identify a functional sequence variant Approach Phase I: Quantitative trait with GWAS data (eg. N = 5000) Phase II: Two stage design Stage 1 sample (n 1 ) – expensive sequencing to identify a smaller set of promising variants Stage 2 sample (n 2 ) – cost-effective genotyping of selected variants in an independent group Stratification in Stage 1 according to a promising GWAS tag SNP Bayesian analysis in Stage 1, incorporating genetic model selection

Two-phase Two-stage Design

Background Two-phase designs +/- Stratification on tag SNP Chen et al (2012), Schaid et al (2013), Thomas et al (2013) Earlier: case-cohort designs Two-stage designs Skol et al (2007), Thomas et al (2009), Stanhope & Skol (2012) Bayesian approaches to genetic association Stephens & Balding (2009), Wakefield (2009), WTCCC/Maller et al (2012) Genetic model (mis)specification Joo et al (2010), Spencer et al (2011), Vukcevic et al (2011), Faye et al (2013)

Sampling Designs & Sample Allocation Based on tag SNP (AA, Aa, aa) from the GWAS: (1)Simple random sampling (SRS) – ignores tagSNP information (2)Equal (ES) number from each stratum (3)Oversampled homozygous (HO) – number larger than under SRS Example: N=5000, MAF=0.2

Quantitative Trait Model QT Model Parameters: θ = (β 0, β 1, σ 2 ) Genetic Models: M 1 = additive, M 2 = dominant, M 3 = recessive

Bayesian Inference: Stage 1 sample (1)Specify priors for the genetic models and the regression parameters p(M j ) = ⅓ p( θ | M j ) = p( θ ) p( θ ) = p(β 0, β 1 | σ 2 ) p( σ 2 ) normal-inverse-gamma (NIG) (2)Derive model-specific posterior for the regression parameters for a functional sequence variant – analytic when prior is NIG (3)Select a genetic model for each seq variant according to the posterior probability w j = p(M j | data ) (4)Given selection of a genetic model, compare all seq variants in the region by computing the posterior probability that variant k is functional given all the data, and rank them (the Bayes factor) p (1) ≥ p (2) ≥ … ≥ p (m) (5)Construct a 95% credible interval that includes all variants such that p (1) + p (2) + … + p (k) ≥ 0.95 for minimum k

Criteria for a Good Design Higher probability that the correct genetic model is identified for the sequence variant Fewer sequence variants selected into the credible set (number and %) * cost Higher probability that the functional sequence variant is selected into the credible set * power Higher probability that the functional sequence variant is top ranked in the credible set

Simulation Design (APOE gene region, 1KG) Quantitative trait model is Y = β 0 + β 1 X + γ 1(X=1) +, Parameters specified by β 0 =5, β 1 =0.25, σ 2 =0.1, 0.5, 1.5 and σ/β 1 =1.3, 2.8, 4.9

Simulation Results: Genetic model selection Data simulated under additive, dominant and recessive genetic models. The rate of selecting the true genetic model for the functional variant using the strong criteria of wj > Common seq variant (MAF=0.2) 1000 simulations Designs: SRS ____ ES HO …..

Simulation Results: Size of the 95% credible set Data simulated under additive, dominant and recessive genetic models. Upper panels: common variant (MAF=0.2) with σ/β1=4.9 (m=201) Lower panels: low frequency variant (MAF=0.02) with σ/β1=2.8 (m=332) 1000 simulations Designs: SRS ____ ES HO …..

Simulation Results: Selection of functional variant Designs: SRS ____ ES HO …..Data simulated under additive, dominant and recessive genetic models. Upper panels: common variant (MAF=0.2) with σ/β1=4.9 (m=201) Lower panels: low frequency variant (MAF=0.02) with σ/β1=2.8 (m=332) 1000 simulations

Simulation Results: Functional variant top ranked Designs: SRS ____ ES HO …..Data simulated under additive, dominant and recessive genetic models. Upper panels: common variant (MAF=0.2) with σ/β1=4.9 (m=201) Lower panels: low frequency variant (MAF=0.02) with σ/β1=2.8 (m=332) 1000 simulations

Simulation Results: Model selection Data simulated under additive, dominant and recessive genetic models. For cases without model selection (no MS), analysed under an additive model. Common seq variant (MAF=0.2), σ/β1=4.9, n1=600, 1000 simulations

Simulation Results: Cost Efficiency (CE) A total of m sequence variants are identified in n 1 individuals in stage 1, and a proportion q = (m 2 / m) are genotyped in n 2 =N-n 1 in stage 2. Cost depends on c 1, the stage 1 per individual sequencing cost, and on c 2, the stage 2 per individual per marker genotyping cost. CE is defined as “Power” / Cost, where “Power” is estimated by the probability that a functional variant falls within the 95% credible set e.g. if N = 5000, n 1 =500, c 1 =$1000, n 2 =4500, m 2 =100, and c 2 =$0.50, then the total two-stage cost is $500,000 + $225,000 = $725,000 compared to a one-stage cost of $5 million.

Comments and Discussion Incorporating Bayesian genetic model selection is worthwhile Selection of informative individuals for expensive data collection can be a useful strategy in statistical genetic design and analysis The simulations confirm the intuition that the efficiency of the tag- stratified sampling strategy increases with tag-seq correlation. Winner’s curse effects propagate from the GWAS, but are more complicated Cost-efficiency of a two-stage design depends on the relative costs of sequencing versus genotyping – will it remain practical? Analysis of the sequence data limited to low frequency and common variants – extensions to rare variants Other design options – trait-dependent sampling How to conduct joint Bayesian inference for stages 1 and 2?

Acknowledgements Co-Authors: Zhijian Chen, STAGE Post-doctoral Fellow Radu Craiu, Dept of Statistical Sciences Thanks to Laura Faye and Andrew Paterson for helpful discussions, and to referees for improvements to the paper. To appear in Genetic Epidemiology Funding

Thanks

Simulation Results Summary In stage 1, a total of m variants are sequenced in n 1 = 500 individuals, with equal strata sampling (ES) and an additive genetic model. Size is the number m 2 of sequence SNPs in the 95% credible set (% or count). P(Select) is the probability the functional variant is selected into the credible set. P(Rank) is the probability the functional variant is top ranked in the credible set.

GWAS Sample Size

Title