A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Signatures of Selection
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Lecture X.X1. 2 The informatics of SNPs and Haplotypes Gabor T. Marth Department of Biology, Boston College
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Quantitative Genetics
1/49 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 9 Estimation: Additional Topics.
Review Session Monday, November 8 Shantz 242 E (the usual place) 5:00-7:00 PM I’ll answer questions on my material, then Chad will answer questions on.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
Genome-Wide Association Study (GWAS)
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.
The HapMap Project and Haploview
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
High-resolution haplotype structure in the human genome
Estimating Recombination Rates
Haplotype Reconstruction
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Presentation transcript:

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom

Focal questions about the HapMap CEPH European samples 1. Required marker densityYoruban samples 4. How general the answers are to these questions among different human populations 2. How to quantify the strength of allelic association in genome region 3. How to choose tagging SNPs

Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Possible consequence for marker performance Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study.

How to assess sample-to-sample variability? 1. Understanding fundamental characteristics of a given genome region, e.g. estimating local recombination rate from the data 3. It would be a desirable alternative to generate such additional sets with computational means McVean et al. Science Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Towards a marker selection tool 2. generate computational samples 3. test the performance of markers across consecutive sets of computational samples 1. select markers (tag SNPs) with standard methods

Generating additional computational haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 3. Use the second haplotype set induced by the same mutations as our computational samples. 4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in Enforce data-relevance by requiring that the first set reproduces the observed haplotype structure of the HapMap reference samples. Calculate the “degree of relevance” as the data likelihood (the probability that the genealogy does produce the observed haplotypes).

Generating computational samples Problem: The efficiency of generating data- relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N M We propose a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K)

Approximating M-site haplotypes as composites of overlapping K-site haplotypes 1. generate K-site sets 2. build M-site composites M

Piecing together neighboring K-site sets hope that constraint at overlapping markers preserves for long-range marker association

Building composite haplotypes

Initial results: 3-site composite haplotypes a typical 3-site composite 30 CEPH HapMap reference individuals (60 chr)

3-site composite vs. data

3-site composites: the “best case” the “best-case” 3-site scenario: composite of exact 3-site sub- haplotypes “short-range” “long-range”

Variability across sets The purpose of the composite haplotypes sets … … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

4-site composite haplotypes 4-site composite

“Best-case” 4 site composites Composite of exact 4-site sub-haplotypes

Variability across 4-site composites

… is comparable to the variability across data sets.

Technical/algorithmic improvements 3. dealing with uninformative markers 1. un-phased genotypes 2. markers with unknown ancestral state (AC)(CG)(AT)(CT) A G A C C C T T AC ? taking into account local recombination rare

Software engineering aspects: efficiency Currently, we run fresh Coalescent simulations at each K-site (several hours per region). This discards most Coalescent genealogies as irrelevant. Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Haplotype sets resulting from matches can be loaded into, stored in, and retrieved from a database efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes

Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)

Testing markers with composite sets

Using the HapMap 1. genotype a set of reference samples 2. compute strength of association 4. use these markers in clinical studies 3. select a smaller set of markers that capture most of the information present in the complete set of markers

Allele structure varies among populations CEPH European samples Yoruban samples

Data probability for composite haplotypes (motivation from composite likelihood methods for recombination rate estimation e.g. by Hudson, Clark, Wall) Pr(composite) = Pr(K-site 1 ) Pr(K-site 1 ~ K-site 2 )Pr(K-site 2 ) Pr(K-site 2 ~ K-site 3 )Pr(K-site 3 )

Generating K-site haplotypes reference data 1 match / 100 – 10,000 Coalescent genealogies K=3,4

Example: CFTR gene Hinds et al. Science, 2005

4-site composite haplotypes 4-site composite #14-site composite #2 HapMap data

4-site composites vs. data

Why should this work? tease apart two questions: (1) to what degree K-site composites preserve long-range correlations between markers (really, the quality of the approximation) and (3) the variability across different sets (what we are interested in).

Example: 4-site approximation 4-site composite #1 4-site composite #2 4-site composite #3 4-site composite #4