A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Slides:



Advertisements
Similar presentations
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Advertisements

Introduction to Haplotype Estimation Stat/Biostat 550.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Sampling: Final and Initial Sample Size Determination
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
Chapter 8 Estimation: Additional Topics
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Lecture X.X1. 2 The informatics of SNPs and Haplotypes Gabor T. Marth Department of Biology, Boston College
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
1/49 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 9 Estimation: Additional Topics.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc. Boston College Department of Biology
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
8 Sampling Distribution of the Mean Chapter8 p Sampling Distributions Population mean and standard deviation,  and   unknown Maximal Likelihood.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.
California Pacific Medical Center
The HapMap Project and Haploview
Statistics for Engineer. Statistics  Deals with  Collection  Presentation  Analysis and use of data to make decision  Solve problems and design.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
STAT 5372: Experimental Statistics
Discovery tools for human genetic variations
Haplotype Reconstruction
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Research for medical discovery at the Computational Genomics Laboratory at Boston College Biology Gabor T. Marth Department of Biology, Boston College.
Chapter 9 Estimation: Additional Topics
Presentation transcript:

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom

Focal questions about the HapMap CEPH European samples 1. Required marker densityYoruban samples 4. How general the answers are to these questions among different human populations 2. How to quantify the strength of allelic association in genome region 3. How to choose tagging SNPs

Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Consequence for marker performance Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study.

How to assess sample-to-sample variability? 1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data 3. It would be a desirable alternative to generate such additional sets with computational means McVean et al. Science Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Towards a marker selection tool 2. generate computational samples for this genome region 3. test the performance of markers across consecutive sets of computational samples 1. select markers (tag SNPs) with standard methods

Generating additional computational haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 3. Use the second haplotype set induced by the same mutations as our computational samples. 4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region. Calculate the data likelihood (the probability that the genealogy does produce the observed haplotypes).

Generating computational samples Problem: The efficiency of generating data- relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N M We are develop a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K) Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.

Approximating M-site haplotypes as composites of overlapping K-site haplotypes 1. generate K-site sets 2. build M-site composites M

Piecing together neighboring K-site sets hope that constraint at overlapping markers preserves for long-range marker association

Building composite haplotypes A composite haplotype is built from a complete path through the (M-K+1) K-sites.

Initial results: 3-site composite haplotypes a typical 3-site composite 30 CEPH HapMap reference individuals (60 chr) Hinds et al. Science, 2005

3-site composite vs. data

3-site composites: the “best case” “short-range” “long-range” 1. generate K-site sets

Variability across sets The purpose of the composite haplotypes sets … … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

4-site composite haplotypes 4-site composite

“Best-case” 4 site composites Composite of exact 4-site sub-haplotypes

Variability across 4-site composites

… is comparable to the variability across data sets.

Software engineering aspects: efficiency To do larger-scale testing we must first improve the efficiency of generating composite sets. Currently, we run fresh Coalescent runs at each K-site (several hours per region). Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Computational hap sets can be databased efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes

Technical/algorithmic improvements 3. dealing with uninformative markers 1. un-phased genotypes 2. markers with unknown ancestral state (AC)(CG)(AT)(CT) A G A C C C T T AC ? taking into account local recombination rate

Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)