A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College marth@bc.edu Genomic studies and the HapMap March 15-18, 2005 Oxford, United Kingdom

Focal questions about the HapMap CEPH European samples 1. Required marker densityYoruban samples 4. How general the answers are to these questions among different human populations 2. How to quantify the strength of allelic association in genome region 3. How to choose tagging SNPs

Across samples from a single population? (random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Consequence for marker performance Markers selected based on the allele structure of the HapMap reference samples… … may not work well in another set of samples such as those used for a clinical study.

How to assess sample-to-sample variability? 1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data 3. It would be a desirable alternative to generate such additional sets with computational means McVean et al. Science 2004 2. Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Towards a marker selection tool 2. generate computational samples for this genome region 3. test the performance of markers across consecutive sets of computational samples 1. select markers (tag SNPs) with standard methods

Generating additional computational haplotypes 1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population. 3. Use the second haplotype set induced by the same mutations as our computational samples. 4. In subsequent statistics, weight each such set proportional to the data likelihood calculated in 2. 2. Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region. Calculate the data likelihood (the probability that the genealogy does produce the observed haplotypes).

Generating computational samples Problem: The efficiency of generating data- relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem. N M We are develop a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K) Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.

Approximating M-site haplotypes as composites of overlapping K-site haplotypes 1. generate K-site sets 2. build M-site composites M

Piecing together neighboring K-site sets 000 100 001 101 010 110 011 111 000 001 010 011 100 101 110 111 hope that constraint at overlapping markers preserves for long-range marker association

Building composite haplotypes A composite haplotype is built from a complete path through the (M-K+1) K-sites.

Initial results: 3-site composite haplotypes a typical 3-site composite 30 CEPH HapMap reference individuals (60 chr) Hinds et al. Science, 2005

3-site composite vs. data

3-site composites: the “best case” “short-range” “long-range” 1. generate K-site sets

Variability across sets The purpose of the composite haplotypes sets … … is to model sample variance across consecutive data sets. But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

4-site composite haplotypes 4-site composite

“Best-case” 4 site composites Composite of exact 4-site sub-haplotypes

Variability across 4-site composites

… is comparable to the variability across data sets.

Software engineering aspects: efficiency To do larger-scale testing we must first improve the efficiency of generating composite sets. Currently, we run fresh Coalescent runs at each K-site (several hours per region). Total # genotyped SNPs is ~ 1 million -> 1 million different K-sites to match. Any given Coalescent genealogy is likely to match one or more of these. Computational hap sets can be databased efficiently. 4 HapMap populations x 1 million K-sites x 1,000 comp sets x 50 bytes < 200 Gigabytes

Technical/algorithmic improvements 3. dealing with uninformative markers 1. un-phased genotypes 2. markers with unknown ancestral state (AC)(CG)(AT)(CT) A G A C C C T T AC ? 01101000010101110 11101000001010101 11101000010101110 01101000010101110 4. taking into account local recombination rate

Acknowledgements Eric Tsung Aaron Quinlan Ike Unsal Eva Czabarka (Dept. Mathematics, William & Mary)

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Similar presentations

Presentation on theme: "A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College

Similar presentations

Presentation on theme: "A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College"— Presentation transcript:

Similar presentations

About project

Feedback