Introduction to Haplotype Estimation Stat/Biostat 550.

Introduction to Haplotype Estimation Stat/Biostat 550

The Haplotype Problem Suppose we genotype individuals at a number of tightly linked SNPs. ACGCCTTTGCGC GAACCCCCAGGC

The Haplotype Problem Suppose we genotype individuals at a number of tightly linked SNPs.

The Haplotype Problem What do the types on the two chromosomes look like?

Haplotypes: who cares? LD mapping: increase power? LD mapping: decrease genotyping? Evolutionary studies: selection, recombination, gene conversion, population structure,… Many people, for many different reasons…

The Haplotype Problem – potential solutions Molecular methods Collect family data Statistical methods for population data

The Simplest Case What do the types on the two chromosomes look like?

The Next Simplest Case What do the types on the two chromosomes look like?

The first difficult case… What do the types on the two chromosomes look like?

Clark’s Method (1990) Idea: use information obtained from other individuals in the population to determine the most probable haplotype pair.

Is it this configuration? 1 2 3

…or this one? 1 2 3

This one is more probable. 1 2 3

Clark’s Method (Clark, 1990) Identify the unambiguous individuals. Make a list of “known” haplotypes. Go through list, and see whether ambiguous individuals can be made up from a “known” haplotype plus another “complementary” haplotype. If so, add the complementary haplotype to the list of “known” haplotypes.

Clark’s Method List of known haps. 1 2 3

Clark’s Method: Problem 1 3 1 2

List of known haps. 1 2 3

Clark’s Method: Problem 1 List of known haps. 1 2 3

Clark’s Method: Problem 1 List of known haps. 1 2 3 Answer depends on order list is considered…. … and frequency information is ignored

Clark’s Method: Problem 2 3 1 2

3 1 2 List of known haps. Algorithm can fail to resolve all haplotypes… … because looks only for exact matches

Clark’s Algorithm: Summary Results may depend on order individuals are considered. Frequency information is ignored. May fail to resolve all haplotypes. Fails to assess uncertainty. Looks only for exact matches. Fast and intuitive(?).

Maximum Likelihood (EM Algorithm) Idea: find haplotype frequencies (f 1,…f N ) to maximise probability of observed genotype data (g 1,…,g n).

Bayesian version Replace single pass through data, with iterative scheme. Allow for uncertainty in resolution. Use frequency information. Resulting “naïve Gibbs sampler” produces results similar to EM (Stephens, Smith and Donnelly 2001). Modify Clark’s algorithm:

Example List of known haps. 1 2 3 Matches 1 known Does not match any 3 1 Assigned moderate probability

Example List of known haps. 1 2 3 Matches 3 known Does not match any 3 1 Assigned higher probability

Example List of known haps. 1 2 3 Does not match any 3 1 Assigned low probability

Problems with EM/naïve Gibbs Potentially (very) large number of parameters to estimate, leading to inaccurate estimates. Can be time-consuming for large problems. Can “converge” to poor local optima (alleviated by multiple runs).

Further modification Take into account “near misses”, as well as exact matches. (PHASE v1.0: Stephens, Smith and Donnelly 2001)

Example List of known haps. 1 2 3 Matches 1 known Differs by 2 from 3 known 3 1

Example List of known haps. 1 2 3 Matches 3 known Differs by 2 from 1 known 3 1

Example List of known haps. 1 2 3 Differs by 1 from 3 known Differs by 1 from 1 known 3 1 How to balance these possibilities?

The key question What is the conditional distribution of the next haplotype, given a set of known haplotypes?

Example 1 2 Given the above haplotypes, what would you expect the next haplotype to look like?

Qualitative answer The next haplotype will likely differ by a small number of mutations (possibly 0 mutations) from a (randomly-chosen) existing haplotype. Use theory (Ewens sampling formula; coalescent theory) to roughly quantify the distribution of the “small number”.

Comparisons on simulated data

Problems Time-consuming for large problems. Can “converge” to poor local optima. Ignores recombination (decay of LD with distance). How should uncertainty in haplotype estimates be treated?

… to be continued.

Introduction to Haplotype Estimation Stat/Biostat 550.

Similar presentations

Presentation on theme: "Introduction to Haplotype Estimation Stat/Biostat 550."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Haplotype Estimation Stat/Biostat 550.

Similar presentations

Presentation on theme: "Introduction to Haplotype Estimation Stat/Biostat 550."— Presentation transcript:

Similar presentations

About project

Feedback