Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 29, 2012 https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt.

Similar presentations


Presentation on theme: "1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 29, 2012 https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt."— Presentation transcript:

1 1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 29, 2012 https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt

2 2 Questions WHAT is haplotype? WHY study haplotype? WHY use algorithms for haplotyping? HOW ? (Data, Hypotheses, Algorithms )

3 3 WHAT is Haplotype? A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of " haploid genotype.“ In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project. From http://en.wikipedia.org/wiki/Haplotype

4 4 Haplotype = Genotype of Haploid Haplotypes: Ab//aB Genotype: Aa Bb Haplotype C G Haplotype T A Genotype CT GA Haplotypes: AB//ab Genotype: Aa Bb

5 5 WHY Study Haplotype? An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology Population evolution LD analysis Missing genotype imputation IBD estimation Tag marker (SNP) selection Multi-locus linkage & association …

6 6 WHY use algorithm in haplotyping? Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes) genotype(AaBb) haplotype (Ab//aB or AB//ab) Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genome- wide study. So, at least now, we need algorithms … ?

7 7 Ambiguity of Haplotype Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown GenotypeHaplotypes AA BBAB//AB Aa bbAb//ab Aa BbAb//aB or AB//ab Aa Bb CcABC//abc, ABc//abC, Abc//aBC or aBC//Abc

8 8 Rule-based Approaches (Parsimony & Phylogeny) Search an optimal set of haplotypes that satisfies some specific rules

9 9 Parsimony Approaches 1.List all unambiguous haplotypes 2.Resolve ambiguous individuals one by one using listed haplotypes 3. If only half-resolved, add new haplotype to the list 4. Continue 2 & 3 5. Until on one can be solved ABC, abc, abC Abc AaBbCC => ABC//abC AABbCc => ABC//Abc Continue … Until on one can be resolved Clark, 1990, Mol. Biol. Evol., 7(2): 111-122 Parsimony rules: Maximum-resolution of genotypes and/or Minimum set of haplotypes Clark’s Algorithm

10 10 Phylogeny Approaches D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175. Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation, but only once for one site)

11 11 Probability-based Approaches (EM & MCMC) Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=?

12 12 Gene/haplotype frequencies HWE, LD Data Structure for Haplotyping Haplotypes Linkage Subjects(1,2,3…) Loci (A,B,C…) G 1,A G 1,B G 1,C … G 2,A G 2,B G 2,C … G 3,A G 3,B G 4,C … ………… ACB Genetic Relationship Genotypes

13 13 HWE & LD Hardy-Weinberg Equilibrium (HWE) Hardy-Weinberg Disequilibrium (HWD) HWE: random combination of alleles from the same locus Under HWE, allele freq. determines genotype freq. HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a) Linkage Equilibrium (LE) Linkage Disequilibrium (LD) LE: random combination of alleles from different loci LD: association between alleles from different loci Under LE, allele freq. determines haplotype freq. LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)

14 14 Genetic Relationship (R) & Linkage (r) AaBb AABB AaBb AB//ab or aB//Ab AB//ab (if r=0) AB//ab (if r>0) AB//ab, Ab//aB Recombination rate ( r ) r =0, complete Linkage 0< r <0.5, incomplete Linkage r =0.5, no Linkage AaBb AABBaabb

15 15 Haplotyping & Conditional Probability AaBB: Pr(AB//aB)=1 AAbB: Pr(AB//Ab)=1 AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5 AABB, aabb, AABB, aabb, AABB, AABb, aabb AaBB, aabb, AABB, AABB, AABB, AABB, aabb aabb, AABB, AABB, AABB, AaBb, AABB,aabb aabb, AABB, AABB, aabb, AABB, aabb, AABB … Pr(AB//ab)=Pr(Ab//aB)=0.5 ? HWE or HWD? LD or LE? P(H|G, R, r)=? P(H|G)=?

16 16 EM Algorithm for unrelated individuals Pr(H|G,F)=? Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927 Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO) Pr(AB)=0.25, Pr(Ab)=0.25 Pr(aB)=0.25, Pr(ab)=0.25 OR Pr(AB)=0.01, Pr(Ab)=0.49 Pr(aB)=0.49, Pr(ab)=0.01 AaBb Pr(AB//ab)=? Pr(Ab//aB)=?

17 17 Likelihood: L(G|F) Haplotypes Joint Likelihood of G given F Genotypes Haplotype Frequencies Prbability of the k-th individual’s G given F & HWE Haplotype-Genotype compatibility index of the k-th individual F=? => Max. L(G|F)

18 18 EM Algorithm Maximum Likelihood Estimation of Haplotype Freq. Lagrange multiplier Prior Expectation Maximization E … M E M … EM Recursion Partial Derivative Equations z=1 if i in (a,b), or z=0 c=1 if (a,b)=>G, or c=0

19 19 Posterior Probability of Haplotype Prior Prob. Posterior Prob.

20 20 Limitation of EM Algorithm For diploid(2n) organism, a genotype of L heterozygous markers may have 2 L possible haplotypes, EM is unpractical for large L Only suitable for small number of loci, 2~12 While L=20, 2 L =1,048,576 …Large space of F Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden …

21 21 MCMC Markov Chain Monte Carlo Algorithm for unrelated individuals by sampling from Pr(H|G,F) Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)

22 22 Markov Chain MCMC Estimation Random sampling based on Pr(H|G,H_) Repeat many times After getting close to stationary distribution of P(H|G) Collect samples Average over samples

23 23 Transition Probability Add the newly constructed haplotype to list H, pick G k+1 … Coalescent hypothesis, Mutation rate, M haplotypes subseting loci, reducing time

24 24 EM vs. MCMC EMMCMC Search F, Max. L(G|F) Haplo. freq. => Haplo. construction Maximum likelihood approach “Analytical” posterior distribution Less loci Convergence: Local Maximum Sample from Pr(H|G,F) Haplo. construction => Haplo. freq. Sampling approach “Empirical” posterior distribution More loci Better convergence: whole parameter space (more computer time)

25 25 EM Algorithm for family data (no recombination, r=0) Pr(H {fam.} |G,R,F)=? Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO) Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP) O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)

26 26 Haplotype Configuration of Family AaBb AB//ab Ab//aB AB//ab Ab//aB Genotypes Possible Haplotype Configurations recombinant, as r=0 or nearly =0, impossible or very low prob., ignored

27 27 EM Algorithm Haplotype Freq. Estimation using Nuclear Families Tips: Only use parents to calculate haplotype freq. (f) Use parents+children ’s info to determine compatibility (c)

28 28 EM Algorithm Haplotype Freq. Estimation for General Pedigrees Tips: Only use founders to calculate haplotype freq. (f) Use all members (founders & non- founders) to determine compatibility (c) Discard the cases with too small probabilities to save time

29 29 Posterior Probability of Haplotype Configuration Dad Mom

30 30 A Middle Summary … Subject-oriented Algorithms Large/General Pedigree & Allowing Recombination (r>0) ? ACB X X X Joint Prob. / Likelihood indiv. by indiv. unrelated family by family r=0

31 31 Next … Locus-oriented Algorithm (Lander-Green) ACB XXX Joint Prob./ Likelihood … Locus by Locus A Pedigree For Large/General Pedigree Data & Allowing Recombination (r>0) ACB

32 32 Inheritance Vector (V) of a pedigree Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367 Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN) Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) Prob. A

33 33 Inheritance Vector & Haplotype 5: AaBb 1101 AB//ab 1101 1101 Ab//aB 1111

34 34 Lander-Green Algorithm ACB … VAVA VBVB VCVC Pr(V B |V A ) Pr(V C |V B ) … Pr(V t+1 |V t ) GAGA Pr(G A |V A ) GBGB Pr(G B |V B ) GCGC Pr(G C |V C ) Loci A,B,C,… One pedigree Hidden status (inheritance vectors) Transition Prob.=f(r) Emission Prob. Observations (genotypes)

35 35 Lander-Green Algorithm Based (or Similar) Approaches Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Viterbi algorithm, the best haplotype configuration Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) MCMC: Annealing & Metropolis Process Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767 (software: MERLIN) Allowing LD & Marker Cluster/Block

36 Haplotyping based on sequencing data (can be done for individual subject with no population data) 36

37 Rationale 37 Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

38 Data Structure 38 Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.

39 Algorithms 39 Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. ML Or MCMC when H space is huge

40 Prob(sequence/haplotype) 40 Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. haplotype =1 if observed sequence X matches assumed haplotype =0 otherwise (for the j-th variant site of i-th fragment ) Sequencing/mapping error observed sequence

41 Markov Chain 41 Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. Sampling H from.

42 42 Practices (1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair, given no any extra information. (2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0) (3) If you know the haplotype frequencies below in population: ABCD(0.2),ABcD(0.1),AbcD(0.1) aBCD(0.1),aBcD(0.2),abcD(0.3) calculate the posterior probabilities in (1). Within a week, send your answers to (E-mail: qunyuan@wustl.edu)


Download ppt "1 Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 29, 2012 https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt."

Similar presentations


Ads by Google