Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Connecticut

Similar presentations


Presentation on theme: "University of Connecticut"— Presentation transcript:

1 University of Connecticut
Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut

2 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

3 Single Nucleotide Polymorphisms
Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) High density in the human genome:  1  107 SNPs out of total 3  109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

4 Haplotypes and Genotypes
Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles + two haplotypes per individual genotype

5 Sources of Haplotype Diversity: Mutation
The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437,

6 Sources of Haplotype Diversity: Recombination

7 Haplotype Structure in Human Populations

8 HMM Model of Haplotype Frequencies
Fn H1 H2 Hn Fi = founder haplotype at locus i, Hi = observed allele at locus i P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

9 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

10 Genotype Phasing ? h1:0010111 h2:0010010 g: 0010212 h3:0010011

11 Maximum Likelihood Genotype Phasing
F1 F2 Fn H1 H2 Hn F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)

12 Computational Complexity
[KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)

13 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

14 Genotyping Errors A real problem despite advances in technology & typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) Many errors remain undetected As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

15 Likelihood Sensitivity Approach to Error Detection in Trios
Mother Father Child Likelihood of best phasing for original trio T h1 h3 h1 h2 h3 h4

16 Likelihood Sensitivity Approach to Error Detection in Trios
Mother Father Child h’ 1 h’ 3 h’1 h’2 h’ 3 h’ 4 Likelihood of best phasing for modified trio T’ ? Likelihood of best phasing for original trio T

17 Likelihood Sensitivity Approach to Error Detection in Trios
Mother Father Child ? Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)

18 Alternate Likelihood Functions
[KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP Efficiently Computable Likelihood Functions Viterbi probability Probability of Viterbi Haplotypes Total Trio Probability

19 Comparison of Alternative Likelihood Functions (1% Random Allele Errors)

20 Log-Likelihood Ratio Distribution
FPs caused by same-locus errors in parents

21 “Combined” Detection Method
Compute 4 likelihood ratios Trio Mother-child duo Father-child duo Child (unrelated) Flag as error if all ratios are above detection threshold

22 Comparison with FAMHAP (Children)

23 Comparison with FAMHAP (Parents)

24 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

25 Genome-Wide Association Studies
Powerful method for finding genes associated with complex human diseases Large number of markers (SNPs) typed in cases and controls Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing imputation of untyped Hapmap genotypes [WTCCC’07]

26 HMM Based Genotype Imputation
Train HMM using the haplotypes from related Hapmap or small cohor typed at high density Probability of missing genotypes given the typed genotype data  gi is imputed as

27 Experimental Results Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k

28 Experimental Results Accuracy and missing data rate for imputed genotypes at different thresholds

29 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

30 Ultra-High Throughput Sequencing
New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads 30

31 Probabilistic Model … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1
The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. G1 G2 Gn R1,1 R1,c R2,1 R2,c Rn,1 Rn,c 1 2 n 31

32 Model Training Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise where is the probability that read r has an error at locus I  Conditional probabilities for sets of reads are given by: The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. 32

33 Multilocus Genotyping Problem
GIVEN: Shotgun read sets r=(r1, r2, … , rn) Base quality scores HMMs for populations of origin for mother/father FIND: Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r) NOTE: P(g|r) is NP-Hard… 33

34 Posterior Decoding Algorithm
For each i = 1..n, compute Return Joint probabilities can be computed using a forward-backward algorithm: Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08] 34

35 Genotyping Accuracy on Watson Reads

36 Outline HMM model of haplotype diversity Applications Conclusions
Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions

37 Conclusions HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice Highly scalable runtime (linear in #SNPs and #individuals/reads) Software available at

38 Acknowledgements Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc NSF funding (awards IIS and DBI )


Download ppt "University of Connecticut"

Similar presentations


Ads by Google