University of Connecticut Hidden Markov Models of Haplotype Diversity and Applications in Genetic Epidemiology Ion Mandoiu University of Connecticut
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Single Nucleotide Polymorphisms Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) High density in the human genome: 1 107 SNPs out of total 3 109 base pairs … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …
Haplotypes and Genotypes Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 011100110 001000010 021200210 + two haplotypes per individual genotype
Sources of Haplotype Diversity: Mutation The International HapMap Consortium. A Haplotype Map of the Human Genome. Nature 437, 1299-1320. 2005.
Sources of Haplotype Diversity: Recombination
Haplotype Structure in Human Populations
HMM Model of Haplotype Frequencies Fn … H1 H2 Hn Fi = founder haplotype at locus i, Hi = observed allele at locus i P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference genotype or haplotype data For given haplotype h, P(H=h|M) can be computed in O(nK2) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Genotype Phasing ? h1:0010111 h2:0010010 g: 0010212 h3:0010011
Maximum Likelihood Genotype Phasing … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn Maximum likelihood genotype phasing: given g, find (h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)
Computational Complexity [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP [Rastas et al.] give Viterbi and randam sampling based heuristics that yield phasing accuracy comparable to best existing methods (PHASE)
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Genotyping Errors A real problem despite advances in technology & typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) Many errors remain undetected As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]
Likelihood Sensitivity Approach to Error Detection in Trios 0 1 2 1 0 2 0 2 2 1 0 2 Mother Father Child Likelihood of best phasing for original trio T 0 1 1 1 0 0 h1 0 0 0 1 0 1 h3 0 1 1 1 0 0 h1 0 1 0 1 0 1 h2 0 0 0 1 0 1 h3 0 1 1 1 0 0 h4
Likelihood Sensitivity Approach to Error Detection in Trios 0 1 2 1 0 2 0 2 2 1 0 2 Mother Father Child 0 1 0 1 0 1 h’ 1 0 0 0 1 0 0 h’ 3 0 1 0 1 0 1 h’1 0 1 1 1 0 0 h’2 0 0 0 1 0 0 h’ 3 0 1 1 1 0 1 h’ 4 Likelihood of best phasing for modified trio T’ ? Likelihood of best phasing for original trio T
Likelihood Sensitivity Approach to Error Detection in Trios Mother Father 0 1 2 1 0 2 0 2 2 1 0 2 Child 0 2 2 1 0 2 ? Large change in likelihood suggests likely error Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=104)
Alternate Likelihood Functions [KMP08] Cannot approximate L(T) within O(n1/4 -), unless ZPP=NP Efficiently Computable Likelihood Functions Viterbi probability Probability of Viterbi Haplotypes Total Trio Probability
Comparison of Alternative Likelihood Functions (1% Random Allele Errors)
Log-Likelihood Ratio Distribution FPs caused by same-locus errors in parents
“Combined” Detection Method Compute 4 likelihood ratios Trio Mother-child duo Father-child duo Child (unrelated) Flag as error if all ratios are above detection threshold
Comparison with FAMHAP (Children)
Comparison with FAMHAP (Parents)
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Genome-Wide Association Studies Powerful method for finding genes associated with complex human diseases Large number of markers (SNPs) typed in cases and controls Disease causal SNPs unlikely to be typed directly Significant statistical power gained by performing imputation of untyped Hapmap genotypes [WTCCC’07]
HMM Based Genotype Imputation Train HMM using the haplotypes from related Hapmap or small cohor typed at high density Probability of missing genotypes given the typed genotype data gi is imputed as
Experimental Results Estimates of the allele 0 frequency based on Imputation vs. Illumina 15k
Experimental Results Accuracy and missing data rate for imputed genotypes at different thresholds
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Ultra-High Throughput Sequencing New massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to Sanger sequencing -SBS: Sequencing by Synthesis -SBL: Sequencing by Ligation -Challenges in Genome Assembly: The short read lengths and absence of paired ends make it difficult for assembly software to disambiguate repeat regions, therefore resulting in fragmented assemblies. -New Type of sequencing error: in 454 including incorrect estimates of homopolymer lengths, ‘transposition-like’ insertions (a base identical to a nearby homopolymer is inserted in a nearby nonadjacent location) and errors caused by multiple templates attached to the same bead Roche / 454 Genome Sequencer FLX 100 Mb/run, 400bp reads Illumina / Solexa Genetic Analyzer 1G 1000 Mb/run, 35bp reads Applied Biosystems SOLiD 3000 Mb/run, 25-35bp reads 30
Probabilistic Model … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c 1 2 n 31
Model Training Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi), P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the Baum-Welch algorithm from haplotypes inferred from the populations of origin for mother/father P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise where is the probability that read r has an error at locus I Conditional probabilities for sets of reads are given by: The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. 32
Multilocus Genotyping Problem GIVEN: Shotgun read sets r=(r1, r2, … , rn) Base quality scores HMMs for populations of origin for mother/father FIND: Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum posterior probability, i.e., g*=argmaxg P(g | r) NOTE: P(g|r) is NP-Hard… 33
Posterior Decoding Algorithm For each i = 1..n, compute Return Joint probabilities can be computed using a forward-backward algorithm: Direct implementation gives O(m+nK4) time, where m = number of reads n = number of SNPs K = number of founder haplotypes in HMMs Runtime reduced to O(m+nK3) using speed-up idea similar to [Rastas et al. 08, Kennedy et al. 08] 34
Genotyping Accuracy on Watson Reads
Outline HMM model of haplotype diversity Applications Conclusions Phasing Error detection Imputation Genotype calling from low-coverage sequencing data Conclusions
Conclusions HMM model of haplotype diversity provides a powerful framework for addressing central problems in population genetics & genetic epidemiology Enables significant improvements in accuracy by exploiting the high amount of linkage disequilibrium in human populations Despite hardness results, heuristics such as posterior or Viterbi decoding perform well in practice Highly scalable runtime (linear in #SNPs and #individuals/reads) Software available at http://www.engr.uconn.edu/~ion/SOFT/
Acknowledgements Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin Kennedy, Bogdan Pasaniuc NSF funding (awards IIS-0546457 and DBI-0543365)