Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.

Similar presentations


Presentation on theme: "Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University."— Presentation transcript:

1 Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S. Dinakar, J. Duitama, Y. Hernández, J. Kennedy, and Y. Wu

2 Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

3 Illumina Genome Analyzer II 35-50bp reads 1.5Gb/2.5 day run Roche/454 FLX Titanium 400bp reads 400Mb/10h run ABI SOLiD 2.0 25-35bp reads 3-4Gb/6 day run Recent massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to classic Sanger sequencing Ultra-high throughput DNA sequencing Helicos HeliScope 25-55bp reads >1Gb/day

4 UHTS enable personal genomics C.Venter Sanger@7.5x J. Watson 454@7.4x NA18507 Illumina@36x SOLiD@12x

5 Sequencing provides single-base resolution of genetic variation (SNPs, CNVs, genome rearrangements) However, interpretation requires determination of both alleles at variable loci This is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] Challenges for medical applications of sequencing

6 Allele coverage for heterozygous SNPs (Watson 454 @ 5.85x avg. coverage)

7 Allele coverage for heterozygous SNPs (Watson 454 @ 2.93x avg. coverage)

8 Allele coverage for heterozygous SNPs (Watson 454 @ 1.46x avg. coverage)

9 Allele coverage for heterozygous SNPs (Watson 454 @ 0.73x avg. coverage)

10 Allele coverage for heterozygous SNPs (Watson 454 @ 0.37x avg. coverage)

11 All prior genotype calling methods are based on allele coverage [Levy et al 07] and [Wheeler et al 08] require that each allele be covered by at least 2 reads in order to be called Combined with hypothesis testing based on the binomial distribution when calling hets Binomial probability for the observed number of 0 and 1 alleles must be at least 0.01 [Wendl&Wilson 08] generalize coverage methods to allow an arbitrary minimum allele coverage k Estimate that as much as 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs” Prior work

12 We propose methods incorporating additional sources of information: Quality scores reflecting uncertainty in sequencing data Allele/genotype frequency and linkage disequilibrium (LD) info extracted from a reference panel such as Hapmap Experimental results show significantly improved genotyping accuracy Do heuristic inputs help?

13 Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

14 Biallelic SNPs: 0 = major allele, 1 = minor allele SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Inferred genotypes Mapped reads with allele 0 Mapped reads with allele 1 012100120 Sequencing errors Basic notations

15 Let r i denote the set of mapped reads covering SNP locus i and c i =| r i | For a read r in r i, r(i) denotes the allele observed at locus i If q r(i) is the phred quality score of r(i), the probability that r(i) is incorrect is given by Incorporating base call uncertainty Probability of observing read set r i conditional on G i :

16 Applying Bayes’ formula: Where are genotype frequencies inferred from a representative panel Single SNP genotype calling

17 Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

18 Haplotype structure in human populations

19 F i = founder haplotype at locus i, H i = observed allele at locus i For given haplotype h, P(H=h|M) can be computed in O(nK 2 ) using forward algorithm Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06] HMM model of haplotype frequencies F1F1 F2F2 FnFn … H1H1 H2H2 HnHn

20 F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference P(f1), P(f’1), P(fi+1|fi), P(f’i+1|f’i), P(hi|fi), P(h’i|f’i) trained using Baum-Welch algorithm on haplotypes inferred from the populations of origin for mother/father

21 F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

22 F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference

23 GIVEN: Shotgun read sets r=(r 1, r 2, …, r n ) Quality scores Trained HMM models representing LD in populations of origin for mother/father FIND: Multilocus genotype g*=(g* 1,g* 2,…,g* n ) with maximum posterior probability, i.e., g*=argmax g P(g | r ) Multilocus genotyping problem Bad news: max g P(g | r) cannot be approximated within unless ZPP=NP

24 Posterior decoding algorithm 1. For each i = 1..n, compute 2. Return

25 fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

26 fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

27 fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

28 fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

29 fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation of posterior probabilities

30 Runtime Direct recurrences for computing forward probabilities: Runtime reduced to O(m+nK 3 ) by reusing common terms: where

31 Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

32 >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti|1779718824 name:EI1W3PE02ILQXT 28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15 >gnl|ti|1779718824 name:EI1W3PE02ILQXT 28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15 >gnl|ti|1779718824 name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti|1779718825 name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti|1779718824 name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti|1779718825 name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA Mapped reads Hapmap genotypes or haplotypes 90 209342 16 F 0 0 2110001?0100210010011002122201210211?1221220212000 18 F 15 16 21100012010021001001100?100201?10111110111?0212000 15 M 0 0 21120010012001201001120010110101011111011110212000 7 M 0 0 2110001001000200122110001111011100111?121210222000 8 F 0 0 011202100120022012211200101101210211122111?0120000 12 F 9 10 21100010010002001221100010110111001112121210220000 9 M 0 0 011?001?012002201221120010?10121021112211110120000 11 M 7 8 21100210010002001221100012110111001112121210222000 90 209342 16 F 0 0 2110001?0100210010011002122201210211?1221220212000 18 F 15 16 21100012010021001001100?100201?10111110111?0212000 15 M 0 0 21120010012001201001120010110101011111011110212000 7 M 0 0 2110001001000200122110001111011100111?121210222000 8 F 0 0 011202100120022012211200101101210211122111?0120000 12 F 9 10 21100010010002001221100010110111001112121210220000 9 M 0 0 011?001?012002201221120010?10121021112211110120000 11 M 7 8 21100210010002001221100012110111001112121210222000 90 209342 16 F 0 0 2110001?0100210010011002122201210211?1221220212000 18 F 15 16 21100012010021001001100?100201?10111110111?0212000 15 M 0 0 21120010012001201001120010110101011111011110212000 7 M 0 0 2110001001000200122110001111011100111?121210222000 8 F 0 0 011202100120022012211200101101210211122111?0120000 12 F 9 10 21100010010002001221100010110111001112121210220000 9 M 0 0 011?001?012002201221120010?10121021112211110120000 11 M 7 8 21100210010002001221100012110111001112121210222000 Reference genome sequence >gi|88943037|ref|NT_113796.1|Hs1_111515 Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC … … … …… … … >gnl|ti|1779718824 name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti|1779718825 name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti|1779718824 name:EI1W3PE02ILQXT 28 28 28 28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21 7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28 43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28 28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27 33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29 28 28 28 28 27 28 28 28 37 28 27 27 28 36 28 37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29 36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23 28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29 28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24 28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15 Read sequences Quality scores SNP genotype calls rs12095710 T T 9.988139e-01 rs12127179 C T 9.986735e-01 rs11800791 G G 9.977713e-01 rs11578310 G G 9.980062e-01 rs1287622 G G 8.644588e-01 rs11804808 C C 9.977779e-01 rs17471528 A G 5.236099e-01 rs11804835 C C 9.977759e-01 rs11804836 C C 9.977925e-01 rs1287623 G G 9.646510e-01 rs13374307 G G 9.989084e-01 rs12122008 G G 5.121655e-01 rs17431341 A C 5.290652e-01 rs881635 G G 9.978737e-01 rs9700130 A A 9.989940e-01 rs11121600 A A 6.160199e-01 rs12121542 A A 5.555713e-01 rs11121605 T T 8.387705e-01 rs12563779 G G 9.982776e-01 rs11121607 C G 5.639239e-01 rs11121608 G T 5.452936e-01 rs12029742 G G 9.973527e-01 rs562118 C C 9.738776e-01 rs12133533 A C 9.956655e-01 rs11121648 G G 9.077355e-01 rs9662691 C C 9.988648e-01 rs11805141 C C 9.928786e-01 rs1287635 C C 6.113270e-01 Pipeline for LD-Based Genotype Calling

33 Datasets Watson Sequencing data: 74.4 million 454 reads (average length 265bp) Reference panel: CEU genotypes from Hapmap r23a phased using the ENT algorithm [Gusev et al. 08] Ground truth: duplicate Affymetrix 500k SNP genotypes NA18507 (Illumina & SOLiD) Sequencing data: 525 million Illumina reads (36bp, paired) and 764 million SOLiD reads (24 - 44bp, unpaired) Reference panel: YRI haplotypes from Hapmap r22 excluding NA18507 haplotypes Ground truth: Hapmap r22 genotypes

34 Mapping statistics Dataset Raw reads Raw sequence Mapped reads Test SNPs Avg. mapped SNP cov. Watson74.2M19.7Gb 49.8M (67%) 443K5.85x NA18507 Illumina 525M18.9Gb 397M (78%) 2.85M6.10x NA18507 SOLiD 764M21.15Gb 324M (42%) 2.85M3.21x

35 Concordance vs. avg. coverage (Watson 454 reads)

36 Tradeoff with call rate (5.85x Watson 454 reads, homo SNPs)

37 Tradeoff with call rate (5.85x Watson 454 reads, het SNPs)

38 Concordance vs. avg. coverage for NA18507 (Illumina & SOLiD reads)

39 Posterior decoding algorithm has scalable running time and yields significant improvements in genotyping calling accuracy Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by previously proposed binomial test at 5-6x average coverage is achieved by HMM-based posterior decoding algorithm using less than 1/4 of the reads Open source code available at http://dna.engr.uconn.edu/software/GeneSeq/ http://dna.engr.uconn.edu/software/GeneSeq/ Ongoing work Extension to population sequencing data (removing need for reference panels) Haplotype reconstruction Conclusions & ongoing work

40 Acknowledgments Work supported in part by NSF awards IIS-0546457 and DBI-0543365 to IM and IIS-0803440 to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF award CCF-0755373.

41 Mapping Procedure 454 reads mapped on human genome build 36.3 using the NUCMER tool of the MUMmer package [Kurtz et al 04] with default parameters Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Illumina and SOLiD reads mapped using MAQ [Li et al 08] with default parameters For reads mapped at multiple positions MAQ returns best position (breaking ties arbitrarily) together with mapping confidence We filtered bad alignments and discarded paired end reads that are not mapped in pairs using the “submap -p” command

42 Recombination rate effects (NA18507 Illumina)

43 Coverage effects (NA18507 Illumina)


Download ppt "Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University."

Similar presentations


Ads by Google