Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Considerations for Analyzing Targeted NGS Data HLA
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Computational Advances in Next Generation Sequencing Ion Măndoiu (University of Connecticut) Alex Zelikovsky (Georgia State University) April 28, 2011,
METHODS FOR HAPLOTYPE RECONSTRUCTION
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
University of Connecticut
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
LD-Based Genotype and Haplotype Inference from Low-Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Imputation-based local ancestry inference in admixed populations
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Analysis of Next Generation Sequence Data BIOST /06/2015.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Constrained Hidden Markov Models for Population-based Haplotyping
How to Solve NP-hard Problems in Linear Time
Imputation-based local ancestry inference in admixed populations
Jin Zhang, Jiayin Wang and Yufeng Wu
Discovery tools for human genetic variations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Sequence Analysis - RNA-Seq 2
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with S. Dinakar, J. Duitama, Y. Hernández, J. Kennedy, and Y. Wu

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

Illumina Genome Analyzer II 35-75bp reads 2-3Gb/2 day run Roche/454 FLX Titanium 400bp reads Mb/10h run ABI SOLiD bp reads 5-7.5Gb/3.5-7 day run Recent massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to classic Sanger sequencing Ultra-high throughput DNA sequencing Helicos HeliScope 25-55bp reads ~2.5Gb/day

UHTS enables personal genomics C.Venter J. Watson NA18507

Sequencing can potentially provide all genetic variations (SNPs, CNVs, genome rearrangements) at single-base resolution… However, medical use requires determination of both alleles (genotype) at variable loci Accurate genotype calling is limited by coverage depth due to random nature of shotgun sequencing For the Venter and Watson genomes (both sequenced at ~7.5x average coverage), comparison with SNP genotyping chips has shown only ~75% accuracy for sequencing based calls of heterozygous SNPs [Levy et al 07, Wheeler et al 08] Challenges for medical applications of sequencing

Allele coverage for heterozygous SNPs (Watson 5.85x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 2.93x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 1.46x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 0.73x avg. coverage)

Allele coverage for heterozygous SNPs (Watson 0.37x avg. coverage)

Most prior genotype calling methods are based on allele coverage [Levy et al 07] and [Wheeler et al 08] require that each allele be covered by at least 2 reads in order to be called Combined with hypothesis testing based on the binomial distribution when calling hets Binomial probability for the observed number of alleles must be at least 0.01 [Wendl&Wilson 08] generalize coverage methods to allow an arbitrary minimum allele coverage k Prior work

MAQ [Li,Ruan&Durbin 08] Widely used read mapping program Single SNP genotype calling incorporating read mapping confidence and quality scores Mostly tuned for de novo SNP discovery… Prior work (contd.)

[Wendl&Wilson 08] estimate that 21x coverage will be required for sequencing of normal tissue samples based on idealized theory that “neglects any heuristic inputs” What coverage is required?

We propose methods incorporating additional sources of information extracted from a reference panel such as Hapmap: Allele/genotype frequencies Linkage disequilibrium Experimental results show significantly improved genotyping accuracy Do heuristic inputs help?

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

Known SNP positions Biallelic SNPs 0 = major allele, 1 = minor allele SNP genotypes: 0/2 = homozygous major/minor, 1=heterozygous Basic assumptions

r i = set of mapped reads covering SNP locus i For each read r in r i r(i) = the allele observed at locus i = probability that r(i) is incorrect, where q r(i) is the phred quality score of r(i) m r = mapping confidence of r Incorporating base call and read mapping uncertainty Mapped reads with allele 0 Mapped reads with allele 1 Sequencing errors Inferred genotypes

r i = set of mapped reads covering SNP locus i For each read r in r i r(i) = the allele observed at locus i = probability that r(i) is incorrect, where q r(i) is the phred quality score of r(i) m r = mapping confidence of r Incorporating base call and read mapping uncertainty

Applying Bayes’ formula: Where are genotype frequencies inferred from a representative panel Single SNP genotype calling

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

Haplotype structure in human populations

Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06] HMM model of haplotype frequencies

Random variables F i = founder haplotype at locus i H i = observed allele at locus i For fully specified model and given haplotype h, P(H=h|M) can be computed in O(nK 2 ) using forward algorithm, where n=#SNPs, K=#founders Graphical Model Representation F1F1 F2F2 FnFn … H1H1 H2H2 HnHn

F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference P(f1), P(f’1), P(fi+1|fi), P(f’i+1|f’i), P(hi|fi), P(h’i|f’i) trained using Baum-Welch algorithm on haplotypes inferred from the populations of origin for mother/father

F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

F1F1 F2F2 FnFn … H1H1 H2H2 HnHn G1G1 G2G2 GnGn …R 1,1 R 2,1 F' 1 F' 2 F' n … H' 1 H' 2 H' n R 1,c …R 2,c …R n,1 R n,c 1 2 n HF-HMM for multilocus genotype inference

GIVEN: Shotgun read sets r=(r 1, r 2, …, r n ) Trained HMM models representing LD in populations of origin for mother/father Quality scores & read mapping confidence values FIND: Multilocus genotype g*=(g* 1,g* 2,…,g* n ) with maximum posterior probability, i.e., g*=argmax g P(g | r ) Multilocus genotyping problem

Theorem: max g P(g | r) cannot be approximated within unless ZPP=NP Computational complexity Idea: reduction from the clique problem

Posterior decoding algorithm 1. For each i = 1..n, compute 2. Return

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation

fifi … hihi gigi … r 1,1 r i,1 f’ i … h’ i r 1,c … r i,c …R n,1 R n,c 1 i n … … Forward-backward computation

Runtime Direct recurrences for computing forward probabilities: Runtime reduced to O(nK 3 ) by reusing common terms: where

Outline Introduction Single SNP Genotype Calling Multilocus Genotyping Problem Experimental Results Conclusion

>gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA Mapped reads & confidence values Hapmap haplotypes F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M F ? ? F ?100201? ? M M ? F ? F M ?001? ? M Reference genome sequence >gi| |ref|NT_ |Hs1_ Homo sapiens chromosome 1 genomic contig, reference assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCCATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACTTGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTAATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTTCCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAATTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAGCGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAGGCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAGATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAGCTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAGCTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGGGAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC … … … …… … … >gnl|ti| name:EI1W3PE02ILQXT TCAGTGAGGGTTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAATTTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCACTGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTGT TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCCTGACCTCAAATGAC >gnl|ti| name:EI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGTTCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTTTATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCACAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAAGAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGGAGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTTACA >gnl|ti| name:EI1W3PE02ILQXT Read sequences Quality scores SNP genotype calls rs T T e-01 rs C T e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs C C e-01 rs A G e-01 rs C C e-01 rs C C e-01 rs G G e-01 rs G G e-01 rs G G e-01 rs A C e-01 rs G G e-01 rs A A e-01 rs A A e-01 rs A A e-01 rs T T e-01 rs G G e-01 rs C G e-01 rs G T e-01 rs G G e-01 rs C C e-01 rs A C e-01 rs G G e-01 rs C C e-01 rs C C e-01 rs C C e-01 Pipeline for LD-Based Genotype Calling

Datasets Watson Sequencing data: 74.4 million 454 reads (of million reads used in [Wheeler et al 08]) Reference panel: CEU genotypes from Hapmap r23a phased using the ENT algorithm [Gusev et al. 08] Ground truth: duplicate Affymetrix 500k SNP genotypes

Datasets (contd.) NA18507 (Illumina & SOLiD) Sequencing data: 525 million Illumina reads (36bp, paired) and 764 million SOLiD reads ( bp, unpaired) Reference panel: YRI haplotypes from Hapmap r22 excluding NA18507 haplotypes Ground truth: Hapmap r22 genotypes

Mapping Procedure 454 reads mapped on human genome build 36.3 using the NUCMER tool of the MUMmer package [Kurtz et al 04] with default parameters Additional filtering: at least 90% of the read length matched to the genome, no more than 10 errors (mismatches or indels) Reads meeting above conditions at multiple genome positions (likely coming from genomic repeats) were discarded Illumina and SOLiD reads mapped using MAQ [Li,Ruan&Durbin 08] with default parameters For reads mapped at multiple positions MAQ returns best position (breaking ties arbitrarily) together with mapping confidence We filtered bad alignments and discarded paired end reads that are not mapped in pairs using the “submap -p” command

Mapping statistics Dataset Raw reads Raw sequence Mapped reads Test SNPs Avg. mapped SNP cov. Watson74.2M19.7Gb 49.8M (67%) 443K5.85x NA18507 Illumina 525M18.9Gb 397M (78%) 2.85M6.10x NA18507 SOLiD 764M21.15Gb 324M (42%) 2.85M3.21x

Concordance vs. avg. coverage (Watson 454 reads)

Tradeoff with call rate (5.85x Watson 454 reads, homo SNPs)

Tradeoff with call rate (5.85x Watson 454 reads, het SNPs)

Concordance vs. avg. coverage for NA18507 (Illumina & SOLiD reads)

Effect of local recombination rate (NA18507 Illumina)

Effect of SNP coverage (NA18507 Illumina)

Posterior decoding algorithm has scalable running time and yields significant improvements in genotyping calling accuracy Improvement depends on the coverage depth (higher at lower coverage), e.g., accuracy achieved by previously proposed binomial test at 5-6x average coverage is achieved by HMM-based posterior decoding algorithm using less than 1/4 of the reads Open source code available at LD-based genotype calling increasingly attractive as reference panels improve (denser, more samples, more populations) Allows sequencing larger populations for the same cost Conclusions

Haplotype reconstruction Promising preliminary results using Viterbi-like algorithm based on HF-HMM Extension to population sequencing data Removes need for reference panels! Integrated read mapping, SNP identification, and haplotype reconstruction EM algorithm that iteratively refines two full haplotype sequences and read mapping probabilities Integrates read data with LD info available for known SNPs Takes advantage of reads overlapping multiple SNP loci Allows reconstruction of complete sequences for CNVs Reconstruction of complex haplotype spectra mRNA isoforms, quasispecies Ongoing work

Acknowledgments Work supported in part by NSF awards IIS and DBI to IM and IIS to YW. SD and YH performed this research as part of the Summer REU program “Bio-Grid Initiatives for Interdisciplinary Research and Education" funded by NSF award CCF