FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute
Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp
Kmer Extension & Repeat Junctions
Handling of Single Base Variations
ACGTAACTAACAGTT ACGTAACTCACAGTT ACGTAACT ACAGTT Fuzzy Kmers Number of Mismatches between Two Kmers
Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference or Sanger reads or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus
Pileup of Solexa and 454 Reads
Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly
Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/ bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads
library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x PCR-free E. coli 0422 x PCR-free P. falciparum 3D72 x PCR-free B. pertussis ST242 x PCR-free P. falciparum 3D72 x PCR-free E. coli 0422 x standard-245 P. falciparum 3D72 x standard-368 P. falciparum 3D72 x standard-851 P. falciparum 3D72 x standard-883 P. falciparum clin2 x Extremely GC Biased Genomes GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%
Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26, Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies
Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/ bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly
Solexa reads : Number of reads: 86.5 million; Finished genome size: 95.2 Mbp; Read length:2x36bp; Estimated read coverage: ~65X; Insert size: 120/ bp; Assembly features: - contig stats Total number of contigs: 55,802; Total bases of contigs: 75.8 Mbp N50 contig size: 2,322; Largest contig:17,859 Averaged contig size: 1,358; Contig coverage over the genome: ~80 %; Contig extension errors: ? Mis-assembly errors:? Mouse Chromosome 17 Assembly
Clone Name Length (bp) FinishedCloning VectorSpeciesCapillary Data Pathway zH117H YespTARBAC2.1D. rerio/nfs/repository/d0012/zH117H1 zH141B YespTARBAC2.1D. rerio/nfs/repository/d0012/zH141B18 zH151M YespTARBAC2.1D. rerio/nfs/repository/d0014/zH151M17 zH117E YespTARBAC2.1D. rerio/nfs/repository/d0015/zH117E7 zH137D YespTARBAC2.1D. rerio/nfs/repository/d0023/zH137D22 zH97A YespTARBAC2.1D. rerio/nfs/repository/d0027/zH97A24 zH146D YespTARBAC2.1D. rerio/nfs/repository/d0040/zH146D21 zH140N YespTARBAC2.1D. rerio/nfs/repository/d0013/zH140N19 zH147D YespTARBAC2.1D. rerio/nfs/repository/d0011/zH147D24 bE2F YespTARBAC1.3_BamHIS. scrofa/nfs/repository/d0027/bE2F11 bE156J YespTARBAC1.3_BamHIS. scrofa/nfs/repository/d0041/bE156J20 bE240L *NopTARBAC1.3_BamHIS. scrofa/nfs/repository/d0012/bE240L11 * Finished length may be shorter or longer once complete Pooled Clones: Zfish 9, Pig 3
Mapping of Solexa Reads On the Reference
extended long reads of 1-2Kb bp Insert ~300 bp bp Solexa assembly Genome/Chromosome Assembly Fishing WGS Reads WGS Reads 5X Combined Reads FuzzyPath Phusion or Phrap Phusion
Solexa reads : Number of reads: 4.3 million; Finished genome size: 1.72 Mbp; Read length:2x36bp; Estimated read coverage: ~180X; Insert size: 260/ bp; Zfish DH reads:12,539 Assembly features: - contig stats Solexa Hybrid_Ctg Hybrid_Super N contigs: Bases: 1.25 Mbp1.68 Mbp1.69 Mbp N50 size: 4,97525,81774,598 Largest23,906 79,730144,808 Averaged: 2,51311,07217,815 Coverage: ~72.6 %~73%~73% Errors:??? Zfish and “Pig” Clone Assemblies
Acknowledgements: Yong Gu James Bonfiled Helen Beasley Siobhan Whitehead Daniel Turner Michael Quail Tony Cox Richard Durbin