Presentation is loading. Please wait.

Presentation is loading. Please wait.

FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute.

Similar presentations


Presentation on theme: "FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute."— Presentation transcript:

1 FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

2 Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads 30-70 bp known dist ~500 bp 30-70 bp

3 Kmer Extension & Repeat Junctions

4 Handling of Single Base Variations

5 ACGTAACTAACAGTT 00 01 10 11 00 00 01 11 00 00 01 00 10 11 11 ACGTAACTCACAGTT 00 01 10 11 00 00 01 11 01 00 01 00 10 11 11 ACGTAACT ACAGTT 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 Fuzzy Kmers Number of Mismatches between Two Kmers

6 Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference - 454 or Sanger reads - 454 or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus

7 Pileup of Solexa and 454 Reads

8 Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly

9 Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads

10 library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x 769074.1221 PCR-free E. coli 0422 x 765735.3108 PCR-free P. falciparum 3D72 x 76148623.065 PCR-free B. pertussis ST242 x 364524.1110 PCR-free P. falciparum 3D72 x 36100823.044 PCR-free E. coli 0422 x 369585.3181 standard-245 P. falciparum 3D72 x 35219823.096 standard-368 P. falciparum 3D72 x 35262823.0115 standard-851 P. falciparum 3D72 x 3547423.021 standard-883 P. falciparum clin2 x 36399423.0175 Extremely GC Biased Genomes GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%

11 Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26,92622839 Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: 14561621 Largest contig:9106 9825 Averaged contig size: 706923 Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies

12 Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/50-300 bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly

13 Solexa reads : Number of reads: 86.5 million; Finished genome size: 95.2 Mbp; Read length:2x36bp; Estimated read coverage: ~65X; Insert size: 120/50-200 bp; Assembly features: - contig stats Total number of contigs: 55,802; Total bases of contigs: 75.8 Mbp N50 contig size: 2,322; Largest contig:17,859 Averaged contig size: 1,358; Contig coverage over the genome: ~80 %; Contig extension errors: ? Mis-assembly errors:? Mouse Chromosome 17 Assembly

14 Clone Name Length (bp) FinishedCloning VectorSpeciesCapillary Data Pathway zH117H1129221YespTARBAC2.1D. rerio/nfs/repository/d0012/zH117H1 zH141B18119622YespTARBAC2.1D. rerio/nfs/repository/d0012/zH141B18 zH151M17122622YespTARBAC2.1D. rerio/nfs/repository/d0014/zH151M17 zH117E7139449YespTARBAC2.1D. rerio/nfs/repository/d0015/zH117E7 zH137D22122615YespTARBAC2.1D. rerio/nfs/repository/d0023/zH137D22 zH97A24 113538YespTARBAC2.1D. rerio/nfs/repository/d0027/zH97A24 zH146D21109862YespTARBAC2.1D. rerio/nfs/repository/d0040/zH146D21 zH140N19118794YespTARBAC2.1D. rerio/nfs/repository/d0013/zH140N19 zH147D24111470YespTARBAC2.1D. rerio/nfs/repository/d0011/zH147D24 bE2F11170585YespTARBAC1.3_BamHIS. scrofa/nfs/repository/d0027/bE2F11 bE156J20210831YespTARBAC1.3_BamHIS. scrofa/nfs/repository/d0041/bE156J20 bE240L11216560*NopTARBAC1.3_BamHIS. scrofa/nfs/repository/d0012/bE240L11 * Finished length may be shorter or longer once complete Pooled Clones: Zfish 9, Pig 3

15 Mapping of Solexa Reads On the Reference

16 extended long reads of 1-2Kb 30-70 bp Insert ~300 bp 30-70 bp Solexa assembly Genome/Chromosome Assembly Fishing WGS Reads WGS Reads 5X Combined Reads FuzzyPath Phusion or Phrap Phusion

17 Solexa reads : Number of reads: 4.3 million; Finished genome size: 1.72 Mbp; Read length:2x36bp; Estimated read coverage: ~180X; Insert size: 260/50-400 bp; Zfish DH reads:12,539 Assembly features: - contig stats Solexa Hybrid_Ctg Hybrid_Super N contigs:49615295 Bases: 1.25 Mbp1.68 Mbp1.69 Mbp N50 size: 4,97525,81774,598 Largest23,906 79,730144,808 Averaged: 2,51311,07217,815 Coverage: ~72.6 %~73%~73% Errors:??? Zfish and “Pig” Clone Assemblies

18

19

20

21 Acknowledgements:  Yong Gu  James Bonfiled  Helen Beasley  Siobhan Whitehead  Daniel Turner  Michael Quail  Tony Cox  Richard Durbin


Download ppt "FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute."

Similar presentations


Ads by Google