FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute.
Advertisements

Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek,
Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek,
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Some new sequencing technologies. Molecular Inversion Probes.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Stuff to Do. Midterm I questions due 1/31 me your question (with answers), –if you have the capability, mail complete questions, figures, etc. and.
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Large-scale genome projects
Solanum lycopersicum Chromosome 4 Sequencing Update SOL Germany– October 2008 Wellcome Trust Medical Photographic Library.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Fuzzypath – Algorithms, Applications and Future Developments
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
The Changing Face of Sequencing
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Jan Pačes Institute of Molecular Genetics AS CR
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
HeterochromatinEuchromatin Relative chromosome length Relative bivalent diameter X 1.23 X 1.00 Relative area Relative optical density.
Human Genome.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Accessing and visualizing genomics data
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
26 th July 2006 Christine Nicholson, Mapping Core Group Karen McLaren, Finishing Group Leader Wellcome Trust Sanger Institute Sequencing the Gene Space.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
Denovo genome assembly of Moniliophthora roreri
Supplementary figures
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Stuff to Do.
Plant & Animal Genome Conference
AMOS Assembly Validation and Visualization
Assembly of Solexa tomato reads
Presentation transcript:

FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute

Assembly Strategy Selexa reads assembler to extend long reads of 1-2Kb Genome/Chromosome Capillary reads assembler Phrap/Phusion forward-reverse paired reads bp known dist ~500 bp bp

Kmer Extension & Repeat Junctions

Handling of Single Base Variations

ACGTAACTAACAGTT ACGTAACTCACAGTT ACGTAACT ACAGTT Fuzzy Kmers Number of Mismatches between Two Kmers

Means to handle repeats: - Base quality - Base quality - Read pair - Read pair - Fuzzy kmers - Fuzzy kmers - Closely related reference - Closely related reference or Sanger reads or Sanger reads Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus

Pileup of Solexa and 454 Reads

Solexa reads : Number of reads: 3,084,185; Finished genome size: 2,007,491 bp; Read length:39 and 36 bp; Estimated read coverage: ~55X; Number of 454 reads:100,000; Read coverage of 454:10X; Assembly features: - contig stats Total number of contigs: 73; Total bases of contigs: 1,999,817 bp N50 contig size: 62,508; Largest contig:162,190 Averaged contig size: 27,394; Contig coverage over the genome: ~99 %; Contig extension errors: 2 Mis-assembly errors:3 S.Suis P1/7 Solexa/454 Assembly

Solexa reads : Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length:2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/ bp; Assembly features: - contig stats Solexa454 Total number of contigs: 75;390 Total bases of contigs: 4.80 Mbp4.77 Mb N50 contig size: 139,35325,702 Largest contig:395,600 62,040 Averaged contig size: 63,96912,224 Contig coverage on genome: ~99.8 %99.4% Contig extension errors: 0 Mis-assembly errors:04 Salmonella seftenberg Solexa Assembly from Pair-End Reads

Solexa reads : Number of reads: 7,055,348; Finished genome size: 5.35 Mbp; Read length:2x36bp; Estimated read coverage: ~95X; Insert size: 170/ bp; Assembly features: - contig stats Total number of contigs: 168; Total bases of contigs: 5.19 Mbp N50 contig size: 85,886; Largest contig:337,768 Averaged contig size: 30,886; Contig coverage over the genome: ~99 %; Contig extension errors: 1 Mis-assembly errors:2 E.Coli strain 042 Assembly

Solexa reads : Number of reads: 6,346,317; Finished genome size: 4.7 Mbp; Read length:33 bp; Estimated read coverage: ~40 X; Shredded reference of SpA: 10X; Assembly features: - contig stats Total number of contigs: 66; Total bases of contigs: 4,615,704 bp N50 contig size: 168,793; Largest contig:401,700 Averaged contig size: 69,934; Contig coverage over the genome: ~98 %; Contig extension errors: 0 Mis-assembly errors:2 Salmonella delhi5 Solexa Assembly Guided by A Close Reference

The Malaria Genome Project

library organismread lengthMb sequencegenomemean generatedsize (Mb)coverage PCR-free B. pertussis ST242 x PCR-free E. coli 0422 x PCR-free P. falciparum 3D72 x PCR-free B. pertussis ST242 x PCR-free P. falciparum 3D72 x PCR-free E. coli 0422 x standard-245 P. falciparum 3D72 x standard-368 P. falciparum 3D72 x standard-851 P. falciparum 3D72 x standard-883 P. falciparum clin2 x Datasets with Various GC Content GC 68.0% 50.5% 19.0% 50.8% 19.0% 68.0% 19.0%

Solexa reads :2x36 bp2x76 bp Number of reads: 14.0m9.77m Finished genome size: 23 Mbp23 Mbp Estimated read coverage: 43x64x Insert size: 170 bp170 bp Assembly features: Total number of contigs: 26, Total bases of contigs: 19.2 Mbp21.1 Mb N50 contig size: Largest contig: Averaged contig size: Contig coverage on genome: ~83.5 %91.7% Contig extension errors: ?? Mis-assembly errors:?? Malaria 3D7 Assemblies

Acknowledgements:  Yong Gu  Ben Blackburne  Hannes Ponstingl  Daniel Turner  Michael Quail  Tony Cox  Richard Durbin