The Wellcome Trust Sanger Institute

Slides:



Advertisements
Similar presentations
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Advertisements

Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
Click to edit Master title style Irys data analysis January 10 th, 2014.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
High Throughput Sequencing
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
How to Build a Horse Megan Smedinghoff.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Solanum lycopersicum Chromosome 4 Sequencing Update SOL Germany– October 2008 Wellcome Trust Medical Photographic Library.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
Fuzzypath – Algorithms, Applications and Future Developments
The Changing Face of Sequencing
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Human Genome.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
13 th January 2008 Plant & Animal Genome Conference Progress with Sequencing Tomato Chromosome 4 Clare Riddle Tomato Project Group Wellcome Trust Sanger.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Accessing and visualizing genomics data
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Removing Erroneous Connections
(A) Scale map of connections between contigs in Ver_v2 suggested by the alignment of paired-end Illumina reads (insert size, ~300 bp). (A) Scale map of.
Presentation transcript:

The Wellcome Trust Sanger Institute Assembly Scaffolding using String Graphs and In Silico Chromosome Assignment Zemin Ning The Wellcome Trust Sanger Institute 1

Phusion2 Assembly Pipeline Illumina Reads Assembly 2x75 or 2x100bp Flow-sorting Reads Map Markers AGPcontig Data Process Mate Pair Reads BAC Ends Supercontig Base Correction Contigs Reads Group Consensus Generation

Spinner – a scaffolding tool Spinner uses mate pair data to scaffold contigs. Contigs, and pairs of contigs connected by pairs, define a bi-directional graph: Using expected insert size, a estimate of the gap size can be given for each contig.

Spinner – removing bad pairs Spinner seeks to delete spurious connections where possible. Pairs screened for (a) PCR duplication, (b) cross-biotin and (c) chimeric pairs, etc.  Max insert length If placement of reads implies a large negative distance between the contigs, pair is discarded.  Max insert length After merging two contigs… this check is repeated to find more spurious pairs.

Spinner – deciding when to merge Connection to X with smallest gap size is merged -- as long as neither of these “conflicts” occur: A X B (1) According to the gap distance estimates and contig length, some alternative B overlaps A. X A B (2) Some alternative B is NOT connected to A. Must ALSO check the reverse: that there is nothing closer to A than X (and no conflicts with X from A). Conflicts may be resolved by a “strength comparison”.

Spinner – still to do These techniques alone produces useful results. Further stages will be used to resolve repeats pairs that “jump over” repeats, and graph flow concepts.

Remove Heterozygosity Contigs

Pipeline of Contig Gap Closure

Scaffold Comparisons SPINNER vs SSPACE SSPACE SPINNER Genome_Size N50 Average N50 Average Assemblathon 1 119 Mb 608Kb 86.8Kb 10Mb 450Kb Bamboo 2.0 Gb 322Kb 5804 488Kb 7689 Parrot 1.23 Gb 906Kb 4675 1.32Mb 6969

Tasmanian tiger Tasmanian devil Australian Tasmanian

Tasmanian devil facial tumour disease (DFTD) Transmissible cancer characterised by the growth of large tumours on the face, neck and mouth of Tasmanian devils Transmitted by biting Commonly metastasises First observed in 1996 Primarily affects adults >1yr Death in 4 – 6 months

Tasmanian devil Tasmanian devil Opossum Wallaby

1 2 3 4 5 6 7 8 X 2a 3a 2b 3b Devil – Opossum Homology Map Based on Hybridisation Results of Devil Paints onto Opossum Chromosomes Opossum Devil Opossum chromosome images were taken from Duke et a. 2007, Chromosome Res 15:361-370

3431 3319 2926 Genome size Opossum Devil Chr Seq FC 1 748 611 571 2 Flow cytometry analysis of chromosomal mixture of devil and opossum Genome size   Opossum Devil Chr Seq FC 1 748 611 571 2 541 484 610 3 526 483 556 4 430 423 450 5 309 321 341 6 245 296 277 7 263 264 8 308 X 61 116 121 Total 3431 3319 2926 3 2 1 1 Tasmanian devil 4 2 3 5 4 6 5+8 6 7 Opossum X X

Read mapping coefficient: e = Size_of_Chr/Num_reads_in_lane Table 1 Run ID, Template names, Number of reads and Chromosome size 4972_1 chr1 IL20_4972:1 19.8 571 4967_1 chr2 IL21_4967:1 20.0 610 4971_1 chr3 IL30_4971:1 21.7 556 4964_1 chr4 IL14_4964:1 7.26 450 4969_1 chr5 IL17_4969:1 7.06 341 4969_2 chr6 IL17_4969:2 8.59 277 4969_3 chrx IL17_4969:3 9.43 122 Read mapping coefficient: e = Size_of_Chr/Num_reads_in_lane

Perfect - Reads from the same library were mapped to the contig

Acceptable - Majority of the reads were from the same library, but there were reads from other libraries

Bad – mis-assembly error Majority of the reads in one region were from one library. But there is a transition from which we see a new library, i.e. switch to another chromosome.

Unassigned contigs were placed by supercontigs using mate pairs

Scaffolds Assigned to Chromosomes using Flow-sorting Data Chr_ID Chr_size Scaffolds_assigned Bases_assigned Mb Chr1 571 6729 684 Chr2 610 8381 740 Chr3 556 7197 641 Chr4 450 4817 487 Chr5 341 3188 300 Chr6 277 2844 263 Chrx 122 2378 86.6 Unassigned 440 1.23

Genome Assembly Normal – T. Devil Solexa reads: Number of read pairs: 650 Million; Estimated genome size: 3.1 GB; Read length: 2x100bp; Estimated read coverage: ~40X; Insert size: 410/50-600 bp; Mate pair data: 2k,4k,5k,6k,8k,10k Number of reads clustered: 591 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 178,711 26,954 Total bases of contigs: 2.95 Gb 3.08 Gb N50 contig size: 28,921 2,244,460 Largest contig: 214,456 6,014,846 Averaged contig size: 16,511 114,451 Contig coverage on genome: ~94% >99% Ratio of placed PE reads: ~92% ?

Devil Tumour Genome Assemblies Solexa reads: Tumour_87T Tumour_53T Number of read pairs: 760 Million 669 M; Finished genome size: 3.2 GB 3.2 GB; Read length: 2x100 2x100; Estimated read coverage: ~46X ~40X; Insert size: 300bp 300bp; Number of reads clustered: 635 Million 603 M Assembly features: - stats Tumour_87T Tumour_53T Total number of contigs: 532,584 612,288 Total bases of contigs: 3.13 Gb 3.14 Gb N50 contig size: 15,908 14,632 Largest contig: 109,065 170,831 Averaged contig size: 5,882 5,567 Contig coverage on genome: ~95% ~95% Ratio of placed PE reads: ~92% ~92%

DFTD1 K I F1 F G/H D F2 E F M1 A J M2? M3 1 der1 der2 3 4 5 der5 6 X X 5 6 2 5 6 2 5 X? X 2 2

DFTD2 L K3 M J K1/K2 I J H D F G B M3 M1 M2 der6 der5 der1 1 2 3 4 5 6 Xp Xq 5 1 6 2 2 1 X 2 X X 2 2

Acknowledgements: Joe Henson Elizabeth Murchuson David McBride Yong Gu Fengtang Yang Mike Stratton Ole Schulz-Trieglaff Dirk Evers David Bentley