Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.

Slides:



Advertisements
Similar presentations
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
Advertisements

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Lecture 14 Genome sequencing projects
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
De-novo Assembly Day 4.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Developing Bioinformatics Tools for Genome Analysis Zemin Ning The Wellcome Trust Sanger Institute.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
NGS sequencing and Genome Assemblies from Animals and Large Plants Zemin Ning The Wellcome Trust Sanger Institute.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Fuzzypath – Algorithms, Applications and Future Developments
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly.
The Changing Face of Sequencing
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Canadian Bioinformatics Workshops
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
Denovo genome assembly of Moniliophthora roreri
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
A Hybrid Assembly System in Zebrafish Pooled Clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Very important to know the difference between the trees!
Removing Erroneous Connections
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
AMOS Assembly Validation and Visualization
Canadian Bioinformatics Workshops
Presentation transcript:

Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:  Challenges in genome assemblies from pure Illumina reads  The Phusion2 pipeline  The Tasmanian devil genome project  The Devil genome assembly  Future work

Challenges in Whole Genome Assembly using Pure Illumina Reads  Short read length: 2x36; 2x54; 2x75; 2x100  Large genome and huge datasets For human: 100Gb at 30x  Repetitive/Duplication structures, Alus, LINES, SVAs 30-40% such as human, mouse; 50-60% such as rice and other plant genomes.  Tandem repeats: how many copies they have? TATATATATATATATATATATATATATA GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG AGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGTAGT

De Bruijn vs Read overlap Missing from de Bruijn contigs Missing sequences

Phusion2 Assembly Pipeline Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Velvet Phrap 2x75 or 2x100 Base Correction PhrapRP

Repetitive Contig and Read Pairs Depth Depth Depth Grouped Reads by Phusion

Gap-Hash4x3 ATGGGCAGATGT ATGGGCAGATGT TGGCCAGTTGTT TGGCCAGTTGTT GGCGAGTCGTTC GGCGAGTCGTTC GCGTGTCCTTCG GCGTGTCCTTCG ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA CGTGCAGTCCAT CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGTTCGGATCA Contiguous Base Hash Base Hash K = 12 Kmer Word Hashing

Word use distribution for the mouse sequence data at ~7.5 fold Useful Region Poisson Curve Real Data Curve

Sorted List of Each k-Mer and Its Read Indices ACAGAAAAGC10h06.p1c ACAGAAAAGC12a04.q1c ACAGAAAAGC13d01.p1c ACAGAAAAGC16d01.p1c ACAGAAAAGC26g04.p1c ACAGAAAAGC33h02.q1c ACAGAAAAGC37g12.p1c ACAGAAAAGC40d06.p1c ACAGAAAAGG16a02.p1c ACAGAAAAGG20a10.p1c ACAGAAAAGG22a03.p1c ACAGAAAAGG26e12.q1c ACAGAAAAGG30e12.q1c ACAGAAAAGG47a01.p1c High bits Low bits 64 -2k 2k

… j … N i N R(i,j) Relation Matrix: R(i,j) – number of kmer words shared between read i and read j Group 1: (1,2,3,5) Group 2: (4,6)

maq ssaha2

Break contigs without read pair coverage

Paired Reads Separated by “NN”

Error Bases Correction

Mis-assembly errors: Contig Breaking

Track read pairs to walk through repetitive regions Read Pair Guided Local Assembler

Solexa reads : Number of read pairs: 528 Million; Finished genome size: 3.5 GB; Read length:2x100bp; Estimated read coverage: ~30X; Insert size: 410/ bp; Mate pair data:2k,4k,5k,8k,10k Number of reads clustered:458 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 1,246,970792,099 Total bases of contigs: 3.22 Gb3,62 Gb N50 contig size: 9,642434,642 Largest contig:96,9194,150,712 Averaged contig size: 2,5784,564 Contig coverage on genome: ~95%>99% Ratio of placed PE reads:~92%? Genome Assembly – T. Devil

Solexa reads : Number of read pairs: 502 Million; Finished genome size: 2.0 GB; Read length:2x76bp; Estimated read coverage: ~35X; Insert size: 410/ bp; Mate pair data:5Kb Number of reads clustered:438 Million Assembly features: - stats Contigs Supercontigs Total number of contigs: 2,241,4652,090,385 Total bases of contigs: 1.64 Gb1.92 Gb N50 contig size: 4,30129,076 Largest contig:71,161730,290 Averaged contig size: Contig coverage on genome: ~85%>95% Ratio of placed PE reads:~82%? Genome Assembly – Miscanthus

Solexa reads : Number of read pairs: 97.9 Million; Finished genome size: 440 MB; Read length:2x76bp; Estimated read coverage: ~33X; Insert size: 500/ bp; Number of reads clustered:81.2 Million Assembly features: - contig stats Total number of contigs: 374,713; Total bases of contigs: 365 Mb N50 contig size: 7,639; Largest contig:72,321 Averaged contig size: 973; Contig coverage over the genome: ~83 %; Mis-assembly errors:? Rice Genome Assembly One Of the most difficult Genomes on earth?

Melanoma cell line COLO-829 Paul Edwards, Departments of Pathology and Oncology, University of Cambridge

Solexa reads : Number of read pairs: 557 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/ bp; Number of reads clustered:458 Million Assembly features: - contig stats Total number of contigs: 1,020,346; Total bases of contigs: Gb N50 contig size: 8,344; Largest contig:107,613 Averaged contig size: 2,659; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Human Cancer Genome Assembly – Normal Cell

Solexa reads : Number of read pairs: 562 Million; Finished genome size: 3.0 GB; Read length:2x75bp; Estimated read coverage: ~25X; Insert size: 190/ bp; Number of reads clustered:449 Million Assembly features: - contig stats Total number of contigs: 1,249,719; Total bases of contigs: Gb N50 contig size: 6,073; Largest contig:72,123 Averaged contig size: 2,152; Contig coverage over the genome: ~90 %; Mis-assembly errors:? Genome Assembly – Tumour Cell

Plots of INDELs/SVs size distribution for all events detected by Pindel at single-base resolution. Left, insertions from 1bp to 60 bp. Right, deletions from 1bp to 1Mb.

Assemblies are used to confirm Pindel predictions: (a) deletion is confirmed by aligning two flanking sequences F1 and F2 to the reference; (b) deletion is not found in the reference with flanking sequences; (c) insertion is confirmed.

Total% Number of deletions > 100 bp detected by Pindel Hits to tumour contigs Hits to tumour contigs with correct orientation Hits to tumour contigs with correct position Hits to normal contigs Hits to normal contigs with correct orientation Hits to normal contigs with correct position The numbers of structural variations (> 100 bp deletions) detected by Pindel and then confirmed by assembly

The number of insertion and deletion events detect by Pindel using input of various coverage (1x-70x)

Acknowledgements:  Elizabeth Murchuson  Erin Preasance  Mike Stratton  Kai Ye  Dirk Evers  Ole Schulz-Trieglaff