Fuzzypath – Algorithms, Applications and Future Developments

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Large Plant Genome Assemblies using Phusion2 Zemin Ning The Wellcome Trust Sanger Institute.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Universidad de los Andes, Bogotá, Colombia, Septiembre 2015  Sequence and annotation of genomes and metagenomes with Galaxy Dr. rer. nat. Diego Mauricio.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
Metagenomics Assembly Hubert DENISE
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Genome De Novo Assemblies and Applications in NGS Sequencing Zemin Ning The Wellcome Trust Sanger Institute.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Cross_genome: Assembly Scaffolding using Cross-species Synteny Zemin Ning High Performance Assembly.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Sequence Alignment and Genome Assembly Zemin Ning The Wellcome Trust Sanger Institute.
Performance Profiling of NGS Genome Assembly Algorithms Alex Ropelewski Pittsburgh Supercomputing Center
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly
Phusion2 and The Genome Assembly of Tasmanian Devil
Cross_genome: Assembly Scaffolding using Cross-species Synteny
CS296-5 Genomes, Networks, and Cancer
CAP5510 – Bioinformatics Sequence Assembly
A Hybrid Assembly System in Zebrafish Pooled Clones
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
CSE 5290: Algorithms for Bioinformatics Fall 2011
Graph Algorithms in Bioinformatics
Removing Erroneous Connections
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Fuzzypath – Algorithms, Applications and Future Developments Zemin Ning Sequence Assembly and Analysis 1

Outline of the Talk: Sequence Reconstruction and Euler Path Assembly strategy Sequence extension using read pairs, base qualities, fuzzy kmers or longer reads Repeat junctions Installation, data process and running Gap5 - visual inspection for mis-assembly errors Integration into the Phusion pipeline 2

Sequence Repeat Graph Repeat Sequences

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG Vertices: k-tuples from the spectrum shown in red (8); Edges: overlapping k-tuples (7); Path: visiting all vertices corresponding to the sequence.

Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA AT GT CG CA GC TG GG ATGGCGTGCA ATGCGTGGCA Vertices: correspond to (k-I)-tuples (7); Edges: correspond to k-tuples from the spectrum (8); Path: visiting all EDGES corresponding to the sequence. 17

Assembly Strategy forward-reverse paired reads known dist ~500 bp Solexa read assembler to extend short reads to 1-2 kb long reads Genome/Chromosome Capillary reads assembler Phrap/Phusion

Kmer Extension & Walk

Base Quality to Filter Base Errors

Read Pairs in Repeat Junctions

Kmer Extension & Repeat Junctions Pileup of other reads like 454, Sanger etc at a repeat junction Consensus Means to handle repeats: - Base quality - Read pair - Fuzzy kmers - Closely related reference - 454 or Sanger reads

Handling of Repeat Junctions A = A1 + A2 A2 A1 B1 B = B1 + B2 B2

Handling of Single Base Variations B1 = B2 S = A + B1

Fuzzypath Pipeline

Fuzzypath Read File

Fuzzypath Fastq File

Salmonella seftenberg Solexa Assembly from Pair-End Reads Solexa reads: Number of reads: 6,000,000; Finished genome size: ~4.8 Mbp; Read length: 2x37 bp; Estimated read coverage: ~92.5 X; Insert size: 170/50-300 bp; Assembly features: - contig stats Solexa 454 Total number of contigs: 75; 390 Total bases of contigs: 4.80 Mbp 4.77 Mb N50 contig size: 139,353 25,702 Largest contig: 395,600 62,040 Averaged contig size: 63,969 12,224 Contig coverage on genome: ~99.8 % 99.4% Contig extension errors: 0 Mis-assembly errors: 0 4

maq ssaha2

maq ssaha2

maq ssaha2

maq ssaha2

New Phusion Assembler 2x75 or 2x100 Solexa Reads Assembly Reads Group Data Process Long Insert Reads Supercontig Contigs PRono Fuzzypath Phrap Velvet 2x75 or 2x100

Human Assembly – COLO-829 Normal Cell Solexa reads: Number of reads: 557 Million; Finished genome size: 3.0 GB; Read length: 2x75bp; Estimated read coverage: ~25X; Insert size: 190/50-300 bp; Number of reads clustered: 458 Million Assembly features: - contig stats Total number of contigs: 1,040,582; Total bases of contigs: 2.703 Gb N50 contig size: 6,484; Largest contig: 85,595 Averaged contig size: 2,597; Contig coverage over the genome: ~90 %; Mis-assembly errors: ?

Acknowledgements: Yong Gu James Bonfield Heng Li Hannes Ponstingl Daniel Zerbino (EBI) Helen Beasley Siobhan Whitehead Tony Cox