CS 394C March 19, 2012 Tandy Warnow.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Genome Assembly: a brief introduction
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Next Generation Sequencing, Assembly, and Alignment Methods
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Genome sequence assembly
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
How to Build a Horse Megan Smedinghoff.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Metagenomics Assembly Hubert DENISE
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Human Genome.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Assembly algorithms for next-generation sequencing data
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Graph Theory.
How to Build a Horse: Final Report
An Eulerian path approach to DNA fragment assembly
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

CS 394C March 19, 2012 Tandy Warnow

2 2

Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp Not all sequencing technologies produce mate-pairs. Different error models Different read lengths 10,000bp 3

Basic Challenge Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues: Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution Running time and memory

Simplest scenario Reads have no error Read are long enough that each appears exactly once in the genome Each read given in the same orientation (all 5’ to 3’, for example)

De novo vs. comparative assembly De novo assembly means you do everything from scratch Comparative assembly means you have a “reference” genome. For example, you want to sequence your own genome, and you have Craig Venter’s genome already sequenced. Or you want to sequence a chimp genome and you have a human already sequenced.

Comparative Assembly Much easier than de novo! Basic idea: Take the reads and map them onto the reference genome (allowing for some small mismatch) Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence

Comparative Assembly Fast Short reads can map to several places (especially if they have errors) Needs close reference genome Repeats are problematic Can be highly accurate even when reads have errors

De Novo assembly Much easier to do with long reads Need very good coverage Generally produces fragmented assemblies Necessary when you don’t have a closely related (and correctly assembled) reference genome

De Novo Assembly paradigms Overlap Graph: overlap-layout-consensus methods greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) k-mer graph (especially useful for assembly from short reads) 10 10

Overlap Graph Each read is a node There is a directed edge from u to v if the two reads have sufficient overlap Objective: Find a Hamiltonian Path (for linear genomes) or a Hamiltonian Circuit (for circular genomes)

Paths through graphs and assembly Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome 12 12

Hamiltonian Path Approach Hamiltonian Path is NP-hard (but good heuristics exist), and can have multiple solutions Dependency on detecting overlap (errors in reads, overlap length) Running time (all-pairs overlap calculation) Repeats Tends to produce fragmented assemblies (contigs)

Example Reads: TAATACTTAGG TAGGCCA GCCAGGAAT GAATAAGCCAA GCCAATTT AATTTGGAAT GGAATTAAGGCAC AGGCACGTTTA CACGTTAGGACCATT GGACCATTTAATACGGAT If minimum overlap is 3, what graph do we get? If minimum overlap is 4, what graph do we get? If minimum overlap is 5, what graph do we get?

Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 1 2 3 4 5 6 7 8 9 ACCTGA AGCTGA ACCAGA 1 1 2 3 1 2 3 2 3 2 3 1 3 1 1 3 2 2 15 15

TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done 16 16

Challenges for Overlap-Layout-Consensus Computing all-pairs overlap is computationally expensive, especially for NGS datasets, which can have millions of short reads. (The Hamiltonian Path part doesn’t help either) This computational challenge is increased for large genomes (e.g., human) Need something faster!

k-mer graph (aka de Bruijn graph) Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides Small values of k produce small graphs Does not require all-pairs overlap calculation! But: loss of information about reads can lead to “chimeric” contigs, and incorrect assemblies Also produces fragmented assemblies (even shorter contigs)

Eulerian Paths An Eulerian path is one that goes through every edge exactly once It is easy to see that if a graph has an Eulerian path, then all but 2 nodes have even degree. The converse is also true, but a bit harder to prove. For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian path if and only if the indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.

de Bruijn Graphs are Eulerian If the k-mer set comes from a sequence and every k-mer appears exactly once in the sequence, then the de Bruijn graph has an Eulerian path!

de Bruijn Graph Create the de Bruijn graph for the following string, using k=5 ACATAGGATTCAC Find the Eulerian path Is the Eulerian path unique? Reconstruct the sequence from this path

Using de Bruijn Graphs Given: set of k-mers from a DNA sequence Algorithm: Construct the de Bruijn graph Find an Eulerian path in the graph The path defines a sequence with the same set of k-mers as the original

Using de Bruijn Graphs Given: set of k-mers from a set of reads for a sequence Algorithm: Construct the de Bruijn graph Try to find an Eulerian path in the graph

No matter what Because of Errors in reads Repeats Insufficient coverage the overlap graphs and de Bruijn graphs generally don’t have Hamiltonian paths/circuits or Eulerian paths/circuits This means the first step doesn’t completely assemble the genome

Reads, Contigs, and Scaffolds Reads are what you start with (35bp-800bp) Fragmented assemblies produce contigs that can be kilobases in length Putting contigs together into scaffolds is the next step

? Mate Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Contig Consensus (15-30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold 26

Handling repeats Repeat detection Repeat resolution pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat 27 27

Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder Find all overlaps  40bp allowing 6% mismatch. Overlapper A B Unitiger implies A B TRUE Scaffolder Repeat Rez I, II OR A B REPEAT-INDUCED 28

Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II 29

Assembly Pipeline Trim & Screen Mated reads Overlapper Unitiger Scaffold U-unitigs with confirmed pairs Mated reads Overlapper Unitiger Scaffolder Repeat Rez I, II 30

Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II 31