Presentation is loading. Please wait.

Presentation is loading. Please wait.

De-novo Assembly Day 4.

Similar presentations

Presentation on theme: "De-novo Assembly Day 4."— Presentation transcript:

1 De-novo Assembly Day 4

2 The Challenge Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues: Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution Running time and memory

3 In an ideal world… Reads have no error
Reads are long enough and each appears once Each read given in the same orientation (all 5’ to 3’, for example) Contiguous reads have enough overlap so that they can easily be assembled

4 Shotgun DNA Sequencing
DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp Not all sequencing technologies produce mate-pairs. Different error models Different read lengths 10,000bp 4

5 Terminology Reads are what you start with (35bp-800bp)
Fragmented assemblies produce contigs Contigs can be put together into scaffolds

6 Reference-based vs de novo assembly
Comparative assembly means you have a “reference” genome Same OR related species Can get weird in bacteria and viruses due to recombination De novo assembly means you do everything from scratch

7 Comparative Assembly Much easier than de novo! Basic idea:
Take the reads and map them onto the reference (allowing for small mismatches) Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence

8 Comparative Assembly Fast
Short reads can map to several places (especially if they have errors) Needs close reference genome Repeats are problematic Can be highly accurate even when reads have errors

9 De Novo assembly Much easier to do with long reads
Need very good coverage Generally produces fragmented assemblies Necessary when you don’t have a closely related (and correctly assembled) reference genome

10 Conceptual steps in de novo assembly
1. Find reads that overlap by a specified number of bases (the k-mer size) 2. Merge overlapping, “good” reads into longer contigs 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford

11 De Novo Assembly paradigms
Overlap-layout-consensus methods k-mer graph (especially useful for assembly from short reads) aka de Bruijn graph 11 11

12 de Bruijn graph has k-mers as nodes connected by reads; assembly involves finding Eulerian path through graph Diagram from Michael Schatz, Cold Spring Harbor

13 de Bruijn graph Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides Small values of k produce small graphs Does not require all-pairs overlap calculation! But: loss of information about reads can lead to “chimeric” contigs, and incorrect assemblies Also produces fragmented assemblies (even shorter contigs)

14 de Bruijn Graph use Create the de Bruijn graph for the following string, using k=5 ACATAGGATTCAC Find the Eulerian path Is the Eulerian path unique? Reconstruct the sequence from this path

15 But… Because of Errors in reads Repeats Insufficient coverage de Bruijn graphs generally don’t Eulerian paths/circuits This means the first step doesn’t completely assemble the genome

16 Velvet and SOAPdenovo are two leading assemblers
Velvet is from EMBL-EBI SOAPdenovo is from BGI k-mer size is adjustable parameter Typically it is adjusted to maximize N50 length of scaffolds or contigs N50 length is central measure of distribution weighted by lengths

17 NOTE: QC step slightly different
You still do standard checks with something like FASTQC Read Trimming is different Algorithms/tools can deal with different read lengths Finding overlaps give longer contigs So we never want to sacrifice good reads Solution: remove sequencing adapters trim individual reads as needed

18 Three useful measures for optimization of k-mer length
Mean length: the usual average of the lengths Median length: the length for which half the sequences are shorter & half are longer N50 length: the length that splits the total bases in half, after the lengths are ordered Example values for distribution of contig lengths: Mean length: 627 Median length: 200 N50 length: 1,718 We’ll look at using N50 in in the practical

19 Practical prep Download and install
TRIMMOMATIC: VELVET: Use make ’MAXKMERLENGTH=70’ when compiling Grab practical dataset from course page

Download ppt "De-novo Assembly Day 4."

Similar presentations

Ads by Google