Assembly algorithms for next-generation sequencing data

Assembly algorithms for next-generation sequencing data
Jason R. Miller, Sergey Koren, Granger Sutton

OUTLINE Introduction The Challenges of Assembly
Graph Algorithms for Assembly: Greedy Graph-Based Assemblers Overlap/Layout/Consensus Assemblers The de Bruijn Graph Approach Future lines of questioning

(B) Filtering particles, typically by size.
Sampling from habitat. (B) Filtering particles, typically by size. (C) DNA extraction and lysis. (D) Cloning and library. (E) Sequence the clones into reads. (F) Sequence assembly. Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLOS Computational Biology 6(2): e

NEXT-GENERATION SEQUENCING
The second-generation machines are characterized by: highly parallel operation. higher yield. simpler operation. much lower cost per read. shorter reads (unfortunately). Today's machines are commonly referred to as short-read sequencers or next-generation sequencers (NGS).

WHAT IS AN ASSEMBLY? An assembly is a data structure that maps the sequence data to a putative reconstruction of the target.

DE NOVO ASSEMBLY De novo assembly refers to reconstruction from scratch, without the aid of external data.

COVERAGE Coverage of a genome is defined as the mean number of times a nucleotide is being sequenced. Thus, 5X coverage means that each nucleotide in the genome is sequenced a mean number of five times.

THE CHALLENGES OF ASSEMBLY
Repeat sequences: genomic regions that share perfect repeats can be indistinguishable. Sequencing error: can induce unreal assemblies. Non-uniform coverage: very low coverage induces gaps in assemblies. Coverage variability undermines coverage-based diagnostics and statistical tests designed to detect errors and repeats.

THE CHALLENGES OF ASSEMBLY
Computational complexity: of processing larger volumes of data. Genomic diversity and variable abundance within populations: Assembly reconstructs the most abundant sequences, and coverage is usually incomplete. Furthermore, there is also the danger of assembling sequences from different species, creating interspecies chimeras.

GREEDY GRAPH-BASED ASSEMBLERS
The greedy algorithms apply one basic operation: given any read or contig, find the read/contig with largest overlap. merge them into new contig. The basic operation is repeated until no more operations are possible.

OVERLAP/LAYOUT/CONSENSUS ASSEMBLERS
The OLC approach has three phases: Overlap - identifying all pairs of reads that overlap and build an overlap graph. Layout - simplify the overlap graph into approximate read layout (contigs). Consensus - determine the consensus sequence.

OVERLAP GRAPH Reads: Nodes represent the reads.
Edges represent overlaps. 4 1 2 3 Reads: ACGCA CGCAT CATTC ATTCG TCGCG Finding the correct assembly is cast as a Hamiltonian path finding problem, for finding a path in a graph where each vertex is visited once:

K-MER “K-mer” is a substring of length K, where K is any positive integer. R: GGCGATTCATCG All 3-mers of R: GGC GCG CGA GAT ATT TTC TCA CAT ATC TCG

THE DE BRUIJN GRAPH The de Bruijn graph was developed outside the realm of DNA sequencing to represent strings from a finite alphabet. The nodes represent all possible fixed-length strings. The edges represent suffix-to-prefix perfect overlaps. A K-mer graph is a form of de Bruijn graph. Its nodes represent all the fixed-length subsequences (k-mers) drawn from a read. Its edges represent all the fixed-length overlaps between subsequences.

THE DE BRUIJN GRAPH Eulerian path Reads: ATGC TGCT GCTA CTAT
k-mers (k=4) ATGC TGCT GCTA CTAT TATG ATGC TGCG GCGT Reads: ATGCTA CTATGC ATGCGT Eulerian path

DE NOVO ASSEMBLY SOFTWARE
Greedy Assemblers: SSAKE, SHARCGS, VCAKE … OLC Assemblers: Newbler, CABOG, Edena, Shorty… DBG Assemblers: Euler, Velvet, AllPaths, ABySS, SOAP … Other software: PCAP, LOCAS, MIRA, Taipan, CLC Workbench, SeqMan …

EULER ASSEMBLER – ERROR CORRECTION
The EULER assembler was the first to present this technique using de Bruijn graphs. Euler applies a filter to the reads before it builds its graph to identify sequencing error by comparing K-mer content between individual reads and all reads. It distrusts individual-read K-mers whose frequency in all reads is below a threshold. Euler corrects substitution errors. Finally, it either accepts a fully corrected read or rejects the read.

EULER ASSEMBLER For example, sequencing error: CAGGTCT CAGCTCT CAG AGG

EULER ASSEMBLER For example, k-mer count profiles when errors are in different parts of the read GCGTATTACGCGTCTGGCCT:

FUTURE LINES OF QUESTIONING
Reads of the future will challenge assembly software in many ways: Almost certainly, data volume will continue to increase while manufacturing cost declines. The next-generation technology will surely be applied to larger genomes, more repetitive sequences, and less homogeneous samples. The quest for more powerful and efficient assembly software remains an area of critical research.

Thank you

Assembly algorithms for next-generation sequencing data

Similar presentations

Presentation on theme: "Assembly algorithms for next-generation sequencing data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assembly algorithms for next-generation sequencing data

Similar presentations

Presentation on theme: "Assembly algorithms for next-generation sequencing data"— Presentation transcript:

Similar presentations

About project

Feedback