Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.

Similar presentations


Presentation on theme: "Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly."— Presentation transcript:

1

2

3 Genome sequencing and assembly
Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly

4 Session Outline Planning a genome sequencing project
Assembly strategies and algorithms Assessing the quality of the assembly Assessing the quality of the assemblers Genome annotation

5 Genome sequencing

6 Schematic overview of genome assembly
Schematic overview of genome assembly. (a) DNA is collected from the biological sample and sequenced. (b) The output from the sequencer consists of many billions of short, unordered DNA fragments from random positions in the genome. (c) The short fragments are compared with each other to discover how they overlap. (d) The overlap relationships are captured in a large assembly graph shown as nodes representing kmers or reads, with edges drawn between overlapping kmers or reads. (e) The assembly graph is refined to correct errors and simplify into the initial set of contigs, shown as large ovals connected by edges. (f) Finally, mates, markers and other long-range information are used to order and orient the initial contigs into large scaffolds, as shown as thin black lines connecting the initial contigs. Schatz et al. Genome Biology :243

7 Planning a genome sequencing project
How large is my genome? How much of it is repetitive, and what is the repeat size distribution? Is a good quality genome of a related species available? What will be my strategy for performing the assembly?

8 How large is my genome? The size of the genome can be estimated from the ploidy of the organism and the DNA content per cell This will affect: How many reads will be required to attain sufficient coverage (typically 10x to 100x) What sequencing technology to use What computational resources will be needed

9 Repetitive sequences Most common source of assembly errors If sequencing technology produces reads > repeat size, impact is much smaller Most common solution: generate mate pairs with spacing > largest known repeat

10 Assemblies can collapse around repetitive sequences.
Assemblies can collapse around repetitive sequences. R1 and R2, in yellow, represent near-identical copies of the same DNA sequence. Salzberg S L , and Yorke J A Bioinformatics 2005;21: © The Author Published by Oxford University Press. All rights reserved. For Permissions, please

11 Mis-assembly of repetitive sequence
Schatz M C et al. Brief Bioinform 2013;14:

12 Genome(s) from related species
Preferably of good quality, with large reliable scaffolds Help guiding the assembly of the target species Help verifying the completeness of the assembly Can themselves be improved in some cases But to be used with caution – can cause errors when architectures are different!

13 Strategies for assembly
The sequencing approaches and assembly strategies are interdependent! E.g., for bacterial genome assembly, can generate 454 sequence reads and assemble with Newbler, or generate Illumina reads and assemble with Velvet Optimal sequencing strategies very different for a SOAPdenovo or an ALLPATHS-LG assembly

14 Typical sequencing strategies
Bacterial genome: Shotgun or mate-pair >500nt reads from 454 machine at 25x coverage (~150,000 reads for 3 MB genome), assembly with Newbler PacBio CLR sequences at 200x coverage, self-correction and/or hybrid correction and assembly using Celera Assembler or PBJelly Vertebrate genome: Combination paired-end (180 nt fragments) and mate-pair (1, 3 and 10 kb libraries) 100 nt reads from Illumina machine at 100x coverage (~1B reads for 1 GB genome), assembly with ALLPATHS-LG

15 Mate-pair library preparation from 454 (left) and Illumina (right)

16 Additional useful data
Fosmid libraries End sequencing adds long-range contiguity information Pooled fosmids (~5000) can often be assembled more efficiently Moleculo libraries New technology acquired by Illumina, allows generation of fully assembled 10 kb sequences Pacbio reads Provide 1-3 kb reads, but need parallel coverage by Illumina data for error correction

17 Assembly strategies and algorithms
In all cases, start with cleanup and error correction of raw reads For long reads (>500 nt), Overlap/Layout/Consensus (OLC) algorithms work best For short reads, De Bruijn graph-based assemblers are most widely used

18 Cleaning up the data Trim reads with low quality calls
Remove short reads Correct errors: Find all distinct k-mers (typically k=15) in input data Plot coverage distribution Correct low-coverage k-mers to match high-coverage Part of several assemblers, also stand-alone Quake of khmer programs

19 Overlap-layout-consensus
Main entity: read Relationship between reads: overlap 1 2 3 4 5 6 7 8 9 ACCTGA AGCTGA ACCAGA

20

21 OLC assembly steps Calculate overlays
Can use BLAST-like method, but finding common k-mers more efficient Assemble layout graph, try to simplify graph and remove nodes (reads) Generate consensus from the alignments between reads (overlays)

22 Some OLC-based assemblers
Celera Assembler with the Best Overlap Graph (CABOG) Designed for Sanger sequences, but works with 454 and error-corrected PacBio reads Newbler, a.k.a. GS de novo Assembler Designed for 454 sequences, but works with Sanger reads

23 De Bruijn graphs - concept

24 Converting reads to a De Bruijn graph
Reads are 7 nt long Graph with k=3 Deduced sequence (main branch)

25 DBG implementation in the Velvet assembler

26 Examples of DBG-based assemblers
EULER (P. Pevzner), the first assembler to use DBG Velvet (D. Zerbino), a popular choice for small genomes SOAPdenovo (BGI), widely used by BGI and for relatively unstructured assemblies ALLPATHS-LG, probably the most reliable assembler for large genomes (but with strict input requirements)

27 Anatomy of a WGS Assembly
Chromosome STS STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

28 ? Pairs Give Order & Orientation Contig
Assembly without pairs results in contigs whose order and orientation are not known. Contig Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

29 Assembly gaps Physical gaps Sequencing gaps Sequencing gap is "easy"
Physical gap resolution takes more than 1/2 of closure effort Multiplex PCR sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

30 Repeats often split genome into contigs
Contig derived from unique sequences Reads from multiple repeats collapse into artefactual contig

31 Handling repeats Repeat detection Repeat resolution
pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat Obtain long reads spanning repeats

32 How good is my assembly? How much total sequence is in the assembly relative to estimated genome size? How many pieces, and what is their size distribution? Are the contigs assembled correctly? Are the scaffolds connected in the right order / orientation? How were the repeats handled? Are all the genes I expected in the assembly?

33 N50: the most common measure of assembly quality
N50 = length of the shortest contig in a set making up 50% of the total assembly length

34 Order and orientation of contigs – more errors in one assembly than in another

35 CEGMA: conserved eukaryotic gene sets
From Ian Korf’s group, UC Davis Mapping Core Eukaryotic Genes Coverage is indicative of quality and completeness of assembly

36 Even the best genomes are not perfect

37 There is no such thing as a “perfect” assembler (results from GAGE competition)

38 The computational demands and effectiveness of assemblers are very different

39 Assessing assembly strategies
Assemblathon (UC Davis and UC Santa Cruz) Provide challenging datasets to assemble in open competition (synthetic for edition 1, real for edition 2) Assess competitor assemblies by many different metrics Publish extensive reports GAGE (U. of Maryland and Johns Hopkins) Select datasets associated with known high-quality genomes Run a set of open source assemblers with parameter sweeps on these datasets Compare the results, publish in scholarly Journals with complete documentation of parameters

40 Some advice on running assemblies
Perform parameter sweeps Use many different values of key parameters, especially k-mer size for DBG assemblers, and evaluate the output (some assemblers can do this automatically) Try different subsets of the data Sometimes libraries are of poor quality and degrade the quality of the assembly Artefacts in the data (e.g. PCR duplicates, homopolymer runs, …) can also badly affect output quality Try more than one assembler There is no such thing as “the best” assembler

41 Genome annotation A genome sequence is useless without annotation
Three steps in genome annotation: Find features not associated with protein-coding genes (e.g. tRNA, rRNA, snRNA, SINE/LINE, miRNA precursors) Build models for protein-coding genes, including exons, coding regions, regulatory regions Associate biologically relevant information with the genome features and genes

42 Methods for genome annotation
Ab initio, i.e. based on sequence alone INFERNAL/rFAM (RNA genes), miRBase (miRNAs), RepeatMasker (repeat families), many gene prediction algorithms (e.g. AUGUSTUS, Glimmer, GeneMark, …) Evidence-based Require transcriptome data for the target organism (the more the better) Align cDNA sequences to assembled genome and generate gene models: TopHat/Cufflinks, Scripture

43 Methods for biological annotation
BLAST of gene models against protein databases Sequence similarity to known proteins InterProScan of predicted proteins against databases of protein domains (Pfam, Prosite, HAMAP, PANTHER, …) Mapping against Gene Ontology terms (BLAST2GO)

44 MAKER, integration framework for genome annotation
MAKER runs many software tools on the assembled genome and collates the outputs See

45 Acknowledgements For this slide deck I “borrowed” figures and slides from many publications, Web pages and presentations by M. Schatz, S. Salzberg, K. Bradnam, K. Krampis, D. Zerbino, J. J. Cook, M. Pop, G. Sutton Thank you!


Download ppt "Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly."

Similar presentations


Ads by Google