Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome sequence assembly

Similar presentations


Presentation on theme: "Genome sequence assembly"— Presentation transcript:

1

2 Genome sequence assembly
Assembly concepts and methods (some slides courtesy of Mihai Pop)

3 Building a library Break DNA into random fragments (8-10x coverage)
Actual situation

4 Building a library Break DNA into random fragments (8-10x coverage)
Sequence the ends of the fragments Amplify the fragments in a vector Sequence ( ) bases at each end of the fragment

5 Assembling the fragments
NOte that contig orientation/order is not determined

6 Forward-reverse constraints
The sequenced ends are facing towards each other The distance between the two fragments is known (within certain experimental error) Insert F R F R I II R I F II Clone F II R I

7 Building Scaffolds Break DNA into random fragments (8-10x coverage)
Sequence the ends of the fragments Assemble the sequenced ends Build scaffolds We need to determine the relative order/orientation of contigs Using forward-reverse constraints helps

8 Assembly gaps Physical gaps Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap Sequencing gap is "easy" Physical gap resolution takes more than 1/2 of closure effort Multiplex PCR

9 Unifying view of assembly
Scaffolding

10 Shotgun sequencing statistics

11 Typical contig coverage
Imagine raindrops on a sidewalk

12 Lander-Waterman statistics
L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

13 Example c N #islands #contigs bases not in any read
Genome size: 1 Mbp Read Length: Detectable overlap: 40 c N #islands #contigs bases not in any read bases not in contigs 1 1,667 655 614 698 367,806 3 5,000 304 250 121 49,787 5 8,334 78 57 20 6,735 8 13,334 7 335

14 Experimental data X coverage # ctgs % > 2X avg ctg size (L-W)
max ctg size # ORFs 1 284 54 1,234 (1,138) 3,337 526 3 597 67 1,794 (4,429) 9,589 1,092 5 548 79 2,495 (21,791) 17,977 1,398 8 495 85 3,294 (302,545) 64,307 1,762 complete 100 1.26 M 1,329 Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

15 Read coverage vs. Clone coverage
4 kbp 1 kbp Read coverage = 8X Clone (insert) coverage = 16 2X coverage in BAC-ends implies 100x coverage by BACs (1 BAC clone = approx. 100kbp)

16 Assembly paradigms Overlap-layout-consensus
greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) Eulerian path (especially useful for short read sequencing)

17 TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps
Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done

18 Overlap-layout-consensus
Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 1 2 3 4 5 6 7 8 9 ACCTGA AGCTGA ACCAGA 1 1 2 3 1 2 3 2 3 2 3 1 3 1 1 3 2 2

19 Paths through graphs and assembly
Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

20 Implementation details

21 Overlap between two sequences
overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: length of overlap % identity in overlap region maximum overhang size. when a pair of sequences is considered,the two sequences are merged only if they match the criteria

22 All pairs alignment Needed by the assembler
Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) k-mer

23

24 REPEATS

25 RptA RptB 3 6 9 12 2 5 8 11 1 4 7 10 13 6 4 8 10 2 12 1 13 3 11 5 7 9

26 Non-repetitive overlap graph
1 2 3 4 5,9 7 8 6,10 11 12 13

27 Handling repeats Repeat detection Repeat resolution
pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat

28 Statistical repeat detection
Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

29 Mis-assembled repeats
excision collapsed tandem rearrangement

30 An assembly puzzle: contradictory data (discovered after publication)
ribosomal RNA repeats, Ames Porton strain “chimeras” “mates” How do you align the green pieces?

31 Puzzle solution Reference: Ames ‘ancestor’ strain
Ames Porton Down strain Tandem duplication

32 Anthrax attack strain history
1981 “Ames” isolate Ft Detrick (Ames ancestor) 1982 Lab B Lab C Lab D Porton Down ? ? ? Plasmids cured ? Attack Strain Porton Strain UC Berkeley Victim 2001 1998 2001 Florida isolate Porton 1 Porton 2

33 Probability of base-calling error
GBX(g) b. anthracis SNP CS-1 Cut of assembly from GBX0130.contig (11 bases) Cut at ungapped consensus offset (from 1), +/- 5 positions: Cut at gapped consensus offset (from 1) 11 positions Ungapped consensus TGAATGCACAC Gapped consensus TGAATGCACAC T G A A T G C A C A C Covering reads: GBZEI27TF TGAATGCACAC GBXEZ08TR TGAATGCACAC GBZDA09TF TGAATGCACAC Summary info: P-value (10^q) Quality Class Coverage depth Homogeneity Cut of assembly 6264 from GBA0117.contig (11 bases) Cut at ungapped consensus offset (from 1), +/- 5 positions: Cut at gapped consensus offset (from 1) 11 positions Ungapped consensus TGAATACACAC Gapped consensus TGAATACACAC T G A A T A C A C A C GBIFW80TF TGAATA-ACAC GBICA33TR TGAATACACAC GBIFQ32TR TGAATACACAC GBICH40TF TGAATACACAC GBICU19TR TGAATACACAC P-value (10^q) Quality Class Coverage depth GBA(a) Probability of base-calling error Not shown


Download ppt "Genome sequence assembly"

Similar presentations


Ads by Google