Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Assembly: a brief introduction

Similar presentations


Presentation on theme: "Genome Assembly: a brief introduction"— Presentation transcript:

1 Genome Assembly: a brief introduction
Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg

2

3 Mate-Pair Shotgun DNA Sequencing
DNA target sample SHEAR & SIZE (16 of these) End Reads / Mate Pairs e.g., 10Kbp ± 8% std.dev. 550bp CLONE (16 of these) & END SEQUENCE (automated) 10,000bp

4 Shotgun DNA Sequencing (Technology)
DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. Primer End Reads (Mates) SEQUENCE 550bp Vector LIGATE & CLONE

5 Whole Genome Shotgun Sequencing
Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 2Kbp 10Kbp Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’ + single highly automated process + only three library constructions – assembly is much more difficult

6 Sequencing Factory

7 Celera’s Sequencing Factory (circa 2001)
300 ABI 3700 DNA Sequencers 50 Production Staff 20,000 sq. ft. of wet lab 20,000 sq. ft. of sequencing space 800 tons of A/C (160,000 cfm) $1 million / year for electrical service $10 million / month for reagents

8 Human Data (April 2000) Collected 27.27 Million reads = 5.11X coverage
21.04 Million are paired (77%) = Million pairs 2Kbp 5.045M 98.6% true * <6% std.dev. 10Kbp 4.401M 98.6% true * <8% std.dev. 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence The clones cover the genome 38.7X times Data is from 5 individuals (roughly 3X, 4 others at .5X)

9 ? Pairs Give Order & Orientation Contig
Assembly without pairs results in contigs whose order and orientation are not known. Contig Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold

10 Anatomy of a WGS Assembly
Chromosome STS STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”

11 Assembly gaps Physical gaps Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap Sequencing gap is "easy" Physical gap resolution takes more than 1/2 of closure effort Multiplex PCR

12 Shotgun sequencing statistics

13 Typical contig coverage
Imagine raindrops on a sidewalk

14 Lander-Waterman statistics
L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads

15 Example c N #islands #contigs bases not in any read
Genome size: 1 Mbp Read Length: Detectable overlap: 40 c N #islands #contigs bases not in any read bases not in contigs 1 1,667 655 614 698 367,806 3 5,000 304 250 121 49,787 5 8,334 78 57 20 6,735 8 13,334 7 335

16 Experimental data X coverage # ctgs % > 2X avg ctg size (L-W)
max ctg size # ORFs 1 284 54 1,234 (1,138) 3,337 526 3 597 67 1,794 (4,429) 9,589 1,092 5 548 79 2,495 (21,791) 17,977 1,398 8 495 85 3,294 (302,545) 64,307 1,762 complete 100 1.26 M 1,329 Caveat: numbers based on artificially chopping up the genome of Wolbachia pipientis dMel

17 Assembly paradigms Overlap-layout-consensus
greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) Eulerian path (especially useful for short read sequencing)

18 TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps
Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done

19 Overlap-layout-consensus
Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 1 2 3 4 5 6 7 8 9 ACCTGA AGCTGA ACCAGA 1 1 2 3 1 2 3 2 3 2 3 1 3 1 1 3 2 2

20 Paths through graphs and assembly
Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome

21 Implementation details

22 Overlap between two sequences
overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: length of overlap % identity in overlap region maximum overhang size. when a pair of sequences is considered,the two sequences are merged only if they match the criteria

23 All pairs alignment Needed by the assembler
Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) k-mer

24

25 Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder
Find all overlaps  40bp allowing 6% mismatch. Overlapper A B Unitiger implies A B TRUE Scaffolder Repeat Rez I, II OR A B REPEAT-INDUCED

26 Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder
Compute all overlap consistent sub-assemblies: Unitigs (Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II

27 OVERLAP GRAPH E.G.: Edges are annotated with deltas of overlaps
Edge Types: Regular Dovetail A B A B B A Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps

28 The Unitig Reduction 1. Remove “Transitively Inferrable” Overlaps: A C

29 The Unitig Reduction 2. Collapse “Unique Connector” Overlaps: 412 352
45 2. Collapse “Unique Connector” Overlaps: A B A B

30 Identifying Unique DNA Stretches
Unique DNA unitig Repetitive DNA unitig Arrival Intervals Discriminator Statistic is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. -10 +10 Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique

31 Assembly Pipeline Trim & Screen Mated reads Overlapper Unitiger
Scaffold U-unitigs with confirmed pairs Mated reads Overlapper Unitiger Scaffolder Repeat Rez I, II

32 Assembly Pipeline Trim & Screen Overlapper Unitiger Scaffolder
Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II

33 REPEATS

34 Handling repeats Repeat detection Repeat resolution
pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat

35 Statistical repeat detection
Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives

36 Mis-assembled repeats
excision collapsed tandem rearrangement


Download ppt "Genome Assembly: a brief introduction"

Similar presentations


Ads by Google