Download presentation
Presentation is loading. Please wait.
Published byRoss Matthews Modified over 8 years ago
1
Genome Research 12:1 (2002), 177-189
2
Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments ● Identification of paired pairs ● Contig assembly ● Identification of repeat contigs ● Creation of scaffolds ● Filling gaps in scaffolds ● Consensus computation
3
Trimming ● find longest contiguous sequence with error less than 5% (use quality values) ● trim further if any base with Q<10 is within 12 bases of either end ● throw away read if length < 50 after trimming ● identify vector by aligning with E. coli and known cloning vector sequences ● remove vector from beginning and/or end of read
4
Overlapping ● 24-mer indexing ● index only 1/2 of all k-mers – for (x1,x2) where x1 is the reverse compl of x2, store whichever k-mer is alphabetically first ● exclude high-copy k-mers ● create read pairs for all reads that share one or more k-mers
5
Error correction in reads ● Correct errors using multiple alignment TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA C: 20 C: 35 T: 30 C: 35 C: 40 C: 20 C: 35 C: 0 C: 35 C: 40 ● Score alignments ● Accept alignments with good scores A: 15 A: 25 A: 40 A: 25 - A: 15 A: 25 A: 40 A: 25 A: 0
6
Evaluation of alignments (pairs) ● Penalty (P) for each mismatching base is minimum of: – quality scores of the two aligned bases – quality scores of the bases on their immediate left and right ● Penalty score is then 10 P/10 ● Discard pairs with penalty score > 100
7
Evaluation of alignments (pairs) ● Example: 30 10 20 5 10 30 40 40 35 A A G T G T C T A A A G T G C C T A 30 10 20 15 20 10 40 40 25 ● P = min(10,30) because of T-C mismatch ● Penalty score is 10 P/10 = 1
8
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 2 Using paired pairs of overlaps to merge reads
9
Contig assembly ● Paired pairs form the initial contigs ● Next, mark repeat boundaries before doing further merging ● Only merge read pairs when they do not cross a repeat boundary
10
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 A: merging across a repeat boundary may cause mis- assembly. Here, A may be assembled next to D. B: a potential repeat boundary identified by the divergence of reads x and y, both of which overlap r. C: Contigs on left and right are created by merging reads up to a repeat boundary. The repeat region would also create a contig, whose coverage would be twice as deep. D: Sequence errors may cause artificial breaks. Read r “dominates” read y because the neighbors of y are all neighbors of r. “Dominated” reads are eliminated.
11
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 4 Detection of repeat contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such that A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B (if their reads do not overlap), then R is probably a repeat linking to two unique regions to the right.
12
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 5 Scaffold creation and gap filling these are usually repetitive contigs
13
Simulation of WGS Data ● Reads selected from a target genome in random locations ● Errors created using realistic quality values and errors from real reads taken from finished BACs done at MIT/Whitehead ● No cloning bias (no large gaps in coverage) ● No long stretches of low quality within reads ● Two data sets created, at 10.3X and 5.15X ● Two libraries, 4kb and 40kb, with 20:1 ratio
14
Making a Simulated Read Simulated reads have error patterns taken from random real reads ERRORIZER Simulated read artificial shotgun read real read
18
Human 22, Results of Simulations Plasmid/ Cosmid cov 10 X / 0.5 X 5 X / 0.5 X 3 X/ 0 X N50 contig353 Kb15 Kb2.7 Kb Mean contig142 Kb10.6 Kb2.0 Kb N50 scaffold3 Mb 4.1 Kb Avg base qual 413226 % > 2 kb97.391.167
19
Neurospora crassa Genome (Real Data) 40 Mb genome, shotgun sequencing complete (Whitehead Genome Ctr) Coverage: 1705 contigs 368 scaffolds 1% uncovered (of finished BACs) Evaluated assembly using 1.5Mb of finished BACs Efficiency: Time: 20 hr Memory: 9 Gb Accuracy: < 3 misassemblies compared with 1 Gb of finished sequence Errors/10 6 letters: Subst. 260 Indel: 164
20
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 6 Types of misassemblies
22
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 8 Merging k-mer hits in the alignment module
23
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 9. Detection of chimeric reads. Reads l 1, l 2, l 3, r 1, r 2, and r 3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. We call x the point of chimerism of c. Note that reads l 3 and r 3 extend slightly beyond x, as often happens for real chimeric reads.
24
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 10. Contig assembly. If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c) = shift(a,c) - shift(a,b). We detect a repeat boundary toward the right of read a, if there is no overlap (b,c), nor any path of reads x1,..., xk such that (b,x1), (x1,x2),..., (xk,c) are all overlaps, and shift(b,x1) +... + shift(xk,c) approx shift(a,c) - shift(a,b).
25
Subreads ● After contigs are created, subreads are inserted – Subreads can be completely contained in other reads – Subreads can also be completely contained within a contig but not within any one read ● Subreads only inserted if this operation is unambiguous ● Subread insertion improves scaffolding in subsequent steps – because it adds new mate-pair links
26
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 11 Consistency of forward-reverse links
27
Pairing contigs for scaffolding Two scaffolds S1,S2 have distance d(S1,S2) based on estimated basepair distance between them (contigs are “singleton scaffolds”) Priority score is: s(S1,S2) = f(k) - | d(S1,S2) | f(k) is a heuristic ‘reward’ for the number of links between S1,S2. f(2),f(3),f(4),... = 50, 875, 1700, 2025, 2350, 2475, 2600, 2625, 2650, 2675
28
Scaffold assembly 1.Create a priority queue Q with all pairs of contigs that are 1.not repetitive 2.linked by at least 2 forward-reverse pairs 2.Loop: Merge highest priority pair (S1,S2), creating new scaffold T 3.Remove all pairs in Q containing S1 or S2 4.Create new pairs (T,W) for all scaffolds W that share forward-reverse links with T
29
Scaffold assembly ● Scaffold assembly procedure on previous slides is run first using only short inserts (less than 10,000 bp) ● Entire procedure is then re-run using all links
30
Serafim Batzoglou et al. Genome Res. 2002; 12: 177-189 Figure 12 Filling gaps in scaffolds. (A) Contigs A and B are connected by a path p of contigs X1,..., Xk. The distance dp(A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A or B. (B) Contigs Y1 and Y2 share forward-reverse links with the scaffold S. These links position them in the vicinity of the gap between A and B. Therefore, Y1 and Y2 will be used as possible stepping points in the path closing the gap from A to B.
31
Consensus sequence computation ● All contigs contain reads with approximate positions within those contigs ● Start at left end of each contig ● Move base-by-base, computing the consensus by a quality-weighted vote ● Switch to another read when: – at the end of the current read – at a deletion in the current read – at a low-quality region in the current read
32
Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting
33
ARACHNE 2
34
Three major improvements ● Scaffold breaking and re-joining introduced ● Gaps can now be filled by individual reads, not just by contigs – This is equivalent to the “stones” method in the Celera Assembler ● Memory usage was reduced fourfold
35
David B. Jaffe et al. Genome Res. 2003; 13: 91-96 Careful joining of scaffolds: minimize “stretching” of mate-pair links Figure 1. Joining of scaffolds. Three scaffolds (a, b, c) are seen off the end of scaffold s. There are two or more read pair links from s to each of them. Each has an optimal position relative to s, determined by the insert lengths corresponding to the read pairs. However, each insert length has a standard deviation associated to it, and so the positions of a, b, and c relative to s also have standard deviations. Supposing that we allow each of them to slide from their optimal positions by up to 2.5 standard deviations, but that we do not allow overlap between any of the scaffolds, is there more than one possible order for the scaffolds? Among the possible orders, does a always appear first (after s)? If so, we join scaffold s to scaffold a.
36
David B. Jaffe et al. Genome Res. 2003; 13: 91-96 Scaffold breaking: look for regions where clone coverage = 1 Figure 2. A disguised instance where sequence join alone holds together a scaffold. A long scaffold (blue) from one part of the genome subsumes a small foreign inset (red) from a completely different part of the genome, held together by a single point of attachment within a contig (bicolor): in fact only a sequence join ties blue to red. This was not recognized in the version of the code which produced the released mouse assembly (Mouse Genome Sequencing Consortium 2002). Resolution: break at the bicolor juncture, move the red sequence to where it links in another scaffold.
37
David B. Jaffe et al. Genome Res. 2003; 13: 91-96 Scaffold breaking: look for correlated links from the middle of one scaffold to another Figure 3. Positive breaking of scaffolds. Three correlated links are seen between scaffolds S1 and S2. The spread of the connection between S1 and S2 is, in this case, the lesser of 10 kb and 25 kb, which is 10 kb. Because the positive breaking algorithm as applied to mouse required five links with spread at least 50 kb, this connection would not have been sufficient to break the scaffolds. If it were, the respective scaffolds would have been broken at the exact ends of reads (green bars).
38
Mouse genome assembly Improved version of ARACHNE assembled the mouse genome Several heuristics that iteratively: Break scaffolds that are suspicious Rejoin scaffolds Size of problem: 32,000,000 reads Time: 15 days, 1 processor Memory: 28 Gb N50 Contig size: 16.3 Kb -> 24.8 Kb N50 Scaffold size: 0.27 Mb -> 16.9 Mb
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.