3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach.

Slides:



Advertisements
Similar presentations
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Advertisements

WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Next Generation Sequencing, Assembly, and Alignment Methods
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Similar Sequence Similar Function Charles Yan Spring 2006.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Genome sequencing and assembling
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequencing a genome and Basic Sequence Alignment
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Protein Sequence Alignment and Database Searching.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Introduction to Short Read Sequencing Analysis
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Sequencing a genome and Basic Sequence Alignment
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Human Genome.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Sequence Alignment.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Virginia Commonwealth University
DNA Sequencing Project
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Pairwise sequence Alignment.
A Sequenciação em Análises Clínicas
Introduction to Sequencing
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach (Celera, Gene Myers). Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now. ED Green, Nat Rev Genet 2, 573 (2001)

3. Lecture WS 2003/04Bioinformatics III2 Automatic sequencing

3. Lecture WS 2003/04Bioinformatics III3 Automated Sequencing nearly all automatic sequencing is done using the enzymatic dideoxy chain- termination method of Sanger (1977). Separation of fragments by gel electrophoresis. Readout of fragments labeled with fluorescent dyes. Computer analysis of gel images: - lane tracking – identify gel boundaries - lane profiling – sum each of 4 signals across lane width to create a profile - trace processing – deconvolute and smooth signal estimates + reduce noise - base-calling in which the processed trace is translated into a sequence of bases. Program Phred is quasi-standard for last step (base calling).

3. Lecture WS 2003/04Bioinformatics III4 Base Calling - Phred B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8, (1998). B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities. Genome Res 8, (1998). The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases.

3. Lecture WS 2003/04Bioinformatics III5 Base Calling - Phred Idealized traces would consist of evenly spaced, nonoverlapping peaks. Real traces deviate from this ideal due to imper- fections of the sequencing reactions, of gel electro- phoresis, and of trace processing. The first 50 or so peaks and peaks over 500 or so are particularly noisy. Quality: high – no ambiguities medium – some ambiguities Poor – low confidence

3. Lecture WS 2003/04Bioinformatics III6 Base Calling Algorithm 1 Locate Predicted Peaks find the idealized locations of the base peaks using Fourier methods. 2 Locate Observed Peaks scan 4 trace arrays for concave regions satisfying 2  v(i)  v(i+1) + v(i-1) 3 Match Observed and Predicted Peaks a) find easy matches b) use dynamic programming to align those peaks not matched in a) c) match remaining observed peaks that seem to represent genuine bases 4 Find missed Peaks

Phred quality values q = - 10  log 10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = (1 error in 100 bases) q = 40 means p = (1 error in 10,000 bases)

Phred Phred performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

3. Lecture WS 2003/04Bioinformatics III9 whole genome assembly: problem description The goal is to reconstruct an unknown source sequence (the genome) on {A, C, G, T} given many random short segments from the sequence, the shotgun reads. A read is a subsequence of nucleotides of length around 500, taken from a random place in the genome. The orientation of the read is either forward or reverse complement. Reads contain two kinds of errors: base substitutions and indels. Base substitutions occur with a frequency of ca. 0.5 – 2%. Indels occur roughly 10 times less frequently. Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb) or BACs (150 kb). Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III10 Whole Genome Assemblers TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995) PHRAP P. Green (1996) Celera Assembler CAP3 X. Huang, A. Madan, Genome Res 9, (1999) RePS J. Wang et al. Genome Res 12, (2002) Phusion (Sanger)J.C. Mullikin, Z. Ning, Genome Res 13, (2003) Arachne (Whitehead/MIT) Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001) most assemblers follow the same approach: overlap – layout - consensus

3. Lecture WS 2003/04Bioinformatics III11 CAP3 Assembler Removal of poor end regions of reads Computation of overlaps between reads Removal of false overlaps Construction of contigs Construction of multiple sequence alignments and generation of consensus sequences

3. Lecture WS 2003/04Bioinformatics III12 CAP3: Clipping of Low-Quality Regions Use base quality values (from Phred) and sequence similarities to compute 5‘ and 3‘ clipping positions of reads. Definition of good regions of a read: - any sufficiently long region of high-quality values that is similar to a region of another read OR - any sufficiently long region that is highly similar to a good high-quality region of another read Computation of the 5‘ and 3‘ clipping positions of read f. Read f has high local similarities to reads g and h. A pair of broken lines shows the start and end positions of a similarity. A thick line indicates the high quality region of a read. Huang, Madan, Genome Res 9, 868 (1999)

3. Lecture WS 2003/04Bioinformatics III13 Celera – compartmentalized shotgun assembler use preliminary data from both human genome assembly projects Huson et al. Bioinformatics 17, S132 (2001)

3. Lecture WS 2003/04Bioinformatics III14 Arachne program by Serafin Batzoglou (MIT, PhD thesis 2000) (i)create graph G of overlaps between pairs of reads of shotgun data (ii)process G for the purpose of constructing supercontigs of mapped reads. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III15 Earmuff links An important variation of whole-genome shotgun sequencing obtains reads from both ends of an insert, forward and backward. Since inserts are size-selected, the approximate distance of the pair of reads obtained from the ends of a fragment is known. These will be called earmuff links.

3. Lecture WS 2003/04Bioinformatics III16 Arachne: creation of overlap graph List of reads R = (r 1,..., r N ), N is number of reads. Each read r i has length l i < If both reads are taken from the endpoints of the same clone (earmuff link) r i has link to another read r j at specified distance d ij. First: create graph G of overlaps (edges) between pairs of reads (nodes).  Pairs of reads in R need to be aligned. Since R can be very long, N 2 alignments are infeasible. Create table of occurences of k-mers (k long strings) in the reads, count the number of k-mer matches for each pair of reads. Then perform pairwise alignments between pairs of reads that contain more than a cutoff number of common k-mers. Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III17 Arachne: table of k-mer occurrences Find number of k-mer matches in the forward or reverse complement direction between each pair of reads in R. (1) Obtain all triplets (r,t,v) r = read in R t = index of a k-mer occuring in r v = direction of occurrence (forward or reverse complement) (2) sort the set of pairs according to k-mer indices t (3) use sorted list to create table T of quadrublets (r i, r j, f, v) where r i and r i are reads that contain at least one common k-mer, v is a direction, and f is the number of k-mers in common between r i and r j in direction v. Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III18 Arachne: table of k-mer occurrences Batzoglou PhD thesis (2002) Here: k = 3

3. Lecture WS 2003/04Bioinformatics III19 Arachne: table of k-mer occurrences If a k-mer occurs „too often“  likely part of a repeat sequence, we should not use it for detecting overlap. Implementation (1)find k-mer occurences (r,t,v) and sort into 64 files according to the first three nucleotides of each k-mer. (2)For i=1,64 load file in memory, sort according to t, store sorted file. end (3)load 64 sorted files in memory sequentially, create table T incrementally. In practice, k = 8 to 24. Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III20 Arachne: pairwise read alignments Perform pairwise alignments between reads that contain more than a cutoff number of common k-mers. When excluding those k-mers that are too common (larger than a second) cutoff it is guaranteed that only O(N) number of pairwise alignments will be performed. Only a small number of base substitutions and indels is allowed in an overlapping region of two aligned reads. Use dynamic programming alignment that disallows deviations of more than a few characters. Output of the alignment algorithm: for reads r i, r j quadrublets (b 1, b 2, e 1, e 2 ) of beginning b 1, b 2 and end e 1, e 2 positions of the detected overlap region. If a significant overlap region is detected (r i, r j, b 1, b 2, e 1, e 2 ) becomes a link in the overlap graph G. Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III21 Correcting errors in reads Batzoglou et al. Genome Res 12, 177 (2002) Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30.

3. Lecture WS 2003/04Bioinformatics III22 Partial alignments 3 partial alignments of length k=6 between a pair of reads coalesce to yield a single full alignment of length k=19. Vertical bars denote matching bases, whereas x‘s denote mismatches. This illustrates the commonly occurring situation where an extended k- mer hit is a full alignment between two reads. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III23 Ambiguity created by the presence of repeats In the absence of sequencing errors and repreats it would be simple to retrieve all retrievable pairwise distances of reads and to construct G. In the presence of repeats a link between two reads in G does not necessarily imply true overlap. A „repeat link“ is a link in G between two reads that come from different regions in the genome, and overlap in a repeated segment. Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III24 Arachne: processing of overlap graph Some of the repetition in the genome is efficiently masked before the creation of G by throwing away k-mers of high frequency when building T. Furthermore some heuristic algorithms are used to detect and delete repetitive links (not discussed here). Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III25 Merging contigs Batzoglou PhD thesis (2002) Sequence contigs are formed by merging together pairs of reads that can be merged without ambiguity. In practice the situation is much worse than shown here. Repeats are not 100% conserved between copies.

3. Lecture WS 2003/04Bioinformatics III26 Sequence contigs Batzoglou PhD thesis (2002)

3. Lecture WS 2003/04Bioinformatics III27 Using paired pairs of overlaps to merge reads Arachne searches for instances of two plasmids of similar insert size with sequence overlaps occurring at both ends  paired pairs. Batzoglou et al. Genome Res 12, 177 (2002) (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged. Bottom: collection of paired pairs are merged into contigs, and consensus sequences are formed.

3. Lecture WS 2003/04Bioinformatics III28 Detection of repeat contigs Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B then R is probably a repeat linking to two unique regions to the right. Batzoglou et al. Genome Res 12, 177 (2002) Some of the identified contigs are repeat contigs in which nearly identical sequence from distinct regions are collapsed together. Detection by (a) repeat contigs usually have an unusually high depth of coverage. (b) they will typically have conflicting links to other contigs. After marking repeat contigs, the remaining contigs should represent the correctly assembled sequence.

3. Lecture WS 2003/04Bioinformatics III29 Supercontig creation and gap filling (A)A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, 3 contigs are joined into one supercontig. The layout now consists of a number of supercontigs with interleaved gaps. Most gaps belong to regions marked as repeat contigs, some correspond to regions of insufficient shotgun reads. (B)Arachne attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs. Batzoglou et al. Genome Res 12, 177 (2002) Unmarked contigs = unique contigs. Iteratively merge contigs into supercontigs.

3. Lecture WS 2003/04Bioinformatics III30 Contig assembly If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c)=shift(a,c)-shift(a,b). A repeat boundary is detected toward the right of read a, if there is no overlap (b,c), nor any path of reads x 1,..., x k such that (b,x 1 ), (x 1,x 2 )..., (x k,c) are all overlaps, and shift(b,x 1 ) shift(x k,c)  shift(a,c) – shift(a,b). Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III31 Consistency of forward-reverse links (A)The distance d(A,B) (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward- reverse linked reads between them. (B)The distance d(B,C) between two contigs B,C that are linked to the same contig A can be estimated from their respective distances to the linked contig. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III32 Types of misassemblies (A)3 types of simple minor misas- semblies are shown: insertions, deletions, and hanging ends. In all cases, a contiguous segment (of a contig ore the genome) of less than 10 kb does not align in the expected location (with the genome or contig). (B) More misassemblies. First, two pieces of a contig align to distant parts of the genome. Second, adjacent contigs in a supercontig are aligned to distant parts of the genome. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III33 Filling gaps in supercontigs (A)Contigs A and B are connected by a path p of contigs X 1,..., X k. The distance d p (A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A and B. (B)Contigs Y 1 and Y 2 share forward- reverse links with the supercontig S. These links position them in the vicinity of the gap between A and B. Therefore, Y 1 and Y 2 will be used as possible stepping points in the path closing the gap from A to B. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III34 Detection of chimeric reads Reads l 1, l 2, l 3, r 1, r 2, and r 3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. Note that reads l 3 and r 3 extend slightly beyond x, as often happens for real chimeric reads. Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III35 Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III36 Characterization of Contigs Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III37 Characterization of Supercontigs Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III38 Base Pair Accuracy Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III39 Misassemblies Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III40 Computational Performance Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III41 Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002)

3. Lecture WS 2003/04Bioinformatics III42 Comparison of different assemblers Pevzner, Tang, Waterman PNAS 98, 9748 (2001) you should look out for: - smallest number of contigs + misassembled contigs - highest possible coverage by contigs - lowest possible coverage by misassembled contigs

3. Lecture WS 2003/04Bioinformatics III43 There is no error-free assembler to date Pevzner, Tang, Waterman PNAS 98, 9748 (2001) Comparative analysis of EULER, PHRAP, CAP, and TIGR assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the IDEAL assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs. Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, PHRAP misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome. „Biologists "pay" for these errors at the time-consuming finishing step“.

3. Lecture WS 2003/04Bioinformatics III44 What comes next? Finishing the genome Usually, the assembly of shotgun data is finished with a number of contigs with some remaining gaps. Also, within each contig there are some regions of high error rate. The goal of the finishing phase is then to get a single continuous contig with low error rate. „Finishers“ apply ad hoc rules to decide where additional data is necessary. This experimental data may then be generated in experiments using different chemistry or higher coverage. Autofinish (phrap group) is a program to help humans with deciding which new reads to get.

3. Lecture WS 2003/04Bioinformatics III45 Human experts are only rarely needed... D. Gordon, C. Desmarais, P. Green, Genome Res, 11, 614 (2001)