Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009

Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines OUTLINE

Assembly process overview

A Genome Sequencing Project

Building a Library Break DNA into random fragments (8- 10x)‏

SHOTGUNs Whole Genome Shotgun Bac-Bac Shotgun Size of inserts: --Bac insert: ~150KB --Fosmid insert: ~30KB --Normal insert: ~3KB

Clone and scaffold (a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54

Building a Library Break DNA into random fragments (~10x)‏ -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end

Assembling the fragments

Break DNA into random fragments Sequence the ends of the fragments Assemble the sequenced ends

Forward-reverse constraints The sequenced ends are facing towards each other The distance between the two fragments is known

Building Scaffolds

Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

Finishing the Project

Unifying View of Assembly

Assembly Algorithms

Assembly Methods Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...)‏ – graph-based (Euler)‏

Phrap/CAP3 Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done !!! IDEAL CASE !!!

Real World Problems Sequencing errors Chimera Repeats Contaminants Polymorphism Orientation

Error Correction

Overlap b/w two sequences

All pairs alignment Try all pairs – must consider ~ n^2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome)‏ – Generate the pairs from k-mer table (single pass through k-mer table)‏

Repeats

Repeat sequence The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly. Computer 35 (7):47-54

重覆序列 ■ 重覆頻率分 Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat.: mono-, di-, tri-, tetra-nucleotide ■ 重覆方向分同向重覆序列反向重覆序列

Repeat detection Pre-assembly: find fragments that belong to repeats statistically (Reps)‏ repeat database (RepeatMasker)‏

Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)‏ Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed

Scaffolding

Sequencing hierarchy Random sequencing – unrelated reads ~700 pairs Assembly – un-related contigs 5K-10K pairs Scaffolding – unrelated scaffolds 30K~ 50K pairs Finishing/gap closure – completed genomes millions-billions of base-pairs

Definition

Scaffolder output order and orientation of contigs size of gaps between contigs linking evidence: mate-pairs spanning gaps

Clone-mates

Linking information

Hierarchical scaffolding

Ambiguous scaffold

Phred/Phrap/Consed Analysis

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

Phred Genome Research 8: 175-194

Phred Phred is a program that performs several tasks: a.Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

File Directories chromat_dir/ edit_dir/ phd_dir/

Trace File High quality region – no ambiguities (Ns)‏

Trace File Medium quality region – some ambiguities (Ns)‏

Trace File Poor quality region – low confidence

Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10 -2 (1 error in 100 bases)‏ q = 40 means p = 10 -4 (1 error in 10,000 bases)‏

Base Calling phred -id. -p -pd../phd_dir phred -view pf84c05.s1

The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445

phd2fasta phd2fasta program ––converts.phdfiles to sequence in multifasta format ––writes.qualfile (quality scores) for each trace file ––phd2fasta -id../phd_dir -os CLONE.fasta - oq CLONE.fasta.qual Output: ––fasta.seqcontains fastasequences ––fasta.seq.qualcontains quality scores

Vector Sequence Cleaning (1)‏ DNA sequence cleaning: quality trimming and vector removal---Lucy: Lucy Steps: –Read input seq#, seq info, and quality info –Chop off splice sites –Remove vector insert –Produce output seq for fragment assembly.

Vector Sequence Cleaning (2)‏ Restriction on file name: can’t contain any symbol eg. “–” “. “ “_” Lucy major parameters to set up: -vector vector_completeSeq splice_site_file (splice_site_file: 2 splice-site seq before and after the insertion point on the vector)‏ Lucy Output: –identified locations of good/clean region –trim seq without vector, linker, Ns (<3% Ns)‏

splice_site_file ~ 150 bases, 50 bases overlap around splice >PUCsplice.for.begin gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga >PUCsplice.for.end acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga aattgttatccgctcacaattccacacaacatacgagccggaagcataaa >PUCsplice.rev.begin tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt >PUCsplice.rev.end tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

Cross_match

cross_match -minmatch 12 -penalty -2 - minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta

Phrap -- Phragment Assembly Program (or Phil’s Revised Assembly Program)‏ Phrap is a program for assembling shotgun DNA sequence dataPhrap is a program for assembling shotgun DNA sequence data Key Features:Key Features: a. Uses the entire read content – no need for trimming.a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats.b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!

Phrap --Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual filesd. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets – hundreds of thousands of reads are easily manipulated.e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programsf. Generate output files – contain some important data and enable visualization by other programs

Banded Search

K-mers >GL1234.b1 gattaagttgggtaacgccagggttttcccagtcac… gattaagttgggta attaagttgggtaa ttaagttgggtaac taagttgggtaacg...

Phrap output files *.contigs – fasta file containing the contigs*.contigs – fasta file containing the contigs –Contigs with more than one read –Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it)‏ *.singlets – fasta file of the singlet reads*.singlets – fasta file of the singlet reads –Reads with no match to other read *.ace – allows for viewing the assembly using Consed*.ace – allows for viewing the assembly using Consed *.view – required for viewing the assembly using Phrapview*.view – required for viewing the assembly using Phrapview

Phrap parameters phrap -new_ace CLONE.fasta.screen >outfile OPTIONS DEFAULT FUNCTION -penalty -2 ↑=>↑Stringency. -gap_init penalty-2. -gap_ext penalty-1 -minmatch* 14 ↑=>↓time↓Matches -bandwidth14 ↓=>↓time↓String. -minscore30 ↑=>↑String. *highly sensitive! bigger genomes bigger value

Phrap parameters OPTIONS DEFAULT FUNCTION -forcelevel0~10 ↓=>↑String. -repeat_stringency 0.95 0 ↑String. -force_high* ↑=>↑String. -revise_greedy** ↓Misassembly -shatter_greedy** ↓ContigLength * Ignore edited high-quality discrepancies **break assembly at weak joins

Phrap parameters OPTIONS DEFAULT FUNCTION -max_subclone_size 5000 F.-R. check -default_qual 15 -preassemble* -group_delim* _ *used together

Consed Genome Research 8: 195-202, 1998

Consed A program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

Consed A program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.

Phred/Phrap/Consed Pipeline Chromat_dir Phd_dir Edit_dir Directories:

Comparison of shotgun sequence data from Wolbachia genome Project Computer 35 (7):47-54

Softwares CAP3 (for EST): http://genome.cs.mtu.edu/cap/cap3.html Phrap (for large genome): http://www.phrap.org --Similar algorithm --Insufficient documentation and support -- Always have to write scripts to parse outputs --NO PERFECT PROGRAM!!!

Questions?? wangsj@yahoo.com

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Similar presentations

Presentation on theme: "Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Similar presentations

Presentation on theme: "Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009."— Presentation transcript:

Similar presentations

About project

Feedback