Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.

Similar presentations


Presentation on theme: "Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009."— Presentation transcript:

1 Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009

2 Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines OUTLINE

3 Assembly process overview

4 A Genome Sequencing Project

5 Building a Library Break DNA into random fragments (8- 10x)‏

6 SHOTGUNs Whole Genome Shotgun Bac-Bac Shotgun Size of inserts: --Bac insert: ~150KB --Fosmid insert: ~30KB --Normal insert: ~3KB

7 Clone and scaffold (a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54

8 Building a Library Break DNA into random fragments (~10x)‏ -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end

9 Assembling the fragments

10 Break DNA into random fragments Sequence the ends of the fragments Assemble the sequenced ends

11 Forward-reverse constraints The sequenced ends are facing towards each other The distance between the two fragments is known

12 Building Scaffolds

13 Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

14 Finishing the Project

15 Unifying View of Assembly

16 Assembly Algorithms

17 Assembly Methods Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...)‏ – graph-based (Euler)‏

18 Phrap/CAP3 Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done !!! IDEAL CASE !!!

19 Real World Problems Sequencing errors Chimera Repeats Contaminants Polymorphism Orientation

20 Error Correction

21 Overlap b/w two sequences

22 All pairs alignment Try all pairs – must consider ~ n^2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome)‏ – Generate the pairs from k-mer table (single pass through k-mer table)‏

23 Repeats

24 Repeat sequence The top represents the correct layout of three DNA sequences. The bottom shows a repeat collapsed in a misassembly. Computer 35 (7):47-54

25 重覆序列 ■ 重覆頻率分 Interspersed repeats Short interspersed element (SINE), eg. Alu <300 bp Long interspersed element (LINE), ca. 5 kb Tandem repeats Satellite DNA Minisat. & Variable number of tandem repeats Microsat.: mono-, di-, tri-, tetra-nucleotide ■ 重覆方向分 同向重覆序列 反向重覆序列

26 Repeat detection Pre-assembly: find fragments that belong to repeats statistically (Reps)‏ repeat database (RepeatMasker)‏

27 Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)‏ Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed

28 Scaffolding

29 Sequencing hierarchy Random sequencing – unrelated reads ~700 pairs Assembly – un-related contigs 5K-10K pairs Scaffolding – unrelated scaffolds 30K~ 50K pairs Finishing/gap closure – completed genomes millions-billions of base-pairs

30 Definition

31 Scaffolder output order and orientation of contigs size of gaps between contigs linking evidence: mate-pairs spanning gaps

32 Clone-mates

33 Linking information

34 Hierarchical scaffolding

35 Ambiguous scaffold

36 Phred/Phrap/Consed Analysis

37 What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

38 How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

39 Phred Genome Research 8: 175-194

40 Phred Phred is a program that performs several tasks: a.Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

41 Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

42 File Directories chromat_dir/ edit_dir/ phd_dir/

43 Trace File High quality region – no ambiguities (Ns)‏

44 Trace File Medium quality region – some ambiguities (Ns)‏

45 Trace File Poor quality region – low confidence

46 Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10 -2 (1 error in 100 bases)‏ q = 40 means p = 10 -4 (1 error in 10,000 bases)‏

47 Base Calling phred -id. -p -pd../phd_dir phred -view pf84c05.s1

48 The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445

49 phd2fasta phd2fasta program ––converts.phdfiles to sequence in multifasta format ––writes.qualfile (quality scores) for each trace file ––phd2fasta -id../phd_dir -os CLONE.fasta - oq CLONE.fasta.qual Output: ––fasta.seqcontains fastasequences ––fasta.seq.qualcontains quality scores

50 Vector Sequence Cleaning (1)‏ DNA sequence cleaning: quality trimming and vector removal---Lucy: Lucy Steps: –Read input seq#, seq info, and quality info –Chop off splice sites –Remove vector insert –Produce output seq for fragment assembly.

51 Vector Sequence Cleaning (2)‏ Restriction on file name: can’t contain any symbol eg. “–” “. “ “_” Lucy major parameters to set up: -vector vector_completeSeq splice_site_file (splice_site_file: 2 splice-site seq before and after the insertion point on the vector)‏ Lucy Output: –identified locations of good/clean region –trim seq without vector, linker, Ns (<3% Ns)‏

52 splice_site_file ~ 150 bases, 50 bases overlap around splice >PUCsplice.for.begin gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga >PUCsplice.for.end acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga aattgttatccgctcacaattccacacaacatacgagccggaagcataaa >PUCsplice.rev.begin tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt >PUCsplice.rev.end tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

53 Cross_match

54 cross_match -minmatch 12 -penalty -2 - minscore 20 -screen CLONE.fasta /net/share/sequence_pipeline/vector.fasta

55 Phrap -- Phragment Assembly Program (or Phil’s Revised Assembly Program)‏ Phrap is a program for assembling shotgun DNA sequence dataPhrap is a program for assembling shotgun DNA sequence data Key Features:Key Features: a. Uses the entire read content – no need for trimming.a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats.b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!

56 Phrap --Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual filesd. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets – hundreds of thousands of reads are easily manipulated.e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programsf. Generate output files – contain some important data and enable visualization by other programs

57 Banded Search

58 K-mers >GL1234.b1 gattaagttgggtaacgccagggttttcccagtcac… gattaagttgggta attaagttgggtaa ttaagttgggtaac taagttgggtaacg...

59 Phrap output files *.contigs – fasta file containing the contigs*.contigs – fasta file containing the contigs –Contigs with more than one read –Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it)‏ *.singlets – fasta file of the singlet reads*.singlets – fasta file of the singlet reads –Reads with no match to other read *.ace – allows for viewing the assembly using Consed*.ace – allows for viewing the assembly using Consed *.view – required for viewing the assembly using Phrapview*.view – required for viewing the assembly using Phrapview

60 Phrap parameters phrap -new_ace CLONE.fasta.screen >outfile OPTIONS DEFAULT FUNCTION -penalty -2 ↑=>↑Stringency. -gap_init penalty-2. -gap_ext penalty-1 -minmatch* 14 ↑=>↓time↓Matches -bandwidth14 ↓=>↓time↓String. -minscore30 ↑=>↑String. *highly sensitive! bigger genomes bigger value

61 Phrap parameters OPTIONS DEFAULT FUNCTION -forcelevel0~10 ↓=>↑String. -repeat_stringency 0.95 0 ↑String. -force_high* ↑=>↑String. -revise_greedy** ↓Misassembly -shatter_greedy** ↓ContigLength * Ignore edited high-quality discrepancies **break assembly at weak joins

62 Phrap parameters OPTIONS DEFAULT FUNCTION -max_subclone_size 5000 F.-R. check -default_qual 15 -preassemble* -group_delim* _ *used together

63 Consed Genome Research 8: 195-202, 1998

64 Consed A program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

65 Consed A program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single-strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.

66

67

68

69

70

71

72 Phred/Phrap/Consed Pipeline Chromat_dir Phd_dir Edit_dir Directories:

73 Comparison of shotgun sequence data from Wolbachia genome Project Computer 35 (7):47-54

74

75

76 Softwares CAP3 (for EST): http://genome.cs.mtu.edu/cap/cap3.html Phrap (for large genome): http://www.phrap.org --Similar algorithm --Insufficient documentation and support -- Always have to write scripts to parse outputs --NO PERFECT PROGRAM!!!

77 Questions?? wangsj@yahoo.com


Download ppt "Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009."

Similar presentations


Ads by Google