Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.

Similar presentations


Presentation on theme: "Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013."— Presentation transcript:

1 Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013

2 Outline Wheat genomes Previous assemblies Study Goals
Dataset 1 evaluation Dataset 2 evaluation Future work

3 Backgroud 8-10K years ago Emmer/Ancient Wheat + Goat grass = Bread/Common Wheat Triticum dicoccoides + Aegilops tauschii = Triticum aestivum AABB + DD = AABBDD

4 Wheat WGS Assemblies Last 12 months: 3 Wheat assembly papers published in Nature ! WheatABD: hexapoild & very repetitive: HARD TO ASSEMBLE WheatD: easier? “the assembly represents 83% of the genome, of which 65% comprised transposable elements.“ . publication group technology cvg assembler wheatABD Nature, UK,454,CSH 454 5 Newbler wheatA Nature, BGI HiSeq 90 SOAPdenovo2 wheatD GAIIx/HiSeq 60? SOAPdenovo itt, hierarchical . #scf M Max Kb N50 AssemblySize Gb GenomeSize wheatABD 5.32 21 3.8 17.33 wheatA 7.98 1,066 63.69 4.67 4.93 wheatD 0.43 606 57.6  3.31 4.36

5 WheatD WGS Data 42X cvg of fragment reads
25% of the reads shorter than 44bp Max insert size of 20K . type #libs #readsM rdLen insLen cvg Illumina pe 46 1,524 43-149 39 mp_short 22 595 43-89 2,000-2,500 8 mp_med 15 372 5,000-10,000 5 mp_long 1 77 48 20000 84 2,716 167-20,000 56 454 se 28 27 44-2,049 3 Total 132 2,743 43-2,049 0-20,000 60

6 Study Goals Improve the WheatD genome assembly!
Collaboration with UCDavis (JAN DVORAK) & USDA Better genomics=> better Wheat Instead of the WGS use the BAC Pool strategy A BAC Pool consists of a collection of BACs organized as a Minimum Tiling Path; the BACs cover the whole genome; position on the BACs on the chromosomes is known. Questions: Q1.How many BACs/Pool? 8? Q2.Overlapping vs Non-Overlapping BACs? Q3.Which technology : 454 vs MiSeq? Cost vs Quality? Q4.Which assembler? Q5: Can we use the WGS reads for additional scaffolding?

7 Q2: Nonoverlapping vs Overlapping?
Example: 5 BACs Pools Nonoverlapping BACs: 1,3,5,8,10 Overlapping BACs: ,2,3,4,5 Issues: NonOvl: Assignment of contigs to the BACs Ovl: Coverage doubles in the overlapping regions => possible assembly duplications

8 Dataset 1 Q2: Nonoverlapping vs overlapping BACs?
4 Pool Sets, 8 BACs/pool, 454_se 1 BAC = 170K 8 nonOvl BACs= 1.36M % of the genome 8 ovl BACs = 1.06M % of the genome 320 BACs/pool, 454_pe (~4K inserts) % of the genome Read length statistics: type pool# BACs/ pool readType #reads K max mea Sum Mb cvg nonovl 102 8 se 79 1,623 640 51 37.5 98 128 1,126 642 82 60.61 . 320 pe 1,284 968 221 283 5.2 ovl 295 81 1,094 553 44 42.3 1540 105 1,106 509 53 50.53 1,340 935 195 261 4.81

9 Dataset 1 Assembly Ideal assembly results:
nonoverlapping pools: 8 contigs overlapping pools: 1 contig A4: Newber generated longer scaffolds than Celera, Abyss ... 454_pe selection: alignment based 1 in 40 mates should align to the pool Alignment metrics: mapping score, length, %id? Top 2.5%: rdLen=381 alignLen=302 %id=97.5 score=180 : 0 mates align Mated bwasw alignments: 1 in 3 mates align Used mated bwasw alignments >= 75bp+ match, 96% identity poolType pool# #ctg2K+ ctgSum Mb ctgN50 Kp #scf2K+ scfSum scfN50 Kb nonovl 102 44 1.25 46.13 35 1.26 51.37 98 49 1.17 38.39 34 1.23 65.88 ovl 295 65 1.00 28.65 15 1.03 556.87 1540 41 0.64 8.36 20 0.65 85.78

10 Dataset 1 Assembly (cont)
A2: Overlapping pools seem to work best

11 Dataset 2 Q3: 454 vs MiSeq? Q5: Can we use the WGS reads for additional scaffolding? 4 Pool Sets, 8 BACs/pool, both 454 & MiSeq reads 454_se , 454_pe (320 BACs/pool, 4Kbp inserts) MiSeq_pe (625bp ins) Read length statistics: pool type pool# #BACs/ 454 #reads K rdMea rdSum M cvg MiSeq ovl 240 8 50 580 29 27 737 251 182 171 168 58 533 31 1,075 265 250 ? 145 85 545 46 44 1,615 399 376 192 108 615 66 63 932 230 217

12 Dataset 2 (cont) Kmer histograms show large variation in coverage
Multiple peaks: expected cvg, overlapping regions, other long repeats ?

13 BAC Pool 145 Assembly Assembler N50 nucmer -l 127
#scf 2K+ Max Kb Sum Mb N50 NEWBLER 51 99 1.21 40 NEWBLER.pe 16 244 1.23 122 SOAPdenovo2 K127 22 173 134 ABYSS-pe k191 38 138 1.27 83 MaSuRCA 75 129 1.32 42 SPAdes k127 60 127 1.19 33 Velvet 108 71 1.17 15 Discovar 141 78 2.06 28 nucmer -l 127 Newbler.pe vs SOAPdenovo2 scaffolds A4: Newbler (v2.8) & SOAPdenovo2(r240,K=127) => longest scaffolds

14 Dataset 2 Assembly Using additional WGS libraries for scaffolding (2-10K ins,rdLen=89 & 20K ins) pool Assembler/Dataset #scf(2Kb+) max(Kb) sum(Kb) n50(Kb) 240 NEWBLER 37 90 833 29 NEWBLER.pe 20 157 840 117 SOAPdenovo2 25 288 875 100 SOAPdenovo2.wgs.2K 23 350 903 157* 168 50 115 869 150 893 67 181 932 120 10 306 937 156* 145 51 99 1,207 40 16 244 1,226 122 22 173 134 11 336 1,234 148 SOAPdenovo2.wgs.5K 13 184 137 SOAPdenovo2.wgs.10K 279 1,257 SOAPdenovo2.wgs.20K 15 180 1,250 SOAPdenovo2.wgs.2-20K 290 1,254 202* 192 169 921 48 202 929 140 18 203 943 102 12 230 950 188*

15 Dataset 2 Assembly (cont)
A3: MiSeq seem to work better than 454 A5: Using WGS reads => longer scaffolds

16 Prolamine Gene Locus : 2.2Mb
Begin1(Mb) End1(Mb) Begin2(Mb) End2(Mb) Len(Kb) Dist(Mb) Longest Repeat 1.467 1.477 1.565 1.574 9.6 0.1

17 Future Work “Re-assemble the 454 data for the 2.2 Mb prolamin gene locus, assembled and validated with the BioNano mapping system.” Evaluate other scaffolding & finishing tools Optimal coverage? How can we further improve the assemblies? Suggestions?


Download ppt "Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013."

Similar presentations


Ads by Google