Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013.

Slides:



Advertisements
Similar presentations
Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Advertisements

Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
The IWGSC: Building the sequence-based foundation for accelerated wheat breeding Kellye A. Eversole IWGSC Executive Director & The IWGSC Cereals for Food,
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Henrik Lantz - BILS/SciLife/Uppsala University
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
De-novo Assembly Day 4.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Status report on gap closure of the human chromosome 5 BAC map Authentication of C5 BAC maps Map and sequence status Gap status and steps used to close.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Jan Pačes Institute of Molecular Genetics AS CR
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
Comparative analyses of the potato and tomato transcriptomes
Sequencing Kristian Stevens Mark Crepeau Charis Cardeno Charles H. Langley University of California, Davis Evolution.
Human Genome.
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
The Wellcome Trust Sanger Institute
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
1 Comparative analyses of the potato and tomato transcriptomes David Francis, AllenVan Deynze, John Hamilton, Walter De Jong, David Douches, Sanwen Huang,
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Short Read Sequencing Analysis Workshop
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Cross_genome: Assembly Scaffolding using Cross-species Synteny
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Pre-genomic era: finding your own clones
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
BIOL 433 Plant Genetics Term 2,
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Human Genome Project Seminal achievement. Scientific milestone.
Presentation transcript:

Sequencing and Assembly of the WheatD Genome using BAC Pools A Preliminary Study Daniela Puiu Sept 23rd 2013

Outline Wheat genomes Previous assemblies Study Goals Dataset 1 evaluation Dataset 2 evaluation Future work

Backgroud 8-10K years ago Emmer/Ancient Wheat + Goat grass = Bread/Common Wheat Triticum dicoccoides + Aegilops tauschii = Triticum aestivum AABB + DD = AABBDD

Wheat WGS Assemblies Last 12 months: 3 Wheat assembly papers published in Nature ! WheatABD: hexapoild & very repetitive: HARD TO ASSEMBLE WheatD: easier? “the assembly represents 83% of the genome, of which 65% comprised transposable elements.“ . publication group technology cvg assembler wheatABD Nature,2012-11 UK,454,CSH 454 5 Newbler wheatA Nature,2013-03 BGI HiSeq 90 SOAPdenovo2 wheatD GAIIx/HiSeq 60? SOAPdenovo itt, hierarchical . #scf M Max Kb N50 AssemblySize Gb GenomeSize wheatABD 5.32 21 3.8 17.33 wheatA 7.98 1,066 63.69 4.67 4.93 wheatD 0.43 606 57.6  3.31 4.36

WheatD WGS Data 42X cvg of fragment reads 25% of the reads shorter than 44bp Max insert size of 20K . type #libs #readsM rdLen insLen cvg Illumina pe 46 1,524 43-149 167-764 39 mp_short 22 595 43-89 2,000-2,500 8 mp_med 15 372 5,000-10,000 5 mp_long 1 77 48 20000 84 2,716 167-20,000 56 454 se 28 27 44-2,049 3 Total 132 2,743 43-2,049 0-20,000 60

Study Goals Improve the WheatD genome assembly! Collaboration with UCDavis (JAN DVORAK) & USDA Better genomics=> better Wheat Instead of the WGS use the BAC Pool strategy A BAC Pool consists of a collection of BACs organized as a Minimum Tiling Path; the BACs cover the whole genome; position on the BACs on the chromosomes is known. Questions: Q1.How many BACs/Pool? 8? Q2.Overlapping vs Non-Overlapping BACs? Q3.Which technology : 454 vs MiSeq? Cost vs Quality? Q4.Which assembler? Q5: Can we use the WGS reads for additional scaffolding?

Q2: Nonoverlapping vs Overlapping? Example: 5 BACs Pools Nonoverlapping BACs: 1,3,5,8,10 Overlapping BACs: 1,2,3,4,5 Issues: NonOvl: Assignment of contigs to the BACs Ovl: Coverage doubles in the overlapping regions => possible assembly duplications

Dataset 1 Q2: Nonoverlapping vs overlapping BACs? 4 Pool Sets, 8 BACs/pool, 454_se 1 BAC = 170K 8 nonOvl BACs= 1.36M 0.031% of the genome 8 ovl BACs = 1.06M 0.024% of the genome 320 BACs/pool, 454_pe (~4K inserts) 1.24% of the genome Read length statistics: type pool# BACs/ pool readType #reads K max mea Sum Mb cvg nonovl 102 8 se 79 1,623 640 51 37.5 98 128 1,126 642 82 60.61 . 320 pe 1,284 968 221 283 5.2 ovl 295 81 1,094 553 44 42.3 1540 105 1,106 509 53 50.53 1,340 935 195 261 4.81

Dataset 1 Assembly Ideal assembly results: nonoverlapping pools: 8 contigs overlapping pools: 1 contig A4: Newber generated longer scaffolds than Celera, Abyss ... 454_pe selection: alignment based 1 in 40 mates should align to the pool Alignment metrics: mapping score, length, %id? Top 2.5%: rdLen=381 alignLen=302 %id=97.5 score=180 : 0 mates align Mated bwasw alignments: 1 in 3 mates align Used mated bwasw alignments >= 75bp+ match, 96% identity poolType pool# #ctg2K+ ctgSum Mb ctgN50 Kp #scf2K+ scfSum scfN50 Kb nonovl 102 44 1.25 46.13 35 1.26 51.37 98 49 1.17 38.39 34 1.23 65.88 ovl 295 65 1.00 28.65 15 1.03 556.87 1540 41 0.64 8.36 20 0.65 85.78

Dataset 1 Assembly (cont) A2: Overlapping pools seem to work best

Dataset 2 Q3: 454 vs MiSeq? Q5: Can we use the WGS reads for additional scaffolding? 4 Pool Sets, 8 BACs/pool, both 454 & MiSeq reads 454_se , 454_pe (320 BACs/pool, 4Kbp inserts) MiSeq_pe (625bp ins) Read length statistics: pool type pool# #BACs/ 454 #reads K rdMea rdSum M cvg MiSeq ovl 240 8 50 580 29 27 737 251 182 171 168 58 533 31 1,075 265 250 ? 145 85 545 46 44 1,615 399 376 192 108 615 66 63 932 230 217

Dataset 2 (cont) Kmer histograms show large variation in coverage Multiple peaks: expected cvg, overlapping regions, other long repeats ?

BAC Pool 145 Assembly Assembler N50 nucmer -l 127 #scf 2K+ Max Kb Sum Mb N50 NEWBLER 51 99 1.21 40 NEWBLER.pe 16 244 1.23 122 SOAPdenovo2 K127 22 173 134 ABYSS-pe k191 38 138 1.27 83 MaSuRCA 75 129 1.32 42 SPAdes k127 60 127 1.19 33 Velvet 108 71 1.17 15 Discovar 141 78 2.06 28 nucmer -l 127 Newbler.pe vs SOAPdenovo2 scaffolds A4: Newbler (v2.8) & SOAPdenovo2(r240,K=127) => longest scaffolds

Dataset 2 Assembly Using additional WGS libraries for scaffolding (2-10K ins,rdLen=89 & 20K ins) pool Assembler/Dataset #scf(2Kb+) max(Kb) sum(Kb) n50(Kb) 240 NEWBLER 37 90 833 29 NEWBLER.pe 20 157 840 117 SOAPdenovo2 25 288 875 100 SOAPdenovo2.wgs.2K 23 350 903 157* 168 50 115 869 150 893 67 181 932 120 10 306 937 156* 145 51 99 1,207 40 16 244 1,226 122 22 173 134 11 336 1,234 148 SOAPdenovo2.wgs.5K 13 184 137 SOAPdenovo2.wgs.10K 279 1,257 SOAPdenovo2.wgs.20K 15 180 1,250 SOAPdenovo2.wgs.2-20K 290 1,254 202* 192 169 921 48 202 929 140 18 203 943 102 12 230 950 188*

Dataset 2 Assembly (cont) A3: MiSeq seem to work better than 454 A5: Using WGS reads => longer scaffolds

Prolamine Gene Locus : 2.2Mb Begin1(Mb) End1(Mb) Begin2(Mb) End2(Mb) Len(Kb) Dist(Mb) Longest Repeat 1.467 1.477 1.565 1.574 9.6 0.1

Future Work “Re-assemble the 454 data for the 2.2 Mb prolamin gene locus, assembled and validated with the BioNano mapping system.” Evaluate other scaffolding & finishing tools Optimal coverage? How can we further improve the assemblies? Suggestions?