Presentation is loading. Please wait.

Presentation is loading. Please wait.

Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg,

Similar presentations

Presentation on theme: "Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg,"— Presentation transcript:

1 Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg, Kristian Stevens, Nick Wheeler, Jim Yorke, Aleksey Zimin, David Neale Univ. of Calfornia, Davis; Children’s Hospital of Oakland Research Institute; Indiana Univ.; Washington State Univ.; Univ. of Maryland; and Johns Hopkins Univ.

2 PineRefSeq Goal To provide the benefits of conifer reference genome sequences to the research, management and policy communities. Specific Objectives – Provide a high-quality reference genome sequence of loblolly pine looking toward sugar pine and then Douglas-fir. – Provide a complete transcriptome resource for gene discovery, reference building, and aids to genome assembly – Provide annotation, data integration, and data distribution through Dendrome and TreeGenes databases.

3 The Large, Complex Conifer Genomes Present a Challenge Challenges – The estimated 22 Gigabase loblolly pine genome is 8 times larger than the human genome – Conifer genomes generally possess large gene families (duplicated and divergent copies of a gene), and abundant pseudo-genes. – The vast majority of the genome appears to be repetitive DNA Approaches to Resolving Challenges – Complementary sequencing strategies that seek to reduce complexity through use of actual or functional haploid genomes and reduced size of individual assemblies.

4 Plant Genome Size Comparisons Image Credit: Modified from Daniel Peterson, Mississippi State University 0 5000 10000 15000 20000 25000 30000 35000 40000 0 1000 2000 3000 Arabidopsis Oryza Populus Sorghum Glycine Zea Pseudotsuga menziesii Taxodium distichum Picea abies Picea glauca Pinus taeda Pinus pinaster 1C DNA content (Mb) Pinus lambertiana P. menziesii

5 Existing and Planned Angiosperm Tree Genomes SpeciesGenome Size 1 In Progress With Draft Assemblies Populus trichocarpaBlack cottonwood423 Mbp Populus nigraBlack poplar480 Mbp Eucalyptus grandisRose gum691 Mbp Eucalyptus globulusBlue gum530 Mbp Eucalyptus camaldulensisRiver red gum624 Mbp Corymbia citriodoraLemon-scented gum370 Mbp Betula nanaDwarf birch450 Mbp Fraxinus excelsiorEuropean ash900 Mbp Malus domesticaApple881 Mbp Prunus persicaPeach227 Mbp Citrus sinensisSweet orange319 Mbp Azadirachta indicaNeem363 Mbp In Progress Or Planned Castanea mollissimaChinese Chestnut800 Mbp Quercus roburPedunculate Oak740 Mbp Populus spp and ecotypesVariousvarious

6 Existing and Planned Gymnosperm Tree Genomes SpeciesGenome Size 1 Status Conifers Picea abiesNorway Spruce20,000 Mbp Draft Complete Picea glaucaWhite Spruce22,000 Mbp Draft Complete Pinus taedaLoblolly Pine22,000 Mbp Draft Complete Pinus lambertianaSugar Pine34,000 MbpPending Pseudotsuga menziesiiDouglas-fir18,700 MbpPending Larix sibiricaSiberian Larch12,030 MbpPending Pinus sibiricaSiberian Pine30,000 MbpPending Pinus pinasterMaritime Pine23,810 MbpPending Pinus sylvestrisScots Pine~23,000 MbpPending 1)Genome size: Approximate total size, not completely assembled.

7 Elements of the Conifer Genome Sequencing Project

8 Acquiring the DNA Haploid Haploid megagametophyte tissue 1N Shotgun sequenced Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced Figure Credit: Nicholas Wheeler, University of California, Davis

9 Sequencing Strategy 65X12X

10 Technology for De Novo Sequencing of the Conifer Genomes Parallel and Complementary Approaches Max Output: 95 Gigabases Max. paired end reads - 640 million Max Output: 300 Gigabases Max. paired end reads - 3 billion 1 Effectively haploid

11 Sequencing Strategy Today

12 Megagametophyte Whole Genome Shotgun (M- WGS) Not enough haploid DNA in a megagametophyte to implement a complete list of WGS ingredients. Compromise: Obtain DNA for longer insert linking libraries (> 1kbp) from diploid needle tissue. Prepare only short insert Illumina libraries from megagametophye tissue.

13 Each DNA sample is then run on an Agilent Bioanalyzer to determine a preliminary estimate of insert size and coefficient of variation. If within spec, selected DNA samples are converted into Illumina libraries M-WGS Short Insert Libraries Preliminary QC and Size Selection

14 Libraries are subsequently QCed on the Illumina MiSeq M-WGS Short Insert Libraries Library QC and Titration

15 A k-mer Genome Size Estimate How deep to sequence the libraries? Experimentally – hybridization Computationally (WGS) – choose substring of the reads of length k P. taeda genome size ≅ total k-mers in genome total k-mers in P. taeda genome ≅ total k-mers in P. taeda reads expected number of times a genomically unique k-mer is observed in the reads

16 k-mer Genome Size Estimates Loblolly pine Pinus taeda: 31-mers total: 3.736 x 10 11 Expected k-mer depth: 18.11 Estimated genome size: 20.63 GB High Copy 31-mers 1.09% of distinct 31-mers 33% of all 31-mers 24-mers total: : 4.092 x 10 11 Expected k-mer depth: 19.79 Estimated genome size: 20.68 GB Sugar pine Pinus lambertiana: 31-mers total: 2.776 x 10 11 Expected k-mer depth: 8.12 Estimated genome size: 34.19 GB High Copy 31-mers 0.35% of distinct 31-mers 33% of all 31-mers 24-mers total: 3.031 x 10 11 Expected k-mer depth: 8.89 Estimated genome size: 33.98 GB

17 truly large genomes

18 P. taeda Version 0.9 Library Statistics Haploid short insert libraries – 10 short insert libraries 200 - 640bp – 1.4Tbp GA2x, HiSeq, MiSeq sequence – 65 fold coverage Diploid jumping libraries – 47 jumping libraries 1300 – 5500bp – 280Gbp GA2x sequence – 12 fold coverage 13 Fosmid DiTag Libraries

19 Elements of the Conifer Genome Sequencing Project

20 65X coverage in paired ends from a single seed 1/3 in GAIIx, 160-bp overlapping pairs 2/3 in HiSeq, 100-bp pairs 1.7 billion reads from “jumping” libraries from pine needles, diploid DNA

21 Collect jumping reads from same haplotype 1.7 billion jumping reads (4 Kbp) 93 million Di-Tag reads (36 Kbp) Keep only pairs where both reads match haploid DNA Filter: both reads had to be covered by 52-mers from megagametophyte data

22 How to get all these reads into a single assembly run?

23 Recent Assemblers for Illumina Data MSR-CA (Aleksey Zimin, UMD) – Based on Celera assembler – 454, Illumina, and Sanger reads Allpaths-LG SOAPdenovo Velvet ABySS Contrail SGA

24 Two Classes of Assembly Algorithms Overlap-Layout-Consensus (OLC) – Used by most assemblers for previous generation (Sanger) sequencing – Celera Assembler, PCAP, Phusion, Arachne, etc De Bruijn Graph – Used by most assemblers for Illumina data – SOAPdenovo, Allpaths-LG, Velvet, Abyss, etc We use a combined approach that combines the benefits of both OLC and the De Bruijn Graph in our MSR-CA assembler

25 Combine Benefits of OLC and De Bruijn Graph Benefits of OLC – Can deal with variable length reads and reads from different sequencing platforms – Overlaps can be long and thus more reliable – Overlaps do not have to be exact – Can resolve repeats of up to read size Drawbacks of OLC – Computationally intensive, number of overlaps grows quickly with the number of reads and coverage Benefits of DeBruijn Graph – Computationally faster Drawbacks of DeBruijn Graph – Errors in the reads create spurious branches in the graph requiring error correction Max. size of k-mer is limited by the shortest read size All overlaps in the graph are exact overlaps of k-1 bases Repeats of longer than k bases cannot be resolved – Without space consuming side information

26 Consider a read Consider a read CGACTGACCAGATGACCATGACAGATACATGGT stop CGACTGACCAGATGACCATGACAGATACATGGT stop extend 5 GACTGACCAG ATACATGGTA 10 stop extend 3 CGACTGACCA ATACATGGTC 2 Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base: Typically Illumina sequencing projects generate data with high coverage (>50x). With 100bp reads this implies that a new read starts on average at least every other base: read R extended to super read S read R extended to super read S super read S (red) super read S (red) the other reads extend to the S as well the other reads extend to the S as well Super reads GOAL: Reduce the amount of input data without losing information

27 Super-Reads Compress the Data 100-fold compression 50% of sequence is in super reads > 500 bp Super-read total: 52 Gbp

28 MaSuRCA assembler performance 64-core computer with 1 Terabyte of RAM Time/memory to assemble: 10 days / 800 GB QuORUM error correction: 10 days / 800 GB 11 days / 400 GB Super-reads construction plus filtering: 11 days / 400 GB 60+ days / 450 Gb Contig and scaffold construction: 60+ days / 450 Gb uses CABOG assembler uses CABOG assembler 8 days / 300 Gb Gap filling with super-reads: 8 days / 300 Gb

29 MSR-CA Output Contigs: contiguous sequences that do not appear to be repetitive (may contain internal repeats). These end up in scaffolds. Scaffolds: ordered and oriented collections of contigs, built using mate pair data. A scaffold can consist of just one contig (a "single- contig" scaffold). Degenerate contigs: contigs that appeared to be repeats according to the coverage statistics. Only placed in scaffolds when linked to contigs via mate pairs. Most of them will end up being placed in more than one location, but many will not appear in any scaffold.

30 P. taeda WGS V0.6 (June 2012) Approximately 35X coverage – 7 billion reads (50 million jumping library reads) – Compressed to 377 million Super-reads Total Sequence: 18,321,727,393 bp Total contig sequence: 14,606,783,345 bp N50 1,199bp (9.16 Gbp is contained in contigs of 1199 bp or longer) Total scaffold sequence (with imputed gaps) : 18,428,460,141bp N50 1,230bp (9.21 Gbp is contained in scaffolds of 1230 bp or longer) Degenerate contig sequence 3.8Gb

31 P. taeda WGS V0.8 (January 2013) Approximately 65X coverage – 16 billion reads (1.7 billion jumping library reads) – Compressed to 150 million Super-reads Total Sequence: 22,518,572,092 bp N50 Contig: 7,083bp N50 Scaffold: 15,885 bp

32 P. taeda WGS V0.9 (March 2013) Total Sequence: 20.1 Gbp Total contig sequence: 2.3 Gbp N50 8,200bp (11.6 million) Total scaffold sequence (with imputed gaps) : 17.8 Gbp N50 30,700bp (4.8 million)

33 Ongoing Efforts Improve MSR-CA scaffolding Transcriptome + WGS assembly Fosmid pool sequencing and assembly GBS to anchor and orient scaffolds Sugar pine genome: 35 Gigabases!

34 Elements of the Conifer Genome Sequencing Project

35 Sequencing Strategy Molecular approach to complexity reduction End of summer 2013

36 Fosmid Pooling: Genome partitioning for reduced assembly complexity The immense and complex diploid pine genome can be economically and efficiently partitioned into smaller, functionally haploid, pieces using pools of fosmid clones. Fosmids in a pool should have a combined insert size far less than a haploid genome size; to ensure haploid genome representation. The sequence data obtained from a single fosmid pool may be up to 80 X deep. The sequence data obtained from a pool must be screened for vector and E. coli contamination Ideally: larger clones (BACs) are more desirable, more likely to span repeats

37 Fosmid Sequence Components Haploid fosmids with vector tagged ends Primary coverage from short insert libraries Additional coverage from long insert libraries from equi-molar pool of pools. Fosmid end sequences (diTags) link ends of the assembly and count fosmids in a pool

38 Fosmid Pools Determining the Best Assembler for the Job quartiles AssemblerStatCountQ1Q2Q3N50Sum Allpaths-LG scf 987249977813027126298 14 x 10 6 ctg 1524235560311250910324 14 x 10 6 scf30K+ 24833595356823836130114 9 x 10 6 MSR-CA scf 21625061375922414753 15 x 10 6 ctg 3519503133950006826 14 x 10 6 scf30K+ 13632603350873811930147 5 x 10 6 SOAP scf 325112318549533389 15 x 10 6 ctg 23873761753481515 15 x 10 6 scf30K+ 32233907357663868333389 12 x 10 6 Assembly results for a relatively large pool of approximately 600 P. taeda fosmids

39 Use Cases for Fosmid Pools Assembler Evaluation Repeat Library Construction SNP Identification

40 Genomic Sequence Pinus taeda BACs and Fosmids Combined sequence resource represents roughly 1% of the estimated 22 GB genome

41 Similarity and De Novo Repeat Identification Tandem Repeat Finder (TRF) Homology (Censor against RepBase) Summary of Repbase v17.07 Number of entries: 28,155 Number of species represented: 715 Number of repeat families: 280 Angiosperm entries: 131 Gymnosperm entries (conifer):15 De Novo (REPET/TEannot) Self-alignment (all vs all) with BLAST to find HSPs is followed by clustering with Grouper, Recon, and Piler 3 sets of clusters are aligned with a MSA (MAP) to derive a consensus sequence Structural search runs simultaneously (LTR Harvest) to detect highly diverged LTRs Final Blastclust to cluster potential sequences

42 Tandem Repeats Comparison across sequenced angiosperms and other gymnosperms (partial) Pinus taeda (BAC + Fosmid)Picea glauca (BAC) Taxus mairei (Fosmid) MicroMiniSatMicroMiniSatMicroMiniSat Most frequent period 221123227122224230 Cumulative length 126,254216,194154,8358768433891,4115871,598 Num. of loci 64,74010,5081,2581110132192 Most frequent period (%) 0.05%0.08%0.06%0.33%0.32%0.15%0.09%0.04%0.10% Total cumulative length (bp) 241,8224,323,3612,650,7409256,6031,8643,02415,8755,871 Total (%) 0.09%1.56%0.96%0.35%2.49%0.70%0.19%0.98%0.36% Total tandem content: 2.6% 3.31% of BACs 2.59% of fosmids

43 Homology Search Results Censor (BLAST-style) comparisons against Repbase Partial and Full-length Interspersed Alignments (compared across species) Full-length Alignments Only Full Length Sequences 80-80-80 Rule (Wicker et al. 2007) 80 bp in length 80% identity 80% coverage

44 Summary of Combined Homology and De Novo Approach 88% repetitive (partial and full-length) 29% repetitive (full-length only defined by 80-80-80) – 87% of the full-length content is characterized as LTR retrotransposons Repeats are highly diverged – Only 23% identified by homology for full and partial elements – Repbase contains just 15 (+5) gymnosperm elements – 6,270 novel families discovered with no homology 5,155 are single copy High copy elements are either Gypsy or Copia LTRs Nested repeats common in LTR retrotransposons

45 Novel Repeat Elements Diverged LTRs are annotated as 6,270 novel families Top 400 elements only cover 12% of the combined sequence sets Repeat familyFull-Length CopiesLength (bp) Percent of Sequence Set TPE11591,077,5980.39% PtPiedmont (93122)133969,1090.35% IFG7162956,0180.34% PtOuachita (B4244)47576,8710.21% Corky78469,2860.17% PtCumberland (B4704)67431,4920.16% PtBastrop (82005)38378,6310.14% PtOzark (100900)32378,0200.14% PtAppalachian (212735) 67367,6530.13% PtPineywoods (B6735)68322,6320.12% PtAngelina (217426)24309,2480.11% Gymny24291,4790.11% PtConagree (B3341)50285,8500.10% PtTalladega (215311)33274,8260.10% Total9827,088,7132.56%

46 Novel Repeat Elements MSA with annotations of the novel Copia LTR -PtPineywoods MSA with annotations of the novel Gypsy LTR - PtAppalachian

47 Elements of the Conifer Genome Sequencing Project

48 Loblolly transcriptome from 30 unique RNA collections Carol Loopstra (RNA) and Keithanne Mockaitis (sequencing)

49 Progressive Transcript Profiling Build a useful transcriptome reference early in project:  generate long reads for ease of assembly, scaffolding of existing shorter data  integrate community data into assemblies Vegetative Organs vegetative buds candles stems needles roots Early Stress Signaling Responses cold heat elevated UV compression Reproductive Development megastrobili microstrobili Early Development seeds young seedlings

50 Transcriptome Assembly Considerable variation in de novo transcriptome assemblies – Used a compare and compete methodology to select the final transcripts – Two Trinity versions and Velvet/Oasis (6 different k-mer sizes) – First analysis: Basic clustering methods with 454 and other protein evidence to determine optimal full-length proteins

51 Coding transcripts, clustered outputs by assembler Protein coding loci, estimated from transcript evidence alone (32 sets): 87,602 unique complete 64,610 mapped to the WGS assembly Protein coding loci, estimated from transcript evidence alone (32 sets): 87,602 unique complete 64,610 mapped to the WGS assembly preliminary results, Keithanne Mockaitis

52 Improving Transcriptome Assembly Improved transcript grouping with exon-aware clustering methods Transcript ClassTotal (Improved Clustering)Mapped to genome v0.9Unmapped Primary, complete CDS87,24183,2713970 Alternate, complete CDS642,175 4092 Partial CDS69,04461,1147930 Alternative partial CDS617,248607,4909758 Duplicates/Paralogs Pseudogenes Too much compression of Unigene set? Mapping OccurrencesComplete transcripts mapped 19840 23574 32542 42624 >/=564690

53 Examining Gene Families MyB Transcription Factors Homeobox Transcription Factors

54 Improving the Assembly with Transcriptome Map WGS (v0.9) against the transcripts with nucmer Iteratively compute alignments and merge scaffolds. 12,000+ scaffolds merged during first pass... V1.0

55 Elements of the Conifer Genome Sequencing Project

56 Source: Jiao et al., Ancestral polyploidy in seed plants and angiosperms, Nature, Vol. 473, May 5, 2011

57 Mapping Full-Length Orthologous Proteins Alignments: exonerate protein2genome, heuristic, 70% query coverage, 70% similarity ~220,000 query proteins total Physcomitrella patens: 2,761 out of 25,506 (10.8%) Selaginella moellendorffii: 2,025 out of 16,821 (12.0%) ‘Basal’ angiosperm: – Amborella trichopoda: 4,076 out of 25,347 (16.1%) Angiosperms: – Arabidopsis thaliana: 4,777 out of 27,986 (17.1%) – Populus trichocarpa: 4,023 out of 18,588 (21.6%) – Sorghum bicolor: 3,368 out of 24,122 (18.1%) – Vitis vinifera: 3,833 out of 18,441 (20.8%) – Glycine max: 9,970 out of 52,178 (19.1%) Gymnosperms: Picea: 6,696/11,065 (60.5%) – The majority of these are Picea sitchensis (Ralph et al., 2008) Pinus: 345/426 (81.0%)

58 Mapping Proteins ~220K full-length proteins and CEGMA analysis BLAT/Exonerate with ~220K proteins – Requiring 70% similarity and 70% query coverage, 45,101 proteins aligned to 11,897 unique scaffolds/contigs CEGMA – Examines conserved eukaryotic core genes (KOGS) – 240 full-length and 197 partial proteins (of 458) – 113 full-length proteins of the 248 in the highly conserved category

59 Training MAKER Pinus taeda resources: ADEPT2 Project Clusters Exon Capture (Neves et al. 2013) PineRefSeq Transcriptome 454 Transcriptome (Lorenz et al. 2012) Pinus Resources: TreeGenes UniGenes Whitebark pine (RNASeq) Sugar pine transcriptome (454 + RNASeq) Limber pine transcriptome (RNASeq) Lodgepole pine (454) (Parchman et al. 2010) Longleaf pine (454) (Lorenz et al. 2012) Picea Resources: TreeGenes UniGenes Sitka spruce (Sanger/454) (Ralph et al. 2008) Norway spruce (454) (Chen et al. 2013) Congenie transcriptome (Nysterdt et al. 2013) Norway spruce (454) (Lorenz et al. 2013) White spruce (454) (Rigault et al. 2011)... Just finished at iPlant (TACC) Running on 8,000 cores…

60 WebApollo on TreeGenes Introns Exon conservation highlighted Supporting EST evidence Intron 2: Size 131,138 Intron 3: Size 179,620

61 Gene family conservation An Example of: jcf7180063228536 4.97Mbp scaffold WebApollo on TreeGenes

62 Conifer specific proteins An Example of: jcf7180063228536 WebApollo on TreeGenes

63 Elements of the Conifer Genome Sequencing Project

64 Dendrome Project TreeGenes Database to Distribute Transcriptome and Genome

65 GENome Sequence Annotation Server (GenSAS) Community Annotation

66 GENome Sequence Annotation Server (GenSAS) Opening Screen After the selected programs run, clicking on the green icon will bring users to a map interface List of available programs to run with optional paramaterization Sequence choices from database or uploaded file

67 GENome Sequence Annotation Server (GenSAS) Map Interface Click on features to see more info Quick Navigation Shortcuts Click to jump positions Task Name Export Layer and Save

68 The Maryland Genome Assembly Group featuring co-PD Steven Salzberg and Daniela Puiu (Johns Hopkins U) and co-PD Jim Yorke and Aleksey Zimin (U of Maryland) PD David Neale (r), co-PD Jill Wegrzyn (c), and (l to r) John Liechty, Ben Figueroa, Patrick McGuire, Pedro J. Martinez- Garcia, Hans Vasquez-Gross UC Davis Co-PD Chuck Langley (r) and (l to r) Marc Crepeau, Kristian Stevens, and Charis Cardeno UC Davis (l to r) Co-PD Pieter de Jong, Ann Holtz-Morris, Maxim Koriabine, Boudewijn ten Hallers CHORI BAC/PAC Co-PD Carol Loopstra and Jeff Puryear TAMU Co-PD Keithanne Mockaitis and Zach Smith Indiana U Co-PD Dorrie Main WSU DP SS AZ JY

Download ppt "Assembly and Annotation of a 22Gb Conifer Genome, Loblolly Pine Jill Wegrzyn Pieter de Jong, Chuck Langley, Dorrie Main, Keithanne Mockaitis, Steven Salzberg,"

Similar presentations

Ads by Google