Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tomato genome annotation pipeline in Cyrille2

Similar presentations


Presentation on theme: "Tomato genome annotation pipeline in Cyrille2"— Presentation transcript:

1 Tomato genome annotation pipeline in Cyrille2
Erwin Datema

2 Contents of the annotation pipeline
Annotation on the BAC level Gene prediction Repeat identification Other features Annotation on the gene level (work in progress) blastx vs NCBI’s nr (sequence similarity) InterProScan (domain identifcation)

3 Ab initio gene structure prediction
Ab initio predictors included in the pipeline Genscan GlimmerHMM (trained on tomato!) GeneId (has been trained on Solanaceae) SNAP Augustus (predicts alternative spliced variants)

4 Alignment-based gene structure prediction (1)
Transcript alignment (blastn + Sim4) SGN tomato UniGenes ( UniGenes) SGN potato UniGenes ( UniGenes) SGN coffee UniGenes ( UniGenes) SGN pepper UniGenes (9.554 UniGenes) SGN petunia Unigenes (5.135 UniGenes) SGN S. melongena UniGenes (1.841 UniGenes) NCBI full-length tomato cDNAs (678 cDNAs) Protein alignment (tblastn + GeneWise) TAIR6 Arabidopsis thaliana proteome ( proteins) TIGR4 Oryza sativa proteome ( proteins) UniProt Plant division ( proteins)

5 Additional feature prediction
Repeat Identification Tandem Repeats Finder RepeatMasker RepBase + ‘default’ features (low complexity, etc) TIGR Solanum lycopersicon repeat library V2 SGN Solanum lycopersicon UniRepeats Feature prediction tRNAscan-SE MarScan GeneSplicer Marker identification (blastn + Sim4)

6 Preliminary results Annotation of chromosome 6 BACs phase 1, 2 and 3
632 contigs Older version of the pipeline GlimmerHMM only trained on Arabidopsis 2 UniGene sets (tomato, potato) 2 protein sets (Arabidopsis, UniProt plant) Protein alignment parameters too strict

7 The genomic landscape of chromosome 6
632 contigs have been annotated Length of contigs varies between 348 – nt Average length of nt, median length of nt Total length of nt GC content: 29.9% min, 34.1% avg, 42.2% max (sequences longer than nt)

8 Ab initio gene prediction
Note: Augustus predictions include up to 3 splice variants per gene Estimated gene density is 1 gene per 5 kb ~1.200 genes in currently sequenced BACs

9 Transcript alignment-based gene prediction
Tomato UniGenes (derived from ESTs) 574 hits to the contigs Potato UniGenes (derived from ESTs) 631 hits to the contigs

10 Protein alignment-based gene prediction
UniProt Plant proteins protein sequences from the plant kingdom 195 hits to the contigs Arabidopsis thaliana TAIR6 annotation protein sequences 228 hits to the contigs

11 Repeat density TIGR Tomato Repeat Library (95 repeats)
118 regions spanning nt Minimum 48 nt, average 449 nt, maximum nt SGN Tomato UniRepeats (668 repeats) 2.860 regions spanning nt Minimum 10 nt, average 427 nt, maximum nt Tandem repeats 1.313 regions spanning nt Minimum 24 nt, average 120 nt, maximum nt

12 Additional features 74 markers could be aligned
alignment quality unverified 39 predicted tRNA genes 1.301 predicted MAR/SAR elements

13 Generic Genome Browser (1)

14 Generic Genome Browser (2)

15 Generic Genome Browser (3)

16 Recent work GeneModelCollector JIGSAW
Tries to find ‘full’ open reading frames in aligned UniGenes Automatic generation of gene predictor training set Parameters? JIGSAW Appears not to provide a prediction for every region which contains annotations Training?

17 Future Work – Tomato Annotation Pipeline
Gene prediction Combining predictions into a single consensus model Train individual predictors with recently curated tomato gene set Automated functional annotation of genes “Giving a biological meaning to the nicely colored bars” blastx InterProScan

18 Future Work – Tomato Genome Browser
Annotation of features Meaningful names for features such as genes, marker alignments, blast hits More detailed and better readable data when clicking on a feature Links to external data sources NCBI GenBank SGN

19 Acknowledgements Cyrille2 development
Mark Fiers Ate van der Burgt Joost de Groot Tomato BAC sequencing (chromosome 6) Greenomics Supervision Willem Stiekema Roeland van Ham


Download ppt "Tomato genome annotation pipeline in Cyrille2"

Similar presentations


Ads by Google