Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA

Similar presentations

Presentation on theme: "2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA"— Presentation transcript:

1 2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Helianthus annuus genome annotation HA412.v1.1.bronze update Sébastien Carrere1 Ludovic Legrand1, Jérôme Gouzy1 Erika Sallet1, Thomas Schiex2 1 Laboratoire des Interactions Plantes Microorganismes (LIPM) INRA/CNRS 2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA I’m going to talk about sunflower genome annotation The tools and results that I'm going to talk about have been mainly driven by Jerome Gouzy’s group at LIPM in Toulouse.

2 Summary Genome annotation Web tools The EuGene gene finder
Sunflower bronze annotation pipeline Annotation summary Web tools Genome Browser Annotation Browser Sequence-based tools I will start with an overview of the annotation process introducing the EuGene software. I’ll focus on specific features and associated pipeline tuning of the sunflower genome and especially the bronze assembly sequence. And I will give you a brief summary of the results. Most of my speech will be dedicated to present you the tools we provide to help you to mine all the annotation data. January 2015 –

3 Annotation of protein coding genes EuGene: an integrative gene finder
Integration of different types of evidences protein similarities, evidences of transcription, etc. Alternative splicing prediction Combiner Integrating predictions from other gene finders (e.g FGENESH) Result in GFF3 format Ensuring interoperability with a large number of software and databases (chado, gbrowse, jbrowse, apollo, etc..) Availability Open source software (Artistic License) Foissac S, Gouzy J, Rombauts S, Mathé C, Amselem J, Sterck L, Van de Peer Y, Rouzé P, Schiex T: Genome Annotation in Plants and Fungi: EuGène as a Model Platform. Current Bioinformatics 2008, 3:87-97. Firstly, let's focus on protein coding genes annotation. We have been developing a software called Eugene in collaboration with Thomas Schiex for now more than ten years. Main current developments are now handled by Erika Sallet. Eugene is an integrative gene finder in many ways: - it integrates a wide range of evidences types : protein similarities, evidences of transcriptions (RNAseq data, ESTs mapping) - But it also manages to integrate different sources of predictions like other alternative gene splicing site predictors, start predictors, gene predictors and so on. Eugene produces GFF3 compliant files that ensures interoperability with the main genome browsers and annotation environments. Eugene (It) is an open source software but requires strong skills to be used in a relevant way. January 2015 –

4 Sunflower EuGene pipeline
EuGene software is part of the EuGene pipeline. Here is an overview of how EuGene pipeline has been adapted to fit with the sunflower bronze genome assembly. Main features of Eugene softwareare conserved: we use similarity evidences against reference databases and transcription evidences using sunflower transcriptomics assemblies available. But a strong specific effort has been carried out to deal with large amount of transposable elements that can lower annotation quality. I’ll will now focus onto this problem. January 2015 –

5 Sunflower EuGene pipeline
Deal with the large amount of transposable elements Reference database contains a lot of TE Need to clean these databases before creating similarities evidences TE and flanking genes were collapsed Need to consider TE Regions as non coding regions Transposable element sequences and related proteins are present in reference databases such as UniProt. We used these datasets to get similarities evidences in the annotation process. So these databases had to be cleaned to remove noisy sequences. Then we used RepBase to identify these known proteins. Second, many transposable elements from sunflower absent in RepBase where still predicted. and we had to deal with an issue which was that these genes where collapsed with flanking genes. That’s why a new dataset of putative transposable element related proteins from sunflower was built and used to determine regions on genome to consider as non coding regions for EuGene. January 2015 –

6 Sunflower EuGene pipeline
Deal with N stretches  configure EuGene to allow gene prediction through 3kb gaps Another tunning of EuGene pipeline was necessary to deal with the assembly gaps. We had to configure EuGene to make it possible to predict genes spanning these gaps, we set a limit of 3 thousands of N’s. January 2015 –

7 Annotation summary 94.33 % of HA412-HO EST are correctly mapped (~90% of XRQ ESTs)  Gene space is covered 90935 protein coding genes 59817 with Full Length Best Hits (spanning 60% of the length of the A. thaliana | SwissProt | Unitprot_plant protein) OR « EST/RNAseq assembly » support 39050  with EST support over 80% of the mRNA 13568  gene models correspond to full length A. thaliana proteins Lettuce Genome Assembly, Structure and Annotation (Maria Jose Truco , PAG 2014): “Genome annotation of the assembled genome using three prediction pipelines postulated a set of 94,556 non-redundant gene models. From those, 41,000 high confidence gene models were identified […] that combines transcriptomic and prediction evidence.” Nomenclature : Ha412v1r1_LGgXXXXXX A month and a half was the time necessary to run blastx analyses against reference databases and provide an okay annotation. The first point I would like to highlight is that the genespace seems to be correctly covered according to EST mapping rate. The Eugene annotation pipeline leads to aroud ninety thousands of protein coding genes. If we apply some quite strict quality criterias such as transcript or similarity evidences, this number is closer to sixty thousands of genes. These values are quite similar to those obtained on the Lettuce genome, so the annotation pipeline seems to be reliable. To name these predicted genes, we choose the following locus tag prefix according to the NCBI recommandations. It includes genome assembly release and linkage group information. Genes are numbered according to their relative position on pseudo molecules. January 2015 –

8 Web Tools
Login / password: see consortium Let's move on to the tools we are deploying to give access in an integrative way to all produced results. We set up a secure web server which gives a single endpoint to all the annotation tools. You can login into this portal using the credentials provided by the consortium. I will now shortly describe all the resources available. January 2015 –

9 Genome Browser Available tracks Query with Assembly
Gene Models (v0 and v1.1) Protein similarities TE predictions (MITE, LTR, BlastX) Ha412-HO RNA-seq libraries mapping Transcript alignments Query with Genomic Region Gene locus tags Transcripts accessions HaT13l or Ha412T4l The first one is the genome browser. We deployed Jbrowse, a new generation open source genome browser, to give you access to the following tracks: genome assembly tracks such as reference sequence and N stretches structural annotation before and after EuGene pipeline tuning protein similarities against reference databases and transposable elements datasets and also tracks with legacy transcriptomics datasets You can query the system using of course genomic region but also by looking for a gene locus tag and, more useful, a transcript accession form legacy transcriptomics datasets January 2015 –

10 Contextual menus Linked resources
On several tracks, you will find contextual menus using the right click event. These menus give you quick access to external resources such as aligned transcripts expression data or annotation. January 2015 –

11 Contextual menus Linked resources
A similar contextual menu is available for the EuGene track. If you click on it you will be redirected to the annotation page of the selected gene. This is what I describe now. January 2015 –

12 Annotation Browser Pre-computed analysis to speed up annotation mining
Full-text search through Automatic functionnal annotation (InterPro based) Blast hits accessions / descriptions (Blastp vs. Uniprot-Plants, B. dystachyon, A. thaliana) InterPro hits accessions (GO terms, database accessions – PFAM, InterPro terms) Exact match Partial match Complex queries Region search In order to speed up annotation mining, we have precomputed and indexed time consuming analyses such as InterPro domains and protein similarities against reference databases. By this way, you should be able to query the system using InterPro or specific databases accessions and terms but also with your favorite arabidopsis or brachypodium gene name or locus tag. The system is ready for exact match searchs, fuzzy queries and more complex queries such as retrieval of the list of loci annotated on a specific region. Now, let’s focus on how results are rendered. January 2015 –

13 Annotation Sheet The top of the annotation sheet displays structural and automatic fonctionnal annotation data. Below, you can find an embedded view of the genome browser with expression data plotted. More detailed structural annotation information are available through the plus icon. And you can directly get access to the different feature sequences of the gene model by clicking these buttons. January 2015 –

14 Annotation Sheet Pre-computed protein families
If you scroll down the page, you will find the protein family analyses results. We used the OrthoMCL program to predict group of homologuous proteins accross sunflower and the two model organisms Arabidopsis thaliana and Brachypodium distachyon. The parameters we used are not too stringent so some families could be really large because of domain conservation. An automatic family annotation is proposed based on the occurrence of interpro terms in the whole family. You can compute a family alignment using MAFFT software just by clicking on this link. January 2015 –

15 Annotation Sheet Pre-computed blast results
5 best hits with e-value < 1e-5 against A. thaliana TAIR10 B. distachyon proteome Uniprot-Plants The center part of the page is dedicated to display protein similarities results. We ran three blastp analyses against plant model organisms and a more exhaustive plante reference database. We kept the five best hits with an expect value threshold of ten minus 5 for each dataset and indexed the results. The full match attributes such as the alignment are available through the plus icon. January 2015 –

16 Annotation Sheet Pre-computed InterPro results
In the same way, we report InterProScan hits with full information behind the plus icon and as well as cross reference to InterPro database. January 2015 –

17 Sequence-based tools Blast server
Finally, if you’ve got a sequence of interest which can be not recovered through its accession, you can look for similar sequence using the blast server that we added in the Sequence Tools section. All blast methods are available and all parameters excluding output format are available on the bottom of the page. All type of object annotated with Eugene are available as target databases, genes, mRNAs, proteins, scaffolds and so on. Results are displayed using the same format as the one described few slides before but with an extra link to the genome browser centered on the matching region or locus. January 2015 –

18 Sequence-based tools Extract-seq
Another usefull tool is the web form for sequence extraction using locus tags. You can also retrieve genomic sequence from a particular region using chromosome accessions and set start, stop and strand information. Full datasets are finally available in fasta format. January 2015 –

19 Sequence-based tools Protein families
The last tool I will talk about is dedicated to retrieve protein families based on the othoMCL analysis. You can query the system using of course sunflower locus tags, but also using arabidopsis or brachypodium ones. Results are displayed in a table with protein member annotations and as I show before you can get family sequences and alignments using these buttons. January 2015 –

20 Next Annotation components (gene structure  genome browser) will be updated/improved as soon as the gold genome assembly is available Sequence datasets (annotations, hits) will be integrated as a new source into the HeliaMine portal Feature request/Bug report To conclude, Tools and results will be updated according to the genome assembly progress and user needs. So, do not hesitate to give us a feed back or report any bug. This new sequence resource I just presented will be used as a new datasource in the HeliaMine portal to be integrated with all other information such as phenotypes/genotypes and so on . Thank you for your attention. January 2015 –

Download ppt "2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA"

Similar presentations

Ads by Google