TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource www.arabidopsis.org.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida.
Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:
Annotation of Gene Function …and how thats useful to you.
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA.
The Arabidopsis Information Resource (TAIR)
Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Kate Dreher AraCyc, TAIR, PMN Carnegie Institution for Science
Putting TAIR to work for you hands-on workshop for beginning and advanced users
Part I: Tips and Techniques from curators GBrowse at TAIR David Swarbreck.
Part I: Tips and techniques from curators Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
RNA-Seq based discovery and reconstruction of unannotated transcripts
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
Expression Analysis of RNA-seq Data
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
New data and tools at TAIR (The Arabidopsis Information Resource)
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
Why do we need good quality annotations? Pankaj Jaiswal Oregon State University Gene Annotation Workshop July 31, 2010 ASPB Plant Biology 2010 Montreal,
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Top Four Essential TAIR Resources Debbie Alexander Metabolic Pathway Databases for Arabidopsis and Other Plants Peifen Zhang.
Gramene Objectives Provide researchers working on grasses and plants in general with a bird’s eye view of the grass genomes and their organization. Work.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Genome Annotation Rosana O. Babu.
Sackler Medical School
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics and Computational Biology
How can we find genes? Search for them Look them up.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Building and Refining AraCyc: Data Content, Sources, and Methodologies Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
1 AraCyc Metabolic Pathway Annotation. 2 AraCyc – An overview  AraCyc is a metabolic pathway database for Arabidopsis thaliana;  Computational prediction.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Welcome to the combined BLAST and Genome Browser Tutorial.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Web Databases for Drosophila
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Introductory RNA-seq Transcriptome Profiling
TSS Annotation Workflow
Genome Annotation w/ MAKER
Part I: Tips and Techniques from curators
Ensembl Genome Repository.
Yating Liu July 2018 G-OnRamp workshop
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Integrative omic approaches for the study of host–pathogen interactions Integrative omic approaches for the study of host–pathogen interactions (A) Proteomic.
Part II SeqViewer AraCyc Help
Presentation transcript:

TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource contact us:

o Philippe Lamesch Introducing TAIR and PMN TAIR10 genome annotation TAIR gene confidence ranking TAIR tools o Kate Dreher Ee Rr Outline

TAIR: The Arabidopsis Information Resource collect, curate and distribute information on Arabidopsis information freely available from arabidopsis.org

Slides available from TAIR

TAIR is used worldwide Visits per month (source: Google Analytics)

TAIR usage worldwide : July 2009-July 2010

What TAIR does: (1) Arabidopsis genome annotation

What TAIR does: (2) manual literature curation Controlled vocabulary annotations Gene Ontology (GO) Plant Ontology (PO) Gene name, symbol Allele, phenotype Summary statement composition

Who we partner with: PMN (Plant Metabolic Network) and PlantCyc A comprehensive plant biochemical pathway database, containing curated information from the literature and computational analyses about the genes, enzymes, compounds, reactions, and pathways involved in primary and secondary metabolism

Who we partner with: ABRC Distribution of biological research materials

A new approach for improving the Arabidopsis genome annotation for TAIR10 The Arabidopsis gene structure confidence ranking Arabidopsis genome annotation

Arabidopsis genome sequenced almost 10 years ago High quality sequence with few gaps TIGR did initial genome annotation TAIR took over responsibility in 2005 Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants

Genome annotation at TAIR Annotate atypical gene classes * * * ** * * Trans. element Short protein-coding genes Transposable element genes Pseudogenes uORFs (genes within UTR of other genes) Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures TAIR10 TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions

Using PASA and ESTs/cDNAs Clustered transcripts NCBI Genome annotation TAIR6-TAIR9

Clustered transcripts Resulting gene model NCBI Using PASA and ESTs/cDNAs Genome annotation TAIR6-TAIR9

Clustered transcripts Resulting gene model Previous gene model NCBI comparison Novel genes New Splice-variants Gene structure updates Using PASA and ESTs/cDNAs Genome annotation TAIR6-TAIR9

ESTs cDNAs Radish sequence alignments Eugene prediction dicot sequence alignments monocot sequence alignments Aceview gene predictions 2 gene isoforms Manual annotation at TAIR: Apollo Short MS peptide

TAIR10: using proteomics and RNA-seq data to improve genome annotation 4-step process: 1.Mapping RNA seq & Peptides 2.Assembly/Gene built 3.Manual review 4.Integration (genome release/Gbrowse)

Mapping and Assembly 1.Mapping RNA-seq sequences (Tophat (C. Trapnell), Supersplat (T.C. Mockler)) Peptides (6-frame translation, spliced exon graph) 2.Assembly approaches Augustus (M. Stanke) o Uses spliced RNA seq reads, peptides o Aim: Identify additional splice-variants, update existing genes TAU (T.C. Mockler) o Uses spliced RNA seq reads o Aim: Identify additional splice-variants Cufflinks (C. Trapnell) o Uses spliced and unspliced RNA seq data o Aim: Identify novel genes

Augustus TopHat, SuperSplat 145,000 RNA-seq junctions based on >1 read 203,000 clustered spliced RNA-seq junctions (spliced RNA-seq junction) RNA-seq datasets (Mockler Lab, Ecker Lab) 200 Million aligned RNA-seq reads

Augustus 145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al) Augustus gene prediction + ESTs & cDNAs + AGI models 11% of RNA-seq junctions incorporated into Augustus models 64% of peptide sequences incorporated into Augustus models Predicted Augustus models: 5461 distinct models 1596 novel models

Categorisation/Review TAU Models RNA-seq Junctions Augustus Model TAIR confidence rank TAIR Model Peptides (Splice variants, NMD targets) (correction) (colour reflects matching model) Incorrect junction in TAIR model Unsupported exon

Example Augustus update

Example Augustus splice variant

Example 2 August splice variant

Augustus/TAU/Cufflinks Augustus Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq junctions 5461 potential updated genes 1596 potential novel genes TAU 30,083 junctions distinct to Augustus or TAIR models 10,902 junctions incorporated into 10,491 TAU models Cufflinks 367 novel assemblies which fall above the 100 bp #TE-filter applied to AUG and cufflinks models

Preliminary TAIR 10 Results Novel genes Updated genes Splice-variants B-list Rejects

Preliminary TAIR 10 Results Novel genes 126 Updated genes 1182 Splice-variants 5885 (18% of all loci) B-list 1586 Rejects 2318

Gene Confidence Rank Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence

Assigning A Confidence Rank E1 E4

Full support No support

New and updated tools at TAIR N-Browse GBrowse Synteny viewer

N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU) > 7,000 experimental interactions Interactions curated by TAIR, IntAct & BioGrid Tutorial at New and updated tools at TAIR

N-Browse

N-Browse: Finding information about edges (interactions)

N-Browse: How to select and move nodes

N-Browse: How to visualize GO terms from a selected set of nodes

N-Browse: How to load your own file and overlay it with the curated interaction data

N-Browse: How to save your session and export your data

New Tools at TAIR N-Browse GBrowse Synteny viewer

GBrowse Header Main Browser Window Track Menu

Alternative gene annotations Eugene (transcript, proteins +) Thierry-Mieg (NCBI) Gnomon (transcript, proteins) Souvorov (NCBI) Aceview (transcript) Sebastien Aubourg Hanada et al 2007 (3633 predicted genes)

Proteomic Data High-density Arabidopsis proteome map (Baerenfaller. 2008) Incorrect start codon

VISTA plot Gbrowse track

Transcriptome data

Orthologs and Gene Families

Variation

Promoter Elements

Methylation

Decorated Fasta file

New Tools at TAIR N-Browse GBrowse Synteny viewer Data provided by Pedro Pattyn at the University of Ghent

AT5G48000 AT5G48010 AT5G47990

Example 2 Augustus update

GBrowse Header Main Browser Window Track Menu

Gbrowse