Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida.
TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource
Bienvenidos a TAIR! Kate Dreher curator TAIR/PMN.
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
TAIR: Bringing together data for the global plant biology community kate dreher curator TAIR/PMN.
Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA.
The Arabidopsis Information Resource (TAIR)
Arabidopsis as a model for plant development Eva Huala.
Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Kate Dreher AraCyc, TAIR, PMN Carnegie Institution for Science
Part I: Tips and Techniques from curators GBrowse at TAIR David Swarbreck.
Part I: Tips and techniques from curators Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
What is RefSeqGene?.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
The Plant Metabolic Network: PlantCyc, AraCyc, and NEW Metabolic Pathway Databases for Plant Research *K. Dreher, P. Zhang, L. Chae, R.A. Nilo Poyanco,
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome Annotation BCB 660 October 20, From Carson Holt.
Update on The Pathway Tools Software Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International BioCyc.org EcoCyc.org MetaCyc.org.
Accessing the Data You Need at the Plant Metabolic Network kate dreher biocurator PMN The Carnegie Institution for Science Stanford, CA.
NGS Analysis Using Galaxy
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
The Ensembl Gene set The “Genebuild” 21 April 2008.
TAIR, PMN, SGN and Gramene workshop Focus on comparative genomics and new tools Philippe Lamesch, A. S. Karthikeyan, Aureliano Bombarely Gomez, Pankaj.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
New data and tools at TAIR (The Arabidopsis Information Resource)
Accessing information in plant metabolic pathway databases at the PMN, Gramene, and SGN Part I: Contents, Search Strategies, and Data Sharing Opportunities.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
SRI International Bioinformatics 1 Recent Developments in Pathway Tools GMOD Workshop November ‘07 Suzanne Paley Bioinformatics Research Group SRI International.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
PlantCyc, AraCyc, PoplarCyc and more... Building databases and connecting to researchers at the Plant Metabolic Network kate dreher curator PMN/TAIR.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
MetaCyc and AraCyc: Plant Metabolic Databases Hartmut Foerster Carnegie Institution.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Top Four Essential TAIR Resources Debbie Alexander Metabolic Pathway Databases for Arabidopsis and Other Plants Peifen Zhang.
Combining Computational Prediction and Manual Curation to Create Plant Metabolic Pathway Databases Peifen Zhang Carnegie Institution For Science Department.
Metabolic Pathway Databases and Tools Speaker and Schedule Update PMN (Peifen Zhang) KEGG (auto-slide show) MetaCrop (cancelled)
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Copyright OpenHelix. No use or reproduction without express written consent1.
How can we find genes? Search for them Look them up.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Building and Refining AraCyc: Data Content, Sources, and Methodologies Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
1 AraCyc Metabolic Pathway Annotation. 2 AraCyc – An overview  AraCyc is a metabolic pathway database for Arabidopsis thaliana;  Computational prediction.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
2006 ICAR: TAIR workshop Organizers: Katica Ilic and Peifen Zhang Location: Reception Room, 4th floor A general overview of TAIR website and demonstration.
Welcome to the combined BLAST and Genome Browser Tutorial.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Recent Developments and Future Directions in Pathway Tools Peter D. Karp SRI International.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Web Databases for Drosophila
bacteria and eukaryotes
Annotating The data.
VectorBase genome annotation
TSS Annotation Workflow
TAIR, PMN, SGN and Gramene workshop
Genome Annotation w/ MAKER
Part I: Tips and Techniques from curators
Ensembl Genome Repository.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Introduction to Alternative Splicing and my research report
Integrative omic approaches for the study of host–pathogen interactions Integrative omic approaches for the study of host–pathogen interactions (A) Proteomic.
Part II SeqViewer AraCyc Help
Presentation transcript:

Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:

TAIR: The Arabidopsis Information Resource collect, curate and distribute information on Arabidopsis information freely available from arabidopsis.org

Gene structure – Philippe Lamesch Gene function – Donghui Li Metabolic pathway – Donghui Li New tools – Philippe Lamesch Outline

Slides available from TAIR

TAIR is used worldwide Visits per month (source: Google Analytics)

TAIR usage in Asia: June 2009-June 2010

What we do: (1) Arabidopsis genome annotation

What we do: (2) manual literature curation Controlled vocabulary annotations Gene Ontology (GO) Plant Ontology (PO) Gene name, symbol Allele, phenotype Summary statement composition

What we do: (3) metabolic pathway curation AraCyc A metabolic pathway database for Arabidopsis thaliana that contains information about both predicted and experimentally determined pathways, reactions, compounds, genes and enzymes. PlantCyc and PMN (Plant Metabolic Network)

What we do: (4) work with ABRC to distribute research material

Part I: The Arabidopsis genome annotation A new approach for improving the Arabidopsis genome annotation Where to find gene structure related data at TAIR The Arabidopsis gene structure confidence ranking

Arabidopsis genome annotation Arabidopsis genome sequenced almost 10 years ago High quality sequence with few gaps TIGR did initial genome annotation TAIR took over responsibility in 2005 Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants

Genome annotation at TAIR Annotate atypical gene classes * * * ** * * Trans. element Short protein-coding genes Transposable element genes Pseudogenes uORFs (genes within UTR of other genes) Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures TAIR10 TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions

Using PASA and ESTs/cDNAs Clustered transcripts NCBI Genome annotation TAIR6-TAIR9

Clustered transcripts Resulting gene model NCBI Using PASA and ESTs/cDNAs Genome annotation TAIR6-TAIR9

Clustered transcripts Resulting gene model Previous gene model NCBI comparison Novel genes New Splice-variants Gene structure updates Using PASA and ESTs/cDNAs Genome annotation TAIR6-TAIR9

ESTs cDNAs Radish sequence alignments Eugene prediction dicot sequence alignments monocot sequence alignments Aceview gene predictions 2 gene isoforms Manual annotation at TAIR: Apollo Short MS peptide

TAIR10: using proteomics and RNA-seq data to improve genome annotation 4-step process: 1.Mapping RNA seq & Peptides 2.Assembly/Gene built 3.Manual review 4.Integration (genome release/Gbrowse)

Mapping and Assembly 1.Mapping RNA-seq sequences (Tophat (C. Trapnell), Supersplat (T.C. Mockler)) Peptides (6-frame translation, spliced exon graph) 2.Assembly approaches Augustus (M. Stanke) o Uses spliced RNA seq reads, peptides o Aim: Identify additional splice-variants, update existing genes TAU (T.C. Mockler) o Uses spliced RNA seq reads o Aim: Identify additional splice-variants Cufflinks (C. Trapnell) o Uses spliced and unspliced RNA seq data o Aim: Identify novel genes

Augustus TopHat, SuperSplat 145,000 RNA-seq junctions based on >1 read 203,000 clustered spliced RNA-seq junctions (spliced RNA-seq junction) RNA-seq datasets (Mockler Lab, Ecker Lab) 200 Million aligned RNA-seq reads

Augustus 145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al) Augustus gene prediction + ESTs & cDNAs + AGI models 11% of RNA-seq junctions incorporated into Augustus models 64% of peptide sequences incorporated into Augustus models Predicted Augustus models: 5461 distinct models 1596 novel models

Categorisation/Review TAU Models RNA-seq Junctions Augustus Model TAIR confidence rank TAIR Model Peptides (Splice variants, NMD targets) (correction) (colour reflects matching model) Incorrect junction in TAIR model Unsupported exon

Example Augustus update

Example 2 Augustus update

Example Augustus splice variant

Example 2 August splice variant

Augustus/TAU/Cufflinks Augustus Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq junctions 5461 potential updated genes 1596 potential novel genes TAU 30,083 junctions distinct to Augustus or TAIR models 10,902 junctions incorporated into 10,491 TAU models Cufflinks 367 novel assemblies which fall above the 100 bp & >15 FPKM filter #TE-filter applied to AUG and cufflinks models 4

Preliminary Results 4 Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes Updated genes Splice-variants B-list Rejects

Preliminary Results 4 Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes 21 Updated genes 812 Splice-variants 2134 B-list 1586 Rejects 2318

Where can you find gene structure data on TAIR? ON GENE MODEL PAGE Graphic of exon-intron structure Coordinates of each exon ON GBROWSE Graphic display of structure and overlapping evidence data ON FTP SITE GFF files with exact structures of each gene model Files with gene confidence ranking information

Gene Locus Page

Gene Model Page

Where can you find gene structure data on TAIR? ON GENE MODEL PAGE Graphic of exon-intron structure Coordinates of each exon ON GBROWSE Graphic display of structure and overlapping evidence data ON FTP SITE GFF files with exact structures of each gene model Files with gene confidence ranking information

Gbrowse

GBrowse Header Main Browser Window Track Menu

Where can you find gene structure data on TAIR? ON GENE MODEL PAGE Graphic of exon-intron structure Coordinates of each exon ON GBROWSE Graphic display of structure and overlapping evidence data ON FTP SITE GFF files with exact structures of each gene model Files with gene confidence ranking information

FTP site

Where can you find gene structure data on TAIR? ON GENE MODEL PAGE Graphic of exon-intron structure Coordinates of each exon ON GBROWSE Graphic display of structure and overlapping evidence data ON FTP SITE GFF files with exact structures of each gene model Files with gene confidence ranking information

Gene Confidence Rank Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence

Assigning A Confidence Rank E1 E4

Full support No support

New Tools at TAIR N-Browse GBrowse Synteny viewer

New Tools at TAIR N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU) GBrowse Synteny viewer

N-Browse

N-Browse: Finding information about edges (interactions)

N-Browse: How to select and move nodes

N-Browse: How to visualize GO terms from a selected set of nodes

N-Browse: How to load your own file and overlay it with the curated interaction data

N-Browse: How to save your session and export your data

New Tools at TAIR N-Browse GBrowse Synteny viewer

GBrowse Header Main Browser Window Track Menu

Alternative gene annotations Eugene (transcript, proteins +) Thierry-Mieg (NCBI) Gnomon (transcript, proteins) Souvorov (NCBI) Aceview (transcript) Sebastien Aubourg Hanada et al 2007 (3633 predicted genes) Identify possible corrections

Proteomic Data High-density Arabidopsis proteome map (Baerenfaller. 2008) Incorrect start codon

VISTA plot Gbrowse track

Transcriptome data

Orthologs and Gene Families

Variation

Promoter Elements

Methylation

Decorated Fasta file

New Tools at TAIR N-Browse GBrowse Synteny viewer Data provided by Pedro Pattyn at the University of Ghent

AT5G48000 AT5G48010 AT5G47990

Acknowledgements Curators: - Peifen Zhang - Tanya Berardini - David Swarbreck - Kate Dreher - Rajkumar Sasidharan Tech Team : - Bob Muller - Larry Ploetz - Raymond Chetty - Anjo Chi - Vanessa Kirkup - Cynthia Lee - Tom Meyer - Shanker Singh - Chris Wilks AraCyc and TAIR PI and Co-PI Eva Huala Sue Rhee Metabolic Pathway Software: - Peter Karp and SRI group

Automated pipeline at TAIR Program for aligned sequence(PASA) Clustered transcripts Resulting gene model Previous gene model Based on a set of rules a decision is made comparison NCBI

Gene structure annotation in Arabidopsis NEW: 282 genes; 1056 exons UPDATED: 1254 models; 1144 exons NEW: 1291 genes; 683 exons UPDATED: 3811 models; 4007 exons NEW: 681 genes; 828 exons UPDATED: 10,792 models and 14,050 exons TAIR6

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation ESTs cDNAs

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation

How do MOD curators annotate genomes? Experimental & Computational Evidence Automatic pipeline Manual annotation Genome annotation Alternative gene models Short MS peptides Community submissions …

Manual annotation at different MODs Genome editing tool Evidence set Set of annotation rules + +

Manual annotation at different MODs Genome editing tool Evidence set Set of annotation rules + + Nucleotide sequence Short peptides Protein similarity Alternative predictions … Apollo (Arabidopsis, Fly) Aceview (Worm) Zmap/Otterlace (Human) Artemis (Pathogen Project) … Exon size Intron size Number of UTRs Coding/Non-coding ratio Splice-junctions …

Responsibilities of a gene structure curator ATG TGA GT AG Delete wrongly predicted genes

Responsibilities of a gene structure curator ATG TGA GT AG cDNA Update mispredicted exon-intron structure

Responsibilities of a gene structure curator ATG TGA GT AG cDNA Update mispredicted exon-intron structure

Responsibilities of a gene structure curator ATG TGA GT AG Annotate splice-variants ATGTGA GT AG

Responsibilities of a gene structure curator Annotate atypical gene classes * * * ** * * Trans. element Short protein-coding genes Transposable element genes Pseudogenes uORFs (genes within UTR of other genes)

Responsibilities of a gene structure curator ATG TGA GT AG Define gene type Protein-coding tRNA snRNA snoRNA rRNA …

Categorisation/Review 17,915 total gene models Categorise/Prioritise (CDS length, Blast similarity, gene confidence rank) TAU Models RNA-seq Junctions Augustus Model TAIR confidence rank TAIR Model Peptides (Splice variants, NMD targets) (correction) (colour reflects matching model) Incorrect junction in TAIR model Unsupported exon 5

Augustus RNA-seq Junctions = cluster reads Augustus Input: RNA-seq junctions, peptides, ESTs/cDNAs, TAIR models Provide evidence ranking and bonus scores Junction assembly Raw spliced RNA-seq reads (8,819,162 reads) (203,317 Junctions)

Examples of large-scale community datasets recently integrated into the Arabidopsis annotation Transposable elements (Quesneville Lab) Pseudogenes (Gerstein Lab) Short MS peptides (Baerenfaller et al, Castellana et al) Short genes (Hanada et al)

Model Organism Databases

Augustus- Results 4 Augustus models were classified into 4 categories: Novel genes 20 Updated genes 897 Splice-variants 1826 B-list 1173 Rejects 3137

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation cDNA s & ESTs Automated annotation Annotated Arabidopsis genome PASA Program To Assemble Spliced Alignments

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation cDNA s & ESTs Automated annotation Manual annotation Annotated Arabidopsis genome PASA Program To Assemble Spliced Alignments

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for Arabidopsis genome annotation cDNA s & ESTs Automated annotation Annotated Arabidopsis genome PASA Program To Assemble Spliced Alignments

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation cDNA s & ESTs Automated annotation Manual annotation Annotated Arabidopsis genome PASA Program To Assemble Spliced Alignments

Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation cDNA s & ESTs Automated annotation Manual annotation Annotated Arabidopsis genome MS peptides RNA-seq data PASA Program To Assemble Spliced Alignments