Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.

Slides:



Advertisements
Similar presentations
The Arabidopsis Information Resource (TAIR)
Advertisements

SG KB 2009 NIGMS Workshop: Enabling Technologies for Structural Biology Section on Structural Analysis Margaret J. Gabanyi March 4, 2009 How to Use the.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Annotation Presentation Alternative Start Codons &
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
BLAST What it does and what it means Steven Slater Adapted from pt.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Pathway Assignments. The assignment – Annotating Pathways KEGG Pathway Database.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Organizing information in the post-genomic era The rise of bioinformatics.
The consistency Checker, or Overhauling a PGDB By Ron Caspi.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Copyright OpenHelix. No use or reproduction without express written consent1.
The Public Face of TAIR User Interface Design Responsiveness to User Input.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
How can we find genes? Search for them Look them up.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Group discussion Name this protein. Protein sequence, from Aedes aegypti automated annotation >25558.m01330 MIHVQQMQVSSPVSSADGFIGQLFRVILKRQGSPDKGLICKIPPLSAARREQFDASLMFE.
(H)MMs in gene prediction and similarity searches.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
What is BLAST? Basic BLAST search What is BLAST?
Bacterial infection by lytic virus
Bacteriophage Gene Functions
Introduction to Genes and Genomes with Ensembl
Bacterial infection by lytic virus
The Integrated Microbial Genome (IMG) systems
Basics of BLAST Basic BLAST Search - What is BLAST?
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
INFORMATION FLOW AARTHI & NEHA.
Strategies for annotation of a genome
BLAST.
Protein Sequence Analysis - Overview -
Annotation Presentation
Basic Local Alignment Search Tool
Victor M. Markowitz, I-Min A. Chen, Ken Chu, Amrita Pati, Natalia N
Basic Local Alignment Search Tool (BLAST)
Part II SeqViewer AraCyc Help
Presentation transcript:

Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012

Advancing Science with DNA Sequence Tricky question What do you need to do data curation in IMG? a)I-phone b)PhD in Computer Science c)supernatural powers Correct answer: you need an IMG account

Advancing Science with DNA Sequence 1.Gene models a)Add a gene b)Make a gene pseudogene or “obsolete” (=delete it) 2. Functional annotations: a)Product names b)EC numbers c)Gene symbols If you believe something else needs to be changed (genome name, taxonomy, etc.) – please use IMG Questions/Comments link What can’t be changed: automated assignments to protein families (Pfam, COGs, TIGRfam, InterPro, SEED assignments, KO assignments) What can be curated in IMG-ER?

Advancing Science with DNA Sequence Center point for curation – Gene Cart

Advancing Science with DNA Sequence Product Name is free text (but see GenBank requirementsProduct Name is free text (but see GenBank requirements mesubmit_annotation.html ) Prot Description is free text (goes to “note” in GenBank submission) EC number and PUBMED ID – see explanation Notes are free text (goes to “note” in GenBank submission) Gene symbol is “gene name” – 4 letter abbreviation; goes to “gene” in GenBank submission

Advancing Science with DNA Sequence How to find the genes that need curation? Two possible scenarios: You have submitted a genome to IMG-ER and want to have the best annotations possible for it (e. g. for GenBank submission)You have submitted a genome to IMG-ER and want to have the best annotations possible for it (e. g. for GenBank submission) You’re an expert and know everything about a certain pathway or protein family (families) = “community service”You’re an expert and know everything about a certain pathway or protein family (families) = “community service”

Advancing Science with DNA Sequence Curation of genome annotations Compare Gene Annotations find genome Genome Statistics review Gene Pages add to Gene Cart refine gene set Find Genomes: Genome Browser Genome Search “Hypothetical protein”, but with some evidence Non-hypothetical protein, but no evidence w/o enzymes but with candidate KO based enzymes Protein families Homologs/orthologs Gene Neighborhoods

Advancing Science with DNA Sequence Why do you want to review annotations? Most IMG pipelines are optimized for specificity, so they are more likely to have false negatives, but generate few false positives Compare Annotations –Product name is a consensus of multiple assignments: BLASTp, TIGRfam, COG, Pfam –Sources of false negatives - cutoffs: TIGRfam trusted cutoffs are quite stringent; COG doesn’t have trusted cutoffs; BLASTp cutoff of 50% identity Candidate genes with KO annotations – sources of false negatives –Cutoffs for % identity and alignment length

Advancing Science with DNA Sequence Curation of annotation in one genome (or a set of genomes) a)Your favorite genes (experimental verification, etc.) -> use Find Genes, Gene Search or BLAST b)“Compare Annotations” on Organism Details page c)“Candidate genes with KO annotations” on Organism Details page d)KEGG Pathways (either from Organism Details page or from Find Functions menu) e)PhyloProfiler

Advancing Science with DNA Sequence A shortcut for product name/EC number assignments based on KO

Advancing Science with DNA Sequence Example of a missed gene Run PhyloProfiler of Deinococcus geothermalis as a query, Deinococcus hopiensis as target (with no homologs in) Select Dgeo_0119 as a sequence to check whether a homolog of this gene was missed in Deinococcus hopiensis

Advancing Science with DNA Sequence Adding missed genes - contd Use graphical viewer to check the translation Adjust the start if other start codons with better RBS exist upstream

Advancing Science with DNA Sequence Reviewing your annotations Organism Details page -> Genome Statistics MyIMG

Advancing Science with DNA Sequence IMG curation exercises Go to the link in the usual place: The first 2 pages – questions without answers; the rest is cheat sheet