Genome Annotation: A Protein-centric Perspective.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

SWISS-PROT The SWISS-PROT database consists of sequence entries. It contains high-quality annotation, is non-redundant and cross- referenced to many other.
Andy Jenkinson, EBI An Introduction to DAS. Summary of Topics What is Data Integration? Problems in Data Integration An architectural overview of DAS.
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
EMBL-EBI Integration of Sequence and 3D structure Databases.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
An Introduction to Bioinformatics Molecular Biology Databases.
Joint EBI-Wellcome Trust Summer School June 2010.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Biological Databases By : Lim Yun Ping E mail :
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
NCBI Vector-Parasite Genomic Related Databases Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 12, 2004
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
Production Priorities. Genome protein sets User Support Production systems change Database changes On-the-fly species gene associations.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Function preserves sequences
EMBL-EBI Integration of Sequence and 3D structure Databases “The key to Bioinformatics is integration, integration, integration” Bioinformatics: Bringing.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
A collaborative tool for sequence annotation. Contact:
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Protein databases Henrik Nielsen
Annotating with GO: an overview
Data Mining with BioMart
Archives and Information Retrieval
Biological Sequence Databases
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
UniProt: the Universal Protein Resource
INFORMATION FLOW AARTHI & NEHA.
BLAST.
Introduction to Bioinformatics
Protein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview -
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Genome Annotation: A Protein-centric Perspective

Protein data contributing to genome annotation Gene structure prediction Gene function prediction

UniProt Collaboration between EBI, SIB and PIR Funded mainly by NIH Based on the original work on PIR, Swiss-Prot and TrEMBL

UniProt Goals High level of annotation Minimal redundancy High level of integration with other databases Complete and up-to-date

UniProt Non-Redundancy Concepts UniProt Archive (UniParc): All sequences that are 100% identical over their entire length are merged into a single entry, regardless of species. UniParc represents each protein sequence once and only once, assigning it a unique Identifier. UniParc cross-references the accession numbers of the source databases. UniProt Knowledgebase: Aims to describe in a single record all protein products derived from a certain gene (or genes if the translation from different genes in a genome leads to indistinguishable proteins) from a certain species. UniProt Nref: Merges sequences automatically across different species.

UniParc 2.2. July ,913,916 unique sequences from 10,422,131 source records Source databases are DDBJ/EMBL/GenBank, UniProt/Swiss-Prot, UniProt/TrEMBL, PIR-PSD, Ensembl, International Protein Index (IPI), PDB, RefSeq, FlyBase, WormBase, H-Inv, TROME, European Patent Office, United States Patent and Trademark Office and Japan Patent Office

UniProt Protein DAS Reference Server Aristotle – Data Source for the Reference Server Creating a Plugin for Thomas Down's DAZZLE Servlet

DAS Infrastructure - Overview EBI UniProt InterPro Aristotle Protein DAS Reference Server Download every 2 weeks Reference and Annotation from the EBI Protein DAS Annotation Server Protein DAS Annotation Server DAS Client – Connects to Reference Server and zero or more Annotation Servers. Merge duplicate features? Resolve version differences ?

Creating a Plugin for DAZZLE / Aristotle Involved linking the Aristotle Java API to the BioJava & DAZZLE Java API's Issues with enabling a useful entry_point command Creation of an 'artificial' hierarchy of entry points, based upon sequence length

Possible Approach to Implementing Local Annotation Servers Use GFF Format as a simple and accessible primary data source Problem with this – not suitable for very large numbers of records, so... Load this into a relational database (sticking with SQL-92 to ensure as cross-platform as possible) Use a standard plugin that will allow the 'GFF' data to be read from the relational database. From the point of view of the data curators, this process should be transparent, i.e. they should be able to work with GFF files and not need to worry about the database structure

UniProt Protein DAS Server External Page: DAS server package downloadpage: The UniProt DAS Server itself: