EMBL-EBI Integration of Sequence and 3D structure Databases “The key to Bioinformatics is integration, integration, integration” Bioinformatics: Bringing.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

EMBL-EBI Integration of Sequence and 3D structure Databases.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The European Molecular Biology Laboratory (EMBL) is supported by sixteen countries. Consists of the main Laboratory in Heidelberg (Germany), Outstations.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Archives and Information Retrieval
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Making the Most of What We Know: Towards Effective Use of Genomics Data Terence Critchlow Center for Applied Scientific Computing Lawrence Livermore National.
Protein Databases EBI – European Bioinformatics Institute
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
UniProt - The Universal Protein Resource
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Biological Databases By : Lim Yun Ping E mail :
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Data Integration and Management A PDB Perspective.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Motif discovery and Protein Databases Tutorial 5.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Macromolecular Structure Database Project EMSD Infra-structure Services for Europe To develop an autonomous structural database capability in Europe
Bioinformatics and Computational Biology
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
EMBL-EBI Dimitris Dimitropoulos MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining  For the advanced MSDSD researcher.
InterPro Sandra Orchard.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
BUSINESS SENSITIVE 1 SAAW - Sequence Annotation and Analysis Workshop Boyu Yang and Gene Godbold Battelle Memorial Institute, Charlottesville Operations.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Demo: Protein Information Resource
Department of Genetics • Stanford University School of Medicine
Mangaldai College, Mangaldai
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

EMBL-EBI Integration of Sequence and 3D structure Databases “The key to Bioinformatics is integration, integration, integration” Bioinformatics: Bringing it all together technology feature, M. Chicurel,Nature 419,751, 2002

EMBL-EBI “ Coordinates by themselves just specify shape and are not necessarily of intrinsic biological value, unless they can be related to other information ” Integrative database analysis in structural genomics, Mark Gerstein, Nature Structural Biology 7, 960, 2000 “The information management challenge for the future will be to develop new ways to acquire, store and retrieve not only biological data per se, but also those data in the context of biological knowledge” Biological Databases and Informatics Program Announcement NSF “Only the development of integrated bioinformatics systems will enable the manipulation of complex biological information” Editorial, Bioinformatics 18 (12), 1551, 2002

EMBL-Bank DNA sequences EnsEMBL Human Genome Gene Annotation Uniprot Protein Sequences EMSD Macromolecular Structure Data Array-Express Microarray Expression Data

EMBL-EBI  Integration With Uniprot  eFamily Project  Future Plans

EMBL-EBI UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. Integration With UniProt

EMBL-EBI MSD / UniProt: UniProtMSD Agreed common mechanism for exchange of information. Services Two Different Database Systems “One of the major benefits of using databases for data storage is for data sharing”

EMBL-EBI  Collaboration between MSD (Sameer Velankar, Phil McNeil) and UniProt (Virginie Mittard, Daniel Barrell) groups  Depends upon  Clean UniProt (UNP) cross references in the DBREF records for each chain (where possible)  Clean taxonomy ids for each PDB chain  Taxonomy for PDB Source and UniProt OS must be the same MSD/Uniprot Collaboration

EMBL-EBI  Cleanup of the DBREF records in the PDB entries  Cleanup of the UniProt cross references in PDB entries  Cleanup of Source Information  NCBI Taxonomy IDs  Cleanup of the Reference information  Update UniProt entries  Source, Reference, Secondary structure information  Supply Additional Information  revision date, experimental method, resolution, R-factor  Residue-by-residue mapping between MSD and UniProt enables chimaeras to be handled correctly

EMBL-EBI Sequence Schema

EMBL-EBI Residue by Residue Mapping to UniProt PDBCHAINUNPSERIALPDB_RESPDB_SEQUNP_RES ANNOTATION 1HG1AP066081ALA22ANOT OBSERVED 1HG1AP066082ASP23DNOT OBSERVED 1HG1AP066083LYS24KNOT OBSERVED 1HG1AP LEU25L 1HG1AP PRO26P 1HG1AP ASN27N 1HG1AP ILE28I 1HG1AP VAL29V 1HG1AP ILE30I 1HG1AP LEU31L 1HG1AP ALA32A

EMBL-EBI Display of Mappings

EMBL-EBI  IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature.  The IntEnz relational database implemented and supported by the EBI is the master copy of the Enzyme Nomenclature data.  MSD uses the UniProt accession code(s) mapped to each chain to link to the IntEnz EC number  This done directly via the MSD and IntEnz Oracle relational databases Integration With IntEnz

EMBL-EBI eFamily The eFamily project is designed to integrate the information contained in five of the major protein databases.

EMBL-EBI  To integrate the information contained in the five major protein databases.  The member databases (CATH, SCOP, MSD, Interpro, and Pfam) contain information describing protein domains.  For SCOP, CATH and MSD the data is primarily concerned with 3D structures  In InterPro and Pfam the focus is mainly on the sequences.  It is often difficult for biologists to navigate from protein sequence to protein structure and back again.  eFamily aims to provide the scientific community with a coherent and rich view of protein families that allow users seamlessly to navigate between the worlds of protein structure and protein sequence, by improved data resources and integration via grid technologies. eFamily Core Activities

EMBL-EBI UniProt GO InterPro GO PROSITE SCOP Pfam CATH GO Curated Common Domains definition HMM prediction Curated Mapping & curation Mapping per residue Mapping start – end MSD mapping Residues/Sequence DATA INTEGRATION

EMBL-EBI InterPro-UniProt(s) UniProt-PDBCHAIN(S) CATH/SCOP DOMAIN PDBCHAIN(S) InterPro-CATH/SCOP CATH/SCOP DOMAIN UniProt Complexity of Mappings  An InterPro entry is a collection of one or more UniProt entries  Unlike PDB concept of CHAIN does not exist in UniProt  UniProt entry is always numbered from 1 to N  PDB SEQRES Residue numbering is from 1 to N  PDB CHAIN (ATOM Records) Residue numbering is not necessarily 1 to N  UniProt to PDB Mapping can be one to many  PDB CHAIN to UniProt Mapping can be one to many

EMBL-EBI SCOP Domain PDB Residue Range Chains Swiss-Prot Residue Range MSD-SCOP Mapping for 1cbw

EMBL-EBI MSD-CATH Mapping for 1cbw CATH Domains PDB Residue Range Chains Swiss-Prot Residue Range

EMBL-EBI MSD-Pfam Mapping for 1cbw Pfam DomainPDB Residue Range Chains Swiss-Prot Residue Range

EMBL-EBI Practical Applications of Database Integration

EMBL-EBI Mappings Used in Pfam  Pfam now uses UniProt to structure mapping from MSD Search Database  Saves duplication of effort and weeks of compute  Use mapping for annotation of alignments Pfam domains highlighted on structure of RuBisCo (8ruc)

EMBL-EBI Mappings Used in Interpro

EMBL-EBI Mappings Used in SCOP

EMBL-EBI Comparison of SCOP, CATH and Pfam Domains SCOP, CATH and Pfam have developed web-services for describing their particular domain families. These services can be queried with a protein identifier, protein accession or PDB identifier. The databases use the MSD/UniProt mapping to translate between the sequence and structure domains

EMBL-EBI XML & Web Services The eFamily project has developed a XML schema to describe:  Domains  Annotation  Sequence Alignments  Structure Alignments This will be used to provide web-services as part of the eFamily project. More information about the XML schema is available at - We are also developing a perl based API for the eFamily XML which will be available from eFamily site as well as via bio-perl. The MSD residue-by-residue mapping is made available in XML format based on the eFamily schema.

EMBL-EBI Future Plans

EMBL-EBI Mapping Annotation

EMBL-EBI

 Makes use of cleaned-up cross-reference & taxonomy data, SEQRES and ATOM/HETATM records from the PDB and the sequence from the UniProt entry to align and map each residue.  Makes connected segments from the PDB ATOM/HETATM records for each chain  These are then aligned against the SEQRES records and all the alignments for the segments are merged to get the SEQRES-ATOM alignment  This enables any unobserved residues to be considered Residue Mapping Program 1

EMBL-EBI  A similar operation is performed on the UniProt sequence and connected segments from the ATOM/HETATM records to get the UNP-ATOM alignment  The SEQRES-ATOM and UNP-ATOM alignments are then merged to get the final alignment  This is repeated for each chain in the PDB archive (with a UNP cross-reference  The mapping is loaded into the MSD relational database and validated Residue Mapping Program 2

EMBL-EBI Integrating data from MSD into CATH  Protocols have been developed for regular imports of a subset of MSD data warehouse into a local CATH database set up in ORACLE 9i  For example, information on the biological unit and on protein-ligand interactions will be integrated to increase functional annotations for CATH domain families

EMBL-EBI Two step process of data synchronisation  Data are moved from the MSD search database to the CATH-UCL site using a combination of Oracle Export/Import and SQL*Loader utilities  Subsequent updates in the MSD database are pushed to the CATH site using an incremental replication mechanism.  Data from the CATH site are pushed to the MSD site, using the same two step process  The two databases are synchronised MSD & CATH Data Exchange

EMBL-EBI  Structure  SCOP  CATH  Sequence  UniProt (neé Swiss-Prot /Trembl/PIR), InterPro, Go, Pfam  Function  IntEnz  Literature  Medline