EBI is an Outstation of the European Molecular Biology Laboratory. Protein Databank in Europe (PDBe)‏ An Introduction.

Slides:



Advertisements
Similar presentations
CCPN project modeling framework University of Cambridge European Bioinformatics Institute MSD group.
Advertisements

Continuous improvement of macromolecular crystal structures Tom Terwilliger (Los Alamos National Laboratory) DDD WG member ECM 2012: Diffraction Data Deposition.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
EMBL-EBI Integration of Sequence and 3D structure Databases.
1.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Managing Data Resources
Archives and Information Retrieval
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Protein Interfaces, Surfaces and Assemblies
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Development of Bioinformatics and its application on Biotechnology
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Bringing Structure to Biology: Small Molecules and the PDBe
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
EMBL-EBI MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining For the advanced MSDSD researcher Interactive ad-hoc.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Protein 3D-structure analysis Exercises. Practicals Find update frequency for RCSB PDB: weekly. When was the last update? How many protein structures.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Databank in Europe (PDBe)‏ An Introduction.
Biological Databases By : Lim Yun Ping E mail :
EBI is an Outstation of the European Molecular Biology Laboratory. A web service for the analysis of macromolecular interactions and complexes PDBe Protein.
© 2007 by Prentice Hall 1 Introduction to databases.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Gaurav Sahni, Ph.D. Deposition, Validation, Search and Analysis.
EBI is an Outstation of the European Molecular Biology Laboratory. A web service for the analysis of macromolecular interactions and complexes MSD Protein.
EMBL-EBI the European Macromolecular Structure Database (EMSD).
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
EBI is an Outstation of the European Molecular Biology Laboratory. A web service for the analysis of macromolecular interactions and complexes PDBe Protein.
Data Integration and Management A PDB Perspective.
EMBL-EBI Integration of Sequence and 3D structure Databases “The key to Bioinformatics is integration, integration, integration” Bioinformatics: Bringing.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Project Database Handler The Project Database Handler dbCCP4i is a brokering application that mediates interactions between the project database and an.
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
EBI is an Outstation of the European Molecular Biology Laboratory. Quaternary Structure.
Structural Models Lecture 11. Structural Models: Introduction Structural models display relationships among entities and have a variety of uses, such.
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Macromolecular Structure Database Project EMSD Infra-structure Services for Europe To develop an autonomous structural database capability in Europe
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Gaurav Sahni, Ph.D. Deposition, Validation, Search and Analysis.
Real World Experiences in Operating a Collaboratory: The Protein Data Bank Helen M. Berman Board of Governors Professor of Chemistry.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-PISA a web based service for understanding Protein Interfaces, Surfaces and Assemblies.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Worldwide Protein Data Bank Common D&A Project Sequence Processing Modular Demo May 6, 2010 Project Deliverable.
AutoDep 4.0 A data deposition and archival system Sameer Velankar.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Search Services (PDBelite, PDBePro and BIObar) Sanchayita Sen, Ph.D. PDB Depositions.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
PDBe Protein Interfaces, Surfaces and Assemblies
Take a REST from manual searching: PDBe, programmatically
PDBemotif A web based integrated search service to understand ligand binding and secondary structure properties in macromolecular structures.
Getting the Most out of the PDBe
Archives and Information Retrieval
The ultimate in data organization
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Protein Databank in Europe (PDBe)‏ An Introduction

Protein Databank in Europe Introduction Based at the European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory (EMBL) at Hinxton, UK Started in 1996 with the goal of providing an autonomous structural database capability in Europe The aims of the group are to provide: a deposition site via which macromolecular structures can be added to the PDB (AutoDep) or EM (EMDep). a stable and clean repository of macromolecular structure data services that allow users to access, search and retrieve structural data

Protein Databank in Europe Protein Databank in Europe (PDBe) group Is one of the four sites around the world that where 3D structures may be deposited. Provides stable and clean repository of macromolecular structure data. Has services that allow users to access, search and retrieve structural data from a single web access point.

Protein Databank in Europe worldwide Protein Data Bank (wwPDB)‏ Consists of four sites RCSB (USA), PDB-j (Japan) BMRB (USA) and PDBe. PDB is the single repository of all publicly available macromolecular structures. The PDB started in 1971 and now has around 54,000 entries and new entries are added weekly. Structures are deposited by experimentalists and contents is freely available. The format of the archive is flat-files with fixed line format, although an improved flat-file format (mmCIF) is available.

EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Tasks Deposition site Data clean-up Database design and implementation Retrieve data

Protein Databank in Europe Structure Determination NMR: High Field Spectrometer cryo-EM: Electron microscope X-ray crystallography: synchrotron

Protein Databank in Europe Full deposition site from June % of all submissions via the EBI. Closely collaborate with the other wwPDB members for a single unified archive.. Depositions started June 2002 Depositions and Curation

AutoDep 4.0 A structure deposition and archiving system. Based on Java/XML technology. Available free under license for academic and industry use. Easy to install and use for in-house archiving before deposition to the PDB via the PDBe interface.

Protein Databank in Europe Disadvantages of Flat files… Macromolecular structures are very complex. Existing PDB format is incapable of fully describing even existing structures. Format is not readily extensible, to cope, for example, with structural genomics data. Historical archive is non-uniform and poorly populated. Search and retrieval of flat files is difficult and/or inaccurate.

ATOM 2567 N PHE B ATOM 2568 CA PHE B ATOM 2569 C PHE B ATOM 2570 O PHE B ATOM 2571 CB PHE B ATOM 2572 CG PHE B ATOM 2573 CD1 PHE B ATOM 2574 CD2 PHE B ATOM 2575 CE1 PHE B ATOM 2576 CE2 PHE B ATOM 2577 CZ PHE B PHENYLALANINE All looks normal ?

PHENYLALANINE Not Quite an Outlier!! All looks normal ?

PDBe Curation Authentication of source That the protein is from human and not rabbit, for example ! Authentication of structure Comparison of structure against raw data. Geometry and Stereochemistry. Provide results back to depositor. Validation of correct methodology used Whether X-Ray, NMR or EM. Conformity to standards Follows PDB format specifications Error checks Consistency checks - to identify simple typos Homo sapiens and not Homo sapien (single human?). Outlier detection - to identify suspect records

Adopt standards Use NCBI taxonomy database to ensure correct organism names Use Uniprot database to ensure correct protein description Enzyme database Annotated ligand information

What happens when these checks fail?  Raise issue with the depositor But the depositor might:  be unavailable  not interested  not know the answer anyway  not be sure about which data have the problem The older the entry, the less likely the depositor can/will help Protein Databank in Europe

What is the solution? Don’t rush and define another format Represent the structure data in a meaningful way (use data model)‏

The benefits of a database Historically, data have been curated as flat-files, with few, if any, checks on the consistency of the archive There are many problems with the legacy files: some can be corrected or at least detected automatically during database loading; many must be manually corrected prior to loading Once loaded, the entire archive can be subjected to various all-against-all comparisons that further enforce uniformity across entries $COLI COLI E. COLI ESCHERCHIA COLI ESCHERICHI $COLI ESCHERICHIA $ COLI ESCHERICHIA COLI ESCHERICHIA COLI. EXCHERICHIA COLI EXPRESCHERICHIA COLI Spelling errors abound, e.g. 23 versions of this humble bug: ESCHERICHIA COLI

PDBe maintains a curated database of HET compounds, against which legacy data will be compared Ligands are often named inconsistently or even entirely incorrectly, e.g.  -D-mannose (MAN) vs  -D-mannose (BMA)‏ Errors are detected using a graph-based structure comparison algorithm Benefits - ligand nomenclature Beta Alpha

Database organization Search database PDB files External Processes Deposition database Reference derived data transformation loading SQL Query The top-level entity in a structure entry in the deposition database is the assembly, as determined using the Protein Interaction Surface and Assembly (PISA)‏ Every PDB entry in the search database is based on the quaternary structure/assembly as determined using the Protein Interaction Surface and Assembly (PISA) X-ray structures are deposited as asymmetric units without biological context

Protein Databank in Europe The PDBe databases The PDBe actually consists of two separate databases: the deposition database is highly normalized, with thousands of relationships linking some 400 tables; the deposition database is the definitive archive for all structural data at PDBe the search database is a much simpler, denormalized database, with data items duplicated and aggregated into 40 much wider tables, making it more amenable to searching and retrieval of data

Protein Databank in Europe Deposition database The deposition database comprises: common reference data, such as amino-acid connectivity, HET groups structures, etc. older PDB entries, loaded from legacy files schema includes strict constraints, enforcing internal consistency and performing type checking and validation against the reference data new entries, loaded from recent PDB submissions new entries are loaded on a weekly basis, subject to the same constraints and checks during loading as legacy data The top-level entity in a structure entry is the assembly, as determined using the PISA Server (PDBePISA)‏

Protein Databank in Europe Search database Each data item occurs only once in the deposition database, so that data from a single entry are spread across many tables To make searching faster, the data are aggregated into fewer, larger tables Searching the search database requires fewer table joins, making database queries significantly faster and much less complex The top-level entity in a structure entry is the assembly, as determined using the Protein Quaternary Structure server (PQS)‏

Protein Databank in Europe Derived data During “transformation” from the deposition database to the search database additional derived data are added Numerous processes are run on the deposition data, including: characterization of ligand binding sites derivation of secondary structure information mapping data onto other databases such as UniProt, CATH, SCOP, MEROPS, MEDLINE etc in order to provide an integrated view of data inside the database.

Protein Databank in Europe PISA biological assemblies PDBeChem ligand data Electron Density Visualisation AstexViewer PDBePro, PDBeite Fold matching PDBeMotif Linking to Domain data, eFamily Sequence Mapping, SIFTS

Protein Databank in Europe Quaternary Structure PISA provides an automated method for the determination of putative protein complexes, derived from PDB entries Crystal symmetry matrices are applied to a protein structure, and possible complexes are detected by consideration of buried surface areas PISA assignments form the basis for the assemblies for a given PDB entry in the PDBe

Protein Databank in Europe PISA Complex divining ! Best !! Expand Crystal Symmetry Possible Assemblies Loss of accessible surface area >10% of total surface. True complexes also look good ! Analyze surface and contacts ASU Contents

Protein Databank in Europe PISA assemblies 1E94PISA assembly

Protein Databank in Europe Some Implementation Issues  The PDBe database is large and complex:  50,000+ PDB entries  40+ tables in the warehouse, many very large  Cross-referenced against SwissProt, PubMed etc.  Need to expose as much of the data as possible, without making the interface too complex.  Tools for different categories of end-user  "Novice" user  Experienced user  Expert user

PDBe Searches Biobar – Mozilla/Netscape toolbar application for searching the MSD PDBelite – web form application for searching the MSD PDBepro – applet for searching the MSD PDBechem – complete collection of all the chemical species and small molecules in the PDB EMsearch – search tool for electron microscopy depositions PDBefold – Secondary Structure Matching (SSM) tool for protein structure comparison PDBesite – active site database search PDBemotif – 3D structural motif

Query capabilities in PDBe  Browsing (click and read)‏  Simple search  select records with some constraints (Biobar)‏  More elaborate search  select specific fields of some records with constraints on some fields (PDBelite)‏  Complex querying  ability to return an answer that results from a "live" computation, and was not part of any record of the database (PDBepro)‏

Protein Databank in Europe PDBe provides… Clean biological data Integrated data A single web access point Query interfaces for different users (Beginner, Occasional or expert). Interconnected views of the data relating structure, sequence, text & experimental details.

A database for all Search database