High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös.

Slides:



Advertisements
Similar presentations
1. Which of the following organic molecules can never be used as a source of energy? a. Carbohydrates b. Lipids c. proteins d. nucleic acids Nucleic acids.
Advertisements

Chapter 12 File Processing and Data Management Concepts
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
An analysis of pdb-care (PDB CArbohydrate REsidue check): a program to support annotation of complex carbohydrate structures in PDB files by Thomas Lütteke.
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
DNA and Amino Acids Molecular Structure Lecture 3.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Chemistry in Biology.
Protein Tertiary Structure Prediction
Number of released entries Year. Growth of Molecular Complexity Number of Chains Year Number of Structures Containing that Number of Chains.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Being a binding site: Characterizing Residue-Composition of Binding Sites on Proteins joint work with Zoltán Szabadka and Gábor Iván, Protein Information.
CHAPTER 5 THE STRUCTURE AND FUNCTION OF MACROMOLECULES.
A data flow diagram (DFD) maps how data moves through a system. It shows how data entering the system (input) is transformed (process) and changed into.
Carbon Compounds Isomers
2-3 Carbon Compounds.
Biochemistry Notes. Elements and Atoms Matter is anything that has mass and takes up space. Atoms are the basic building blocks of all matter. Elements.
Transmembrane proteins in the Protein Data Bank: identification and classification Gabor, E. Tusnady, Zsuzanna Dosztanyi and Istvan Simon Bioinformatics,
2-3 Carbon Compounds. Carbon Compounds Organic chemistry – the study of compounds that contain bonds between carbon atoms.
EMBL-EBI Adel Golovin MSDsite The project is funded by the European Commission as the TEMBLOR, contract-no. QLRI-CT under the RTD programme.
 Four levels of protein structure  Linear  Sub-Structure  3D Structure  Complex Structure.
Analyzing the Simplicial Decomposition of Spatial Protein Structures Rafael Ördög, Zoltán Szabadka, Vince Grolmusz.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
MSDmotif 1 Adel Golovin Protein Site and Motif search Biosapiense network of excellence.
Crystallographic Databases I590 Spring 2005 Based in part on slides from John C. Huffman.
CHAPTER 2 CHEMISTRY OF LIFE. 2-1 The Nature of Matter.
Polymer Molecule made of many monomers bonded together
C-1 Management Information Systems for the Information Age Copyright 2004 The McGraw-Hill Companies, Inc. All rights reserved Extended Learning Module.
Section 1: Atoms, Elements, and Compounds
SECTION 2-1 CONT. Bonding. TYPES OF CHEMICAL BONDS  Bonds involve the electrons in an atom.  1. Ionic Bonds Electrons are transferred from one atom.
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Introduction to Chemistry – Background for Nanoscience and Nanotechnology Prof. Petr Vanysek.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
INTERMOLECULAR FORCES
Question and Answer Samples and Techniques
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
CHAPTER 2 THE CHEMISTRY OF LIFE. I.The Nature of Matter A. Atoms – the basic unit of matter; made up of 3 subatomic particles.
EMBL-EBI Chemistry & the PDB MSDchem Primary Developer: Dimitris Dimitropoulos.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
Topic 1 Roland Dunbrack. Modeling of Biological Units Model data files of single proteins may require –sequence alignment(s) to templates (entry and chain)
Oliver Thomas. Atoms Unable to be cut Basic unit of matter Made of protons, neutrons, and electrons Protons are positive Neutrons carry no charge Electrons.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Examining Protein Folding Process Simulation and.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
INTRODUCTION ~ PART 1 ~ Biomolecules. Chemistry of Life 1. Life requires about ____________naturally occurring chemical elements. A. _____________________________,
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
2.3 notes Carbon Compounds. Organic chemistry- study of compounds that contain bonds between C atoms Carbon: -can bond with 4 e- to another atom -can.
General Chapter 6 Assessment answers. Section 1 1. What is chemistry? The study of matter 2. What is the definition of matter? Anything that has mass.
Advanced Biology. Atoms – the building blocks of matter Nucleus – the center of the atom; the location of neutrons and protons Protons – positively charged.
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Chapter 2 The Chemistry of Life
PDBemotif A web based integrated search service to understand ligand binding and secondary structure properties in macromolecular structures.
Getting the Most out of the PDBe
Dimitris Dimitropoulos
2–3 Carbon Compounds Photo Credit: © John Conrad/CORBIS
Biological Molecules.
Chemistry Review What do you remember?
Databases and Structured Files: What is a database?
Biomolecules Molecules of Life.
The Chemistry of Carbon
LEQ: How do biological molecules store information?
Crystal structure description
Basic procedure for MD simulations
Chapter 2: The Chemical Contexts of Life
Presentation transcript:

High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös University, Budapest

What is wrong with the PDB? It is not uniform, each author has a different style It is hard to process it automatically –Residue numbering is not always sequential –The chemical symbols of the atoms are often missing –It is not easy to tell how many ligands there are in an entry, chain ids are not used consistently –It is not clearly indicated if a molecule has missing atoms, and which atoms are missing There is a need for a “front-end” database to the PDB

database of structure and coordinate data test sets of docking algorithms list of binding sites statistical information Flow of data Internetlocal PDB mirror download and check for updates structural decomposition SQL query

What type of molecules are there in a PDB entry? Protein chains (P) DNA/RNA chains (N) Ligands (L) Metals and other small ions (I) Water molecules (W)

Information stored in the database Covalent structure of molecules List of components of each entry Coordinate data for each atom Interactions between molecules

E/R diagram of the database covalent structure id symbol molecule containsatom bondtype monomer contains id num

E/R diagram of the database component structure entry componentcontains pdbid id molecule id atom contains interaction (x,y,z) id type length

PDB file formats  PDB format This is the original PDB file format, it contains data records in separate lines, each with fixed length and format, eg. ATOM, HETATM, SEQRES, CONECT, etc.  mmCIF format This is a relational database description language, a file contains data tables called categories.  XML format The same tables are described by XML tags. The file sizes are huge, a file contains more data tags then data.

Structural units of an entry The basic structural unit of both the PDB and the mmCIF format is the so called monomer. It can be a molecule, a molecule fragment or just an atom. Each such monomer has an at most three letter long code, called monomer id, eg. ALA for alanine, MG for magnesium ion, ACE for acethyl group, or HOH for water. A protein chain consists of many amino acid monomers, each having a sequence number that indicates its position within the chain. Similarly, DNA/RNA chains consist of many nucleic acid monomers. Metals, small ions, water and most ligands are one monomer having a unique monomer id. The basic problem is that there are certain ligand molecules that consist of two or more monomers, and this information is not always properly annotated in the PDB entries in either formats.

mmCIF data categories entity List of molecules in the entry, can be of three types: polymer, non-polymer and water. Each molecule has an entity id. entity_poly Contains the type of polymer entities, eg. polypeptide(L) struct_asym List of the components in the asymmetric unit. Each component has an asym id and an entity id. pdbx_poly_seq_scheme Describes the sequence of monomers in a polymer entity. pdbx_nonpoly_scheme List of the monomers belonging to the non-polymer entities. atom_site Coordinate data for atoms, whose positions could be experimentally determined.

Structural decomposition based on the mmCIF format First we read the list of components in the asymmetric unit. For each component, we read its entity type, and for each polymer entity, its polymer type. Then we read the sequence of monomers for the polymer entities, and the list of monomers belonging to the non-polymer entities. The structure of monomers if known ‘a priori’ from a file named components.cif, which can be found at RCSB’s web site. So for each monomer, we have a list of atoms, lacking coordinate information. Now we go through the table atom_site, and for each atom, we find the monomer it belongs to, and fill the coordinates for the atom just found. If an atom of a monomer is not found, it will be marked as missing.

Definition of molecule types Protein chain: a polymer entity of type “polypeptide(L)”, which is at least 10 monomers long DNA/RNA chain: a polymer entity, which is at least 5 monomers long and its type is either “polydeoxiribonucleotide”, “polyribonucleotide”, or more then half of its monomers are nucleic acids (A,C,G,I,T,U monomer id) Ion: there is a predefined list of monomer ids, containing metals and small ions Water: the monomers of the water entity Ligand: all monomers, that do not belong to the above categories will form the set of ligand monomers

Ligands and binding sites We define a graph on the atoms that have coordinate data. It will have two types of edges: –covalent: if the distance of the two atoms is less then 1.25 times the sum of their covalent radii –VdW: if it is not covalent, but the distance of the two atoms is less then the sum of their Van der Waals radii The graph is built using a 3 dimensional kd-tree in O(n log n) time We go through the edges: –if an edge of covalent type connects two ligand molecules, then they will be joined together in one new molecule –if an edge connects a ligand to a protein chain, then this intermolecular interaction will be recorded in the protein-ligand interaction table, marking the binding site of this ligand on the protein surface

PDB version: June 6, 2005 Number of PDB entries: 31,217 Number of entries processed: 26,445 Number of protein chains: 59,842 Number of different sequences: 18,333 Number of ligands: 53,834 Number of different ligand molecules: 6,016 Number of all atoms: 269,237,779 –Number of atoms in protein chains: 240,243,785 –Number of atoms in DNA/RNA chains: 7,709,842 –Number of atoms in ligands and ions: 2,479,339 –Number of atoms in water: 18,804,813

Distribution of elements in ligands and ions The distribution of the organic and the most frequent inorganic elements among the ligands and ions. We found 70 different elements.

Distribution of elements in protein chains There were 17 different elements in the protein chains, the tables show the number of occurrences, and for the non-standard elements, the monomers that contain them.

Distribution of protein monomers The table shows the distribution of the 20 natural amino acids and selenomethionine in the different chains and in all chains. The other non-standard monomers are listed below.

Protein-Ligand interactions 10gs A C The table above shows the number of protein-ligand interactions, the number of entries they occur in, and the number of different interaction types while more and more con- ditions are met.

Distribution of missing atoms The distribution of the number of missing atoms from protein chains in the PDB entries. Note, that there are relatively few entries, where only a few atoms are missing.

Distribution of missing segments The distribution of the lengths of missing chain segments at the beginning, at the middle and at the end of the chains. The length is measured in amino acids. Note that in the middle of the chain, typically 4-7 amino acids are missing.

Thank You!