Presentation is loading. Please wait.

Presentation is loading. Please wait.

Being a binding site: Characterizing Residue-Composition of Binding Sites on Proteins joint work with Zoltán Szabadka and Gábor Iván, Protein Information.

Similar presentations


Presentation on theme: "Being a binding site: Characterizing Residue-Composition of Binding Sites on Proteins joint work with Zoltán Szabadka and Gábor Iván, Protein Information."— Presentation transcript:

1 Being a binding site: Characterizing Residue-Composition of Binding Sites on Proteins joint work with Zoltán Szabadka and Gábor Iván, Protein Information Technology Group Department of Computer Science, Eötvös University Budapest, Hungary Vince Grolmusz

2 The Protein Data Bank  It is a collection of the experimentally determined 3D structures of biopolymers and their complexes, today it contains more than 45,000 entries  Experimental methods include X-Ray Diffraction Nuclear magnetic resonance (NMR) spectroscopy  PDB file formats pdb format mmCIF format XML format

3 The graph model of molecules  The molecule is modelled with a graph where the vertices are the atoms and the edges are the covalent bonds  Each atom has an atomic number and a formal charge  Each bond has an order that can be 0 for coordinated covalent bonds 1,2 or 3 for single, double and triple bonds respectively  Aromatic ring systems are modelled with alternating single and double bonds  A steric model is a graph model plus 3D coordinates for the atoms

4 Main problems  Given a pdb file, find the steric model of each molecule in it  Find the molecules which have unrealistic steric models  Make a searchable database of different protein-ligand complexes which fulfil certain additional quality requirements Our solution: The RS-PDB Database (RS stands for Rich-Structure)

5 Difficulties and solutions  The two main difficulties with these problems the basic units of a pdb entry are the residues and HET groups, and not the molecules there are atoms, whose coordinates could not be determined, and these are simply missing from the files  Therefore the problem can not be solved for every entries  We developed a method to automatically process the PDB mmCIF files and created a database with an approximate solution and marked the places, where there are errors or ambiguities

6 HET Group Dictionary  The basic units of a pdb entry are the residues and HET groups, these will be called monomers  A monomer can be a molecule or a molecule fragment  Each monomer has a unique code: ASN, C, MG, NAD, …  The covalent structure of these monomers are in a separate part of the PDB, the “PDB Chemical Component Dictionary'‘, formerly called the HET Group Dictionary (HGD)  We converted the structure descriptions of these monomers to the graph model and put them in our HGD database

7 Processing of an mmCIF file (1) Polymers  We read all the so called entities from the file, each of them containing one ore more monomers  Each entity has a type, that can be polymer, non-polymer or water, and each polymer entity has a polymer type  Next we build the polymers from the monomers, one-by-one, for example in the case of proteins:

8 Constructing Polypeptide chains – the peptide bond... A R N CA HA C O HN2 H O R N CA HA C H 1 2 R N CA HA C O H HXT O R N CA HA C OXT H n-1 n When a new amino acid (i.e., a monomer) is added we remove the atoms OXT and HXT from the end of the chain, and the atom HN2 from the new monomer, and add a covalent bond between the atoms C and N. In the case of amino acid PRO, we remove both HT1 and HT2; if, in the case of a non-standard amino acid (i.e., protein monomer), the above mentioned atoms are not present, we refuse to make chain.

9  After the polymers are built, we define three types of polymer molecules Polypeptide chains (P): >10 monomers long DNA/RNA chains (N): >5 monomers long Polysaccharides (S): >5 monomers long  The sequence of these polymers will give the graph model of the molecules

10 Processing of an mmCIF file (2) Ligands and their bond graph  Initially all monomers not belonging to a polymer are distinct ligands, their graph model taken from the HGD  We read all the available atomic coordinates from the mmCIF file to create the (partial) steric models  We find all pairs of atoms with distance less then 6 Å, building a kd-tree for this purpose  If two atoms from different molecules are within covalent distance, we try to combine their graphs  If this fails, or the atoms are too close, we record this in a separate database table containing bond errors  Next, crystallization artefacts and “junk” ligands are removed (Similarly as in the PDBBind database).

11 Database of protein-ligand complexes and binding sites  A protein-ligand complex consists of a ligand and one or more protein chains that have atoms in van der Waals distance from the ligand; these atoms are painted red in the figure:

12 Getting rid of redundancies  PDB is strongly biased in the direction of “popular” or “important” proteins; some chains (e.g., bovine trypsin) are present in more than 100 PDB entries.  When mapping binding sites in the PDB, redundancies must be dealt with;  If to the chain A ligand X is bound to the same place in different PDB id’s -> counted once;  If to the chain A ligand X is bound at distinct places -> counted twice or more  Result: 25,000 binding sites -> 19,000 B.S.

13 Residues in binding sites Next, those residues are collected from protein chains, that are close to the ligands: We go through the ligand atoms one- by-one and find those protein atoms which were closer to them than 1.05 times the sum of the Van der Waals radii of the two atoms scanned; We do not have covalently bound ligands; they were already filtered out. Next we identify the residues containing these atoms: for every binding site a subset of the 20 amino acids were created. If the same residue appeared more than once, we inserted it only once into the residue-set: we are interested in the plain appearance of the residue at the binding site.

14 Binding site residue frequencies

15 Association rules in residue-sets  We are interested in implication-like rules such as: (ALA,LEU) (ILE,VAL) that is, if a binding site contains amino acids leucine and alanine, it will ``likely'' contain also valine and isoleucine.  Main attributes of the rules are: support: Prob(ALA,LEU,ILE,VAL) confidence: Prob((ILE,VAL) | (ALA,LEU)) lift : Prob(ALA,LEU,ILE,VAL)/(Prob(ILE,VAL)Prob(ALA,LEU))

16 What is interesting?  Association rules X Y, where Y is a very frequently appearing residue-subset, are not interesting generally.  On the other hand, if Y is infrequent, then the support and the confidence generally will not reach the thresholds to be included in our results.  For example, Y=GLY appears very frequently, while Y=CYS or Y=TRP appears rarely.  Association rules of unusually high and unusually low lifts and rules of form X Y with high confidence and not-too- high support for Y are of particular interest. Our next figures here visualize such remarkable data.

17 Our first figure… … was created by deleting all X GLY association rules for clarity, and including only those rules which satisfy that  their support is at least 7.15% and  their confidence is at least 0.5 and  at least one of the following conditions hold: a) their confidence is at least 0.8 or b) their lift is at least 1.8 or c) their lift is at most 0.97 or d) their support is at least 24%.

18 High-confidence area Low-lift area High-support area High-lift area

19 Figure 2 contains rules, where…  all X GLY association rules are deleted for clarity, and  the support is at least 7.15% and  the confidence is at least 0.55 and  the lift is at least 1.7.

20 All large fan-in stars contains GLY Here, ALA, the sixth most frequent residue, is present in almost all bases; and THR (threonine), the tenth most frequent residue appears in the center; all bases have 3 or 4 elements.

21 Conclusions  We believe that by the analysis of the residue-composition of the binding sites in a really large and reliable data set, one can identify pretty interesting data patterns, applicable in inhibitor and drug design;  We think that this work is just one of the first steps in that direction.

22 Thank you very much!


Download ppt "Being a binding site: Characterizing Residue-Composition of Binding Sites on Proteins joint work with Zoltán Szabadka and Gábor Iván, Protein Information."

Similar presentations


Ads by Google