Presentation is loading. Please wait.

Presentation is loading. Please wait.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU.

Similar presentations


Presentation on theme: "CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU."— Presentation transcript:

1 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU NO!

2 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Identification of Protein- model accuracy Why is it important? What is accuracy –RMSD, fraction correct,… Protein model correctness/quality –Procheck, Whatif, ProsaII, Verify3d Prediction of protein model accuracy –ProQ server

3 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Why is it so important Reliable fold recognition –P-value, E-value, Z-score… –Tells you if you should believe in the fold!! Alignment (model construction) –No obvious method to estimate reliability of alignment Number of gaps, length of gaps Amino acids in protein core and loops –% id is too conservative Many low homology models are accurate, and some high homology model are wrong Correct fold, wrong alignment => Terrible model How to gain confidence in a protein model?

4 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model accuracy. Swiss-model. 1200 models sharing 25-95% sequence identity with the submitted sequences ( www.expasy.ch/swissmod)

5 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU What is protein model accuracy Model quality (correctness) –Does the model look like a protein? Hydrophobic residues in core, hydrophilic on surface Backbone geometry (phi/psi angles, bond-length) Amino acid environment A correct model can be completely wrong

6 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model accuracy If we know the answer Fraction correct = N c /N N c = number correct (dij<4Å) d ij Blue model Yellow structure

7 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Amino acid environment 1.000.000 of different protein sequences (Swissprot) 10.000 different solved protein structures (PDB) 600 different protein folds => Typical amino acid environment 1.000.000 10.000 600

8 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Evaluation of model quality Check for proper protein stereochemistry –ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery)http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery Ramachandran plot, bond-length, … –Whatif ( http://www.cmbi.kun.nl/gv/servers/WIWWWI ) http://www.cmbi.kun.nl/gv/servers/WIWWWI Packing quality –Both web-servers Fitness of sequence to structure –ProsaII (http://lore.came.sbg.ac.at/Services/prosa.html)http://lore.came.sbg.ac.at/Services/prosa.html Program runs on Linux and Unix –Verify3D (http://www.doe-mbi.ucla.edu/Services/Verify_3D/)http://www.doe-mbi.ucla.edu/Services/Verify_3D Web-server

9 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ProCheck Peptide backbone geometry Bond length Peptide planes –C  NCC  Dihedral angles   degrees –  strand (~20%)  degrees –  helix (~30%) From: http://garlic.mefos.hr/garlic/commands/dihedrals.html

10 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU B. Beta strand A. Right handed helix L. Left handed helix Color coding –White. Disallowed –Red. Most favorable –Yellow. Allowed region Glycine triangles Ramachandran plot B A L

11 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Find the wrong structure 1RIP Ribosomal protein. 1PLC Electron transport protein

12 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Procheck. Bond length 1plc

13 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU

14 1plc

15 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Find the wrong structure 1RIP Ribosomal protein. 1PLC Electron transport protein

16 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU What-if. Fine packing Quality Statistical description of local chemical environment in high quality protein structures –Superimpose tryptophans and find average local environment. Same for other amino acids –Full atom model G. Vriend and C. Sander, 1992

17 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Example. Model T0133 T0133 CASP5 target Modeled by X3M ( CPHModels-2.0, Lund O., 2002) RMSD=7.3

18 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 1 ILE ( 33 ) 2 -0.737 -0.462 0.331 -1.312 -0.865 2 SER ( 34 ) 2 -0.241 0.209 -0.021 -1.437 -1.421 ….. 245 ALA ( 296 ) 2 -1.919 -1.770 -1.264 0.000 0.000 246 GLU ( 297 ) 3 -1.384 -0.641 -1.400 0.070 -1.132 247 HIS ( 298 ) 3 -1.476 -1.211 -1.736 -0.874 -1.427 ============================================================ All contacts : Average = -0.459 Z-score = -3.05 BB-BB contacts : Average = -0.155 Z-score = -1.14 BB-SC contacts : Average = -0.445 Z-score = -2.94 SC-BB contacts : Average = -0.221 Z-score = -1.39 SC-SC contacts : Average = -0.701 Z-score = -4.10 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Bad model BB: Backbone SC: Side chain

19 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T0133 structure - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 18 ILE ( 33 ) A 2 0.781 1.018 -0.116 0.661 -0.291 19 SER ( 34 ) A 2 1.435 1.467 0.077 2.284 0.134 ….. 281 ALA ( 296 ) A 2 -2.272 -2.504 -0.404 0.000 0.000 282 GLU ( 297 ) A 2 -0.778 -1.601 -1.256 0.137 1.471 283 HIS ( 298 ) A 3 -0.836 -0.801 -0.948 -1.094 0.351 ============================================================ All contacts : Average = 0.001 Z-score = -0.04 BB-BB contacts : Average = -0.040 Z-score = -0.40 BB-SC contacts : Average = 0.139 Z-score = 0.90 SC-BB contacts : Average = -0.196 Z-score = -1.23 SC-SC contacts : Average = -0.024 Z-score = 0.02 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Good model

20 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990). ProsaII (Potential of Mean Force) Likelihood of amino acid packing For high quality protein structure estimate nearest neighbor counts for all aa Exposure score is Method developed by Manfred Sippl., 1993 Works for C  -models Philosophy –Hydrophobic residues tend to have many neighbors (buried) –Hydrophilic residues tend to have fewer N (exposed) –Finding an hydrophilic aa with many NN can indicate wrong model

21 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Exposure potential Exposure potential for D D is a charged aa

22 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ProsaII (Potential of Mean Force) Likelihood of amino acid packing Example: If D and E are close in sequence (s=3), then they prefer to be close in distance d~5.5Å Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990). Not are amino acids are friends –Amino acids of equal charge will not like to be in close contact –Amino acids w. opposite charge will like to be in contact –Amino acids forming hydrogen bonds, salt bridges will like to be in close contact Philosophy a b r s

23 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Likelihood of amino acid packing

24 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Verify 3D (Eisenberg et al. 1997) Closely related to ProsaII exposure potential. How well does aa fit its local environment (hydrophobic/hydrophilic) –T0133 Casp5 target –Modeled by X3M (Lund, O., 2002) –RMSD=7.3 –Red/orange: Crystal structure, –Blue/green: Model

25 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Model T0133. Verify 3D Sequence has poor match to structure

26 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Prediction of model accuracy Integrate features from –Model construction –Protein packing (Procheck) –Fitness of sequence to model (Verify3D, whatif) A model with low %id, but high protein packing, and fitness values is more likely to be correct, than a model with high %id and poor packing and fitness values.

27 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ProQ. Prediction of Model accuracy Neural network to identify correct protein models. –B. Wallner and Arne Elofsson, 2003 –http://www.sbc.su.se/~bjorn/ProQhttp://www.sbc.su.se/~bjorn/ProQ Input, a pdb structure/model Output, accuracy measure –LGscore –Maxsub score

28 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU ProQ Input to neural net –Atom-atom contacts C, N, O How often is C in contact with N? –Residue-residue contacts How often is E in contact with D? –Solvent accessibility surface Average exposure of L’s –Secondary structure prediction How consistent is prediction with model? Does not include any information about the model construction (%id, P-value etc)

29 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Casp model T0113

30 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Structure 1RIP

31 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU LifeBench data 11000 Models 220 targets Modeled by Pcons Incorrect model Lgscore <1.5 Maxsub < 0.1

32 CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Conclusions Correct protein models cannot (yet!) reliably be identified!! Many methods from the protein crystallography world are useful to identify wrong models Bad models can however pass all filters ProQ is a first attempt of an “accuracy prediction server” –Integrates information from many sources –Future will show if this approach can provide reliable prediction of model accuracy ProQ does not include information about the model construction. An integrated method should have this. Room for improvement


Download ppt "CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU."

Similar presentations


Ads by Google