Presentation on theme: "Christian Kramer, Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland Leave-cluster-out crossvalidation is appropriate for scoring."— Presentation transcript:
Christian Kramer, Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland Leave-cluster-out crossvalidation is appropriate for scoring functions derived on diverse protein datasets Dataset PDBbind07  was used for reproducing RF-score results. The PDBbind09 refined set was used for demonstration of leave-cluster-out crossvalidation Descriptors The RFscore descriptors as published by Ballester and Mitchell were used for all models. For every ligand atom [C,N,O,F,P,S,Cl,Br,I] all protein atoms [C,N,O,S] within 12 Å distance are counted and summed up to give 4x9 atom pair descriptors Learning algorithm The Random Forest as implemented in R with default settings was used. Dataset & methods PDBbind core set & RFscore performance Empirical rescoring functions for predicting Protein- Ligand interaction energies can be trained based on large diverse collections of crystal structure geometries augmented with binding data, such as the PDBbind or the BindingMOAD database. In a recent publication remarkable success has been demonstrated in predicting the free energy of interaction based on atom counts in a 12 Å radius around the ligand.  However the quality of prediction depends strongly on the composition of training and validation set. We suggest a generally applicable validation strategy that is not prone to protein-family recognition pitfalls. Introduction References  Ballester, P.J. & Mitchell, J.B.O. A machine learning approach to predicting protein- ligand binding affinity with applications to molecular docking. Bioinformatics 26, 1169- 1175 (2010)  Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative Assessment of Scoring Functions on a Diverse Test Set. Journal of Chemical Information and Modeling 49, 1079-1093 (2009). The PDBbind07 core set can be predicted with RMSE = 1.58, R 2 = 0.59 and R = 0.77. It has been assembled from a clustering of the PDBbind07 database according to BLAST similarities. The most active, the least active and the complex closest to the average activity have been extracted from each cluster with at least 4 members. This means that for every validation set entry there is at least one entry from the same protein family in the training set. Leave-cluster-out crossvalidation The PDBbind09 refined set consists of 1741 complexes in 561 clusters (90% BLAST similarity). The distribution of cluster population is shown below. For the leave-cluster-out crossvalidation we suggest the following clustering scheme: All clusters with more than nine members are kept (A-W). Clusters with four to nine members are united (X), clusters with two and three members are united (Y) and all singletons are united (Z). Multidimensional scaling of the RFscore space shows that complexes from the same protein family indeed cluster. A flexible learning algorithm should well be able to recognize protein family membership Complete Set Train 1 Validation 1 Cluster 1 outCluster 2 outCluster 3 outCluster 4 outCluster 5 out Train 2 Validation 2 Train 3 Validation 3 Train 4 Validation 4 Train 5 Validation 5 Composition of the PDBbind09 database The PDBbind09 cluster alphabet Cluster proximities The range of activities within protein families is smaller than the total range of activities. To avoid predictions that benefit from protein-family we suggest to do leave-cluster-out crossvalidation Biological TargetCluster#samplesRR2R2 RMSE HIV ProteaseA1880.110.011.91 TrypsinB740.730.531.04 Carbonic AnhydraseC570.560.311.68 ThrombinD520.370.142.03 PTP1B (Protein Tyrosine Phosphatase)E320.630.41.02 Factor XaF320.190.041.76 UrokinaseG290.780.610.95 Different similar TransportersH29-0.120.011.17 c-AMP Dependent Kinase (PKA)I170.540.291.26 Beta-GlucosidaseJ170.590.351.13 AntibodiesK160.580.341.57 Casein Kinase IIL160.440.191.1 RibonucleaseM150.180.031.2 ThermolysinN140.680.461.09 CDK2 KinaseO130.640.411.11 Glutamate receptor 2P13-0.20.041.16 P38 KinaseQ130.790.620.59 Beta-secretase 1R120.930.861.51 tRNA-guanine transglycosylaseS120.120.011.08 EndothiapepsinT110.60.361.34 Alpha-mannosidase 2U10-0.170.031.88 Carboxypeptidase AV100.780.611.71 PenicillopepsinW10-0.420.182.22 All Clusters with 4-9 complexesX3870.560.311.63 All Clusters with 2-3 complexesY3400.530.281.61 SingletonsZ3210.440.191.75 Performance for each cluster after leave-cluster-out crossvalidation If crystal structures with corresponding activities are available, target specific scoring functions can be generated. We generated scoring functions within the clusters with standard out-of-bag crossvalidation for the four largest clusters. The advent of large diverse datasets of protein- ligand complexes allows to generate scoring functions with a QSAR-type fitting procedure Global scoring functions must be validated with protein-ligand complexes that stem from protein families that are not present in the training set. Else the validation will look overoptimistic (R 2 = 0.59 vs R 2 = 0.21) Target specific scoring functions can be much more predictive than global scoring functions, even when trained with the same descriptors. Conclusion Target specific scoring functions Acknowledgments CK thanks the Novartis Education Office for a Presidential Postdoc Fellowship. Table 1: Leave-cluster-out crossvalidation results on the PDBbind09 refined set. Average R 2 = 0.21, average RMSE = 1.60 Biological TargetCluster#samplesRR2R2 RMSERR2R2 Validation SetOut-of-bag within clusterCluster left out HIV ProteaseA1880.670.451.170.110.011.91 TrypsinB740.860.740.710.730.531.04 Carbonic AnhydraseC570.620.381.640.560.311.68 ThrombinD520.690.481.090.370.142.03
Your consent to our cookies if you continue to use this website.