Presentation on theme: "Training pK a and logP prediction Jozsef Szegezdi Solutions for Cheminformatics."— Presentation transcript:
Training pK a and logP prediction Jozsef Szegezdi Solutions for Cheminformatics
logP calculation models in Marvin ModelsTraining set size Number of parameters VG1000120 KLOP1700100 PHYS10000110 Weighted>10000120 User definedVariable<=100 Unfortunately we can not tell in advance which model will be better for a molecule if it is not included in the training set. Three models are provided in Marvin. They share the same atom type definitions taken from Viswanadhan, V. N., et al. J.Chem.Inf.Comput.Sci., 1989, 29, 3, 163-172;
Problem with logP models Frequently occuring problems of constructing logP models -logP training set size is too small -logP training set is unrepresentative -Specification of atom types and interactions is subjective -The number of logP parameters is restricted in order to ensure the predictive power As a result, there will be missing interactions and atom types for the models.
-0.77 -0.31 0.25 0.88 1.51 2.03 2.62 3.00 3.77 4.57 1.29 1.28 1.48 1.23 1.19 1.79 -3.24 -0.92 0.15 1.46 0.16 2.85 -1.76 -1.04 0.88 Example for creating a local logP model
The logP of the molecules calculated with the standard weighted method which is shown on the figure below. The principal of uniformity of nature would say that other OH containig molecules could be predicted reasonably by the standard weighted method. Is it true? We test this with the hydroquinone molecule.
The logP value of hydroquinone is 0.59. The next table summarizes the logP errors of the standard models. ModelslogP calc. –logP exp. VG0.88 KLOP0.75 PHYS0.68 Weighted0.77 User defined? Test of standard models How can one improve the accuracy of the predicition? Prediction error can be reduced by creating a local model using linear regression for the 25 molecules mentioned above. Command line call for creating the local model: cxcalc -T logP -t LOGP –o logPparameters.txt training25.sdf Error of the standard models is relatively large.
The logP value of 25 molecules containing OH groups calculated with the user defined method after logP training on the figure below. ModelnR2R2 sTest molecule: logP error of hydroquinone Weighted250.960.360.77 User defined 250.990.100.24 Comparision of the standard and the user model The user-trained local model based on 25 molecules outperforms all of the standard models. Users model
Conclusions The local model based on 25 molecules is more accurate than any of the standard global models. Depending on the training set different parameter values will be assigned to the same atom type. This is one of the main characteristics of the user model. A carefully created set of local models must be superior to any large model. We plan to develop a model that combines many local models.
The ionization % -pH curve denoted with blue color for basic centers and with red color for acidic centers. 10.28 4.30 5.10 2.49 Apparent pK a and ionization%-pH curve
Method for predicting pK a and training Marvins prediction model considers: partial charges polarizability effect of ionizable centers on each others Training refines the existing parameters for ionizable centers and at the same time creates new modifier parameters based on structures and experimental values specified by the user.
Example for training pK a prediction 1 2345 678910 1112131415
123456789101112131415 pKa 1 6.06.700.501.201.503.765.605.04.506.710.634.054.102.844.91 pKa 2 -3.01.0-7.58.350.719.01 Experimental vs. calculated pK a values
The input sdf file may be created in IJC The teaching can be run using this command line : cxcalc –T pka –o c:/output InputpKadata.sdf Curating experimental pK a data
Conclusions User defined pK a model is more accurate then the built-in default model. IJC can be used for curating input data for the training. The new model is only a refinement of the default model, so the training assumes a robust base model that is provided in Marvin.