Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews

1. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting the solubility of druglike molecules would be very valuable.

How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.

Dataset The thermodynamically most stable polymorph was selected where possible. All have experimental crystal structures. All have experimental logS. 10 have experimental ΔG sub and ΔG hydr (circled in red).

CheqSol Method In Solution Powder ● We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM Repeatability better than 0.05 log units Supersaturated Solution Subsaturated Solution 8 Intrinsic solubility values * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983 ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = S kin – S 0 “CheqSol”

Thermodynamic Cycle

Crystal Gas Solution

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas (rigid molecule approximation)

Sublimation Free Energy Crystal Gas Calculating Δ G sub is a standard procedure in crystal structure prediction

 G sub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential. Theoretical method for crystal

Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles. Results for ΔG sub

A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.

Thermodynamic Cycle Crystal Gas Solution

Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

 G hydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC). Theoretical method for aqueous solution

Reference Interaction Site Model (RISM) Combines features of explicit and implicit solvent models. Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound

Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.

Perhaps surprisingly, error in  G hyd is smaller than in  G sub. Results for ΔG hyd

logS from Thermodynamic Cycle Crystal Gas Solution Add the two terms to get ΔG sol and hence logS.

Results for ΔG sol

Conclusions: Solubility from Theory Must calculate  G sub &  G hyd separately; RISM is efficient & fairly accurate for  G hyd ; Experimental data for  G sub &  G hyd sparse and errors may be large; Dataset size and composition make comparisons of methods hard; Not yet matched accuracy of informatics.

Informatics and Empirical Models In general, informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

What Error is Acceptable? For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. A RMSE > 1.0 logS unit is probably unacceptable. This corresponds to an error range of 4.0 to 5.7 kJ/mol in  G sol.

What Error is Acceptable? A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Machine Learning Method Random Forest

Random Forest: Solubility Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007) N train = 658; N test = 300

Support Vector Machine

SVM: Solubility Results et al., N train = 150 + 50; N test = 87 RMSE(te)=0.94 r 2 (te)=0.79

100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.

Replicating Solubility Challenge (post hoc) RMSE(te)=1.09; 1.00; 0.89; 1.08 r 2 (te)=0.39; 0.49; 0.58; 0.41 10; 12; 12; 13/28 correct within 0.5 logS units N train  94; N test  28 CDK descriptors: RF, RF, PLS, SVM

Replicating Solubility Challenge (post hoc) N train  94; N test  28 CDK descriptors: RF, RF, PLS, SVM Although the test dataset is small, it is a standard set.

Conclusions: Solubility from Informatics Experimental data: errors unknown, but limit possible accuracy of models; CheqSol - step in right direction; Dataset size and composition hinder comparisons of methods; Solubility Challenge – step in right direction.

2. Protein Target Prediction Which protein does a given molecule bind to? Virtual Screening Multiple endpoint drugs - polypharmacology New targets for existing drugs Prediction of adverse drug reactions (ADR) –Computational toxicology

Predicted Protein Targets Selection of 233 classes from the MDL Drug Data Report ~90,000 molecules 15 independent 50%/50% splits into training/test set Actually we are predicting closely target-related MDDR classes

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

Protein Target Prediction Given a specific compound, is it possible to predict computationally its biological interactions with protein targets? Very important for In silico screening (time and money efficient) off-target prediction (side effects) Can be used for identifying substances with performance- enhancing potential DrugDrug discovery: Predicting promiscuity, Andrew L. Hopkins, Nature 462, 167-168(12 November 2009),doi:10.1038/462167adiscovery:

Substances Prohibited in Sports WADA publishes and maintains a prohibited list of banned compounds, updated every 6 months Substances are split into three main categories: Substances prohibited at all times (in and out of competition) S0. Non-Approved substances S1. Anabolic Agents S2. Peptide hormones, Growth Factors and Related Substances S3. Beta-2 Agonists S4. Hormone Antagonists and Modulators S5. Diuretics and Other Masking Agents Substances prohibited in competition S6. Stimulants S7. Narcotics S8. Cannabinoids S9. Glucocorticosteroids Substances prohibited in particular sports P1. Alcohol with a violation threshold of 0.10 g/L. (Archery, Karate etc) P2. Beta-Blockers prohibited In- Competition only (Bridge, Curling, Darts, Wrestling, Archery etc.)

Database Rank 1 2 3 Class Anabolic Agents VitaminD Glucocorticoids Methodology Stanozolol

ChEMBL-Activities Each compound has experimental data for a number of targets Activity data based on IC50, EC50, K i, K d etc. Some activities just labelled “inactive” or “active” Each compound can have more than one record for a given target

Each of the 8,845 targets has a number of compounds assigned Not all compounds have actual data on the target or are active We filtered each of the families according to rules defining “active” and “inactive” The rules were decided by visual inspection of distributions Rules IC50 ≤50000nM active & >50000nM inactive K i <20000nM active & ≥20000nM inactive K d ≤ 10000nM active & >10000nM inactive EC50 ≤ 40000nM active & >40000nM inactive ED50 ≤ 10000nM active & >10000nM inactive Potency ≤ 10000nM active & >10000nM inactive Activity ≥40% active & <40% inactive Inhibition ≥45% active & <45% inactive Filtering the CheMBLFamilies

Index Example: Distributions of K i

??? Refined Families Filtered families consist of compounds with significant experimental activities against the relevant targets. Many targets have distinct groups of ligands with different scaffolds. May be because there is more than one binding site, or because different scaffolds can fit the same site. Splitting such a family into smaller groups based on ligand structure will allow us to identify the different sets of ligands.

Refined Families - PFClust We selected the PFClust algorithm because it is a parameter free clustering algorithm and does not require any kind of parameter tuning. PFClust : A novel parameter free clustering algorithm. Mavridis L, Nath N, Mitchell JBO. BMC Bioinformatics 2013, 14:213.

3563 783690 Rule Filtering 19639 616600 Clustering Database Refined Families Compounds 5443 Families 1366460 Compounds Predicting the protein targets for athletic performance-enhancing substances. Mavridis L, Mitchell JBO. J Cheminformatics 2013, 5:31. Database Refinement Original Families

Database Refinement - Validation Monte Carlo Cross-Validation The three versions of the database were examined (Original, Filtered and Refined) 10% of each family were randomly removed and used as queries If the top prediction was the family that the query was a member of, a TP would be counted; if not, a FP Average Matthews Correlation Coefficient (MCC) Original : 0.02 Filtered : 0.03 Refined : 0.66 2.58% (6.61%) 66.98% (87.25%) 3.18% (7.21%) Top Hit (Top four )

P2–BetaBlockers 20 explicitly prohibited compounds Every compound, except timolol and levobunolol, gave a strong prediction (PR-Score) for at least one family Good experimental validation We see that the majority of the families are Beta-1,2 & 3 adrenergic receptor ligands, as expected. Other families also generate some interesting results, such as the serotonin 1a receptor, indicated to make off-target interactions with pindolol CompoundTargetPR-ScoreE-Value P2-Beta Blockers Alprenolol (266195)Cavia Porceullus (369)0.039 LogB/F = −0.158 Carvedilol (723) β-1 adrenergic receptor (3252) 0.032 Ki = 0.81 nM β-2 adrenergic receptor (210) 0.044 Ki = 0.166 nM Pindolol (500) β-2 adrenergic receptor (3754) 0.047 β-3 adrenergic receptor (4031) 0.036 β-1 adrenergic receptor (3252) 0.017 Prediction Ki = 1 nM β-2 adrenergic receptor (210) 0.015 Ki = 0.4 nM β-2 adrenergic receptor (3754) 0.026 β-3 adrenergic receptor (4031) 0.018 Inhibition = 84% Ki = 1 nM Serotonin 1a (5-HT1a (214)0.026 Ki = 24 nM Propranolol (27) Sotalol (471) β-2 adrenergic receptor (210) β-3 adrenergic receptor (246) 0.003 0.009 IC50 = 12 nM IC50 = 7200 nM

Carvedilol WADA– P2 Beta Blockers Metoprolol

CompoundTargetPR-Score E-Value S8-Cannabinoids Cannabidivarin (−) Cannabinoid CB1 receptor (218) 0.037Prediction Cannabigerol (497318) Cannabinoid CB2 receptor (253) HL-60 (383) 0.037 0.047 Prediction HU-210 (70625) JWH-018 (561013) Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (5373) Cannabinoid CB1 receptor (218) Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (253) 0.035 0.029 0.002 0.015 0.009 Ki = 0.82 nM a Prediction pKi = 8.7 pKi = 8.045 pKi = 8.2 JWH-073 (−) Isoprenylcysteine carboxyl methyltransferase (4699) MDA-MB-231 (400) Cannabinoid CB1 receptor (218) 0.031 0.030 0.002 Prediction Cannabinoid CB1 receptor (3571) 0.025Prediction Tetrahydrocannabinol (465) Cannabinoid CB1 receptor (218) 0.037 Ki = 2.9 nM Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (2470) Cannabinoid CB2 receptor (253) Cannabinoid CB2 receptor (5373) 0.037 0.034 0.033 0.049 Ki = 37 nM Ki = 20 nM Ki = 3.3 nM Ki = 9.2 nM S8-Cannabinoids 10 explicitly prohibited compounds 17 refined families of which 13 are cannabinoid CB1/2 receptors All compounds show strong predicted affinity to at least one cannabinoid receptor, except cannabivarol Excellent agreement between PR-scores and experimental results

Tetrahydrocannabinol HU-210 WADA– S8 Cannabinoids JWH-018/073

Discussion As for any method, the success of our approach depends on the quality of the underlying data. Our methodology addresses the problem that only a small fraction of possible activities of different molecules against different targets have been assayed. For ChEMBL families that are not well populated, or for protein targets which too few compounds are assayed against, we cannot make predictions.

Conclusions – Target Prediction Automated data curation of the ChEMBL families greatly increases the precision of our protein target predictions. Our validations show good correspondence with experiment, 87% having the correct refined family among the top four hits. Across the seven WADA classes considered, we find a combination of expected and unexpected protein targets. Many of the non-obvious predicted targets have biochemically or clinically validated connections with the expected bioactivities.

Thanks SULSA, WADA, BBSRC, SFC Dr Lazaros Mavridis, James McDonagh, Dr Tanja van Mourik, Dr Luna De Ferrari, Neetika Nath (St Andrews) Prof. Maxim Fedorov, Dr David Palmer (Strathclyde) Laura Hughes, Dr Toni Llinas (ex-Cambridge) James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk, William Walton (U/G project)

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.

Similar presentations

Presentation on theme: "Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.

Similar presentations

Presentation on theme: "Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews."— Presentation transcript:

Similar presentations

About project

Feedback