Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction on QSAR and modelling of physico-chemical and biological properties Alessandra Roncaglioni – IRFMN Problems and.

Similar presentations


Presentation on theme: "Introduction on QSAR and modelling of physico-chemical and biological properties Alessandra Roncaglioni – IRFMN Problems and."— Presentation transcript:

1 Introduction on QSAR and modelling of physico-chemical and biological properties Alessandra Roncaglioni – IRFMN aroncaglioni@marionegri.it Problems and approaches in computational chemistry

2 Outline History QSAR/QSPR steps ◦ (Descriptors) ◦ Activity data ◦ Modelling approaches ◦ Validation (OECD principles) QSPR (Phys-chem properties) QSAR (Biological activities) Example (D EMETRA ) 2

3 QSAR postulates The molecular structure is responsible for all the activities Similar compounds have similar biological and chemico-physical properties (Meyer 1899) Hansch analysis (‘70s) Free Wilson approach (‘70s) H. Kubinyi. From Narcosis to Hyperspace: The History of QSAR. Quant. Struct.-Act. Relat., 21 (2002) 348-356. 3

4 Hansch analysis Applied to congeneric series Log 1/C = a  + b  + c E s + const. where C = effect concentration  = octanol - water partition coefficient  = Hammett substituent constant (electronic) E s = Taft’s substituent constant Linear free energy-related approach McFarland principle 4

5 Free-Wilson analysis Log 1/C =  a i +  where C = effect concentration a i = contribution per group  =activity of reference compound 5

6 The old QSAR paradigm Compounds in the series must be closely related Same mode of action Basics biological activity Small number of “intuitive” properties Linear relation 6

7 The old QSAR paradigm Factors limiting to the old paradigm: Sw availability Calculation of molecular properties Limited COMPUTING POWER Costs of hw and sw 7

8 The new QSAR paradigm Heterogeneous compound sets Mixed modes of action Complex biological endpoints Large number of properties Non linear modelling 8

9 The new QSAR paradigm Factors enabling new paradigm: Increased computing power QM calculations Thousands of descriptors Cost drop for hw and sw (freeware) 9

10 Outline History QSAR/QSPR steps ◦ (Descriptors) ◦ Activity data ◦ Modelling approaches ◦ Validation (OECD principles) QSPR (Phys-chem properties) QSAR (Biological activities) Example (D EMETRA ) 10

11 QSAR flowchat 11

12 2D3D … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Compounds (1, …, n) Descriptos (1, …, m) D (n,m) … … Compounds (1, …, n) Activity A A = f (D (n,m) ) 12

13 QSAR/QSPR defined by Y data Quantitative Structure-Property Relationship: physico-chemical or biochemical properties ◦ Boiling point ◦ Partition coefficients (LogP) ◦ Receptor binding Quantitative Structure-Activity Relationship: interaction with the biota ◦ Toxicity ◦ Metabolism 13

14 Activity data Garbage in, garbage out Quality and quantity of data ◦ Suitable for purposes? ◦ Intrinsic variability of Y data (particularly for QSAR): examples later on ◦ Chemical domain covered with experimental data ◦ As much as you can expecially if using complex models 14

15 Data need Data are one of the pillars of the models The goal is to extract knowledge from these data If they are too noisy it is not possible to extract this knowledge Enough number of training data Keep data variability low Large number of compounds Quality / Accuracy Nr. of compounds 15

16 Modelling steps Data pre-processing ◦ Scaling X block and transformation of Y block Variable selection Application of algorithms to search for the reationship 16

17 Scaling variables ◦ making sure that each descriptor has an equal chance of contributing to the overall analysis ◦ E.g.: autoscaling, range scaling Y transformation Data pre-processing (I) 17 No.TestSubName_JpCAS nrMolWeightZM1VHNarMSDGMTIVSPITI1 1 1-001Trichlorfon52-68-6257,437 162,1241,3580,2661336,0749,072-13,982 2 1-002Dimethoate60-51-5229,257 148,1981,4850,3241587,7418,253-16,476 3 1-003Dichlorvos62-73-7220,976 172,5191,4510,3261380,4447,646-13,908 4 1-004Malathon121-75-5330,358 274,1981,5510,266524,18513,757-34,658 5 1-005Methoprene40596-69-8310,471 2241,5620,334992217,17-57,135 6 1-006Propylthiourea927-67-3118,201 50,4441,4480,431235,6674,166-6,962 7 1-0072-Butanone oxime96-29-787,1204 721,3850,4142243,484-4,909 8 1-008Dibromoacetic acid631-64-1217,844 86,1341,2860,38204,9493,786-4,593 9 1-011Bis(2-ethylhexyl)adipate103-23-1370,566 2541,6960,3341465517,037-80,091 10 1-013Thiram137-26-8240,433 87,7781,440,338829,4459,053-17,08 11 1-015Stannane, tributylfluoro-1983-10-4309,051 88,0081,60,3141243,8898,561-22,079 12 1-016Methomyl16752-77-5162,21 148,4441,50,3751128,3336,542-12,904 13 1-017Aldicarb116-06-3190,263 158,4441,4850,35116978,357-17,609 14 1-018Demeton-s-methyl919-86-8230,285 124,1981,5480,3531163,1857,554-17,685 15 1-019Citral5392-40-5152,233 1061,5350,38713997,197-16,06 16 1-020Disulfiram97-77-8296,539 103,7781,5480,3071801,44411,43-28,141 17 1-0212-Ethyl-1,3-hexanediol94-96-2146,227 861,50,3388056,412-11,897 18 1-022Tributyl phosphate126-73-8266,314 183,3091,6590,3113441,66710,069-32,437 20 1-024Tris(2-chloroethyl)phosphate115-96-8285,49 170,1241,60,3141985,0378,561-22,079 22 1-026Ethylene glycol107-21-162,0678 581,3330,5271391,893-2,499

18 Data pre-processing (II) Variable pruning ◦ Detecting constant variables ◦ Detecting quasi-constant variables  It can distinguish between informative and non informative variables ◦ Detecting correlated variables  Variables can be grouped into correlation groups and the most correlated variable with the response is retained ◦ Variables with missing values 18

19 Variable selection Reducing dimensions, facilitating data visualization and interpretation Likely improving prediction performance Hypothesis driven or statistically driven 19 Wrappers: utilizes the choice of prediction method to score subsets of features according to their predictive power; Filters: a preprocessing step, independent of the choice of the predictor.

20 Variable selection techniques Principal component analysis (PCA) Clustering Self organizing maps (SOM) Stepwise procedures Forward selection: features are progressively incorporated into larger and larger subsets; Backward elimination: starting with the set of all features and progressively eliminates the least promising ones. Genetic algorithms Variable importance/sensitivity 20

21 Principal component analysis Keep only those components that possess largest variation PC are orthogonal to each other Loadings plot 21

22 Cluster analysis Process of putting objects into classes, based on similarity Descriptors in the same cluster are assume similar values for the molecules of the dataset Many different methods and algorithms ◦ different clustering methods will result in different clusters, with different relationships between them ◦ different algorithms can be used to implement the same method (some may be more efficient than others) 22

23 Hierarchical and non-hierarchical A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not 23

24 Hierarchical agglomerative The hierarchy is built from the bottom upwards Several different methods and algorithms Basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items ◦ at each step the most similar pair of molecules (or previously-formed clusters) are merged together ◦ until everything is in one big cluster ◦ methods differ in how they determine the similarity between clusters 24

25 Hierarchical divisive The hierarchy is built from the top downwards At each step a cluster is chosen to divide, until each cluster has only one member Various ways of choosing next cluster to divide ◦ one with most members ◦ one with least similar pair of members ◦ etc. Various ways of dividing it 25

26 Non-hierarchical methods Usually faster than hierarchical e.g.: Nearest neighbour methods ◦ best known is example is Jarvis-Patrick method  identify top k (e.g. 20) nearest neighbours for each object  two objects join same cluster if they have at least k min of their top k nearest neighbours in common ◦ tends to produce a few large heterogeneous clusters and a lot of singletons (single- member clusters) 26

27 Self organizing maps A SOM is an unsupervised NN condensing the input space into a low- dimensional representation 27

28 Genetic algorithms Based on the Darwinian evolutionary theory ◦ individuals in a population of models are crossed over, mutated, then iteratively evaluated against a fitness function which gives a statistical evaluation of the model’s performances 28 Initial population Evaluation of individuals Cross-over Mutations Individual selection Fitness? End Y N 10010111011011 01010101010101 11100111001110 10001010010010 10001010110010 11010111011111

29 Modelling approaches SAR Quantitative SAR 29 Categorical Y Classification Continuous Y Regression

30 Modelling techniques Multiple Linear Regression PLS … Neural Networks Classification trees Discriminant analysis Fuzzy classification … 30

31 Multiple Regression Linear relationship between Y and several X i descriptors Y = aX 1 + bX 2 + cX n + … + const. Minimize error by least squares May include polynomial terms 31 ; (1)

32 32 Partial Least Square PLS similarly to PCA uses orthogonal PC of linearly correlated variables more closely related to the Y response Scores t 1 &t 2 projection

33 O = f(I) I O Neural networks Inspired by biological NNs are a set of connected nonlinear elements making transformation of input 33

34 The problem of overfitting y = 0.979x + 0.344 R² = 0.956 y = -0.062x 4 + 1.293x 3 - 9.472x 2 + 29.24x - 27.37 R² = 0.999 34

35 Solution: validation 35 Training prediction Validation prediction Complexity Performances

36 Validation criteria Interna validation - robustness ◦ Cross-validation (LOO, LSO) ◦ Bootstrap ◦ Y scrambling External validation - prediction ability ◦ Test set  representative of training set ◦ Tropsha criteria Applicability domain 36

37 Cross validation Leave One Out All the data are used for fitting but one compound Predict the excluded sample Repeat it for all samples Calculate Q 2 or R 2 cv similarly to R 2 on the basis of these predictions Problem: to optimistic if there are many samples Leave Many Out Use larger groups to obtain a more realistic outcome 37

38 Bootstrapping Bootstrapping simulates what happen by randomly resampling the data set with n objects K n-dimensional groups are generated by a randomly repeated some objects The model obtained on the different sets is used to predict the values for the excluded sample From each bootstrap sample the statistical parameter of interest is calculated The estimation of accuracy is obtained by the average of all calculated statistics 38

39 Y-scrambling Randomply permutate Y responses while X variables are kept in the same order for several times 39

40 Tropsha criteria* 40 * A. Golbraikh, M. Shen, Z. Xiao, Y.D. Xiao, K.-H. Lee, A. Tropsha, Rational selection of training and test sets for the development of validated QSAR models, JCAMD, 17 (2003) 241-253. a) Q 2 > 0.5;b) R 2 > 0.6; c) (R 2 - R 2 0 )/ R 2 < 0.1 and 0.85 < k < 1.15 or (R 2 – R’ 2 0 )/ R 2 < 0.1 and 0.85 < k’ < 1.15 (k=slope of the regression line) (R 2 0 = R 2 related to y=kx) d) if (c) is not fulfilled, then | R 2 0 – R’ 2 0 | < 0.3

41 Applicability domain The applicability domain of a (Q)SAR model is the response and chemical structure space in which the model makes predictions with a given reliability.* 41 * Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA, 33:1-19, 2005.

42 Applicability domain 42 Training data

43 Applicability domain 43 Training data New compounds

44 AD assessment Similarity measures: Response range (span of activity data) Chemometric treatment of the descriptor space Fragment-based approaches 44

45 Chemometric Methods Descriptor range-based 45

46 Chemometric Methods Descriptor range-based Geometric methods 46

47 Chemometric Methods Descriptor range-based Geometric methods Distance-based 47

48 Chemometric Methods Descriptor range-based Geometric methods Distance-based Probability density distribution 48

49 AMBIT software http://ambit.acad.bg/main.php 49

50 AD assessment Similarity measures: Response range (span of activity data) Chemometric treatment of the descriptor space Fragment-based approaches 50

51 Example of AD assessment Test set 1 Test set 2 % of all compounds in the test set predicted within one or two log unit without assessing the AD 51

52 Further aspects in AD 52 Including the model’s characteristics Terminal node Node assignment Misclassification ratio Training set Validation set Test set 400.040.020 610.130.160.14 800.10.260.17 1110.31*00.25 1200.050.140.25 1410.47*0.67*0.25 1500.140.190.13 1810.32*0.33*0.55* 1900.200 2000.060.20.17 2110.44*0.20.17

53 Outline History QSAR/QSPR steps ◦ (Descriptors) ◦ Activity data ◦ Modelling approaches ◦ Validation (OECD principles) QSPR (Phys-chem properties) QSAR (Biological activities) Example (D EMETRA ) 53

54 Why? 54 Large number of existing and new chemicals without a complete (eco)toxicological characterization http://www.ewg.org/reports/skindeep/ Ingredient Search Results: [tocopheryl acetate] Summary - health information Constrains: time consuming, expensive, ethical issues

55 REACH Enterprises that manufacture or import more than one tonne of a chemical substance per year would be required to register it in a central database It is estimated that the testing of the approximately 30’000 existing substances would result in total costs of about 2,1 billion €, over the next 11 years Promotion of non-animal testing 55 Registration, Evaluation and Authorisation of CHemicals

56 REACH 56 Registration, Evaluation and Authorisation of CHemicals Additional costUse of (Q)SARs, read-across 2.3 billion EuroMinimal use 1.5 billion EuroAverage use (likely scenario) 1.1 billion EuroMaximal use Cost-saving potential: € 800-1130 million Pedersen et al. (2003). Assessment of additional testing needs under REACH.

57 REACH 57 Registration, Evaluation and Authorisation of CHemicals Additional animalsUse of (Q)SARs, read-across 3.9 millionMinimal use 2.6 millionAverage use (likely scenario) 2.1 millionMaximal use Animal-saving potential: 1.3-1.9 million animals Van der Jagt et al. (2004). Alternative approaches can reduce the use of test animals under REACH.

58 OECD principles for QSAR validation Efforts to improve transparency and acceptability of in silico methods: A defined endpoint An unambiguous algorithm A defined domain of applicability Appropriate measures of goodness-of-fit, robustness and predictivity A mechanistic interpretation, if possible 58

59 QSPR Physico-chemical properties ◦ Boiling point ◦ Solubility ◦ Partition coefficients ◦ Viscosity ◦ Hydrophobicity Biochemical assays 59

60 Specific aspects of QSPR In general you can expect to obtain more precise models, and experience reduced experimental variability Many properties important for drug design ◦ Biochemical assay – target property ◦ Bioavailability – LogP ◦ Side effects Many others important for REACH 60

61 QSAR Biological activities ◦ Ecotoxicity ◦ Mammalian toxicity (as surrogate of human health) ◦ Carcinogenicity & Mutagenicity ◦ … ◦ & many more 61

62 Specific aspcts of QSAR Biological variability Moles vs. wheight data Role of LogP Mechanistic interpretation 62

63 Biological variability Intrinsic variability of toxicological data (LC 50 ) 63

64 Mole vs. wheight units 64

65 Role of LogP Used to model the penetration into the phospholipidic membrane Extreamly common for its easyness of interpretation 65

66 Role of LogP Which is the your favourite option? 66 Tox = 1.32 LogP + 0.23 Tox = 0.55 des1 + 0.36 des2 + 0.29 des3 + 0.64 des4 - 0.47 des5 - 1.56 des6 - 0.53 des7 + 0.27 des8 + 0.55 des9 + 0.50 des10 + 0.23

67 Role of LogP 67

68 From LogP 68 descriptors structure logP activity descriptors structureactivity To direct descriptors

69 Mechanistic interpretation A priori (experimentally determined – even more complex that the studied endpoint itself) or postulated or a posteriori Different classification schemes for MOA exist (narcosis, specific reactive modes) 69

70 Global models … 70 Training set Training set n = 422 d = 5 R 2 % = 69,9 % Rcv 2 % = 68,0 % RMS = 0,77 Test set Test set n = 141 R 2 = 71,7 RMS = 0,70

71 Global models … 71 n563 d Log P, E LUMO Log P, E LUMO, MW, Kier&Hall (order 0), Molecular surface area Log P, E LUMO R2R2R2R2 71.169.5 Q2Q2Q2Q2 70.769.3 RMS0.740.76

72 vs mechanistic models … 72

73 Outline History QSAR/QSPR steps ◦ (Descriptors) ◦ Activity data ◦ Modelling approaches ◦ Validation (OECD principles) QSPR (Phys-chem properties) QSAR (Biological activities) Example (D EMETRA ) 73

74 Activity data collection  Identification of the endpoints that can mostly benefit for QSAR Costs, test severity, feasibility, etc…  Identification of data sources Quality, guidelines, protocols  Refinement of the data Multiple sources comparison, precautionary selection 74 TROUT (282) WATER FLEA (263) ORAL QUAIL (116) DIETARY QUAIL (123) BEE (105) W H O L E D A T A S E T (398)

75 Trout Daphnia Oral quail Dietary quail Bee Individual models Linear models, ANN models Hybrid system Combining model results Modelling process 75

76 Validation Validation of the hybrid model for Daphnia with new data subsequenly identified in literature ◦ Real “blind” test set Comparison with Expert systems ◦ ECOSAR ◦ Topkat 76

77 DEMETRA Hs - training set 77 Daphnia Magna TRAINING SET NC = 193 R 2 = 0.80

78 DEMETRA Hs - results on tests 78 Daphnia Magna TEST SETS EPA test set NC = 36 R 2 = 0.80 D-BBA test set NC = 101 R 2 = 0.70

79 US EPA ECOSAR predictions 79 Daphnia Magna ECOSAR – tutti i dati NC = 432 R 2 = 0.20

80 Topkat predictions 80 NC = 176 NC (training test) = 31 R 2 = 0.20


Download ppt "Introduction on QSAR and modelling of physico-chemical and biological properties Alessandra Roncaglioni – IRFMN Problems and."

Similar presentations


Ads by Google