Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen.

Similar presentations

Presentation on theme: "1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen."— Presentation transcript:

1 1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen Kompany-Zareh In the name of GOD

2 Contents: 2  Introduction  Selwood data set (all descriptors  Model development  Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV  Internal validation  QUIK  Selwood data (a # descriptors  Descriptor selection  LMO and Jackknife  Cross model validation  Bootstrapping  Training and test set selection  Leverage

3 3 QSPR/QSAR (Quantitative structure activity relationship) Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals. Application: Prediction of property for a variety of chemicals,  prior to expensive synthesis and experimental measurement.  To determine environmental risk of thousands of untested industrial chemicals. Description of a mechanism of action for a variety of chemicals, Introduction

4 molec. 6 molec. 5 Descriptors 1.885120.93476.92122.04 2.913108.77508.56150.17 3.312122.85554.01164.08 3.711123.92571.26178.10 2.696120.49505.61156.01 3.106119.98518099247.93 2.924 1.992 1.987 1.544 2.079 1.530 X y Lipoph.LUMO MW Surf. Area Activities ? ? QSAR model molec. 1 molec. 2 molec. 3 molec. 4 Introduction

5 5 Data preparation: 1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data. 2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.) more than 3000 descr.s Introduction DRAGON (Todeschini et al, 2001 ADAPT (Jurs 2002; Stuper and Hurs 1976 OASIS (Mekenyan and Bonchev 1986 CODESSA (Katritzky et al, 1994 Gaussian …

6 6 Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes; From a very large number of descriptors from different softwares As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select Structure Activity Model Descriptors: Topologic (edges and vertices Geometric (surface, volume, … Electronic (e dencity, local charges Constitutional (#C, #OH, … …. Introduction

7 7 Selwood data: D (31x53), Y(31x1) >> load selwood.txt; >> D=selwood(:,1:end-1); >> y=selwood(:,end); 31 molecules 53 descriptors antifilarial antimycin analogous c 31 antifilarial antimycin analogous characterized by 53 physicochemical descriptors Selwood, et al J Med Chem (1990) 33, 136. Data set

8 8 Model generation: Indep variables: descriptors Depend variables: properties (activities) Model developm methods:  Multiple linear regression MLR,  Partial least squares PLS,  Artificial neural netorks (ANNs),  k-nearest neighbor Model development #samples<#descr.s !!

9 9 D b = y b = D + y Multiple Linear RegressionSimplest model: >> b= D\y; >> yEST= D*b; 22 of 53 coeff.s are zero!! b0 Model is developed Application of model ? Validation? D y b Model development R 2 =1

10 10 Other statistical diagnostics: Coefficient of determination, R 2 Fraction of dependent variable variance explained by a model (e.g. MLR model). Closer to unity is better. It is a measure of the quality of fit between model- predicted and experimental values, and does not reflect the predictive power, at all. Model development

11 11 Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !! They do not include model validation in model development. n/#descr=11/2>5 r 2 cv < r 2 fit : unstable model log(1/IGC50)=0.54 logK w – 8.90 LUMO – 0.99 n=11, r 2 =0.82, s=0.28, r 2 cv =0.64 Schultz, et al Toxicity of Tetrahymena Pyriformis QSAR 2002 meeting, May 25-29, Ottawa, Canada. Ex Model development

12 12 Model development Ex Akers et al Struc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39. Claim: The goodness of fit is satisfactory for predictive purposes. Ex Benigni et al QSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714. “..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !! x

13 13 Problem: Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!, the model is not reliable !!

14 14 Model validation Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals. Model validation: Quantitative assessment of model robustness and its predictive power. Definition of the application domain of the model in the space of applied chemical descriptors

15 15 Division Division to calibration and test sets calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:); b=calD\caly; %model development valD valy validation calD Model caly Developm. There are many different methods for selection of member s in training and test set. External validation Model validation 14 7 10 13 … 2 5 8 11 14 … 3 6 9 12 15…

16 16 >> calyEST=calD*b; >> valyEST=valD*b; % model validation   Not good prediction Model validation R 2 =1

17 17 >> calyEST=calD*b; %root mean square error of calibr >> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr) >> valyEST=valD*b; % root mean square error validation >> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr) RMSEC=2.9396e-014 RMSEP=2.2940  Not good prediction  Model validation

18 18 A model with high R 2 could be a poor predictor:  Variable muticollinearity,  Statistically insignificant model descriptors,  High leverage points in the training set. Model validation A regression model with k descriptors and n training set compounds may be acceptable for validation only if : n > 4 k For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.

19 19 Validation strategies:  Randimization of model property (Y-scrambling).  Internal validation. Only training  External validation. Division to training and test sets. Model validation

20 20 Predictive power of QSAR models: From sufficiently large external test set of compounds that were not used in the model development. Golbraikh, et al Beware of q 2 !, J Mol Graph Model (2002) 20, 269-276. Zefirov, et al QSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models?, J Chem Inf Comput Sci (2001) 41, 1022-1027. Model validation

21 21 Train Test residual SS Model validation

22 22 Train Test Tot variance SS Model validation

23 23 Train Test R 2 = 1.0000 q 2 = -8.5220  Model validation

24 24 Internal validation: Internal validation Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes Similar to R2 ! CV corr coeff

25 25 Training set, only Internal validation Cross validation Leave-one-out Internal validation Useful when small number of molecules are available.

26 26 Subsamples (copies from Training set # subsamples = # molec.s Internal validation

27 27 SubTrain1 SubValid1 cumPRESS # subsamples = # molec.s in training set SubTrain3 SubTrain2 SubValid2 SubValid3 SubValid5 SubTrain5 Internal validation

28 28 for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; end cumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y))) LOO CV Internal validation

29 29 q 2 LOO = -4.8574 RMSECV = 2.0397 >> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2 >> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end q2ASYMPTOT = 1.0000 REJECT Internal validation q2LOO and R2 should not be considerably different.

30 30 q 2 LOO>0.5 Many authors consider q 2 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set. Ex Cronin, et al The importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176. Ex Moss, et al Q. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317. Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, 718-726. Internal validation

31 31 Small value of q 2 LOO or q 2 LMO test indicates low prediction ability, But opposite is not necessarily true. (high q 2 LOO is necess and not enough) It indicates robustness, but not the prediction ability of model. Internal validation

32 32 It has been shown that there exist no correlation between LOO cross-validation q 2 LOO and the correlation coefficient R 2 between the predicted and observed activities for an external test set. Kubinyi, et al Three dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, 2553-2564. Golbraikh, et al Beware of q 2 !, J Mol Graph Model (2002) 20, 269-276. High q 2 LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition. Internal validation

33 33 QUIK R. Todeschini, et al Detecting bad Regression models: Multicriteria fitness functions in regression analysis Anal. Chim Acta (2004) 515, 199-208. For illustration of correlation (collinearity) among independent variables. Based on Multivariate correlation index K QUIK

34 34 2111 4222 6333 8444 10555 >> corr(M) 4 correlated descriptors M= 1111 1111 1111 1111 >> p=size(M,2); >> CorrEV=svds(corr(M),p); 10 20 30 40 50 y= It seems possible to use svd(M) QUIK

35 35 >> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p); KM = 1.0000Maximum correlation between descriptors >> [KM]=QUIK(M) function >> [KMY]=QUIK([M Y]) %in the pres of depend var if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end KMY = 1.0000 REJECT QUIK

36 >> corr(M) >> M=rand(4,5) M=.6879.1101.3863.54681.0419-.7227.36231.5468 -.3545.17841.3623.3863.24501.1784-.7227.1101 1.2450-.3545.0419.6879 1 2 3 4 y= QUIK

37 37 KM = 0.5000 >> [KM]=QUIK(M) >> [KMY]=QUIK([M Y]) if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end KMY = 0.6000 NOT REJECTED QUIK

38 38 KM = 0.7919 >> [KM]=QUIK(calD) % Selwood data, all descriptors >> [KMY]=QUIK([calD Y]) >>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end KMY = 0.7923 REJECTED QUIK

39 39 Development of MLR model using all descriptors is not acceptable. Model can be improved, using a factor based method, …and by descriptor selection.

40 40 >> D=Dini(:,[51 37 35 38 39 36 15]); Development of MLR model using a number of descriptors. RMSEC= 0.4989 RMSEP= 0.4993 Comparable Improved A number of descriptors

41 41 R 2 = 0.6495 q 2 = 0.5490 Comparable Improved q 2 LOO = 0.2816 NOT REJECTED A number of descriptors

42 D=Dini(:,[51 37 35 38 39 36 15]); 42 KX = 0.6384 QUIK KXY = 0.5996 if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end REJECTED A number of descriptors

43 D=Dini(:,[51 1 38]); 43 KX = 0.3159 QUIK KXY = 0.3953 if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end NOT REJECTED A number of descriptors

44 44 Using proper set of descriptors, improved results from MLR can be obtained. But how the proper set of descriptors can be selected.

45 45 Descriptor selection: -Forward selection, -Backward elimination, -Genetic algorithm -Kohonen map -SPA -CWSPA Descriptor Selection

46 Kohonen Map 53 × 31 Rows (descriptors) as input for Kohonen map: 1. Sampling from all regions in descriptors space 2. Sampling from regions which descriptors have high correlation with Y (activity) selwood data matrix By: Mehdi Vasighi

47 47 Descriptor Selection Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12 Successive projections algorithm (SPA) SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.

48 Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73. Important parameters: 1- Starting vector 2- N, maximum number of descriptors Descriptor Selection

49 Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable, relation of entered vector as an independent variable to the response is not considered. Incorporation of a form of correlation ranking procedure by which the variables are weighted by their correlation coefficient with dependent variable, within SPA procedure will overcome this limitation of SPA. Descriptor Selection M. Kompany-Zareh and Y. Akhlaghi Correlation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives, J of Chemom, (2007) 21, 239-250.

Download ppt "1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen."

Similar presentations

Ads by Google