Presentation on theme: "1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen."— Presentation transcript:
1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen Kompany-Zareh In the name of GOD
Contents: 2 Introduction Selwood data set (all descriptors Model development Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV Internal validation QUIK Selwood data (a # descriptors Descriptor selection LMO and Jackknife Cross model validation Bootstrapping Training and test set selection Leverage
3 QSPR/QSAR (Quantitative structure activity relationship) Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals. Application: Prediction of property for a variety of chemicals, prior to expensive synthesis and experimental measurement. To determine environmental risk of thousands of untested industrial chemicals. Description of a mechanism of action for a variety of chemicals, Introduction
molec. 6 molec. 5 Descriptors X y Lipoph.LUMO MW Surf. Area Activities ? ? QSAR model molec. 1 molec. 2 molec. 3 molec. 4 Introduction
5 Data preparation: 1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data. 2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.) more than 3000 descr.s Introduction DRAGON (Todeschini et al, 2001 ADAPT (Jurs 2002; Stuper and Hurs 1976 OASIS (Mekenyan and Bonchev 1986 CODESSA (Katritzky et al, 1994 Gaussian …
6 Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes; From a very large number of descriptors from different softwares As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select Structure Activity Model Descriptors: Topologic (edges and vertices Geometric (surface, volume, … Electronic (e dencity, local charges Constitutional (#C, #OH, … …. Introduction
7 Selwood data: D (31x53), Y(31x1) >> load selwood.txt; >> D=selwood(:,1:end-1); >> y=selwood(:,end); 31 molecules 53 descriptors antifilarial antimycin analogous c 31 antifilarial antimycin analogous characterized by 53 physicochemical descriptors Selwood, et al J Med Chem (1990) 33, 136. Data set
8 Model generation: Indep variables: descriptors Depend variables: properties (activities) Model developm methods: Multiple linear regression MLR, Partial least squares PLS, Artificial neural netorks (ANNs), k-nearest neighbor Model development #samples<#descr.s !!
9 D b = y b = D + y Multiple Linear RegressionSimplest model: >> b= D\y; >> yEST= D*b; 22 of 53 coeff.s are zero!! b0 Model is developed Application of model ? Validation? D y b Model development R 2 =1
10 Other statistical diagnostics: Coefficient of determination, R 2 Fraction of dependent variable variance explained by a model (e.g. MLR model). Closer to unity is better. It is a measure of the quality of fit between model- predicted and experimental values, and does not reflect the predictive power, at all. Model development
11 Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !! They do not include model validation in model development. n/#descr=11/2>5 r 2 cv < r 2 fit : unstable model log(1/IGC50)=0.54 logK w – 8.90 LUMO – 0.99 n=11, r 2 =0.82, s=0.28, r 2 cv =0.64 Schultz, et al Toxicity of Tetrahymena Pyriformis QSAR 2002 meeting, May 25-29, Ottawa, Canada. Ex Model development
12 Model development Ex Akers et al Struc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, Claim: The goodness of fit is satisfactory for predictive purposes. Ex Benigni et al QSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, “..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !! x
13 Problem: Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!..so, the model is not reliable !!
14 Model validation Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals. Model validation: Quantitative assessment of model robustness and its predictive power. Definition of the application domain of the model in the space of applied chemical descriptors
15 Division Division to calibration and test sets calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:); b=calD\caly; %model development valD valy validation calD Model caly Developm. There are many different methods for selection of member s in training and test set. External validation Model validation … … …
16 >> calyEST=calD*b; >> valyEST=valD*b; % model validation Not good prediction Model validation R 2 =1
17 >> calyEST=calD*b; %root mean square error of calibr >> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr) >> valyEST=valD*b; % root mean square error validation >> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr) RMSEC=2.9396e-014 RMSEP= Not good prediction Model validation
18 A model with high R 2 could be a poor predictor: Variable muticollinearity, Statistically insignificant model descriptors, High leverage points in the training set. Model validation A regression model with k descriptors and n training set compounds may be acceptable for validation only if : n > 4 k For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.
19 Validation strategies: Randimization of model property (Y-scrambling). Internal validation. Only training External validation. Division to training and test sets. Model validation
20 Predictive power of QSAR models: From sufficiently large external test set of compounds that were not used in the model development. Golbraikh, et al Beware of q 2 !, J Mol Graph Model (2002) 20, Zefirov, et al QSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models?, J Chem Inf Comput Sci (2001) 41, Model validation
21 Train Test residual SS Model validation
22 Train Test Tot variance SS Model validation
23 Train Test R 2 = q 2 = Model validation
24 Internal validation: Internal validation Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes Similar to R2 ! CV corr coeff
25 Training set, only Internal validation Cross validation Leave-one-out Internal validation Useful when small number of molecules are available.
26 Subsamples (copies from Training set # subsamples = # molec.s Internal validation
27 SubTrain1 SubValid1 cumPRESS # subsamples = # molec.s in training set SubTrain3 SubTrain2 SubValid2 SubValid3 SubValid5 SubTrain5 Internal validation
28 for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; end cumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y))) LOO CV Internal validation
29 q 2 LOO = RMSECV = >> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2 >> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end q2ASYMPTOT = REJECT Internal validation q2LOO and R2 should not be considerably different.
30 q 2 LOO>0.5 Many authors consider q 2 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set. Ex Cronin, et al The importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, Ex Moss, et al Q. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, Internal validation
31 Small value of q 2 LOO or q 2 LMO test indicates low prediction ability, But opposite is not necessarily true. (high q 2 LOO is necess and not enough) It indicates robustness, but not the prediction ability of model. Internal validation
32 It has been shown that there exist no correlation between LOO cross-validation q 2 LOO and the correlation coefficient R 2 between the predicted and observed activities for an external test set. Kubinyi, et al Three dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, Golbraikh, et al Beware of q 2 !, J Mol Graph Model (2002) 20, High q 2 LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition. Internal validation
33 QUIK R. Todeschini, et al Detecting bad Regression models: Multicriteria fitness functions in regression analysis Anal. Chim Acta (2004) 515, For illustration of correlation (collinearity) among independent variables. Based on Multivariate correlation index K QUIK
>> corr(M) 4 correlated descriptors M= >> p=size(M,2); >> CorrEV=svds(corr(M),p); y= It seems possible to use svd(M) QUIK
35 >> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p); KM = Maximum correlation between descriptors >> [KM]=QUIK(M) function >> [KMY]=QUIK([M Y]) %in the pres of depend var if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end KMY = REJECT QUIK
>> corr(M) >> M=rand(4,5) M= y= QUIK
37 KM = >> [KM]=QUIK(M) >> [KMY]=QUIK([M Y]) if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end KMY = NOT REJECTED QUIK
38 KM = >> [KM]=QUIK(calD) % Selwood data, all descriptors >> [KMY]=QUIK([calD Y]) >>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end KMY = REJECTED QUIK
39 Development of MLR model using all descriptors is not acceptable. Model can be improved, using a factor based method, …and by descriptor selection.
40 >> D=Dini(:,[ ]); Development of MLR model using a number of descriptors. RMSEC= RMSEP= Comparable Improved A number of descriptors
41 R 2 = q 2 = Comparable Improved q 2 LOO = NOT REJECTED A number of descriptors
D=Dini(:,[ ]); 42 KX = QUIK KXY = if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end REJECTED A number of descriptors
D=Dini(:,[ ]); 43 KX = QUIK KXY = if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end NOT REJECTED A number of descriptors
44 Using proper set of descriptors, improved results from MLR can be obtained. But how the proper set of descriptors can be selected.
Kohonen Map 53 × 31 Rows (descriptors) as input for Kohonen map: 1. Sampling from all regions in descriptors space 2. Sampling from regions which descriptors have high correlation with Y (activity) selwood data matrix By: Mehdi Vasighi
47 Descriptor Selection Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12 Successive projections algorithm (SPA) SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.
Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73. Important parameters: 1- Starting vector 2- N, maximum number of descriptors Descriptor Selection
Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable, relation of entered vector as an independent variable to the response is not considered. Incorporation of a form of correlation ranking procedure by which the variables are weighted by their correlation coefficient with dependent variable, within SPA procedure will overcome this limitation of SPA. Descriptor Selection M. Kompany-Zareh and Y. Akhlaghi Correlation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives, J of Chemom, (2007) 21,