Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application and Efficacy of Random Forest Method for QSAR Analysis

Similar presentations


Presentation on theme: "Application and Efficacy of Random Forest Method for QSAR Analysis"— Presentation transcript:

1 Application and Efficacy of Random Forest Method for QSAR Analysis
presented by Pavel Polishchuk

2 Random Forest – consensus modelling
Random Forest model is an ensemble of single decision trees. Rules for model construction 1. Each tree growing on separate bootstrap sample of initial training set compounds. 2. In each node only small randomly chosen fixed number of descriptors are considered. 3. Each tree grows for its maximum depth (no pruning).

3 … Random Forest algorithm Initial dataset Bootstrap sample Tree1 Tree2
Combined prediction

4 Random Forest advantages:
RF models are robust to over-fitting. There is no need in pre-selection of variables. RF has its own reliable procedure for estimation of predictive ability of model. RF models are robust to “noise” in training dataset. RF allows to estimate variable importance for target property (interpretability of RF model). RF allows to analyze compounds with different mechanisms of action. RF method is very fast and effective in working with huge datasets.

5 Several examples of real QSAR tasks solutions

6 Toxicity of chemical compounds for T. pyriformis#
was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50) Diverse datasets: training set = 644 compounds test set 1 (ts1) = 339 compounds test set 2 (ts2) = 110 compounds Total number of 2D simplex descriptors = 6021 # Zhu, H., et al., J. Chem. Inf. Model., : p

7 Comparison of RF model with other consensus ones
RF model (trees=500, vars=2000)# RF# (2D simplex) Consensus PLS Consensus literature## R2(ws) 0.99 0.85 0.92 R2(oob) 0.81 --- R2(ts1) 0.83 0.80 R2(ts2) 0.74 0.69 0.67 MAE(ts1) 0.30 0.33 0.29 MAE(ts2) 0.38 0.41 0.39 mean absolute error of prediction # Polischuk, P.G., et al J. Chem. Inf. Model., : p ## Zhu, H., et al., J. Chem. Inf. Model., : p

8 Estimation of mutagenic potential of chemical compounds (Ames test)
training set = 4361 compounds test set = 2181 compounds Model Descriptors Accuracy (oob) (5-fold CV) (test set) 2D RF Simplex + Dragon 0.827 0.823 0.813 Simplex 0.810 0.814 Dragon 0.815 0.803 0.805 Consensus# (32 models) --- 0.828 # Results of collaboration of 13 scientific groups (not published yet)

9 Solubility in water QSPR task solution#
training set = 2537 compounds test set = 301 compounds training set R2 = 0.99 out-of-bag set R2 = 0.88 test set R2 = 0.82 # Kovdienko, N.A., et al. Molecular Informatics, : p

10 Leo Breiman – author of Random Forest
«Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.» ( – )


Download ppt "Application and Efficacy of Random Forest Method for QSAR Analysis"

Similar presentations


Ads by Google