Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen.

Similar presentations


Presentation on theme: "Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen."— Presentation transcript:

1

2 Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen to Svante Wold) LOO, BOO and n-Fold Cross-Validation Error Measures Albumin Data Set and Feature Selection Bio-Informatics Analyze/StripMiner™

3 Feature Selection – Sensitivity Analysis – Genetic Algorithms – Correlation GA (GAFEAT) – Method specific Learning Modes – Bootstrapping – Bagging – Boosting – Leave-one-out cross-validation Data Processing – Interface with RECON – Different Scaling Modes – Outlier detection/data cleansing Visualization – Correlation Plots – 2-D Sensitivity Plots – Outlier Visualization Plots – Different Scaling Options – Cluster Ranking Plots – Standard ROC curves – Continuous ROC curves Modeling – ANN (Neural Networks) – SVM (Support Vector Machines) – PLS (Partial-Least Squares) – GA-based regression clustering – PCA regression – Local Learning – Outlier Detection (GAMOL) Code Specifics – Tight Classic C-code (< 15000 lines) – Script-Based Shell Program – Runs on all Platforms – Ultra Fast – Use: TransScan – GE - KODAK Doppler broadening Macro-Economics Analysis

4 Analyze/StripMiner ™ Coding Philosophy Standard C code that compiles on all platforms WINDOWS™ and Linux platforms Supporting visualizations use Java and/or gnuplot Flexible GUI with sample problems and demos Fastest code possible with efficient memory requirements Long history of code use with variety of users for troubleshooting Flexible code based on scripts and operators Operates on a numeric standard data mining format file

5 Practical Tips for PCA NIPALS algorithm assumes the features are zero centered It is standard practice to do a Mahalanobis scaling of the data PCA regression does not consider the response data The t’s are called the scores It is common practice to drop 4 sigma outlier features (if there are many features)

6 StripMiner Script Examples PCA visualization (pca.bat) Pharma-plot (pharma.bat) Prediction for iris with PCA (iris.bat) Bootstrap prediction for iris (iris_boo.bat) Predicting with an external test set example (iris_ext.bat)) PLS and ROC curve for iris problem (roc.bat) Leave-One-Out PLS for HIV (loo_hiv.bat) Feature selection for HIV (prune.bat) Starplots (star.bat)

7 File Flow for PCA.bat Script num_eg.txt contains the number of PCAs (2-10) usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only) num_eg.txt stats.txt la_sscala.txt iris.txt.txt.txt.txt

8 num_eg.txt has to contain a 4 for a pharmaplot use pharmaplot.m for visualization in MATLAB adjust color setting threshold in pharmaplot.m File Flow for pharma.bat script num_eg.txt stats.txt la_sscala.txt dmatrix.txt a.txt pharmaplot

9 For the random seed in splitting routine don’t use 0 (preserves order) The test set is really only for validation purposes (answer is known) Note: descaling from PLS uses la_sscala.txt file Notice q2, Q2, and RSME error measures File Flow For iris.bat Script: Predicting Class num_eg.txt stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.txt resultss.xxx resultss.ttt results.xxx results.ttt

10 We use bootstrap cross-validation (e.g., leave 7 out 100 times) Use MATLAB script dos_mbotw results.ttt to display results for test set Use MATLAB script dos_mbotw resultss.xxx to display results training set Notice q2, Q2, and RSME error measures File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence num_eg.txt stats.txt la_sscala.txt a.txt resultss.xxx resultss.ttt results.ttt

11 Error Measure Criteria For training set we use: - RMSE: root mean square error for training set - r 2 : correlation coefficient for training set - R 2 : PRESS R 2 For validation/test set we use: - RMSE: reast mean square error for validation set - q 2 : 1 – r test 2 - Q 2 : PRESS/SD

12 Script for Scaling with an External Test Set 3305 scatterplot (Java) -3305 scatterplot gnuplot 3313 errorplot (Java) -3313 errorplot (gnuplot)

13

14 Docking Ligands is a Nonlinear Problem

15 PLS, K-PLS, SVM, ANN Feature Selection (data strip mining)

16 Binding affinities to human serum albumin (HSA): log K’hsa Gonzalo Colmenarejo, GalaxoSmithKline J. Med. Chem. 2001, 44, 4370-4378 95 molecules, 250-1500+ descriptors 84 training, 10 testing (1 left out) 551 Wavelet + PEST + MOE descriptors Widely different compounds Acknowledgements: Sean Ekins (Concurrent) N. Sukumar (Rensselaer)

17 Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data cmatrix.ori dmatrix.ori num_eg.txt stats.txt la_sscala.txt a.txt results.xxx results.ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt PLS-LOO stands for leave-one-out PLS cross-validation Training set is in cmatrix.ori and external validation set in dmatrix.ori External validation set has –999 or 0 in the activity field Note that we create generic labels and and that there is a test set Notice the dropping of non-changing features and 4-sigma ouliers Notice the acrobatics for displaying metrics (visualize with dos_mbotw)

18

19

20 PLS Feature Selection Script For Albumin Data Do several iterative prunings, typically leave 7 out 100 x Use different seeds Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, … aa.pat bbmatrixx.txt sel_lbls.txt select.txt sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixxx.txt

21

22 STARPLOT.BAT: Starplot for Selected Features for Albumin sel_lbls.txt aa.pat bbmatrixxx.txt sel_lbls.txt starplot.txt starplot First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps using PLS bootstrap option 33 Generate starplot.txt from bbmatrixxx.txt using option 3320 Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)

23


Download ppt "Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen."

Similar presentations


Ads by Google