Presentation on theme: "Statistics evaluation and graphics"— Presentation transcript:
1 Statistics evaluation and graphics with ChemAxon tools and Statistica and WEKAtowards QSPR and QSAR developmentTobias Kind FiehnLab at UC Davis Genome CenterNovember 2006Free Academic Licenses for JChem and Instant JChem provided by ChemAxonAcademic License for Statistica Dataminer provided by StatsoftTechnical presentation See notes and comments for deeper discussionChemAxonFiehnlab (fiehnlab.ucdavis.edu)Statistics - QSPR/QSAR - with JChem and Statistica and WEKA and YaleGNU general public license for WEKA provided by WEKA Machine Learning Project
2 Metabolomics - The science of the small molecules Compound Classes:sugarsamino acidssteroidsfatty acidslipidsphospholipidsorganic acids ...Molecules under investigationVisit us!3D model of a molecule with surface plot
3 Techniques and toolsAnalytical techniques (LC-MS, GC-MS, FT-MS, NMR, IR)BioInformatics, ChomoInformaticsLiquid Chromatography LC-MSGas Chromatography GC-MSBioInformatics and Cheminformatics Statistics (Statistica Dataminer) Open Source Tools
4 ChemAxon JChem has now PCA and PLS Create new library with JCHEM Manager GUI (testcase here: fingerprints)Exctract fingerprints and do dimension reduction with principal component analysis (PCA) with command line tool PCA.bat or pca.shPCA – principal component analysisPLS – partial least squares
5 ChemAxon JChem Principal Component Analysis (PCA) Start PCA by getting information from DB (here Access, but can be Oracle, Derby, MySQL)Test case chemicals from NCI DBPCA can be done from any descriptor, chemical fingerprints, BCUT etc. This is just a simple example made from the 16 standard fingerprints.Be sure only to select descriptors you want (and not the molecule ID)PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= " -o PCA-scores.txt -t PCA-Eigenvalues.txtTimeThis : Command Line : run-pca.batTimeThis : Start Time : Mon Nov 27 17:02:TimeThis : End Time : Mon Nov 27 17:19:TimeThis : Elapsed Time : 00:17:49.812Testsystem AMD Dual Opteron 2,8 Ghz 2,8 GByte RAM; WINXP-32 bit---TimeThis : Command Line : pca -i test-25kx16.txt -o PCA250k-scores-external.txt -t PCA250k-eigen-external.txtTimeThis : Start Time : Mon Nov 27 22:24:activeColumnsTimeThis : End Time : Mon Nov 27 22:40:TimeThis : Elapsed Time : 00:15:45.375----PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= " -o PCA-scores.txt -t PCA-Eigenvalues.txtTimeThis : Command Line : run-pca.batTimeThis : Start Time : Mon Nov 27 17:02:TimeThis : End Time : Mon Nov 27 17:19:TimeThis : Elapsed Time : 00:17:49.812Z:\>pca -hPCA 3.2, (C) ChemAxon Ltd.Principal Component Analisis.Usage:pca [options]General options:-h --help this help message-d --driver <JDBC driver> JDBC driver-u --dburl <url> URL of database-l --login <login> login name-p --password <password> password-s --saveconf save settings into"C:\Documents and Settings\Tobi\chemaxon\.jchem"-m --meancenter Don't autoscale just mean center data-s --noStandardize Don't mean center and autoscale the data-e --maxerr maximal error during the iterationInput options (default: standard input):-i --input <path> input file-q --query <sql> SQL query string for reading input(database input)Output options (default: standard output):-o --scoreOutput <filepath> output file path for principal componentsscores (text file output)-t --infoOutput <filepath> output file path for Eigenvalues,Cumulated variance ...(text file output)Problem here: A) JDBC extraction not tuned - DB extraction of values nearly 2 minutes.B) PCA calculation time too long - 15 minutes for a matrix x 16The current PCA algorithm needs to be changed, its very inefficent (faster matrix routines exist for JAVA)Database extraction time with Statistica: 8 seconds.The same PCA with Statistica is finished in: 1 second (no joke – thats a factor of 1:900).
6 JChem PCA outputEigenvalues, % and Cumulated variance (in rows)=Loadings (in rows)=------PCA scoresThe PCA results matrix is inverted and values *(-1) from Statistica.Problem: Currently no graphics. But multivariate statistics lives from graphics.Follwing simple graphic examples are made with Statistica or WEKA via DB query.
7 Following slides „What could be“ in the future. or „What can be done“ right now.Check the pretty comprehensive statistics link
8 Machine Learning and statistic tools PLSMachine Learning (KNN)Feature selectionTree modelNeural NetworkCluster AnalysisResponse curvesTry matlab if you want to die from command line sickness BUT MatLab is very fast and compiled for each specifc CPU YALEWe use Statistica Dataminer as a comprehensive statistics work tool. WEKA or YALE are free but (not yet :-) as powerful as the Statistica Dataminer.
9 Connection of a JCHEM molecule DB via JDBC with Statistica For Oracle and Apache Derby and multi core CPU speed with JCHEM calculations check here:Check my other presentation for JCHEM (See )Or copy thisOr this (not mine)Time for query + copy of 4,000,000 values with 250k molecules 16 fingerprints = 8 seconds.Test system JChem 3.2 with MS Access with Statistica Dataminer 7.1Dual Opteron 2.8 GHz
10 Statistica with JChem data Statistica has inbuilt functions and most (if not all) statistical routines.Its way more comfortable than R or matlab – R has only a commandlineTry also Yale or WEKA
11 PCA Scree plot – determine optimal factors to retain Visible StepTechnical presentation See notes and comments for deeper discussionChemAxonFiehnlab (fiehnlab.ucdavis.edu)Statistica Dataminer 7.1Four factors can be retained. The 16 dimensional space can becompressed into a 4-dimensional space. (Scree plot is not optimal here)
12 PCA Loadings plot – which variables are influential? If you want to cluster loadings you have to put the loadings output into a cluster analysis.Statistica Dataminer 7.1Which of the 16 fingerprints are similar? Those who “cluster” together are similar (fp_11 and fp_14).The variables fp_5 and fp_16 influence factor 1 in the same way. Variables inside or near the center (0,0) have no discrimination power. Remember PCA is no cluster analysis!
13 PCA Scores plot – picture of the reduced dimensionality. Technical presentation See notes and comments for deeper discussionChemAxonFiehnlab (fiehnlab.ucdavis.edu)Statistica Dataminer 7.1The 16 fingerprints are compressed into 2D. We can use other high dimensionality descriptors forenhanced examples. Cases (molecules) which „cluster“ together may have same properties or functional groups (depending on input). Here we see the KOW molecule set covers the whole NCI dataset based on 16 pfs.
14 PCA Scores 3D plot – KOWWIN versus silicon compound test set Statistica Dataminer 7.1The 16 fingerprints are compressed into 3D. The KOWWIN test set does not cover the whole molecules space of important silicon containing molecules. You can also do an Overlap Analysis(compare two databases) within the all-new Instant-JChem.
15 Statistica – Random Forest Machine learning 1024-DIM FC descriptor spaceStatistica generates all graphical output + SQL codeZ:\>timethis "generatemd c 10k-test.smi -T -2 -k CF >10k-fp.txt"TimeThis : Command Line : generatemd c 10k-test.smi -T -2 -k CF >10k-fp.txtTimeThis : Start Time : Wed Nov 29 20:35:TimeThis : End Time : Wed Nov 29 20:35:TimeThis : Elapsed Time : 00:00:05.421On Dual Opteron 2,8 GHz (one core used only).------MiklosChemical fingerprint generation: 500/sPharmacophore fingerprint generationcalculated: 80/srule-based: 200/sScreening: 12000/sOptimization: 10s/metricHardware/software environment:P4 3GHz, 1GB RAMRed Hat Linux 9Java 1.4.2Chemical fingerprint descriptors generated with JCHEM GenerateMD GenerateMD performance 1800 molecules/second for 1024 dimensional fpOn Dual Opteron 2,8 GHz (one core used only).
16 CART tree method for QSPR and QSAR Thats no joke, check out scholar.google.comClassification trees, boosting trees, random forest, regression treesand honest trees and adaptive trees – lots of wood and forests - did you hear about them?
17 Other machine learning techniques from Statistica Dataminer we use Most of them work for classification and regressionModel classspecific model#Generalized Linear Models (GLM)General Discriminant Analysis1Binary logit (logistic) regression2Binary probit regression3Nonlinear modelMultivariate adaptive regression splines (MARS)4Tree modelsStandard Classification Trees (CART)5Standard General Chi-square Automatic Interaction Detector (CHAID)6Exhaustive CHAID7Boosting classification trees8Neural NetworksMultilayer Perceptron neural network (MLP)9Radial Basis Function neural network (RBF)10Machine LearningSupport Vector Machines (SVM)11Naive Bayes classifier12k-Nearest Neighbors (KNN)13More than functions available
18 Now with open source datamining tool WEKA URLSQLDataFor MS Access create from ADMIN tools, JDBC driver, add DNS file, create DB; or use Orcacle settingsFile databaseutils.props in weka root DIRjdbcDriver=sun.jdbc.odbc.JdbcOdbcDriverjdbcURL=jdbc:odbc:jchem-zSQL:SELECT silicon.`cd_fp1`, silicon.`cd_fp2`, silicon.`cd_fp3`, silicon.`cd_fp4`, silicon.`cd_fp5`, silicon.`cd_fp6`, silicon.`cd_fp7`, silicon.`cd_fp8`, silicon.`cd_fp9`, silicon.`cd_fp10`, silicon.`cd_fp11`, silicon.`cd_fp12`, silicon.`cd_fp13`, silicon.`cd_fp14`, silicon.`cd_fp15`, silicon.`cd_fp16` FROM `Z:\access-DB\silicon`.`silicon` siliconYellow =OKEasy: enter DB URL, enter SQL statement, import data. Try free AquaStudio for SQL!
19 WEKA - Machine learning algorithms in Java Technical presentation See notes and comments for deeper discussionChemAxonFiehnlab (fiehnlab.ucdavis.edu)
20 WEKA – fingerprint visualization Data matrix 22,000x16
21 Conclusions regarding statistics: JChem PCA and PLS output (Eigenvalues, scores, loadings)are provided only as textfile. More univariate and multivariate tools needed.JChem PCA and PLS results must have graphical output. (They must)JChem PCA must be made faster (factor ) by using math routines.Integration into Instant-JChem would be good or ChemAxon provides enhanced bundled statistics tools.Currently JDBC query from JChem to other statistical packages like WEKA or Statistica or R or MATLAB or YALE is perfect. Each package works best in the field it was designed for.Matlab and R and YALE database connection JDBC or ODBC not shown hereMATLABRYALEThats it.Thanks
Your consent to our cookies if you continue to use this website.