Presentation is loading. Please wait.

Presentation is loading. Please wait.

Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang.

Similar presentations


Presentation on theme: "Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang."— Presentation transcript:

1

2 Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

3 The Background TOXICITY values are not known !!! 26 million distinct organic, inorganic chemicals known > 80, 000 in commercial production Combinatorial chemistry adds more than 1 million new compounds to the library every year In UK, > 10,000 are evaluated for possible production every year Biggest cost factor

4 What is toxicity? "The dose makes the poison - Paracelsus ( ) Toxicity Endpoints: EC50, LC50,… …

5 Toxicity tests are expensive, time consuming and disliked by many people In Silico Toxicity Prediction: SAR & QSAR - (Quantitative) Structure Activity Relationships TOPKAT, DERECK, MultiCase

6 SAR & QSARs e.g. Neural Networks PLS, Expert Systems ToxicityEndpoints Daphnia magna EC50 CancinogenicityMutagenicity Rat oral LD50 Mouse inhalation LC50 Skin sensitisation Eye irritancy Molecular weight HOMOLUMO Heat of formation Log D at pH 2, 7.4, 10 Dipole moment Polarisability Total energy Molecular volume... HOMO - highest occupied molecular orbital LUMO - Lowest unoccupied molecular orbital No of descriptors cost time Molecular Modelling DESCRIPTORS DESCRIPTORS Physcochemical, biological, structural Physcochemical, biological, structural

7 Aims of Research integrated data mining environment (IDME) for in silico toxicity prediction decision tree induction technique for eco- toxicity modelling in silico techniques for mixture toxicity prediction

8 Why Data Mining System for In Silico Toxicity Prediction Existing systems: Unknown confidence level of prediction Extrapolation Models built from small datasets Fixed descriptors May not cover the endpoint required Users own data resources, often commercially sensitive, not fully exploited

9 Data Mining: Discover Useful Information and Knowledge from Data Data: records of numerical data, symbols, images, documents Data Data Information Knowledge Decision Volume Value The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data More importantly Better understanding Knowledge: Rules: IF.. THEN.. Cause-effect relationships Decision trees Patterns: abnormal, normal operation Predictive equations ……

10 ClusteringClassification Conceptual Clustering Inductive learning Dependency modelling SummarisationRegression Case-based Learning

11 x1x2x eg. Dependency Modelling or Link Analysis x1x2x3 x1 x2 x3

12 Data pre-processing - Wavelet for on-line signal feature extraction and dimension - Wavelet for on-line signal feature extraction and dimension reduction reduction - Fuzzy approach for dynamic trend interpretation - Fuzzy approach for dynamic trend interpretation Clustering - Supervised classification - BPNN - BPNN - Fuzzy set covering approach - Fuzzy set covering approach Unsupervised classification - ART2 (Adaptive resonance theory) - ART2 (Adaptive resonance theory) - AutoClass - AutoClass - PCA - PCA Dependency modelling - Bayesian networks - Bayesian networks - Fuzzy - SDG (signed directed graph) - Fuzzy - SDG (signed directed graph) - Decision trees - Decision trees Others - Automatic rules extraction from data using Fuzzy-NN and Fuzzy SDG - Visualisation

13 Modern control systems Cost due to PREVENTABLE abnormal operations: e.g. $20 billion per year in pretrochemical ind. Fault detection & diagnosis: very complex sensor faults, equipment faults, control-loop, interaction of variables …

14

15 Yussels work Start point End point Process Operational Safety Envelopes Loss Prevention in Process Ind. 2002

16 Integrated Data Mining Environment Results Presentation Graphs Tables ASCII files Discovery Validation Statistical significance Results for training and test sets Discovery Validation Statistical significance Results for training and test sets Data Mining Toolbox Regression PCA & ICA ART2 networks Kohonen networks K-nearest neighbour Fuzzy c-means Decision trees and rules Feedforward neural networks (FFNN) Summary statistics Visualisation Data import Excel ASCII Files Database XML Data Pre- processing Scaling Missing values Outlier identification Feature extraction Descriptor calculation - Toxicity

17 User Interface

18 Quantitative Structure Activity Relationship 75 organic compounds with 1094 descriptors and endpoint Log(1/EC50) to Vibrio fischeri Zhao et al QSAR 17(2) 1998 pages Log(1/EC50) = Vx (r , MSE ) Vx – McGowans characteristic volume r 2 – Pearsons correlation coefficient q 2 – leave-one out cross validated correlation coefficient

19 Principal Component Analysis

20 Clustering in IDME

21 Multidimensional Visualisation

22 Feedforward neural networks Input layer Hidden layer Output layer Log(1/EC50) PC 1 PC 2 PC 3 … PC m

23 FFNN Results graph

24 QSAR Mode for Mixture Toxicity Prediction TRAINING Similar Constituents Dissimilar Constituents TESTING Similar Constituents Dissimilar Constituents

25 oxicity Prediction ? Why Inductive Data Mining for In Silico Toxicity Prediction ? Lack of knowledge on what descriptors are important to toxicity endpoints (feature selection) Expert systems: subjective knowledge obtained from human experts Linear vs nonlinear Black box models

26 What is inductive learning? Aims at Developing a Qualitative Causal Language for Grouping Data Patterns into Clusters Decision trees or production rules Explicit and transparent

27 Human expert knowl. Knowl. transparent, causal Data driven Quantitative Data driven Quantitative Nonlinear Easy setup Statistical Methods Neural Networks Knowl. Subjective. Data not used Often qualitative Black-box Human knowl. not used Black-box Human knowl. not used Expert Systems Combines adv. of ESs, and SMs & NNs Qualitative & quantitative, nonlinear Data & human knowledge used Knowl. transparent and causal Inductive DM More research Continuous valued output Dynamics / interactions

28 C5.0 Binary discretization Information entropy (Quinlan 1986 & 1993) LERS (Learning from Examples using Rough Sets, Grzymala-Busse 1997) Probability distribution histogram Equal width interval KEX (Knowledge EXplorer, Berka & Bruha 1998) CN4 (Berka & Bruha 1998) Chi2 (Liu & Setiono 1995, Kerber 1992) C5.0 LERS_C5.0 Histogram_C5.0 EQI_C5.0 KEX_chi_C5.0KEX_fre_C5.0KEX_fuzzy_C5.0 CN4_C5.0 Chi2_C5.0 Discretization techniques Methods Tested

29 Genetic Algorithm – optimisation approach can effectively avoid local minima and simultaneously evaluate many solutions GA has been used in decision tree generation to decide the splitting points and attributes to be used whilst growing a tree Traditional Tree Generation methods – Greedy search, can miss potential models Decision Tree Generation Based on Genetic Programming Genetic (evolutionary) Programming : Not only simultaneously evaluate many solutions and avoid local minima and avoid local minima But does not require parameter encoding into fixed length vectors called chromosomes Based on direct application of the GA to tree structures

30 Genetic Computation (1)Generation of a population of solutions (2) Repeat steps (i) and (ii) until the stop criteria are satisfied (i) calculate the fitness function values for each solution candidate (ii) perform crossover and mutation to generate the next generation (3) the best solution in all generations is regarded as the solution

31 Crossover Genetic algorithms Genetic (Evolutionary) Programming / EPTree + = + =

32 1. Divide data into training and test sets 2. Generate the 1 st population of trees - randomly choosing a row (i.e. a compound), and column (i.e. descriptor) - Using the value of the slot, s, to split, left child takes those data points with selected attribute values s. Descriptors Molecules s>s DeLisle & Dixon J Chem Inf Comput Sci 44, (2004) Buontempo & Wang et al, J Chem Inf Comput Sci 45, (2005)

33 - If a child will not cover enough rows (e.g. 10% of the training rows), another combination is tried. the training rows), another combination is tried. - A child node becomes a leaf node if pure i.e. all the rows covered are in the same class, or near pure, rows covered are in the same class, or near pure, whilst the other nodes grow children whilst the other nodes grow children -When all nodes either have two children or are leaf nodes, the tree is fully grown and added to the first nodes, the tree is fully grown and added to the first generation. generation. -A leaf node is assigned to a class label corresponding to the majority class of points corresponding to the majority class of points partitioned there. partitioned there.

34 3. Crossover, Mutation - Tournament: randomly select a groups of trees e.g Tournament: randomly select a groups of trees e.g Calculate fitness values - Calculate fitness values - Generate the first parent - Generate the first parent - Similarly generate the second parent - Similarly generate the second parent - Crossover to generate a child - Crossover to generate a child - Generate other children - Generate other children - Select a percentage for mutation - Select a percentage for mutation + =

35 Mutation Methods - Random choice of change of split point (i.e. choosing a different rows value for the current attribute) - Choosing a new attribute whilst keeping the same row - choosing a new attribute and a new row - re-growing part of the tree - If no improvement in accuracy for k generations, trees generated were mutated - … …

36 Data Set 1: Concentration lethal to 50% of the population, LC50, 1/Log(LC50), of vibrio fischeri, a biolumininescent bactorium 75 compounds 1069 molecular descriptors Data Set 2: Concentration effecting 50% of the population, EC50 of algae chlorella vulgaris, by causing fluorescein diacetate to disappear 80 compounds 1150 descriptors Two Data Sets

37 600 trees were grown in each egneration 16 trees competing in each tournament to select trees for crossover, 66.7% were mutated for the bacterial dataset, and 50% mutated for the algae dataset. Data setMinimu m Class 1 range Class 2 range Class 3 range Class 4 range Maximum Bacteria > Algae >

38 Evolutionary Programming Results: Dataset 1 Class 4 (7/7) Class 1 (12/12) NoYes No Class 3 (8/8) No Yes No Highest eigenvalue of Burden matrix weighted by atomic mass 2.15 Lowest eigenvalue of Burden matrix weighted by van der Waals vol Yes Self-returning walk count of order Cl attached to C2 (sp3) 1 Class 4 (5/6) No Yes Distance Degree Index Class 4 (5/6) Summed atomic weights of angular scattering function No Yes Class 2 (5/6) R autocorrelation of lag 7 weighted by atomic mass Class 2 (7/8) YesNo Class 3 (6/7) For data set 1, bacteria data in generation % for training (60 cases) 73.3% for the test set (15 cases)

39 Decision Tree Using C5.0 for the Same Data Class 2 (11/12) Class 4 (3/6) Class 3 (5/6) Class 1 (13/14) Class 4 (14/15) Class 3 (7/7) NoYes No Valence connectivity index Cl attached to C1 (sp2) 1 NoYes H Autocorrelation lag 5 weighted by atomic mass YesNo Summed atomic weights of angular scattering function Gravitational index No Yes For data set 1, bacteria data 88.3% for training (60 cases) 60.0 % for test set (15 cases)

40 No Yes Class 2 (14/15) Class 1 (16/16) No Yes Class 3 (6/8) No Yes No Self-returning walk count order H autocorrelation of lag 2 weighted by Sanderson electro- negativities Molecular multiple path count order Solvation connectivity index Class 4 (6/7) No Yes 2nd component symmetry directional WHIM index weighted by van der Waals volume Class 3 (9/10) Class 4 (8/8) 2 nd dataset - algae data GP Tree, generation 9 Training: 92.2% Test: 81.3% Evolutionary Programming Results: Dataset 2

41 Class 3 (15/20) Class 1 (16/16) Class 2 (15/16) Class 4 (12/12) No Yes No Broto-Moreau autocorrelation of topological structure lag 4 weighted by atomic mass Total accessibility index weighted by van der Waals vol No Yes Max eigenvalue of Burden matrix weighted by van der Waals vol nd dataset, algae data See 5, Training: 90.6% Test: 75.0% Decision Tree Using See5.0 for the Same Data

42 Data set 1 – Bacteria data GP method C5.0 Tree size Training Accuracy Test Accuracy 688.3%60.0% 891.7%73.3% Data set 2 – Algae data GP method C %75.0% 692.2%81.3% Tree size Training Accuracy Test Accuracy Summary of Results

43 Data Set 1 – Bacteria data GP (Generation 31) C5.0 Tree size Training Accuracy Test Accuracy 688.3%60.0% 888.3%73.3% Data Set 2 – Algae data GP (Generation 9) C %75.0% 690.6%87.5% Tree size Training Accuracy Test Accuracy Comparison of Test Accuracy for See5.0 and GP Trees Having the Same Training Accuracy

44 Primary TreatmentSecondary Treatment Secondary Settler Aeration Tank Outflow Inflow Screening Grit Removal PrimarySettler Application to Wastewater Treatment Plant Data

45 Input Pre-Treatment Primary Treatment Sludge Line Secondary Treatment Secondary Settler Screws Aeration Tanks Output Primary Settler Data Corresponding to 527 Days Operation 38 Variables

46 Decision tree for prediction of suspended solids in effluents – training data SS-P DQO-D SS-P DBO-D RD-DBO-G SS-P ZN-E SS-P PH-D DQO-D PH-D RD-DQO-S SS-P SS-P PH-D SS-P SSV-P DBO-SS PH-D L2L2 N7N7 L3L3 N3N3 N3N3 N4N4 N2N2 N3N3 H3H3 N 27 H4H4 H2H2 N 20 N 320/1 N 16 N2N2 L3L3 N 30 N 11 N5N5 Total No of Obs. =470 Training Accuracy: 99.8% Test Accuracy: 93.0% Leaf Nodes = 20 L = Low N = Normal H = High SS-P : input SS to primary settler DQO-D : input COD to secondary settler DBO-D : input COD to secondary settler PH-D : input pH to secondary settler SSV-P : input volatile SS to primary settler

47 DBO-E SS-P SS-P SS-P RD-DQO-S RD-SS-G DBO-D SED-P DBO-E RD-SS-P SS-P PH-D RD-DQO-S PH-P PH-P COND-S SS-P N 76/3 L3L3 N 25 N2N2 H9H9 N 11 N8N8 H3H3 N4N4 N 13 N 234/1 N 11 N 69 L3L3 L3L3 N 31 L2L2 N 20 No of Obs. = 527 Accuracy = 99.25% Leaf Nodes = 18 L = Low N = Normal H = High Using all the data of 527 days

48 Final Remarks An Integrated Data Mining Prototype System for Toxicity Prediction of Chemicals and Mixtures Developed An Evaluation of Current Inductive Data Mining Approaches to Toxicity Prediction Has Been Conducted A New Methodology for the Inductive Data Mining Based Novel Use of Genetic Programming is Proposed, Giving Promising Results in Three Case Studies

49 On-going Work 1)Adaptive Discretization of End-point Values through Simultaneous Mutation of the Output The best training accuracy in each generation for the trees grown for the algae data using the SSRD. The 2 class trees no longer dominate and very accurate 3 class trees have been found. SSRD - sum of squared differences in rank

50 2) Extend the Method to Model Trees & Fuzzy Model Trees Generation Future Work Rule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…×μ1,9 then y1= PC PC PC PC PC PC PC PC PC Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…×μ2,9 then y2 = PC PC PC PC PC PC PC PC PC Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2) where μi=μi,1×μi,2×……×μi,10

51 Fuzzy Membership Functions Used in Rules

52 3) Extend the Method to Mixture Toxicity Prediction Future Work TRAINING Similar Constituents Dissimilar Constituents TESTING Similar Constituents Dissimilar Constituents

53 Acknowledgements Crystal Faraday Partnership on Green Technology AstraZenaca Brixham Environmental Laboratory NERC Centre of Ecology and Hydrology FV Buontempo M Mwense A Young D Osborn

54 Type of descriptorDefinitionExamples ConstitutionalPhysical description of the compound Molecular weight, atoms count Topological2D descriptors taken from the molecular graph Wiener index, Balaban index Walk countsObtained from molecular graphs Total walk count Burden eigenvalues (BCUT) Eigenvalues of the adjacency matrix, weighting the diagonals by atom weights, reflecting the topology of the whole compound Weighted by atomic mass, volume, electronegativity or polarizability Galvez topological charge indices Describes charge transfer between pairs of atoms calculated from the eigenvalues of the adjacency matrix Topological and mean charge index of various orders 2D autocorrelation Sum of the atom weights of the terminal atoms of all the paths of a given length (lag) Moreau, Moran, and Geary autocorrelations

55 Charge descriptors Charges estimated by quantum molecular methods Total positive charge, dipole index Aromaticity indices Estimated from geometrical distance between aromatically bonded atoms Harmonic oscillator model of aromaticity Randic molecular profiles Derived from distance distribution moments of the geometry matrix Molecular profile, shape profile Geometrical descriptors Conformational- dependant, based on molecular geometry 3D Wiener index, gravitational index

56 Radial distribution function descriptors Obtained from radial basis functions centred at different distances Unweighted or weighted by atomic mass, volume, electronegativity or polarizability 3D Molecule Representation of Structure based on Electron diffraction (MoRSE) Calculated by summing atomic weights viewed by different angular scattering functions GEometry, Topology, and Atom Weights AssemblY (GETAWAY) Calculated from the leverage matrix, representing the influence of each atom in determining the shape of the molecule, obtained by centred atomic coordinates

57 Weighted holistic invariant molecular (WHIM) Statistical indices calculated from the atoms projected onto 3 principal components from a weighted covariance matrix of atomic coordinates Unweighted or weighted by atomic mass, volume, electronegativity, polarizability or electrotopological state Functional groupsCounts of various atoms and functional groups Primary carbons Aliphatic ethers Atom-centred fragments From 120 atom centred fragments defined by Ghose- Crippen Cl-086; Cl attached to C1 (sp3) Various othersUnsaturation index; number of non-single bonds Hy; a function of the count of hydrophilic groups Aromaticity ratio; aromatic bonds/ total number of bonds in a H-depleted atom Ghose-Crippen molecular refractivity Fragment-based polar surface area


Download ppt "Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang."

Similar presentations


Ads by Google