Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Study on Feature Selection for Toxicity Prediction*

Similar presentations


Presentation on theme: "A Study on Feature Selection for Toxicity Prediction*"— Presentation transcript:

1 A Study on Feature Selection for Toxicity Prediction*
Gongde Guo1, Daniel Neagu1 and Mark Cronin2 1Department of Computing, University of Bradford 2School of Pharmacy and Chemistry, Liverpool John Moores University *EPSRC Project: PYTHIA – Predictive Toxicology Knowledge representation and Processing Tool based on a Hybrid Intelligent Systems Approach, Grant Reference:GR/T02508/01

2 Outline of Presentation
Predictive Toxicology Feature Section Methods Relief Family: Relief, ReliefF KNNMFS Feature Selection Evaluation Criteria Toxicity Dataset: Phenols Evaluation I: Toxicity Evaluation II: Mechanism of Action Conclusions

3 Predictive Toxicology
The goal of predictive toxicology is to describe the relations between the chemical structure of a molecule and biological and toxicological processes (Structure-Activity Relationship SAR) and to use these relations to predict the behaviour of new, unknown chemical compounds. Predictive toxicology data mining comprises steps of data preparation; data reduction (includes feature selection); data modelling; prediction (classification, regression); and evaluation of results and further knowledge discovery tasks.

4 Feature Selection Methods
Feature selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. Seven feature selection methods (Witten et al, 2000) are involved in our study: GR – Gain Ratio feature evaluator; IG – Information Gain ranking filter; Chi – Chi-squared ranking filter; ReliefF – ReliefF Feature selection; SVM- SVM feature evaluator; CS – Consistency Subset evaluator; CFS – Correlation-based Feature Selection; But in this work, we focused on the drawbacks of the ReliefF feature selection method and proposed the kNNMFS feature selection method.

5 Relief Feature Selection Method
The Relief algorithm works by randomly sampling an instance and locating its nearest neighbour from the same and opposite class. The values of the features of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each feature. K=1 Miss Hit Noise? M=? How to choose individual M instances?

6 Relief Feature Selection Method
Algorithm Relief Input: for each training instance a vector of attribute values and the class value Output: the vector W of estimations of the qualities of attributes Set all weights W[Ai]=0.0, i=1,2,…,p ; for j=1 to m do begin randomly select an instance Xj; find nearest hit Hj and nearest miss Mj; for k=1 to p do begin W[Ak]=W[Ak]-diff(Ak, Xj, Hj)/m+diff(Ak, Xj, Mj)/m; end;

7 ReliefF Feature Selection Method
Miss Hit K=3 Noise (X); K=? M=? How to choose M instances?

8 ReliefF Feature Selection Method

9 kNN Model-based Classification Method (Guo et al, 2003)
The basic idea of kNN model-based classification method is to find a set of more meaningful representatives of the complete dataset to serve as the basis for further classification. kNNModel can generate a set of optimal representatives via inductively learning from the dataset.

10 An Example of kNNModel Each representative di is represented in the form of <Cls(di), Sim(di), Num(di), Rep(di)> which respectively represents the class label of di; the similarity of di to the furthest instance among the instances covered by Ni; the number of instances covered by Ni; a representation of instance di.

11 KNNMFS: kNN Model-based Feature Selection
kNNMFS takes the output of the kNNModel as seeds for further feature selection. Given a new instance, kNNMFS finds the nearest representative for each class and then directly uses the inductive information of each representative generated by kNNModel for feature weight calculation. The k in ReliefF is varied in our algorithm. Its value depends on the number of instances covered by each nearest representative used for feature weight calculation. The M in kNNMFS is the number of representatives output from the kNNModel.

12 KNNMFS Feature Selection Method

13 Toxicity Dataset: Phenols
Phenols data set was collected from TETRATOX database (Scheultz, 1997) which contained 250 compounds. A total of 173 descriptors were calculated for each compounds using different software tools, e.g., ACD/Labs, Chem-X, TSAR. These descriptors were calculated to represent the physico-chemical, structure and topological properties that were relevant to toxicity. Some features are irrelevant to or poor correlate with the class label X: CX-EMP20 Y:Toxicity X:TS_QuadXX Y:Toxicity

14 Evaluation Measure for Continuous Class Values Prediction

15 Evaluation Using Linear Regression
Endpoint I: Toxicity Table 1. Performance of linear regression algorithm on different phenols subsets FSM NSF Evaluation Using Linear Regression CC MAE RSE RAE RRSE Phenols 173 0.8039 0.3993 0.5427 % % MostU 12 0.7543 0.4088 0.5454 % % GR 20 0.7722 0.4083 0.5291 % % IG 0.7662 0.3942 0.5325 % % Chi 0.7570 0.4065 0.5439 % % ReliefF 0.8353 0.3455 0.4568 % % SVM 0.8239 0.3564 0.4697 % % CS 13 0.7702 0.3982 0.5292 % % CFS 7 0.8049 0.3681 0.4908 % % kNNMFS 35 0.8627 0.3150 0.4226 % %

16 Endpoint II: Mechanism of Action
Table 2. Performance of wkNN algorithm on different phenols subsets FSM NSF 10-Fold Cross Validation Using wkNN (k=5) Average Accuracy Variance Deviation GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols 20 13 7 35 173 89.32 89.08 88.68 91.40 91.80 89.40 80.76 93.24 86.24 1.70 1.21 0.50 1.32 0.40 0.76 1.26 0.44 0.43 1.31 1.10 0.71 1.15 0.63 0.87 1.12 0.67 0.66

17 Conclusion and Future Research Directions
Using a kNN model as the starter selection can choose a set of more meaningful representatives to replace the original data for feature selection; Presenting a more reasonable ‘difference function calculation’ based on inductive information in each representative obtained by kNNModel. Better performances are obtained on the subsets of the Phenol dataset with different endpoints by kNNMFS. Investigating the effectives of boundary data or centre data of clusters chosen as seeds for kNNMFS More comprehensive experiments on the benchmark data will be carried out.

18 References Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann (2000), San Francisco Guo, G., Wang, H., Bell, D. et al.: kNN Model-based Approach in Classification. In Proc. of CoopIS/DOA/ODBASE 2003, LNCS 2888, Springer-Verlag, pp (2003) Scheultz, T.W.: TETRATOX: The Tetrahymena Pyriformis Population Growth Impairment Endpoint – A Surrogate for Fish Lethality. Toxicol. Methods, 7, (1997)

19 Thank you very much!


Download ppt "A Study on Feature Selection for Toxicity Prediction*"

Similar presentations


Ads by Google