Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB http://www.it.iitb.ac.in/~sunita Joint work with B. Anuradha, IITB Anand Janakiraman, IITB Jayant Haritsa, IISc

The dataset Dataset provided by DuPont Pharmaceuticals Activity of compounds binding to thrombin Library of compounds included: 1909 known molecules (42 actively binding thrombin) 139,351 binary features describe the 3-D structure of each compound 636 new compounds with unknown capacity to bind to thrombin

Sample data 0,1,0,0,0,0,… …,0,0,0,0,0,0,I 0,0,0,0,0,0,… …,0,0,0,0,0,1,I 0,0,0,0,0,0,… …,0,0,0,0,0,0,I 0,1,0,0,0,1,… …,0,1,0,0,0,1,A 0,1,0,0,0,1,… …,0,1,0,0,1,1,? 0,1,1,0,0,1,… …,0,1,1,0,0,1,?

Challenges Large number of binary features, significantly fewer training instances: 140,000 vs 2000! Highly skewed: 1867 In-actives, 42 Actives. Varying degrees of correlation among features Differences in the training and test distributions

Steps Familiarization with data data has noise, four equal records (all 0s) with different labels Lots more 0s than 1s Number of 1s significantly higher for As than Is Feature selection Build classifiers Combine classifiers Incorporate unlabeled test instances

First step: feature selection Most commercial classifiers cannot handle 140,000 features even with 1 GB memory. Entropy-based individual feature selection Does not handle redundant attributes. Step-wise feature selection Too brittle Top entropy attribute with a “1” in each active compound Exploiting small counts of Actives Want all important groups of redundant attributes

Building classifiers Partition training data using stratified sampling Two-thirds training data One-third validation data Classification methods attempted Decision tree classifiers Naïve-Bayes SVMs Hand-crafted clustering/nearest neighbor hybrid

Decision Tree C4.5 I (338/6) A (2) A (3) A (4) A (5) A (10) f25144 = 1 f80106 = 1 f26913 = 1 f135832 = 1 f137567 = 1 f88235 = 1 AI A37 I1459

Naïve Bayes Data characteristics very similar to text lots of features, sparse data, few ones Naïve Bayes found very effective for text classification Accuracy: All actives misclassified! AI A010 I1459

Support vector machines Has received lots of attention recently Requires tuning: which kernel, what parameters? Several freely available packages: SVMTorch Accuracy: slightly worse than decision trees fifi fjfj

Hand-crafted hybrid Find features such that actives cluster together using appropriate distance measure Training active Training inactive Test Record fifi fjfj

Incremental Feature Selection Pick features ONE by ONE that result in maximum clustering of the actives. And maximum separation from the inactives. Objective function: Maximum separation between centroids of the Actives and In-actives Distance function: matching ones Careful selection of training Actives. Accuracy: 100%, 493 features

Final approach Test data: significantly denser Methods like SVM, NB, clustering-based will not generalize Preferred distribution independent method Ensemble of Decision Trees On disjoint attributes --- unconventional Semi-supervised training Introduce feedback from the test data in multiple rounds

Building tree ensembles Initially picked ~20000 features based on entropy. More than one tree to take care of large feature space. Repeat until accuracy on validation data does not drop All groups of redundant features exploited. Remove features Remove features

Incorporating unlabeled instances Augment training data with sure test instances. Re-train another ensemble of trees using same method Include more unlabelled instances with sure predictions Repeat few more times... How to capture drift?

Capturing drift Solution: Validate with independent data Be sure to include only correctly labeled data First approach: Same prediction by all trees On validation data, found errors in this scheme Pruning not a solution Weighted prediction by each tree Weight: fraction of Actives Pick the right threshold using validation data. Stop when no more unlabelled data can be added

Final state Three rounds each with about 6 trees Unlabelled data included: 126 actives & 311 inactives Remaining 200 in confusion Use meta-learner on validation data to pick final criteria Sum of scores times number of trees claiming Actives Several other last minute hacks.

Outcome Winning Entry: Weighted: 68.4% Accuracy: 70.03% Home Team

Winner’s method Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) Learning Bayesian network models of different complexity (2 to 12 features) Choosing a model (ROC area, model complexity)

Postmortem: Was all this necessary? Without semi-supervised learning: Single decision tree = 49% 6-tree ensemble on training data alone: Majority = 57% Confidence weighted = 63% With unlabelled data: 64.3%

Lessons learnt Products: Need tools that scale in number of features Research problems: Classifiers that are not tied to distribution similarity with the training data More principled way of including unlabelled instances.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Similar presentations

Presentation on theme: "Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Similar presentations

Presentation on theme: "Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB"— Presentation transcript:

Similar presentations

About project

Feedback