Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.

Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical Engineering, Boston University 229 th ACS National Meeting March 13-17, 2005 San Diego, CA

Outline Introduction to active learning for compound screening Introduction to active learning for compound screening Objectives and performance criteria Objectives and performance criteria Algorithms and procedures Algorithms and procedures Thrombin dataset results Thrombin dataset results Preliminary conclusions Preliminary conclusions

Introduction: drug discovery drug discovery is an iterative process drug discovery is an iterative process goal: to identify many target binding compounds with minimal screening iterations goal: to identify many target binding compounds with minimal screening iterations Features Compounds01110 10110 11101 01101 compounds descriptors selection screening

Introduction: supervised learning input: data set with positive and negative examples input: data set with positive and negative examples output: a classifier such that for each example output: a classifier such that for each example = 1 if example is positive = 1 if example is positive = -1 if example is negative = -1 if example is negative standard learning standard learning classifier trains on a static training set classifier trains on a static training set train, then test train, then test active learning active learning classifier chooses data points for training set classifier chooses data points for training set classifer “requests” labels classifer “requests” labels iterative rounds of training and testing iterative rounds of training and testing

Introduction: active learning & compound screening Mamitsuka et al. Proceedings of the Fifteenth International Conference on Machine Learning, 1998:1-9. Mamitsuka et al. Proceedings of the Fifteenth International Conference on Machine Learning, 1998:1-9. Warmuth et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. Warmuth et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 1 st query2 nd queryFeaturesA/ICompounds train classifier # 1 I train classifier # 2 A NOT labeled ? ? ? ? ? ? test? ?FeaturesA/ICompounds train classifier # 1 I A A train classifier # 2 I A I NOT labeled ? ? test? ?

Objectives exploitation exploitation Hit Performance Hit Performance Enrichment Factor (EF) Enrichment Factor (EF) exploration exploration Accurate model of activity Accurate model of activity Sensitivity Sensitivity

Methods: datasets 632 DuPont thrombin-targeting compounds 632 DuPont thrombin-targeting compounds 149 actives 149 actives 483 inactives 483 inactives a binary feature vector for each compound a binary feature vector for each compound shaped-based features shaped-based features pharmacophore features pharmacophore features 139,351 features 139,351 features retrospective data retrospective data 200 features selected by mutual information (MI) w.r.t. activity labels 200 features selected by mutual information (MI) w.r.t. activity labels mean MI = 0.126 mean MI = 0.126 Features Compounds 0111A 1011I 1110I 0110A 1. Warmuth et al. J. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. 2. Eksterowicz et al. J Mol Graph Model. 2002 Jun;20(6):469-77. 3. Putta et al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. 4. KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Methods: cross validation 5X cross validation 5X cross validation 1 st Features Compounds train train train train test 2 ndFeaturesCompoundstrain train train test train Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

given given binary input vector, binary input vector, weight vector, weight vector, threshold value, T threshold value, T learning rate, n learning rate, n classification, t classification, t TEST: TEST: TRAIN: TRAIN: if classified correctly, do nothing if classified correctly, do nothing if misclassified, if misclassified, Methods: perceptron Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Methods: classifier committees bagging: uniform sampling distribution bagging: uniform sampling distribution boosting: compounds misclassified by classifier #1 more likely resampled by classifier #2 boosting: compounds misclassified by classifier #1 more likely resampled by classifier #2 FeaturesA/I Compounds train classifier # 1 I A A train classifier # 2 I A I NOT labeled ? ? test? ? Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Methods: weighted voting weighted vote of all classifiers predicts compound activity label weighted vote of all classifiers predicts compound activity label Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Methods: sample selection strategies P(active) : select compounds predicted active with highest probability by the committee P(active) : select compounds predicted active with highest probability by the committee uncertainty: select compounds on which the committee disagrees most strongly uncertainty: select compounds on which the committee disagrees most strongly density with respect to actives: select compounds most similar to previously labeled or predicted actives density with respect to actives: select compounds most similar to previously labeled or predicted actives Tanimoto similarity metric Tanimoto similarity metric given compound bitstrings A and B given compound bitstrings A and B a = # bits on in A a = # bits on in A b = # bits on in B b = # bits on in B c = # bits on in both A and B c = # bits on in both A and B Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Methods: performance criteria Hit Performance Hit Performance Enrichment Factor (EF) Enrichment Factor (EF) Sensitivity Sensitivity Start Input data files Pick training and testing data for next round of cross validation 1 st batch? Query training set batch labels Train classifier committee on labeled training set subsamples Select 1 st batch randomly or by chemist Predict compound labels by committee weighted majority vote All training set labels queried? Cross validation completed? Accuracy and performance statistics End yes no yes no Sample Selection: - P(active) - uncertainty - density

Results: hit performance

Results: sensitivity uncertainty uncertainty highest testing set sensitivity initially highest testing set sensitivity initially no significant increase in testing set sensitivity no significant increase in testing set sensitivity

Results: bagging vs. boosting boosting boosting training set TP climbs faster, converges higher training set TP climbs faster, converges higher overfits to the training data overfits to the training data

Results: # classifiers

Conclusions Sample selection Sample selection Bag vs. boost Bag vs. boost Committee vs. single classifier Committee vs. single classifier Testing set sensitivity Testing set sensitivity Trade off: exploration and exploitation Trade off: exploration and exploitation

Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.

Similar presentations

Presentation on theme: "Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.

Similar presentations

Presentation on theme: "Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical."— Presentation transcript:

Similar presentations

About project

Feedback