Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/

Similar presentations


Presentation on theme: "Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/"— Presentation transcript:

1 Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/

2 Classifying the WADA Prohibited List Aims & Background. Methods. Data. Results. Conclusions.

3 Aims & Background

4 Much drug abuse in sport involves novel compounds such as the designer steroid THG. tetrahydrogestrinone (THG)

5 Aims & Background Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules. Analogues are prohibited using the similar chemical structure or similar biological effect(s) criterion.

6 WADA Prohibited Classes Anabolic Agents (S1) Hormones and Related Substances (S2) Beta-2-agonists (S3) Anti-estrogenic Agents (S4) Diuretics and Masking Agents (S5) Stimulants (S6) Narcotics (S7) Cannabinoids (S8) Glucocorticoids (S9) Alcohol (P1) Beta Blockers (P2)

7 Predicting Bioactivities We seek to predict whether a molecule exhibits one of these bioactivities. Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

8 Methods

9 Chemical Space Use descriptor-based fingerprints to locate molecules in chemical space. Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

10 Machine Learning Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space. Random Forest. k-Nearest Neighbours.

11 Fingerprints CDK (Chemistry Development Kit) fingerprint. Unity 2D. MACCS key. MOE 2D (2004). Typed Atom Distance. Typed Graph Distance.

12 CDK Fingerprint CDK fingerprint resembles Daylight. All bond paths up to a length of 6 are generated. A hashing function is used to map these paths onto a fingerprint of 1024 bits.

13 Unity 2D Fingerprint Unity is similar to CDK, but based on sub-structures rather than just paths. Substructures present in the molecule are enumerated. A hashing function is used to map these paths onto a fingerprint of 992 bits.

14 Classification Algorithms Random Forest (RF). k-Nearest Neighbours (k-NN).

15 Random Forest Decision based learner. Based on bootstrap sample of data. Number of trees in forest (ntree). Number of descriptors tried at each node (mtry). Each tree predicts label of molecule. Majority vote = class label of molecule.

16 Random Forest Node A > x 1 A < x 1 B > x 2 B < x 2 C > x 3 C < x 3 Decision: YesNo Yes A Random Forest contains many such trees.

17 Random Forest Decision based learner. Based on bootstrap sample of data. Number of trees in forest (ntree). Number of descriptors tried at each node (mtry). Each tree predicts label of molecule. Majority vote = class label of molecule.

18 k-Nearest Neighbours Instance based learner. Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space. k is a variable describing the number of neighbours to be considered. Class of x determined by majority vote of class labels of k neighbours. Ties broken randomly (only occurs for even k).

19 k-Nearest Neighbours

20 Instance based learner. Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space. k is a variable describing the number of neighbours to be considered. Class of x determined by majority vote of class labels of k neighbours. Ties broken randomly (only occurs for even k).

21 k-Nearest Neighbours Local method. Uses only a very small number of near neighbours to make its prediction. Suitable for predicting activity classes with multiple clusters in chemical space. Therefore good for WADA classes with multiple receptors.

22 Performance Measure Matthews Correlation Coefficient: Range: -1 < MCC < 1; Balance between predicting positives & negatives.

23 Data

24 The Dataset 5245 molecules (5235 for CDK). Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

25 Data by Class WADA ClassNumber of Molecules S147 S2272 S3367 S4928 S51000 S6804 S7195 S81000 S926 P2239 Allowed367

26 Fivefold Cross-validation We test for membership of each prohibited class separately. All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

27 False Positives False Positives arise in two ways: (1) A molecule predicted positive on an incorrect activity class; (2) An explicitly allowed molecule predicted positive.

28 Results

29 Results: Random Forest Aggregated over 10 classes

30 Unity CDK > MACCS > others.

31 100 trees sufficient; little improvement with more.

32 Results: k-Nearest Neighbours Aggregated over 10 classes

33

34 Unity CDK > MACCS > others.

35 k = 1 best; poor performance at k = 2 due to ties. MCC falls off with increasing k.

36 k = 1 best; poor performance at k = 2 due to ties. MCC falls off with increasing k. Unity CDK.

37 Results: Comparison Recall v Precision Aggregated over 10 classes RecallPrecision

38 RF gives higher precision, k-NN higher recall.

39 Results: Comparison Analysed by class

40 Classes vary in difficulty of prediction; independent of classification algorithm.

41 Conclusions

42 Can successfully predict active molecules (MCC 0.83). Unity CDK > MACCS > others. RF & k-NN give similar MCC. k-NN higher recall. RF higher precision; RF less likely to find false positives.

43 Conclusions RF results vary little with ntree. k-NN results best for k = 1. Performance decreases at higher k. Odd k avoids problems with ties (k = 2 is worse than k = 3). Activity classes show consistent prediction difficulty pattern.

44 Acknowledgements Andreas Bender (Novartis Institutes). David Palmer (Unilever Centre). Unilever.

45 www-mitchell.ch.cam.ac.uk/

46 No significant correlation overall; though smallest class S9 is hardest to predict.

47

48 tetrahydrogestrinone (THG) gestrinone trenbolone


Download ppt "Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/"

Similar presentations


Ads by Google