Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jacqueline M. Hughes-Oliver Department of Statistics

Similar presentations


Presentation on theme: "Jacqueline M. Hughes-Oliver Department of Statistics"— Presentation transcript:

1 Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree
Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University *joint with Ke Zhang, GSK and Stan Young, NISS ________________________________________________ This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG Information on the Molecular Libraries Roadmap Initiative can be obtained from

2 Blackwell-Tapia - November 2008
Outline Background Recursive partitioning OBSTree Simulation study Screening for monoamine oxidase inhibitors Summary

3 Blackwell-Tapia - November 2008
Background Estimate a function such that based on where Preferably, costs more than

4 Blackwell-Tapia - November 2008

5 Background – Structure-Activity Relationship (SAR)
Blackwell-Tapia - November 2008 Background – Structure-Activity Relationship (SAR) Willett, Barnard, Downs (1998 JCICS) Molecular descriptors—Carhart atom pairs Atom type—distance—atom type, e.g., C(2,1)-04-C(3,1) Binary descriptors—few turned on

6 Blackwell-Tapia - November 2008
True False X3=1 Splitting variable chosen to optimize “purity measure” Search space: size p Need definitions for: search space purity measure, splitting criterion stopping criterion Recursive Partitioning X27=1

7 Recursive Partitioning: Rules are complex
Blackwell-Tapia - November 2008 Recursive Partitioning: Rules are complex 1 2 3 17 18 9 19 6 15 16 11 5 8 4 12 13 7 Are all splits necessary for the activity mechanism? Does an early split impede identification of other mechanisms?

8 Recursive Partitioning: Focus of Study
Blackwell-Tapia - November 2008 Recursive Partitioning: Focus of Study Need definitions for: Search space Purity measure, splitting criterion Stopping rule Binary Formal Inference-Based Recursive Modeling (BFIRM) Cho, Shen, Hermsmeier (2000, JCICS) Rank predictors according to F-test Combine important predictors to form splitting variable Result is better QSAR rules Recursive Partitioning/Simulated Annealing (RP/SA) Blower et al. (2002, JCICS) Best single predictor not necessarily best in combination Tree Harvesting Yuan, Chipman, Welch (2006 tech report) “Trim” bits off each terminal node

9 Recursive Partitioning: RP/SA
Blackwell-Tapia - November 2008 Recursive Partitioning: RP/SA Splitting variables are based on a combination of K predictors Features are always present: Search space of size Uses simulated annealing – stochastic optimization K is held fixed for all splits, and is assumed known

10 Blackwell-Tapia - November 2008
OBSTree Splitting variables are based on a combination of K predictors Combine approaches of BFIRM and RP/SA Features can be present or absent: chromosome selection Search space of size Uses simulated annealing + weighted sampling + trimming “K” can change for all splits, and is assumed unknown Uses a penalty entropy splitting criterion Usual stopping criteria applied, including cross validation

11 Blackwell-Tapia - November 2008
OBSTree: Flowchart Pre-OBSTree Setup Remove unary descriptors Determine Singly Important group Specify parameters Descriptor Pool RP Singly Important Descriptors General Descriptors

12 Blackwell-Tapia - November 2008
OBSTree: Flowchart Pre-OBSTree Setup Remove unary descriptors Determine Singly Important group Specify parameters Initialize split at next depth: depth=depth+1 a set of K descriptor (X0) using WSS Determine best chromosome x0 of initial X0 Form last terminal node. STOP depth=d or node size<2min or Ymax=0 or Ybar>M-1 Yes No SA to determine “optimal” (XA, xA) for split using WSS Trim Check 2K-1 subsets of current (XA, xA) Report best trimmed version as (X*, x*) Form terminal node X*=x*? Yes No

13 OBSTree: Splitting Criterion
Blackwell-Tapia - November 2008 OBSTree: Splitting Criterion Node has N compounds Class i has proportion pi in the node, with a total of ni in the node Entropy (node impurity): Penalty Entropy (penalize unwanted category) Problem: Entropy=0 (perfect) when a class of junk compounds is identified

14 OBSTree: Stopping Criteria
Blackwell-Tapia - November 2008 OBSTree: Stopping Criteria Maximum depth d The most active compound is junk The node size is less than 2j (j is the minimum node size). 5-fold cross-validation, e.g., choose depth d if # correct classifications levels off at depth d Accept H0: pd+1 = 0 for pd+1 = sensitivity between depths d and d+1

15 Blackwell-Tapia - November 2008
Simulation Study 1000 compounds, 500 binary descriptors Four active groups (20 compounds per group) – 8% active Activity Mechanisms Potency Descriptor Sets and Chromosomes I 3 1 2 4 5 II 6 7 8 9 III 11 12 13 17 IV 15 16 18 19

16 Simulation Study: Standard RP Tree
Blackwell-Tapia - November 2008 Simulation Study: Standard RP Tree 1 2 3 17 18 9 19 6 15 16 11 5 8 4 12 13 7 5 compounds of compounds of 0 7 compounds of 3

17 Simulation Study: Sample OBSTree
Blackwell-Tapia - November 2008 Simulation Study: Sample OBSTree 1,2,3,4,5/1,0,1,0,1 3 15,16,17,18,19/1,1,0,1,1 1 5,6,7,8,9/0,1,1,1,1 3 3,11,12,13,17/1,1,1,1,1 2

18 Simulation Study: 5-fold Cross-validation
Blackwell-Tapia - November 2008 Simulation Study: 5-fold Cross-validation Actual Accuracy 1 2 3 Prediction 918 5 99.5% 20 100% 35 94.6% Hit 99.7% 87.5% Overall Accuracy: 99.3% OBSTree Actual Accuracy 1 2 3 Prediction 910 34 96.3% 19 86.4% 20 100% 7 6 46.2% Hit 98.9% 95% 15% Overall Accuracy: 93.5% RP

19 Simulation Study: Sensitivity Analysis
Blackwell-Tapia - November 2008 Simulation Study: Sensitivity Analysis K, descriptor set size K >7 perfectly found all mechanisms K =7 perfectly found all but one mechanism Basic tree parameters Min node size is 5 SA parameters Initial temperature Minimum temperature Temperature reduction rate # transitions at a given temperature # failures to accept new point before increasing transition counter Sampling weights in WSS

20 Screening to Identify MAO Inhibitors
Blackwell-Tapia - November 2008 Screening to Identify MAO Inhibitors Neuronal MAO deactivates neurotransmitters Pargyline, an MAO inhibitor, was used to treat depression MAO inhibitors no longer used due to toxicity & interactions Abbott Laboratories dataset of MAO inhibitors Brown & Martin (1996 JCICS), 1646 chemically diverse compounds 1380 binary 2D atom-pair descriptors Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)

21 Blackwell-Tapia - November 2008
81,177,579,183/1,1,1,0 184,721,879 0/0/0/33 0/1/0/6 1, 579,1184,809/1,1,1,0 32,572,844 0/0/0/15 1/0/1/26 OBSTree RP/SA 704 1184 81 9/2/1/2 2/0/0/32 65 2/0/5/24 183 959/85/55/18 99/1/0/0 RP

22 MAO: Activity Mechanism I
Blackwell-Tapia - November 2008 MAO: Activity Mechanism I “Irreversible binding to flavin cofactor of MAO” Pargyline-like compounds Typical features of pargyline-like compounds A triple bond A tertiary nitrogen An aromatic ring 1st terminal node of OBSTree Highest active terminal node of RP 1st terminal node of RP/SA 81 183 177 579 0/0/0/33 1 81 704 2/0/0/32 1 184 721 879 0/1/0/6 1

23 MAO: Activity Mechanism I
Blackwell-Tapia - November 2008 MAO: Activity Mechanism I Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183 Compound 2: y=0, has feature 183 so violates OBSTree Compound 3: y=0, falls in active node from RP Compound 4: y=0, falls in active node from RP and RP/SA

24 MAO: Activity Mechanism II
Blackwell-Tapia - November 2008 MAO: Activity Mechanism II “Binding to active site" –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to bind protein (MAO) as a nonselective, irreversible inhibitor HO O N ) C(1,0)-3-C(1,0) 579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0) 1184:N(2,0)-2-N(2,0)

25 Absent Descriptor (809: C(3,1)-4-Br)
Blackwell-Tapia - November 2008 Absent Descriptor (809: C(3,1)-4-Br) O N Br C(3,1)-4-Br Activity=3 Activity=2

26 Blackwell-Tapia - November 2008
Summary OBSTree: new RP algorithm for obtaining simplified output Model presence and absence of molecular features Combination size is data-driven, varies over splits Penalty entropy splitting criterion for one-sided purity Weighted sampling during optimization allows prior information Simpler verification of QSAR Standard RP and RP/SA are special cases of OBSTree Output is not deterministic As with any RP output, care should be taken when interpreting the results Can miss highly correlated but important predictors Different trees provide similar partitions of the data Because of hard thresholding, predictions are highly variable Computationally intensive!

27 Blackwell-Tapia - November 2008
Acknowledgements Atina Brooks, North Carolina State University Jiajun Liu, Merck Haojun Ouyang, North Carolina State University Abbott Laboratories Jack Liu, OmicSoft Jun Feng, NIH GoldenHelix


Download ppt "Jacqueline M. Hughes-Oliver Department of Statistics"

Similar presentations


Ads by Google