Presentation is loading. Please wait.

Presentation is loading. Please wait.

X 0001001011000 0101011000111 R2R2 NCSU ________________________________________________ This work was funded by the National Institutes of Health.

Similar presentations

Presentation on theme: "X 0001001011000 0101011000111 R2R2 NCSU ________________________________________________ This work was funded by the National Institutes of Health."— Presentation transcript:

1 X R2R2 NCSU ________________________________________________ This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG Information on the Molecular Libraries Roadmap Initiative can be obtained from Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University *joint with Ke Zhang, GSK and Stan Young, NISS Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree

2 X R2R2 NCSU 2Blackwell-Tapia - November 2008 Outline Background Recursive partitioning OBSTree Simulation study Screening for monoamine oxidase inhibitors Summary

3 X R2R2 NCSU 3Blackwell-Tapia - November 2008 Background Estimate a function such that based on where Preferably, costs more than

4 X R2R2 NCSU 4Blackwell-Tapia - November

5 X R2R2 NCSU 5Blackwell-Tapia - November 2008 Background – Structure-Activity Relationship (SAR) Willett, Barnard, Downs (1998 JCICS) Molecular descriptorsCarhart atom pairs –Atom typedistanceatom type, e.g., C(2,1)-04-C(3,1) –Binary descriptorsfew turned on

6 X R2R2 NCSU 6Blackwell-Tapia - November 2008 True False X3=1X3=1 Splitting variable chosen to optimize purity measure Search space: size p Need definitions for: search space purity measure, splitting criterion stopping criterion Recursive Partitioning True False X 27 =1

7 X R2R2 NCSU 7Blackwell-Tapia - November 2008 Recursive Partitioning: Rules are complex Are all splits necessary for the activity mechanism? Does an early split impede identification of other mechanisms?

8 X R2R2 NCSU 8Blackwell-Tapia - November 2008 Need definitions for: Search space Purity measure, splitting criterion Stopping rule Binary Formal Inference-Based Recursive Modeling (BFIRM) Cho, Shen, Hermsmeier (2000, JCICS) Rank predictors according to F-test Combine important predictors to form splitting variable Result is better QSAR rules Recursive Partitioning/Simulated Annealing (RP/SA) Blower et al. (2002, JCICS) Best single predictor not necessarily best in combination Tree Harvesting Yuan, Chipman, Welch (2006 tech report) Trim bits off each terminal node Recursive Partitioning: Focus of Study

9 X R2R2 NCSU 9Blackwell-Tapia - November 2008 Recursive Partitioning: RP/SA Splitting variables are based on a combination of K predictors Features are always present: Search space of size Uses simulated annealing – stochastic optimization K is held fixed for all splits, and is assumed known

10 X R2R2 NCSU 10Blackwell-Tapia - November 2008 OBSTree Splitting variables are based on a combination of K predictors Combine approaches of BFIRM and RP/SA Features can be present or absent: chromosome selection Search space of size Uses simulated annealing + weighted sampling + trimming K can change for all splits, and is assumed unknown Uses a penalty entropy splitting criterion Usual stopping criteria applied, including cross validation

11 X R2R2 NCSU 11Blackwell-Tapia - November 2008 Pre-OBSTree Setup Remove unary descriptors Determine Singly Important group Specify parameters OBSTree: Flowchart Descriptor Pool RP Singly Important Descriptors General Descriptors

12 X R2R2 NCSU 12Blackwell-Tapia - November 2008 OBSTree: Flowchart Pre-OBSTree Setup Remove unary descriptors Determine Singly Important group Specify parameters Initialize split at next depth: depth=depth+1 a set of K descriptor (X 0 ) using WSS Determine best chromosome x 0 of initial X 0 SA to determine optimal (X A, x A ) for split using WSS Form last terminal node. STOP depth=d or node size<2min or Ymax=0 or Ybar>M-1 Yes No Trim Check 2 K -1 subsets of current (X A, x A ) Report best trimmed version as (X*, x*) Form terminal node X*=x *? Yes No

13 X R2R2 NCSU 13Blackwell-Tapia - November 2008 Node has N compounds Class i has proportion p i in the node, with a total of n i in the node Entropy (node impurity): Penalty Entropy (penalize unwanted category) Problem: Entropy=0 (perfect) when a class of junk compounds is identified OBSTree: Splitting Criterion

14 X R2R2 NCSU 14Blackwell-Tapia - November 2008 Maximum depth d The most active compound is junk The node size is less than 2j ( j is the minimum node size). 5-fold cross-validation, e.g., choose depth d if –# correct classifications levels off at depth d –Accept H 0 : d+1 = 0 for d+1 = sensitivity between depths d and d+1 OBSTree: Stopping Criteria

15 X R2R2 NCSU 15Blackwell-Tapia - November compounds, 500 binary descriptors Four active groups (20 compounds per group) – 8% active Activity MechanismsPotencyDescriptor Sets and Chromosomes I II III IV Simulation Study

16 X R2R2 NCSU 16Blackwell-Tapia - November 2008 Simulation Study: Standard RP Tree 5 compounds of compounds of 0 7 compounds of 3

17 X R2R2 NCSU 17Blackwell-Tapia - November 2008 Simulation Study: Sample OBSTree 0 1,2,3,4,5/1,0,1,0, ,16,17,18,19/1,1,0,1,1 5,6,7,8,9/0,1,1,1,1 3,11,12,13,17/1,1,1,1,1

18 X R2R2 NCSU 18Blackwell-Tapia - November 2008 Simulation Study: 5-fold Cross-validation ActualAccuracy 0123 Prediction % % % % Hit99.7%100% 87.5%Overall Accuracy: 99.3% OBSTree RP ActualAccuracy 0123 Prediction % % % % Hit98.9%95%100%15%Overall Accuracy: 93.5%

19 X R2R2 NCSU 19Blackwell-Tapia - November 2008 Simulation Study: Sensitivity Analysis K, descriptor set size –K >7 perfectly found all mechanisms –K =7 perfectly found all but one mechanism Basic tree parameters –Min node size is 5 SA parameters –Initial temperature –Minimum temperature –Temperature reduction rate –# transitions at a given temperature –# failures to accept new point before increasing transition counter –Sampling weights in WSS

20 X R2R2 NCSU 20Blackwell-Tapia - November 2008 Screening to Identify MAO Inhibitors Neuronal MAO deactivates neurotransmitters Pargyline, an MAO inhibitor, was used to treat depression MAO inhibitors no longer used due to toxicity & interactions Abbott Laboratories dataset of MAO inhibitors Brown & Martin (1996 JCICS), 1646 chemically diverse compounds 1380 binary 2D atom-pair descriptors Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)

21 X R2R2 NCSU 21Blackwell-Tapia - November /1/0/6 1/0/1/26 32,572, ,721,879 0/0/0/33 0/0/0/15 1, 579,1184,809/1,1,1,0 81,177,579,183/1,1,1,0 2/0/0/32 2/0/5/24 9/2/1/ OBSTreeRP/SA RP 959/85/55/1899/1/0/0 183

22 X R2R2 NCSU 22Blackwell-Tapia - November 2008 MAO: Activity Mechanism I Irreversible binding to flavin cofactor of MAO Pargyline-like compounds Typical features of pargyline-like compounds A triple bond A tertiary nitrogen An aromatic ring 1 st terminal node of OBSTree Highest active terminal node of RP 1 st terminal node of RP/SA /0/0/ /0/0/ /1/0/6 111

23 X R2R2 NCSU 23Blackwell-Tapia - November 2008 MAO: Activity Mechanism I Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183 Compound 2: y=0, has feature 183 so violates OBSTree Compound 3: y=0, falls in active node from RP Compound 4: y=0, falls in active node from RP and RP/SA

24 X R2R2 NCSU 24Blackwell-Tapia - November 2008 HO O N N )C(1,0)-3-C(1,0) 579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0) 1184:N(2,0)-2-N(2,0) MAO: Activity Mechanism II Binding to active site" –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to bind protein (MAO) as a nonselective, irreversible inhibitor

25 X R2R2 NCSU 25Blackwell-Tapia - November 2008 Absent Descriptor (809: C(3,1)-4-Br) O N N Br O N N C(3,1)-4-Br Activity=3 Activity=2 C(3,1)-4-Br

26 X R2R2 NCSU 26Blackwell-Tapia - November 2008 Summary OBSTree: new RP algorithm for obtaining simplified output –Model presence and absence of molecular features –Combination size is data-driven, varies over splits –Penalty entropy splitting criterion for one-sided purity –Weighted sampling during optimization allows prior information Simpler verification of QSAR Standard RP and RP/SA are special cases of OBSTree Output is not deterministic As with any RP output, care should be taken when interpreting the results –Can miss highly correlated but important predictors –Different trees provide similar partitions of the data –Because of hard thresholding, predictions are highly variable Computationally intensive!

27 X R2R2 NCSU 27Blackwell-Tapia - November 2008 Acknowledgements Atina Brooks, North Carolina State University Jiajun Liu, Merck Haojun Ouyang, North Carolina State University Abbott Laboratories Jack Liu, OmicSoft Jun Feng, NIH GoldenHelix

Download ppt "X 0001001011000 0101011000111 R2R2 NCSU ________________________________________________ This work was funded by the National Institutes of Health."

Similar presentations

Ads by Google