Presentation on theme: "Jacqueline M. Hughes-Oliver Department of Statistics"— Presentation transcript:
1 Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree Jacqueline M. Hughes-OliverDepartment of StatisticsNorth Carolina State University*joint with Ke Zhang, GSK and Stan Young, NISS________________________________________________This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG Information on the Molecular Libraries Roadmap Initiative can be obtained from
2 Blackwell-Tapia - November 2008 OutlineBackgroundRecursive partitioningOBSTreeSimulation studyScreening for monoamine oxidase inhibitorsSummary
3 Blackwell-Tapia - November 2008 BackgroundEstimate a function such thatbased onwherePreferably,costs more than
7 Recursive Partitioning: Rules are complex Blackwell-Tapia - November 2008Recursive Partitioning: Rules are complex1231718919615161158412137Are all splits necessary for the activity mechanism?Does an early split impede identification of other mechanisms?
8 Recursive Partitioning: Focus of Study Blackwell-Tapia - November 2008Recursive Partitioning: Focus of StudyNeed definitions for:Search spacePurity measure, splitting criterionStopping ruleBinary Formal Inference-Based Recursive Modeling (BFIRM)Cho, Shen, Hermsmeier (2000, JCICS)Rank predictors according to F-testCombine important predictors to form splitting variableResult is better QSAR rulesRecursive Partitioning/Simulated Annealing (RP/SA)Blower et al. (2002, JCICS)Best single predictor not necessarily best in combinationTree HarvestingYuan, Chipman, Welch (2006 tech report)“Trim” bits off each terminal node
9 Recursive Partitioning: RP/SA Blackwell-Tapia - November 2008Recursive Partitioning: RP/SASplitting variables are based on a combination of K predictorsFeatures are always present:Search space of sizeUses simulated annealing – stochastic optimizationK is held fixed for all splits, and is assumed known
10 Blackwell-Tapia - November 2008 OBSTreeSplitting variables are based on a combination of K predictorsCombine approaches of BFIRM and RP/SAFeatures can be present or absent: chromosome selectionSearch space of sizeUses simulated annealing + weighted sampling + trimming“K” can change for all splits, and is assumed unknownUses a penalty entropy splitting criterionUsual stopping criteria applied, including cross validation
11 Blackwell-Tapia - November 2008 OBSTree: FlowchartPre-OBSTree SetupRemove unary descriptorsDetermine Singly Important groupSpecify parametersDescriptor PoolRPSingly Important DescriptorsGeneral Descriptors
12 Blackwell-Tapia - November 2008 OBSTree: FlowchartPre-OBSTree SetupRemove unary descriptorsDetermine Singly Important groupSpecify parametersInitialize split at next depth:depth=depth+1a set of K descriptor (X0) using WSSDeterminebest chromosome x0 of initial X0Form last terminal node. STOPdepth=d ornode size<2min orYmax=0 or Ybar>M-1YesNoSA to determine “optimal”(XA, xA) for split using WSSTrimCheck 2K-1 subsets of current (XA, xA)Report best trimmed version as (X*, x*)Form terminalnodeX*=x*?YesNo
13 OBSTree: Splitting Criterion Blackwell-Tapia - November 2008OBSTree: Splitting CriterionNode has N compoundsClass i has proportion pi in the node, with a total of ni in the nodeEntropy (node impurity):Penalty Entropy (penalize unwanted category)Problem:Entropy=0 (perfect) when a class of junk compounds is identified
14 OBSTree: Stopping Criteria Blackwell-Tapia - November 2008OBSTree: Stopping CriteriaMaximum depth dThe most active compound is junkThe node size is less than 2j (j is the minimum node size).5-fold cross-validation, e.g., choose depth d if# correct classifications levels off at depth dAccept H0: pd+1 = 0 for pd+1 = sensitivity between depths d and d+1
15 Blackwell-Tapia - November 2008 Simulation Study1000 compounds, 500 binary descriptorsFour active groups (20 compounds per group) – 8% activeActivity MechanismsPotencyDescriptor Sets and ChromosomesI31245II6789III11121317IV15161819
16 Simulation Study: Standard RP Tree Blackwell-Tapia - November 2008Simulation Study: Standard RP Tree12317189196151611584121375 compounds of compounds of 07 compounds of 3
19 Simulation Study: Sensitivity Analysis Blackwell-Tapia - November 2008Simulation Study: Sensitivity AnalysisK, descriptor set sizeK >7 perfectly found all mechanismsK =7 perfectly found all but one mechanismBasic tree parametersMin node size is 5SA parametersInitial temperatureMinimum temperatureTemperature reduction rate# transitions at a given temperature# failures to accept new point before increasing transition counterSampling weights in WSS
20 Screening to Identify MAO Inhibitors Blackwell-Tapia - November 2008Screening to Identify MAO InhibitorsNeuronal MAO deactivates neurotransmittersPargyline, an MAO inhibitor, was used to treat depressionMAO inhibitors no longer used due to toxicity & interactionsAbbott Laboratories dataset of MAO inhibitorsBrown & Martin (1996 JCICS),1646 chemically diverse compounds1380 binary 2D atom-pair descriptorsResponse variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88]Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)
21 Blackwell-Tapia - November 2008 81,177,579,183/1,1,1,0184,721,8790/0/0/330/1/0/61, 579,1184,809/1,1,1,032,572,8440/0/0/151/0/1/26OBSTreeRP/SA7041184819/2/1/22/0/0/32652/0/5/24183959/85/55/1899/1/0/0RP
22 MAO: Activity Mechanism I Blackwell-Tapia - November 2008MAO: Activity Mechanism I“Irreversible binding to flavin cofactor of MAO”Pargyline-like compoundsTypical features of pargyline-like compoundsA triple bondA tertiary nitrogenAn aromatic ring1st terminal node of OBSTreeHighest active terminal node of RP1st terminal node of RP/SA811831775790/0/0/331817042/0/0/3211847218790/1/0/61
23 MAO: Activity Mechanism I Blackwell-Tapia - November 2008MAO: Activity Mechanism ICompound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183Compound 2: y=0, has feature 183 so violates OBSTreeCompound 3: y=0, falls in active node from RPCompound 4: y=0, falls in active node from RP and RP/SA
24 MAO: Activity Mechanism II Blackwell-Tapia - November 2008MAO: Activity Mechanism II“Binding to active site"–N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to bind protein (MAO) as a nonselective, irreversible inhibitorHOON)C(1,0)-3-C(1,0)579: C(2,1)-3-C(3,1)1:C(1,0)-3-C(1,0)1184:N(2,0)-2-N(2,0)
26 Blackwell-Tapia - November 2008 SummaryOBSTree: new RP algorithm for obtaining simplified outputModel presence and absence of molecular featuresCombination size is data-driven, varies over splitsPenalty entropy splitting criterion for one-sided purityWeighted sampling during optimization allows prior informationSimpler verification of QSARStandard RP and RP/SA are special cases of OBSTreeOutput is not deterministicAs with any RP output, care should be taken when interpreting the resultsCan miss highly correlated but important predictorsDifferent trees provide similar partitions of the dataBecause of hard thresholding, predictions are highly variableComputationally intensive!
27 Blackwell-Tapia - November 2008 AcknowledgementsAtina Brooks, North Carolina State UniversityJiajun Liu, MerckHaojun Ouyang, North Carolina State UniversityAbbott LaboratoriesJack Liu, OmicSoftJun Feng, NIHGoldenHelix
Your consent to our cookies if you continue to use this website.