Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK.

Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK

Descriptor Generation for Combinatorial Libraries  Need to calculate structure descriptors for large virtual libraries Subset selection better in property space than in reactant (precursor) space  Full library enumeration, followed by descriptor calculation for single molecules can be slow  Direct analysis of Markush representation of library can offer order-of-magnitude speedups gives accurately-calculated descriptors for all molecules in the library

Markush Structures  Scaffold plus R-groups  Each R-group alternative (n i ) shown once  Convenient for input and display  Markush is O(  n i ) 1 core + 100 R1 + 100 R2 + 100 R3 = 301  Enumeration is O(  n i ) 1 core × 100 R1 × 100 R2 × 100 R3 = 1,000,000

Direct Analysis of Markush  Avoid multiple analysis of common/repeated parts  Time and space advantages do as much work as possible in  n i work in  n i only when absolutely necessary  Generate partial descriptors from individual building blocks, and overlaps between them  Combine these using appropriate logic to form full descriptors for individual products  Applicable where descriptors are “additive” in nature

Two-stage Descriptor Generation from Markush Structures 1. Analyse core and R-group alternatives Build intermediate representation of “partial descriptors” Some partial descriptors may involve overlap between core and R-group(s) O(  n i ) [Sigma Phase] 2. Assemble “full” descriptor for each individual molecule in library Usually simple addition, concatenation or logical OR of partial descriptors O(  n i ) [Pi Phase]

Descriptors from Markush  Previously described structure fingerprint generation based on dictionary of predefined fragments “Partial” fingerprints for relevant building blocks ORed together for each specific structure  More recent work on calculation of property values “Lipinski” properties topological indices

Markush Analysis Software

Internal Markush Representation  Data structure held in memory only while needed for analysis Separate building blocks (“partial structures”) with logical relationships Several (non-independent) substituent groups may be included in a single structural variable  Can be built from various input formats “Markush-type” input (e.g. RGfile, cSLN) imported directly Generic reaction and precursor input is more complex  Representation may be “optimised” for efficient processing

Reaction/Precursor input  Build Markush incrementally, one reaction step at a time  Each step modifies core and adds an R-group (clipped reagents)  Input modules based on Daylight reaction toolkit implemented http://www.daylight.com/meetings/mug00/Barnard  Module based on Accord SDK under development

SMILES Enumeration  Markush analysis can be used for fast enumeration of non-canonical SMILES for library members  Based on SMILES trick: “C1.C1”  “CC” dot separates the two carbon atoms “ring closure” numerals join them up again  Sigma Phase: Generate Partial SMILES for each Partial Structure Use unsatisfied ring closure numeral for bonds outside the Partial Structure  Pi phase: Concatenate Partial SMILES from each relevant PS

SMILES Enumeration coreR1R2 O=C%12Nc1ccc(ccc1)C(=O)OC%11. C%12. [H]%11 O=C%12Nc1ccc(ccc1)C(=O)OC%11. C%12. Cl%11 O=C%12Nc1ccc(ccc1)C(=O)OC%11. C%12Br. [H]%11 O=C%12Nc1ccc(ccc1)C(=O)OC%11. C%12Br. Cl%11  Sigma phase: fast generation of partial SMILES < 0.04s for 100x100x100 = 1M benzodiazepine library  Pi phase: simple concatenation of partial SMILES 38,775 structures per sec (SGI R10k) Producing canonical SMILES slows down enumeration by factor of 45 o Each individual molecule must be separately canonicalised

Lipinski Property Generation  Molecular weight trivial addition of partial molecular weights  Count of aromatic rings addition of partial counts – optimisation of internal representation ensures that aromatic rings are not split between building blocks  Hydrogen bond donor/acceptor counts Partial counts may depend on combination of more than one R-group (e.g. where H is an alternative) “Overlap” terms (combinations of building blocks) may need to be included in addition HBD/HBA definitions can be customised

Lipinski Property Generation  Rotatable bond counts Some complexities for bonds between core and R-group RB = any single bond except to a terminal atom (H, Cl etc.) or a terminal group (CH3, NO2 etc.) In example, R1 to ring single bond is not rotatable when R1 is CH3 or R2 and R3 are identical terminal atoms

Lipinski Property Generation  logP Used SlogP atom-contribution method o Wildman & Crippen, JCICS 1999, 39, 868-873 o 68 atom types (+ 4 supplemental) defined as SMARTS patterns Atom types redefined as BCI Fragment Dictionary, e.g. [CH3][(N,O,S,P,F,Cl,Br,I)] => C as X o 797 fragments (644 AA + 26 AS + 131 direct assignment) o Charged N,O and intermediate C, X, Y atom types Some atom types require examination of neighbouring building blocks

Lipinski Generation Timings  100  100  100 = 1M benzodiazepine library  SGI R10000  Sigma phase: calculation of partial property values <0.04s  Pi phase: assembly and output of full property values 95.59s for all 1M molecules 10,461 molecules/s

Topological Index Generation Many topological indices are based on summing the terms for small parts of structure Simple extra calculation needed at end for some indices Several implemented (others under development) o Kier Chi connectivity indices; any order o Counts of different subgraph types; any order o Kier Kappa and Phi shape indices o Zagreb index o (Wiener index and Balaban (JX, JY) indices) Hosoya Index not amenable to Markush approach o Requires analysis of full molecule

Kier Index Generation  Sigma Phase Identify all subgraphs up to n bonds (n is maximum index order) Count number of subgraphs of different types, and calculate contributions to Chi indices  Pi Phase Sum appropriate subgraph counts and index contributions for each molecule Kappa and Phi shape indices calculated from low-order subgraph counts Sigma phase is significantly slower than for Lipinski properties and fingerprints

Chi Index Sigma Phase Timings

Slowdown at higher orders – number of subgraphs  Exponential increase in number of subgraphs at higher orders Also a problem when handling specific structures  Subgraph Types P (Path) – nodes have 1 or 2 connections C (Cluster) – nodes have 1 or 3 connections PC (Path/Cluster) – nodes have 1, 2 or 3 connections CH (Chain) – subgraph contains a ring

Explosion in Number of Subgraphs

Slowdown at higher orders – number of Rgroups  Higher order subgraphs can involve core and multiple R-groups Order 6 PC can involve all three R-groups  Depends on how well- separated R-groups are

Speeding-up Kier Index Generation  Limit maximum order for subgraph counts and Kier connectivity indices  Avoid identifying PC/CH subgraphs if these indices are not required not available yet some complications as subgraphs can change type as bonds are added

Clustering Library Members  Previously described clustering of library members on the basis of fingerprints  Lipinksi properties and topological indices can also be used as basis for clustering Descriptors are re-generated from partial descriptors as needed (Pi phase) and need not be stored  K-means relocation method needs O(N) time Non-hierarchical clustering method Produces high-quality clusters User specifies required number of clusters Results can depend on random selection of cluster seeds

Current Work: Library Overlap  Work in progress to identify the overlap between combinatorial libraries  Identify specific compounds in common expressed as another Markush structure  “Brute force” algorithm would Fully enumerate libraries involved Compare lists of (e.g.) canonical SMILES for common members

Library Overlap  Markush algorithm originally designed for structure search in chemical patents uses “reduced graph” representation of Markush o avoids “segmentation problem” (different boundaries between R-group and scaffold) eliminates non-matching parts very rapidly slower (atom-by-atom) check to confirm matches o worst case is matching library against itself  Implementation in software toolkit form can be incorporated into users’ software could form basis for Markush Registration/Search system

Potential Future Work: 3D Conformation Generation  Preliminary discussions with Gasteiger group (Univ. Erlangen) on linking Markush approach with CORINA  CORINA works by separating cyclic and acyclic components establishing conformation for each independently linking them back together checking and adjusting for steric crowding  Some analogies with Markush approach First two steps are equivalent to Sigma phase Last two steps are equivalent to Pi phase

References  Barnard, J. M.; Downs, G. M.; von Scholley-Pfab, A.; Brown, R.D., “Use of Markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries.” J. Mol. Graph. Modelling 2000, 18 (4/5), 452-463  Reactions  Markush (Daylight MUG00 meeting) http://www.daylight.com/meetings/mug00/Barnard http://www.daylight.com/meetings/mug00/Barnard  P.S. we are recruiting too… http://www.bci1.demon.co.uk Copyright © Barnard Chemical Information Ltd., 2001

Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK.

Similar presentations

Presentation on theme: "Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK.

Similar presentations

Presentation on theme: "Fast Descriptor Calculation for Combinatorial Libraries Geoff Downs & John Barnard Sheffield, UK."— Presentation transcript:

Similar presentations

About project

Feedback