Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Similar presentations


Presentation on theme: "1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."— Presentation transcript:

1 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2 2 Lecture 5: Topics to be Covered Reaction searching o atom-atom mapping o Maximal Common Substructure search 3D substructure search Searching Markush structures in patents o nature and origin of Markush structures o fragment codes o topological systems (MARPAT, Markush DARC)

3 3 Searching Chemical Reactions  each database entry contains several molecules reactants products catalysts solvents etc.  may want query substructure confined to one of these can be done by assigning role indicator to each molecule  but role indicators are not enough on their own for a useful reaction search system

4 4 Reaction search  Query:

5 5 Reaction search  Query:  “Hit”:  We didn’t get what we wanted because the hydroxyl in the product did not involve the same oxygen as the ketone in the reactant  We need to “map” the atoms between the reactant and product

6 6 Atom mapping  atoms on each side of the reaction can be numbered to show which corresponds to which similar mappings can be used in the query  automatic assignment of atom mapping is very important in reaction indexing systems problem is obviously related to finding a graph isomorphism between reactant and product sides except that the two sides are NOT isomorphic

7 7 Maximal common subgraph  atoms and bonds in red represent the largest subgraph that is common to both sides all these atoms have same neighbours on both sides none of these bonds are made or broken  remaining atoms and bonds represent reaction site

8 8 Maximal common subgraph Finding the MCS between two graphs is an NP- complete problem even worse than subgraph isomorphism because you don’t know in advance how big the subgraph will be exhaustive backtracking is prohibitively slow the best algorithms find an approximate solution (i.e. a large, but not necessarily maximal, subgraph) tricks can be used to determine an upperbound for the size of the MCS (so you can stop looking when you’ve found one of this size) new algorithm published 2002

9 9 Applications of MCS  MCS algorithms can be applied to other things than atom-atom mapping in reactions structural similarity between molecules o size of MCS (relative to size of molecules) can be used as measure of similarity of molecules approximate match searches o search for molecules containing at least 80% of query substructure multiple maximal common substructure

10 10 Multiple MCS  largest substructure common to whole set of molecules can be used to extract “core” for a Markush structure might represent features important for biological activity even more difficult than MCS of two molecules o unfortunately it doesn’t work to find MCS of first two, and then MCS between that and the third, etc.

11 11 3-D substructure search  Analogous to 2-D substructure search need to find atoms in correct spatial orientation relative to each other o some fuzziness (tolerance) permitted in distance values query can be defined as a group of atoms, with specified interatomic distances o sometimes called a pharmacophore both query and database structures can be shown as topological graphs in which the nodes are atoms, but the edges are interatomic distances

12 12 3-D substructure searching  the interatomic distances are the labels on the edges  graph is fully-connected (an edge between every pair of nodes)  the graph edges do not correspond to bonds in the molecule  matching is then a process of subgraph isomorphism between such graphs

13 13 3D substructure searching  subgraph isomorphism involving fully-connected graphs is computationally more demanding than for 2D substructure search Ullmann’s algorithm performs well other approaches (e.g. clique detection) have also been used  fingerprint-like screening stages can also be applied in the search, based on 3D-fragments such as 3-point pharmacophores screens based on torsion and valance angles have also been used Willett, P. Three-Dimensional Chemical Structure Handling. Wiley: New York (1991)

14 14 Chemical patents  Contract between inventor and State to encourage innovation Inventor reveals nature of invention State grants protected monopoly over its exploitation for limited period  Invention must be novel, useful and non-obvious new ways of making compounds new compounds with useful properties (therapeutic uses)  Essential for success of pharmaceutical industry  Knowledge of existing patents (prior art) essential to avoid fruitless development

15 15 Chemical patents  May claim single product or process  More usually claim class of products or processes to ensure protection for closely-related compounds etc.  Very broad claims can disguise true nature of invention But may claim compounds which lack claimed activity Nested series of claims (A, preferably B, more preferably C etc.) can provide “fallback” positions  Extremely broad claims have become more common as Patent Offices moved to publication before examination Sibley, J. F. “Too broad generic disclosures: a problem for all” J. Chem. Inf. Comput. Sci. 1991, 31 (1) 5-8

16 16 R 1 -X-R 36  R1 is a substituted or unsubstituted, mono-, di- or polycyclic, aromatic or non-aromatic carbocylic or heterocyclic ring system, or…  X is a single or double bond, substituted or unsubstituted heteroatom, or substituted carbon atom, or substituted or unsubstituted chain of two or more carbon atoms and/or heteroatoms…  R 36 is substituted or unsubstituted asymmetrical heterocylic ring system having at least 3 nitrogens… [Structure 32 from Claim 105 of PCT Application 8704321, claimed as novel]

17 17 The patent explosion  Originally only granted patents published.  Belgium (1950s), Netherlands (1964) and EPO (1978) -> publishing all patent applications.  Rapid publication makes information available very quickly.  Huge number of patents, many low quality, insufficient or incorrect details, no novelty.  Less work for patent examiners but greater problems for retrieval systems.

18 18 Structural information in chemical patents  Uses mixture of: 2D structure diagrams linear formulae (e.g. “C 2 H 5 ”, “EtOH”) specific nomenclature (e.g, “phenyl”, “isopropyl”) generic nomenclature (e.g. “alkyl”, “heteroaryl”) non-structural expressions (e.g. “pharmaceutically acceptable cation”, “group known in the art”)  Many machine readable systems just show structural information as free text and images

19 19 Specific Structures from Patents  Several databases contain specific molecules claimed in patents Chemical Abstracts Registry Derwent Registry MDL announced major new database Nov 2003 o will include reactions, molecules and Markush display o http://www.mdl.com/company/news/press_releases/2003 /pr_patentdb_07nov03.jsp

20 20 Markush Structures  also known as “Generic Structures” or “R-group Structures”  chemical structures involving variable parts

21 21 Markush Structures  compact representation of a set or class of specific compounds with common structural features  used in chemical patents query structures in substructure search systems Quantitative Structure-Activity Relationship (QSAR) analysis o class of related compounds with activity data combinatorial libraries o rapid synthesis of large numbers of related compounds legislation (controlled drugs, chemical weapons)

22 22 Variability in Markush Structures  s-variation (substituent variation) list of alternative values for an R-group  p-variation (position variation) variable point of attachment  f-variation (frequency variation) multiple occurrence of groups  h-variation (homology variation) generically described group (e.g. “alkyl”) potentially infinite set of specific alternatives

23 23 Types of variation substituent variation R1 is methyl or ethyl homology variation R2 is alkyl position variation R3 is amino frequency variation m is 1-3

24 24 Types of Markush structure

25 25 Markush Structures  Compact representation for sets of molecules common parts shown once only  Can be considered as formal “grammar” for generating valid molecules (“sentences”)  Enumeration of coverage usually impractical and often impossible (infinite sets)  Appropriate algorithms for handling take advantage of Markush representation: Avoid enumeration (especially infinite sets) Compare finite grammars rather than infinite sets of valid sentences

26 26 Dr Eugene A. Markush  born Budapest, Hungary, c. 1888  migrated to USA, 1913 (Citizen, 1920)  Founded Pharma Chemical Corporation (NJ), 1919  Filed US patent 1506316 on pyrolazone dyes, 9 January 1924, using expression “where R is a group selected from...” to circumvent USPTO “rule against ‘or’ ”  died New York, 21 April 1968

27 27 Markush storage and retrieval  Early systems (1950s, 1960s) developed in-house by pharmaceutical companies/consortiums  High costs of patent abstracting and technical difficulties with automation shifted development to specialist companies  Fragmentation code systems superseded by topological (structure graphics) systems

28 28 Fragmentation Codes  Structural features (ring systems, functional groups, etc.) used as indexing terms  Structural relationships usually lost all alternatives tend to be “over-coded” retrieved structures include many “false drops” (“ballast”)  Codes originally assigned manually Now usually generated (semi-)automatically from graphical input Queries also generated automatically  Some codes use “closed” set of terms (periodically revised)  Others are “open-ended”

29 29 Fragmentation Codes  Derwent World Patent Index Chemical Code Closed code with about one thousand terms Large comprehensive backfile (from early 1960s) Available for online searching (Questel)  IFI/Plenum Code Open-ended code Used for “CLAIMS” database (U.S. patents) Available for online searching (STN) o no graphical interface

30 30 Fragmentation Codes  GREMAS code Very sophisticated open-ended code Private collaboration between (mainly) German pharmaceutical companies Good retrieval performance Input discontinued in early 1990s Backfile (from 1950s) still searched at a few companies

31 31 Graphical (“topological”) systems  Development started in early 1980s  Intended to supplement graphical substructure search systems for specific structures MACCS, CAS Online, DARC, etc.  User draws graphical (sub)structure query  System displays graphical Markush structure hits  Two commercial systems implemented available for online searching only each with its own database no “in-house” systems or databases

32 32 Markush DARC  Joint development of Questel SA (software and online host) Derwent Information Ltd (WPIM database) INPI (French Patent Office) (PHARMSEARCH database)  Integrated database (“Merged Markush File”) now available http://www.inpi.fr/inpi/mms/index.htm Extension forwards (Derwent) and backwards (INPI)

33 33 MARPAT  software and database from Chemical Abstracts Service  available online via STN International http://www.cas.org/CASFILES/marpat.html  integrated with CA Registry database of specific compounds  Proposal to allow Derwent database to be searched with MARPAT software dropped in mid 1990s for commercial reasons

34 34 The Markush Problem  Representation Mixture of structures and text Generic (h-variant) expressions Vagueness (“where by X we mean…”)  Search The “translation” problem o Specific groups (e.g. tert. butyl) must be matched against generic expressions (e.g. 1-6C alkyl) The “segmentation” problem o Boundaries between scaffold and R-groups may not coincide in query and database structures

35 35 Matching Markush Structures  Translation and Segmentation problems coincide to make it difficult to spot matching structures

36 36 Sheffield University Research Extended project (1979-1994) on Markush structure storage and retrieval designed external (GENSAL) and internal (ECTR) storage formats o parameter lists for homology-variant groups developed novel matching algorithms based around graph isomorphism o “reduced graph” concept influenced development of commercial systems o independent work also done at CAS, Derwent and Questel Downs and Barnard, J. Documentation, 1998, 54 (1), 106-120

37 37 GENSAL  formalised version of language used in patent specifications  design analogous to programming language  lexical elements include structure diagrams specific and generic chemical nomenclature substitution operators position/multiplicity values  GENSAL Interpreter program (compiler) generates internal representation based on “partial” connection tables with links between them

38 38 GENSAL example

39 39 Parameter Lists  Represent generic (“homology-variant”) expressions by set of permitted numerical ranges for structural parameters e.g. “alkyl”: 1-n carbon atoms 0 heteroatoms 0 double or triple bonds 0-n branch points 0 rings

40 40 Reduced Graphs  connected groups of atoms “collapsed” to form a single node of the reduced graph atoms in the same ring system (R) optionally branched carbon chains (C) connected acyclic heteroatoms (Z)

41 41 Reduced Graphs  boundaries between nodes are non-arbitrary thus provides solution to segmentation problem  each node can be described by a parameter list  homology-variant groups can also be represented as reduced graph nodes with parameter lists thus provides solution to translation problem: o first identify isomorphism between reduced graphs o if parameter lists match can do atom-by-atom match on original atoms in specific groups, if necessary

42 42 Design of Commercial Systems  Sheffield system never implemented commercially  Ideas incorporated into both Markush DARC and MARPAT also used by BCI Ltd. in various projects  Other ideas developed independently both systems have patent protection  Basic concepts parallel those developed at Sheffield Barnard, J. M. “A comparison of different approaches to Markush structure handling” JCICS, 1991, 31 (1), 64-67 Berks, A. “Current state of the art of Markush topological search systems”, World Patent Information, 2001, 23 5-13

43 43 Markush DARC  Specific groups shown as structure diagrams Rather clunky display (one R-group at a time)  Generic groups shown as “superatoms” e.g. CHK = alkyl, HEF = fused heterocycle qualitative attributes used in searching quantitative parameters (texnotes) available for display  reduced graph concepts used in atom-by-atom search stage

44 44 Markush DARC Display

45 45 MARPAT  Part of CASLink substructure search system on STN  Input and display uses text and graphics similar to GENSAL  Generic Group Nodes with quantitative attributes (not fully implemented for search)

46 46 MARPAT Generic Group Nodes  GGN definitions imply reduced graph concept  “Spin-off” GGNs generated for specific groups to allow specific-generic matching (“translation”)

47 47 MARPAT Display MSTR 1 G1 = N, CH G2 = H, X, SC,Cl DER: or acid addition salts MPL: Claim 1

48 48 Conclusions from Lecture 5  Chemical reaction search requires atom-atom mapping between reactant and product Maximal Common Subgraph algorithms can be used  3D substructure search uses interatomic distances as edge labels in fully-connected graphs  Markush structures pose particular problems to structure search systems extremely broad classes homology-variant (generic) expressions segmentation between R-groups  Two publicly-available Markush search systems for chemical patents Markush DARC and MARPAT

49 49 Further Reading  Chen, L.; Nourse, J. G.; Christie, B. D.; Leland, B. A.; Grier, D. L. “Over 20 years of reaction access from MDL: a novel reaction substructure search system”. J. Chem. Inf. Comput. Sci. 2002, 42, 1296-1310.  “Representation and manipulation of 3D molecular structures”. Chapter 2 (pp. 27-52) in A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003  Berks, A. H. “Current state of the art of Markush topological search systems”. In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol 2, pp. 885-903, Wiley-VCH, 2003

50 50 Lecture 6: Topics to be Covered  Similarity searching similarity search vs. substructure search similarity and distance metrics different types of descriptor for similarity search choice of descriptors  The drug discovery process


Download ppt "1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software."

Similar presentations


Ads by Google