1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.

Slides:



Advertisements
Similar presentations
May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Advertisements

Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
Structural Search Using ChemAxon Tools
UGM, June, 2007 Szabolcs Csepregi Markush: Whats new, development discussions.
Solutions for Cheminformatics
Dr. Matthew Wright Product Director.
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Chapter 3: Modules, Hierarchy Charts, and Documentation
Industrial Property the Patent system
3D Molecular Structures C371 Fall Morgan Algorithm (Leach & Gillet, p. 8)
Using Data Flow Diagrams
Personalia: Pre-Sheffield Batchelor’s degree in Chemistry at Oxford Pre-university job in my local public library system Chemistry or information science?
Mining Graphs.
Association Analysis (7) (Mining Graphs)
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 8 Slide 1 System models.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
SciFinder ® : Part of the process™ 2006 Edition. SciFinder ® : Part of the process™ 2006 Edition SciFinder ® 2006 provides new, powerful capabilities.
Testing an individual module
School of Computer ScienceG53FSP Formal Specification1 Dr. Rong Qu Introduction to Formal Specification
Automated Drawing of 2D chemical structures Kees Visser.
1 Unity of Invention: Biotech Examples TC1600 Special Program Examiner Julie Burke (571)
1 Chemical Structure Representation and Search Systems Lecture 4. Nov 11, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
1 Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
University of Toronto Department of Computer Science © 2001, Steve Easterbrook CSC444 Lec22 1 Lecture 22: Software Measurement Basics of software measurement.
Pharmacophore and FTrees
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Aniko T. Valko, Keymodule Ltd.
Combinatorial Chemistry and Library Design
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Similarity Methods C371 Fall 2004.
Chapter 4 System Models A description of the various models that can be used to specify software systems.
Process Flowsheet Generation & Design Through a Group Contribution Approach Lo ï c d ’ Anterroches CAPEC Friday Morning Seminar, Spring 2005.
المحاضرة الثالثة. Software Requirements Topics covered Functional and non-functional requirements User requirements System requirements Interface specification.
U.S. Patent and Trademark Office Technology Center 1600 Michael P. Woodward Unity of Invention: Biotech Examples.
May 2009 ChemAxon - What’s New?. What’s new and hot? All products have seen enhancements in the past 12 months BUT WHAT’S REALLY HOT?
ITGS Case Study Theatre Booking System Ayushi Pradhan.
Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director.
Validated Model Transformation Tihamér Levendovszky Budapest University of Technology and Economics Department of Automation and Applied Informatics Applied.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Selection Control Structures. Simple Program Design, Fourth Edition Chapter 4 2 Objectives In this chapter you will be able to: Elaborate on the uses.
Chapter 7 System models.
System models l Abstract descriptions of systems whose requirements are being analysed.
Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry.
Software Engineering, 8th edition Chapter 8 1 Courtesy: ©Ian Somerville 2006 April 06 th, 2009 Lecture # 13 System models.
Sommerville 2004,Mejia-Alvarez 2009Software Engineering, 7th edition. Chapter 8 Slide 1 System models.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
11 MMS Merged Markush Service The QuestelOrbit Alternative for Chemical Information Elliott Linder, QuestelOrbit Joe Terlizzi, QuestelOrbit 227 th ACS.
Patent Searching Basics Patrick M. Torre, Ph.D. November 18, 2015.
Winter 2011SEG Chapter 11 Chapter 1 (Part 1) Review from previous courses Subject 1: The Software Development Process.
Chapter 9. We earlier defined a class of compounds called hydrocarbons (containing C and H and nothing else). Hydrocarbons form the backbone of an important.
SOFTWARE TESTING. Introduction Software Testing is the process of executing a program or system with the intent of finding errors. It involves any activity.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Use of Machine Learning in Chemoinformatics
Chapter – 8 Software Tools.
1 Test Coverage Coverage can be based on: –source code –object code –model –control flow graph –(extended) finite state machines –data flow graph –requirements.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.
Software Testing.
Security Issues Formalization
CS 430: Information Discovery
Worked Example Molecular Structures: Identifying Functional Groups
Daylight and Discovery
Aniko T. Valko, Keymodule Ltd.
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Presentation transcript:

1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2 Lecture 5: Topics to be Covered Reaction searching o atom-atom mapping o Maximal Common Substructure search 3D substructure search Searching Markush structures in patents o nature and origin of Markush structures o fragment codes o topological systems (MARPAT, Markush DARC)

3 Searching Chemical Reactions  each database entry contains several molecules reactants products catalysts solvents etc.  may want query substructure confined to one of these can be done by assigning role indicator to each molecule  but role indicators are not enough on their own for a useful reaction search system

4 Reaction search  Query:

5 Reaction search  Query:  “Hit”:  We didn’t get what we wanted because the hydroxyl in the product did not involve the same oxygen as the ketone in the reactant  We need to “map” the atoms between the reactant and product

6 Atom mapping  atoms on each side of the reaction can be numbered to show which corresponds to which similar mappings can be used in the query  automatic assignment of atom mapping is very important in reaction indexing systems problem is obviously related to finding a graph isomorphism between reactant and product sides except that the two sides are NOT isomorphic

7 Maximal common subgraph  atoms and bonds in red represent the largest subgraph that is common to both sides all these atoms have same neighbours on both sides none of these bonds are made or broken  remaining atoms and bonds represent reaction site

8 Maximal common subgraph Finding the MCS between two graphs is an NP- complete problem even worse than subgraph isomorphism because you don’t know in advance how big the subgraph will be exhaustive backtracking is prohibitively slow the best algorithms find an approximate solution (i.e. a large, but not necessarily maximal, subgraph) tricks can be used to determine an upperbound for the size of the MCS (so you can stop looking when you’ve found one of this size) new algorithm published 2002

9 Applications of MCS  MCS algorithms can be applied to other things than atom-atom mapping in reactions structural similarity between molecules o size of MCS (relative to size of molecules) can be used as measure of similarity of molecules approximate match searches o search for molecules containing at least 80% of query substructure multiple maximal common substructure

10 Multiple MCS  largest substructure common to whole set of molecules can be used to extract “core” for a Markush structure might represent features important for biological activity even more difficult than MCS of two molecules o unfortunately it doesn’t work to find MCS of first two, and then MCS between that and the third, etc.

11 3-D substructure search  Analogous to 2-D substructure search need to find atoms in correct spatial orientation relative to each other o some fuzziness (tolerance) permitted in distance values query can be defined as a group of atoms, with specified interatomic distances o sometimes called a pharmacophore both query and database structures can be shown as topological graphs in which the nodes are atoms, but the edges are interatomic distances

12 3-D substructure searching  the interatomic distances are the labels on the edges  graph is fully-connected (an edge between every pair of nodes)  the graph edges do not correspond to bonds in the molecule  matching is then a process of subgraph isomorphism between such graphs

13 3D substructure searching  subgraph isomorphism involving fully-connected graphs is computationally more demanding than for 2D substructure search Ullmann’s algorithm performs well other approaches (e.g. clique detection) have also been used  fingerprint-like screening stages can also be applied in the search, based on 3D-fragments such as 3-point pharmacophores screens based on torsion and valance angles have also been used Willett, P. Three-Dimensional Chemical Structure Handling. Wiley: New York (1991)

14 Chemical patents  Contract between inventor and State to encourage innovation Inventor reveals nature of invention State grants protected monopoly over its exploitation for limited period  Invention must be novel, useful and non-obvious new ways of making compounds new compounds with useful properties (therapeutic uses)  Essential for success of pharmaceutical industry  Knowledge of existing patents (prior art) essential to avoid fruitless development

15 Chemical patents  May claim single product or process  More usually claim class of products or processes to ensure protection for closely-related compounds etc.  Very broad claims can disguise true nature of invention But may claim compounds which lack claimed activity Nested series of claims (A, preferably B, more preferably C etc.) can provide “fallback” positions  Extremely broad claims have become more common as Patent Offices moved to publication before examination Sibley, J. F. “Too broad generic disclosures: a problem for all” J. Chem. Inf. Comput. Sci. 1991, 31 (1) 5-8

16 R 1 -X-R 36  R1 is a substituted or unsubstituted, mono-, di- or polycyclic, aromatic or non-aromatic carbocylic or heterocyclic ring system, or…  X is a single or double bond, substituted or unsubstituted heteroatom, or substituted carbon atom, or substituted or unsubstituted chain of two or more carbon atoms and/or heteroatoms…  R 36 is substituted or unsubstituted asymmetrical heterocylic ring system having at least 3 nitrogens… [Structure 32 from Claim 105 of PCT Application , claimed as novel]

17 The patent explosion  Originally only granted patents published.  Belgium (1950s), Netherlands (1964) and EPO (1978) -> publishing all patent applications.  Rapid publication makes information available very quickly.  Huge number of patents, many low quality, insufficient or incorrect details, no novelty.  Less work for patent examiners but greater problems for retrieval systems.

18 Structural information in chemical patents  Uses mixture of: 2D structure diagrams linear formulae (e.g. “C 2 H 5 ”, “EtOH”) specific nomenclature (e.g, “phenyl”, “isopropyl”) generic nomenclature (e.g. “alkyl”, “heteroaryl”) non-structural expressions (e.g. “pharmaceutically acceptable cation”, “group known in the art”)  Many machine readable systems just show structural information as free text and images

19 Specific Structures from Patents  Several databases contain specific molecules claimed in patents Chemical Abstracts Registry Derwent Registry MDL announced major new database Nov 2003 o will include reactions, molecules and Markush display o /pr_patentdb_07nov03.jsp

20 Markush Structures  also known as “Generic Structures” or “R-group Structures”  chemical structures involving variable parts

21 Markush Structures  compact representation of a set or class of specific compounds with common structural features  used in chemical patents query structures in substructure search systems Quantitative Structure-Activity Relationship (QSAR) analysis o class of related compounds with activity data combinatorial libraries o rapid synthesis of large numbers of related compounds legislation (controlled drugs, chemical weapons)

22 Variability in Markush Structures  s-variation (substituent variation) list of alternative values for an R-group  p-variation (position variation) variable point of attachment  f-variation (frequency variation) multiple occurrence of groups  h-variation (homology variation) generically described group (e.g. “alkyl”) potentially infinite set of specific alternatives

23 Types of variation substituent variation R1 is methyl or ethyl homology variation R2 is alkyl position variation R3 is amino frequency variation m is 1-3

24 Types of Markush structure

25 Markush Structures  Compact representation for sets of molecules common parts shown once only  Can be considered as formal “grammar” for generating valid molecules (“sentences”)  Enumeration of coverage usually impractical and often impossible (infinite sets)  Appropriate algorithms for handling take advantage of Markush representation: Avoid enumeration (especially infinite sets) Compare finite grammars rather than infinite sets of valid sentences

26 Dr Eugene A. Markush  born Budapest, Hungary, c  migrated to USA, 1913 (Citizen, 1920)  Founded Pharma Chemical Corporation (NJ), 1919  Filed US patent on pyrolazone dyes, 9 January 1924, using expression “where R is a group selected from...” to circumvent USPTO “rule against ‘or’ ”  died New York, 21 April 1968

27 Markush storage and retrieval  Early systems (1950s, 1960s) developed in-house by pharmaceutical companies/consortiums  High costs of patent abstracting and technical difficulties with automation shifted development to specialist companies  Fragmentation code systems superseded by topological (structure graphics) systems

28 Fragmentation Codes  Structural features (ring systems, functional groups, etc.) used as indexing terms  Structural relationships usually lost all alternatives tend to be “over-coded” retrieved structures include many “false drops” (“ballast”)  Codes originally assigned manually Now usually generated (semi-)automatically from graphical input Queries also generated automatically  Some codes use “closed” set of terms (periodically revised)  Others are “open-ended”

29 Fragmentation Codes  Derwent World Patent Index Chemical Code Closed code with about one thousand terms Large comprehensive backfile (from early 1960s) Available for online searching (Questel)  IFI/Plenum Code Open-ended code Used for “CLAIMS” database (U.S. patents) Available for online searching (STN) o no graphical interface

30 Fragmentation Codes  GREMAS code Very sophisticated open-ended code Private collaboration between (mainly) German pharmaceutical companies Good retrieval performance Input discontinued in early 1990s Backfile (from 1950s) still searched at a few companies

31 Graphical (“topological”) systems  Development started in early 1980s  Intended to supplement graphical substructure search systems for specific structures MACCS, CAS Online, DARC, etc.  User draws graphical (sub)structure query  System displays graphical Markush structure hits  Two commercial systems implemented available for online searching only each with its own database no “in-house” systems or databases

32 Markush DARC  Joint development of Questel SA (software and online host) Derwent Information Ltd (WPIM database) INPI (French Patent Office) (PHARMSEARCH database)  Integrated database (“Merged Markush File”) now available Extension forwards (Derwent) and backwards (INPI)

33 MARPAT  software and database from Chemical Abstracts Service  available online via STN International  integrated with CA Registry database of specific compounds  Proposal to allow Derwent database to be searched with MARPAT software dropped in mid 1990s for commercial reasons

34 The Markush Problem  Representation Mixture of structures and text Generic (h-variant) expressions Vagueness (“where by X we mean…”)  Search The “translation” problem o Specific groups (e.g. tert. butyl) must be matched against generic expressions (e.g. 1-6C alkyl) The “segmentation” problem o Boundaries between scaffold and R-groups may not coincide in query and database structures

35 Matching Markush Structures  Translation and Segmentation problems coincide to make it difficult to spot matching structures

36 Sheffield University Research Extended project ( ) on Markush structure storage and retrieval designed external (GENSAL) and internal (ECTR) storage formats o parameter lists for homology-variant groups developed novel matching algorithms based around graph isomorphism o “reduced graph” concept influenced development of commercial systems o independent work also done at CAS, Derwent and Questel Downs and Barnard, J. Documentation, 1998, 54 (1),

37 GENSAL  formalised version of language used in patent specifications  design analogous to programming language  lexical elements include structure diagrams specific and generic chemical nomenclature substitution operators position/multiplicity values  GENSAL Interpreter program (compiler) generates internal representation based on “partial” connection tables with links between them

38 GENSAL example

39 Parameter Lists  Represent generic (“homology-variant”) expressions by set of permitted numerical ranges for structural parameters e.g. “alkyl”: 1-n carbon atoms 0 heteroatoms 0 double or triple bonds 0-n branch points 0 rings

40 Reduced Graphs  connected groups of atoms “collapsed” to form a single node of the reduced graph atoms in the same ring system (R) optionally branched carbon chains (C) connected acyclic heteroatoms (Z)

41 Reduced Graphs  boundaries between nodes are non-arbitrary thus provides solution to segmentation problem  each node can be described by a parameter list  homology-variant groups can also be represented as reduced graph nodes with parameter lists thus provides solution to translation problem: o first identify isomorphism between reduced graphs o if parameter lists match can do atom-by-atom match on original atoms in specific groups, if necessary

42 Design of Commercial Systems  Sheffield system never implemented commercially  Ideas incorporated into both Markush DARC and MARPAT also used by BCI Ltd. in various projects  Other ideas developed independently both systems have patent protection  Basic concepts parallel those developed at Sheffield Barnard, J. M. “A comparison of different approaches to Markush structure handling” JCICS, 1991, 31 (1), Berks, A. “Current state of the art of Markush topological search systems”, World Patent Information, 2001,

43 Markush DARC  Specific groups shown as structure diagrams Rather clunky display (one R-group at a time)  Generic groups shown as “superatoms” e.g. CHK = alkyl, HEF = fused heterocycle qualitative attributes used in searching quantitative parameters (texnotes) available for display  reduced graph concepts used in atom-by-atom search stage

44 Markush DARC Display

45 MARPAT  Part of CASLink substructure search system on STN  Input and display uses text and graphics similar to GENSAL  Generic Group Nodes with quantitative attributes (not fully implemented for search)

46 MARPAT Generic Group Nodes  GGN definitions imply reduced graph concept  “Spin-off” GGNs generated for specific groups to allow specific-generic matching (“translation”)

47 MARPAT Display MSTR 1 G1 = N, CH G2 = H, X, SC,Cl DER: or acid addition salts MPL: Claim 1

48 Conclusions from Lecture 5  Chemical reaction search requires atom-atom mapping between reactant and product Maximal Common Subgraph algorithms can be used  3D substructure search uses interatomic distances as edge labels in fully-connected graphs  Markush structures pose particular problems to structure search systems extremely broad classes homology-variant (generic) expressions segmentation between R-groups  Two publicly-available Markush search systems for chemical patents Markush DARC and MARPAT

49 Further Reading  Chen, L.; Nourse, J. G.; Christie, B. D.; Leland, B. A.; Grier, D. L. “Over 20 years of reaction access from MDL: a novel reaction substructure search system”. J. Chem. Inf. Comput. Sci. 2002, 42,  “Representation and manipulation of 3D molecular structures”. Chapter 2 (pp ) in A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003  Berks, A. H. “Current state of the art of Markush topological search systems”. In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol 2, pp , Wiley-VCH, 2003

50 Lecture 6: Topics to be Covered  Similarity searching similarity search vs. substructure search similarity and distance metrics different types of descriptor for similarity search choice of descriptors  The drug discovery process