Presentation is loading. Please wait.

Presentation is loading. Please wait.

Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.

Similar presentations


Presentation on theme: "Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of."— Presentation transcript:

1 Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of Sheffield, UK

2 Overview of talk Introduction Processing of patent structures Specific molecules Generic molecules Processing of non-structural information

3 The pharmaceutical industry Finds, develops and markets new drugs that Can be used against previously untreatable diseases or that are better than current drugs Have commercial value sufficient to meet the discovery and development costs Drug discovery is a vastly complex, multi- disciplinary task that: Is inherently very risky and costly (10 years and $1.5B) to bring a new drug to market Increasing use of informatics methods to enhance the cost-effectiveness of pharmaceutical research

4 Chemical patents Patents and journal articles are the two most important sources of information on new molecules Chemical Abstract Service now processes ca. 1M new documents a year and has records for ca. 30M specific small-molecules Central role of the chemical structure diagram Other types of information Synthetic details (yield, catalysts) Physical properties (melting point, spectra, solubility) Property of interest (lowers cholesterol level, increases viscosity)

5 Types of molecule in patents Specific molecules Individual substances or groups of closely related substances in which the exact nature of the relationship is explicitly defined Generic molecules (Markush structures) Class of substances, this class being defined either explicitly or implicitly A generic claim may cover far more molecules than had actually been synthesised and tested (or even more than can possibly exist)

6 Specific molecules Molecules in chemical databases (patent or otherwise) are represented by graphs Nodes and edges of a graph denote the atoms and bonds of a molecule Can be extended to encode molecular 3D structures A searchable representation – not just an image Graph representation means can use graph isomorphism algorithms for searching Graph isomorphism (exact match) Subgraph isomorphism (partial match) Maximum common subgraph isomorphism (best match)

7 2D substructure searching

8 2D similarity searching

9 Substructure and similarity searching The two standard modes of database access in chemoinformatics Effective but highly inefficient, owing to algorithmic complexity of graph operations Significant increase in efficiency by means of fragment bit-strings (“screening”) Allows interactive searching of multi-million compound databases 3D also feasible but far more time-consuming

10 Each bit in the fingerprint (or fragment bit-string) represents one molecular fragment. Typical length is ~1000 bits The fingerprint records the presence/absence (1/0) of each fragment in that molecule A query structure and a database structure can be compared in terms of the bits (i.e., fragment substructures) that they have in common Can be regarded as the chemoinformatics analogue of a text signature

11 Generic molecules The Markush structure provides a simple and compact way of representing sets of specifics with common structural features Typically two parts Invariant part Often a common ring scaffold Varying parts Range of possible substructures, often at a range of possible positions

12

13 Simple example: 192 specifics R = 2-chlorophenyl or 2,3- dichlorophenyl R1 = CH 3 R2 = C 2 H 5 n = 2 R3 = H or CH 3 R4 = C-O-R5 or C-S-R6 or S-O-R7 R5 = H or NHCH 3 or NHCH 2 CONH 2 or 2-pyridon-5-yl R6 = NH 2 or C(=NHCN)NHCH 3 R7 = NH 2 or NHCH 3 or NH- cyclopentyl or 2-thienyl or 8- quinolyl or 2-(4-methypiperazin- 1-yl)pyrid-5-yl

14 Types of variation Substituent variation R1 is H or Cl Homology variation R3 is 1-3 carbon alkyl Position variation R2 is F or Cl Frequency variation n is 2-4

15 Complexity of generic structures Types of variation can be nested May be relations between different parts of the claim R1= methyl, ethyl or phenyl n = 1-6 if R1 = phenyl, else n=1,2 Parts of the claim may not be defined explicitly Optionally substituted by a N-containing group R1= alkyl (1-4) or aryl R1= any electron-withdrawing group

16 Substructure searching of generics Graph-based representations are again used, with three levels of search so as to minimise the numbers of structures that need to be enumerated Screen search Reduced graph search Subgraph isomorphism search The CAS MARPAT system exemplifies this approach ( www.cas.org/expertise/cascontent/marpat.html ) Ca. 750K searchable Markush structures from ca. 300K patents

17 Markush structure and a reduced graph

18 Processing non-structural information Long-standing interest in using NLP to extract facts from the chemical literature Early work CAS (melting points, reaction yields) Sheffield (patent citations, chemical names) CLIDE at Leeds OCR of synthetic chemistry literature to identify reaction sequences and associated information (http://www.chem.leeds.ac.uk/ICAMS/new_web; http://www.simbiosys.ca/clide/index.html/) OSCAR at Cambridge Recognises chemical names, boiling points, peaks in mass spectra, refractive indices, optical rotations, synthetic yields Suggested for use as an editorial checker for journals (http://wwmm.ch.cam.ac.uk/wikis/wwmm/images/4/44/CompLife_20 06.pdf)

19 References Chemoinformatics in general Gasteiger, J. and Engel, T. (Eds.), Chemoinformatics: A textbook, Weinheim, Wiley-VCH (2003). Leach, A.R. and Gillet, V.J., An Introduction to Chemoinformatics, Dordrecht, Kluwer, 2 nd edition (2007). Paris, C.G., “Chemical structure handling by computer”, Annual Review of Information Science and Technology, 32, 271-337 (1997). Structures in chemical patents Berks A.H., “Current state of the art of Markush topological search systems”, World Patent Information, 23, 5-13 (2001). Downs G.M. and Barnard J.M., “Chemical patents and structural information - the Sheffield research in context”, Journal of Documentation, 54, 106-120 (1998 )


Download ppt "Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of."

Similar presentations


Ads by Google