Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.

Slides:



Advertisements
Similar presentations
May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Advertisements

Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
Structural Search Using ChemAxon Tools
UGM, June, 2007 Szabolcs Csepregi Markush: Whats new, development discussions.
Solutions for Cheminformatics
Dr. Matthew Wright Product Director.
Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey.
INTRODUCTION TO THE BEILSTEIN AND GMELIN DATABASES Margarete Bower Chemistry Library.
3D Molecular Structures C371 Fall Morgan Algorithm (Leach & Gillet, p. 8)
Tips and Tricks Chemistry November Edition CAS... we are scientists, creating and delivering the most complete and effective digital information.
A division of the American Chemical Society Hancock Library I/A database training – July 2013 SciFinder In June Alec Johnston, Business Development Manager.
Personalia: Pre-Sheffield Batchelor’s degree in Chemistry at Oxford Pre-university job in my local public library system Chemistry or information science?
Mining Graphs.
SciFinder Scholar Gary Wiggins IU School of Informatics.
CHAPTER 4 CARBON AND THE MOLECULAR DIVERSITY OF LIFE Section A: The Importance of Carbon 1.Organic chemistry is the study of carbon compounds 2.Carbon.
SciFinder ® : Part of the process™ 2006 Edition. SciFinder ® : Part of the process™ 2006 Edition SciFinder ® 2006 provides new, powerful capabilities.
PHYSICAL PROPERTIES OF ORGANIC COMPOUNDS Mr. Maywan Hariono.
1 Chemical Structure Representation and Search Systems Lecture 4. Nov 11, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Searching the Scientific Literature Douglas A. Loy.
1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
Classification of Hydrocarbons
Molecular Descriptors
Aniko T. Valko, Keymodule Ltd.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
1 InstantJChem: a flexible chemical database system G. Marcou, D. Horvath + Laboratoire d’infochimie, Université de Strasbourg, 1, rue Blaise Pascal,
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Similarity Methods C371 Fall 2004.
Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield.
Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director.
Searching the Chemical Literature: Reference Books and Online Resources Dr. Sheppard Chemistry 4401L.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
An Introduction to Organic Chemistry. Orgins Originally defined as the chemistry of living materials or originating from living sources Wohler synthesized.
Dendral: A Case Study Lecture 25.
Brenda Poulter International Applications Specialist Thailand November 2004.
Veli-Pekka Hyttinen Access to the most comprehensive chemical data.
C3C3 Introduction into CI; SS 03/1st lecture © Gasteiger et al. Chemoinformatics in Europe: Achievements and Perspectives Johann Gasteiger Computer-Chemie-Centrum.
Catalyst TM What is Catalyst TM ? Structural databases Designing structural databases Generating conformational models Building multi-conformer databases.
Introduction to Organic Chemistry Section Organic Chemistry The chemistry of carbon compounds Not including metal carbonates and oxides Are varied.
Use of Machine Learning in Chemoinformatics
Notes 8-2 Carbon Compounds. Organic compounds Made up of carbon Have similar properties such as melting point, boiling point, odor, electrical conductivity,
Searching the Scientific Literature Douglas A. Loy.
Organic Chemistry Topic 10.
Outline Introduction State-of-the-art solutions
3. Organic Compounds: Alkanes and Their Stereochemistry
Chapter Eight: Covalent Bonding
Computers versus human brains a cooperative game for scientific discoveries Alain Hertz Polytechnique Montréal Mons, August 23, 2017.
When a benzene ring is a substituent, it is called a
Organic Functional group and Stereochemistry
Chapter 3 Organic Compounds: Alkanes and Their Stereochemistry
ISOMERS.
Introduction Most of the advances in the pharmaceutical industry are based on a knowledge of organic chemistry. Many drugs are organic compounds.
Reaxys Training Part 1 November 14, 2018November 14, 2018
Daylight and Discovery
Functional Groups Definition: A structural feature of a molecule, consists of a specific arrangement of atoms, responsible for certain properties of.
Chemistry 23.1.
Chemistry 23.1.
Lixia Yao, James A. Evans, Andrey Rzhetsky  Trends in Biotechnology 
Aniko T. Valko, Keymodule Ltd.
Vision on the future development of R4BP 3
3. Organic Compounds: Alkanes and Their Stereochemistry
Efficient Subgraph Similarity All-Matching
Topological Index Calculator III
Chemistry 23.1.
3. Organic Compounds: Alkanes and Their Stereochemistry
3. Organic Compounds: Alkanes and Their Stereochemistry
Page: 60 The bristlecone pine is the oldest living organism on Earth. The waxy coating on its needles contains a mixture of organic compounds called alkanes,
A guide for GCSE students KNOCKHARDY PUBLISHING
Reaxys Make the most of your access to Reaxys
Presentation transcript:

Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of Sheffield, UK

Overview of talk Introduction Processing of patent structures Specific molecules Generic molecules Processing of non-structural information

The pharmaceutical industry Finds, develops and markets new drugs that Can be used against previously untreatable diseases or that are better than current drugs Have commercial value sufficient to meet the discovery and development costs Drug discovery is a vastly complex, multi- disciplinary task that: Is inherently very risky and costly (10 years and $1.5B) to bring a new drug to market Increasing use of informatics methods to enhance the cost-effectiveness of pharmaceutical research

Chemical patents Patents and journal articles are the two most important sources of information on new molecules Chemical Abstract Service now processes ca. 1M new documents a year and has records for ca. 30M specific small-molecules Central role of the chemical structure diagram Other types of information Synthetic details (yield, catalysts) Physical properties (melting point, spectra, solubility) Property of interest (lowers cholesterol level, increases viscosity)

Types of molecule in patents Specific molecules Individual substances or groups of closely related substances in which the exact nature of the relationship is explicitly defined Generic molecules (Markush structures) Class of substances, this class being defined either explicitly or implicitly A generic claim may cover far more molecules than had actually been synthesised and tested (or even more than can possibly exist)

Specific molecules Molecules in chemical databases (patent or otherwise) are represented by graphs Nodes and edges of a graph denote the atoms and bonds of a molecule Can be extended to encode molecular 3D structures A searchable representation – not just an image Graph representation means can use graph isomorphism algorithms for searching Graph isomorphism (exact match) Subgraph isomorphism (partial match) Maximum common subgraph isomorphism (best match)

2D substructure searching

2D similarity searching

Substructure and similarity searching The two standard modes of database access in chemoinformatics Effective but highly inefficient, owing to algorithmic complexity of graph operations Significant increase in efficiency by means of fragment bit-strings (“screening”) Allows interactive searching of multi-million compound databases 3D also feasible but far more time-consuming

Each bit in the fingerprint (or fragment bit-string) represents one molecular fragment. Typical length is ~1000 bits The fingerprint records the presence/absence (1/0) of each fragment in that molecule A query structure and a database structure can be compared in terms of the bits (i.e., fragment substructures) that they have in common Can be regarded as the chemoinformatics analogue of a text signature

Generic molecules The Markush structure provides a simple and compact way of representing sets of specifics with common structural features Typically two parts Invariant part Often a common ring scaffold Varying parts Range of possible substructures, often at a range of possible positions

Simple example: 192 specifics R = 2-chlorophenyl or 2,3- dichlorophenyl R1 = CH 3 R2 = C 2 H 5 n = 2 R3 = H or CH 3 R4 = C-O-R5 or C-S-R6 or S-O-R7 R5 = H or NHCH 3 or NHCH 2 CONH 2 or 2-pyridon-5-yl R6 = NH 2 or C(=NHCN)NHCH 3 R7 = NH 2 or NHCH 3 or NH- cyclopentyl or 2-thienyl or 8- quinolyl or 2-(4-methypiperazin- 1-yl)pyrid-5-yl

Types of variation Substituent variation R1 is H or Cl Homology variation R3 is 1-3 carbon alkyl Position variation R2 is F or Cl Frequency variation n is 2-4

Complexity of generic structures Types of variation can be nested May be relations between different parts of the claim R1= methyl, ethyl or phenyl n = 1-6 if R1 = phenyl, else n=1,2 Parts of the claim may not be defined explicitly Optionally substituted by a N-containing group R1= alkyl (1-4) or aryl R1= any electron-withdrawing group

Substructure searching of generics Graph-based representations are again used, with three levels of search so as to minimise the numbers of structures that need to be enumerated Screen search Reduced graph search Subgraph isomorphism search The CAS MARPAT system exemplifies this approach ( ) Ca. 750K searchable Markush structures from ca. 300K patents

Markush structure and a reduced graph

Processing non-structural information Long-standing interest in using NLP to extract facts from the chemical literature Early work CAS (melting points, reaction yields) Sheffield (patent citations, chemical names) CLIDE at Leeds OCR of synthetic chemistry literature to identify reaction sequences and associated information ( OSCAR at Cambridge Recognises chemical names, boiling points, peaks in mass spectra, refractive indices, optical rotations, synthetic yields Suggested for use as an editorial checker for journals ( 06.pdf)

References Chemoinformatics in general Gasteiger, J. and Engel, T. (Eds.), Chemoinformatics: A textbook, Weinheim, Wiley-VCH (2003). Leach, A.R. and Gillet, V.J., An Introduction to Chemoinformatics, Dordrecht, Kluwer, 2 nd edition (2007). Paris, C.G., “Chemical structure handling by computer”, Annual Review of Information Science and Technology, 32, (1997). Structures in chemical patents Berks A.H., “Current state of the art of Markush topological search systems”, World Patent Information, 23, 5-13 (2001). Downs G.M. and Barnard J.M., “Chemical patents and structural information - the Sheffield research in context”, Journal of Documentation, 54, (1998 )