Presentation is loading. Please wait.

Presentation is loading. Please wait.

from scientific literature Principal Scientist (Chemoinformatics)

Similar presentations


Presentation on theme: "from scientific literature Principal Scientist (Chemoinformatics)"— Presentation transcript:

1 from scientific literature Principal Scientist (Chemoinformatics)
ChemEngine An automated chemical data harvesting tool for molecular inventory and chemical computing from scientific literature M. Karthikeyan Ph.D Principal Scientist (Chemoinformatics) CSIR-National Chemical Laboratory Pune, India

2 Open Source in Chemoinformatics
ChemXtreme (JCIM 2005) GoogleAPI (MSDS) ChemStar (JCIM2008) (Pubchem 18m) ChemRobot (2011 * Denver) (Img2Str) ChemInfoCloud (2014 * San Fransisco) ChemScreener (CCHTS 2015) Virtual Screening ChemEngine (2016 * San Diego) Poster CINF Why Open Source?

3 ChemEngine Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher’s resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed an application, ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically.

4 ChemEngine The methodology has been demonstrated via three case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies (SPEs) that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software Source code Project file is available for downloading at the Sourceforge site. FREE !

5 Re-usable Chemical Structures
When chemical structures are stored in truly computable format with atoms and bond matrices (vector format-Cartesian co-ordinates), they can be processed electronically for computational and informatics purposes. However while transforming/storing the files in PDF (Printable/Portable Document/Data Format) that are usually used for the convenience of printing and reading, the valuable and re-usable molecular data is totally lost and buried in scientific literature as documents and seldom used for further computational studies.

6 Image to Structure Transforming the raster images into vector graphics followed by identification of relevant pixel information associated with atoms and bonds of a molecule is a cumbersome job. Tools have also been developed to harvest molecular data from images using web camera, scanned images wherein the raster graphics data was transformed into vector graphics to eventually retrieve the atoms and bonds information for the generation of truly computable and re-usable chemical structures. Several attempts have been made to convert molecular images into truly computable format such as ChemRobot, OSRA, ChemReader, CLiDE, but only partial success has been achieved.

7 ChemEngine FREE !

8 Chem Engine We have developed an application, ChemEngine that reads all the files stored in the pdf format to generate computable molecular structures. To demonstrate the efficiency of the program, supporting material data files of three different formats available freely from the publishers were selected and data was parsed using PDF readers to extract the textual information.

9 The Challenge Supporting materials of molecular data files also include brief description of molecules, computed data, plots, page numbers, document information, manuscript bibliographic details etc. as a single document in pdf format that makes harvesting the molecular data extremely difficult as these have to be excluded while parsing the PDF file.

10 General Workflow

11 General Workflow Input Data Format Pattern Recognition
Regular Expression Generate Coordinates

12 Pattern Recognition Pseudo code:
(Co-ordinate Text).matches(“Regular Expression Pattern with Delimiter Definition”); For Example: Delimiter: Comma String_Data.matches("^[A-Za-z0-9]{1,2}\\,[0]{0,1}[\\,]{0,1}-{0,1}.{1,2}[0-9]{1,10}\\,-{0,1}.{1,2}[0-9]{1,10}.{1,}") Delimiter: Space String_Data.matches("^[A-Za-z0-9]{1,2}\\s+[0]{0,1}[\\s+]{0,1}-{0,1}.{1,2}[0-9]{0,10}\\s+-{0,1}.{1,2}[0-9]{0,10}.{1,}")

13 Harvesting Molecular Data
Bond matrix data (Computed)  Coordinates Data (From PDF) (To be Processed) Non- Molecular Data (To be Excluded) 0:---- Mol_0 1 C1 H Mol_0 2 C1 H Mol_0 3 C1 H Mol_0 4 C1 S Mol_0 5 S5 H   C H H H S H    S1 SUPPORTING INFORMATION Thiol-Ene Click Chemistry: Computational and Kinetic Analysis of the Influence of Alkene Functionality. Brian H. Northrop* and Roderick N. Coffey Department of Chemistry Wesleyan University, Middletown, Connecticut

14 AS X Y Z AN X Y Z AS X Y Z AS Delimiter (Space) Delimiter (Comma) AN
H H C,0, , , C,0, , , C,0, , , C,0, , , C,0, , , C,0, , , AS Delimiter (Space) Delimiter (Comma) AN Delimiter (Space) AS ChemEngine Pattern Recognition AN X Y Z C S C C H H Coordinate matrix C1 S C1 H C1 H Optional CT 3D MOLECULE

15 Bond Recognition (a) A2 A1 (b) (c) Bond Recognition r1 r2 l1 r1’ r2’
0.35 Å r'1 r‘2 (a) (b) (c)

16 Compare the Bond distance
ChemEngine Gaussian 09

17 Computed Conformational Energy

18 Case Studies Entry Case Study N= molecules Regular Expression pattern
Format & Delimit er 1 Epoxide formation from sulfur ylides and aldehydes 29 ^[A-Za-z0-9]{1,2}\\s+- {0,1}.{1,2}[0-9]{1,8}\\s+- {0,1}.{1,2}[0-9]{1,8}.{1,} PDF Space 2 Thiol ene click chemistry 115 Text 3 Design of tetra(arenediyl)bis(allyl) derivatives for cope rearrangement transition states 55 ^[A-Za-z0- 9]{1,2}\\,[0]{0,1}[\\,]{0,1}- {0,1}.{1,2}[0-9]{1,10}\\,- {0,1}.{1,2}[0-9]{1,10}.{1,} Comma

19 ChemEngine

20 Harvesting Molecular Data

21 Graphical User Interface

22

23 Acknowledgement The Director, CSIR-National Chemical Laboratory, Pune
Department of Science and Technology for International Travel Grant ACS-CINF Division SourceForge

24 Thank You


Download ppt "from scientific literature Principal Scientist (Chemoinformatics)"

Similar presentations


Ads by Google