Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards knowledge representation improvements in chemistry Evan Bolton, Ph.D. Mar. 15, 2016 National Center for Biotechnology Information.

Similar presentations


Presentation on theme: "Towards knowledge representation improvements in chemistry Evan Bolton, Ph.D. Mar. 15, 2016 National Center for Biotechnology Information."— Presentation transcript:

1 Towards knowledge representation improvements in chemistry Evan Bolton, Ph.D. Mar. 15, 2016 National Center for Biotechnology Information

2 MOST CHEMISTRY KNOWLEDGE IS LOCKED UP IN TEXT. Premise Image credits: https://play.google.com/store/apps/details?id=com.uc.addon.web2pdf http://ideasuccessnetwork.com/idea-discovery-article-how-good-idea-mushroomed/ http://xk2.ahu.cn/ http://libraryschool.libguidescms.com/content.php?pid=682172 https://www.freshrelevance.com/blog/real-time-marketing-report-for-may-2014

3 Simplistic science workflow Do science Publish papers Search papers Read papers Image credits: http://www.how-to-draw-funny-cartoons.com/cartoon-scientist.html http://computertutorinc.net/computer-maintenance-safety-tips/ Computers help yet we rely on humans who abstract out keywords, article gist, data, etc CAS, Medline, ChEMBL, etc.

4 Computer aided abstraction Image credits: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/ner/ http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/ http://www.intechopen.com/books/theory-and-applications-for-advanced-text-mining/biomedical-named-entity-recognition-a-survey-of-machine-learning-tools http://www.slideshare.net/NextMoveSoftware/leadmine-a-grammar-and-dictionary-driven-approach-to-chemical-entity-recognition BioCreative V (2015) Name entity recognition (NER) now as good as a human to find chemical names, gene names, disease names in text corpus. Natural Language Processing (NLP) attempts to “read” text with a computer with human-like understanding … is getting there

5 Effects of computer understanding you can experience

6 HOW DO WE BRING IT ALL TOGETHER? Knowledge representation helps to provide “information about the world in a form that a computer system can utilize to solve complex tasks”… https://en.wikipedia.org/wiki/Knowledge_representation_and_reasoning Image credit: http://artint.info/html/ArtInt_8.html

7 7 Chemical information is everywhere now

8 PubChem data complexity Many links between large record collections – ~210M Substances ~85M Compounds – ~85M Compounds ~85M Compounds – ~225M Bioactivities ~3M Substances – ~225M Bioactivities ~2M Compounds – ~225M Bioactivities ~1M BioAssays – ~11M PMIDs ~100K Compounds – ~3M Patents ~30M Substances – ~3M Patents ~15M Compounds Sparse and dense data and/or linking New types of data, links, and metadata on a regular basis 8 With so many links, how does one ensure they are accurate, relevant? How does one improve the specificity of links?

9 How do you describe a chemical substance? – No standards, meta data associated with chemical representation (e.g., purity) How do you describe a chemical mixture? – No standards (InChI?), often free-form text How do you describe a bioactivity? – Emerging standards, not widely adopted Minimum Information About a Bioactive Entity (MIABE) How do you draw a chemical structure? – IUPAC Graphical Representation Standards for Chemical Structure Diagrams Not widely adopted, large pre-existing corpus 9 Chemical information is not easy Image credit: http://www.ivyroses.com/Chemistry/GCSE/What-is-a-substance.php Most chemistry information is for humans, not machines Chemists invent arbitrary conventions to communicate chemical information

10 10 Chemical information is not designed for computers As a chemist, you can understand and recognize that this picture is the chemical acetone You can put a chemical name or registry identifier next to it Is this not good enough? Many names for structure The computer ‘sees’ a binary image not a structure Acetone 67-64-1 Almost all chemical information is geared towards human understanding in the form of text and images

11 11 Computer understanding of chemical information Give a computer a chemical structure a (normal) human cannot understand it A computer can make the image from the structure A computer can associate information to the structure A computer can generate other key information from structure CC(=O)C Acetone 67-64-1 propan-2-one 58.07914 g/mol Computer understanding can help provide human understanding If the computer understands, we can leverage it for search, analysis, and more

12 Chemical information is a bit of a mess and can be rather nuanced – Names, names, and more names (+200M in PubChem) Some standard names are not open and cannot be used/verified without $$$ – Name/structure associations vary by use case (many overlapping) Acetic acid vs. Acetic acid tri-hydrate Formaldehyde: (gas) vs. Formalin (liquid, 40% formaldehyde w/ water) Sulfuric acid: SO 3 (gas) vs. H 2 SO 4 (liquid) Glucose: L/D, ring open/closed (f/p), alpha/beta/both vs. Glucose monohydrate Large corpus in the ‘wild’.. data source dependent nuances Verify with primary source(s) prior to information use – i.e., is this the form of the chemical I care about? 12 Chemical structure is not enough Go below 32% formaldehyde in water and it is non-flammable Same chemical structure can represent many things, most of which are use case dependent Users are not happy to see overlapping data on different forms of same chemical

13 13 Many data sources of relevant information SDS Search and Product Safety Center http://www.sigmaaldrich.com/ SIRI MSDS Index Biological Safety Data Sheets CHEMINDEX FREE on the WEB! International Chemical Safety Cards (ICSC) Right to Know Hazardous Substance Fact Sheets Pathogen Safety Data Sheets and Risk Assessment NIOSH Pocket Guide to Chemical Hazards TOXLINE: A TOXNET DATABASE SDS and Chemical Information from Manufacturers ChemIDplus: A TOXNET DATABASE https://www.dol.gov/ http://www.csb.gov/ Does each organization (or scientist) use their own favorite data source(s)? Do these various data sources provide consistent information (gaps, errors)? How do their decisions change with different information (or lack of it)?

14 Many answers to “How do describe my substance and its data?” Information is geared towards humans Need computer understanding of information Chemical structure representation is insufficient Publically available chemical information is heavily fragmented Lots of data links (use case dependent, relevancy, etc.) 14 Chemical data issues are fundamental

15 WHAT ARE WE DOING ABOUT IT?

16 Community drive towards knowledge representation Dear Colleagues, We would like to thank you for taking the time to participate in our first meeting to address chemical ontologies (CO). Below please find some notes / minutes from the meeting held in Basel, Switzerland, October 2, 2015. a chemical ontology to be used for classification purposes as well as for input for downstream operations such as knowledge graphs, data mining, chemical text mining and cognitive computing experiments The purpose was to explore using computers to ingest machine-readable forms of molecules and to generate molecular attributes (descriptors). For example, ingesting a SMILES string and producing a set of triples that describe the molecule [ molecule X “is a “ ketone ; “ is a “ amino acid ; “is a “ steroid etc. ]. The output of which would provide the basis of a chemical ontology to be used for classification purposes as well as for input for downstream operations such as knowledge graphs, data mining, chemical text mining and cognitive computing experiments. Historically these operations were performed manually or semi-automated; however, it is desirable to have a computer process for large scale processing to meet current day demands resulting from computer curation of the scientific literature. To date, two programs have been developed to accomplish this objective: one at OntoChem, a German informatics company, and another (ClassyFire) at the University of Alberta. While both programs produce reasonable output, there are differences that could lead to non- conformity in the resulting ontologies. One motivation for the workshop was anticipation that all parties will benefit from common standards for a computer-derived chemical ontology. Overall we believe that a Chemical Ontology can make contributions when it comes to answering scientific relevant questions. Workshop hosted by Fatma Oezdemir-Zaech, Novartis Pharma AG

17 Things connect thru ontologies A common vocabulary for researchers who need to share information in a domain. Including machine-interpretable definitions of basic concepts in the domain and relations among them. Slide courtesy of Stephen Boyer, IBM

18 A Goal for Building Chemical Ontologies : Step 1 ) Establish - Molecular Attributes Step 2 ) Establish - Functional Attributes Step 3 ) Explore - Integration & Predication Slide courtesy of Stephen Boyer, IBM

19 Physical Attributes : Examples : Molecular Weight, Melting point, Boiling Point …etc Molecular Attributes : Examples: Steroid, Prostaglandin, Amino Acid, Alkene, Imidazole,. Functional Attributes : Examples: Anti-Inflammatory, Explosive, Refrigerant, Pesticide There are many types of : “ Attributes “ Slide courtesy of Stephen Boyer, IBM

20 Consider this molecule Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

21 Carboxylic acid Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

22 Carboxylic acid Benzoic acid Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

23 Carboxylic acid Benzoic acid Phenol Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

24 Carboxylic acid Benzoic acid Phenol Hydroxy group Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

25 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

26 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

27 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

28 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Sulfonamide Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

29 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Sulfonamide Azobenzene Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

30 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Sulfonamide Azobenzene Benzene Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

31 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Sulfonamide Azobenzene Benzene Molecular Attributes (Labels) Is a Benzoic acid Is a Carboxylic acid Is a Carbonyl cpd Is a Phenol Is aAxobenzene Is a Azo compound Is a sulfone Is a Sulfonamide Is aPyridine Is aBenzene Is a hydroxy Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

32 Carboxylic acid Benzoic acid Phenol Hydroxy group Azo Pyridine Sulfone Sulfonamide Azobenzene Benzene Functional Attributes Is used for the treatment of Crohn's disease Is used for the treatment of rheumatoid arthritis Is used for the treatment of ulcerative colitis Molecular Attributes (Labels) Is a Benzoic acid Is a Carboxylic acid Is a Carbonyl cpd Is a Phenol Is aAxobenzene Is a Azo compound Is a sulfone Is a Sulfonamide Is aPyridine Is aBenzene Is a hydroxy Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

33 SMILES Name Molecular Attributes / Labels Molecular Attributes / labels Operation A Operation B Structure Name Slide courtesy of Stephen Boyer, IBM

34 SMILES Name Molecular Labels Molecular Labels Operation A ClassyFire Structure Name Molecular Labels OntoChem Molecular Labels Other Ex : Derwent Frag Codes, ChEBI, MeSH Normalization Process – compatible with rules of chemical nomenclature Slide courtesy of Stephen Boyer, IBM

35 Name Molecular Descriptors / Attributes / Labels 2-[(1S,2S,4R,8S,9S,11S,12R,13S,19S)-12,19-difluoro-11-hydroxy-6,6,9,13-tetramethyl-16-oxo-5,7- dioxapentacyclo[10.8.0.0^{2,9}.0^{4,8}.0^{13,18}]icosa-14,17-dien-8-yl]-2-oxoethyl acetate Slide courtesy of Stephen Boyer, IBM

36 [H][C@@]12C[C@@]3([H])[C@]4([H])C[C@]([H])(F)C5=CC(=O)C=C[C@]5(C)[C@@]4(F)[ C@@H](O)C[C@]3(C)[C@@]1(OC(C)(C)O2)C(=O)COC(C)=O SMILES String ClassyFireOntoChem ClassyFire: Halogenated steroids (6); Fluorohydrins (7); Halohydrins (7); 1,3-dioxolanes (9); 11-beta-hydroxysteroids (9); Dioxolanes (9); 3-oxo delta-1,4-steroids (10); Alpha-acyloxy ketones (10); Delat-1,4-steroids (10); 11-hydroxysteroids (12); Gluco/mineralcorticoids, progestogins and derivatives (13); Pregnane steroids (13); 20-oxosteroids (15); Acetate salts (22); 3-oxosteroids (26); Oxosteroids (27); Carboxylic acid salts (30); Hydroxysteroids (32); Cyclic ketones (45); Alpha amino acid amides (73); Pyrrolidines (80); D-alpha-amino acids(85); Cyclic ketones (45); Acetals (50); Steroids and steroid derivatives (51); Alkyl fluorides (53); Alkyl halides (67); Cyclic alcohols and derivatives (86); Ketones (101); Organofluorides (128); Carboxylic acid esters (139); Secondary alcohols (187); Oxacyclic compounds (192); Lipids and lipid-like molecules (209); Organohalogen compounds (272); Ethers (393); Alcohols and polyols (395); Carboxylic acid derivatives (423); Carboxylic acids and derivatives (548); Carbonyl compounds (598); Organic acids and derivatives (633); Organoheterocyclic compounds (651); Organooxygen compounds (856); Organic compounds (978); Chemical entities (989); Hydrocarbon derivatives (995); OntoChem: 17-deoxy-prednisolones (6); halohydrins (6); prednisolones (6); ethanoic acid esters (20); methyl esters (20); acetals (37); alkyl fluorides (56); cyclic ketones (61); natural product derivatives (92); fluorine compounds (126); alkene derivatives (172); polycyclic compounds (184); oxacyclic compounds (190); secondary alcohols (202); carboxylic acids (249); formic acid derivatives (559); lipophilic molecules (642); lipinski molecules (785); bioavailable molecules (867); oxygen compounds (891); small molecules (949); carbon compounds (974); hetero compounds (978); Slide courtesy of Stephen Boyer, IBM Green means likely synonym between ontologies

37 Its’ an olefin Its’ an alkene Slide courtesy of Yannick Djoumbou & David Wishart / Drugbank team / U of Alberta

38 Enables drill down to chemical lists with a classified feature of interest Allows interesting cross comparison between classification systems – For example, compare MeSH with ChEBI 38 Classifications in PubChem Of the 38 benzaldehydes in ChEBI, how many are within the set of 128 benzaldehyes in MeSH? Only 20 between the two are in common Only 19 of 20 are considered aldehydes Only 17 of 19 are considered benzaldehydes!

39 39 Computational vs. Manual Classification Humans make mistakes – This includes programming mistakes, classification mistakes, encoding mistakes Manual classifications only handle a small number of chemical structures – Automated classification can help to extend classifiers to all known chemicals Harmonization of terminology? – Can they speak a common language? Image credit: https://commons.wikimedia.org/wiki/File:Brueghel-tower-of-babel.jpg

40 Benzene boiling point case study Benzene Coal tar

41 41 Gasoline ILO 1400 is for gasoline. There are three names [Gasoline, Benzin, 86290-81-5]. The corresponding MeSH concept is M0008998 with single name ‘Gasoline’. The ILO term ‘86290-81-5’ is not in MeSH. The other ILO term ‘Benzin’ is located in another concept M0045647 with names ‘naphtha, O3L624621X, 8030-30-6, benzin, benzine, petroleum ether’. OSHA 706 is for naptha. There are several names [Naptha, Naptha (coal tar), high solvent naptha, crude solvent coal tar naptha, 8030-30-6]. This maps to MeSH concept M0045647, with names [naphtha, O3L624621X, 8030-30-6, benzin, benzine, petroleum ether]. There is a second record for naptha from OSHA, 707, which is a narrower concept to OSHA 706 above. OSHA 707 has several names [naptha, petroleum ether, petroleum spirit, painters naphtha, refined solvent naptha, 8032-32-4, ligroin]. These names map to the MeSH concept M0045647 via ‘petroleum ether’ but also M0060581 with names [ligroin, 8032- 32-4]. Computational harmonization of chemical concepts [comparison of MeSH, WHO ILO ICSC, OSHA concepts] Naptha Harmonization suggestions to improve MeSH 1.Reg# ‘86290-81-5’ to be added to the gasoline concept M0008998. 2.M0060581 to be a related concept (narrower) to M0045647. 3.name ‘petroleum ether’ be moved from M0045647 to M0060581. 4.names [petroleum spirit, painters naphtha, refined solvent naptha] to be added to M0060581 5.name [phenyl hydride] to be added to M0002332. 6.M0008998 to be a related concept (narrower) to M0045647.

42 Knowledge representation in chemistry enables computer understanding of the domain Community efforts are underway to help build and harmonize chemical ontologies There are many opportunities to improve the quality, quantity, variety, relevancy, and integration of (open) chemical data and chemical knowledge 42 Summary

43 PubChem Crew … Evan Bolton Jie Chen Tiejun Cheng Gang Fu Renata Geer Asta Gindulyte Lianyi Han Jane He Siqian He Sunghwan Kim Ben Shoemaker Paul Thiessen Jiyao Wang Yanli Wang Bo Yu Leonid Zaslavsky Jian Zhang Special thanks to the NCBI Help Desk, especially Rana Morris, and past PubChem group members. Steve Bryant (PI) 43

44 Chemical Ontology Collaborators (including Stephen Boyer, Yannick Djoumbou, Lutz Weber) PubChemRDF Collaborators – Especially: Janna Hastings (European Bioinformatics Institute), Michel Dumontier (Stanford University), Colin Batchelor (Royal Society of Chemistry), Egon Willighagen (Maastricht University) Stephan Schurer, Uma Vempati, and Hande Küçük (U. of Miami) – NLM Linked Data Infrastructure Working Group (MeSH RDF) Chemical Health and Safety collaborators – Especially: Leah McEwen (Cornell U.), Ralph Stuart (Keene State College), Ye Li (U. of Michigan) Software collaborators – NextMove Software (Roger Sayle, Daniel Lowe, Noel O’Boyle, John May) – Xemistry GmbH (Wolf D. Ihlenfeldt) – OpenEye Scientific Software All PubChem Contributors and Collaborators This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine. 44 Special thanks

45 45 Have any questions?


Download ppt "Towards knowledge representation improvements in chemistry Evan Bolton, Ph.D. Mar. 15, 2016 National Center for Biotechnology Information."

Similar presentations


Ads by Google