Approaches for extraction and “digital chromatography” of chemical data: A perspective from the RSC.

Slides:



Advertisements
Similar presentations
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry
Advertisements

Yansong Feng and Mirella Lapata
Columbia University Department of Computer Science COMS – E6125 Web-enHanced Information Management Presentation A Study to the Semantic Web and Semantic.
ChemSpider: Searching by Chemical Name. ChemSpider  What is ChemSpider?  How to conduct a search  What do you get?
The Web of data with meaning... By Michael Griffiths.
Information and Business Work
Human Language Technologies. Issue Corporate data stores contain mostly natural language materials. Knowledge Management systems utilize rich semantic.
Dynamic Ontologies on the Web Jeff Heflin, James Hendler.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Collections Management Museums Reporting in KE EMu.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Introduction to BIM BIM Curriculum 01.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
1 Introduction to web mapping Dissemination of results, maps and figures ESTP course on Geographic Information Systems (GIS): Use of GIS for making statistics.
10/14/2001 Coping with Semantics in XML Document Management Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics.
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
Aniko T. Valko, Keymodule Ltd.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
OFC304 Excel 2003 Overview: XML Support Joseph Chirilov Program Manager.
Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Lisa Ruff Business Productivity/Accessibility TS Microsoft Federal.
CP2022 Multimedia Internet Communication1 HTML and Hypertext The workings of the web Lecture 7.
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
Introduction to XML Extensible Markup Language. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information.
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Sharing lessons through effective modelling Hilary Dexter University of Manchester Tom Franklin Franklin Consulting.
Scientific Data and Electronic Publishing Renze Brandsma, Head, Digital Production Centre University of Amsterdam Maarten Hoogerwerf, Project Manager,
Writing for AP/Social Sciences 5 Paragraph Essay.
Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
OFC291 Microsoft® Office Word XML (part 1 of 3): Introduction Martin Sawicki Lead Program Manager.
Andy Dawson– University College London 1 EABH SUMMER SCHOOL Web Page Construction Andy Dawson Department of Information Studies, UCL.
Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform.
OWL Representing Information Using the Web Ontology Language.
Introduction to the Semantic Web and Linked Data
Vendor Session: ChemSpider, from Royal Society of Chemistry.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko.
It’s the data that makes a paper Joerg Heber Executive Editor Nature Communications.
You Can’t Afford to be Late!
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
© 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 1 Designing with State Diagrams.
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
1 Mind Mapping. 2 A Pictorial representation of an idea, a concept or a topic. Mind mapping helps to: o Provide an overview of a topic and its complex.
Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)
1 Annotation Framework March Terminology CV - abbreviation for controlled vocabulary CRS - Community Review System (a collection within DLESE)
What problems are we trying to solve? Hannes Tschofenig.
Exeter – Implementation of a Crosswalk Connector S. Trowell, University of Exeter Nov 2013.
Automated extraction of reaction data from text Daniel Lowe, Lezan Hawizy, David Jessop, Peter Murray-Rust.
Updating image To update the background image: Go to ‘View’ Select ‘Slide Master’ Select the page with the image Right click on the image and select ‘Change.
1 DATA Act Information Model Schema (DAIMS) Version 1.0 Briefing June 2016.
Fern Albery-S Tess Downes-S Matthew Kelly-S
Long Term Preservation of Digital Data Raymond A. Lorie JCDL ‘01 June 24-28, 2001.
By: Jamie Morgan  A wiki is a web page or collection of web pages which you and your students can access to contribute or modify content without having.
OCUL License Mapping Project Colleen Neely
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
ORCID ID: Driving needs for analytical data exchange standards and the potential impacts on the chemical sciences Antony Williams.
Starter Open up the “What Am I Task Sheet” and work through the clues independently. You have 5 minutes before the answers are to be discussed as a group…
Documentation in Continuous Delivery Model and DevOps
OMPOL – Visualisation of large chemical spaces
Remixing elearning Scott Wilson, Glasgow 16/3/06
Lesson 9: GUI HTML Editors and Mobile Web Sites
ePerformance: A Process Crosswalk May 2010
Future of EDAMIS Webforms
Palestinian Central Bureau of Statistics
Presentation transcript:

Approaches for extraction and “digital chromatography” of chemical data: A perspective from the RSC

Overview Introduction – What data can we consider? – What are the challenges – What data and sources does the RSC have? – Experimental Data Checker Case Studies: – Project Prospect – Chair forms of Sugars/cyclohexanes

Traditional Chromatography Images taken from: /viewthread.php?tid=3960&page=3 chromatography

Why Digital Chromatography? Useable information is mixed in with description and analysis – Makes it difficult to find Despite our best efforts – still lots of ambiguous or plain wrong/unusable chemical information Why? – Human error – Processing errors – Incorrect usage of data generation/extraction – Style over meaning – Data not generated with reuse in mind – Data generated for humans

Style/Layout Vs Meaning Structures drawn to illustrate more than just the identity Data not generated with reuse in mind Author practices Mixed 2D and perspective representations Unintentional definition of stereochemistry

Data generated for humans Separated/Orphaned information inc. Markush structures, information passed by reference

What chemical data can we consider? Chemistry is an especially challenging - wide range of types of data – Numeric data – Names – Structures – Terminology Over a hugely different set of topics: Org, Inorg, Physical – Meanings/interpretations are not perfectly aligned Application of standards can be challenging Drawing conventions – are documented but not used

What chemical data and sources does the RSC have?

A beginning: helping chemists review their own work Amphidinoketide I To a solution of……. …. Amphidinoketide I was isolated as a …….. [α] D 25 −17.6 (c 0.085, CH 2 Cl 2 ); R f = 0.61 (1:1 hexane:ethyl acetate); ν max (CHCl 3 )/cm − (CO), (CO), (CO), (CC), ; 1 H NMR (CD 2 Cl 2, 500 MHz) δ H 6.08 (1H, t, J = 1.3 Hz, 3-CHC), 5.82 (1H, ddt, J = 16.9, 10.2, 6.7 Hz, 19-CHCH 2 ), 4.99 (1H, m (17.1 Hz), 20-CH A ), 4.92 (1H, m (10.2 Hz), 20-CH B ), 3.05 (1H, dd, J = 17.9, 9.3 Hz, 8-CH A ), 3.00–2.90 (3H, m, 9-CHCH 3, 11-CH A, 12-CHCH 3 ), 2.72–2.64 (2H, m, 5-CH A, 6-CH A ), 2.62–2.55 (2H, m, 5-CH B, 6-CH B ), 2.51–2.45 (3H, m, 8-CH B, 11-CH B, 14-CH A ), 2.33 (1H, dd, J = 16.9, 7.4 Hz, 14-CH B ), 2.09 (3H, s, 21-CH 3 ), 2.05–1.99 (2H, m, 18-CH 2 ), 1.99–1.96 (1H, m, 15-CHCH 3 ), 1.88 (3H, s, 1-CH 3 ), 1.39–1.25 (3H, 17-CH 2, 16-CH A ), 1.14–1.10 (1H, m, 16-CH B ), 1.07 (3H, d, J = 7.0 Hz, 22-CH 3 ), 1.05 (3H, d, J = 7.2 Hz, 23-CH 3 ), 0.87 (3H, d, J = 6.7 Hz, 24-CH 3 ); 13 C NMR (CD 2 Cl 2, 125 MHz) δ C (13-CO), (10-CO), (7-CO), (4- CO), (2-CCH), (19-CHCH 2 ), (3-CHC), (20-CH 2 C), (14-CH 2 ), (11-CH 2 ), (8-CH 2 ), (9-CHCH 3 ), (12-CHCH 3 ), (5-CH 2 ), (16-CH 2 ), (6- CH 2 ), (18-CH 2 ), (15-CHCH 3 ), (1-CH 3 ), (17-CH 2 ), (21-CH 3 ), (24- CH 3 ), (22 or 23-CH 3 ), (22 or 23-CH 3 ); HRMS (ESI) Calculated for C 24 H 38 O , found (MNa + ). (9R, 12R, 15S)-1 had [α] D (c 0.245, CH 2 Cl 2 ).

Case study 1: Project Prospect

12 What is Prospect? OSCAR Enhanced RSC XML InChI–Name pairs (from ChemSpider) OntologiesRSC XML Tool layer Input layer Information layer Output layer RSS Enhanced HTML Prospect database InChI–name pairs (in ChemSpider) Author CDX files Better ontologies Visible output

People and machines People Can understand narratives. Can interpret pictures. Can reason about three- dimensional objects. Can do a high-quality job. Machines Can’t understand narratives. Can’t interpret pictures. Not able to infer 3D structure from 2D without cues. Can do a lower-quality, but still useful job.

Case study 2: The chair representation issue InChI=1S/C6H12O6/c (8)4(9)5(10)6(11)12-2/h2-11H,1H2 WQZGKKKJIJFFOK-UHFFFAOYSA-N 5 stereocentres = 2^5 isomers =32 structures

Case study 2: Chair forms of hexacycles what could go wrong?

How we normalize them: 1.Identify 6-membered rings (Indigo) 2.Identify what sort of ring it is 3.Map atoms onto a standard structure (eg. beta-D-glucopyranose) 4.Tidy How do we “fix” chair-representations

The future: “The digester” Ability to: – Reconnect R-groups – Expand abbreviations – Expand brackets – Link structures with reference IDs

Other examples that we didn’t mention in case studies CIF data importer Structure Validation and Standardisation – (Thurs Aug 23, 9:15 am, Marriott Downtown, Franklin Hall 6) Work on creation of ontologies, RXNO, CMO – Also collaborating on: ChEBI ontology, GO, SO Collaboration with Utopia to enable Prospect mark-up of PDFs

Summary Many data sharing practices are based on: – Traditional print articles – Consumption of data by humans only This poses issues for publishers and users alike The RSC is developing innovative solutions to address some of these problems – Chemical structures are challenging – Limitations to what a machine methods can achieve – Need to educate authors to think differently

Acknowledgements Colin Batchelor - Development and Technical work Jeff White & Aileen Day Richard Kidd, Graham McCann and Will Russell RSC ICT staff

Thank you