Presentation is loading. Please wait.

Presentation is loading. Please wait.

The collection, curation and modeling of Open Melting Point measurements August 26, 2011 5 th Meeting on U.S. Government Chemical Databases and Open Chemistry.

Similar presentations


Presentation on theme: "The collection, curation and modeling of Open Melting Point measurements August 26, 2011 5 th Meeting on U.S. Government Chemical Databases and Open Chemistry."— Presentation transcript:

1 The collection, curation and modeling of Open Melting Point measurements August 26, 2011 5 th Meeting on U.S. Government Chemical Databases and Open Chemistry Jean-Claude Bradley Department of Chemistry Drexel University Andrew Lang Department of Mathematics Oral Roberts University Antony Williams ChemSpider Royal Society of Chemistry

2 The Problem of Data Quality in Chemistry Lack of provenance Reliance on a system of “trusted sources” CRC Handbook Merck Index Chemical Vendor Catalogs (e.g. Sigma-Aldrich) Peer-Reviewed Journals In the case of melting points:

3 Strategy for the curation of melting points Using technology, we can begin to replace the “trusted source” model with one based on transparency and provenance 1.Rely on redundancy when possible 2.Provide the maximum level of provenance when necessary (Open Notebook Science) 3.Adhere to Open Data, Open Descriptors and Open Algorithms for measurements and modeling

4 The Chemical Information Validation Sheet 567 curated and referenced measurements from Fall 2010 Chemical Information Retrieval course

5 Investigating the m.p. inconsistencies of EGCG

6 Most popular data sources

7 Alfa Aesar donates melting points to the public

8 Open Melting Point Explorer

9 Outliers MDPI dataset EPA/PhysProp (donated all data to public also)

10 Outliers for ethanol: Alfa Aesar and Oxford MSDS

11 Inconsistencies and SMILES problems within MDPI dataset

12 MDPI Dataset labeled with High Trust Level

13 EPA/PHYSPROP Structure Errors (Incorrect Valence): 2315 out of 43543 were contained pentavalent nitrogens

14 EPA/PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS Number and chemical name in the file are for the hydrobromide salt.

15 Common errors in datasets 1.multiple melting points for the same compound in the same database 2.stereochemistry issues 3.sign inversion 4.conversion errors (Kelvin/Celcius Fahrenheit/Celcius) 5.bad SMILES (non-rendering) 6.salts associated with SMILES for free base 7.using boiling point for melting point

16 Open melting point datasets Double+ validated: 2706 compounds (7413 highly curated measurements. range: 0.01-5 C. Compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt removed.) Entire dataset: 19933 unique compounds (27684 measurements – no inorganics or salts)

17 Open Models with Open Data Using Open Descriptors (CDK)

18 Modeling Results ModelTraining setTest set (TS) DescriptorsTS AAETS RMSETS R2 12205500132 2D29.5140.910.82 12204500170 2D/3D29.5240.790.83 216015500137 2D26.6236.350.86 3160153500137 2D29.3640.180.81

19 Melting point prediction service

20 Melting point predictions and measurements on iPhone/iPad (Alex Clark)

21 Publication of double+ validated melting point dataset to Nature Precedings and LuLu

22 For all Formats of ONS Projects

23 Open Melting Point Datasets Currently 20,000 compounds with Open MPs

24 Some melting points can’t be resolved only with literature: 4-benzyltoluene

25 Motivation: Faster Science, Better Science

26 Open Lab Notebook page measuring the melting point of 4-benzyltoluene

27 Using melting point for temperature dependent solubility prediction

28 Crowdsourcing Solubility Data

29 Integration of Multiple Web Services to Recommend Solvents for Reactions

30

31 All ONS web services

32 Google Apps Scripts web services

33 Google Apps Scripts for conveniently exploring melting point data

34 Straight chain carboxylic acids from 1 to 10 carbons Straight chain alcohols from 1 to 10 carbons Comparison of model with triple validated measurements

35 Cyclic primary amines from 3 to 6 carbons (cyclobutylamine flagged for validation – only single source available)

36 Google Apps Scripts for planning reactions and creating schemes

37 Open Melting Points in Supplementary Data Pages of Wikipedia (Martin Walker)

38 Conclusions For science to progress quickly there is great benefit in moving away from a “trusted source” model to one based on transparency and data provenance Open Notebook Science offers an efficient way to make research transparent and discoverable


Download ppt "The collection, curation and modeling of Open Melting Point measurements August 26, 2011 5 th Meeting on U.S. Government Chemical Databases and Open Chemistry."

Similar presentations


Ads by Google