Smart Storage for Physical Properties Or How on Earth do we Store this Stuff? Kieron Taylor with Jeremy Frey and Jonathan Essex
What makes up chemical data? ● Numbers - big, small, precise and vague ● Circumstances - How hot? What pressure? ● Assumptions – This is pretty pure, let's say it's pure – Standard conditions? More or less – That peak on the spectrum isn't important
Using the Data: QSPR Take lots of data Magical statistics occur Validate results Predictive model
So What is Real Data like? Bad - take the commercial Physprop Database Can we handle these melting points?
Let's Make a Database ● One data source is not enough ● Good(?) data isn't free ● Different sources have varied style of content ● Most database software not suited to data mining ● We cannot plumb these varied sources for data, we must reconcile them to make sensible statistics
Relational Design For one molecule: Cyclohexanone PropertyValueUnits Solubility2500mg/L Melting point-31C Boiling point155.4C PropertyValueErrorUnitsSource Solubility2500+/-50mg/LPhysprop 2650+/-60mg/LOur lab Melting point-31+/-0.1CDetherm Boiling point155.4+/-0.5CMerck Index PropertyValueErrorUnitsSourceMethodAuthor Solubility2500+/-50mg/LPhyspropLaboratory /-60mg/LSouthamptonSimulationMe Melting point-31+/-0.1CDethermLaboratory... Boiling point155.4+/-0.5CMerck IndexLaboratory... Arbitrary numbers of points are hard to store in relational databases We're not done yet: We still have to account for multiple experimental conditions, statements of validity and molecules. Provenance = Senary relational model? PropertyValueErrorUnitsSourceMethodAuthorNote Solubility2500+/-50mg/LPhyspropLaboratory /-60mg/LSouthamptonSimulationMeSuperceded 2599+/-25mg/LSouthamptonSimulation BMe Melting point-31+/-0.1CDethermLaboratory... Boiling point155.4+/-0.5CMerck IndexLaboratory...Decomposing
RDF Triplestore is the Solution ● RDF describes trees and networks of entities ● Data of this complexity lends itself well to a tree representation ● RDF trees enable additional clever things ● Triplestores provide persistent RDF models
What can we do with this? ● Store almost any chemical data as normal ● Track the where, when and how of each and every data point ● Filter values down whether real, simulated, old, new, from a particular source, or done by a particular person. ● Bolt on RDF schemas such as FOAF and our units system.
What have we done with this?
Thanks to: ● AKT and Steve Harris for 3store ● Rob Gledhill for web tech and discussion ● Perl for s/ / /g