Presentation on theme: "The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory."— Presentation transcript:
The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory Repository In/Organic Notebooks Principal Investigator: Peter Murray-Rust Co-Investigator: Jim Downing Project Team: Nick Day, Sam Adams, Brian Brooks Unilever Centre, Department of Chemistry, University of Cambridge
CHEM-0 repository EmMa Embargo Mgr ELN (IDBS) Crystall- ography Files (CIF) NMR files CML, RDF RDF triplestores SPARQL interface CLARION query app CLARION overview CHEM-1 repository Data Releaser Publications database JUMBO converters EmMa user interface External Scientist Internal Scientist 1.Scientist collects data & stores it in variety of locations 2.EmMa is notified about the new content 3.Scientist specifies the release conditions for the data 4.Timer waits until release conditions are met 5.Data is moved into CHEM-1 repository and (at some time) into CHEM-0 repository 7.Repository queried by scientists Data Loader 5 7
ELN server File Feed ELN Feed Lensfield Loader ELN Data Files CHEM-0/1 repository Atom Feed Jetty webserver cron jobs Java Adapter Atom Feed ELN API Jetty webserver cron jobs Java Data Handler Atom Feed Atom Feed Reader GUI client Adapter Release Manager Design principles used: Decoupling through standard web interfaces (http, Atom) Avoid data duplication (by using http references unless a copy is required) Dont do manually that which can be done automatically Manual semantification as early as possible Automatic semantification as late as possible Give ability to undo an action during a grace period rather than getting confirmation Jetty webserver Java H2db for metadata JUMBO converters Ontologies: ChemAxiom ORE ORE Chem Expt Jetty webserver Java & Clojure CML RDF Triplestore Chemical Structure index Jetty webserver Java SPARQL Blue boxes indicate logical machine environments CLARION architecture SOAP CLARION repository Sesame Chemicx EmMas role: Adds metadata Defines embargo release conditions Is the gatekeeper for metadata quality Is the gatekeeper for security (trust, authentication, authorisation) Embargo Manager (EmMa) Query System
Scientists presented with data records to which they add metadata and then set embargo release conditions EmMa Sources R epository Data Loader Stage 1Stage 2Stage FebMarJanMayJunAprAugSepJulNovDecOct 123 CLARION development stages & timings Stage 1: First data-feed into EmMa Atom-feeds from file stores EmMa feed-readers EmMa user review tool EmMa output atom-feeds Stage 2: Basic functionality to store first data-type into repository Lensfield reads EmMa feeds Process data to CML Process CML to RDF Store triples into triple-store Indexing of chemical structures Stage 3: Basic querying functionality Authentication & authorisation Pilot users loading data V1 query tool Data stored in RDF and chemical structures indexed System in use by pilot users & simple query interface for SSS & RDF queries. Querying by outside users.
EmMa EmMa: A general tool for controlling data release between systems ? ISIS ELN XRay NMR Etc PubChem PDB Chem-1 Chem-0 NCS eCrystals Atom feed Public Atom feed Fully semantified data (RDF) Original data plus basic metadata Private Atom feed Pump
Institution A EmMa Rutherford neutron Institution B EmMa Events: 1.Scientist sends sample to Rutherford 2.Rutherford stores data locally and sends copy back to scientist 3.Institutions EmMa is informed about new data 4.Scientist specifies data release conditions 5.Release conditions reached, data released to public repository 6.Rutherford monitors institutions atom feed, detects data is released 7.Rutherford makes data visible in their own public-access repository Private repository Public repository How EmMa could facilitate data release in collaborating institutions