We are all doing this many times…… Pfizer AZGSKMerckn The Problem
ChEMBL DrugBank Gene Ontology Wikipathways UniProt ChemSpider UMLS ConceptWiki ChEBI TrialTrove GVKBio GeneGo TR Integrity “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” “Let me compare MW, logP and PSA for known oxidoreductase inhibitors” “What is the selectivity profile of known p38 inhibitors?”
Research Questions 6 NumbersumNr of 1Question 15 129 All oxido,reductase inhibitors active <100nM in both human and mouse 18 148 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound? 24 138 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives. 32 138 For a given interaction profile, give me compounds similar to it. 37 138 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X. 38 138 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not). 41 138 A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature. 44 138 Give me all active compounds on a given target with the relevant assay data 46 138 Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease) 59 148 Identify all known protein-protein interaction inhibitors
Semantic Resources – Data sets 814,535,923 triples, 569 predicates
Semantic Resources - Mappings 18 Million Mappings
Semantic resources - Summary Types of semantic resources – RDF Datasets – Mappings – Vocabularies Mesh, UMLS, NCIM – Hierarchies are essential e.g. ChEBI, Enzyme classification rdfs:subClassOf reasoning is essential
Data Integration Global – as – View – Flexible global schema defined by domain experts. – Simple, as flat as possible, result structures. Load data into RDF store – A named graph for each dataset. – Inferred triples computed offline and asserted directly.
Data Integration Define use cases and view templates – View templates: Required and optional information – Domain experts trust some datasets over others for specific properties. – Results without a core set of properties are meaningless. The rest are optional. – Integration in OpenPHACTS amounts to collating trusted information from the available sources. Convert view templates to SPARQL queries – CONSTRUCT for projection onto global view – Preserve provenance through void:inDataset
Linked Data API No public SPARQL endpoint is available for OpenPHACTS – Writing good SPARQL queries has a steep learning curve – Queries that are not well formed can cause instability. Instead SPARQL queries are embedded into Linked Data API (LDA) endpoints. – Enables API developers to fine tune query performance. – RESTful – Multiple formats (JSON, Turtle, TSV, XML and more) – Pagination – HTTP parameters are mapped to RDF properties and SPARQL variables Ability to specify elaborate filtering and sorting conditions – Two-level caching of URLs served Raw RDF results and formatted w.r.t. the “_format” parameter
Linked Data API First public release today! – http://dev.openphacts.org http://dev.openphacts.org – http://explorer.openphacts.org http://explorer.openphacts.org LDA: http://code.google.com/p/linked-data-api/http://code.google.com/p/linked-data-api/ – PHP Implementation (Puelia): http://code.google.com/p/puelia-php/ http://code.google.com/p/puelia-php/ – OpenPHACTS Extension: https://github.com/openphacts/OPS_LinkedDataApi https://github.com/openphacts/OPS_LinkedDataApi
Identity Mapping Service (IMS) Instance level mappings are held externally to the RDF store Driven by Uni. Manchester – http://openphacts.cs.man.ac.uk:9090/OPS-IMS/ http://openphacts.cs.man.ac.uk:9090/OPS-IMS/ Injected into queries at run-time. Better performance (when co-located). Provision of flexible mappings: Scientific lenses.
Identity Resolution Service (IRS) text – to – URI functionality Driven by ConceptWiki – http://ops.conceptwiki.org/web-ws/ http://ops.conceptwiki.org/web-ws/ Manually curated (sub)String search is performed externally to the RDF store.
Optimising SPARQL query performance Syntactic variations of the same query vary wildly in evaluation cost. – Pérez Jorge, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of SPARQL. ACM Transactions on Database Systems (TODS) 34, no. 3 (2009): 16. Evaluated the utility of 5 heuristics: 1.Minimise OPTIONAL patterns 2.Use named GRAPHs to localise patterns 3.Replace connected triple patterns with sequence paths 4.Use aggregates to reduce the effects of cartesian products. 5.Compare the performance of different syntax for specifying alternative URIs Antonis Loizou and Paul Groth, On the Formulation of Performant SPARQL Queries. Preprint: http://arxiv.org/abs/1304.0567http://arxiv.org/abs/1304.0567
Performance overview: Method# of resultsTime (s) Compound Pharmacology 1 2591.732 951.578 Compound Information11.885 Compound Count Datapoints11.803 Target Pharmacology 1 07211.093 201.127 Target Information134.759 [1.56, 242.079] Target Count Datapoints11.464 Enzyme Class Pharmacology586 107350.343 ChEBI Class Pharmacology300 724371.526
Lessons learnt: Invest in hardware - Develop environment – 400GB RAM, 1.5TB SSD, 16 Cores @ 1.8GHz. Not all RDF stores are created equal Not all RDF datasets are created equal Not all SPARQL queries are created equal. Perfomance is a function of: – SPARQL syntax – Dataset schema – Data size – Loading strategy/named GRAPHs – RDF store
Open PHACTS Information http://www.openphacts.org @Open_PHACTS Publications – Overview paper: Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today. 17, 1188–1198 (2012). – Technical approach: Gray, A.J.G., Groth, P., Loizou, A., et al.: Applying linked data approaches to pharmacology: Architectural decisions and implementation. Semantic Web. (2012). – SPARQL heuristics: Antonis Loizou and Paul Groth, On the Formulation of Performant SPARQL Queries. Preprint: http://arxiv.org/abs/1304.0567http://arxiv.org/abs/1304.0567