Webinar: Wed 13 th May 2015 UniChem Jon Chambers and Anne Hersey, ChEMBL group, The European Bioinformatics Institute, part of the European Molecular Biology.

Slides:



Advertisements
Similar presentations
With Microsoft Access 2010© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Microsoft ® Access.
Advertisements

Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
5 EBI is an Outstation of the European Molecular Biology Laboratory. Master title Molecular Interactions – the IntAct Database Sandra Orchard EMBL-EBI.
SciVal Experts & SciVal Funding Information Sessions.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
Technical Architectures
Database Management An Introduction.
Using Social Care Online: an overview Version 1.0 April 2015.
SiS Technical Training Development Track Technical Training(s) Day 1 – Day 2.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
Academic Year 2014 Spring.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
An introduction to using the AmiGO Gene Ontology tool.
Process Modeling SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED. 1 Roberta M. Roth.
OCLC Online Computer Library Center A Global OpenURL Resolver Registry Phil Norman OCLC Dlsr4lib Workshop March 23 rd, 2006 Arlington VA.
Integrate your people maximize your knowledge Tel SalesBase Customer.
REDCap Overview Institute for Clinical and Translational Science Heath Davis Fred McClurg Brian Finley.
26-28 th April 2004BioXHIT Kick-off Meeting: WP 5.2Slide 1 WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution.
THOMSON SCIENTIFIC Web of Science 7.0 via the Web of Knowledge 3.0 Platform Access to the World’s Most Important Published Research.
Structure Representation and Coordinates Format Lecture 3 Structural Bioinformatics Dr. Avraham Samson
Web of Science. Copyright 2006 Thomson Corporation 2 Example: (bird* or avian) and (flu or influenz*) Enter your terms to be searched. Search fields are.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
MobeSys Technologies MobeSys – helping you overcome mobile technology challenges.
Global Science and Technology Watch Portal The home page of the GSTW provides access to creating Technology Information Papers (TIPs), searching TIPs Online,
University of Palestine Library Management System S. DEYA Abu REJELA
DE&T (QuickVic) Reporting Software Overview Term
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Integrating and managing your Engaging Networks data Top ten data features.
Fundamentals of Database Chapter 7 Database Technologies.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Copyright OpenHelix. No use or reproduction without express written consent1.
Grouper Training Developers and Architects Advanced Topics Chris Hyzer Internet2 University of Pennsylvania This work licensed under a Creative Commons.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
1 NASIS 6.1 and WSS 2.3 Updates Jim Fortner National Soil Survey Center April 20, 2011.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Anne Hersey ChEMBL Group, EMBL-EBI ChEMBL – A Database of Bioactive Drug-like Small Molecules.
ITGS Databases.
REDCap Overview Institute for Clinical and Translational Science Heath Davis Fred McClurg Brian Finley.
Data resource management
EBI is an Outstation of the European Molecular Biology Laboratory. MSDchem and the chemistry of the wwPDB EMBO 22nd-26th September 2008 EMBL-EBI Hinxton.
Last updated 30/03/05 ISI Web of Knowledge Service for UK Education Web of Science Version 7 - new features.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
Registration Solutions for your Event Management.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
TSS Database Inventory. CIRA has… Received and imported the 2002 and 2018 modeling data Decided to initially store only IMPROVE site-specific data Decided.
The Protein Identifier Cross-Reference (PICR) service.
EBI is an Outstation of the European Molecular Biology Laboratory. Tutorial 5: ChEBI - On-line Submission and Curation.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBeChem The Ligand Database.
3 Copyright © 2010, Oracle. All rights reserved. Product Data Hub: PIM Functional Training Program Setup Workbench Fundamentals.
Copyright OpenHelix. No use or reproduction without express written consent1.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Search Services (PDBelite, PDBePro and BIObar) Sanchayita Sen, Ph.D. PDB Depositions.
David M. Kroenke and David J. Auer Database Processing Fundamentals, Design, and Implementation Appendix H: The Semantic Object Model.
Metayogi Increasing the Accessibility of the Semantic Web Karim Tharani Doug Macdonald Rachel Heidecker.
1 © Charles Schwab & Co., Inc. All rights reserved. Member SIPC ( ). Electronic Trading The Charles Schwab Corporation (Schwab) provides services.
OncoTrack Bioinformatics Workshop Max Planck Institute for Molecular Genetics, Berlin Wednesday 6 th November 2013 TimeSubject 13:30-15:00 Introduction.
Software sales at U Waterloo Successfully moved software sales online Handle purchases from university accounts Integrated with our Active Directory and.
Cheminformatics and Metabolism Team The EBI Enzyme Portal.
Objective % Select and utilize tools to design and develop websites.
Take a REST from manual searching: PDBe, programmatically
Databases.
CIS 155 Table Relationship
Objective % Select and utilize tools to design and develop websites.
An overview of the online edition
The ultimate in data organization
Welcome - webinar instructions
Survey Results Respondents: 39 of 51 – 76%
Presentation transcript:

Webinar: Wed 13 th May 2015 UniChem Jon Chambers and Anne Hersey, ChEMBL group, The European Bioinformatics Institute, part of the European Molecular Biology Laboratory (EMBL-EBI). An Introduction to UniChem: EMBL-EBI’s mapping tool for small molecule database identifiers.

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

A ChEMBL Compound Report Card

Compound Cross-references on a Compound Report Card… Cross-references to the same molecule in other resources. Automatically maintained via UniChem web services. Other resources can make use of this same functionality.

REST Web services.

REST web services

UniChem query results. LR = Last Release when Assignment was current. UCI = UniChem Identifier

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

EBI Resources containing small molecule data. ‘CHEMBL12’ ‘49575’ ‘DZP’ ‘ECBD..??’ ‘diazepam’ ‘SCHEMBL21442’ -Many resources, each with very different user-bases. -New resources predicted to be developed/adopted in future. -How can chemistry-centric users make use of all these data ? -Links between resources allow each resource to evolve independently. -But, maintenance is manual/time consuming, and a duplication of effort.

Advantages of the UniChem model. - All EBI DBs share the maintenance overhead of creating links to each other. UniChem - All EBI DBs share the benefits of maintained links to external resources. - The ‘mapping service’ could be opened for use by external users.

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

Essential requirements for UniChem. Create cross-referencing of chemical structures and their identifiers between databases. Fast (ie: capable of producing mappings ‘on the fly’ during a web page load, via a web service call.) Low maintenance. Up to date. Archive and track changes to ‘id-to-structure’ assignments over time.

Standard InChI used as the normalizing mechanism. InChIs (International Chemical Identifier). Non-proprietary, free. Not a registry system. Designed for printed and electronic data sources. Hashed representation aids ‘private’ querying.

InChI (International Chemical Identifier) InChIKey… 27 characters long… MGDTEJBDJOHWYU-UHTGSUKQAC-N [ ‘connectivity block’ aka ‘First InChIKey Hash Block’ (FIKHB) shown in blue ]

UniChem Schema UC_STRUCTURE UC_XREF UC_RELEASE UC_SOURCE Entries here are immutable eg: CHEMBL12 1 or 0 UCI -PK STANDARDINCHI STANDARDINCHIKEY UCI -FK -PK SRC_ID -FK-PK SRC_COMPOUND_ID-PK ASSIGNMENT LAST_REL_CURRENT SRC_ID-PK NAME DESCRIPTION CURRENT_RELEASE_U etc SRC_ID-PK RELEASE_U-PK SRC_RELEASE_NUMBER SRC_RELEASE_DATE etc

UniChem Tracks Historical Assignments… InChiX cpd123 UniChem will record that in this particular source, the id ‘cpd123’… … was last assigned to InChiX on Release No.1, but is not currently assigned to this structure. … was last assigned to InChiY on Release No.2, but is not currently assigned to this structure. … is currently assigned to InChiZ. Data Release No1 from Source ‘S’: Data Release No2 from Source ‘S’: cpd123 InChiY Data Release No3 from Source ‘S’: (latest) cpd123 InChiZ ie: UniChem keeps a record of current AND obsolete assignments.

UniChem deals with ‘Multiple Assignments’… InChiX cpd123 cpd456 cpd789 Multiple ids from a particular source assigned to a single InChI… cpd123 InChiX InChiY InChiZ Single id from a particular source assigned to multiple InChIs… …and…

Loading Rules Records are not loaded if…  There is a mis-match between the InChI and the InChIKey…  ie: where the InChIKey calculated by UniChem from the InChI provided by the source does not exactly match the InChIKey provided by the source.  The Standard InChI supplied is greater than 2000 characters long. 20

Automated Loading and Release. Common Format Source specific downloaders and parsers Single loader Overall process controlled by crontab (timings optimized for each DB to capture latest releases asap). Weekly release process Productio n Release Incl. Downloads+ Mapping files … etc …

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

Top Level stats.

Stats. 24

Sources.

Sources

Downloads. ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/

Downloads on the UniChem ftp site …

Oracle Dumps on the UniChem ftp site … Release number == UDRI

Contents of a single Release directory…

Downloads on the UniChem ftp site …

Whole Source Mapping Downloads

Whole Source Mapping Downloads – Files containing all id mappings between two sources.

An Example of a Whole source mapping file. From src:'3'To src:'15' SX2SCHEMBL DUSCHEMBL FM9SCHEMBL HHHSCHEMBL DCSCHEMBL YSCHEMBL X5SCHEMBL PU7SCHEMBL LPSCHEMBL ACKSCHEMBL (8719 records) eg: src3src15.txt [PDBe and SureChEMBL]

Analyses. Various analyses run on the current UniChem content, using ‘Structural Identity’ defined in one of 3 ways… FULIK = The Full InChIKey. FIKHB = First InChIKey Hash Block (commonly called 'the connectivity layer' of the InChIKey). SCFIB = Separated Single Components of FIKHB.

Structures by Source Numbers of ‘structures’ contributed by each source, and of these, how many are unique to the source…

Overlaps between Sources Numbers of ‘structures’ which ‘overlap’ between pairs of sources…

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

UniChem Connectivity Search  An advanced use of UniChem which permits searching across UniChem data sources for molecules with the same molecular skeleton as the query, but which may exist in …  Different stereochemical and isotopic forms  Different salt forms or mixtures Funded by FP7 Capacities Specific Programme, grant agreement no

Connectivity Based Searching in UniChem  Standard UniChem links created only on the basis of identical InChIKeys.  Aim: Create links on the basis of common connectivity (but differing elsewhere; stereochemistry, isotopic composition, etc).  Requirements…  Fast (has to be created dynamically).  Identify ‘relationships’ between molecules (eg: “has same connectivity …and is isotopic variant of”)  Link between cpds with common connectivity within mixtures/salts.  Generic / Flexible / Customizable. Funded by FP7 Capacities Specific Programme, grant agreement no

Alternative views of molecular equivalence.  Sometimes, molecules that many scientists would consider equivalent in the context of their particular field (e.g. pharmacology, docking, etc.), are quite often depicted differently across different resources.  Frequently, these depictions have different Standard InChIs and so cannot be integrated by simply matching on Standard InChIKey.  Examples…

Isotopic Differences PubChem CID CHEMBL DTQNEFOKTXXQKV-XRLBDJASSA-N DTQNEFOKTXXQKV-HKUYNNGSSA-N CP-99994, an NK1 antagonist… NB: First InChIKey Hash Block (FIKHB) in blue.

Example of Stereochemical differences AHOUBRCZNHFOSL-WMLDXEAASA-N AHOUBRCZNHFOSL-YOEHRIQHSA-N Paroxetine in two different sources …. Incorrectly drawn, or Valid stereoisomeric forms ? NB: First InChIKey Hash Block (FIKHB) in blue.

PIPZGJSEDRMUAW- VJDCAHTMSA-N Yohimbine HCl (Antagonil in ‘Selleck’) Yohimbine (CHEMBL15245 in ChEMBL) BLGXFZZNTVWLAY- SCYLSFHTSA-N Links between mixtures / salts ?

QJVHTELASVOWBE-AGNWQMPPSA-N Amoxicillin Clavulanic acid Co_Amoxiclav InChI=1S/C16H19N3O5S.C8H9NO5/c1-16(2)11(15(23)24)19- 13(22)10(14(19)25-16)18-12(21)9(17) (20)6-4-7; (8(12)13)9-5(11)3-6(9)14-4/h3-6,9-11,14,20H,17H2,1- 2H3,(H,18,21)(H,23,24);1,6-7,10H,2-3H2,(H,12,13)/b;4-1-/t9-,10-,11+,14-;6-,7-/m11/s1

PIPZGJSEDRMUAW- VJDCAHTMSA-N Yohimbine HCl Yohimbine BLGXFZZNTVWLAY- SCYLSFHTSA-N Links between mixtures / salts ?

PIPZGJSEDRMUAW- VJDCAHTMSA-N Yohimbine HCl Yohimbine BLGXFZZNTVWLAY- SCYLSFHTSA-N Links between mixtures / salts ? BLGXFZZNTVWLAY- SCYLSFHTSA-N VEXZGXHMUGYJMC- UHFFFAOYSA-N Hydrochloride Yohimbine …Yes, but parsing of the InChI required first...

UniChem Schema UC_STRUCTURE UC_XREF UC_RELEASE UC_SOURCE eg: CHEMBL12 1 or 0 UCI -PK STANDARDINCHI STANDARDINCHIKEY FIKHB UCI -FK -PK SRC_ID -FK-PK SRC_COMPOUND_ID-PK ASSIGNMENT LAST_REL_CURRENT SRC_ID-PK NAME DESCRIPTION CURRENT_RELEASE_U etc SRC_ID-PK RELEASE_U-PK SRC_RELEASE_NUMBER SRC_RELEASE_DATE etc UC_FIKHB_HIERARCHY PARENT CHILD Additions to schema for ‘Connectivity Search’ shown in green

BLGXFZZNTVWLAY- SCYLSFHTSA-N Links between combinations of stereoisomers, isotopic variants, in mixtures / salts … Yohimbine (CHEMBL15245 in ChEMBL) Yohimbine HCl PIPZGJSEDRMUAW- VJDCAHTMSA-N …is a component of… XIIDGINYXKOJGX-ZKKXXTDSSA-N Rauwolscine Oxalate Rauwolscine HCl PIPZGJSEDRMUAW-ZKKXXTDSSA-N …is a component of… AND …is stereoisomer of… tritiated Rauwolscine BLGXFZZNTVWLAY-XDGRAVGFSA-N …is isotopic variant of… AND …is stereoisomer of…

Refining ‘Connectivity Search’ to show salts and mixtures. Select radio button ‘4’ of Option C.

Connectivity Search Results Page.

Connectivity Search Web Services

Connectivity Search Web service query results

Connectivity Search in ChEMBL

Train Online

Acknowledgements ChEMBL John Overington Anne Hersey Anna Gaulton Mark Davies Louisa Bellis George Papadatos Shaun McGlinchey Jon Chambers ChEBI Chris Steinbeck Janna Hastings PDBe Sameer Velankar Atlas Robert Petryszak Training Tom Hancocks Richard Grandison

UniChem Webinar: 13 th May 2015 What is UniChem ? Basic Use of UniChem (web service and web page). Background … Why was UniChem developed ? What problem does it solve ? Requirements and Features… Schema, Data Normalization, Loading Rules, etc Current Content … Sources, Downloads, Stats, Analyses. Connectivity Search. Q and A

20 th May - ChEMBL walkthrough 27 th May - Sequence searching (*3pm UK time) 3 rd June – UniProt – accessing protein data programmatically 10 th June – MyChEMBL walkthrough 17 th June - ChEMBL Web Services All 4:00pm UK time unless stated For details see: ebi-training-webinar-series-2015 Future webinars:

__END__

62 InChI=1S/C10H6N4O2/ c (13-10 … Example of multiple ids from a source assigned to a single Standard InChI… mappings generated… ChEMBL -> ChEBIChEBI -> ChEMBL CHEMBL > > CHEMBL68500 CHEMBL > > CHEMBL68500 Mapping imprecision alloxazine isoalloxazine