George Papadatos Mark Davies

George Papadatos Mark Davies surechembl-help@ebi.ac.uk
SureChEMBL webinar George Papadatos Mark Davies

Outline SureChEMBL Interface demo Coverage and content Capabilities
Future plans Interface demo

ChEMBL: Data for drug discovery
1. Scientific facts 3. Insight, tools and resources for translational drug discovery >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Compound Assay/Target Bioactivity data Ki = 4.5nM APTT = 11 min. 2. Organization, integration, curation and standardization of pharmacology data

Why is searching chemical patents useful?
Infringement search to avoid areas of valid patent protection (freedom to operate) Search for industrial profiles and research directions (competitive intelligence) State-of-the-art/novelty/prior art search Search for citations and key references Most of the knowledge in chemical patents will never appear anywhere else Average time lag between patent and journal: 3 years Compounds, scaffolds, reactions Biological targets, diseases, indications Traditionally a closed and commercial field. Average difference is 3 years Comprehensive IT has scaffolds and chemistry and compounds that will never appear or if it Find novel claimed inhibitors of EGFR receptor published in 2014 Novel heterocyclic scaffolds / reaction schemes

SureChEMBL

SureChem becomes SureChEMBL
December 2013 EMBL-EBI acquired SureChem – a leading chemistry patent mining product from Digital Science, Macmillan Group SureChem not aligned with core future academic business Existing SureChem user base Free (SureChemOpen) Paying (SureChemPro + API) EMBL-EBI supported existing licensees during transition EMBL-EBI provides an ongoing, free and open resource to the entire community Rebranded as SureChEMBL Free users limited access Full functionality for everyone

SureChEMBL data pipeline
SureChEMBL System WO EP Applications& Granted US Applications & granted JP Abstracts Patent Offices Chemistry Database Entity Recognition SureChem IP 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-methylpiperazine Processed patents (IFI Claims) Name to Structure (five methods) OCR Database Image to Structure (one method) Dynamic nature Completely automated with no manual interaction – to be compared with chembl Patent PDFs (service) Application Server Users API

SureChEMBL chemistry data coverage
Structures from text: 1976 onwards Title, abstract, claims, description IUPAC, trivial, drug names, etc. SureChem Chemical Entity Recognition proprietary algorithm ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name-to- structure conversion Structures from images: 2007 onwards CLiDE image-structure conversion USPTO offers ‘Complex Work Units’ since 2001 CWU file types include MOL and CDX CWUs processed as part of pipeline: 2007 onwards

SureChEMBL data content (11/03/2015)
16,261,347 unique compounds 13,274,991 chemically annotated patents ~80,000 novel compounds extracted from ~50,000 new patents monthly 2–7 days for a published patent to be chemically annotated and searchable in SureChEMBL SureChEMBL provides search access to all patents (not just chemically annotated ones) ~120M patents

EMBL-EBI chemistry resources
RDF and REST API interfaces Atlas Ligand induced transcript response 750 PDBe Ligand structures from structurally defined protein complexes 15K ChEBI Nomenclature of primary and secondary metabolites. Chemical Ontology 24K ChEMBL Bioactivity data from literature and depositions 1.5M SureChEMBL Chemical structures from patent literature ~16M 3rd Party Data ZINC, PubChem, ThomsonPharma DOTF, IUPHAR, DrugBank, KEGG, NIH NCC, eMolecules, FDA SRS, PharmGKB, Selleck, …. ~60M Negative novelty checking UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >80M REST API Interface -

SureChEMBL data access I
UniChem (“Universal Compound Resolver”) Weekly updates Web service lookup Connectivity search FTP download Quarterly updates All SureChEMBL compounds in SDF and CSV format Raw data ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/ no further filtering or manual curation

SureChEMBL data access II
PubChem SureChEMBL data source Quarterly updates Data feed client Creates a local replica of SureChEMBL Updates daily own-replica-of.html

Can we have everything? Cost Quality Time
Usually you can have 2 of the 3 factors We do not have access to commercial patent chemsitry products such as scifinder Quality Time

Common sources of errors
Small, poor quality images OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors -> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3- vDbenzamide’ Reliability better for US patents due to inclusion of mol files Missing double bonds, rings opened, failure to extract molecule from image

Bioactivity data extraction?
Compounds Target/Assay Bioactivity I know what many of you are thinking

Markush structure extraction?
-alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. Accurate automated markush extraction is We welcome ideas and collaborations

Future plans Full compound-patent map OpenPHACTS ENSO project
Flat file ftp download Coming in March Regular updates Also available in UniChem OpenPHACTS ENSO project Biological entity extraction and annotation Proteins, genes and diseases Ontology mapping and semantic integration

SureChEMBL Interface

Homepage Filter by authority (US, EP, WO and JP) Search by keyword
Filter by document section (title, claims, abstract, description and images) Chemical search type filter (substructure, similarity, identical) Filter by date Filter by MW Search by keyword Search by chemical structure (sketch compound) Search by SMILES, MOL, SMARTS, name Search by patent number Help

Keyword-based search Uses Boolean operators and Lucene query fields
Example searches… roche OR novartis sterili?e kinase* pfizer C07D “kinase inhibitor” pn:WO A1 pa:(bayer OR genentech OR merck) AND desc:(chemotherap* AND (“phosphoinositide kinase”~0.8 OR Pi3K)) Logical operators – UPPER case Pn patent number Pa assignee or applicant - all lucene lower case ? represents one and only one character * represents 0 to n characters Quotes will retrieve documents where the phrase exists exactly as entered TO FIND WORDS THAT ARE WITHIN A SPECIFIC PROXIMITY TO ONE ANOTHER, USE THE TILDE "~" SYMBOL AT THE END OF A PHRASE Complex searches can be formed

Lucene Field Description Indexed Data Sample scpn SureChEMBL Patent Number (SCPN) EP B1 scpn:EP B1 pn publication number EP B1 pn:ep b1 pd publication date pd: an application number EP A an:EP A ad application date ad: pri priority(ies) DE A pri:“DE A ” pridate all priority dates pridate: pdyear publication year 2013 pdyear:2013 ds designated states DE ds:(DE OR GB OR FR) GB ds:FR pctpn PCT publication number WO A2 pctpn:WO A2 pctpd PCT publication date pctpd: pctan PCT application number US W pctan:US W pctad PCT application date pctad: relan related application number Division of application No. 12/159,232 relan:US relad related application date Jun 26, 2008 relad: ic IPCR C CO8 C08K C08K0005 ic:C cpc CPC C C07 C07D C07D0471 C07D047104 cpc:C07D ecla ECLA C07D487/10 ecla:C07D487/10 uc US class 29 uc:029 inv inventor(s) schmidt hans-werner inv:("schmidt hans" AND thelakkat) apl applicant Sony International (Europe) GmbH apl:sony asg assignee SIEMENS AKTIENGESELLSCHAFT asg:siemens pa apl or asg assignee(s) or applicant(s) see apl and asg above pa:sony cor correspondent Dr Roger Brooks cor: “Dr Roger Brooks” agt agents Pohlman, Sandra M agt:”Pohlman, Sandra M” pcit patent citations EP B1 pcit:EP B1 ncit non-patent citations TANG C W: ”Two-layer organic photovoltaic cell” ncit:(tang AND ”Two-layer organic photovoltaic cell”) ttl title in English, French and German Sonnenenergiesystem ttl:(”solar energy” OR “énergie solaire” OR Sonnen*) ab abstract in English, French and German desc description in English, French and German clm claims in English, French and German text abstract or description or claims in English, French or German pnlang publication language EN FR DE PT NO RU NL SV FI TR IS and more pnlang:(NO OR FI OR SV)

Fielded keyword search
Filter by document section Lucene Field Name pn: publication number pa: assignee(s) or applicant(s) Logical operators

SureChEMBL Patent Numbers (SCPN)
Standardised format used to search system Format: CC-PATNO-KK, e.g. WO A2 Batch conversion available via interface homepage link Country code Number Kind code

Keyword searches return documents

Patent family members

Export patent chemistry
Property range filters Count filters Go to ‘My Exports’ to download CSV or XML

Patent view - Front page

Patent view - Claims

Chemical entities in patent
Click on blue highlighted text to see chemical info box

Patent view - Tools Export chemistry for document or family
Access to source document PDF

Chemistry-based searching
Types of search Structure sketch (2 sketchers) Filter by MW range Different sketchers available Substructure Chemists are most often interested in Substructure search, that is, whether a target structure contains the query structure within it. Note: If special molecular features are present on the query (eg. stereochemistry, charge, etc.), only those targets containing the feature are considered hits. However, if a feature is missing from the query, it is not checked and targets without that feature may appear as hits. Similarity A Similarity structure search looks for target structures that are similar to the query structure. The similarity concept implemented is based on hashed binary chemical fingerprints derived using a Tanimoto metric. That is to say, the presence of molecular features are recorded for both the query and the target and then compared using a standard formula. Note: See Tanimoto Coefficient and Fingerprint Generation for a complete description of the concept and method. Note: If you choose Similarity as your search type, you will be prompted to provide a Tanimoto coefficient between 0.5 and 0.95. Major Match A Major Match structure search finds molecules that are equal in size to the query structure with additional fragments or heavy atoms allowed. This search type is useful to perform as it ignores the presence of salts or solvents beside the main structure in the target. Basic A Basic structure search finds molecules that are equal (in size) to the query structure. No additional fragments or heavy atoms are allowed. By default, molecular features are evaluated the same way as described above for substructure search. Identical In an Identical structure search, all molecular features need to be equal (e.g. a non-stereo query will only match a non-stereo target). Filter by document section

Chemistry searches return structures
Tautomers are registered as different structures, unlike in ChEMBL – this will likely change in future

Review chemistry hits

Compound report page On-the-fly integration with 81M structures and from 28 data sources UniChem integration: On-the-fly integration with ~81M structures and from 28 data sources

Review patent documents for chemistry

Review patent documents for chemistry
families

SureChEMBL knowledge base

SureChEMBL support

ChEMBL blog

Summary SureChEMBL Interface demo Coverage and content Capabilities
Future plans Interface demo Surechembl free source Complements chembl with chemistry from patent corpus There are developments

Acknowledgements ChEMBL team Digital Science Open PHACTS consortium
John Overington Jon Chambers George Papadatos Mark Davies Nathan Dedman Anna Gaulton Digital Science Nicko Goncharoff James Siddle Richard Koks Open PHACTS consortium Funding: Innovative Medicines Initiative Joint Undertaking, grant agreement no (Open PHACTS) Wellcome Trust Strategic Award for Chemogenomics, WT086151/Z/08/Z European Molecular Biology Laboratory European Commission FP7 Capacities Specific Programme, grant agreement no (BioMedBridges) Software:

Future webinars: 25th March - UniProt: Exploring protein sequence and functional information 8th April Introduction to ENA 22nd April Ensembl Tools 6th May Reactome: Exploring biological pathways All 4:00pm GMT For details see: ebi-training-webinar-series-2015

myChEMBL Example

What is myChEMBL? A Virtual Machine, preloaded with…
A complete version of the ChEMBL database Chemical structure searching GUI & web services for accessing the database A suite of chemoinformatics and data analysis tools Tutorials on a range of topics Using ChEMBL data Chemoinformatics, machine learning, etc. Completely free and open

myChEMBL: Applications
Centralised Resource VM shareable across the local network Access to standardised tools, services and data Application Development Sandboxed VM, all source code available Learning Lowers ‘activation barrier’ with pre-installed tools and examples Teaching, Training & Dissemination IPython notebooks and KNIME 2nd Prize at ACS Teach-Discover-Treat competition Open Data combined with Open Tools on an Open OS All-in-one platform for learn and practice chemoinformatics A centralised machine for chemoinformatics services Share the VM across the local network Local access to standardised tools, services and resources myChEMBL uses exclusively free and open source tools and libraries, so it removes the expensive licensing costs often associated with similar applications. myChEMBL runs locally behind a firewall, therefore the typical concerns regarding the submission of sensitive data to web-based applications do not apply. the source code is available for all myChEMBL applications, so developers can use this as a starting point for applications they wish to develop in the future. Due to the availability of interactive, web- and GUI-based tools, myChEMBL requires no prior programming experience or knowledge. myChEMBL provides a versatile platform for learning chemical data mining and cheminformatics in an intuitive and straightforward way. The combination of data with relevant pre-installed tools effectively lowers the 'activation barrier' and shifts the focus to hands-on programming and learning. Proven in our annual Wellcome Trust courses - myChEMBL is a proven resource for training scientists on the use of essential tools in the field of chemoinformatics and computer-aided drug discovery.

myChEMBL LaunchPad

SureChEMBL and myChEMBL
More: Download: ftp://ftp.ebi.ac.uk/pub/databases/chembl/VM/myChEMBL/current/

SureChEMBL and myChEMBL

George Papadatos Mark Davies

Similar presentations

Presentation on theme: "George Papadatos Mark Davies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

George Papadatos Mark Davies

Similar presentations

Presentation on theme: "George Papadatos Mark Davies"— Presentation transcript:

Similar presentations

About project

Feedback