Web Services as Integrators of Public Chemistry Databases Gary Wiggins School of Informatics Indiana University
Interdisciplinary Science “The boundary between the known and the unknown is where science flourishes.” --Michael Shermer “There are known knowns… There are known unknowns… There are unknown unknowns…” --Donald Rumsfeld BUT…. What about UNKNOWN knowns?
zzzzzCAS ““““In an industry first, Chemical Abstracts Service (CAS) has unveiled a revolutionary new literature searching tool which will permit scientists to search and retrieve the world’s chemical literature— including patents and obscure technical reports—in their sleep.” --Author unknown
Overview of the Talk Introduction to Web Services Public Databases and Data Repositories for Chemistry Projects Underway or Planned at Indiana University
Web Services Introduction What are “Web Services”? A distributed invocation system built on Grid computing A distributed invocation system built on Grid computing Independent of platform and programming languageIndependent of platform and programming language Built on existing Web standardsBuilt on existing Web standards A service oriented architecture with A service oriented architecture with Interfaces based on Internet protocolsInterfaces based on Internet protocols Messages in XML (except for binary data attachments)Messages in XML (except for binary data attachments)
Service Oriented Architecture (SOA) Goal is to achieve loose coupling among interacting software agents Define service: a unit of work done by a service provider to achieve desired end results for a service consumer Both provider and consumer are roles played by software agents on behalf of their owners.
Web Services Architectures Individual services are registered globally Broken down into individual services with inputs and outputs specified Broken down into individual services with inputs and outputs specified Services are published Services are requested Open registry, publishing, and requesting
Service-Oriented Architecture From Curcin et al. DDT, 2005, 10(12),867
Web Services for Science Invisible Services, Semantic Web, and Grid Easy-to-use tools for any scientist High throughput, resource intensive computing done for low cost/resources Shared community Collaborations between labs and fields Collaborations between labs and fields Shared data Shared data Shared tools Shared tools
e-Science and the Grid 1 e-Science: Major UK Program global collaboration in key areas of science and the next generation of infrastructure that will enable it global collaboration in key areas of science and the next generation of infrastructure that will enable it reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teamsreflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams total investment of some £200M over the five-year period from 2001 to 2006total investment of some £200M over the five-year period from 2001 to 2006 CyberInfrastructure: the analogous US initiative Grid Technology: supports e-Science & Cyberinfrastructure
e-Science and the Grid 2
Basic Architectures: Servlets/CGI and Web Services Browser Web Server HTTP GET/POST DB or MPI Appl. JDBC Web Server DB or MPI Appl. JDBC Browser Web Server SOAP GUI Client SOAP WSDL
When To Use Web Services? Applications do not have severe restrictions on reliability and speed. Two or more organizations need to cooperate. One needs to write an application that uses another’s service. One needs to write an application that uses another’s service. Services can be upgraded independently of clients. Services can be easily expressed with simple request/response semantics and simple state.
Web Services Benefits Web services provide a clean separation between a capability and its user interface. Increase in productivity Increase in flexibility Rapid return on investment Integration across multiple applications
Web Services Advantages Output in human- and computer-readable formats I/O formats based on standard Internet protocols Resources accessible server to server allow automated I/O Integration based on specific services: you select services or data needed without downloading the entire data set
Web Services Advantages Description protocols provide details of service provided and interface components Semantic Web standards increase efficiency Use a central registry and standardized description of services Quality and status of the information is dynamically available
Web Services Drawbacks Based on new technologies Time and commitment required to learn Standards still in a state of flux Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy
Components of Web Services Protocols WSDL WSDL SOAP SOAP UDDI UDDI XML as a basis for the protocols Ontologies OWL: Ontology Web Language OWL: Ontology Web Language Semantic Web
WSDL: Web Service Definition Language Describes a service’s interface to clients Services register themselves with Web Services WSDL describes how to contact and interact with services I/O, operations and messages to aid interaction with client I/O, operations and messages to aid interaction with client
SOAP: Simple Object Access Protocol Maps abstract WSDL service descriptions to concrete implementations Flexible protocol to communicate information between server and server or client and server using XML Supports Remote Procedure Calls Allows layers (security, authentication, transactions) over the basic SOAP elements
UDDI: Universal Description, Discovery, and Integration Provides ways for clients and services to interact with other services Uses XML Defines the means of access, e.g., URL URL Defines services hosted by an entity Business-oriented tags Uses SOAP for communicating
XML and Web services XML lends itself to distributed computing: It’s just a data description. It’s just a data description. Platform, programming language independent Platform, programming language independent Web Services Description Language (WSDL) Describes how to invoke a service Describes how to invoke a service Can bind to SOAP, other protocols for actual invocation Can bind to SOAP, other protocols for actual invocation Simple Object Access Protocol (SOAP) Wire protocol extension for conveying RPC calls Wire protocol extension for conveying RPC calls Can be carried over HTTP, SMTP Can be carried over HTTP, SMTP
RDF: Resource Description Framework A standard for statements describing resources RDF statements consist of a Subject: the resource being described Subject: the resource being described Predicate: the property ascribed to the resource Predicate: the property ascribed to the resource Object: the entity to which the resource bears that property Object: the entity to which the resource bears that property An alternative to UDDI: said to more flexible and extensible than UDDI
RDFS: Resource Description Framework Schema RDF statements ascribe properties to resources, but RDF provides no means for describing those properties. RDFS defines classes and properties that are used to describe classes, properties, and other resources. RDFS vocabulary descriptions are written in RDF. RDFS provides a means of specifying a vocabulary for a particular domain.
OWL: Web Ontology Language Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes Disjoint classes Disjoint classes Cardinality of classes Cardinality of classes Characteristics of relations, like symmetry Characteristics of relations, like symmetry
Standards for Web Services Business Process Execution Language for Web Services (BPEL4WS) Ontology Web Language Semantics (OWL-S) Web Service Modeling Ontology (WSMO)
Standards Setting Bodies: OASIS OASIS: Organization for Advancement of Structured Information Standards ebXML: e-business XML ebXML: e-business XML UDDI: Universal Description, Discovery and Integration UDDI: Universal Description, Discovery and Integration Global Grid Forum community of users, developers, and vendors leading the global standardization effort for grid computing community of users, developers, and vendors leading the global standardization effort for grid computing
Standards Setting Bodies: W3C W3C: World Wide Web Consortium OWL: Ontology Web Language OWL: Ontology Web Language RDF/RDFS: Resource Description Framework/Schema RDF/RDFS: Resource Description Framework/Schema SOAP: Simple Object Access Protocol SOAP: Simple Object Access Protocol URI/URL/URN: Universal Resource Identifier/Locator/Name URI/URL/URN: Universal Resource Identifier/Locator/Name WSDL: Web Service Definition Language WSDL: Web Service Definition Language XML: eXtensible Markup Language XML: eXtensible Markup Language
Web Services Integration Projects: Biosciences myGrid BIOPIPE BioMOBY
Web Services for Chemistry: Problems Performance and scalability Proprietary data Competition from high-performance desktop applications -- Geoff Hutchison, it’s a puzzle blog, ALSO: Lack of a substantial body of trustworthy Open Access databases Lack of a substantial body of trustworthy Open Access databases Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another) Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another)
Necessary Ingredients in Chemistry Chemical communities to assemble Open Access databases Well-defined quality assurance procedures performed by distributed peer-review systems Well-defined quality assurance procedures performed by distributed peer-review systems Software underlying the databases needs to be open source. Software underlying the databases needs to be open source.
BlueObelisk.org A group of chemists, programmers, and informaticians working collaboratively on projects such as: Chemistry Development Kit (CDK) Chemistry Development Kit (CDK) JChemPaint JChemPaint Jmol Jmol JUMBO JUMBO NMRShiftDB NMRShiftDB Octet Octet Open Babel Open Babel QSAR QSAR World Wide Molecular Matrix (WWMM) World Wide Molecular Matrix (WWMM)
Components of the Semantic Web for Chemistry XML – eXtensible Markup Language RDF – Resource Description Framework RSS – Rich Site Summary Dublin Core – allows metadata-based newsfeeds OWL – for ontologies BPEL4WS – for workflow and web services Murray-Rust et al. Org. Biomol. Chem. 2004, 2, Murray-Rust et al. Org. Biomol. Chem. 2004, 2,
Chemistry Databases on the Web Marc Nicklaus lists 37 databases as of October 2001 Must have structure searching and at least 100 molecules Must have structure searching and at least 100 molecules SoaringBear’s List has 15 databases
Institutional Repositories NARSTO Quality Systems Science Center Pollutant species in the troposphere over North America Pollutant species in the troposphere over North America Part of the Carbon Dioxide Information Analysis Center at ORNL Part of the Carbon Dioxide Information Analysis Center at ORNL NARSTO Data and Information Sharing Tool NARSTO Data and Information Sharing Tool
Public Databases Developmental Therapeutics Program/NCI Some assay data for download Some assay data for download Structures for over 200,000 compounds Structures for over 200,000 compounds Zinc and other screening databases NIST computational chemistry database Environmental fate and exposure databases
Other Public Databases 1 ChemExper Chemical Directory > 200,000 substances; > 10,000 IR spectra > 200,000 substances; > 10,000 IR spectra HIC-Up; Hetero-Compound Identification Centre – Uppsala 5384 substances as of 1/15/ substances as of 1/15/ Chemicals with Pharmaceutical Activity; a 3D Structural Database 400 3D structures 400 3D structures
Isatin in ChemExper
Isatin ChemExper IR Spectrum
Isatin from HIC-Up
Other Public Databases 2 Cheminformatics.org 41 data sets in 9 categories as of 8/18/05 41 data sets in 9 categories as of 8/18/ WebReactions
Other Public Databases 3 MolTable MatWeb Materials Property Data Spectral Database for Organic Compounds (SDBS) Over 32,000 compounds Over 32,000 compounds Has EI-MS, FT-IR, 1 H NMR, 13 C NMR, Raman, ESR Has EI-MS, FT-IR, 1 H NMR, 13 C NMR, Raman, ESR NMRShiftDB (Christoph Steinbeck) 14,753 structures as of 8/19/05 14,753 structures as of 8/19/05 Features peer-reviewed submission of data sets Features peer-reviewed submission of data sets
Comment on Link to PubChem “It is great to know that - with PubChem - there will be something like a single point of entry for people interested in structures and their properties. We are looking forward to link our datasets to PubChem structures.” --Christoph Steinbeck, commenting on the NMRShiftDB, CHMINF-L, 19 April 2005
Other Public Databases: Commercial Teasers FTIRsearch.com (Thermo Electron) Demo file of 575 spectra from 87,000 in the full database Demo file of 575 spectra from 87,000 in the full database ChemACX 30 of >350 suppliers catalog data 30 of >350 suppliers catalog data Sunset Molecular Discovery, LLC Wombat (World of Molecular BioAcTivity) Wombat (World of Molecular BioAcTivity) 117,007 entries with over 230,000 biological activities117,007 entries with over 230,000 biological activities Wombat PK Wombat PK Database for Clinical Pharmacokinetics: 643 substances with 4668 measurementsDatabase for Clinical Pharmacokinetics: 643 substances with 4668 measurements Three sample files from Wombat containing 341 Histamine-1 receptor antagonists Three sample files from Wombat containing 341 Histamine-1 receptor antagonists
Indiana University Existing Projects System for the Integration of Bioinformatics Services (SIBIOS) PlatCom: A Platform for Computational Comparative Genomics Reciprocal Net
Indiana University Planned Projects Design of a Grid-based distributed data architecture Development of tools for HTS data analysis and virtual screening Database for quantum mechanical simulation data Chemical prototype projects Novel routes to enzymatic reaction mechanisms Novel routes to enzymatic reaction mechanisms Mechanism-based drug design Mechanism-based drug design Data-inquiry-based development of new methods in natural product synthesis Data-inquiry-based development of new methods in natural product synthesis
Community Grids Lab
Open Systems Lab
Web Services Future Depends on Adoption of standards Adoption of standards Incorporation of WS in current and newly developed applications Incorporation of WS in current and newly developed applications Security, privacy, quality of data issues Security, privacy, quality of data issues Development of WS tools and resources for e-Science Development of WS tools and resources for e-Science
“Tripos Embraces Web Services” Goal: to make it easier to incorporate Tripos’ high-throughput workflows Tripos’ Service-Oriented Informatics (SOI) Tripos’ Service-Oriented Informatics (SOI) In partnership with SciTegic: Web services ensure software compatibility between the companies’ products.
Provocative Thought “While much bioscience is published with the knowledge that machines will be expected to understand at least part of it, almost all chemistry is published purely for humans to read.” Murray-Rust et al. Org. Biomol. Chem. 2004, 2, Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3201.
Closing quote “The future of chemistry depends on the automated analysis of chemical knowledge, combining disparate data sources in a single resource, such as the World-Wide Molecular Matrix, which can be analysed using computational techniques to assess and build on these data.” Townsend et al. Org. Biomol. Chem. 2004, 2, Townsend et al. Org. Biomol. Chem. 2004, 2, 3299.
Acknowledgements Christoph Steinbeck Geoffrey Fox, Marlon Pierce Peter Murray-Rust Michael D. Jensen, Timothy B. Patrick, Joyce A. Mitchell Elsevier
Bibliography Coles, Simon J.; Day, Nick E.; Murray-Rust, Peter; Rzepa, Henry S.; Zhang, Yong. “Enhancement of the chemical semantic web through InChIfication.” Organic & Biomolecular Chemistry 2005, 3, Curcin, Vera; Ghanem, Moustafa; Guo, Yike. "Web services in the life sciences." Drug Discovery Today 2005, 10(12), Day, Michael. Metadata: Mapping between metadata formats. (accessed 8/6/2005) De Roure, D.; Jennings, N. R.; Shadbolt, N. R. “The Semantic Grid: Past, present and future.” Proceedings of the IEEE 2005, 93(3), Jensen, Michael; Patrick, Timothy B.; Mitchell, Joyce A. “Web services for bioinformatics.” PSB 2004 Tutorial Murray-Rust, Peter S.; Mitchell, John B.O.; Rzepa, Henry S. “Communication and re-use of chemical information in bioscience.” BMC Bioinformatics 2005, 6, 180. Murray-Rust, Peter; Mitchell, John B.O.; Rzepa, Henry S. “Chemistry in bioinformatics.” BMC Bioinformatics 2005, 6,
Bibliography Murray-Rust, Peter; Rzepa, Henry S.; Tyrrell, Simon M.; Zhang, Yong. “Representation and use of chemistry in the globabl electronic age.” Organic & Biomolecular Chemistry 2004, 2, Salamone, Salvatore. “Tripos embraces Web Services. Bio-IT World, 2005, 4(8), 8. Shermer, Michael. “Rumsfeld’s Wisdom.” Scientific American 2005, 293(3), 38. Townsend, Joe A.; Adams, Sam E.; Waudby, Christopher A.; de Souza, Vanessa K.; Goodman, Jonathan M; Murray-Rust, Peter. “Chemical documents: machine understanding and automated information extraction.” Organic & Biomolecular Chemistry 2004, 2, World Wide Molecular Matrix. Summary Report – W3C Workshop on Semantic Web for Life Sciences, October 27-28,