Presentation on theme: "Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University."— Presentation transcript:
Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University
Chemical Informatics as a Grid Application Chemical Informatics is the application of information technology to problems in chemistry. Example problems: managing data in large scale drug discovery and molecular modeling Building Blocks: Chemical Informatics Resources: Chemical databases maintained by various groups NIH PubChem, NIH DTP Application codes (both commercial and open source) Data mining, clustering Quantum chemistry and molecular modeling Visualization tools Web resources: journal articles, etc. A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, distributed computing environment.
Problem: Connecting It Together The problem is defining an architecture for tying all of these pieces into a distributed computing system. A Grid How can we combine application codes, web resources, and databases to solve a particular science problem? Specifically, how do we build a runtime environment that can connect the distributed services we need to solve an interesting problem? For academic and government researchers, how can we do all of this in an open fashion? Data and services can come from anywhere That is, we must avoid proprietary infrastructure. Individual pieces may be commercial, however.
NIH Roadmap for Medical Research http://nihroadmap.nih.gov/ The NIH recognizes chemical and biological information management as critical to medical research. Federally funded high throughput screening centers. 100-200 HTS assays per year on small molecules. 100,000s of small molecules analyzed Data published, publicly available through NIH PubChem online database. What do you do with all of this data? That is, how can you create an extensible toolbox of services that can be combined into interesting applications for clustering, mining, modeling, etc. the data.
The Solution, Part I: Web Services Web Services provide the means for wrapping databases, applications, web scavengers, etc, with programming interfaces. WSDL definitions define how to write clients to talk with databases, applications, etc. Web Service messaging through SOAP Discovery services such as UDDI, MDS, and so on. Many toolkits available Axis,.NET, gSOAP, SOAP::Lite, etc. Web Services can be combined with each other into workflows Workflow==use case scenario More about this later.
Basic Architectures: Servlets/CGI and Web Services Browser Web Server HTTP GET/POST DB JDBC Web Server DB JDBC Browser Web Server SOAP GUI Client SOAP WSDL
Solution Part II: Grid Resources Many Grid tools provide powerful backend services Globus: uniform, secure access to computing resources (like TeraGrid) File management, resource allocation management, etc. Condor: job scheduling on computer clusters and collections SRB: data grid access OGSA-DAI: uniform Grid interface to databases. These have Web Service as well as other interfaces (or equivalently, protocols).
Solution, Part III: Domain Specific Tools and Standards -->More Services For Chemical Informatics, we have a number of tools and standards. Chemical string representations SMILES, InChI Chemistry Markup Language XML language for describing, exchanging data. JUMBO 5: a CML parser and library Glue Tools and Applications Chemistry Development Kit (CDK) OpenBabel These are the basis for building interoperable Chemical Informatics Web Services Analogous situations exist for other domains Astronomy, Geosciences, Biology/Bioinformatics
Solution Part IV: Workflows Workflow engines allow you to connect services together into interesting composite applications. This allows you to directly encode your scientific use case scenario as a graph of interacting services. There are many workflow tools Well briefly cover these later. General guidance is to build web services first and then use workflow tools on top of these services. Dont get married to a particular workflow technology yet, unless someone pays you.
Solution Part V: User Interfaces Web Services allow you to cleanly separate user interfaces from backend services. Model-view-controller pattern for web applications Client environments include Grid and web service scripting environments Desktop tools like Taverna and Kepler Portlet-based Web portal systems Typically, desktop tools like Taverna are used by power users to define interesting workflows. Portals are for running canned workflows.
Wrapping Science Applications as Services Science Grid services typically must wrap legacy applications written in C or Fortran. You must handle such problems as Specifying several input and output files These may need to be staged in Launching executables and monitoring their progress. Specifying environment variables Often these have also shell scripts to do some miscellaneous tasks. How do you convert this to WSDL? Or (equivalently) how do you automatically generate the XML job description for WS-GRAM?
Our Solution: Apache Ant Services Weve found using Apache Ant to be very useful for wrapping services. Can call executables, set environment variables. Lots of useful built-in shell-like tasks. Extensible (write your own tasks). Develop build scripts to run your application You can easily call Ant from other Java programs. So just write a wrapper service We use both blocking (hold connection until return) and non-blocking version (suitable for long running codes). In non-blocking case, Context web service is used for callbacks.
Flow Chart of SMILES to Cluster Partitioned of BCI Web Service SMILE String Makebits Dictionary (Default) Fingerprint (*.scn) DivKmeans Cluster Hierarchy (*.dkm) OptclusRNNclus One Column Process Merge Process Extracted Cluster Hierarchy (*.clu) New SMILE String Generating Fingerprints Clustering Fingerprints Generating the best levels SMILES to DKM Extracting individual cluster partitions best level
BCI Clustering Service Methods Service MethodDescription InputOutput makebitsGenerateGenerate fingerprints from a SMILES structure SMIstringFingerprint string divkmGenerateCluster fingerprints with Divkmeans SCNstringClustered Hierarchy smile2dkmMakebits + divkmSMIstringClustered Hierarchy optclusGenerateGenerate the best levels in a hierarchy DKMstringBest partition cluster level rnnclusGenerateExtract individual cluster partitions DKMstringIndiv. cluster partitions smile2ClusterPartiti oned Generate a new SMILES structure w/ extra col. SMIstringNew SMILES structure
All Services Great and Small Like most Grids, a Chemical Informatics Grid will have the classic styles: Data Grid Services: these provide access to data sources like PubChem, etc. Execution Grid Services: used for running cluster analysis programs, molecular modeling codes, etc, on TeraGrid and similar places. But we also need many additional services Handling format conversions (InChI SMILES) Shipping and manipulating tabular data Determining toxicity of compounds Generating batch 2D images So one of our core activities is build lots of services
VOTables: Handling Tabular Data Developed by the Virtual Observatory community for encoding astronomy data. The VOTable format is an XML representation of the tabular data (data coming from BCI, NIH DTP databases, and so on). VOTables-compatible tools have been built We just inherit them. SAVOT and JAVOT JAVA Parser APIs for VOTable allow us to easily build VOTable-based applications Web Services Spread sheet Plotting applications. VOPlot and TopCat are two
mrtd1.txt – smiles representation of chemical compounds along with its properties
Votable.xml : xml representation of mrtd1.txt file
VOPlot Application from generated votable.xml file : Graph plotted on Mass (X–axis) and PSA (Y-axis)
Other Uses for VOTables VOTables is a useful intermediate format for exchanging data between data bases. Simple example: exchange data between VARUNA databases. Each student in the Baik group maintains his/her on copy (sandbox purposes). Often need to import/export individual data sets. It is also good for storing intermediate results in workflows. Value is not the format, but the fact that the XML can be manipulated programmatically. Unions, subset, intersection operations
More Services: WWMM Services ServicesDescriptionsInputOutput InChIGoogleSearch an InChI structure through Google inchiBasic type Search result in HTML format InChIServerGenerate InChIversion format An InChI structure OpenBabelS erver Transform a chemical format to another using Open Babel format inputData outputData options Converted chemical structure string CMLRSSSer ver Generate CMLRSS feed from CML data mol, title description link, source Converted CMLRSS feed of CML data
CDK-Based Services Common Substructure Calculates the common substructure between two molecules. CDKsimTakes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints). CDKdescCalculates a variety of molecular and atomic descriptors for QSAR modeling CDKwsFingerprint generation CDKsdgCreates a jpeg of the compounds 2D structure CDKStruct3DGenerates 3D coordinates of a molecule from its SMILE
ToxTree Service The Threshold of Toxicological Concern (TTC) establishes a level of exposure for all chemicals below which there would be no appreciable risk to human health. ToxTree implements the Cramer Decision Tree approach to estimate TTC. We have converted this into a service. Uses SMILES as input. Note the GUI must be separated from the library to be a service http://ecb.jrc.it/QSAR/home.php?CONTENU=/QSAR/qsar_tools/qsar_tools_toxtree.php
OSCAR3 Service Oscar3 is a tool for shallow, chemistry-specific natural language parsing of chemical documents (i.e. journal articles). It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. Other entities: Things like N(5)-C(3) and so on. There is a larger effort, SciBorg, in this area http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html This (like ToxTree) is potentially productively pleasingly parallelized. It also has potentially very interesting Workflows http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3
PubMed Query Service MM Applications 3D Structure Generator OSCAR3 Extract abstracts Extract SMILES Create initial 3D structures GAMESS, MOPAC Quantum Chemistry DB Refined 3D structures QM Chemistry Info Clustering Tools Other Cheminfo Services
Use Cases and Workflows Putting data and clustering together in a distributed environment.
Workflow, Services, and Science Web Services work best as simple stateless services. No implicit input, output, or interdependency of methods. Services must be composed into interesting applications. This is called workflow. A good workflow... Is composed of independent services Completely specifies an interesting science problem.
Finding compound-protein relationships A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex) The workflow employs our local NIH DTP database service to search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. Client portlets are used to browse these structures Once docking is complete, the user visualizes the high- scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex) Correlation of docking results and biological fingerprints across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds
HTS data organization & flagging A tumor cell line is selected. The activity results for all the compounds in the DTP database in the given range are extracted from the PostgreSQL database The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
Use Case: Which of these hits should I follow up? An HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist on the project wants to know what the most promising series of compounds for follow-up are, based on: Series selection cluster analysis Structure-activity relationships modal fingerprints/stigmata Chemical and pharmacokinetic properties mitools, chemaxon Compound history gNova / PostgreSQL Patentability BCI Markush handling software Toxicity Synthetic feasibility + requires visualization tools!
A Workflow Scenario: HTS Data Organization and Flagging This workflow demonstrates how screening data can be flagged and organized for human analysis. The compounds and data values for a particular screen are retrieved from the NIH DTP database and then are filtered to remove compounds with reactive groups, etc. A tumor cell line is selected. The activity results for all the compounds in the DTP database in the given range are extracted from the PostgreSQL database OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs ToxTree is used to flag the potential toxicities of compounds. Divkmeans is used to add a column of cluster numbers. Finally, the results are visualized using VOPlot and the 2D viewer applet.
Example plots of our workflow output using VOPlot and VOTables
NIH Database Service PostgreSQL CHORD Fingerprint Generator BCI Makebits Cluster Analysis BCI Divkmeans Table Management VoTables Plot Visualizer VoPlot Docking Selector Script 2D-3D OpenEye OMEGA Docking OpenEye FRED 3D Visualizer JMOL Cluster the compounds in the NIH DTP database by chemical structure, then choose representative compounds from the clusters and dock them into PDB protein files of interest SMILES + ID Fingerprints PDB Database Service SMILES + ID + Data Cluster Membership SMILES + ID + + Cluster # + Data SMILES + ID MOL File PDB Structure + Box Docked Complex
Use Case: Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: Any newly published compounds which might fit the protein receptor site gNova / PostgreSQL, PubChem search, FRED Docking Any published 3D structures of the protein or of protein- ligand complexes PDB search Any interactions of compounds with other proteins gNova / PostgreSQL, PubChem search Any information published on the protein target Journal text search
Use Case: Who else is working on these structures? A chemist is working on a chemical series for a particular project and wants to know: If anyone publishes anything using the same or related compounds ~ PubChem search Any new compounds added to the corporate collection which are similar or related gNova CHORD / PostgreSQL If any patents are submitted that might overlap the compounds he is working on ~ BCI Markush handling software Any pharmacological or toxicological results for those or related compounds gNova CHORD / PostgreSQL, MiToolkit The results for any other projects for which those compounds were screened gNova CHORD / PostgreSQL, PubChem search