Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 1 Smart Mining Interfaces, Workflows, and Data Mining the.

Similar presentations


Presentation on theme: "Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 1 Smart Mining Interfaces, Workflows, and Data Mining the."— Presentation transcript:

1 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 1 Smart Mining Interfaces, Workflows, and Data Mining the NCI DTP Dataset David Wild Joint IU / Michigan / Lilly Meeting Indianpolis, August 2006

2 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 2 Acknowledgements Xiao Dong - HTCL database & mining, web services, workflows, smart mining interfaces Rajarshi Guha - Smart Mining Interfaces, Workflows, Web Services Geoffrey/Marlons lab: Smitha Ajay, Sima Patel, Jake Kim - web services, workflows, portlet interfaces Others: Junguk Hur, Chris Mueller, Huijun Wang … Funding from Microsoft eScience and CICC

3 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 3 Outline Smart mining of drug discovery information - our approach to connecting scientists with the information they need Application of smart mining to post-HTS chemistry analysis Current interface-level projects Examining the DTP HTCL dataset as a standard for multi- screen data mining and as a surrogate for HTS

4 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 4 Classic approach to designing tools for chemists Select the computational chemistry methods that seem to be the most useful for drug discovery (or whatever), such as docking and similarity searching Have computational chemists / modelers figure out how to dumb down the methods for ordinary chemists Wrap a pretty web interface around the command line tools that run on a Unix server Tell the chemists to use the tools using their browsers Result… A few chemists use the tools a few times Clever tools dont necessarily directly meet needs, and simple needs can be complex to answer

5 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 5 A better idea… Use interviews and follow up techniques (e.g. Contextual Design and Interaction Design) to understand the workflows of chemists Design tools using paper prototyping involving personas (or actual chemists) Develop Tools Go through several iterations of usability testing with real scientists and real-life problems

6 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 6 Contextual Inquiry Session in which an observer watches scientists do their work in their natural environment Observer can ask questions, clarify understanding, etc. Helps to record session on a tape recorder Want to see them do real work, but helps if it is related to the software. From tape, build sequence, flow, artifact, culture and physical models Helps in understanding the scientists work, and in building personas and identifying breakdowns in processes

7 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 7 Example sequence model Intent: Try to improve the activity of the current XXX-1 Kinase structure Intent: See what other similar structures are in ISIS Log onto machine and go into ISIS Unsure which database to choose Chooses Master1 database Draws in structure Unable to specify aromaticity correctly Does similarity search Finds 8 compounds which look similar Intent: Try docking these molecules using web-based docking program Opens web browser Goes to docking program by choosing bookmark Browser says page cannot be found

8 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 8 Interaction Design (Cooper) Personas stereotype the people who will be using the software Select primary personas, and create a customized interface for each one Define the goals of the primary personas, then deveop scenarios that reveal the way these goals are reached Wallace is an engineer. He is aged 45, and lives by himself with his dog Grommit. He enjoys using new gadgets, and likes using his inventiveness to make new things, with differing degrees of success. He is confident using Microsoft Word and Excel, although he sometimes becomes frustrated with these packages. He wants the computer to help him design a rocket to fly him and Grommit to the moon, as he likes cheese and believes the moon to be made of very fine cheese.

9 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 9 Usability testing Monitor real people using the software you have written Find breakdowns, recurring problems Measure a usability score for the software Make changes and see if the score improves

10 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 10 Usability of 2D structure drawing tools Key difference between sequential and random drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch JME > ChemDraw >> ISIS Draw

11 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 11 This approach is better but… Still centered on the tool instead of the information Doesnt solve problem of tools only solving one problem Not flexible enough for an environment where scientists have constantly changing needs for information which is complex to retrive (or compute): –Do we have this compound in-house? Where can I get it? –I want to know if anyone else does something with structures like these –I want to improve the activity of these compounds, what directions should I take? –I wonder if there any protein targets this compound might bind to other than the project Im working on? –Im worried there might be a degradation problem with these compounds

12 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 12 Simple questions can be complex to answer… Oracle Database (HTS) Compounds were tested against related assays and showed activity, including selectivity within target families Oracle Database (Genomics) ? None of these compounds have been tested in a microarray assay Computation The information in the structures and known activity data is good enough to create a QSAR model with a confidence of 75% External Database (Patent) Some structures with a similarity > 0.75 to these appear to be covered by a patent held by a competitor Computation All the compounds pass the Lipinksi Rule of Five and toxicity filters Excel Spreadsheet (Toxicity) One of the compounds was previously tested for toxicology and was found to have no liver toxicity Word Document (Chemistry) Several of the compounds had been followed up in a previous project, and solubility problems prevented further development Journal Article A recent journal article reported the effectiveness of some compounds in a related series against a target in the same family Word Document (Marketing) A report by a team in Marketing casts doubt on whether the market for this target is big enough to make development cost-effective SCIENTIST These compounds look promising from their HTS results. Should I commit some chemistry resources to following them up? ?

13 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 13 An even better idea? Develop web services around as much chemoinformatics computation and data sources as possible Develop web service workflows for as many real-life workflows as we can, based on contextual design interviews (and other sources) Develop generic smart interface components which are able to match what people express they want to do with what workflows and services are available, and even create workflows to meet needs This is a kind of scatter gun approach, with the onus on the workflows and interface to make it useful Related to the lab of the future - based on information, not tools Very closely linked to LEAD

14 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 14 3-layer model PurposeTechnologies Interaction LayerInteractive software for creative access and exploitation of information by humans Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies Aggregation LayerWorkflows and data schemas customized for particular domains, applications and users BPEL, Taverna and other workflow modeling tools, aggregate web services Web service layerComprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft.NET

15 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 15 Kinds of interface-level interactions Passive user / active computation –Mainly for information and computation request, and single-stream or summary results –Natural Language interface through email –RSS, web and.NET tools –Graphical workflow generation –May lead to active user interaction as described below Active user / passive computation or retrieval –Facilitates direct interaction with workflows, services and information –Permits analysis and interpretation of multi-stream, interactive results –Multi-stream portlets –Custom multi stream desktop tools (including.NET) –Visual SAR –Flagging and annotation Active user / active computation –A conversation between the scientist and the computer

16 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 16 Example email natural language interface

17 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 17 Email response after triggering events occur

18 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 18 Natural Language Interface Stage 1: Matching requests to existing workflows All existing workflows are given descriptions which reflect the kinds of words people would likely use if requesting them When a request is made, workflows are ranked by text similarity between the request and the descriptions If similarity is less than a cutoff, it is determined that no match can be found (and thus another workflow should be written!) Requests may be parsed by a standard syntactic analyzer

19 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 19 Natural Language Interface Stage 2: On-the-fly workflow creation Requests can be formatted in a do this THEN do this THEN do this fashion These requests are parsed, and used to attempt to create a workflow on the fly Possible existing parsing software includes Python NLP modules and Infocom Z-code (or similar)…

20 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 20 Desktop tool for multi-stream analysis

21 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 21 Multi-stream portlet interface

22 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 22 PubChemSR.NET desktop search tool (Junguk Hur) http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR

23 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 23 Structural differences related to activity are projected onto actual 2D structures. See http://www.daylight.com/meetings/mug99/Wild/Mug99.htmlhttp://www.daylight.com/meetings/mug99/Wild/Mug99.html VisualiSAR - modal fingerprints

24 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 24 Original (curated)Breadth-first Search DegreeSloans Algorithm Data: NCI Compound Database - Compounds with positive AIDS screens Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal information about the data. Additional details are displayed as property plots. Here, the different computed properties are displayed along with the main matrix. Student: Christopher Mueller In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments. Visual Similarity Matrices

25 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 25 VoPlot

26 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 26 Chemistry Decision Support after PubChem data submission MLSCN submits HTS data to Pubchem Data is stored in Pubchem PubChem interfaces to workflows via SOAP Workflows perform different kinds of analysis on the MLSCN data, including SAR, clustering, literature searching, protein searching, toxicity testing, etc… End-user applications and interfaces utilize the information streams from the workflows for human interaction with the data and analysis

27 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 27 Simple HTS follow-up workflow Presented at 222nd ACS, Chicago, 2001 See http://www.lib.uchicago.edu/cinf/222nm/presentations/222nm050.pdf

28 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 28 Example HTS workflow: organization & flagging A biological screen is selected. The activity results for all the compounds is extracted from the database (currently using DTP Tumor Cell Line database) The compounds are clustered on chemical structure similarity, to group similar compounds together The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

29 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 29 Example of workflow output - LogP vs GI 50 Plotting XLogP against GI 50 can help identify highly active compounds with good logP profiles (1 - 4 range)

30 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 30 Example of workflow output - Cluster # vs GI 50 Plotting Cluster against GI 50 can help identify groups of highly active, structurally similar compounds, and also clusters which might yield good QSAR information

31 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 31 Example HTS workflow: finding cell-protein relationships A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) SImilar structures to the ligand can be browsed using client portlets. Once docking is complete, the user visualizes the high- scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Docking results and activity patterns fed into R services for building of activity models and correlations Least Squares Regression Random Forests Neural Nets

32 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 32 Example workflow output - docked complexes NSC_ID 685478 Docking score -29.74 NSC_ID 685477 Docking score -35.51 NSC_ID 719175 Docking score -30.78 NSC_ID 725806 Docking score -32.15 Example output of most similar compounds to PDB 1Y4 complex ligands docked into the target protein using OpenEye FRED

33 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 33 DTP Human Tumor Cell Line Data Mining Collaboration with Melanie Wu at the School of Informatics 257,547 compounds, 44,653 with 60-cell line screening data (GI 50 ) Local PostgreSQL database with gNova CHORD cartridge for substructure and similarity searching, exposed as a web service Aim to build on existing published data mining research on this dataset, doing forms of data mining that are made easier by using web services Learned so far –Most previous research has used small compound subsets (~4000 compounds), and generally fall into organization (SOM, clustering, etc) or correlation of structure, activity and/or expression –There is little that has approached the dataset as a whole (as it is in 2006) –Correlations of structure, activity and expression are limited in scope (cf. e.g. association rule mining) –The 44,653 compounds with screening data are extremely drug-like Evaluating set as a standard to use as a surrogate for multi-screen HTS (at least secondary screening data) Aim to apply latest Data Mining methods to whole set First step in an Active User / Active Computation Oncology Information Portal?

34 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 34 Sample property profiles (hydrogen bond acceptors)

35 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 35 Mean inter-molecular similarity Mean Similarity TCL 0.3047 MRTD 0.3199 Pubchem Subsets 0.3605 Most-similar HTCL compounds to MRTD 348/1220 > 0.8

36 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 36 Current activities More workflows, more services Identification of key customers, for contextual inquiry sessions Advancement of portlet interfaces Development of first natural language email interfaces Contextualizing information (predictions, SAR, flags, annotations, text mining results, etc) for inclusion in these interfaces Further characterization of DTP HTCL dataset, particularly how similar the screens are to HTS screens Other things: –Lab of the future –Distributed Drug Discovery Database –OSCAR-3 derivatives –Clustering of PubChem

37 Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 37 MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)


Download ppt "Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October 2006. Page 1 Smart Mining Interfaces, Workflows, and Data Mining the."

Similar presentations


Ads by Google