Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics.

Similar presentations


Presentation on theme: "Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics."— Presentation transcript:

1 Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics Indiana University School of Informatics, Bloomington djwild @ indiana.edu

2 Indiana University School of David Wild – Research Overview April 2006. Page 2 Overview Smart mining of drug discovery information –Project goals –Workflow examples & demonstrations –Collaborations with scientists –Workflow interoperability Data mining of the DTP tumor cell line dataset Fast clustering of Pubchem using Divisive Kmeans & Linux clusters Distributed Drug Discovery for neglected diseases Visualization & end-user layer tools Usability of chemical informatics tools Collaboration areas with Peter Murray Rust group

3 Indiana University School of David Wild – Research Overview April 2006. Page 3 Smart mining of drug discovery information Technique for making the large volumes and diverse sources of chemical & related information manageable for scientists Observation: many information needs of scientists are straightforward, but complex and time-consuming in implementation This project aims to match information needs with use-cases and workflows of web services, along with imaginative human interfaces Supported by Microsoft eScience grant

4 Indiana University School of David Wild – Research Overview April 2006. Page 4 3-layer model PurposeTechnologies Interaction LayerInteractive software for creative access and exploitation of information by humans Microsoft Smart Clients, portlets, Java applets, email and browser clients, visualization technologies Aggregation LayerWorkflows and data schemas customized for particular domains, applications and users BPEL, Taverna and other workflow modeling tools, aggregate web services Web service layerComprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft.NET

5 Indiana University School of David Wild – Research Overview April 2006. Page 5 Online database (e.g. PubChem) Local database 3D Docking Tool 2D-3D converter 3D visualizer UDDI (?) New Structure Service Search online databases for recent structures Search local databases for recent structures Merge Results AGENT / SMART CLIENT Parse request Select appropriate use cases and/or web service(s) Schedule as necessary Request from Human Interface WSDL SOAP atomic services aggregate services USE-CASE SCRIPT Invoke New Structure Service Convert structures to 3D Dock results & protein file Extract any hits Return links for visualization

6 Indiana University School of David Wild – Research Overview April 2006. Page 6

7 Indiana University School of David Wild – Research Overview April 2006. Page 7

8 Indiana University School of David Wild – Research Overview April 2006. Page 8

9 Indiana University School of David Wild – Research Overview April 2006. Page 9 Web services implemented Database Services –Local DTP Tumor Cell Line Database –PDB Ligand Database –Distributed Drug Discovery Database OpenEye –FRED Docking –FILTER Property Calculation and Filtering –OMEGA 2D-3D Conversion BCI –Various BCI Clustering services VOTables InChIGoogle InChiServer CMLRSSServer CDK Web services Open Babel

10 Indiana University School of David Wild – Research Overview April 2006. Page 10 A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex) The workflow employs our local NIH DTP database service to search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. Client portlets are used to browse these structures Once docking is complete, the user visualizes the high- scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex) Correlation of docking results and biological fingerprints across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds

11 Indiana University School of David Wild – Research Overview April 2006. Page 11 Workflow interoperability Taverna SCUFL BEPL conversion –Working with Beth Plale & Dennis Gannon at IU Computer Science Use of developing data standards for Chemical Informatics –CML & InChI –XML meta data Interoperability of Taverna with other workflow systems Use of workflows in experiment execution environments –See http://www.extreme.indiana.edu/portals/index.shtmlhttp://www.extreme.indiana.edu/portals/index.shtml

12 Indiana University School of David Wild – Research Overview April 2006. Page 12 DTP Tumor Cell Line Data Mining Collaboration with Melanie Wu, Database & Data Mining expert at the School of Informatics Local PostgreSQL database exposed as a web service Building on existing published data mining research on this dataset Current projects: –Comparing compound clusterings based on structure (MACCS keys) and bioprint (vector of screening results) –Investigating fingerprint and bioprint correlations with MOAs of ~100 compounds (correlation is definitely found) –Application of workflows to associate docking results with screening results –Collaboration with Dr. Faming Zhang at IU Department of Chemistry for mining of Kinase-related information Next projects: –Correlation of structural and gene expression information (without naïve combination of screen & gene information) –Application of COMPARE –Integration into a wider oncology information system

13 Indiana University School of David Wild – Research Overview April 2006. Page 13 Database architecture Using PostgreSQL database with gNova CHORD for structure & fingerprint searching, exposed as a web service Compound table contains ~200,000 SMILES, ID, properties, MACCS keys in compound table Screen tables contain GI 50 /LD 50 /TGI values, and gene expression table (in development) Can search on mix of structure and numeric / categorical data Active research into optimizing searching efficiency

14 Indiana University School of David Wild – Research Overview April 2006. Page 14 Cluster Analysis and Chemical Informatics Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds Organizational usage has not been as well studied as the other two, but see –Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, 155-162. Essentially helping large datasets become manageable Methods used: –Jarvis-Patrick and variants O(N 2 ), single partition –Wards method Hierarchical, regarded as best, but at least O(N 2 ) –K-means < O(N 2 ), requires set no of clusters, a little messy –Sphere-exclusion (Butina) Fast, simple, similar to JP –Kohonen network Clusters arranged in 2D grid, ideal for visualization

15 Indiana University School of David Wild – Research Overview April 2006. Page 15 Limitations of Wards for large datasets (>1m) Best algorithms have O(N 2 ) time requirement (RNN) Requires random access to fingerprints –hence substantial memory requirements (O(N)) Problem of selection of best partition –can select desired number of clusters Easily hit 4GB memory addressing limit on 32 bit machines –Approximately 2m compounds

16 Indiana University School of David Wild – Research Overview April 2006. Page 16 Divisive K-means Clustering New hierarchical divisive method –Hierarchy built from top down, instead of bottom up –Divide complete dataset into two clusters –Continue dividing until all items are singletons –Each binary division done using K-means method –Originally proposed for document clustering Bisecting K-means –Steinbach, Karypis and Kumar (Univ. Minnesota) http://www- users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf –Found to be more effective than agglomerative methods –Forms more uniformly-sized clusters at given level

17 Indiana University School of David Wild – Research Overview April 2006. Page 17 BCI Divkmeans Several options for detailed operation –Selection of next cluster for division –size, variance, diameter –affects selection of partitions from hierarchy, not shape of hierarchy Options within each K-means division step –distance measure –choice of seeds –batch-mode or continuous update of centroids –termination criterion Have developed MPI parallel version for Linux clusters / grids in conjunction with BCI (now Digital Chemistry) For more information, see Barnard and Engels talks at: http://cisrg.shef.ac.uk/shef2004/conference.htm http://cisrg.shef.ac.uk/shef2004/conference.htm Now available as a web service at IU (along with other BCI programs)

18 Indiana University School of David Wild – Research Overview April 2006. Page 18 Comparative execution times 7h 27m 3h 06m 2h 25m 44m NCI subsets, 2.2 GHz Intel Celeron processor

19 Indiana University School of David Wild – Research Overview April 2006. Page 19 MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)

20 Indiana University School of David Wild – Research Overview April 2006. Page 20 Distributed Drug Discovery Project run by Dr. Bill Scott at IUPUI Tackling neglected diseases using distributed chemistry (while educating undergraduates about combinatorial chemistry) Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Many universities participating around the world Reaction transformations, virtual and made compounds stored in PostgreSQL database exposed as a web service This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc

21 Indiana University School of David Wild – Research Overview April 2006. Page 21 Distributed Drug Discovery William L. Scott Distributed Drug Discovery A Distributed Drug Discovery Concept to Search for Developing World Disease Drug Leads

22 Indiana University School of David Wild – Research Overview April 2006. Page 22 Visualization and end-user tools PubChemSR 2D structure visualizer using CDK VoPlot VisualiSAR - modal fingerprints Similarity Matrix Visualization General approaches to end user tools –Portlets and.NET –Usability & Contextual Design

23 Indiana University School of David Wild – Research Overview April 2006. Page 23 PubChemSR (Junguk Hur) http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR

24 Indiana University School of David Wild – Research Overview April 2006. Page 24 Simple 2D viewer applet (using CDK) - David Jiao

25 Indiana University School of David Wild – Research Overview April 2006. Page 25 VoPlot

26 Indiana University School of David Wild – Research Overview April 2006. Page 26 with a nod to Edward Tufte. See http://www.daylight.com/meetings/mug99/Wild/Mug99.htmlhttp://www.daylight.com/meetings/mug99/Wild/Mug99.html VisualiSAR - modal fingerprints

27 Indiana University School of David Wild – Research Overview April 2006. Page 27 Original (curated)Breadth-first Search DegreeSloans Algorithm Data: NCI Compound Database - Compounds with positive AIDS screens Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal information about the data. Additional details are displayed as property plots. Here, the different computed properties are displayed along with the main matrix. Student: Christopher Mueller In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments. Visual Similarity Matrices

28 Indiana University School of David Wild – Research Overview April 2006. Page 28 General approaches to end-user tools Main interface-level vehicle should be portlets, allowing reuse and interchangability Other interfaces, such as.NET clients, email and RSS interfaces will also be investigated No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists interaction with the system Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right people in chemical informatics [collaboration with HCI?] Possibility of multiple interfaces for different people groups (Coopers primary personas) Dont assume the browser interface – email / NLP ? Start with the basics –2D chemical structure drawing (input) –Visualization of large numbers of chemical structures in 2D –3D chemical structure visualization Current project is looking at usability of online chemical databases (including PubChem)

29 Indiana University School of David Wild – Research Overview April 2006. Page 29 Key difference between sequential and random drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch JME > ChemDraw >> ISIS Draw Usability of 2D structure drawing tools

30 Indiana University School of David Wild – Research Overview April 2006. Page 30 Cambridge-Indiana Collaboration Weekly Access Grid meetings Bringing together areas of expertise in the UK and USA Applying OSCAR text mining to NIH data Looking toward joint presentations & publications

31 Indiana University School of David Wild – Research Overview April 2006. Page 31 Cambridge-Indiana Collaboration

32 Indiana University School of David Wild – Research Overview April 2006. Page 32 Contributors My students –Xiao Dong –Huijung Wang –Jason Lee –Junguk Hur –David Jaio –Usha Cheemakurthi –Waiping Kam Geoffreys group at CGL –Marlon Pierce –Jake Kim –Sima Patel –Smitha Ajay Others –Gary Wiggins –Melanie Wu –Dennis Gannon –Beth Plale –Rajarshi Guha –Peter Murray Rust –Peter Corbett –Dan Zaharevitz


Download ppt "Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics."

Similar presentations


Ads by Google