Indiana University School of David Wild – Research Overview April 2006. Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics.

Slides:



Advertisements
Similar presentations
CSWA Provider: Program and Tech Review
Advertisements

Chapter 13: Query Processing
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
GIS for Decision Support and Economic Development Beau Bradley, Neighborhood Transformation Initiative Jim Querry, Mayors Office of Information Services.
Shape and Color Clustering with SAESAR Norah E. MacCuish, John D. MacCuish, and Mitch Chapman Mesa Analytics & Computing, Inc.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Distributed Systems Architectures
Chapter 7 System Models.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.
1 Overview of Chemical Informatics and Cyberinfrastructure Collaboratory Aug Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology.
Indiana University School of David Wild – Joint IU, Michigan, Lilly Meeting, October Page 1 Smart Mining Interfaces, Workflows, and Data Mining the.
Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.
CICC Chemical Compound Mining Workflows Jungkee (Jake) Kim Community Grids Laboratory.
1 State Wildlife Action Plans Wiki: Business Transformation Tutorial Brand Niemann July 5, 2008
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
VisKo: Enabling Visualization Generation Over the Web Nicholas Del Rio – UTEP Paulo Pinheiro - PNNL 1
1 NatQuery 3/05 An End-User Perspective On Using NatQuery To Extract Data From ADABAS Presented by Treehouse Software, Inc.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
Week 2 The Object-Oriented Approach to Requirements
Dr. Matthew Wright Product Director.
Configuration management
Database Performance Tuning and Query Optimization
Campaign Overview Mailers Mailing Lists
PEPS Weekly Data Extracts User Guide September 2006.
1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.
SCORE The Supplemental Complex Repository for Examiners Biotechnology/Chemical/Pharmaceutical Partnership June 2006.
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
Chapter 10 Software Testing
Executional Architecture
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
1 How Do I Order From.decimal? Rev 05/04/09 This instructional training document may be updated at anytime. Please visit and check the.
Node Lessons Learned James Hudson Wisconsin Department of Natural Resources.
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Chapter 13 The Data Warehouse
Tutorial 1: Sensitivity analysis of an analytical function
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Introduction Peter Dolog dolog [at] cs [dot] aau [dot] dk Intelligent Web and Information Systems September 9, 2010.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 16 Slide 1 User interface design.
From Model-based to Model-driven Design of User Interfaces.
Pronalaženje Skrivenog Znanja
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Indiana University School of David Wild – I Page 1 David Wild Chemical Informatics.
1 Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory Cheminformatics Aspects:
Selecting Diverse Sets of Compounds C371 Fall 2004.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis.
Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.
Gary Wiggins for Geoffrey Fox
Presentation transcript:

Indiana University School of David Wild – Research Overview April Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics Indiana University School of Informatics, Bloomington indiana.edu

Indiana University School of David Wild – Research Overview April Page 2 Overview Smart mining of drug discovery information –Project goals –Workflow examples & demonstrations –Collaborations with scientists –Workflow interoperability Data mining of the DTP tumor cell line dataset Fast clustering of Pubchem using Divisive Kmeans & Linux clusters Distributed Drug Discovery for neglected diseases Visualization & end-user layer tools Usability of chemical informatics tools Collaboration areas with Peter Murray Rust group

Indiana University School of David Wild – Research Overview April Page 3 Smart mining of drug discovery information Technique for making the large volumes and diverse sources of chemical & related information manageable for scientists Observation: many information needs of scientists are straightforward, but complex and time-consuming in implementation This project aims to match information needs with use-cases and workflows of web services, along with imaginative human interfaces Supported by Microsoft eScience grant

Indiana University School of David Wild – Research Overview April Page 4 3-layer model PurposeTechnologies Interaction LayerInteractive software for creative access and exploitation of information by humans Microsoft Smart Clients, portlets, Java applets, and browser clients, visualization technologies Aggregation LayerWorkflows and data schemas customized for particular domains, applications and users BPEL, Taverna and other workflow modeling tools, aggregate web services Web service layerComprehensive data and computation provision including storage, calculation, semantics and meta-data exposed as web services Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft.NET

Indiana University School of David Wild – Research Overview April Page 5 Online database (e.g. PubChem) Local database 3D Docking Tool 2D-3D converter 3D visualizer UDDI (?) New Structure Service Search online databases for recent structures Search local databases for recent structures Merge Results AGENT / SMART CLIENT Parse request Select appropriate use cases and/or web service(s) Schedule as necessary Request from Human Interface WSDL SOAP atomic services aggregate services USE-CASE SCRIPT Invoke New Structure Service Convert structures to 3D Dock results & protein file Extract any hits Return links for visualization

Indiana University School of David Wild – Research Overview April Page 6

Indiana University School of David Wild – Research Overview April Page 7

Indiana University School of David Wild – Research Overview April Page 8

Indiana University School of David Wild – Research Overview April Page 9 Web services implemented Database Services –Local DTP Tumor Cell Line Database –PDB Ligand Database –Distributed Drug Discovery Database OpenEye –FRED Docking –FILTER Property Calculation and Filtering –OMEGA 2D-3D Conversion BCI –Various BCI Clustering services VOTables InChIGoogle InChiServer CMLRSSServer CDK Web services Open Babel

Indiana University School of David Wild – Research Overview April Page 10 A protein implicated in tumor growth is supplied to the docking program (in this case HSP90 taken from the PDB 1Y4 complex) The workflow employs our local NIH DTP database service to search 200,000 compounds tested in human tumor cellular assays for similar structures to the ligand. Client portlets are used to browse these structures Once docking is complete, the user visualizes the high- scoring docked structures in a portlet using the JMOL applet. Similar structures are filtered for drugability, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. A 2D structure is supplied for input into the similarity search (in this case, the extracted bound ligand from the PDB IY4 complex) Correlation of docking results and biological fingerprints across the human tumor cell lines can help identify potential mechanisms of action of DTP compounds

Indiana University School of David Wild – Research Overview April Page 11 Workflow interoperability Taverna SCUFL BEPL conversion –Working with Beth Plale & Dennis Gannon at IU Computer Science Use of developing data standards for Chemical Informatics –CML & InChI –XML meta data Interoperability of Taverna with other workflow systems Use of workflows in experiment execution environments –See

Indiana University School of David Wild – Research Overview April Page 12 DTP Tumor Cell Line Data Mining Collaboration with Melanie Wu, Database & Data Mining expert at the School of Informatics Local PostgreSQL database exposed as a web service Building on existing published data mining research on this dataset Current projects: –Comparing compound clusterings based on structure (MACCS keys) and bioprint (vector of screening results) –Investigating fingerprint and bioprint correlations with MOAs of ~100 compounds (correlation is definitely found) –Application of workflows to associate docking results with screening results –Collaboration with Dr. Faming Zhang at IU Department of Chemistry for mining of Kinase-related information Next projects: –Correlation of structural and gene expression information (without naïve combination of screen & gene information) –Application of COMPARE –Integration into a wider oncology information system

Indiana University School of David Wild – Research Overview April Page 13 Database architecture Using PostgreSQL database with gNova CHORD for structure & fingerprint searching, exposed as a web service Compound table contains ~200,000 SMILES, ID, properties, MACCS keys in compound table Screen tables contain GI 50 /LD 50 /TGI values, and gene expression table (in development) Can search on mix of structure and numeric / categorical data Active research into optimizing searching efficiency

Indiana University School of David Wild – Research Overview April Page 14 Cluster Analysis and Chemical Informatics Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds Organizational usage has not been as well studied as the other two, but see –Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, Essentially helping large datasets become manageable Methods used: –Jarvis-Patrick and variants O(N 2 ), single partition –Wards method Hierarchical, regarded as best, but at least O(N 2 ) –K-means < O(N 2 ), requires set no of clusters, a little messy –Sphere-exclusion (Butina) Fast, simple, similar to JP –Kohonen network Clusters arranged in 2D grid, ideal for visualization

Indiana University School of David Wild – Research Overview April Page 15 Limitations of Wards for large datasets (>1m) Best algorithms have O(N 2 ) time requirement (RNN) Requires random access to fingerprints –hence substantial memory requirements (O(N)) Problem of selection of best partition –can select desired number of clusters Easily hit 4GB memory addressing limit on 32 bit machines –Approximately 2m compounds

Indiana University School of David Wild – Research Overview April Page 16 Divisive K-means Clustering New hierarchical divisive method –Hierarchy built from top down, instead of bottom up –Divide complete dataset into two clusters –Continue dividing until all items are singletons –Each binary division done using K-means method –Originally proposed for document clustering Bisecting K-means –Steinbach, Karypis and Kumar (Univ. Minnesota) users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf –Found to be more effective than agglomerative methods –Forms more uniformly-sized clusters at given level

Indiana University School of David Wild – Research Overview April Page 17 BCI Divkmeans Several options for detailed operation –Selection of next cluster for division –size, variance, diameter –affects selection of partitions from hierarchy, not shape of hierarchy Options within each K-means division step –distance measure –choice of seeds –batch-mode or continuous update of centroids –termination criterion Have developed MPI parallel version for Linux clusters / grids in conjunction with BCI (now Digital Chemistry) For more information, see Barnard and Engels talks at: Now available as a web service at IU (along with other BCI programs)

Indiana University School of David Wild – Research Overview April Page 18 Comparative execution times 7h 27m 3h 06m 2h 25m 44m NCI subsets, 2.2 GHz Intel Celeron processor

Indiana University School of David Wild – Research Overview April Page 19 MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound, Nov 2005)

Indiana University School of David Wild – Research Overview April Page 20 Distributed Drug Discovery Project run by Dr. Bill Scott at IUPUI Tackling neglected diseases using distributed chemistry (while educating undergraduates about combinatorial chemistry) Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Many universities participating around the world Reaction transformations, virtual and made compounds stored in PostgreSQL database exposed as a web service This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc

Indiana University School of David Wild – Research Overview April Page 21 Distributed Drug Discovery William L. Scott Distributed Drug Discovery A Distributed Drug Discovery Concept to Search for Developing World Disease Drug Leads

Indiana University School of David Wild – Research Overview April Page 22 Visualization and end-user tools PubChemSR 2D structure visualizer using CDK VoPlot VisualiSAR - modal fingerprints Similarity Matrix Visualization General approaches to end user tools –Portlets and.NET –Usability & Contextual Design

Indiana University School of David Wild – Research Overview April Page 23 PubChemSR (Junguk Hur)

Indiana University School of David Wild – Research Overview April Page 24 Simple 2D viewer applet (using CDK) - David Jiao

Indiana University School of David Wild – Research Overview April Page 25 VoPlot

Indiana University School of David Wild – Research Overview April Page 26 with a nod to Edward Tufte. See VisualiSAR - modal fingerprints

Indiana University School of David Wild – Research Overview April Page 27 Original (curated)Breadth-first Search DegreeSloans Algorithm Data: NCI Compound Database - Compounds with positive AIDS screens Visual Similarity Matrices display large, graph-based data sets in a compact form. The axes are labeled with the data items (vertices) and a dot indicates a relation (edge) between two data items. Different vertex orderings can reveal information about the data. Additional details are displayed as property plots. Here, the different computed properties are displayed along with the main matrix. Student: Christopher Mueller In order to generate similarity matrices and orderings in a reasonable time (minutes instead of days), we are developing parallel and high-performance libraries that take advantage of modern processor and system architectures. These include optimized SIMD for Alitvec (PowerPC) and SSE (Intel) and parallel algorithms for multiprocessor environments. Visual Similarity Matrices

Indiana University School of David Wild – Research Overview April Page 28 General approaches to end-user tools Main interface-level vehicle should be portlets, allowing reuse and interchangability Other interfaces, such as.NET clients, and RSS interfaces will also be investigated No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists interaction with the system Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right people in chemical informatics [collaboration with HCI?] Possibility of multiple interfaces for different people groups (Coopers primary personas) Dont assume the browser interface – / NLP ? Start with the basics –2D chemical structure drawing (input) –Visualization of large numbers of chemical structures in 2D –3D chemical structure visualization Current project is looking at usability of online chemical databases (including PubChem)

Indiana University School of David Wild – Research Overview April Page 29 Key difference between sequential and random drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch JME > ChemDraw >> ISIS Draw Usability of 2D structure drawing tools

Indiana University School of David Wild – Research Overview April Page 30 Cambridge-Indiana Collaboration Weekly Access Grid meetings Bringing together areas of expertise in the UK and USA Applying OSCAR text mining to NIH data Looking toward joint presentations & publications

Indiana University School of David Wild – Research Overview April Page 31 Cambridge-Indiana Collaboration

Indiana University School of David Wild – Research Overview April Page 32 Contributors My students –Xiao Dong –Huijung Wang –Jason Lee –Junguk Hur –David Jaio –Usha Cheemakurthi –Waiping Kam Geoffreys group at CGL –Marlon Pierce –Jake Kim –Sima Patel –Smitha Ajay Others –Gary Wiggins –Melanie Wu –Dennis Gannon –Beth Plale –Rajarshi Guha –Peter Murray Rust –Peter Corbett –Dan Zaharevitz