Indiana University School of David Wild – ECCR Meeting, October 2005. Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory Cheminformatics Aspects:

Slides:

Advertisements

Similar presentations

Integrating ChemAxon technology into your End User Applications Java solutions for cheminformatics Ver. Mar., 2005.

Advertisements

JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.

SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.

Instant JChem 2009 US + EU Seminars Confidential. Copyright© 2009 ChemAxon Kft, Informatics Matters Ltd Instant JChem Instant JChem Seminar series Q

Indiana University School of David Wild – CICC Quarterly Meeting, Jan Page 1 Projects 1-4 update David Wild CICC Quarterly Meeting January 27.

Indiana University School of David Wild – Research Overview April Page 1 Research Update, April 2006 David Wild Assistant Professor of Chemical Informatics.

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Chapter 11 Designing the User Interface

A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.

Visual Scripting of XML

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Interactive Systems Technical Design Seminar work: Web Services Janne Ojanaho.

Information Retrieval in Practice

Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.

Information Retrieval: Human-Computer Interfaces and Information Access Process.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Course Instructor: Aisha Azeem

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Introduction to UDDI From: OASIS, Introduction to UDDI: Important Features and Functional Concepts.

Indiana University School of David Wild – I Page 1 David Wild Chemical Informatics.

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.

Application of e-infrastructure to real research.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

3231 Software Engineering By Germaine Cheung Hong Kong Computer Institute Lecture 12.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

Service Computation 2010November 21-26, Lisbon.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

ITGS Case Study Theatre Booking System Ayushi Pradhan.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:

EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.

240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.

Systems Analysis and Design in a Changing World, Fourth Edition

Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.

Software Engineering User Interface Design Slide 1 User Interface Design.

Building the e-Minerals Minigrid Rik Tyer, Lisa Blanshard, Kerstin Kleese (Data Management Group) Rob Allan, Andrew Richards (Grid Technology Group)

Metadata Mòrag Burgon-Lyon University of Glasgow.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Selecting Diverse Sets of Compounds C371 Fall 2004.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.

Portals, Services, Interfaces Marlon Pierce Indiana University March 15, 2002.

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory HTS Data Analysis.

Indiana University School of Indiana University ECCR Summary Infrastructure: Cheminformatics web service infrastructure made available as a community resource.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Chemical Informatics and Cyberinfrastructure Collaboratory

SOFTWARE DESIGN AND ARCHITECTURE

Chapter 3 Hardware and software 1.

Chapter 3 Hardware and software 1.

Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1

Overview of Workflows: Why Use Them?

McGraw-Hill Technology Education

Presentation transcript:

Indiana University School of David Wild – ECCR Meeting, October Page 1 Chemical Informatics & Cyberinfrastructure Collaboratory Cheminformatics Aspects: HTS Data Analysis & Virtual Screening David J. Wild Visiting Assistant Professor Indiana University School of Informatics

Indiana University School of David Wild – ECCR Meeting, October Page 2 About Me Ph.D. and postdoc in Peter Willett’s Lab (Sheffield) – parallel 2D and 3D similarity algorithms. Postdoc then Senior Scientist at Parke-Davis, Ann Arbor (now Pfizer), researching and developing chemoinformatics tools for bench chemists & modelers. Led collaborations with Tripos and Bioreason for development of HTS analysis software (SAR Navigator, ClassPharmer) Left in 2002 to form Wild Ideas Consulting and take up adjunct position at University of Michigan Visiting Assistant Professor at Indiana since August Permanent position starting fall Now run research group focused on handling large and diverse sources of chemical information. More at

Indiana University School of David Wild – ECCR Meeting, October Page 3 “Cheminformatics” contect of CICC proposal Development of user-centered tools for query, organization, navigation and analysis of large chemical HTS datasets (specifically Pubchem and its subsets), including: –Rapid organization of large datasets (cluster analysis) –Intuitive interfaces for navigation and analysis –Virtual screening –Standardization of data exchange formats –Data mining of SAR across multiple screens Or allowing scientists to ask the right questions and have them answered effectively

Indiana University School of David Wild – ECCR Meeting, October Page 4 Thoughts relating to Pubchem HTS analysis (and more widely applicable) Scientists’ questions are probably not going to be conceptually complex, but finding the answers can currently be very time consuming and/or complex (for a human) –“which of the 10,000 hits from this screen are most promising for follow-up?” –“who else is working on similar chemical structures to these?” –“are there any compounds in Pubchem (or elsewhere) that might bind to the active site of this protein I just resolved?” –“do any compounds related to this one exhibit toxic side effects?” We need to figure out just what the questions are! (Contextual Inquiry, Use cases) Answers are often “stale” after a short period of time – questions need to be re-answered as new information is generated Almost all current systems are passive, and follow the (web) browsing model Existing approaches do not scale up well

Indiana University School of David Wild – ECCR Meeting, October Page 5 Use Case Which of these hits should I follow up? An HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist on the project wants to know what the most promising series of compounds for follow-up are, based on: –Series selection BCI cluster analysis –Structure-activity relationships lots of methods –Chemical and pharmacokinetic properties mitools, chemaxon –Compound history gNova / PostgreSQL / Pubchem search –Patentability BCI Markush handling software –Toxicity –Synthetic feasibility –+ requires visualization tools!

Indiana University School of David Wild – ECCR Meeting, October Page 6 Use Case Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: –Any newly published compounds which might fit the protein receptor site gNova / PostgreSQL, PubChem search, FRED Docking –Any published 3D structures of the protein or of protein-ligand complexes PDB search –Any interactions of compounds with other proteins gNova / PostgreSQL, PubChem search –Any information published on the protein target Journal text search

Indiana University School of David Wild – ECCR Meeting, October Page 7 PurposeTools Interaction LayerSoftware for information access and storage by humans, including , browsing tools and “push” tools Web browsers, clients, RSS aggregators, JMol, JME Aggregation LayerSoftware, intelligent agents and data schemas customized for particular domains, applications and users BPEL, Microsoft Smart Client Interface LayerCommon interfaces to the data layer – may be several for different kinds of information Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft.NET Data LayerComprehensive data provision including storage, calculation, semantics and meta-data, probably in multiple systems MySQL, PostgreSQL, gNova Cartridge chemoinformatics calculation programs; data from NCI, ZINC Wild, D.J., Strategies for Using Information Effectively in Early-stage Drug Discovery, in Ekins, S. (ed), Computer Applications in Pharmaceutical Research and Development, submitted July 2005

Indiana University School of David Wild – ECCR Meeting, October Page 8 PurposeTools Interaction LayerSoftware for information access and storage by humans, including , browsing tools and “push” tools Web browsers, clients, RSS aggregators, JMol, JME Aggregation LayerSoftware, intelligent agents and data schemas customized for particular domains, applications and users BEPL, Microsoft Smart Client Interface LayerCommon interfaces to the data layer – may be several for different kinds of information Apache web services, SOAP wrappers, WSDL, UDDI, XML, Microsoft.NET Data LayerComprehensive data provision including storage, calculation, semantics and meta-data, probably in multiple systems MySQL, PostgreSQL, gNova Cartridge chemoinformatics calculation programs; data from NCI, ZINC Wild, D.J., Strategies for Using Information Effectively in Early-stage Drug Discovery, in Ekins, S. (ed), Computer Applications in Pharmaceutical Research and Development, submitted July 2005 atomic web services databases & tools knowledge mgt. interfaces / grid portal

Indiana University School of David Wild – ECCR Meeting, October Page 9 Online database (e.g. PubChem) Local database 3D Docking Tool 2D-3D converter 3D visualizer UDDI New Structure Service Search online databases for recent structures Search local databases for recent structures Merge Results AGENT / SMART CLIENT Parse request Select appropriate use cases and/or web service(s) Schedule as necessary Request from Human Interface WSDL SOAP atomic services aggregate services USE-CASE SCRIPT Invoke New Structure Service Convert structures to 3D Dock results & protein file Extract any hits Return links for visualization “find me all the structures that fit the enclosed protein for The next three months”

Indiana University School of David Wild – ECCR Meeting, October Page 10 Visualization & interface level tools No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system Several metaphors in existence for looking at large amounts of 2D structural information: 2D plot (SAR Navigator), “spreadsheet” views (Accord, etc), enhanced spreadsheets (Classpharmer, ChemTK), Kohonen maps, TreeMaps Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right people in chemical informatics, and deserve investigation for future use in this project (in collaboration with HCI colleagues on the project) Possibility of multiple interfaces for different people groups (Cooper’s “primary personas”) Don’t assume the browser interface – / nat. lang. proc ? Start with the basics –2D chemical structure drawing (input) –Visualization of large numbers of chemical structures in 2D –3D chemical structure visualization Planning on evaluation of NLP, , RSS, etc. as well as browser-based interfaces Interface tools will be developed in a grid portal environment using portlet technology

Indiana University School of David Wild – ECCR Meeting, October Page 11 Visualization methods for datasets & clusters Partitions –Spreadsheets –Enhanced Spreadsheets –2D or 3D plots Hierarchies –Dendograms –Tree Maps –Hyperbolic Maps

Indiana University School of David Wild – ECCR Meeting, October Page 12 Usability of 2D structure drawing tools Key difference between “sequential” and “random” drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw

Indiana University School of David Wild – ECCR Meeting, October Page 13 Next Steps Develop realistic use-cases based on as much information about potential users as we can muster Work with other members of CICC to define Grid architecture (services required and their interfaces) by integrating requirements of different aspects of Cheminformatics Implement some web services that are likely to be employed in use cases –Rapid dataset search and organization Search of PubChem (SOAP interface already exists) Search of local gNova / PostgreSQL database Clustering using BCI (Digital Chemistry) Divisive K-Means BCI Markush searching –Interface tools for navigation and analysis Integration with Spotfire ChemTK (or other spreadsheet-metaphor product) Develop entirely new interface tools (usability studies) –Virtual Screening Molecular docking with OpenEye FRED Property calculation with Molinspiration / Chemaxon PDB Search (EMBL) Activity prediction modules (Molinspiration / RP / SVMs etc)

Indiana University School of David Wild – ECCR Meeting, October Page 14 Supplemental Slides

Indiana University School of David Wild – ECCR Meeting, October Page 15

Indiana University School of David Wild – ECCR Meeting, October Page 16

Indiana University School of David Wild – ECCR Meeting, October Page 17 Use Case #1 Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: –Any newly published compounds which might fit the protein receptor site –Any published 3D structures of the protein or of protein-ligand complexes –Any interactions of compounds with other proteins –Any information published on the protein target

Indiana University School of David Wild – ECCR Meeting, October Page 18 Use Case #1 Are there any good ligands for my target? A chemist is working on a project involving a particular protein target, and wants to know: –Any newly published compounds which might fit the protein receptor site gNova / PostgreSQL, PubChem search, FRED Docking –Any published 3D structures of the protein or of protein-ligand complexes PDB search –Any interactions of compounds with other proteins gNova / PostgreSQL, PubChem search –Any information published on the protein target Journal text search

Indiana University School of David Wild – ECCR Meeting, October Page 19 Use Case #2 Who else is working on these structures? A chemist is working on a chemical series for a particular project and wants to know: –If anyone publishes anything using the same or related compounds –Any new compounds added to the corporate collection which are similar or related –If any patents are submitted that might overlap the compounds he is working on –Any pharmacological or toxicological results for those or related compounds –The results for any other projects for which those compounds were screened

Indiana University School of David Wild – ECCR Meeting, October Page 20 Use Case #2 Who else is working on these structures? A chemist is working on a chemical series for a particular project and wants to know: –If anyone publishes anything using the same or related compounds ~ PubChem search –Any new compounds added to the corporate collection which are similar or related gNova CHORD / PostgreSQL –If any patents are submitted that might overlap the compounds he is working on ~ BCI Markush handling software –Any pharmacological or toxicological results for those or related compounds gNova CHORD / PostgreSQL, MiToolkit –The results for any other projects for which those compounds were screened gNova CHORD / PostgreSQL, PubChem search

Indiana University School of David Wild – ECCR Meeting, October Page 21 Use Case - Pubchem Which of these hits should I follow up? An MLI HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist at another laboratory wants to know if there are any interesting active series she might want to pursue, based on: –Structure-activity relationships –Chemical and pharmacokinetic properties –Compound history –Patentability –Toxicity –Synthetic feasibility

Indiana University School of David Wild – ECCR Meeting, October Page 22 Use Case – PubChem Which of these hits should I follow up? An HTS experiment has produced 10,000 possible hits out of a screening set of 2m compounds. A chemist on the project wants to know what the most promising series of compounds for follow-up are, based on: –Series selection BCI cluster analysis –Structure-activity relationships lots of methods –Chemical and pharmacokinetic properties mitools, chemaxon –Compound history gNova / PostgreSQL / Pubchem search –Patentability BCI Markush handling software –Toxicity –Synthetic feasibility –+ requires visualization tools!

Indiana University School of David Wild – ECCR Meeting, October Page 23 Cluster Analysis and Chemical Informatics Used for organizing datasets into chemical series, to build predictive models, or to select representative compounds Organizational usage has not been as well studies as the other two, but see –Wild, D.J., Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping using Wards Clustering, Journal of Chemical Information and Computer Sciences., 2000, 40, Essentially helping large datasets become manageable Methods used: –Jarvis-Patrick and variants O(N 2 ), single partition –Ward’s method Hierarchical, regarded as best, but at least O(N 2 ) –K-means < O(N 2 ), requires set no of clusters, a little “messy” –Sphere-exclusion (Butina) Fast, simple, similar to JP –Kohonen network Clusters arranged in 2D grid, ideal for visualization

Indiana University School of David Wild – ECCR Meeting, October Page 24 Limitations of Ward’s method for large datasets (>1m) Best algorithms have O(N 2 ) time requirement (RNN) Requires random access to fingerprints –hence substantial memory requirements (O(N)) Problem of selection of best partition –can select desired number of clusters Easily hit 4GB memory addressing limit on 32 bit machines –Approximately 2m compounds

Indiana University School of David Wild – ECCR Meeting, October Page 25 Scaling up clustering methods Parallelisation –Clustering algorithms can be adapted for multiple processors –Some algorithms more appropriate than others for particular architectures –Ward’s has been parallelized for shared memory machines, but overhead considerable New methods and algorithms –Divisive (“bisecting”) K-means method –Hierarchical Divisive –Approx. O(NlogN)

Indiana University School of David Wild – ECCR Meeting, October Page 26 Divisive K-means Clustering New hierarchical divisive method –Hierarchy built from top down, instead of bottom up –Divide complete dataset into two clusters –Continue dividing until all items are singletons –Each binary division done using K-means method –Originally proposed for document clustering “Bisecting K-means” –Steinbach, Karypis and Kumar (Univ. Minnesota) users.cs.umn.edu/~karypis/publications/Papers/PDF/doccluster.pdf –Found to be more effective than agglomerative methods –Forms more uniformly-sized clusters at given level

Indiana University School of David Wild – ECCR Meeting, October Page 27 BCI Divkmeans Several options for detailed operation –Selection of next cluster for division –size, variance, diameter –affects selection of partitions from hierarchy, not shape of hierarchy Options within each K-means division step –distance measure –choice of seeds –batch-mode or continuous update of centroids –termination criterion Have developed parallel version for Linux clusters / grids in conjunction with BCI For more information, see Barnard and Engels talks at:

Indiana University School of David Wild – ECCR Meeting, October Page 28 Comparative execution times NCI subsets, 2.2 GHz Intel Celeron processor 7h 27m 3h 06m 2h 25m 44m

Indiana University School of David Wild – ECCR Meeting, October Page 29 Clustering a 1 million compound dataset on a 2.2 GHz Celeron Desktop Machine MethodTime *Memory Usage K-Means (10,000 clusters) 3½ days95 MB Divisive K-means7 days65 MB Divisive K-means (Parallel, 4 machines incl. 1.7 GHz Pentium M) 16½ hours~ 50 MB * Time for a single run may vary due to different selection of seeds. Runtimes can be shortened e.g. by using a max. number of iterations or a % relocation cutoff. Results from AVIDD clusters & Teragrid coming soon….

Indiana University School of David Wild – ECCR Meeting, October Page 30 Divisive Kmeans: Conclusions Much faster than Ward’s, speed comparable to K-means, suitable for very large datasets (millions) –Time requirements approximately O(N log N) –Current implementation can cluster 1m compounds in under a week on a low-power desktop PC –Cluster 1m compounds in a few hours with a 4-node parallel Linux cluster Better balance of cluster sizes than Wards or Kmeans Visual inspection of clusters suggests better assembly of compound series than other methods Better clustering of actives together than previously-studied methods Memory requirements minimal Experiments using AVIDD cluster and Teragrid forthcoming (50+ nodes)

Indiana University School of David Wild – ECCR Meeting, October Page 31 Visualization & interface level tools No matter how clever the smarts underneath, the overriding factor in usefulness will be the quality of scientists’ interaction with the system Contextual Design, Interaction Design (Cooper) and Usability Studies have proven effective in designing the right interfaces for the right people in chemical informatics [collaboration with HCI?] Possibility of multiple interfaces for different people groups (Cooper’s “primary personas”) Don’t assume the browser interface – / NLP ? Start with the basics –2D chemical structure drawing (input) –Visualization of large numbers of chemical structures in 2D –3D chemical structure visualization Planning on evaluation of NLP, , RSS, etc. as well as browser-based interfaces

Indiana University School of David Wild – ECCR Meeting, October Page 32 Usability of 2D structure drawing tools Key difference between “sequential” and “random” drawers Huge difference in intuitiveness Key factor how badly you can mess things up Marvin Sketch ≈ JME > ChemDraw >> ISIS Draw

Indiana University School of David Wild – ECCR Meeting, October Page 33 Visualization methods for datasets & clusters Partitions –Spreadsheets –Enhanced Spreadsheets –2D or 3D plots Hierarchies –Dendograms –Tree Maps –Hyperbolic Maps

Indiana University School of David Wild – ECCR Meeting, October Page 34

Indiana University School of David Wild – ECCR Meeting, October Page 35

Indiana University School of David Wild – ECCR Meeting, October Page 36 VisualiSAR – with a nod to Edward Tufte. See

Indiana University School of David Wild – ECCR Meeting, October Page 37 Tree Maps – very Tufte-esque

Indiana University School of David Wild – ECCR Meeting, October Page 38 External support ECCR grant ($500,000) –20% Co-PI with Fox for development of web services for HTS data organization and visualization –May lead to $5m/5 years grant for full center Applied for Microsoft Smart Clients for eScience grant ($50,000) –Including Marlon Pierce in the Community Grids lab Peter Murray-Rust group, Cambridge – offering expertise and assistance with web services IO-Informatics – provision of Sentient software and consulting BCI – clustering, structure enumeration & toolkit, consulting OpenEye – a range of calculation tools, FRED docking Molinspiration – MiTools Toolkit gNova – CHORD chemical database system Possible financial support from company in the UK

Indiana University School of David Wild – ECCR Meeting, October Page 39 Technology Perl SOAP::Lite –Will be used for initial web service development –Doesn’t really implement WSDL & UDDI Apache Axis & Tomcat –Deploy WSDL for web services BPEL4WS – Business Process Execution Language –For aggregation of web services – bpel/ bpel/ Microsoft.NET & C#

Indiana University School of David Wild – ECCR Meeting, October Page 40 Current activities Core activities –Development of use-cases –Development of initial web services (Perl SOAP::Lite) –Use of Taverna to prototype use-case scripts Basic research on future components –Organizing large amounts of chemical information for human consumption Development of very fast parallel clustering techniques – to be exposed as web services –Selection of interface-level tools for basic interaction Chemical structure drawing, display Investigation of , NLP, RSS, and browser interfaces –Interface-level tools for visualization, navigation and analysis Cluster and dataset visualization, natural language interfaces)

Indiana University School of David Wild – ECCR Meeting, October Page 41 Sentient - an alternative approach to managing heterogenous data sources Collaboration with IO-Informatics (along with Cornell, and UCSD) for the investigation of service-oriented architectures in life sciences research using Sentient software Aim to integrate several sources of information relating to Alzheimer’s Disease (brain imaging, morphology, gene expression) so that cross-dataset biomarkers can be identified Sentient usies Intelligent Multidimensional Objects (IMOs) to define and query data sources and the tools used to access them Still a browsing approach, but with a layer of coherence and “intelligence” Hope to expand to include chemistry data Can also be used as an interface-level tool

Indiana University School of David Wild – ECCR Meeting, October Page 42

Indiana University School of David Wild – ECCR Meeting, October Page 43

Indiana University School of David Wild – ECCR Meeting, October Page 44 Conclusions so far Effective exploitation of large volumes and diverse sources of chemical information is a critical problem to solve, with a potential huge impact on the drug discovery process Most information needs of chemists and drug discovery scientists are conceptually straightforward, but complex (for them) to implement All of the technology is now in place to implement may of these information need “use-cases”: the four level model using service-oriented architectures together with smart clients look like a neat way of doing this The aggregation and interface levels offer the most challenges In conjunction with grid computing, rapid and effective organization and visualization of large chemical datasets is feasible in a web service environment Some pieces are missing: –Chemical structure search of journals (wait for InChI) –Automated patent searching –Effective dataset organization –Effective interfaces, especially visualization of large numbers of 2D structures (we’re working on it!)