Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration.

Similar presentations


Presentation on theme: "Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration."— Presentation transcript:

1 Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003

2 Why Discovery Net? Data Challenge: Distributed, heterogeneous & large scale data sets Novel and real-time data sources Resource Challenge Novel specialised data analysis components/services continually being published/made available Computational resources provided Information Challenge: Data cleaning, normalisation & calibration New data needs to be related to existing data Knowledge Challenge: Collaborative, interactive & people-intensive Result interpretation & validation in relation to existing knowledge Knowledge sharing is key

3 What is Discovery Net Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services Key Technologies: Grid and Distributed Computing Workflow and service composition Data Mining & Visualisation. Data Access & Information Structuring. High Throughput Screening Devices: real-time.

4 Discovery Net: Unifying the World’s Knowledge Data Integration: Dynamic Real Time Construction of “Data Grids” Application Integration: Component and Service-based Integration People Integration: Global-wide Discovery Groupware Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work

5 Using Distributed Resources Scientific Information Scientific Discovery Literature Databases Operational Data Images Instrument Data What is Discovery Net Real Time Integration Dynamic Application Integration Workflow Construction Interactive Visual Analysis

6 Discovery Net Layer Model (Life Science Application) High Performance and Grid-enabled Transfer Protocol (GSI-FTP, DSTP..) Grid-enabled Infrastructure (GSI) Deployment Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides execution logic for distributed knowledge discovery and access to distributed resources Computation & Data Resources: Distributed databases, compute servers and scientific devices.

7 A Knowledge Grid based on D-Net Servers Several types of clients for different usage (from thin web client to participating client) Current implmentation based on Java distributed objects (EJB), moving towards Web/Grid service But deployment and API access through standard Web/Grid service Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes

8 Discovery Process Management Workflow based service composition Data-flow approach fits Knowledge Discovery process Allows scientists to develop processes. Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML): Contains component data-flow graphs, but also Records collaboration information (user, changes) Records execution constraints (location, parameterisation) Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms D-Net Workflow for Genome Annotation : 16 services executing across Internet

9 InfoGrid: Dynamic Data Integration Integrative Analysis Chemistry Gene Protein / Targets Biological Screening Clinical Journals Sequence Structure Location Function… Activity Protocols Toxicology Metabolic Pathways… Sequence Expression Function… Structures Libraries Catalogues Synthetic pathways… Journals Project Reports Patents… Trails Patients… Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring Towards a Dynamic Information Integration Methodology: Specialised Information Source Access: InfoGrid allows users to register, locate and connect to various specialised information sources. On the-fly Integration: InfoGrid allows users to build their own integration structure on the fly (Worst case: proprietary protocol/format, best case JDBC/HTTP-XML-XPath/Web Service). Easy Maintenance: Wrappers/Drivers to new data sources can be added through a clean API

10 Dynamic Application Integration Services Dynamic Application Integration = On- demand access and composition of remote analysis components Towards a Dynamic Component Integration: Component service: allow users to register, locate and remotely execute components (Java component interface or Web Service port type). Execution service: allow users to control the execution of components distributed environments Easy Maintenance: New components can be added through a clean API Regression Clustering Classification Gene function perdition Homology Search Promoter Prediction D-NET API

11 Discovery Deployment Discovery Service Batch processing Report Discovery Component Discovery Process in DPML Discovery Deployment = On-demand rapid application construction and publishing Towards a Dynamic Deployment of Knowledge Discovery Procedures: Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool. Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.

12 Knowledge Integration & Interpretation Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge Towards a Knowledge Integration Framework: Multi-subject data analysis Specialised Client Interfaces: Interactive Analysis and dynamic component interaction Result Annotation, Structuring and Storage: Information source query, result browsing, sharing and markup Sequence Analysis Text Mining Genetic Analysis Pathway Analysis Life science example application

13 Workflow execution Component execution location resolution User list of known resources A component can require explicitly to be executed on a particular resource A component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go) For unconstrained components, simple “near the data” execution policy: If single input data location then execute there Otherwise fallback to original execution location Allows usual DPKD workflows to be designed Handles data management and transfer (serialisation, Java based, FTP based)

14 Discovery Net and Grid technologies Cluster/Campus Grid level: Partial or complete workflow execution on Condor / SGE Task farming on subset of the workflow Global Grid: GSI integration (Java Cog Kit) GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3) Potential use of GRIS or NWS in component implementation Globus scheduler ? Unicore ? SRB ?

15 Discovery Net Application Testbeds Life Science Testbed: Gene sequencing, Protein Chips High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases Environmental Modelling Pollution Sensors (GUSTO): SO 2, Benzene,.. High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets Geo-hazard Prediction Multi-spectral, multi-temporal, Satellite imagery Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge GUSTO UNITS with wireless connectivity

16 Case Study: SC2002 HPC Challenge blastgenscan Repeat Masker grail genscanE-PCR Identify Genes Gene markers tRNAs, rRNAs Non-translated RNAs Regulatory Regions Repetitive Elements Segmental Duplication SNP Variations Literature References ….. 3D-PSSMblast Motif Search PFAM DSCpredator Inter Pro Inter Pro SMART SWISS PROT Identify Functional Characteisation Homologues Domain3-D Structure Fold Prediction Secondary structure Literature References ….. Proteins Classify into Protein Families Identify Organism Chromosomes Organism’s DNA Relate Cell Cycle Metabolism Drugs Biological Process….. Cell death Embryogenesis Literature References ….. Ontologies Pathway Maps GeneMapsAmiGO GenNav virtual chip High Throughput Sequencers Nucleotide-level Annotation Protein-level Annotation Process-level Annotation NCBIEMBL TIGRSNP GOCSNDB GKKEGG 15 DBs21 Applications D-Net based Global Collaborative Real- Time Genome Annotation Genome Annotation

17 Nucleotide Annotation Workflows How It Works Download sequence from Reference Server Save to Distributed Annotation Server Interactive Editor & Visualisation Execute distributed annotation workflow NCBIEMBL TIGRSNP Inter Pro SMART SWISS PROT GO KEGG  1800 clicks  500 Web access  200 copy/paste  3 weeks work in 1 workflow and few second execution

18 Conclusion and Future works Towards an open integration platform that enables scientists to conduct their KD activities Several levels of integration required Enable use of available resources Evolution towards cost model integration (performance, value, QoS) Semantic based service retrieval and composition Other useful standards ? (OGSA-DAI ?)


Download ppt "Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration."

Similar presentations


Ads by Google