OGSA-DAI Data Access and Integration for the Grid Neil Chue Hong
2 Motivation Goals Partners Features Projects Further information Overview and demo of FirstDIG/INWA Overview
3 OGSA-DAI Motivation Entering an age of data –Data Explosion CERN: LHC will generate 1GB/s = 10PB/y VLBA (NRAO) generates 1GB/s today Pixar generate 100 TB/Movie –Storage getting cheaper Data stored in many different ways –Data resources Relational databases XML databases Flat files Need ways to facilitate –Data discovery –Data access –Data integration Empower e-Business and e-Science –The Grid is a vehicle for achieving this
4 Goals for OGSA-DAI Aim to deliver application mechanisms that: –Meet the data requirements of Grid applications Functionally, performance and reliability Reduce development cost of data centric Grid applications Provide consistent interfaces to data resources –Acceptable and supportable by database providers Trustable, imposed demand is acceptable, etc. Provide a standard framework that satisfies standard requirements A base for developing higher-level services –Data federation –Distributed query processing –Data mining –Data visualisation
5 Integration Scenario A patient moves hospital DB2 Oracle CSV file A: (PID, name, address, DOB) B: (PID, first_contact) C: (PID, first_name, last_name, address, first_contact, DOB) Data A Data B Data C Amalgamated patient record
6 Why OGSA-DAI? Why use OGSA-DAI over JDBC? –Language independence at the client end Do not need to use Java –Platform independence Do not have to worry about connection technology and drivers –Can handle XML and file resources –Can embed additional functionality at the service end Transformations, Compression, Third party delivery Avoiding unnecessary data movement –Provision of Metadata is powerful –Usefulness of the Registry for service discovery Dynamic service binding process –The quickest way to make data accessible on the Grid Installation and configuration of OGSA-DAI is fast and straightforward
7 Project Partners Powered by …. Funded by the Grid Core Programme OGSA-DAI £3 million, 18 months, from Feb 2002 Three major releases, three interim releases DAIT (DAI-Two) Keep the OGSA-DAI brand name £1.5 million, 24 months, from Oct 2003 Four major releases GGF DAIS WG Strong involvement. Standardise the interfaces OGSA-DAI to be a reference implementation
8 Core features An extensible framework for building applications –Supports relational, xml and some files MySQL, Oracle, DB2, SQL Server, Postgres, XIndice, CSV, EMBL –Supports various delivery options SOAP, FTP, GridFTP, HTTP, files, , inter-service –Supports various transforms XSLT, ZIP, GZip –Supports message level security using X509 certificates –Client Toolkit library for application developers –Comprehensive documentation and tutorials Third production release is coming in November –OGSI/GT3 based –Also previews of WS-I and WS-RF/GT4 releases
9 Activities are the drivers Express a task to be performed by a GDS Three broad classes of activities: –Statement –Transformations –Delivery Extensible: –Easy to add new functionality –Does not require modification to the service interface –Extension operate within the OGSA-DAI framework Functionality: –Implemented at the service –Work where the data is (do not require to move data back)
10 OGSA-DAI Deck
11 Client Toolkit Why? Nobody wants to write XML! A programming API which makes writing applications easier –Now: Java –Next: Perl, C, C#?, ML!? // Create a query SQLQuery query = new SQLQuery(SQLQueryString); ActivityRequest request = new ActivityRequest(); request.addActivity(query); // Perform the query Response response = gds.perform(request); // Display the result ResultSet rs = query.getResultSet(); displayResultSet(rs, 1);
12 Project classification OGSA-DAI Biological Sciences Physical Sciences Commercial Applications Computer Sciences FirstDig INWA Bridges AstroGrid BioSimGrid BioGrid eDiamond myGrid ODD-Genes N2Grid GEON MCS IU RGBench OGSA Web-DB GeneGrid GridMiner
13 e-Digital MammOgraphy National Database Built a prototype of a national database of mammographic images in support of the UK Breast screening programme Employ Grid technologies to facilitate this process
14 DB2 Content Manager DB2 Content Manager DB2 Content Manager DB2 Content Manager DB2 Federation OGSA-DAI Database Files OGSA-DAI Core Services Core Services Core Services Core Services Data Load Training App Training Services UCL KCLUEDCHU Core API Training API Training Application Core & Training API OGSA-DAI Data Load Training App Core & Training API Data Load Training App Core & Training API Data Load Training App Core & Training API
15 eDiaMoND Findings: –OGSA-DAI provides a flexible framework –Dynamically configure the system through discovery –Activities can operate with different levels of granularity –Federation can introduced at various levels –Extended Activities to access IBM DB2 Content Manager
16 GeneGrid Grid Based Framework for Bioinformatics – Virtual Bioinformatics Laboratory –Integration of Existing Technologies & Data Sets –Gene Study in Silico –Develop Specialist Data Sets –Grid Services for Commercial or 3 rd Party Use Data resources as XML collections (XIndice), flat files and relational databases (MySQL) –OGSA-DAI plus custom extensions –Beta testers for file based activities
17 GeneGrid Architecture GeneGrid Application Management Registry GeneGrid Workflow Definition GeneGrid Data Manager Registry GeneGrid Workflow Status GeneGrid Input &Results Parameters GeneGrid Environment GeneGrid Workflow Manager Service GeneGrid Process Manager Service GeneGrid Portal EMBL Database SwissProt Database iGAP GAM Service SDSC BeSC EBI GDM Service TMHMM Blast GAM Service SignalP mpiBlast GAM Service SwissProt DB GDM Service EMBL DB GDM Service
18 Distributed Query Processing Queries mapped to algebraic expressions for evaluation Parallelism represented by partitioning queries –Use exchange operators Prototype available from: – table_scan (protein) table_scan termID=S92 (proteinTerm) reduce hash_join (proteinId) op_call (Blast) reduce exchange 3,4 12
19 GridMiner Test application area: medical –traumatic brain injury treatment –Predicting the outcome of seriously ill patients –analytical part focuses on data mining and On-Line Analytical Processing (OLAP) Target: –provide tools to discover and access relevant knowledge and information from different distributed and heterogeneous data sources –building on and extending OGSA-DAI
20 GridMiner Scenario Heterogeneities: –Name in A is First Last (as the target format) –Name in C has to be combined Distribution: –3 data sources
21 Future work Architecture review –better concurrency model –better AAA framework –better definition of extensibility points security, activities, dynamic configuration, mobile code,… Improved support for –WS Security profiles –Stored procedures –Data transport –XQuery –Database specific datatypes and SQL Additionally –JDBC and ODBC driver for OGSA-DAI –Contribution process
22 Further information The OGSA-DAI Project Site: – The DAIS-WG site: – OGSA-DAI Users Mailing list –General discussion on grid DAI matters Formal support for OGSA-DAI releases – OGSA-DAI training courses
23 Project Membership Principal Investigators Project Manager Programme Management Board Chair Technical Review Board Chair Research Team IBM Dissemination Team EPCC Team Charaka Tom Mike Ally Amy Mario Malcolm Kostas Norman Paul Neil Andy Simon Dave PatrickNeil IBM Development Team
24 The End Questions?
25 INWA Objectives Innovation Node Western Australia –Informing Business & Regional Policy: Grid-enabled fusion of global data and local knowledge Project –Run from Nov Aug 2004 –Involved 10 partners (6 UK + 4 Australia) Aim –Data mine commercially sensitive data –Security an absolute MUST –Employ Grid technologies –Need access to data and computational resources Demonstrator using: –OGSA-DAI Incorporate data resources –Sun DCG's TOG (Transfer-queue Over Globus) Handle job submission to analyse micro array data
26 Curtin,Australia EPCC,UK INWA Grid Engine BankTelco Grid Engine BankTelco OGSA-DAI TOG Data Browser Telco data Bank data Australian property UK Property
27 INWA: Lessons Learned Performing Data Integration: –TimeZone date problems Security issues: –Bugs in JavaCoG in GT3 OGSA-DAI could not switch security for Grid data transfers TOG had no security option –All of these have been fixed Middleware not mature enough for commercial deployment
28 Biomedical Research Informatics Delivered by Grid Enabled Services Want a Grid enabled front end to their software Want to do a comparison evaluation between –IBM's Information Integrator –OGSA-DAI
29 Bridges: Data Sources Edinburgh Glasgow Leicester Oxford MRC/Imperial Eindhoven Maastricht
30 MGICSV IBM Information Integrator MGICSV OGSA-DAI Client
31 FirstDIG Data mining with the First Transport Group, UK –Example: When buses are more than 10 minutes late there is an 82% chance that revenue drops by at least 10% – OGSA-DAI OGSA-DAI Client Application Data Mining Application
32 EdSkyQuery-G Sky Data Sky Data Sky Data Sky Data
33 PostgreSQL MySQL Xindice DB2 Oracle Oracle Federation DB2 DB2 Federation Scratch DB Data Service Scratch DB Data Service Scratch DB Data Service
34 OGSA-DAI Downloads R4 690 downloads since May 04 -Actual user downloads not search engine crawlers -Does not include downloads as part of GT3.2 releases Total of 838 registered users 7/10/04) Version (release date) Downloads R1.0 (Jan 03)104 R1.5 (Feb 03)108 R2.0 (Apr 03)250 R2.5 (Jun 03)291 R3.0 (Jul 03)792 R3.1 (Feb 04)630 Total2865 United Kingdom 21% China 26% United States 13% Japan 5% Unknown 7% Germany 5% Italy 5% Austria 2% Australia 2% France 3% Taiwan 2% Downloads by Country – OGSA-DAI R4.0
35 Users Group A separate independent body to engage with users and feedback to developers –Chair: Prof. Beth Plale of Indiana University Twice-yearly meetings