Download presentation
Presentation is loading. Please wait.
1
Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of Vienna Diplomarbeit: Mobile Computing (Dr. Karin Hummel)
2
Institute for Software Science – University of ViennaP.Brezany 2 Introduction This lecture part outlines how databases can be integrated into the Grid. Most existing and proposed applications are file-based. Consequently, there has been little work on how databases can be made available on the Grid for access by distributed applications. If the Grid is to support a wider range of applications, then database integration into the Grid will become important. E.g. many applications in the life and earth sciences, and many business applications are heavily dependent on databases. The database integration into the Grid is not possible just by adopting or adapting the existing Grid services that handle files, as databases offer much richer operations (e.g. queries and transac- tions), and there is much greater heterogeneity between different database management systems than there is between different file systems. There are major differences between database paradigms (e.g. object and relational) – within one paradigm different database products (e.g. Oracle and DB2) vary in their functionality and interfaces.
3
Institute for Software Science – University of ViennaP.Brezany 3 Introduction (2) Existing DB systems do not offer Grid integration. They are however result of many hundreds of person-years of effort that allows them to provide a wide range of functionality, valuable programming interfaces and tools, and important properties such as security, performance, dependability (Verlässlichkeit), and (new development) autonomous behaviour. Because the above attributes will be required by Grid applications, scientists strongly believe that building new Grid- enabled database management systems (DBMSs) from scratch is both unrealistic and a waste of effort. Instead we must consider how to integrate existing DBMSs into the Grid.
4
Institute for Software Science – University of ViennaP.Brezany 4 Basic Grid Database Requirements If database data is to be accessible in Grid applications, then a Database System must support the relevant, existing, existing and emerging, Grid standards; for example the Grid Security Infrastructure (GSI). These standards should bear in mind the requirements of databases. A typical application may consist of a computation that queries one or more databases and carries out further analysis on the retrieved data. If the compute and database systems all conform to the same standards for security, accounting, performance monitoring and scheduling etc. then the task of building such applications will be greatly simplified. Note: A Database System (DBS) is created, using a DBMS, to manage a specific database. The DBS includes any associated application software. Many Grid applications will need to utilize more than one DBS. To reduce the effort required to achieve this, federated databases use a layer of middleware running on top of autonomous databases, to present applications with some degree of integration. Other requirements of applications on Grid-enabled DBMS: scalability, handling unpredictable usage, metadata-driven access, multiple database federation, etc.
5
Institute for Software Science – University of ViennaP.Brezany 5 Integrating Databases into the Grid Two possible approaches: Grid-enabled version of JDBC/ODBC - The core set of functionality offered by JDBC/ODBC does not include a number of operations to fulfill Grid database requirements. Service-based approach shown in the next slide, with a service wrapper placed between the Grid and the DBS
6
Institute for Software Science – University of ViennaP.Brezany 6 Integrating Databases into the Grid (2)
7
Institute for Software Science – University of ViennaP.Brezany 7 Federating Database Systems Across the Grid
8
Institute for Software Science – University of ViennaP.Brezany 8 Grid Database Access
9
Institute for Software Science – University of ViennaP.Brezany 9 Grid Database Access With OGSA-DAI GDS gets a query via Perform Document GDS Engine process specified activities GDS returns results
10
Institute for Software Science – University of ViennaP.Brezany 10 Grid Database Access With OGSA-DAI
11
Institute for Software Science – University of ViennaP.Brezany 11 Grid Data Mediation Service (our own development): Federation of Databases - Architecture
12
Institute for Software Science – University of ViennaP.Brezany 12 GDMS – Example Scenario Heterogeneities: –Name in A is „Alexander Wöhrer“ –Name in C has to be combined Distribution: –3 data sources
13
Institute for Software Science – University of ViennaP.Brezany 13 Literature Sources http://www.cs.man.ac.uk/grid-db/ http://www.gridminer.org/
14
Institute for Software Science – University of ViennaP.Brezany 14 Distributed Query Processing on the Grid A representative query over bioninformatics resources is used as an example. The query accesses 2 databases: (1) the Gene Ontology database GO (www.geneontology.org) stored in a MySQL (www.www.geneontology.org mysql.com) RDBMS; and (2) GIMS, a genome database running on a Polar parallel object DB server. The query also calls a local installation of the BLAST sequence similarity program (www.ncbi.nlm.nih.gov/BLAST/) which, givenwww.ncbi.nlm.nih.gov/BLAST/ a protein sequence, returns a set of structures containing protein Ids and similarity scores. The query identifies proteins that are similar to the human proteins with the GO term 8372.
15
Institute for Software Science – University of ViennaP.Brezany 15 Distributed Query Processing on the Grid select struct(A:c.proteinID, B:Blast(c.sequence)) from c in proteins, d in proteinTerms where d.termID="8372" and c.proteinID=d.proteinID;
16
Institute for Software Science – University of ViennaP.Brezany 16 Distributed Query Processing on the Grid In the query, protein is a class extent in GIMS, while proteinTerm is a table in GO. Therefore, as illustrated in the figure, the infrastructure initiates 2 subqueries: one on GIMS, and the other on GO. The results of these subqueries are then joined in a computation running on the Grid. Finally, each protein in the result is used as a parameter to the call to BLAST. One key opportunity created on the Grid is the flexibility it offers on resource allocation decisions. In our example, machines need to be found to run both the join operator, and the operation call. If it is a danger that the join will be the bottleneck in the query, than it could be allocated to a system with large amounts of main memory so as to reduce IO costs asociated with the management of intermediate results.
17
Institute for Software Science – University of ViennaP.Brezany 17 Distributed Query Processing on the Grid Further, a parallel algorithm could be used to implement the join, so a set of machines acquired on the Grid could each contribute to its execution. Similarly, the BLAST calls could be speeded-up by allocating a set of machines, each of which can run BLAST on a subset of the proteins. The information needed to make these resource allocation decisions comes from 2 sources: (1) the query optimizer estimates the cost of executing each part of a query and so identifies performance critical operations; (2) the Globus Grid infrastructure provides information on available resources.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.