unibasel Toward Replication in Grids for Digital Libraries with Freshness and Correctness Guarantees* Fuat Akal, Heiko Schuldt and Hans-Jörg University of Basel, Computer Science Department Bernoullistr 16, CH-4056, Basel, Switzerland 3 rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September 2007 * The work has been partly supported by the EU in the 6 th framework programme within the project DILIGENT (contract No. IST ). >
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Example Scenario Satellite pictures of Mediterranean Sea are continuously taken and... stored as complex documents in a Digital Library (DL). A typical activity is to generate periodical reports. Image Features Image Features Image Features Image Features Image Features Image Features Storage Properties MER_RR__2P MER … World World Europe Bigger_Europe Smaller_Europe Mediterranean Iberia North_Atlantic Africa North_Africa Middle_East Portugal... MER_RR__2P MER … World World Europe Bigger_Europe Smaller_Europe Mediterranean Iberia North_Atlantic Africa North_Africa Middle_East Portugal... Metadata as XML Documents Earth Observation Simple Boolea n Querie s Image Similarit y Queries
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Watching the Environment Closely Monitoring of the Mediterranean Sea There are some busy oil terminals in the region –Oil tankers keep floating in the sea –Potential oil spill into the sea Earth Observation Both are extremely concerned about the environment! Data Grid satellite images, metadata, image features... „I am interested in Greek coasts as of last week“ „Fresh Turkish water please“ Scientist 1 in Athens Greece Scientist 2 in Antalya Turkey
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Desired Replica Management in the Grid Scientist 1 in Athens Greece Scientist 2 in Antalya Turkey satellite images, metadata, image features... Entire Mediterranean Turkish Coasts Greek Coasts storage node 0 sn 1 sn 2 sn 3 Greek Coasts Scientist 3 in Thessaloniki Greece Data Grid Assumption: Whole data is collected at a single node, e.g. ESA in Italy Automatic selection of the best replica from the user‘s location Replication at a higher level, e.g. collections, subcollections. Dynamic decision on when/where to create replicas, e.g. sn 1 becomes a hot spot Freshness and correctness guarantees on accessed data is insured, e.g. „I want uptodate data“ Sophisticated replication mechanism is required! Create Replica Scientists may also 1) write back their reports and/or 2) create versions of documents or annotate
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Outline Digital Library built atop a grid middleware –Rich variety, structure, volume of data, e.g. traditional documents, complex multimedia objects Simple Boolean queries as well as sophisticated multi-feature similarity queries –Consistent access to up-to-date data may be essential Rest of the talk is... –Replication in a DB Cluster –Transition from a DB cluster to the Grid –DILIGENT Replication Architecture –Conclusions and Outlook
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Replication in a DB Cluster (PDBREP) Available replication solutions for grid environments do not meet all of the desired properties just mentioned, e.g. freshness and correctness. In our previous work [VLDB2005], we devised a replication protocol for database clusters named PDBREP. –It provides already some properties of what we call desired replica management in the Grid, e.g. freshnes, higher replication granularity. Our approach in this work is to start with this protocol and adapt it to the grid. PDBREP stands for PowerDB Replication, which was a a project conducted at ETH Zurich partially supported by Microsoft.
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Replication in a DB Cluster (PDBREP) Update Node(s) U: update(a)Q: query(a, b, fr) a,c a,b,c,d Coordination Middleware Continuous Update Broadcast Read-only Nodes Continuous Update Propagation Transactions (only, when the node is idle) Local Update Queue Global Log db,db,c U w(a) Q r(b)r(a) distributed query execution fr : freshness requirement, e.g. „I am fine with 2 minutes old data“, „I want fresh data“ etc. Refresh Transactions (on-demand) + +
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Transition to the Grid UpdatesQueries Coordination Middleware Update Node(s) Read-only Nodes We still distinguish update and read-only nodes Potentially several update nodes –We still assume that all updates are serialized into a global log Broadcast of updates not feasible, replicas subscribe for changes instead Service Oriented Architecture More nodes which are heterogeneous Failures are more likely to happen Global Log
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Replication Granularity The unit of replication is called a DataSet (DS) –A DataSet can be a collection of documents, a subcollection or as small as a single document. –Rule based definition: information on a specific region, documents not older than 30 days, created between date1 and date 2, etc... Collection of Satellite Images and its metadata Subcollection 1Subcollection 2 DataSet 1 Entire Mediterranean Turkish Coasts Greek Coasts DS 2
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September sn 1 sn 5 sn 2sn 3 DILIGENT Grid Replication Architecture Storage Node 4 DS 1 DS 2 DS 3 DS 4 DS 1 DS 2 DS 3 DS 1 : 1 DS 2 : 2,3 DS 3 : 5 DS 4 : 4 Replica Catalog DS 1 : 1 DS 2 : 2,3 DS 3 : 5 DS 4 : 4 Replica Catalog DS 1 : DS 2 :, DS 3 : DS 4 : Freshness Repository DS 1 : DS 2 :, DS 3 : DS 4 : Freshness Repository (1) Read(DS 2 (x), DS 4 (y), 0.6) (2.1) Locate bestReplicas Client (3) Read Data continuous propagation Queue.... TS x, W x, DS y... DS 4 Update Queue subscription SN 1 : 50% SN 2 : 25% SN 3 : 60% SN 4 : 30% SN 5 : 50% Load Repository SN 1 : 50% SN 2 : 25% SN 3 : 60% SN 4 : 30% SN 5 : 50% Load Repository (2.2) (2.3) RMS RSS FTS Access History (4) Log
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Conclusions & Outlook We presented the first steps of our on-going work whose ultimate goal is to come up with a fully integrated and self-managing replication subsystem for the Grid We want to adapt an existing database replication mechanism, i.e. PDBREP from database clusters to data grids This looks feasible: –The infrastructure related assumptions like broadcasting of changes to replicas can be replaced by a subscription mechanism easily –Additional components presented in the envisioned architecture to facilitate scheduling of queries can be included in the PDBREP without requiring major changes. Implementation of the DILIGENT replication on top of gLite is still ongoing
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September Thank you!.. Questions?
unibasel 3rd VLDB Workshop on Data Management in Grids, Wien, Austria, 23 September References 1.DILIGENT: A DIgital Library Infrastructure on Grid ENabled Technology. IST F. Akal, C. T¨urker, H.-J. Schek, Y. Breitbart, T. Grabs, and L. Veen. Fine-Grained Replication and Scheduling with Freshness and Correctness Guarantees. In VLDB, pages 565–576, 2005.