Download presentation
Presentation is loading. Please wait.
Published byKyle Howe Modified over 10 years ago
1
Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper 3 rd IEEE International Conference on e-Science and Grid Computing Bangalore, India December 10 th – 13 th 2007
2
2Community Training Lehrstuhl Informatik III: Datenbanksysteme The AstroGrid-D Project German Astronomy Community Grid http://www.gac-grid.org/ funded by the German Ministry of Education and Research part of the D-Grid
3
3Community Training Lehrstuhl Informatik III: Datenbanksysteme The Multiwavelength Milky Way http://adc.gsfc.nasa.gov/mw/
4
4Community Training Lehrstuhl Informatik III: Datenbanksysteme Data-intensive e-Science Applications Many e-science application areas astrophysics geosciences climatology Combination of various, globally distributed data sources Increasing popularity (within community and public domain): more users Scalability issues with current approaches
5
5Community Training Lehrstuhl Informatik III: Datenbanksysteme Up-coming Data-intensive Applications LOFAR LHC Data rates Terabytes a day/night Petabytes a year LOFAR LSST Pan-STARRS LHC
6
6Community Training Lehrstuhl Informatik III: Datenbanksysteme Current Sharing in Data Grids Images from: Data-Intensive Grid Computing by Srikumar Venugopal Data autonomy Policies allow partners to access data Each institution ensures Availability (replication) Scalability Various organizational structures: Centralized Hierarchical Federated Hybrid
7
7Community Training Lehrstuhl Informatik III: Datenbanksysteme Community-Driven Data Distribution
8
8Community Training Lehrstuhl Informatik III: Datenbanksysteme The Training Phase Extract data from the archives Compute partitioning schemes Compare different partitioning schemes Standard Quadtrees Median-based Quadtrees Zones
9
9Community Training Lehrstuhl Informatik III: Datenbanksysteme Quadtrees Well-known index structure Recursive decomposition Adaptive to data resolution
10
10Community Training Lehrstuhl Informatik III: Datenbanksysteme Quadtree-based Schemes: Splitting Variants Center splitting Always bisects each dimension congruent subareas Splitting points stored or computed Median heuristics Compute median for each dimension independently O(n) median algorithm Splitting points stored
11
11Community Training Lehrstuhl Informatik III: Datenbanksysteme The Zones Index (J. Gray, M. Nieto-Santisteban, A. Szalay) Index structure for databases Specific spatial clustering in zones Optimized for points-in-region queries self-match, cross-match queries Equi-distant partitioning Declination coordinate Zone(ra, dec) = floor((dec + 90.0) / h) Implemented directly in SQL Image by J. Gray et al.
12
12Community Training Lehrstuhl Informatik III: Datenbanksysteme Evaluation Setup 2 data sets: skewed and uniform Size of data sample: 0.01%, 0.1%, 1%, 10% Number of partitions: 4, 16, 64, 256, 1024, 4096, 8192, 16384, 32768, 65536, 131072 (2 4 – 2 17 )
13
13Community Training Lehrstuhl Informatik III: Datenbanksysteme Skewed Training Data (D skew )
14
14Community Training Lehrstuhl Informatik III: Datenbanksysteme Comparing Partitioning Schemes Duration Average data population Variance in partition population Empty partitions Size of the training set Baseline comparison
15
15Community Training Lehrstuhl Informatik III: Datenbanksysteme Average Data Population 10% training sample, D skew avg(# objects in partitions) # objects in biggest partition
16
16Community Training Lehrstuhl Informatik III: Datenbanksysteme Evolution of the Partitioning Scheme 4096 partitions8192 partitions16384 partitions
17
17Community Training Lehrstuhl Informatik III: Datenbanksysteme Empty Leaves 10% training sample, D skew
18
18Community Training Lehrstuhl Informatik III: Datenbanksysteme Size of the Training Set 0.1% training sample, D skew # objects in training set # partitions to be created
19
19Community Training Lehrstuhl Informatik III: Datenbanksysteme Standard Quadtree vs. Median Heuristics
20
20Community Training Lehrstuhl Informatik III: Datenbanksysteme Evaluation – Summary Quadtrees: good adaption to data distribution Quadtrees: Trade-off between data load balancing and uniformly shaped regions Median-based heuristics: best data load balancing even for skewed data sets Zone Index: good for uniform data sets Training set needs to be sufficiently large in order not to artificially create empty partitions
21
21Community Training Lehrstuhl Informatik III: Datenbanksysteme HiSbase Peer-to-Peer layer assigns data partitions to peers Higher flexibility New peers are integrated seamlessly
22
22Community Training Lehrstuhl Informatik III: Datenbanksysteme Query Submission (at Peer d) Determine relevant regions: [1,3] Select coordinator: Region 1 Send CoordinateQuery- message to id1: {[1,3], SQL} Message gets routed to Peer a. 0 1 2 3 4 5 6
23
23Community Training Lehrstuhl Informatik III: Datenbanksysteme Query Coordination (at Peer a) Peer a is coordinator Contact relevant regions Collect intermediate results Send complete result to client select …
24
24Community Training Lehrstuhl Informatik III: Datenbanksysteme Prototype Implementation Java-based prototype FreePastry library (Pastry implementation) Rice University MPI-SWS Presentations at BTW 2007, Aachen, Germany VLDB 2007, Vienna, Austria Deployed in various settings LAN WAN (AstroGrid-D, PlanetLab)
25
25Community Training Lehrstuhl Informatik III: Datenbanksysteme Summary Training phase Community-driven Data Grids Domain-specific partitioning scheme Partitioning scheme supports Data skew Region-based queries Framework for comparing partitioning schemes Various measures with regard to data load balancing
26
26Community Training Lehrstuhl Informatik III: Datenbanksysteme Ongoing Work Database-driven comparison 0.1% of a Petabyte still is 1 Terabyte Feasibility of median-based techniques Workload-aware data partitioning Heterogeneous data nodes
27
27Community Training Lehrstuhl Informatik III: Datenbanksysteme Get in Touch Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: scholl@in.tum.de HiSbase http://www-db.in.tum.de/research/projects/hisbase/ Data stream management Grid-based Data Stream Processing in e-Science (e-Science '06) http://www.gac-grid.de/project-products/Software/ DataStreamManagement.html
28
28Community Training Lehrstuhl Informatik III: Datenbanksysteme AstroGrid-D Research Demo Finding Galaxy Clusters using Grid Computing Technology Room: Banquet Wednesday 3:30 pm to 6:00 pm
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.