Presentation is loading. Please wait.

Presentation is loading. Please wait.

Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin.

Similar presentations


Presentation on theme: "Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin."— Presentation transcript:

1 Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper Department of Computer Science, Technische Universität München Germany

2 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning2 Many challenges and opportunities in e-science for database research –High-throughput data management –Correlation of distributed data sources Community-driven data grids –Dealing with data skew and query hot spots –Workload-awareness by employing cost model during partitioning Should I Split or Replicate?

3 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning3 Query Load Balancing via Partitioning

4 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning4 Query Load Balancing via Partitioning

5 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning5 Query Load Balancing via Partitioning

6 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning6 Query Load Balancing via Partitioning X

7 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning7 Query Load Balancing via Replication

8 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning8 Query Load Balancing via Replication

9 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning9 Query Load Balancing via Replication

10 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning10 The AstroGrid-D Project German Astronomy Community Grid Funded by the German Ministry of Education and Research Part of D-Grid

11 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning11 Up-Coming Data-Intensive Applications Alex Szalay, Jim Gray (Nature, 2006): Science in an exponential world Data rates –Terabytes a day/night –Petabytes a year LHC LSST LOFAR Pan-STARRS LOFAR LHC

12 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning12 The Multiwavelength Milky Way

13 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning13 Research Challenges Directly deal with Terabyte/Petabyte-scale data sets Integrate with existing community infrastructures High throughput for growing user communities

14 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning14 Current Sharing in Data Grids Data autonomy Policies allow partners to access data Each institution ensures –Availability (replication) –Scalability Various organizational structures [Venugopal et al. 2006]: –Centralized –Hierarchical –Federated –Hybrid

15 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning15 Community-Driven Data Grids (HiSbase)

16 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning16 Distribute by Region – not by Archive!

17 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning17 Distribute by Region – not by Archive!

18 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning18 Distribute by Region – not by Archive!

19 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning19 Distribute by Region – not by Archive!

20 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning20 Mapping Data to Nodes

21 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning21 Workload-Aware Training Phase Incorporate query traces during training phase Base partitioning scheme on –Data load –Query load Challenges –Balance query load without losing data load balancing –Approximate real query hot spots from query sample

22 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning22 Dealing with Query Hot Spots Query skew triggered by increased interest in particular subsets of the data Two well-known query load balancing techniques: –Data partitioning –Data replication Finding trade-offs between both

23 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning23 When to Split (Partition) or to Replicate Considers partition characteristics –Amount of data (few/many data points) –Number of queries (few/many queries) –Extent of regions and queries (small/big queries) Data pointsFew QueriesMany Queries SmallBigSmallBig FewSPLITREPLICATE ManySPLIT REPLICATE

24 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning24 Region Weight Functions Data only (#objects in a region) Queries only (#queries in a region) Scaled queries –Approximate real extent of hot spot –Avoid overfitting to training query set Heat of a region (#objects * #queries) Extents of regions and queries –Replicate when many big queries big small

25 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning25 Evaluation Weight functions: data, heat, extent Data sets (observational, simulation) Workloads (SDSS query log, synthetic) Partitioning Scheme Properties –Load distribution –Communication overhead Throughput Measurements –Distributed setup –FreePastry simulator P obs

26 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning26 Load Distribution Uniform data set from the Millennium simulation Workload with extreme hot spot In the following: –1024 partitions –Heat of a region (#data * #queries) –Normalized across all partitioning schemes

27 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning27 Query-unaware Training

28 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning28 Training with Scaled Queries (scaled 50x)

29 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning29 Training with Scaled Queries (scaled 400x)

30 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning30 Heat-based, Extent-based Training

31 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning31 Communication Overhead for P obs

32 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning32 Throughput for P obs

33 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning33 Load Balancing During Runtime Complement workload-aware partitioning with runtime load- balancing Short-term peaks –Master-slave approach –Load monitoring Long-term trends –Based on load monitoring –Histogram evolution

34 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning34 Related Work On-line load balancing Hundreds of thousands to millions of nodes Reacting fast Treating objects individually HiSbase

35 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning35 Should I Split or Replicate? Many challenges and opportunities in e-science for database research –High-throughput data management –Correlation of distributed data sources Community-driven data grids –Dealing with data skew and query hot spots –Workload-awareness by employing cost model during partitioning

36 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning36 Get in Touch Database systems group, TU München –Web site: – The HiSbase project –http://www-db.in.tum.de/research/projects/hisbase/ Thank You for Your Attention

37 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning37 Queries Intersecting Multiple Regions

38 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning38 Regions Without Queries

39 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning39 Throughput for P obs (300 nodes, sim.)

40 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning40 Throughput for P obs (1000 nodes, sim.)

41 Technische Universität München EDBT 2009 – Workload-Aware Data Partitioning41 Throughput (Region-Uniform Queries)


Download ppt "Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin."

Similar presentations


Ads by Google