Presentation is loading. Please wait.

Presentation is loading. Please wait.

Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,

Similar presentations


Presentation on theme: "Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,"— Presentation transcript:

1 Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University, Sweden 2 Aristotle University of Thessaloniki, Greece

2 Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work 2 ADMS@VLDB 2013 Anastasios Gounaris

3 Target applications Data-intensive continuous aggregate queries –E.g., continuously report the average share price of each company in all European stock markets. –They form the basis of many online analysis tasks. –They implicitly assume a (possibly infinite) data stream 3 ADMS@VLDB 2013 Anastasios Gounaris

4 Scalability requirements CQs may be CPU-intensive due to the –Sheer amount of data –Possibly complex aggregate tasks CQs may also be memory-intensive. –E.g., continuously report the median share price of each company in all European stock markets in the last 10000 secs. –We need to keep all the values within a (sliding) window of appropriate size. The standard solution is parallelism. –Partitioned parallelism has been widely investigated and used for CQs. 4 ADMS@VLDB 2013 Anastasios Gounaris

5 Imbalance problems In partitioned parallelism each group is allocated to a distinct processor unit (PU). If the workload is predictable, we can allocate equal amount of work to each PU. But often, it is not! –E.g., continuously report the median size of messages originated from each IP taking into account the last 10000 messages. Skew problems arise when groups incur different amounts of workload. 5 ADMS@VLDB 2013 Anastasios Gounaris

6 Our Goal Parallelise CQs on GPUs using CUDA. Balance the load on-the-fly. –Revise the assignment of groups to PUs. 6 ADMS@VLDB 2013 Anastasios Gounaris

7 A brief note on CUDA CUDA stands for “Compute Unified Device Architecture” It is a general purpose programming model that makes it easy batches of threads to run on the GPU. The GPU acts as a dedicated super-threaded, massively data parallel co- processor Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args); The material of this slide is from David Kirk/NVIDIA and Wen-mei W. Hwu 7 ADMS@VLDB 2013 Anastasios Gounaris

8 Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work 8 ADMS@VLDB 2013 Anastasios Gounaris

9 Main rationale Data arrive continuously and we buffer them in batches, –which are processed in iterations. CPU responsibilities: To prepare the data in order to achieve coalesced memory access. To detect and correct imbalances. GPU responsibilities: To perform the actual data processing. 9 ADMS@VLDB 2013 Anastasios Gounaris

10 Mappings on the CPU We assume a fixed number of threads. –Each group is fully processed by a single GPU thread. We keep 2 hashmaps for group-to-thread and thread-to- group mappings: ADMS@VLDB 2013 10 GroupThread 10 22 30 41 52 …… Group 01,3 14 22,5 3… 4… …… Anastasios Gounaris

11 id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 … Data Stream id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 1. Copies the next batch of the streaming data to a new matrix 2. Counts the number of tuples of each thread id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 1. Reorders data so that groups of the same thread are together 2.creates matrix threadDataIndicator Reordered data matrix Data matrix 024 threadDataIndicator repeat Operations on the CPU 11 ADMS@VLDB 2013 Check/correct imbalances Copy data to GPU /launch the kernel Anastasios Gounaris

12 Data on the GPU Copied from the CPU Maintained on the GPU ADMS@VLDB 2013 12 id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 Reordered data matrix 024 threadDataIndicator 246839 3136390 1693410 6718211 763481 248602 Windows 1 2 3 4 5 6 Group 0 1 5 3 5 2 nextPos Anastasios Gounaris

13 Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work 13 ADMS@VLDB 2013 Anastasios Gounaris

14 id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 … Data Stream id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 1. Copies the next batch of the streaming data to a new matrix 2. Counts the number of tuples of each thread id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 1. Reorders data so that groups of the same thread are together 2.creates matrix threadDataIndicator Reordered data matrix Data matrix 024 threadDataIndicator repeat Operations on the CPU 14 ADMS@VLDB 2013 Check/correct imbalances Copy data to GPU/ launch the kernel Anastasios Gounaris

15 Load balancing algorithms - 1 Try to smooth differences between the workload of threads. 15 ADMS@VLDB 2013 We use two heaps in order to detect tmax and tmin in O(1) Anastasios Gounaris

16 Load balancing algorithms - 2 getFirst simply chooses the first group upon detection of the most imbalanced pair. checkAll examines all the groups of the most loaded threaded and moves the biggest one. probCheck makes a probabilistic choice of the biggest group in the most loaded threaded. bestBalance examines all the groups of the most loaded threaded and moves the one that leads to the smallest difference in the workload. shift allows moves of groups only to neighboring threads. –E.g., the first group of thread 14 can be moved only to thread 13. shiftLocal does not detect tmax/tmin and checks only adjacent threads. ADMS@VLDB 2013 16 Anastasios Gounaris

17 Experimental setting Two systems used. –PC1: Intel Core2 Duo E6750 CPU at 2.66GHz NVidia 460GTX (GF104) graphics processor at 810 Mhz on a PCIe v2.0 x16 slot (5GB/s transfer rate). –PC2: Intel P4 550 CPU at 3.4 GHz NVidia 550GTX Ti (GF116) at 910 MHz on a PCIe v1.1 x16 (2.5GB/s transfer rate) slot. Three datasets. –DS1: no imbalance –DS2: high imbalance, group sizes follow a zipf distribution –DS3: low imbalance, group sizes follow a zipf distribution but groups are randomly permuted Fixed parameters: –Block size is fixed to 256 threads. –Batch size is fixed to 50K tuples. –Window size is 100 and there are always 40K groups. ADMS@VLDB 2013 17 Anastasios Gounaris

18 Impact of imbalance PC1 w/o load balancing – time to process 100M tuples (2K iterations) ADMS@VLDB 2013 18 Grid size = 4 Anastasios Gounaris

19 High Imbalance Speedups of up to 4.27 are observed. Increasing the grid size seems to work …but it is not always applicable! Simple heuristics perform similarly to (if not better than) the most sophisticated ones. Less sophisticated and approximate load balancing techniques are more appropriate for GPGPU –Basically because they require less computational effort for the balancing itself. ADMS@VLDB 2013 19 Grid size = 4Grid size = 64 Anastasios Gounaris

20 Low imbalance No technique is actually effective ADMS@VLDB 2013 20 Grid size = 4Grid size = 64 Anastasios Gounaris

21 Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work 21 ADMS@VLDB 2013 Anastasios Gounaris

22 Summary In this work we presented: 1.A GPGPU load balancing framework. 2.Load balancing algorithms. Lessons learnt: –Load imbalances can lead to serious performance degradations. –In high imbalances, we have achieved speedups of more than 4 times. –Load balancing techniques need not be very sophisticated. –Small imbalances cannot be tackled. ADMS@VLDB 2013 22 Anastasios Gounaris

23 Future Work - Points not considered Varying dynamically the grid/block/batch size. Investigation in light of the most recent dynamic parallelism extensions in Kepler architectures. Handling of cases where the gpu capacity is lower than the data arrival rate –Use of approximate/load shedding techniques. 23 ADMS@VLDB 2013 Anastasios Gounaris

24 Thank you! … and apologies to all reviewers, whose comments have not been addressed due to tight time contraints 24 ADMS@VLDB 2013 Anastasios Gounaris

25 Back-up slides - Overheads For grid size 4, the CPU operations are (almost) fully hidden ADMS@VLDB 2013 25 Grid size = 4Grid size = 64 Anastasios Gounaris

26 ADMS@VLDB 2013 26 Anastasios Gounaris


Download ppt "Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,"

Similar presentations


Ads by Google