Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM.

Similar presentations


Presentation on theme: "1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM."— Presentation transcript:

1 1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM Almaden Research Center

2 2 Hardware is a moving target Different degrees of parallelism, # sockets and memory hierarchies Different types of CPUs (SSE, out-of-order vs in-order, 2- vs 4- vs 8- way SMT, …), storage technologies … Very difficult to optimize & maintain data management code for every HW platform 2-socket4-socket (a)8-socket4-socket (b) Intel-based POWER-based Cloud

3 3 NUMA effects => underutilize RAM bandwidth Memory Socket 0 Socket 2 Socket 1 Socket Bandwidth seq. mem access (12 threads) Latency data dependent random access (1 thread) Local memory access 24.7 GB/s340 cycles/access (~150 ns/access) Remote memory 1 hop 10.9 GB/s420 cycles/access (~185 ns/access) Remote memory 2 hops 10.9 GB/s520 cycles/access (~230 ns/access) Remote memory 2 hops with cross traffic 5.3 GB/s530 cycles/access (~235 ns/access) Sequential accesses are not the final solution QPI

4 4 Use case: data shuffling Each of the N threads need to send data to the N-1 other threads Common operation: Sort-merge join Partitioned aggregation MapReduce Both Map and Reduce shuffle data Scatter/gather Ignoring NUMA leaves perf. on the table

5 5 NUMA-aware data mgmt. operations Tons of work on SMPs & NUMA1.0 Sort-merge join [Albutiu et al. VLDB 2012] Favor sequential accesses over random probes OLTP on HW Islands [Porobic et al. VLDB 2012] Should we treat multisocket multicores as a cluster? There are many different data operations that need similar optimizations

6 6 Need for primitives Kernels used frequently on data management operations E.g. sorting, hashing, data shuffling, … Highly optimized software solutions Similar to BLAS Optimized by skilled devs per new HW platform Hardware-based solutions Database machines 2.0 (see Bionic DBMSs talk this afternoon) If very important kernel, can be burnt into HW Expensive, but orders of magnitude more efficient (perf., energy) Companies like IBM and Oracle can do vertical engineering

7 7 Outline Introduction NUMA 2.0 and related work Data shuffling Ring shuffling Thread migration Evaluation Conclusions

8 8 Data shuffling & naïve implementation N threads produce N-1 partitions for all other threads Each thread needs to read its partitions N * (N-1) transfers Assume uniform sizes of partitions Before After Shuffle Naïve implementation Each thread acting autonomously: for (thread=0; thread

9 9 Shuffling naively in a NUMA system Naïve uncoordinated shuffling Step 1 Step 3 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 Step 2 Step … Usage of QPI and Memory paths BUT we bought 4 memory channels and 6 QPIs Need to orchestrate threads/transfers to utilize the rest Max mem. BW of 1 channel Aggr. BW of all channels

10 10 Ring shuffling Devise a global schedule and all threads follow it Inner ring: partitions ordered by thread number, socket; stationary Outer ring: threads ordered by socket, thread number; rotates Can be executed in lock-step or loosely Needs: Thread binding & synchronization Control location of mem. allocations.. s0.t0 s0.p0 s0.t1 s1.t0 s1.t1 s2.t0 s2.t1 s3.t0 s3.t1 s1.p0 s2.p0 s3.p0 s0.p1 s1.p1 s2.p2 s2.p3

11 11 Ring shuffling in action Usage of QPI and Memory paths Ring shuffling Step 1 Step 3 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 Step 2 Step … Orchestrated traffic utilizes underlying QPI network Aggr. BW of all channels

12 12 Thread migration instead of shuffling Move computation to data instead of shuffling them Convert accesses to local memory reads Choice of migrating only thread or thread + state But, both very sensitive to amount of thread state Aggr. BW of all channels

13 13 Outline Introduction NUMA 2.0 and related work Data shuffling Evaluation Conclusions

14 14 Shuffling benchmark – peak bandwidth 3x ~4x IBM x sockets x 8 cores Intel X7650 Nehalem-EX Fully connected QPI 2x IBM x sockets x 8 cores Intel X7650 Nehalem-EX

15 15 Exploiting ring shuffling in joins Implemented the algorithm of Albutiu et al. Sort-merge-based join implementation Small overall perf. improvement because dominated by sort

16 16 Shuffling vs migration for aggregation Partitioning-based aggregation Potential of thread migration when thread state small

17 17 Conclusions Hardware is a moving target Need for primitives for data management operations Highly optimized SW or HW implementations BLAS for DBMSs Data shuffling can be up to 3x if NUMA-aware Needs binding of memory allocations, thread scheduling … Potential of thread migration Improved overall performance of optimized joins and aggregations Continue investigating primitives, their implementation and exploitation Looking for motivated summer interns ! [ to Questions???

18 18 Backup slides

19 19 Shuffling data - scalability IBM x sockets x 8 cores Fully connected QPI

20 20 Shuffling vs migration for aggregation - breakdown Partitioning-based aggregation

21 21 Naïve uncoordinated shufflingCoordinated shuffling Iteration 1 Iteration 3 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 T0 T1T2 T3 T4 T5T6 T7 Iteration 2 Usage of QPI and Memory paths Iteration … Naïve vs ring shuffling


Download ppt "1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM."

Similar presentations


Ads by Google