Presentation is loading. Please wait.

Presentation is loading. Please wait.

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.

Similar presentations


Presentation on theme: "Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and."— Presentation transcript:

1 Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München

2 Technische Universität München Hardware trends … Massive processing parallelism Huge main memory Non-uniform Memory Access (NUMA) Our server: –4 CPUs –32 cores –1 TB RAM –4 NUMA partitions 2 CPU 0

3 Technische Universität München Main memory database systems VoltDB, SAP HANA, MonetDB HyPer*: real-time business intelligence queries on transactional data 3 * http://www-db.in.tum.de/HyPer/ ICDE 2011 A. Kemper, T. Neumann MPSM for efficient OLAP query processing

4 Technische Universität München How to exploit these hardware trends? Parallelize algorithms Exploit fast main memory access Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. VLDB09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD11 AND be aware of fast local vs. slow remote NUMA access 4

5 Technische Universität München How much difference does NUMA make? 100% scaled execution time sortpartitioning merge join (sequential read) 22756 ms 417344 ms 1000 ms 7440 ms 837 ms 12946 ms 5 remote local synchronizedsequential remote local 22756 ms 7440 ms 417344 ms

6 Technische Universität München NUMA partition 1NUMA partition 3 NUMA partition 2NUMA partition 4 hashtable Ignoring NUMA … 6 core 1 core 2 core 3 core 4 core 5 core 6 core 7 core 8

7 Technische Universität München The three NUMA commandments C1 Thou shalt not write thy neighbors memory randomly -- chunk the data, redistribute, and then sort/work on your data locally. 7 C2 Thou shalt read thy neighbors memory only sequentially -- let the prefetcher hide the remote access latency. C3 Thou shalt not wait for thy neighbors -- dont use fine- grained latching or locking and avoid synchronization points of parallel threads.

8 Technische Universität München Basic idea of MPSM private R public S chunk private R chunk public S R chunks S chunks 8

9 Technische Universität München Basic idea of MPSM C1: Work locally: sort C3: Work independently: sort and merge join C2: Access neighbors data only sequentially MJ chunk private R chunk public S sort R chunks locally sort S chunks locally R chunks S chunks merge join chunks MJ 9

10 Technische Universität München Range partitioning of private input R To constrain merge join work To provide scalability in the number of parallel workers 10

11 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers Range partitioning of private input R R chunksrange partition R range partitioned R chunks 11

12 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks 12

13 Technische Universität München To constrain merge join work To provide scalability in the number of parallel workers S is implicitly partitioned Range partitioning of private input R range partitioned R chunks sort R chunks sort S chunks S chunks MJ merge join only relevant parts 13

14 Technische Universität München Range partitioning of private input R Time efficient branch-free comparison-free synchronization-free and Space efficient densely packed in-place by using radix-clustering and precomputed target partitions to scatter data to 14

15 Technische Universität München Range partitioning of private input 9 19 7 3 21 1 17 2 23 4 31 8 20 26 21 17 23 31 20 26 chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = 00111 <16 16 17 = 10001 histogram of worker W 2 <16 16 4 3 3 4 prefix sum of worker W 1 0 0 prefix sum of worker W 2 4 3 W1W1 W2W2 W1W1 W2W2 1 19 19=10011 15 5 2=00010 2

16 Technische Universität München Range partitioning of private input 9 19 7 3 21 1 17 2 23 4 31 8 20 26 chunk of worker W 1 chunk of worker W 2 histogram of worker W 1 7 = 00111 <16 17 = 10001 histogram of worker W 2 <16 4 3 3 4 prefix sum of worker W 1 0 0 prefix sum of worker W 2 4 3 7 3 9 1 4 17 31 2 8 21 23 20 26 W1W1 W2W2 W1W1 W2W2 1 19 19=10011 16 5 2=00010

17 Technische Universität München 21 19 Location skew is implicitly handled Distribution skew –Dynamically computed partition bounds –Determined based on the global data distributions of R and S –Cost balancing for sorting R and joining R and S Skew resilience of MPSM 17 2 1 3 5 7 10 20 2 4 5 8 10 13 24 25 28 S1S1 S2S2 13 4 2 31 20 8 6 31 20 R1R1 25 24 28 9 19 7 3 21 1 177 R2R2

18 Technische Universität München Skew resilience 1. Global S data distribution –Local equi-height histograms (for free) –Combined to Cumulative Distribution Function (CDF) 2 1 3 5 7 10 20 2 4 5 8 10 13 24 25 28 S1S1 S2S2 # tuples key value CDF 18 15 16 12

19 Technische Universität München Skew resilience 2. Global R data distribution –Local equi-width histograms as before –More fine-grained histograms 13 4 2 31 20 8 6 histogram <8 [8,16) [16,24) 24 13 8 31 3 2 1 1 8 = 01000 2 = 00010 19 20 R1R1

20 Technische Universität München Skew resilience 3. Compute splitters so that overall workloads are balanced*: greedily combine buckets, thereby balancing the costs of each thread for sorting R and joining R and S are balanced # tuples key value CDF histogram 3 2 1 1 += 4 2 6 13 8 20 31 20 * Ross and Cieslewicz: Optimal Splitters for Database Partitioning with Size Bounds. ICDT09

21 Technische Universität München Performance evaluation MPSM performance in a nutshell: –160 mio records joined per second –27 bio records joined in less than 3 minutes –scales linearly with the number of cores Platform: –Linux server –1 TB RAM –32 cores Benchmark: –Join tables R and S with schema {[joinkey: 64bit, payload: 64bit]} –Dataset sizes ranging from 50GB to 400GB 21

22 Technische Universität München Execution time comparison MPSM, Vectorwise (VW), and Blanas hash join* 32 workers |R| = 1600 mio (25 GB), varying size of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD 2011 22

23 Technische Universität München Scalability in the number of cores MPSM and Vectorwise (VW) |R| = 1600 mio (25 GB), |S|=4*|R| 23

24 Technische Universität München Location skew Location skew in R has no effect because of repartitioning Location skew in S: in the extreme case all join partners of R i are found in only one S j (either local or remote) 24

25 Technische Universität München Distribution skew: anti-correlated data without balanced partitioning 25 with balanced partitioning R: 80% of the join keys are within the 20% high end of the key domain S: 80% of the join keys are within the 20% low end of the key domain

26 Technische Universität München Distribution skew : anti-correlated data 26

27 Technische Universität München Conclusions MPSM is a sort-based parallel join algorithm MPSM is NUMA-aware & NUMA-oblivious MPSM is space efficient (works in-place) MPSM scales linearly in the number of cores MPSM is skew resilient MPSM outperforms Vectorwise (4X) and Blanas et al.s hash join (18X) MPSM is adaptable for disk-based processing –See details in paper 27

28 Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München THANK YOU FOR YOUR ATTENTION!


Download ppt "Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and."

Similar presentations


Ads by Google