Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.

Similar presentations


Presentation on theme: "Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2."— Presentation transcript:

1

2 Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2

3 Outline Motivation Database Terminology Background Joins Multi-Way Joins Thesis Questions Experimental Results Conclusions 3

4 Motivation Data is everywhere Governments collect data on citizens Facebook collects data on over 1 billion people Wal-Mart and Target collect sales data on all their customers The goal is to make answering the big questions –Possible –Faster 4

5 Database Terminology: Relations (Tables) PartLineitem partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 2Hat25.002110.50 3Bottle2.5032322.50 43152.50 5 Part Relation Tuple/Row Attribute/Column Lineitem Relation The tables are related through their partkey attributes Attribute Names

6 Database Terminology II: SQL Structured Query Language Used to ask the database questions about the data Standardized Example: SQL for retrieving all rows from the part table 6 SELECT * FROM Part;

7 Database Terminology III: Join Joins are used to combine the data in database tables Joins are slow We want joins to be faster 7

8 Background 8

9 What Makes Queries Slow? All the data must be read to give an accurate answer Data is usually much larger than what can fit in memory Operations such as filtering, ordering, and joins are costly A join is especially costly –May need to match every row in two tables. O(n 2 ) –May need to perform many slow disk operations (I/Os) 9

10 Background: Example Join Query 10 SELECT * FROM Part p, Lineitem l WHERE p.partkey = l.partkey; Part Lineitem p.partkey = l.partkey Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 1Box0.50211 2Hat25.0032322.50 3Bottle2.5043152.50 SQL Relational Algebra Join Results

11 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice Nested Loop Join 11 PartLineitem partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 2Hat25.002110.50 3Bottle2.5032322.50 43152.50 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 1Box0.50211 2Hat25.0032322.50 3Bottle2.5043152.50 1Box0.50 111 1Box0.50 211 1Box0.50 32322.50 43152.50

12 Dynamic Hash Join 12 Part partkeynameretailprice 1Box0.50 2Hat25.00 3Bottle2.50 3Bottle2.50 2Hat25.00 1Box0.50 Part1 partkeynameretailprice Part2 partkeynameretailprice Part3 partkeynameretailprice Three Part Partitions Hash Function : partition = (partkey - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 Saved to disk

13 Part1 partkeynameretailprice 1Box0.50 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice Dynamic Hash Join 13 Lineitem linenumberpartkeyquantitysaleprice 1110.50 211 32322.50 43152.50 1Box0.50 111 1Box0.50 211 32322.50 43152.50 Lineitem1 linenumberpartkeyquantitysaleprice Lineitem2 linenumberpartkeyquantitysaleprice Lineitem3 linenumberpartkeyquantitysaleprice Three Lineitem Partitions Hash Function: partition = (partkey - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3

14 Part2 partkeynameretailprice 2Hat25.00 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 1Box0.50211 Dynamic Hash Join 14 Lineitem2 linenumberpartkeyquantitysaleprice 32322.50 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box0.50111 1Box0.50211 2Hat25.0032322.50 3Bottle2.5043152.50 2Hat25.00 32322.50

15 Join Three Tables 15 SELECT A.a_key, B.b_key, C.c_key FROM A, B, C WHERE A.a_key = B.a_key AND A.a_key = C.a_key; A B A.a_key = B.a_key C A.a_key = C.a_key A B A.a_key = B.a_key C A.a_key = C.a_key Left Deep Plan Right Deep Plan

16 Multi-way Hash Joins Join multiple relations at the same time Shares memory across the entire join Produces a result by combining tuples from all relations Do not have to repartition intermediate results Less disk operations 16 A B A.a_key = B.a_key and A.a_key = C.a_key C Multi-way Plan

17 Hash Teams Multi-way hash join Hash teams joins relations on a common attribute 17

18 Hash Teams Example ABC a_keyb_keya_keyc_keya_key 11113 22221 33332 4142 5251 18 SELECT A.a_key, B.b_key, C.c_key FROM A, B, C WHERE A.a_key = B.a_key AND A.a_key = C.a_key;

19 Partitioning A and B 19 A1A1 a_key 1 Partitions AB a_keyb_keya_key 111 222 333 41 52 Hash Function: partition = (a_key - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 1 A2A2 a_key 2 A3A3 3 A1A1 A2A2 A3A3 2 3 B1B1 b_keya_key B2B2 b_keya_key B3B3 b_keya_key

20 Partitioning A and B 20 A1A1 a_key 1 Partitions AB a_keyb_keya_key 111 222 333 41 52 Hash Function: partition = (a_key - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 A2A2 a_key 2 A3A3 3 11 22 33 41 52 B1B1 b_keya_key 11 41 B2B2 b_keya_key 22 52 B3B3 b_keya_key 33 B1B1 b_keya_key B2B2 b_keya_key B3B3 b_keya_key

21 Processing C 21 A1A1 a_key 1 Disk Partitions Hash Function: partition = (a_key - 1 mod 3) + 1 B1B1 b_keya_key 11 41 B1B1 b_keya_key C c_keya_key 13 21 32 42 51 13 21 32 42 1 11 41 C2C2 c_keya_key C3C3 c_keya_key Results a_keyb_keyc_key 2 1 4 1 2 1

22 Processing C 22 A1A1 a_key 1 Disk Partitions Hash Function: partition = (a_key - 1 mod 3) + 1 B1B1 b_keya_key 11 41 B1B1 b_keya_key C c_keya_key 13 21 32 42 51 C2C2 c_keya_key 32 42 C3C3 c_keya_key 13 51 1 11 41 Results a_keyb_keyc_key 112 142 115 145 223 253 224 254 331 1 4 1 1 5 Results a_keyb_keyc_key 112 142 5

23 Generalized Hash Teams (GHT) Extends Hash Teams Does not need the join attributes to be the same Uses indirect partitioning Needs an in-memory map to indirectly join relations 23

24 GHT Partition Maps Uses join memory Use a bitmap to approximate mapping to reduce memory requirements Needs a bitmap for each partition Bitmaps introduce mapping errors that cause tuples to be mapped to multiple partitions (false drops) False drops add I/O and Processing cost 24

25 GHT Example 25 SELECT c.custkey, o.orderkey, l.partkey FROM Customer c, Orders o, Lineitem l WHERE c.custkey = o.custkey AND o.orderkey = l.orderkey; Customer custkey 1 2 3 Orders orderkeycustkey 11 22 33 41 52 Lineitem orderkeypartkey 11 12 23 24 31 38 45 46 54

26 GHT Customer Partitions Customer 1 Customer 2 Customer 3 custkey 123 26 Hash Function: partition = (custkey - 1 mod 3) + 1

27 Orders Partitions and Bitmap 27 Orders 1 orderkeycustkey 11 41 Orders 2 orderkeycustkey 22 52 Orders 2 orderkeycustkey Orders 3 orderkeycustkey 33 Orders 3 orderkeycustkey Orders 1 orderkeycustkey Orders orderkeycustkey 11 22 33 41 52 11 22 33 41 52 B1B1 0 0 0 0 B2B2 0 0 0 0 B3B3 0 0 0 0 Index = (orderkey +1) mod 4 B1B1 0 0 1 0 B1B1 0 1 1 0 B2B2 0 0 0 1 B2B2 0 0 1 1 B3B3 1 0 0 0 Hash Function: partition = (custkey - 1 mod 3) + 1

28 Orders Partitions and Bitmap 28 B1B1 B2B2 B3B3 001 100 110 010 B1B1 0 1 1 0 B2B2 0 0 1 1 B3B3 1 0 0 0

29 Lineitem Partitions with False Drops 29 Lineitem 1 orderkeypartkey 11 12 45 46 54 Lineitem 2 orderkeypartkey 11 12 23 24 54 Lineitem 3 orderkeypartkey 31 38 Lineitem 1 orderkeypartkey Lineitem 2 orderkeypartkey Lineitem 3 orderkeypartkey Lineitem orderkeypartkey 11 12 23 24 31 38 45 46 54 B1B1 0 1 1 0 B2B2 0 0 1 1 B3B3 1 0 0 0 Index = (orderkey +1) mod 4 11 12 23 24 31 38 45 46 54 11 12 54 False Drop

30 30 Lineitem 1 orderkeypartkey 11 12 45 46 54 Joining the Partitions 11 12 45 46 54 Customer 1 custkey 1 Orders 1 orderkeycustkey 11 41 11 41 1 Results custkeyorderkeypartkey 111 112 145 146 223 224 254 331 338 Results custkeyorderkeypartkey 1 1 1 11 1 2 1 1 1 5 4 1 41 1 6 4 1 False Drop

31 SHARP Limited to star joins –Looks like a star –All tables related to a central table 31 Fact keya_keyb_keyc_keyd_keye_key A a_keydata C c_keydata B b_keydata E e_keydata D d_keydata

32 SHARP Example CustomerProductSaleitem idnameidnamec_idp_id 1Bob1Hammer11 2Joe2Drill12 3Greg3Screwdriver23 4Susan4Scissors26 5Toolbox31 6Knife35 25 41 36 32 SELECT * FROM Customer c, Product p, Saleitem s WHERE c.id = s.c_id AND p.id = s.p_id;

33 SHARP Example Partitions 33 Customer idname 1Bob 2Joe 3Greg 4Susan Customer 1 idname 1Bob 3Greg Customer 1 idname Customer 2 idname 2Joe 4Susan Customer 2 idname 1Bob 2Joe 3Greg 4Susan Hash Function: partition = (id - 1 mod 2) + 1

34 SHARP Example Partitions 34 Product idname 1Hammer 2Drill 3Screwdriver 4Scissors 5Toolbox 6Knife Product 1 idname 1Hammer 4Scissors Product 2 idname 2Drill 5Toolbox Product 3 idname 3Screwdriver 6Knife 1Hammer Product 1 idname Product 2 idname Product 3 idname 2Drill 3Screwdriver 4Scissors 5Toolbox 6Knife Hash Function: partition = (id - 1 mod 3) + 1

35 SHARP Example Partitions 35 Saleitem c_idp_id 11 12 23 26 31 35 25 41 36 Saleitem 1,1 c_idp_id 11 31 Saleitem 1,1 c_idp_id Saleitem 1,2 c_idp_id 12 35 Saleitem 1,2 c_idp_id Saleitem 1,3 c_idp_id 36 Saleitem 1,3 c_idp_id Saleitem 2,1 c_idp_id 41 Saleitem 2,1 c_idp_id Saleitem 2,2 c_idp_id 25 Saleitem 2,2 c_idp_id Saleitem 2,3 c_idp_id 23 26 Saleitem 2,3 c_idp_id 11 12 23 26 31 35 25 41 36 c_id mod 2 = 1c_id mod 2 = 0 p_id mod 3 = 1 p_id mod 3 = 2 p_id mod 3 = 0

36 SHARP Partition Combinations Customer 1, Product 1, and Saleitem 1,1 Customer 1, Product 2, and Saleitem 1,2 Customer 1, Product 3, and Saleitem 1,3 Customer 2, Product 2, and Saleitem 2,1 Customer 2, Product 2, and Saleitem 2,2 Customer 2, Product 3, and Saleitem 2,3 36 For each partition i of Customer For each partition j of Product probe with partition i,j of Saleitem output matches between Customer i, Product j, and Saleitem i,j

37 Results c_idc_namep_idp_name SHARP Join 37 Saleitem 1,1 c_idp_id 11 31 Product 1 idname 1Hammer 4Scissors Customer 1 idname 1Bob 3Greg 11 31 1Hammer 1Bob 3Greg 1Hammer

38 Results c_idc_namep_idp_name 1Bob1Hammer 3Greg1Hammer 1Bob2Drill 3Greg5Toolbox 3Greg6Knife 4Susan1Hammer 2Joe5Toolbox 2Joe3Screwdriver 2Joe6Knife Results c_idc_namep_idp_name 1Bob1Hammer 3Greg1Hammer SHARP Join 38 Saleitem 1,2 c_idp_id 12 35 Product 2 idname 2Drill 5Toolbox Customer 1 idname 1Bob 3Greg 35 2Drill 1Bob 3Greg 5Toolbox 12

39 Multi-Way Join Summary AlgorithmRelevant Queries Hash TeamsAny query performing an inner join on identical attributes in all relations. Generalized Hash Teams Any query performing an inner join on direct and indirect attributes. Requires extra memory for indirect queries. SHARPOnly star queries. 39

40 Thesis Questions The study seeks to answer the following questions: Q1: Does Hash Teams provide an advantage over DHJ? Q2: Does Generalized Hash Teams provide an advantage over DHJ? Q3: Does SHARP provide an advantage over DHJ? Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms? 40

41 Multi-Way Join Implementation Performance is implementation dependent Multiple implementations were created –PostgreSQL http://www.postgresql.org/ –Standalone C++ –Verified the results in another environment 41

42 Experimental Results 42

43 PostgreSQL Results All experiments were performed by comparing the multi-way join against the built-in hash join Hybrid Hash Join (HHJ) Data was based on 10GB TPC-H benchmark data –Generated using Microsoft’s TPC-H generator –ftp.research.microsoft.com/users/viveknar/tpcdskew 43

44 TPC-H Relations RelationTuple SizeNumber of TuplesRelation Size Customer194 Bytes1.5 Million284 MB Supplier184 Bytes100,00018 MB Part173 Bytes2 Million323 MB Orders147 Bytes15 Million2097 MB PartSup182 Bytes8 Million1392 MB Lineitem162 Bytes60 Million9270 MB 44

45 Hash Teams in PostgreSQL Performed 3-way join on the Orders relation using direct partitioning 45

46 Generalized Hash Teams in PostgreSQL Indirect partitioning with a join on Customer, Orders, and Lineitem Tested using multiple mappers –Bitmap –Exact 46

47 Generalized Hash Teams in PostgreSQL 47

48 SHARP in PostgreSQL Star join using Part, Orders, and Lineitem 48

49 Standalone C++ Results Uses same TPC-H data as the PostgreSQL experiments 49

50 Standalone C++ Hash Teams Performed 3-way join on the Orders relation using direct partitioning 50

51 Standalone C++ Generalized Hash Teams Indirect partitioning with a join on Customer, Orders, and Lineitem Tested using bitmap mapper Tested GHT by –Not counting mapper memory –Counting mapper memory for small memory sizes –Varying the amount of memory available for the mapper 51

52 GHT Map Memory Not Counted 52

53 GHT at Small Memory Sizes 53

54 GHT and Bitmap Size 54

55 Standalone C++ SHARP Star join using Part, Orders, and Lineitem 55

56 Conclusions 56

57 Thesis Questions Q1: Does Hash Teams provide an advantage over DHJ? Q2: Does Generalized Hash Teams provide an advantage over DHJ? Q3: Does SHARP provide an advantage over DHJ? Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms? 57

58 Does Hash Teams provide an advantage over DHJ? Yes –Performs fewer I/Os than DHJ –Evaluates Queries Faster –Uses memory more efficiently –Performs fewer partitioning steps Queries that can use Hash Teams are very limited in practice. In many cases a traditional sort-merge join would be more efficient Hash Teams is much more complex to implement and maintain 58

59 Does Generalized Hash Teams provide an advantage over DHJ? Sometimes –When GHT performs fewer I/Os Performance is bad when there are a lot of false drops Much more complex than DHJ or Hash Teams Mapper can hurt performance 59

60 Does SHARP provide an advantage over DHJ? Yes –Performs fewer I/Os –Evaluates queries quicker –Uses memory more efficiently –Fewer partitioning steps Limited to star queries More complex to implement and maintain 60

61 Should these algorithms be implemented in a relational database system? Hash teams should not be implemented. –Too limited in use –Microsoft removed support for Hash Teams from SQL Server 2003 Generalized Hash Teams should not be implemented. –GHT can be much slower than DHJ –Mapper makes GHT much more complex to implement and maintain SHARP should be implemented. –Shows a significant performance advantage –Star queries are commonly used in data warehousing 61

62 Future Work Experiments with the algorithms on different data sets Experiments with larger numbers of relations Extend Hash Teams and GHT implementations to support GROUP BY to see if it makes them more useful 62

63 Thank You 63

64 Appendix 64

65 TPC-H Relations http://www.tpc.org/tpch/ 65


Download ppt "Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2."

Similar presentations


Ads by Google