Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.

Slides:

Advertisements

Similar presentations

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Advertisements

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Query Optimization Dr. Karen C. Davis Professor School of Electronic and Computing Systems School of Computing Sciences and Informatics.

Multidimensional Data

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

University of Konstanz Advances in Database Query Processing Sahak Maloyan Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan.

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa

Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.

Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.

1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.

SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.

Multidimensional Data Many applications of databases are ``geographic'' = 2dimensional data. Others involve large numbers of dimensions. Example: data.

1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.

SQL Server 2005 Performance Enhancements for Large Queries Joe Chang

Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.

Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.

Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.

12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Parallel Execution Plans Joe Chang

Lecture 5 Cost Estimation and Data Access Methods.

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

TPC-H Studies Joe Chang

Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.

Exploiting Asynchronous IO using the Asynchronous Iterator Model Suresh Iyengar * S. Sudarshan Santosh Kumar # Raja Agrawal & IIT Bombay Current affiliations:

Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.

Lecture 24 Query Execution Monday, November 28, 2005.

CS4432: Database Systems II Query Processing- Part 2.

SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Generalized Hash Teams for Join and Group-By Alfons Kemper Donald Kossmann Christian Wiesner Universität Passau Germany.

32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.

Query Processing – Implementing Set Operations and Joins Chap. 19.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

CS 540 Database Management Systems

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

CS4432: Database Systems II Query Processing- Part 1 1.

Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.

Evaluation of Relational Operations

Cse 344 April 25th – Disk i/o.

Physical Join Operators

CPSC-310 Database Systems

Implementation of Relational Operations

Presentation transcript:

Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2

Outline Motivation Database Terminology Background Joins Multi-Way Joins Thesis Questions Experimental Results Conclusions 3

Motivation Data is everywhere Governments collect data on citizens Facebook collects data on over 1 billion people Wal-Mart and Target collect sales data on all their customers The goal is to make answering the big questions –Possible –Faster 4

Database Terminology: Relations (Tables) PartLineitem partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Hat Bottle Part Relation Tuple/Row Attribute/Column Lineitem Relation The tables are related through their partkey attributes Attribute Names

Database Terminology II: SQL Structured Query Language Used to ask the database questions about the data Standardized Example: SQL for retrieving all rows from the part table 6 SELECT * FROM Part;

Database Terminology III: Join Joins are used to combine the data in database tables Joins are slow We want joins to be faster 7

Background 8

What Makes Queries Slow? All the data must be read to give an accurate answer Data is usually much larger than what can fit in memory Operations such as filtering, ordering, and joins are costly A join is especially costly –May need to match every row in two tables. O(n 2 ) –May need to perform many slow disk operations (I/Os) 9

Background: Example Join Query 10 SELECT * FROM Part p, Lineitem l WHERE p.partkey = l.partkey; Part Lineitem p.partkey = l.partkey Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Box Hat Bottle SQL Relational Algebra Join Results

Results partkeynameretailpricelinenumberpartkeyquantitysaleprice Nested Loop Join 11 PartLineitem partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Hat Bottle Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Box Hat Bottle Box Box Box

Dynamic Hash Join 12 Part partkeynameretailprice 1Box0.50 2Hat Bottle2.50 3Bottle2.50 2Hat Box0.50 Part1 partkeynameretailprice Part2 partkeynameretailprice Part3 partkeynameretailprice Three Part Partitions Hash Function : partition = (partkey - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 Saved to disk

Part1 partkeynameretailprice 1Box0.50 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice Dynamic Hash Join 13 Lineitem linenumberpartkeyquantitysaleprice Box Box Lineitem1 linenumberpartkeyquantitysaleprice Lineitem2 linenumberpartkeyquantitysaleprice Lineitem3 linenumberpartkeyquantitysaleprice Three Lineitem Partitions Hash Function: partition = (partkey - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3

Part2 partkeynameretailprice 2Hat25.00 Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Box Dynamic Hash Join 14 Lineitem2 linenumberpartkeyquantitysaleprice Results partkeynameretailpricelinenumberpartkeyquantitysaleprice 1Box Box Hat Bottle Hat

Join Three Tables 15 SELECT A.a_key, B.b_key, C.c_key FROM A, B, C WHERE A.a_key = B.a_key AND A.a_key = C.a_key; A B A.a_key = B.a_key C A.a_key = C.a_key A B A.a_key = B.a_key C A.a_key = C.a_key Left Deep Plan Right Deep Plan

Multi-way Hash Joins Join multiple relations at the same time Shares memory across the entire join Produces a result by combining tuples from all relations Do not have to repartition intermediate results Less disk operations 16 A B A.a_key = B.a_key and A.a_key = C.a_key C Multi-way Plan

Hash Teams Multi-way hash join Hash teams joins relations on a common attribute 17

Hash Teams Example ABC a_keyb_keya_keyc_keya_key SELECT A.a_key, B.b_key, C.c_key FROM A, B, C WHERE A.a_key = B.a_key AND A.a_key = C.a_key;

Partitioning A and B 19 A1A1 a_key 1 Partitions AB a_keyb_keya_key Hash Function: partition = (a_key - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 1 A2A2 a_key 2 A3A3 3 A1A1 A2A2 A3A3 2 3 B1B1 b_keya_key B2B2 b_keya_key B3B3 b_keya_key

Partitioning A and B 20 A1A1 a_key 1 Partitions AB a_keyb_keya_key Hash Function: partition = (a_key - 1 mod 3) + 1 = (1 - 1 mod 3) + 1 = 1 = (2 - 1 mod 3) + 1 = 2 = (3 - 1 mod 3) + 1 = 3 A2A2 a_key 2 A3A B1B1 b_keya_key B2B2 b_keya_key B3B3 b_keya_key 33 B1B1 b_keya_key B2B2 b_keya_key B3B3 b_keya_key

Processing C 21 A1A1 a_key 1 Disk Partitions Hash Function: partition = (a_key - 1 mod 3) + 1 B1B1 b_keya_key B1B1 b_keya_key C c_keya_key C2C2 c_keya_key C3C3 c_keya_key Results a_keyb_keyc_key

Processing C 22 A1A1 a_key 1 Disk Partitions Hash Function: partition = (a_key - 1 mod 3) + 1 B1B1 b_keya_key B1B1 b_keya_key C c_keya_key C2C2 c_keya_key C3C3 c_keya_key Results a_keyb_keyc_key Results a_keyb_keyc_key

Generalized Hash Teams (GHT) Extends Hash Teams Does not need the join attributes to be the same Uses indirect partitioning Needs an in-memory map to indirectly join relations 23

GHT Partition Maps Uses join memory Use a bitmap to approximate mapping to reduce memory requirements Needs a bitmap for each partition Bitmaps introduce mapping errors that cause tuples to be mapped to multiple partitions (false drops) False drops add I/O and Processing cost 24

GHT Example 25 SELECT c.custkey, o.orderkey, l.partkey FROM Customer c, Orders o, Lineitem l WHERE c.custkey = o.custkey AND o.orderkey = l.orderkey; Customer custkey Orders orderkeycustkey Lineitem orderkeypartkey

GHT Customer Partitions Customer 1 Customer 2 Customer 3 custkey Hash Function: partition = (custkey - 1 mod 3) + 1

Orders Partitions and Bitmap 27 Orders 1 orderkeycustkey Orders 2 orderkeycustkey Orders 2 orderkeycustkey Orders 3 orderkeycustkey 33 Orders 3 orderkeycustkey Orders 1 orderkeycustkey Orders orderkeycustkey B1B B2B B3B Index = (orderkey +1) mod 4 B1B B1B B2B B2B B3B Hash Function: partition = (custkey - 1 mod 3) + 1

Orders Partitions and Bitmap 28 B1B1 B2B2 B3B B1B B2B B3B

Lineitem Partitions with False Drops 29 Lineitem 1 orderkeypartkey Lineitem 2 orderkeypartkey Lineitem 3 orderkeypartkey Lineitem 1 orderkeypartkey Lineitem 2 orderkeypartkey Lineitem 3 orderkeypartkey Lineitem orderkeypartkey B1B B2B B3B Index = (orderkey +1) mod False Drop

30 Lineitem 1 orderkeypartkey Joining the Partitions Customer 1 custkey 1 Orders 1 orderkeycustkey Results custkeyorderkeypartkey Results custkeyorderkeypartkey False Drop

SHARP Limited to star joins –Looks like a star –All tables related to a central table 31 Fact keya_keyb_keyc_keyd_keye_key A a_keydata C c_keydata B b_keydata E e_keydata D d_keydata

SHARP Example CustomerProductSaleitem idnameidnamec_idp_id 1Bob1Hammer11 2Joe2Drill12 3Greg3Screwdriver23 4Susan4Scissors26 5Toolbox31 6Knife SELECT * FROM Customer c, Product p, Saleitem s WHERE c.id = s.c_id AND p.id = s.p_id;

SHARP Example Partitions 33 Customer idname 1Bob 2Joe 3Greg 4Susan Customer 1 idname 1Bob 3Greg Customer 1 idname Customer 2 idname 2Joe 4Susan Customer 2 idname 1Bob 2Joe 3Greg 4Susan Hash Function: partition = (id - 1 mod 2) + 1

SHARP Example Partitions 34 Product idname 1Hammer 2Drill 3Screwdriver 4Scissors 5Toolbox 6Knife Product 1 idname 1Hammer 4Scissors Product 2 idname 2Drill 5Toolbox Product 3 idname 3Screwdriver 6Knife 1Hammer Product 1 idname Product 2 idname Product 3 idname 2Drill 3Screwdriver 4Scissors 5Toolbox 6Knife Hash Function: partition = (id - 1 mod 3) + 1

SHARP Example Partitions 35 Saleitem c_idp_id Saleitem 1,1 c_idp_id Saleitem 1,1 c_idp_id Saleitem 1,2 c_idp_id Saleitem 1,2 c_idp_id Saleitem 1,3 c_idp_id 36 Saleitem 1,3 c_idp_id Saleitem 2,1 c_idp_id 41 Saleitem 2,1 c_idp_id Saleitem 2,2 c_idp_id 25 Saleitem 2,2 c_idp_id Saleitem 2,3 c_idp_id Saleitem 2,3 c_idp_id c_id mod 2 = 1c_id mod 2 = 0 p_id mod 3 = 1 p_id mod 3 = 2 p_id mod 3 = 0

SHARP Partition Combinations Customer 1, Product 1, and Saleitem 1,1 Customer 1, Product 2, and Saleitem 1,2 Customer 1, Product 3, and Saleitem 1,3 Customer 2, Product 2, and Saleitem 2,1 Customer 2, Product 2, and Saleitem 2,2 Customer 2, Product 3, and Saleitem 2,3 36 For each partition i of Customer For each partition j of Product probe with partition i,j of Saleitem output matches between Customer i, Product j, and Saleitem i,j

Results c_idc_namep_idp_name SHARP Join 37 Saleitem 1,1 c_idp_id Product 1 idname 1Hammer 4Scissors Customer 1 idname 1Bob 3Greg Hammer 1Bob 3Greg 1Hammer

Results c_idc_namep_idp_name 1Bob1Hammer 3Greg1Hammer 1Bob2Drill 3Greg5Toolbox 3Greg6Knife 4Susan1Hammer 2Joe5Toolbox 2Joe3Screwdriver 2Joe6Knife Results c_idc_namep_idp_name 1Bob1Hammer 3Greg1Hammer SHARP Join 38 Saleitem 1,2 c_idp_id Product 2 idname 2Drill 5Toolbox Customer 1 idname 1Bob 3Greg 35 2Drill 1Bob 3Greg 5Toolbox 12

Multi-Way Join Summary AlgorithmRelevant Queries Hash TeamsAny query performing an inner join on identical attributes in all relations. Generalized Hash Teams Any query performing an inner join on direct and indirect attributes. Requires extra memory for indirect queries. SHARPOnly star queries. 39

Thesis Questions The study seeks to answer the following questions: Q1: Does Hash Teams provide an advantage over DHJ? Q2: Does Generalized Hash Teams provide an advantage over DHJ? Q3: Does SHARP provide an advantage over DHJ? Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms? 40

Multi-Way Join Implementation Performance is implementation dependent Multiple implementations were created –PostgreSQL –Standalone C++ –Verified the results in another environment 41

Experimental Results 42

PostgreSQL Results All experiments were performed by comparing the multi-way join against the built-in hash join Hybrid Hash Join (HHJ) Data was based on 10GB TPC-H benchmark data –Generated using Microsoft’s TPC-H generator –ftp.research.microsoft.com/users/viveknar/tpcdskew 43

TPC-H Relations RelationTuple SizeNumber of TuplesRelation Size Customer194 Bytes1.5 Million284 MB Supplier184 Bytes100,00018 MB Part173 Bytes2 Million323 MB Orders147 Bytes15 Million2097 MB PartSup182 Bytes8 Million1392 MB Lineitem162 Bytes60 Million9270 MB 44

Hash Teams in PostgreSQL Performed 3-way join on the Orders relation using direct partitioning 45

Generalized Hash Teams in PostgreSQL Indirect partitioning with a join on Customer, Orders, and Lineitem Tested using multiple mappers –Bitmap –Exact 46

Generalized Hash Teams in PostgreSQL 47

SHARP in PostgreSQL Star join using Part, Orders, and Lineitem 48

Standalone C++ Results Uses same TPC-H data as the PostgreSQL experiments 49

Standalone C++ Hash Teams Performed 3-way join on the Orders relation using direct partitioning 50

Standalone C++ Generalized Hash Teams Indirect partitioning with a join on Customer, Orders, and Lineitem Tested using bitmap mapper Tested GHT by –Not counting mapper memory –Counting mapper memory for small memory sizes –Varying the amount of memory available for the mapper 51

GHT Map Memory Not Counted 52

GHT at Small Memory Sizes 53

GHT and Bitmap Size 54

Standalone C++ SHARP Star join using Part, Orders, and Lineitem 55

Conclusions 56

Thesis Questions Q1: Does Hash Teams provide an advantage over DHJ? Q2: Does Generalized Hash Teams provide an advantage over DHJ? Q3: Does SHARP provide an advantage over DHJ? Q4: Should these algorithms be implemented in a relational database system in addition to the existing binary join algorithms? 57

Does Hash Teams provide an advantage over DHJ? Yes –Performs fewer I/Os than DHJ –Evaluates Queries Faster –Uses memory more efficiently –Performs fewer partitioning steps Queries that can use Hash Teams are very limited in practice. In many cases a traditional sort-merge join would be more efficient Hash Teams is much more complex to implement and maintain 58

Does Generalized Hash Teams provide an advantage over DHJ? Sometimes –When GHT performs fewer I/Os Performance is bad when there are a lot of false drops Much more complex than DHJ or Hash Teams Mapper can hurt performance 59

Does SHARP provide an advantage over DHJ? Yes –Performs fewer I/Os –Evaluates queries quicker –Uses memory more efficiently –Fewer partitioning steps Limited to star queries More complex to implement and maintain 60

Should these algorithms be implemented in a relational database system? Hash teams should not be implemented. –Too limited in use –Microsoft removed support for Hash Teams from SQL Server 2003 Generalized Hash Teams should not be implemented. –GHT can be much slower than DHJ –Mapper makes GHT much more complex to implement and maintain SHARP should be implemented. –Shows a significant performance advantage –Star queries are commonly used in data warehousing 61

Future Work Experiments with the algorithms on different data sets Experiments with larger numbers of relations Extend Hash Teams and GHT implementations to support GROUP BY to see if it makes them more useful 62

Thank You 63

Appendix 64

TPC-H Relations 65