Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Slides:



Advertisements
Similar presentations
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
Advertisements

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Fall 2008Parallel Query Optimization1. Fall 2008Parallel Query Optimization2 Bucket Sizes and I/O Costs Bucket B does not fit in the memory in its entirety,
Join Processing in Databases Systems with Large Main Memories
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
ISAC 教育學術資安資訊分享與分析中心研發專案 The Skyline Operator Stephan B¨orzs¨onyi, Donald Kossmann, Konrad Stocker EDBT
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
A Rule-Based Optimizer for Spatial Join Operations Miguel Fornari João Luiz Comba Cirano Iochpe Instituto de Informática Universidade Federal do Rio Grande.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Dynamic Layout Optimization for Newspaper Web Sites using a Controlled Annealed Genetic Algorithm Gjermund Brabrand H06MMT.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results Ramon Lawrence University of Iowa
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)
1 XJoin: Faster Query Results Over Slow And Bursty Networks IEEE Bulletin, 2000 by T. Urhan and M Franklin Based on a talk prepared by Asima Silva & Leena.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin† UC Berkeley and †Databricks Inc.
Facilitating Document Annotation using Content and Querying Value.
GSLPI: a Cost-based Query Progress Indicator
Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,
Sort in GPDB Feng Tian GreenPlum Inc.. WARNING: NON-TECH SLIDES Why (NOW)?  Real customers, real problems.  About to get the code in MAIN Make Joy/Brian's.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
Query Processing CS 405G Introduction to Database Systems.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
Facilitating Document Annotation Using Content and Querying Value.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
Presented by: Omar Alqahtani Fall 2016
Parallel Databases.
Efficient Join Query Evaluation in a Parallel Database System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Proactive Re-optimization
Ripple Joins for Online Aggregation
Database Management Systems (CS 564)
Implementation of Relational Operations (Part 2)
On Spatial Joins in MapReduce
(Two-Pass Algorithms)
(A Research Proposal for Optimizing DBMS on CMP)
Implementation of Relational Operations
Slides adapted from Donghui Zhang, UC Riverside
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence

Introduction Databases are part of our lives Hash Join is a core database algorithm o Very I/O intensive for large databases  Queries may take hours o Any performance improvement is significant Real datasets contain skew o Skew is when some values occur more frequently o Skew can greatly reduce hash join performance Skew traditionally considered a bad thing for join algorithms o Try to mitigate negative effects of skew Adapt hash join o No longer just mitigate o Use foreknowledge of skew  Improve performance

Relational Model Definitions

Example Relations Build Relation Probe Relation Part Purchase

DHJ Algorithm Build Phase Hash Function: modulo 5

DHJ Algorithm Build Phase, cont.

Probe Relation

DHJ Algorithm Probe Phase

DHJ Algorithm Probe Phase, cont.

DHJ Algorithm Cleanup Phase

DHJ Algorithm Cleanup Phase, cont.

Skewed Probe Relation

Statistics and Hash Joins Modern database systems maintain statistics such as histograms for query optimization What if hash join could use the statistics to choose the best build tuples to keep in memory? o Does not have to generate own statistics

Histojoin Algorithm General Idea Same basic form as DHJ Determines best build tuples from histogram o In this case the tuples with partid 2 and 3 Create partitions for the best build tuples o In addition to regular partitions o Freeze regular partitions first Perform a highly optimized multi-stage check o To determine the partition tuples belong in

Histojoin Algorithm Build Phase

Histojoin Algorithm Probe Phase

Implementation Details Avoided in algorithm description o General enough to fit any database system But ultimately important o Core of algorithm implementation specific Implemented in o Stand alone Java app  Optimistic implementation o PostgreSQL  HHJ  Conservative implementation

Inaccurate Statistics Selections Multi-join plans o Sampling o SITs Handling dependent on implementation o PostgreSQL conservative memory usage

Experimental Results TPC-H o Database commonly used to test database system performance o Skewed versions o 1GB dataset used in Java tests o 10GB dataset used in PostgreSQL tests

Experimental Results, cont. Java, Lineitem/Part, skewed, 1GB Approx. 20% faster

Experimental Results, cont. Java, Lineitem/Part,high skew, 1GB Approx. 60% faster

Experimental Results, cont. Java, Various Joins, Percent Improvement, 1GB Approx. 20% for skewed and 60% for high skew

Experimental Results, cont. Java, Lineitem/Part, Inaccurate Histogram, 1GB

Experimental Results, cont. Java, Lineitem/Part/Supplier,high skew, 1GB Approx. 75% faster

Experimental Results, cont. PostgreSQL, Lineitem/Part,skewed, 10GB Approx. 10% faster

Experimental Results, cont. PostgreSQL, Lineitem/Part, high skew, 10GB Approx. 60% faster

Experimental Results, cont. PostgreSQL, Various Joins, Percent Improvement, 10GB 5-10% for skewed and 50-60% for high skew

Conclusion Histojoin o significantly outperforms standard hash joins in the presence of skew Smart implementation mitigates pitfalls Two papers have been published from this work PostgreSQL patch currently in review o Will be used by millions of users

Thank you Thank you Dr. Lawrence