Optimizing Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 26, 2005 Content on hashing.

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B+-tree and Hashing.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Indexing and Sorting Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 22, 2005.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 1, 2005 Some slide content derived.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 24, 2005 Content on hashing and sorting.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Query Execution Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 23, 2004.
1 Database Query Execution Zack Ives CSE Principles of DBMS Ullman Chapter 6, Query Execution Spring 1999.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Optimizing Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 18, 2008 Content on hashing.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2015.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
CS4432: Database Systems II Query Processing- Part 2.
Query Processing CS 405G Introduction to Database Systems.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
GiST, Concluded, and Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 16, 2008 Content.
Hash-Based Indexes Chapter 11
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
Query Optimization Overview
Introduction to Database Systems
Database Query Execution
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Query Optimization Overview
Index tuning Hash Index.
Lecture 2- Query Processing (continued)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Chapter 11 Instructor: Xin Zhang
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Optimizing Query Execution Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 26, 2005 Content on hashing and sorting courtesy Ramakrishnan & Gehrke

2 Administrivia  My office hour – moved up ½ hour to 2:00-3:00 on Tuesdays  Next week:  Some initial suggestions for the project proposal  Scheduling of the deadline for your midterm report

3 Today’s Trivia Question

4 Query Execution What are the goals?  Logical vs. physical plans – what are the differences?  Some considerations in building execution engines:  Efficiency – minimize copying, comparisons  Scheduling – make standard code-paths fast  Data layout – how to optimize cache behavior, buffer management, distributed execution, etc.

5 Speeding Operations over Data Three general data organization techniques:  Indexing  Associative lookup & synopses  Both for selection and projection  “Inner” loop of nested loops join  … And anywhere sorted data is useful…  Sorting  Hashing

6 Speeding Operations over Data Three general data organization techniques:  Indexing  Sorting  Hashing

General External Merge Sort  To sort a file with N pages using B buffer pages:  Pass 0: use B buffer pages. Produce d N / B e sorted runs of B pages each  Pass 2, …, etc.: merge B-1 runs  Number of passes: 1+ d log B-1 d N / B ee  Cost = 2N * (# of passes) B Main memory buffers INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2...

8 Applicability of Sort Techniques  Aggregation  Duplicate removal as an instance of aggregation  XML nesting as an instance of aggregation  Join, semi-join, and intersection

9 Merge Join  Requires data sorted by join attributes Merge and join sorted files, reading sequentially a block at a time  Maintain two file pointers  While tuple at R < tuple at S, advance R (and vice versa)  While tuples match, output all possible pairings  Maintain a “last in sequence” pointer  Preserves sorted order of “outer” relation  Cost: b(R) + b(S) plus sort costs, if necessary In practice, approximately linear, 3 (b(R) + b(S))

10 Hashing Several types of hashing:  Static hashing  Extendible hashing  Consistent hashing (used in P2P; we’ll see later)

Static Hashing  Fixed number of buckets (and pages); overflow when necessary  h(k) mod N = bucket to which data entry with key k belongs  Downside: long overflow chains h(key) mod N h key Primary bucket pages Overflow pages 2 0 N-1

Extendible Hashing If a bucket becomes full split in half  Use directory of pointers to buckets, double the directory, splitting just the bucket that overflowed  Directory much smaller than file, so doubling it is much cheaper  Only one page of data entries is split  Trick lies in how hash function is adjusted!

Insert h(r)=20 (Causes Doubling) 20* LOCAL DEPTH 2 2 DIRECTORY GLOBAL DEPTH Bucket A Bucket B Bucket C Bucket D Bucket A2 (`split image' of Bucket A) 1* 5*21*13* 32* 16* 10* 15*7*19* 4*12* 19* DIRECTORY Bucket A Bucket B Bucket C Bucket D Bucket A2 (‘split image' of Bucket A) 32* 1*5*21*13* 16* 10* 15* 7* 4* 20* 12* LOCAL DEPTH GLOBAL DEPTH

14 Relevance of Hashing Techniques  Hash indices use extensible hashing  Uses of static hashing:  Aggregation  Intersection  Joins Why isn’t extendible hashing used in hash joins – only as a disk indexing technique?

15 Hash Join Read entire inner relation into hash table (join attributes as key) For each tuple from outer, look up in hash table & join  Not fully pipelined

16 Running out of Memory  Prevention: First partition the data by value into memory- sized groups Partition both relations in the same way, write to files Recursively join the partitions  Resolution: Similar, but do when hash tables full Split hash table into files along bucket boundaries Partition remaining data in same way Recursively join partitions with diff. hash fn!  Hybrid hash join: flush “lazily” a few buckets at a time  Cost: <= 3 * (b(R) + b(S))

17 The Duality of Hash and Sort Different means of partitioning and merging data when comparisons are necessary:  Break on physical rule (mem size) in sorting  Merge on logical step, the merge  Break on logical rule (hash val) in hashing  Combine using physical step (concat)  When larger-than-memory sorting is necessary, multiple operators use the same key, we can make all operators work on the same in-memory portion of data at the same time  Can we do this with hashing? Hash teams (Graefe)

18 What If I Want to Distribute Query Processing?  Where do I put the data in the first place (or do I have a choice)?  How do we get data from point A -> point B?  What about delays?  What about “binding patterns”?  Looks kind of like an index join with a sargable predicate

19 Pipelined Hash Join Useful for Joining Web Sources  Two hash tables  As a tuple comes in, add to the appropriate side & join with opposite table  Fully pipelined, adaptive to source data rates  Can handle overflow as with hash join  Needs more memory

20 The Dependent Join  Take attributes from left and feed to the right source as input/filter  Important in data integration  Simple method: for each tuple from left send to right source get data back, join  More complex:  Hash “cache” of attributes & mappings  Don’t send attribute already seen  Bloom joins (use bit-vectors to reduce traffic) Join A.x = B.y AB x

21 Wrap-Up of Execution Query execution is all about engineering for efficiency  O(1) and O(lg n) algorithms wherever possible  Avoid looking at or copying data wherever possible  Note that larger-than-memory is of paramount importance  Should that be so in today’s world? As we’ve seen it so far, it’s all about pipelining things through as fast as possible But may also need to consider other axes:  Adaptivity/flexibility – may sometimes need this  Information flow – to the optimizer, the runtime system

22 Query Optimization  Challenge: pick the query execution plan that has minimum cost  Sources of cost:  Interactions with other work  Size of intermediate results  Choices of algorithms, access methods  Mismatch between I/O, CPU rates  Data properties – skew, order, placement  Strategy: Estimate the cost of every query plan, find cheapest  Given:  Some notion of CPU, disk speeds  Cost model for every operator  Some information about tables and data

23 The General Model of Optimization  Given an AST of a query:  Build a logical query plan (Tree of query algebraic operations)  Transform into “better” logical plan  Convert into a physical query plan (Includes strategies for executing operations)

24 Which Operators Need Significant Optimization Decisions?  We typically make the following assumptions:  All predicates are evaluated as early as possible  All data is projected away as early as possible  As a general rule, those that produce intermediate state or are blocking:  Joins  Aggregation  Sorting  By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so?

25 The Basic Model: System-R  Breaks a query into its blocks, separately optimizes them  Focuses strictly on joins (and only a few kinds) in dynamic programming enumeration  Principle of optimality: best k-way join includes best (k-1)-way join  Use simple table statistics when available, based on indices; “magic numbers” where unavailable  Heuristics  Push “sargable” selects, projects as low as possible  Cartesian products after joins  Left-linear trees only: n2 n-1 cost-est. operations  Grouping last  Extra “interesting orders” dimension Grouping, ordering, join attributes

26 Next Time: Beyond System-R  Cross-query-block optimizations  e.g., push a selection predicate from one block to another  Better statistics  More general kinds of optimizations  Optimization of aggregation operations  Different cost and data models, e.g., OO, XML  Additional joins, e.g., “containment joins”  Can we build an extensible architecture for this?  Logical, physical, and logical-to-physical transformations  Enforcers  Alternative search strategies  Left-deep plans aren’t always optimal  Perhaps we can prune more efficiently

27 Upcoming Readings For Monday:  Read EXODUS and Starburst papers  Write one review contrasting the two on the major issues