Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CMU SCS /615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Evaluation of Relational Operators 198:541. Relational Operations  We will consider how to implement: Selection ( ) Selects a subset of rows from relation.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Relational Query Optimization (this time we really mean it)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Query Optimization II R&G, Chapters 12, 13, 14 Lecture 9.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Query Optimization R&G, Chapter 15 Lecture 16. Administrivia Homework 3 available today –Written exercise; will be posted on class website –Due date:
1 Implementation of Relational Operations: Joins.
Query Processing 2: Sorting & Joins
Overview of Implementing Relational Operators and Query Evaluation
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Lec3/Database Systems/COMP4910/031 Evaluation of Relational Operations Chapter 14.
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Database Management Systems 1 Raghu Ramakrishnan Evaluation of Relational Operations Chpt 14.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Relational Operator Evaluation. Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin)
Evaluation of Relational Operations
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations (Part 2)
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Lecture#12: External Sorting (R&G, Ch13)
Selected Topics: External Sorting, Join Algorithms, …
Overview of Query Evaluation
Overview of Query Evaluation
Implementation of Relational Operations
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Presentation transcript:

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15

Administrivia Homework 2 Due Tonight –Remember you have 4 slip days for the course Homeworks 3 & 4 available later this week –3 is written assignment, deals with optimization, due before midterm 2 –4 is programming assignment, implementing query processing, due after Spring Break Midterm 2 is 3/22, 2 weeks from Thursday

Review: Query Processing Queries start out as SQL Database translates SQL to one or more Relational Algebra plans Plan is a tree of operations, with access path for each Access path is how each operator gets tuples –If working directly on table, can use scan, index –Some operators, like sort-merge join, or group-by, need tuples sorted –Often, operators pipelined, getting tuples that are output from earlier operators in the tree Database estimates cost for various plans, chooses least expensive

Today Review costs for: –Sorting –Selection –Projection –Joins Re-examine Hashing for: –Projection –Joins

General External Merge Sort To sort a file with N pages using B buffer pages: –Pass 0: use B buffer pages. Produce sorted runs of B pages each. –Pass 1, 2, …, etc.: merge B-1 runs. B Main memory buffers INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2... * More than 3 buffer pages. How can we utilize them?

Cost of External Merge Sort Minimum amount of memory: 3 pages –Initial runs of 3 pages –Then 2-way merge of sorted runs (2 pages for inputs, one for outputs) –#of passes: 1 +  log 2 (N/3)  With more memory, fewer passes –With B pages, #of passes: 1 +  log (B-1) (N/B)  I/O Cost = 2N * (# of passes)

External Sort Example E.g., with 5 buffer pages, to sort 108 page file: –Pass 0: 22 sorted runs of 5 pages each (last run is only 3 pages) Now, do four-way (B-1) merges –Pass 1: 6 sorted runs of 20 pages each (last run is only 8 pages) –Pass 2: 2 sorted runs, 80 pages and 28 pages –Pass 3: Sorted file of 108 pages

Using B+ Trees for Sorting Scenario: –Table to be sorted has B+ tree index on sorting column(s). Idea: –Retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: –B+ tree is clusteredGood idea! –B+ tree is not clusteredCould be a very bad idea! I/O Cost –Clustered tree: ~ 1.5N –Unclustered tree: 1 I/O per tuple, worst case!

Selections: “age < 20”, “fname = Bob”, etc No index –Do sequential scan over all tuples –Cost: N I/Os Sorted data –Do binary search –Cost: log 2 (N) I/Os Clustered B-Tree –Cost: 2 or 3 to find first record + 1 I/O for each #qualifying pages Unclustered B-Tree –Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple Clustered Hash Index –Cost: ~1.2 I/Os to find bucket, all tuples inside Unclustered Hash Index –Cost: ~1.2 I/Os to find bucket, + ~1 I/O for each matching tuple

Selection Exercise Sal > 100 –Sequential Scan: 100 I/Os –Btree on sal? Unclustered, 503 I/Os –BTree on ? Can’t use it Age = 25 –Sequential Scan: 100 I/Os –Hash on age: 1.2 I/O to get to bucket 20 matching tuples, 1 I/O for each –BTree : ~3 I/Os to find leaf + #matching pages 20 matching tuples, clustered, ~ 2 I/Os Age > 20 –Sequential Scan: 100 I/Os –Hash on age? Not with range query. –BTree : ~3 I/Os to find leaf + #matching pages (Age > 20) is 90% of pages, or ~90*1.5 = 135 I/Os

Selection Exercise (cont) Eid = 1000 –Sequential Scan: ~50 I/Os (avg) –Hash on eid: ~1.2 I/Os to find bucket, 1 I/O to get record Sal > 100 and age < 30 –Sequential Scan: 100 I/Os –Btree : ~ 3 I/Os to find leaf, 30% of pages match, so 30*1.5 = 45 I/Os

Projection Expensive when eliminating duplicates Can do this via: –Sorting: cost no more than external sort Cheaper if you project columns in initial pass, since more projected tuples fit in each page. –Hashing: build a hash table, duplicates will end up in the same bucket

An Alternative to Sorting: Remove duplicates with Hashing! Idea: –Many of the things we use sort for don’t exploit the order of the sorted data –e.g.: removing duplicates in DISTINCT –e.g.: finding matches in JOIN Often good enough to match all tuples with equal values Hashing does this! –And may be cheaper than sorting! (Hmmm…!) –But how to do it for data sets bigger than memory??

General Idea Two phases: –Partition: use a hash function h to split tuples into partitions on disk. Key property: all matches live in the same partition. –ReHash: for each partition on disk, build a main- memory hash table using a hash function h2

Two Phases Partition: Rehash: Partitions Hash table for partition R i (<= B pages) B main memory buffers Disk Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Duplicate Elimination using Hashing read one bucket at a time for each group of identical tuples, output one Partitions Hash table for partition R i (<= B pages) B main memory buffers Disk Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Hashing in Detail Two phases, two hash functions First pass: partition into (B-1) buckets E.g., B = 5 pages, h(x) is two low order bit 1 MemoryInput File Output Files

Memory, I/O costs Requirement If we can hash in two passes -> cost is 4N How big of a table can we hash in two passes? –B-1 “partitions” result from Phase 0 –Each should be no more than B pages in size –Answer: B(B-1). Said differently: We can hash a table of size N pages in about space –Note: assumes hash function distributes records evenly! Have a bigger table? Recursive partitioning!

How does this compare with external sorting?

Memory Requirement for External Sorting How big of a table can we sort in two passes? –Each “sorted run” after Phase 0 is of size B –Can merge up to B-1 sorted runs in Phase 1 –Answer: B(B-1). Said differently: We can sort a table of size N pages in about space Have a bigger table? Additional merge passes!

So which is better ?? Based on our simple analysis: –Same memory requirement for 2 passes –Same IO cost Digging deeper … Sorting pros: –Great if input already sorted (or almost sorted) –Great if need output to be sorted anyway –Not sensitive to “data skew” or “bad” hash functions Hashing pros: –Highly parallelizable (will discuss later in semester) –Can exploit extra memory to reduce # IOs (stay tuned…)

Nested Loops Joins R, with M pages, joins S, with N Pages Nested Loops –Simple nested loops Insanely inefficient M + P R *M*n –Paged nested loops – only 3 pages of memory M + M*N –Blocked nested loops – B pages of memory M +  M/(B-2)  * N If M fits in memory (B-2), cost only M + N –Index nested loops M + P R *M* index cost Only good in M very small

Sort-Merge Join Simple case: –sort both tables on join column –Merge –Cost: external sort cost + merge cost 2M*(1 +  log (B-1) (M/B)  ) + 2N*(1 +  log (B-1) (N/B)  ) + M + N Optimized Case: –If we have enough memory, do final merge and join in same pass. This avoids final write pass from sort, and read pass from merge –Can we merge on 2 nd pass? Only in #runs from 1 st pass < B –#runs for R is  M/B . #runs for S is  N/B . Total #runs ~~ (M+N)/B –Can merge on 2 nd pass if M+N/B < B, or M+N < B 2 –Cost: 3(M+N)

Hash Join Partitions of R & S Input buffer for Si Hash table for partition Ri (B-2 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h B-1 Partitions 1 2 B-1...

Cost of Hash Join Partitioning phase: read+write both relations  2(|R|+|S|) I/Os Matching phase: read+write both relations  |R|+|S| I/Os Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass merge-sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?

Have B memory buffers Want to hash relation of size N An important optimization to hashing N # passes B B2B2 1 2 If B < N < B 2, will have unused memory … cost N 3N

1 Hybrid Hashing Idea: keep one of the hash buckets in memory! B main memory buffers Disk Original Relation OUTPUT 3 INPUT 2 h B-k Partitions 2 3 B-k... h3 k-buffer hashtable Q: how do we choose the value of k?

Cost reduction due to hybrid hashing Now: N # passes B B2B2 1 2 cost N 3N

Summary: Hashing vs. Sorting Sorting pros: –Good if input already sorted, or need output sorted –Not sensitive to data skew or bad hash functions Hashing pros: –Often cheaper due to hybrid hashing –For join: # passes depends on size of smaller relation –Highly parallelizable

Summary Several alternative evaluation algorithms for each operator. Query evaluated by converting to a tree of operators and evaluating the operators in the tree. Must understand query optimization in order to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). Two parts to optimizing a query: – Consider a set of alternative plans. Must prune search space; typically, left-deep plans only. – Must estimate cost of each plan that is considered. Must estimate size of result and cost for each plan node. Key issues: Statistics, indexes, operator implementations.