CMU SCS 15-415/615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query.

Slides:



Advertisements
Similar presentations
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Implementation of Relational Operations (Part 2) R&G - Chapters 12 and 14.
Implementation of relational operations
1 Lecture 23: Query Execution Friday, March 4, 2005.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
1 Relational Query Optimization Module 5, Lecture 2.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Lecture 24: Query Execution Monday, November 20, 2000.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Unary Query Processing Operators Not in the Textbook!
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#14: Implementation of Relational Operations.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
6.830 Lecture 6 9/28/2015 Cost Estimation and Indexing.
Lecture 24 Query Execution Monday, November 28, 2005.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 19, March 29, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations (Part 2)
C. Faloutsos Query Optimization – part 1
CS222P: Principles of Data Management Notes #11 Selection, Projection
Lecture#12: External Sorting (R&G, Ch13)
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Implementation of Relational Operations
Lecture 13: Query Execution
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
C. Faloutsos Query Optimization – part 2
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

CMU SCS /615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query Evaluation (R&G ch. 12 and 14)

CMU SCS /615Faloutsos/Pavlo2 Cost-based Query Sub-System Query Parser Query Optimizer Plan Generator Plan Cost Estimator Query Plan Evaluator Catalog Manager Schema Statistics Select * From Blah B Where B.blah = blah Queries

CMU SCS /615Faloutsos/Pavlo3 Outline (12.1) Catalog (12.2) Intro to Operator Evaluation (12.3) Algo’s for Relational Operations (12.6) Typical Q-optimizer (14.3.2) Hashing

CMU SCS /615Faloutsos/Pavlo4 Cost-based Query Sub-System Query Parser Query Optimizer Plan Generator Plan Cost Estimator Query Plan Evaluator Catalog Manager Schema Statistics Select * From Blah B Where B.blah = blah Queries

CMU SCS /615Faloutsos/Pavlo5 Schema What would you store? How?

CMU SCS /615Faloutsos/Pavlo6 Schema What would you store? A: info about tables, attributes, indices, users How? A: in tables! eg., –Attribute_Cat (attr_name: string, rel_name: string; type: string; position: integer)

CMU SCS /615Faloutsos/Pavlo7 Statistics Why do we need them? What would you store?

CMU SCS /615Faloutsos/Pavlo8 Statistics Why do we need them? A: To estimate cost of query plans What would you store? –NTuples(R): # records for table R –NPages(R): # pages for R –NKeys(I): # distinct key values for index I –INPages(I): # pages for index I –IHeight(I): # levels for I –ILow(I), IHigh(I): range of values for I –...

CMU SCS /615Faloutsos/Pavlo9 Outline (12.1) Catalog (12.2) Intro to Operator Evaluation (12.3) Algo’s for Relational Operations (12.6) Typical Q-optimizer (14.3.2) Hashing

CMU SCS /615Faloutsos/Pavlo10 Operator evaluation 3 methods we’ll see often:

CMU SCS /615Faloutsos/Pavlo11 Operator evaluation 3 methods we’ll see often: indexing iteration (= seq. scanning) partitioning (sorting and hashing)

CMU SCS /615Faloutsos/Pavlo12 ``Access Path’’ Eg., index (tree, or hash), or scanning Selectivity of an access path: –% of pages we retrieve eg., selectivity of a hash index, on range query: 100% (no reduction!)

CMU SCS /615Faloutsos/Pavlo13 Outline (12.1) Catalog (12.2) Intro to Operator Evaluation (12.3) Algo’s for Relational Operations (12.6) Typical Q-optimizer (14.3.2) Hashing

CMU SCS /615Faloutsos/Pavlo14 Algorithms selection: projection join group by order by

CMU SCS /615Faloutsos/Pavlo15 Algorithms selection: scan; index projection (dup. elim.): join group by order by

CMU SCS /615Faloutsos/Pavlo16 Algorithms selection: scan; index projection (dup. elim.): hashing; sorting join group by order by

CMU SCS /615Faloutsos/Pavlo17 Algorithms selection: scan; index projection (dup. elim.): hashing; sorting join: many ways (loops, sort-merge, etc) group by order by

CMU SCS /615Faloutsos/Pavlo18 Algorithms selection: scan; index projection (dup. elim.): hashing; sorting join: many ways (loops, sort-merge, etc) group by: hashing; sorting order by: sorting

CMU SCS /615Faloutsos/Pavlo19 Outline (12.1) Catalog (12.2) Intro to Operator Evaluation (12.3) Algo’s for Relational Operations (12.6) Typical Q-optimizer (14.3.2) Hashing

CMU SCS /615Faloutsos/Pavlo20 Q-opt steps bring query in internal form (eg., parse tree) … into ‘canonical form’ (syntactic q-opt) generate alt. plans estimate cost; pick best

CMU SCS /615Faloutsos/Pavlo21 Q-opt - example select name from STUDENT, TAKES where c-id=‘415’ and STUDENT.ssn=TAKES.ssn STUDENT TAKES  

CMU SCS /615Faloutsos/Pavlo22 Q-opt - example STUDENT TAKES   STUDENT TAKES   Canonical form

CMU SCS /615Faloutsos/Pavlo23 Q-opt - example STUDENT TAKES   Index; seq scan Hash join; merge join; nested loops;

CMU SCS /615Faloutsos/Pavlo24 Outline (12.1) Catalog (12.2) Intro to Operator Evaluation (12.3) Algo’s for Relational Operations (12.6) Typical Q-optimizer (14.3.2) Hashing

CMU SCS /615Faloutsos/Pavlo25 Grouping; Duplicate Elimination select distinct ssn from TAKES (Q1: what does it do, in English?) Q2: how to execute it?

CMU SCS /615Faloutsos/Pavlo26 An Alternative to Sorting: Hashing! Idea: –maybe we don’t need the order of the sorted data –e.g.: forming groups in GROUP BY –e.g.: removing duplicates in DISTINCT Hashing does this! –And may be cheaper than sorting! (why?) –But what if table doesn’t fit in memory??

CMU SCS /615Faloutsos/Pavlo27 General Idea Two phases: –Phase1: Partition: use a hash function h p to split tuples into partitions on disk. We know that all matches live in the same partition. Partitions are “spilled” to disk via output buffers

CMU SCS /615Faloutsos/Pavlo28 Two Phases Partition: B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h p B-1 Partitions 1 2 B-1...

CMU SCS /615Faloutsos/Pavlo29 General Idea Two phases: –Phase 2: ReHash: for each partition on disk (assuming it fits in memory) read it into memory and build a main-memory hash table based on a hash function h r Then go through each bucket of this hash table to bring together matching tuples

CMU SCS /615Faloutsos/Pavlo30 Two Phases Rehash: Partitions Hash table for partition R i (k i <= B pages) B main memory buffers Disk hash fn hrhr 1 B-1 B RiRi

CMU SCS /615Faloutsos/Pavlo31 Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big

CMU SCS /615Faloutsos/Pavlo32 Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big –Answer: B(B-1). ie., a table of N blocks needs about sqrt(N) buffers –What assumption do we make?

CMU SCS /615Faloutsos/Pavlo33 Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big –Answer: B(B-1). ie., a table of N blocks needs about sqrt(N) buffers –Note: assumes hash function distributes records evenly! use a ‘fudge factor’ f >1 for that: we need B ~ sqrt( f * N)

CMU SCS /615Faloutsos/Pavlo34 Analysis Have a bigger table? Recursive partitioning! –In the ReHash phase, if a partition b is bigger than B, then recurse: –pretend that b is a table we need to hash, run the Partitioning phase on b, and then the ReHash phase on each of its (sub)partitions

CMU SCS /615Faloutsos/Pavlo35 Recursive partitioning B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h p B-1 Partitions 1 2 B-1... partition b > B * Hash table for partition R i (k i <= B pages) B main memory buffers hash fn hrhr 1 B-1 B PHASE 1*PHASE 2

CMU SCS /615Faloutsos/Pavlo36 Real story Partition + Rehash Performance is very slow! What could have gone wrong? break

CMU SCS /615Faloutsos/Pavlo37 Real story Partition + Rehash Performance is very slow! What could have gone wrong? Hint: some buckets are empty; some others are way over-full. break

CMU SCS /615Faloutsos/Pavlo38 Hashing vs. Sorting Which one needs more buffers?

CMU SCS /615Faloutsos/Pavlo39 Hashing vs. Sorting Recall: can hash a table of size N blocks in sqrt(N) space How big of a table can we sort in 2 passes? –Get N/B sorted runs after Pass 0 –Can merge all runs in Pass 1 if N/B ≤ B-1 Thus, we (roughly) require: N ≤ B 2 We can sort a table of size N blocks in about space sqrt(N) –Same as hashing!

CMU SCS /615Faloutsos/Pavlo40 Hashing vs. Sorting Choice of sorting vs. hashing is subtle and depends on optimizations done in each case … –Already discussed some optimizations for sorting:

CMU SCS /615Faloutsos/Pavlo41 Hashing vs. Sorting Choice of sorting vs. hashing is subtle and depends on optimizations done in each case … –Already discussed some optimizations for sorting: (Heapsort in Pass 0 for longer runs) Chunk I/O into large blocks to amortize seek+RD costs Double-buffering to overlap CPU and I/O –Another optimization when using sorting for aggregation: “Early aggregation” of records in sorted runs –We will discuss some optimizations for hashing next…

CMU SCS /615Faloutsos/Pavlo42 Hashing: We Can Do Better! Combine the summarization into the hashing process - How? HashAgg

CMU SCS /615Faloutsos/Pavlo43 Hashing: We Can Do Better! Combine the summarization into the hashing process - How? –During the ReHash phase, don’t store tuples, store pairs of the form –When we want to insert a new tuple into the hash table If we find a matching GroupVals, just update the RunningVals appropriately Else insert a new pair HashAgg select ssn, sum (credits) from takes group by ssn (groupVal, runningVal) (12345, 12) (45678, 18)

CMU SCS /615Faloutsos/Pavlo44 Hashing: We Can Do Better! Combine the summarization into the hashing process What’s the benefit? –Q: How many entries will we have to handle? –A: Number of distinct values of GroupVals columns Not the number of tuples!! –Also probably “narrower” than the tuples HashAgg

CMU SCS /615Faloutsos/Pavlo45 Even Better: Hybrid Hashing What if B > sqrt(N)? e.g., N=10,000, B=200 B=100 (actually, 101) would be enough for 2 passes How could we use the extra 100 buffers? B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hphp 100 Partitions

CMU SCS /615Faloutsos/Pavlo46 Even Better: Hybrid Hashing What if B > sqrt(N)? e.g., N=10,000, B=200 B=100 (actually, 101) would be enough for 2 passes How could we use the extra 100 buffers? B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hphp 100 Partitions A: 1ph for first partition; 2 for all others

CMU SCS /615Faloutsos/Pavlo47 Even Better: Hybrid Hashing Idea: hybrid! … keep 1st partition in memory during phase 1! –Output its stuff at the end of Phase 1. Partition 1 B main memory buffers Disk Original Relation OUTPUT 3 INPUT 2 hphp 100 Partitions hrhr 100-buffer hashtable 1 100

CMU SCS /615Faloutsos/Pavlo48 Even Better: Hybrid Hashing What if B=300? (and N=10,000, again) i.e., 200 extra buffers?

CMU SCS /615Faloutsos/Pavlo49 Even Better: Hybrid Hashing What if B=300? (and N=10,000, again) i.e., 200 extra buffers? A: keep the first 2 partitions in main memory

CMU SCS /615Faloutsos/Pavlo50 Even Better: Hybrid Hashing What if B=150? (and N=10,000, again) i.e., 50 extra buffers?

CMU SCS /615Faloutsos/Pavlo51 Even Better: Hybrid Hashing What if B=150? (and N=10,000, again) i.e., 50 extra buffers? A: keep half of the first bucket in memory

CMU SCS /615Faloutsos/Pavlo52 Hybrid hashing can be used together with the summarization idea

CMU SCS /615Faloutsos/Pavlo53 Source: G. Graefe. ACM Computing Surveys, 25(2). Hashing vs. Sorting revisited Notes: (1) based on analytical (not empirical) evaluation (2) numbers for sort do not reflect heapsort optimization (3) assumes even distribution of hash buckets

CMU SCS /615Faloutsos/Pavlo54 So, hashing’s better … right? Any caveats?

CMU SCS /615Faloutsos/Pavlo55 So, hashing’s better … right? Any caveats? A1: sorting is better on non-uniform data A2:... and when sorted output is required later. Hashing vs. sorting: Commercial systems use either or both

CMU SCS /615Faloutsos/Pavlo56 Summary Query processing architecture: –Query optimizer translates SQL to a query plan = graph of iterators –Query executor “interprets” the plan Hashing is a useful alternative to sorting for dup. elim / group-by –Both are valuable techniques for a DBMS