CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation.

Slides:



Advertisements
Similar presentations
Chapter 13: Query Processing
Advertisements

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos - A. Pavlo Lecture#2: E-R diagrams.
CMU SCS Faloutsos & PavloCMU SCS /6151 Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Lecture #18: Physical Database.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Lecture #19 (not in book) Database Design Methodology handout.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Lecture #17: Schema Refinement & Normalization - Normal Forms (R&G,
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#12: External Sorting.
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#14: Implementation of Relational Operations.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#14(b): Implementation of Relational.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Implementation of relational operations
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
CMU SCS /615Faloutsos/Pavlo1 Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications C. Faloutsos & A. Pavlo Lecture #13: Query.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Query Processing (overview)
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#14: Implementation of Relational Operations.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
Database Applications (15-415) DBMS Internals- Part VIII Lecture 19, March 29, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Chapter 12: Query Processing
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations (Part 2)
C. Faloutsos Query Optimization – part 1
CS222P: Principles of Data Management Notes #11 Selection, Projection
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Chapters 15 and 16b: Query Optimization
Chapter 12 Query Processing (1)
Implementation of Relational Operations
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
C. Faloutsos Query Optimization – part 2
Evaluation of Relational Operations: Other Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Presentation transcript:

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation

CMU SCS CMU SCS /6152 Today's Class Catalog (12.1) Intro to Operator Evaluation (12.2-3) Typical Query Optimizer (12.6) Projection: Sorting vs. Hashing (14.3.2) Faloutsos/Pavlo

CMU SCS /615Faloutsos3 Cost-based Query Sub-System Query Parser Query Optimizer Plan Generator Plan Cost Estimator Catalog Manager Query Plan Evaluator Schema Statistics Select * From Blah B Where B.blah = blah Queries

CMU SCS /615Faloutsos4 Cost-based Query Sub-System Query Parser Query Optimizer Plan Generator Plan Cost Estimator Catalog Manager Query Plan Evaluator Schema Statistics Select * From Blah B Where B.blah = blah Queries

CMU SCS Catalog: Schema What would you store? –Info about tables, attributes, indices, users How? –In tables! Attribute_Cat (attr_name: string, rel_name: string; type: string; position: integer) Faloutsos/PavloCMU SCS /6155

CMU SCS Catalog: Schema What would you store? –Info about tables, attributes, indices, users How? –In tables! Attribute_Cat (attr_name: string, rel_name: string; type: string; position: integer) Faloutsos/PavloCMU SCS /6156 See INFORMATION_SCHEMA discussion from Lecture #7

CMU SCS Catalog: Statistics Why do we need them? –To estimate cost of query plans What would you store? –NTuples(R): # records for table R –NPages(R): # pages for R –NKeys(I): # distinct key values for index I –INPages(I): # pages for index I –IHeight(I): # levels for I –ILow(I), IHigh(I): range of values for I Faloutsos/PavloCMU SCS /6157

CMU SCS Catalog: Statistics Why do we need them? –To estimate cost of query plans What would you store? –NTuples(R): # records for table R –NPages(R): # pages for R –NKeys(I): # distinct key values for index I –INPages(I): # pages for index I –IHeight(I): # levels for I –ILow(I), IHigh(I): range of values for I Faloutsos/PavloCMU SCS /6158

CMU SCS CMU SCS /6159 Today's Class Catalog (12.1) Intro to Operator Evaluation (12.2-3) Typical Query Optimizer (12.6) Projection: Sorting vs. Hashing (14.3.2) Faloutsos/Pavlo

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61510 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000  cname, amt (  amt>1000 (customer ⋈ account )) Relational Algebra:

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61511 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 CUSTOMERACCOUNT   acctno=acctno amt>1000 cname, amt

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61512 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 CUSTOMERACCOUNT   acctno=acctno amt>1000 cname, amt File Scan Nested Loop “On-the-fly” File Scan

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61513 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 CUSTOMERACCOUNT   The output of each operator is the input to the next operator. Each operator iterates over its input and performs some task.

CMU SCS Operator Evaluation Several algorithms are available for different relational operators. Each has its own performance trade-offs. The goal of the query optimizer is to choose the one that has the lowest “cost”. Faloutsos/PavloCMU SCS /61514 Next Class: How the DBMS finds the best plan.

CMU SCS Operator Execution Strategies Indexing Iteration (= seq. scanning) Partitioning (sorting and hashing) Faloutsos/PavloCMU SCS /61515

CMU SCS Access Paths How the DBMS retrieves tuples from a table for a query plan. –File Scan (aka Sequential Scan) –Index Scan (Tree, Hash, List, …) Selectivity of an access path: –% of pages we retrieve –e.g., Selectivity of a hash index, on range query: 100% (no reduction!) Faloutsos/PavloCMU SCS /61516

CMU SCS Operator Algorithms Selection: Projection: Join: Group By: Order By: Faloutsos/PavloCMU SCS /61517

CMU SCS Operator Algorithms Selection: file scan; index scan Projection: hashing; sorting Join: Group By: Order By: Faloutsos/PavloCMU SCS /61518

CMU SCS Operator Algorithms Selection: file scan; index scan Projection: hashing; sorting Join: many ways (loops, sort-merge, etc) Group By: Order By: Faloutsos/PavloCMU SCS /61519

CMU SCS Operator Algorithms Selection: file scan; index scan Projection: hashing; sorting Join: many ways (loops, sort-merge, etc) Group By: hashing; sorting Order By: sorting Faloutsos/PavloCMU SCS /61520

CMU SCS Operator Algorithms Selection: file scan; index scan Projection: hashing; sorting Join: many ways (loops, sort-merge, etc) Group By: hashing; sorting Order By: sorting Faloutsos/PavloCMU SCS /61521 Today Next Class Today Next Class

CMU SCS CMU SCS /61522 Today's Class Catalog (12.1) Intro to Operator Evaluation (12.2-3) Typical Query Optimizer (12.6) Projection: Sorting vs. Hashing (14.3.2) Faloutsos/Pavlo

CMU SCS Query Optimization Bring query in internal form (eg., parse tree) … into “canonical form” (syntactic q-opt) Generate alternative plans. Estimate cost for each plan. Pick the best one. Faloutsos/PavloCMU SCS /61523

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61524 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 CUSTOMERACCOUNT   acctno=acctno amt>1000 cname, amt

CMU SCS Query Plan Example Faloutsos/PavloCMU SCS /61525 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 SELECT cname, amt FROM customer, account WHERE customer.acctno = account.acctno AND account.amt > 1000 CUSTOMERACCOUNT   acctno=acctno amt>1000 cname, amt CUSTOMERACCOUNT   acctno=acctno amt>1000 cname, amt

CMU SCS CMU SCS /61526 Today's Class Catalog (12.1) Intro to Operator Evaluation (12.2,3) Typical Query Optimizer (12.6) Projection: Sorting vs. Hashing (14.3.2) Faloutsos/Pavlo

CMU SCS Duplicate Elimination What does it do, in English? How to execute it? Faloutsos/PavloCMU SCS /61527 SELECT DISTINCT bname FROM account WHERE amt > 1000 SELECT DISTINCT bname FROM account WHERE amt > 1000  DISTINCT bname (  amt>1000 (account )) Not technically correct because RA doesn’t have “DISTINCT”

CMU SCS Duplicate Elimination Faloutsos/PavloCMU SCS /61528 SELECT DISTINCT bname FROM account WHERE amt > 1000 SELECT DISTINCT bname FROM account WHERE amt > 1000 ACCOUNT   amt>1000 DISTINCT bname Two Choices: Sorting Hashing

CMU SCS Sorting Projection Faloutsos/PavloCMU SCS /61529 acctnobnameamt A-123Redwood1800 A-789Downtown2000 A-123Perry1500 A-456Downtown1300 ACCOUNT   amt>1000 DISTINCT bname Remove Columns Sort Eliminate Dupes acctnobnameamt A-123Redwood1800 A-789Downtown2000 A-123Perry1500 A-456Downtown1300 bname Redwood Downtown Perry Downtown bname Downtown Perry Redwood X Filter

CMU SCS Alternative to Sorting: Hashing! What if we don’t need the order of the sorted data? –Forming groups in GROUP BY –Removing duplicates in DISTINCT Hashing does this! –And may be cheaper than sorting! (why?) –But what if table doesn’t fit in memory? Faloutsos/PavloCMU SCS /61530

CMU SCS Hashing Projection Populate an ephemeral hash table as we iterate over a table. For each record, check whether there is already an entry in the hash table: –DISTINCT : Discard duplicate. –GROUP BY : Perform aggregate computation. Two phase approach. Faloutsos/PavloCMU SCS /61531

CMU SCS Phase 1: Partition Use a hash function h 1 to split tuples into partitions on disk. –We know that all matches live in the same partition. –Partitions are “spilled” to disk via output buffers. Assume that we have B buffers. Faloutsos/PavloCMU SCS /61532

CMU SCS Phase 1: Partition Faloutsos/PavloCMU SCS /61533 acctnobnameamt A-123Redwood1800 A-789Downtown2000 A-123Perry1500 A-456Downtown1300 ACCOUNT   amt>1000 DISTINCT bname Hash Redwood ⋮ Downtown Perry Remove Columns acctnobnameamt A-123Redwood1800 A-789Downtown2000 A-123Perry1500 A-456Downtown1300 bname Redwood Downtown Perry Downtown Filter h1h1 B-1 partitions

CMU SCS Phase 2: ReHash For each partition on disk: –Read it into memory and build an in-memory hash table based on a hash function h 2 –Then go through each bucket of this hash table to bring together matching tuples This assumes that each partition fits in memory. Faloutsos/PavloCMU SCS /61534

CMU SCS Phase 2: ReHash Faloutsos/PavloCMU SCS /61535 ACCOUNT   amt>1000 DISTINCT bname h2h2 Partitions From Phase 1 Redwood ⋮ Downtown Perry keyvalue XXXDowntown YYYRedwood ZZZPerry Eliminate Dupes bname Downtown Perry Redwood Hash Table acctnobnameamt A-123Redwood1800 A-789Downtown2000 A-123Perry1500 A-456Downtown1300 h2h2 h2h2

CMU SCS Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big Faloutsos/PavloCMU SCS /61536

CMU SCS Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big –Answer: B∙(B-1). A table of N blocks needs about sqrt(N) buffers –What assumption do we make? Faloutsos/PavloCMU SCS /61537

CMU SCS Analysis How big of a table can we hash using this approach? –B-1 “spill partitions” in Phase 1 –Each should be no more than B blocks big –Answer: B∙(B-1). A table of N blocks needs about sqrt(N) buffers –Assumes hash distributes records evenly! Use a “fudge factor” f >1 for that: we need B ~ sqrt( f ∙N) Faloutsos/PavloCMU SCS /61538

CMU SCS Analysis Have a bigger table? Recursive partitioning! –In the ReHash phase, if a partition i is bigger than B, then recurse. –Pretend that i is a table we need to hash, run the Partitioning phase on i, and then the ReHash phase on each of its (sub)partitions Faloutsos/PavloCMU SCS /61539

CMU SCS Recursive Partitioning Faloutsos/PavloCMU SCS /61540 Hash h1h1 h 1’ ⋮ Hash the overflowing bucket again h2h2 h2h2 h2h2 h2h2 h2h2

CMU SCS Real Story Partition + Rehash Performance is very slow! What could have gone wrong? Faloutsos/PavloCMU SCS /61541

CMU SCS Real Story Partition + Rehash Performance is very slow! What could have gone wrong? Hint: some buckets are empty; some others are way over-full. Faloutsos/PavloCMU SCS /61542

CMU SCS Hashing vs. Sorting Which one needs more buffers? Faloutsos/PavloCMU SCS /61543

CMU SCS Hashing vs. Sorting Recall: We can hash a table of size N blocks in sqrt(N) space How big of a table can we sort in 2 passes? –Get N/B sorted runs after Pass 0 –Can merge all runs in Pass 1 if N/B ≤ B-1 Thus, we (roughly) require: N ≤ B2 We can sort a table of size N blocks in about space sqrt(N) –Same as hashing! Faloutsos/PavloCMU SCS /61544

CMU SCS Hashing vs. Sorting Choice of sorting vs. hashing is subtle and depends on optimizations done in each case Already discussed optimizations for sorting: –Heapsort in Pass 0 for longer runs –Chunk I/O into large blocks to amortize seek+RD costs –Double-buffering to overlap CPU and I/O Faloutsos/PavloCMU SCS /61545

CMU SCS Hashing vs. Sorting Choice of sorting vs. hashing is subtle and depends on optimizations done in each case Another optimization when using sorting for aggregation: –“Early aggregation” of records in sorted runs Let’s look at some optimizations for hashing next… Faloutsos/PavloCMU SCS /61546

CMU SCS Hashing: We Can Do Better! Combine the summarization into the hashing process - How? Faloutsos/PavloCMU SCS /61547

CMU SCS Hashing: We Can Do Better! During the ReHash phase, store pairs of the form When we want to insert a new tuple into the hash table: –If we find a matching GroupKey, just update the RunningVal appropriately –Else insert a new Faloutsos/PavloCMU SCS /61548

CMU SCS Hashing Aggregation Faloutsos/PavloCMU SCS /61549 SELECT acctno, SUM(amt) FROM account GROUP BY acctno SELECT acctno, SUM(amt) FROM account GROUP BY acctno h2h2 Partitions From Phase 1 Redwood ⋮ Downtown Perry keyvalue XXX YYY ZZZ Final Result acctnoSUM(amt) A A A Hash Table Running Totals h2h2 h2h2

CMU SCS Hashing Aggregation What’s the benefit? How many entries will we have to handle? –Number of distinct values of GroupKeys columns –Not the number of tuples!! –Also probably “narrower” than the tuples Faloutsos/PavloCMU SCS /61550

CMU SCS So, hashing is better…right? Any caveats? Faloutsos/PavloCMU SCS /61551

CMU SCS So, hashing is better…right? Any caveats? A1: Sorting is better on non-uniform data A2:... and when sorted output is required later. Hashing vs. sorting: –Commercial systems use either or both Faloutsos/PavloCMU SCS /61552

CMU SCS Summary Query processing architecture: –Query optimizer translates SQL to a query plan = graph of iterators –Query executor “interprets” the plan Hashing is a useful alternative to sorting for duplicate elimination / group-by –Both are valuable techniques for a DBMS Faloutsos/PavloCMU SCS /61553

CMU SCS Next Class Faloutsos/PavloCMU SCS /61554