CS 540 Database Management Systems

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
1 40T1 60T2 30T3 10T4 20T5 10T6 60T7 40T8 20T9 R S C C R JOIN S?
CS 4432query processing - lecture 161 CS4432: Database Systems II Lecture #16 Join Processing Algorithms Professor Elke A. Rundensteiner.
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Query Processing & Optimization
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
...Looking back Why use a DBMS? How to design a database? How to query a database? How does a DBMS work?
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
CPS216: Advanced Database Systems Notes 07:Query Execution Shivnath Babu.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
CS4432: Database Systems II Query Processing- Part 2.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Query Processing CS 405G Introduction to Database Systems.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Query Processing and Query Optimization Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
CS4432: Database Systems II Query Processing- Part 1 1.
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
15.1 – Introduction to physical-Query-plan operators
CS 540 Database Management Systems
CS 440 Database Management Systems
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Query Execution Presented by Khadke, Suvarna CS 257
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Yan Huang - CSCI5330 Database Implementation – Access Methods
Cse 344 APRIL 23RD – Indexing.
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Advance Database Systems
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Issues in Indexing Multi-dimensional indexing:
Lecture 13: Query Execution
CS505: Intermediate Topics in Database Systems
Lecture 23: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 22: Query Execution
Database Administration
Yan Huang - CSCI5330 Database Implementation – Query Processing
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Lecture 20: Query Execution
Presentation transcript:

CS 540 Database Management Systems Lecture 7: Query Processing

DBMS Architecture User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Today’s lecture Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Buffers Lock Tables Buffer Manager Main Memory Storage Manager Storage

Query Execution Plans   Query Plan: logical plan (declarative) manf SELECT B. manf FROM Beers B, Sells S WHERE B.name=S.beer AND S.price < 20  price < 20 ( nested loops) (Table scan) (Index scan) name=beer Query Plan: logical plan (declarative) physical plan (procedural) – procedural implementation of each logical operator – scheduling of operations Beers Sells

Logical versus physical operators Logical operators Relational Algebra Operators Join, Selection, Projection, Union, … Physical operators Algorithms to implement logical operators. Hash join, nested loop join, … More than one physical operator for each logical operator

Communication between operators: iterator model Each physical operator implements three functions: Open: initializes the data structures. GetNext: returns the next tuple in the result. Close: ends the operation and frees the resources. It enables pipelining

Physical operators Logical operator: selection read the entire or selected tuples of relation R. tuples satisfy some predicate Table-scan: R resides in the secondary storage, read its blocks one by one. Index-scan: If there is an index on R, use the index to find the blocks. more efficient Other operators for join, union, group by, ... join is the most important one. focus of our lecture

Both relations fit in main memory Internal memory join algorithms Nested-loop join: check for every record in R and every record in S; time = O(|R||S|) Sort-merge join: sort R, S followed by merging; time = O(|S|*log|S|) (if |R|<|S|) Hash join: build a hash table for R; for every record in S, probe the hash table; time =O(|S|)

External memory join algorithms At least one relation does not fit into main memory I/O access is the dominant cost B(R): number of blocks of R. |R| or T(R) : number of tuples in R. Memory requirement M: number of blocks that fit in main memory Example: internal memory join algorithms : B(R) + B(S) We do not consider the cost of writing the output. The results may be pipelined and never written to disk.

Nested-loop join of R and S For each block of R, and for each tuple r in the block: For each block of S, and for each tuple s in the block: Output rs if join condition evaluates to true over r and s R is called the outer table; S is called the inner table cost: B(R) + |R| · B(S) Memory requirement: 4 (if double buffering is used) block-based nested-loop join - For each block of R, and for each block of S: For each r in the R block, and for each s in the S block: … cost: B(R) + B(R) · B(S) Memory requirement: 4 (if double buffering is used)

Improving nested-loop join Use up the available memory buffers M Read M - 2 blocks from R Read blocks of S one by one and join its tuples with R tuples in main memory Cost: B(R) + [ B(R) / (M – 2) ] B(S) almost B(R) B(S) / M Memory requirement: M

Index-based (zig-zag) join Join R and S on R.A = S.B Use ordered indexes over R.A and S.B to join the relations. B+ tree Use current indexes or build new ones. Similar to sort-merge join without sorting step. Use the larger value to probe the other index

Two Pass, multi-way merge sort Problem: sort relation R that does not fit in main memory Phase 1: Read R in groups of M blocks, sort, and write them as runs of size M on disk. Main memory Disk Relation R runs 1 2 M-1 . . . B(R) M Buffers . . . . . .

Two Pass, multi-way merge sort Phase 2: Merge M – 1 blocks at a time and write the results to disk. Read one block from each run. Keep one block for the output. Relation R (sorted) Disk runs 1 2 M-1 M-1 Buffers 1 Output buffer 2 . . . . . . B(R) Disk Main Memory

Two pass, multi-way merge Sort Cost: 2B(R) in the first pass + B(R) in the second pass. Memory requirement: M B(R) <= M (M – 1) or simply B(R) <= M2 How can we improve this bound?

General multi-way merge sort Pass 0: read M blocks of R at a time, sort them, and write out a level-0 run There are [B(R) / M ] level-0 sorted runs Pass i: merge (M – 1) level-(i-1) runs at a time, and write out a level-i run (M – 1) memory blocks for input, 1 to buffer output # of level-i runs = # of level-(i–1) runs / (M – 1) Final pass produces 1 sorted run

Example of general multi-way merge sort Input: 1, 7, 4, 5, 2, 8, 9, 6, 3, 0 Each block holds one number, and memory has 3 blocks Pass 0 1, 7, 4 ->1, 4, 7 5, 2, 8 -> 2, 5, 8 9, 6, 3 -> 3, 6, 9 0 -> 0 Pass 1 1, 4, 7 + 2, 5, 8 -> 1, 2, 4, 5, 7, 8 3, 6, 9 + 0 -> 0, 3, 6, 9 Pass 2 (final) 1, 2, 4, 5, 7, 8 + 0, 3, 6, 9 -> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Analysis of multi-way merge sort Number of passes: cost #passes · 2 · B(R): each pass reads the entire relation once and writes it once Subtract B(R) for the final pass Simply O( B(R) · log M B(R) ) Memory requirement: M

Sort-merge join algorithm Sort R and S according to the join attribute, then merge them r, s = the first tuples in sorted R and S Repeat until one of R and S is exhausted: If r.A > s.B then s = next tuple in S else if r.A < s.B then r = next tuple in R else output all matching tuples, and r, s = next in R and S Cost sorting + 2 B(R)+ 2 B(S) (join on foreign key primary key) B(R) B(S) if everything joins Memory Requirement: B(R) <= M2 , B(S) <= M2

Optimized sort-merge join algorithm Combine join with the merge phase of sort Sort R and S in M runs (overall) of size M on disk. Merge and join the tuples in one pass. Disk Runs of R and S R S merge . . . join merge . . . Main Memory

Optimized two-pass sort-merge join algorithm Cost: 3B(R) + 3B(S) Memory Requirement: B(R) + B(S) <= M2 because we merge them in one pass More efficient but more strict requirement.

Hash join algorithm Idea: partition each relation into buckets by hashing their join attributes and consider corresponding buckets from R and S. If tuples of R and S are not assigned to corresponding buckets, they do not join Main memory Disk Relation R Buffer hash function Buckets 1 2 M-1 . . . B(R) Buffers . . . . . .

(Partitioned) Hash join or R and S Step 1: Hash S into M buckets send all buckets to disk Step 2 Hash R into M buckets Send all buckets to disk Step 3 Join corresponding buckets If tuples of R and S are not assigned to corresponding buckets, they do not join

bucket Si ( < M-1 pages) Hash Join M main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Buckets . . . Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Buckets of R & S Input buffer for Ri Hash table for bucket Si ( < M-1 pages) M main memory buffers Disk Output buffer Join Result Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. hash fn h2 h2 14

Bucket Si ( < M-1 pages) Hash join Cost: 3 B(R) + 3 B(S). Memory Requirement: The smaller bucket must fit in main memory. Let min( B(R), B(S)) = B(S) B(S) / (M – 1) <= M, roughly B(S) <= M2 Buckets of R & S Input buffer for Ri Bucket Si ( < M-1 pages) M main memory buffers Disk Output buffer Join Result 24

Hybrid Hash join When partitioning S, keep the records of the first bucket in memory as a hash table; When partitioning R, for records of the first bucket, probe the hash table directly; Saving: no need to write R1 and S1 to disk or read them back to memory.

Handle partition overflow Overflow on disk: an S partition is larger than memory size (note: don’t care about the size of S partitions) Solution: recursive partition. Overflow in memory: the in-memory hash table of S becomes too large. Solution: revise the partitioning scheme and keep a smaller partition in memory.

Hash-based versus sort-based join Hash join need smaller amount of main memory sqrt (min(B(R), B(S))) < sqrt (B(R) + B(S) ) Hash join wins if the relations have different sizes Hash join performance depends on the quality of hashing It may be hard to generate balanced buckets for hash join Sort-based join wins if the relations are in sorted order Sort-based join generates sorted results useful when there is Order By in the query useful the following operators need sorted input Sort-based join can handle inequality join predicates

Duality of Sort and Hash Divide-and-conquer paradigm Sorting: physical division, logical combination Hashing: logical division, physical combination Handling very large inputs Sorting: multi-level merge Hashing: recursive partitioning

What you should know How nested-loop, zig-zag, sort-merge, and hash join processing algorithms work How to compute their costs and memory requirements What are their advantages and disadvantages

Carry Away Messages Old conclusions/assumptions regularly need to be re-examined because of the change in the world System R abandoned hash join, but the availability of large main memories changed the story Observe changes in the world and question some relevant assumptions In general, changes in the world bring opportunities for innovations, so be alert about any changes How have the needs for data management changed over time? Have the technologies for data management been tracking such changes?