Lecture 24 Query Execution Monday, November 28, 2005.

Slides:



Advertisements
Similar presentations
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
Join Processing in Databases Systems with Large Main Memories
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Lecture 24: Query Execution Monday, November 20, 2000.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
Lecture 11: DMBS Internals
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
Sorting.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
CPS216: Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Processing CS 405G Introduction to Database Systems.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
CS 540 Database Management Systems
CS 440 Database Management Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
Lecture 11: DMBS Internals
Database Management Systems (CS 564)
Implementation of Relational Operations (Part 2)
Selected Topics: External Sorting, Join Algorithms, …
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
Lecture 24: Query Execution
Lecture 13: Query Execution
Lecture 23: Query Execution
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 22: Friday, November 22, 2002.
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

Lecture 24 Query Execution Monday, November 28, 2005

Outline Partitioned-base Hash Algorithms External Sorting Sort-based Algorithms

Two Pass Algorithms Based on Hashing Idea: partition a relation R into buckets, on disk Each bucket has size approx. B(R)/M M main memory buffers Disk Relation R OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M B(R) Does each bucket fit in main memory ? –Yes if B(R)/M <= M, i.e. B(R) <= M 2

Hash Based Algorithms for  Recall:  (R)  duplicate elimination Step 1. Partition R into buckets Step 2. Apply  to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M 2

Hash Based Algorithms for  Recall:  (R)  grouping and aggregation Step 1. Partition R into buckets Step 2. Apply  to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M 2

Partitioned Hash Join R |x| S Step 1: –Hash S into M buckets –send all buckets to disk Step 2 –Hash R into M buckets –Send all buckets to disk Step 3 –Join every pair of buckets

Hash-Join Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. v Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. Partitions of R & S Input buffer for Ri Hash table for partition Si ( < M-1 pages) B main memory buffers Disk Output buffer Disk Join Result hash fn h2 B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Partitions 1 2 M-1...

Partitioned Hash Join Cost: 3B(R) + 3B(S) Assumption: min(B(R), B(S)) <= M 2

Hybrid Hash Join Algorithm Partition S into k buckets t buckets S 1, …, S t stay in memory k-t buckets S t+1, …, S k to disk Partition R into k buckets –First t buckets join immediately with S –Rest k-t buckets go to disk Finally, join k-t pairs of buckets: (R t+1,S t+1 ), (R t+2,S t+2 ), …, (R k,S k )

Hybrid Join Algorithm How to choose k and t ? –Choose k large but s.t. k <= M –Choose t/k large but s.t. t/k * B(S) <= M –Moreover: t/k * B(S) + k-t <= M Assuming t/k * B(S) >> k-t: t/k = M/B(S)

Hybrid Join Algorithm How many I/Os ? Cost of partitioned hash join: 3B(R) + 3B(S) Hybrid join saves 2 I/Os for a t/k fraction of buckets Hybrid join saves 2t/k(B(R) + B(S)) I/Os Cost: (3-2t/k)(B(R) + B(S)) = (3-2M/B(S))(B(R) + B(S))

Hybrid Join Algorithm Question in class: what is the real advantage of the hybrid algorithm ?

The I/O Model of Computation In main memory: CPU time –Big O notation ! In databases time is dominated by I/O cost –Big O too, but for I/O’s –Often big O becomes a constant Consequence: need to redesign certain algorithms See sorting next

Sorting Problem: sort 1 GB of data with 1MB of RAM. Where we need this: –Data requested in sorted order (ORDER BY) –Needed for grouping operations –First step in sort-merge join algorithm –Duplicate removal –Bulk loading of B+-tree indexes.

2-Way Merge-sort: Requires 3 Buffers in RAM Pass 1: Read a page, sort it, write it. Pass 2, 3, …, etc.: merge two runs, write them Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk Runs of length L Runs of length 2L

Two-Way External Merge Sort Assume block size is B = 4Kb Step 1  runs of length L = 4Kb Step 2  runs of length L = 8Kb Step 3  runs of length L = 16Kb Step 9  runs of length L = 1MB... Step 19  runs of length L = 1GB (why ?) Need 19 iterations over the disk data to sort 1GB

Can We Do Better ? Hint: We have 1MB of main memory, but only used 12KB

Cost Model for Our Analysis B(R): size of R in number of blocks M: Size of main memory (in # of blocks) E.g. block size = 4KB then: –B(R) = –M = 250

External Merge-Sort Phase one: load M bytes in memory, sort –Result: runs of length M bytes ( 1MB ) M bytes of main memory Disk... M

Phase Two Merge M – 1 runs into a new run (250 runs ) Result: runs of length M (M – 1) bytes (250^2) M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

Phase Three Merge M – 1 runs into a new run Result: runs of length M (M – 1) 2 M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

Cost of External Merge Sort Number of passes: How much data can we sort with 10MB RAM? –1 pass  10MB data –2 passes  25GB data Can sort everything in 2 or 3 passes !

External Merge Sort The xsort tool in the XML toolkit sorts using this algorithm Can sort 1GB of XML data in about 8 minutes

Two-Pass Algorithms Based on Sorting Assumption: multi-way merge sort needs only two passes Assumption: B(R) <= M 2 Cost for sorting: 3B(R)

Two-Pass Algorithms Based on Sorting Duplicate elimination  (R) Trivial idea: sort first, then eliminate duplicates Step 1: sort chunks of size M, write –cost 2B(R) Step 2: merge M-1 runs, but include each tuple only once –cost B(R) Total cost: 3B(R), Assumption: B(R) <= M 2

Two-Pass Algorithms Based on Sorting Grouping:  a, sum(b) (R) Same as before: sort, then compute the sum(b) for each group of a’s Total cost: 3B(R) Assumption: B(R) <= M 2

Two-Pass Algorithms Based on Sorting R ∪ S x = first(R) y = first(S) While (_______________) do { case x y; } x = first(R) y = first(S) While (_______________) do { case x y; } Complete the program in class:

Two-Pass Algorithms Based on Sorting R ∩ S x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } Complete the program in class:

Two-Pass Algorithms Based on Sorting R - S Complete the program in class: x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; }

Two-Pass Algorithms Based on Sorting Binary operations: R ∪ S, R ∩ S, R – S Idea: sort R, sort S, then do the right thing A closer look: –Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) –Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis Total cost: 3B(R)+3B(S) Assumption: B(R)+B(S)<= M 2

Two-Pass Algorithms Based on Sorting R |x| R.A =S.B S x = first(R) y = first(S) While (_______________) do { case x.A < y.B: case x.A=y.B: case x.A > y.B; } x = first(R) y = first(S) While (_______________) do { case x.A < y.B: case x.A=y.B: case x.A > y.B; } Complete the program in class: R(A,C) sorted on A S(B,D) sorted on B

Two-Pass Algorithms Based on Sorting Join R |x| S Start by sorting both R and S on the join attribute: –Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples –Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S –If at least one set of tuples fits in M, we are OK –Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M 2, B(S) <= M 2

Two-Pass Algorithms Based on Sorting Join R |x| S If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M 2