Lecture 22: Friday, November 22, 2002.

Slides:



Advertisements
Similar presentations
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Advertisements

1 Lecture 23: Query Execution Friday, March 4, 2005.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang.
Query Execution :Nested-Loop Joins Rohit Deshmukh ID 120 CS-257 Rohit Deshmukh ID 120 CS-257.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
1 Lecture 25 Friday, November 30, Outline Query execution –Two pass algorithms based on indexes (6.7) Query optimization –From SQL to logical.
CS4432: Database Systems II Query Processing- Part 3 1.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Lecture 24 Query Execution Monday, November 28, 2005.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Lecture 17: Query Execution Tuesday, February 28, 2001.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
CS 540 Database Management Systems
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
1 Lecture 23: Query Execution Monday, November 26, 2001.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Evaluation of Relational Operations
15.5 Two-Pass Algorithms Based on Hashing
Evaluation of Relational Operations: Other Operations
Cse 344 April 25th – Disk i/o.
Implementation of Relational Operations (Part 2)
Relational Operations
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Example R1 R2 over common attribute C
Query Execution Two-pass Algorithms based on Hashing
(Two-Pass Algorithms)
April 27th – Cost Estimation
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Lecture 25: Query Execution
Lecture 27: Optimizations
Lecture 24: Query Execution
Lecture 25: Query Optimization
Lecture 13: Query Execution
CS505: Intermediate Topics in Database Systems
Query Execution Index Based Algorithms (15.6)
Lecture 23: Query Execution
Data-Intensive Computing Systems Query Execution (Sort and Join operators) Shivnath Babu.
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
Lecture 22: Query Execution
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 23: Monday, November 25, 2002.
Evaluation of Relational Operations: Other Techniques
Lecture 24: Query Execution
Lecture 20: Query Execution
Presentation transcript:

Lecture 22: Friday, November 22, 2002

Outline Query execution: 15.1 – 15.5

Two-Pass Algorithms Based on Sorting Recall: multi-way merge sort needs only two passes ! Assumption: B(R) <= M2 Cost for sorting: 3B(R)

Two-Pass Algorithms Based on Sorting Duplicate elimination d(R) Trivial idea: sort first, then eliminate duplicates Step 1: sort chunks of size M, write cost 2B(R) Step 2: merge M-1 runs, but include each tuple only once cost B(R) Total cost: 3B(R), Assumption: B(R) <= M2

Two-Pass Algorithms Based on Sorting Grouping: ga, sum(b) (R) Same as before: sort, then compute the sum(b) for each group of a’s Total cost: 3B(R) Assumption: B(R) <= M2

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: output(x) x = next(R) case x=y: case x > y; } R ∪ S Complete the program in class:

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } R ∩ S Complete the program in class:

Two-Pass Algorithms Based on Sorting x = first(R) y = first(S) While (_______________) do { case x < y: case x=y: case x > y; } R - S Complete the program in class:

Two-Pass Algorithms Based on Sorting Binary operations: R ∪ S, R ∩ S, R – S Idea: sort R, sort S, then do the right thing A closer look: Step 1: split R into runs of size M, then split S into runs of size M. Cost: 2B(R) + 2B(S) Step 2: merge M/2 runs from R; merge M/2 runs from S; ouput a tuple on a case by cases basis Total cost: 3B(R)+3B(S) Assumption: B(R)+B(S)<= M2

Two-Pass Algorithms Based on Sorting R(A,C), S(B,D) x = first(R) y = first(S) While (_______________) do { case x.A < y.B: case x.A=y.B: case x.A > y.B; } R ⋈R.A =S.B S Complete the program in class:

Two-Pass Algorithms Based on Sorting Join R ⋈ S Start by sorting both R and S on the join attribute: Cost: 4B(R)+4B(S) (because need to write to disk) Read both relations in sorted order, match tuples Cost: B(R)+B(S) Difficulty: many tuples in R may match many in S If at least one set of tuples fits in M, we are OK Otherwise need nested loop, higher cost Total cost: 5B(R)+5B(S) Assumption: B(R) <= M2, B(S) <= M2

Two-Pass Algorithms Based on Sorting Join R ⋈ S If the number of tuples in R matching those in S is small (or vice versa) we can compute the join during the merge phase Total cost: 3B(R)+3B(S) Assumption: B(R) + B(S) <= M2

Two Pass Algorithms Based on Hashing Idea: partition a relation R into buckets, on disk Each bucket has size approx. B(R)/M Does each bucket fit in main memory ? Yes if B(R)/M <= M, i.e. B(R) <= M2 M main memory buffers Disk Relation R OUTPUT 2 INPUT 1 hash function h M-1 Partitions . . . 1 2 B(R)

Hash Based Algorithms for d Recall: d(R) = duplicate elimination Step 1. Partition R into buckets Step 2. Apply d to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M2

Hash Based Algorithms for g Recall: g(R) = grouping and aggregation Step 1. Partition R into buckets Step 2. Apply g to each bucket (may read in main memory) Cost: 3B(R) Assumption:B(R) <= M2

Partitioned Hash Join R ⋈ S Step 1: Step 2 Step 3 Hash S into M buckets send all buckets to disk Step 2 Hash R into M buckets Send all buckets to disk Step 3 Join every pair of buckets

Hash table for partition Hash-Join B main memory buffers Disk Original Relation OUTPUT 2 INPUT 1 hash function h M-1 Partitions . . . Partition both relations using hash fn h: R tuples in partition i will only match S tuples in partition i. Partitions of R & S Input buffer for Ri Hash table for partition Si ( < M-1 pages) B main memory buffers Disk Output buffer Join Result hash fn h2 Read in a partition of R, hash it using h2 (<> h!). Scan matching partition of S, search for matches. 14

Partitioned Hash Join Cost: 3B(R) + 3B(S) Assumption: min(B(R), B(S)) <= M2

Hybrid Hash Join Algorithm Partition S into k buckets But keep first bucket S1 in memory, k-1 buckets to disk Partition R into k buckets First bucket R1 is joined immediately with S1 Other k-1 buckets go to disk Finally, join k-1 pairs of buckets: (R2,S2), (R3,S3), …, (Rk,Sk)

Hybrid Join Algorithm How big should we choose k ? Average bucket size for S is B(S)/k Need to fit B(S)/k + (k-1) blocks in memory B(S)/k + (k-1) <= M k slightly smaller than B(S)/M

Hybrid Join Algorithm How many I/Os ? Recall: cost of partitioned hash join: 3B(R) + 3B(S) Now we save 2 disk operations for one bucket Recall there are k buckets Hence we save 2/k(B(R) + B(S)) Cost: (3-2/k)(B(R) + B(S)) = (3-2M/B(S))(B(R) + B(S))

Hybrid Join Algorithm Question in class: what is the real advantage of the hybrid algorithm ?

Indexed Based Algorithms Recall that in a clustered index all tuples with the same value of the key are clustered on as few blocks as possible Note: book uses another term: “clustering index”. Difference is minor… a a a a a a a a a a

Index Based Selection Selection on equality: sa=v(R) Clustered index on a: cost B(R)/V(R,a) Unclustered index on a: cost T(R)/V(R,a)

Index Based Selection Example: B(R) = 2000, T(R) = 100,000, V(R, a) = 20, compute the cost of sa=v(R) Cost of table scan: If R is clustered: B(R) = 2000 I/Os If R is unclustered: T(R) = 100,000 I/Os Cost of index based selection: If index is clustered: B(R)/V(R,a) = 100 If index is unclustered: T(R)/V(R,a) = 5000 Notice: when V(R,a) is small, then unclustered index is useless

Index Based Join R ⋈ S Assume S has an index on the join attribute Iterate over R, for each tuple fetch corresponding tuple(s) from S Assume R is clustered. Cost: If index is clustered: B(R) + T(R)B(S)/V(S,a) If index is unclustered: B(R) + T(R)T(S)/V(S,a)

Index Based Join Assume both R and S have a sorted index (B+ tree) on the join attribute Then perform a merge join (called zig-zag join) Cost: B(R) + B(S)

Questions in Class B(Product), B(Company) are large Which join method would you use ? Consider: 10 bozos v.s. 100…0 10 cool companies v.s. 100…00 SELECT Product.name, Company.city FROM Product, Company WHERE Product.maker = Company.name and Product.category = ‘bozo’ and Company.rating = ‘cool’