Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016
15.2 One-Pass Algorithms Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Keys How should we execute each of the individual steps of a logical query plan? What is an One-Pass algorithm? How does the One-Pass algorithm work for different operators?

How to execute each of the individual steps of a logical query plan?
Each step of plan is an operation such as a join operation or selection operation or grouping operation etc. Join, selection, projection etc are operators. Algorithms for the operators are broadly classified into 3 classes: Sorting-based (to be covered in Section 15.4) Hash-based (to be covered in Section 15.5 and 20.1) Index-based (to be covered in Section 15.6)

How to execute each of the individual steps of a logical query plan?
Algorithms for operators are divided into 3 “degrees” of difficulty and cost: One-Pass algorithms (covered in this Section) Two-Pass algorithms (covered in Sections and 15.5) Multi-Pass algorithms (covered in Section 15.8)

What is an One-Pass algorithm?
It is an algorithm that reads data only once from the disk. Usually, the algorithm requires, at least 1 of the arguments of the operator fit in main memory. Exceptions - Selection and Projection operators

How does the One-Pass algorithm work for different operators?
Operators classified into 3 broad groups: Tuple-at-a-time, unary operations Full-relation, unary operations Full-relation, binary operations

How does the One-Pass algorithm work for Tuple-at-a-Time, Unary operations?
Selection(σ) and Projection(∏). Don’t require an entire relation or even a large part of it, in memory at once. (For this reason they are exceptions for One-Pass) Read one block at a time, use 1 main memory buffer and produce output.

2 buffers = 1 input buffer and 1 output buffer. Read blocks of R, one at a time, into input buffer. Perform the operation on each tuple (keep or discard) Move the selected or projected tuples into output buffer.

Space requirements: M >= 1 for only the input buffer, regardless of B. Note - Don’t consider output buffer as needed space because it might be functioning as I/P buffer for another operation or sending data to end-user.

Disk I/O requirements: Depends on how R is stored initially: R initially on disk -> Time taken for table scan or index scan of R Typically, the cost is B if R is clustered. The cost is T if R is not clustered.

How does the One-Pass algorithm work for Full Relation, Unary operations?
These one-argument operations require seeing all or most of the tuples in memory at once. One-Pass algorithms for applicable to relations that are approximately of size M (the number of main- memory buffers) available or less. Grouping(Ɣ) and Duplicate-Elimination(δ) operators.

Duplicate-Elimination Operator
How does the One-Pass algorithm work for Full Relation, Unary operations? Duplicate-Elimination Operator 3 buffers used 1 buffer is being used for incoming tuple These buffers store a copy of every tuple seen.

One-Pass algorithm for Duplicate-Elimination operator (δ)
We read in each block of R one at a time, but for each tuple we need to make a decision as to whether: It is the first time we have seen this tuple, if yes, copy it to O/P buffer, else, We have seen this tuple before and don’t copy it to O/P buffer.

One-Pass algorithm for Duplicate-Elimination operator (δ)
Naive data-structure (list) scenario: n tuples in memory in list. Processor time for 1 complete operation ∝ n2. Hash table, balanced BST can be used, introduce space overhead. But, overhead is small when compared to space for storing tuples. B(δ(R)) ≤ M [∵, We can have at-most M - 1 unique tuples in memory] Cannot compute size of δ(R) without computing δ(R) itself.

One-Pass algorithm for Grouping operator (ƔL)
It gives us zero or more grouping attributes and one or more aggregated attributes. Create one entry for each group in main memory. Scan tuples of R, one block at a time for each value of the grouping attributes. Entry for group in memory consists of: values for the grouping attributes, and accumulated value(s) for aggregations. When all tuples of R have been read into input buffer and contributed to the aggregation(s) for their group, output can be produced by writing the tuple for each group. NOTE - Until the last tuple is seen, can’t begin to create output for Ɣ operation.

One-Pass algorithm for Grouping operator (ƔL) - Aggregate operations
MIN(a) or MAX(a) aggregate: Record minimum or maximum value, respectively, of attribute ‘a’ seen for any tuple in the group so far. Change this min or max, if appropriate, each time a tuple of the group is seen. COUNT aggregation: Add one for each tuple of the group that is seen. SUM(a) aggregation: Add the value of ‘a’ to accumulated sum for its group. [a != NULL] AVG(a) aggregation (Hard case): Maintain 2 accumulations: count of number of tuples in the group (computed as for COUNT aggregation), and, accumulated sum of the attribute values of these tuples (computed as for SUM aggregation). After all tuples of R are seen, quotient of sum and count is the average.

Why does the One-Pass algorithm for Grouping operator (ƔL) not fit Iterator framework?
Can’t produce output before the last tuple is seen. Entire grouping has to be done by the Open() before the first tuple can be retrieved by GetNext(). Main-memory data-structure used should be able to find the entry for each group, given values for the grouping attributes. Hash-tables or balanced trees commonly used. Search key for the structures is the grouping attributes only. I/O’s needed = B(R) (Clustered) I/O’s needed = T(R) (Non-Clustered) Memory buffers required M not related to B in any simple way, but M is less than B.

How does the One-Pass algorithm work for Full Relation, Binary operations?
Binary operations discussed in book: Union (has bag and set variants) Intersection (has bag and set variants) Difference (has bag and set variants) Product Join (Natural Join) Equi-Joins can be implemented the same way as natural join after attributes are renamed appropriately. Theta-Joins can be implemented as product or equi-join followed by a selection for conditions that cannot be expressed in an equi-join.

How does the One-Pass algorithm work for Full Relation, Binary operations?
Bag union can be achieved using M = 1 regardless of size of R and S. Other operations require smaller of R and S to be in memory and a data-structure for fast inserts and searches. Hash-table and balanced trees commonly used. Approximate requirements for other operations: min(B(R),B(S)) ≤ M [Clustered] min(T(R),T(S)) ≤ M [Non-Clustered] 1 buffer used to read blocks of larger relation, M buffers needed to store blocks of smaller relation in its main-memory data-structure.

One-Pass algorithm for Union operation
Bag and Set variants of Union [∪B and ∪S ] : For R ∪B S - Copy each tuple of R to the O/P buffer and then copy every tuple of S to the O/P buffer. Number of disk I/O’s = B(R) + B(S) [Clustered] Number of disk I/O’s = T(R) + T(S) [Non-Clustered] Can be achieved using M = 1 regardless of size of R and S. For R ∪S S - Assuming R is larger of the relations, store S in memory M - 1, and build a search structure whose search key is the entire tuple and copy to O/P buffer. Read each block of R into Mth buffer, one at a time. For each tuple t of R, we see if t is in S, and if not, we copy t to O/P buffer, else, skip t.

One-Pass algorithm for Intersection operation
Bag and Set variants of Intersection operation [ ∩B and ∩S ] : R ∩B S : Read S into M - 1 buffers, associate each distinct tuple of S a count. Multiple copies of t are not stored individually. Something like this -> {(t,c), …} Read in each block from R, one at a time, and for tuple t in R, check if t occurs in S. If not, skip t, else, if count of t is > 0, O/P t and decrement count by 1. If count of t = 0, skip t. Space taken assumption : B(S) ≤ M. R ∩S S : Read S into M - 1 buffers and build a search structure with full tuples as the search key. Read each block of R, and for each tuple t of R, see if t is also in S. If yes, copy to O/P buffer, else, skip t.

One-Pass algorithm for Difference operation
Bag and Set variants of Difference operation [ -B and -S ] : R -S S ≠ S -S R (Not commutative): Read S into M - 1 buffers and build search structure with full tuple as search key. R -S S :: Read in each tuple t from R and check if t is in S. If yes, skip t else copy it to O/P buffer. S -S R :: Read in each tuple t from R and check if t is in S. If yes, delete t from copy of S in memory, else skip t. Copy S into O/P buffer. R -B S ≠ S -B R (Not commutative): Read S into M - 1 buffers and find associated count of each distinct tuple. S -B R :: Read each tuple t from R, check it t occurs in S, if yes, decrement its associated count. At the end, copy each tuple from main memory whose associated count > 0 and number of times we copy equals that count. R -B S :: Read each tuple t from R, check if it occurs in S. If yes, look at current count c associated with t. If c = 0, copy t to O/P buffer. If c > 0, don’t copy but decrement c by 1. If no, copy t to O/P buffer.

One-Pass algorithm for Product operation
Read S into M - 1 buffers of memory. NO SPECIAL DATA STRUCTURE IS NEEDED. Read each block of R, and for each tuple t of R. Concatenate t with each tuple of S in memory and copy to O/P as it is formed. This algorithm may take considerable amount of processor time per tuple of R, N X M.

One-Pass algorithm for Natural Join
We assume R(X,Y) and S(Y,Z) are being joined and Y represents all the attributes in common. X represents all attributes in R but not in S and Z represents all attributes in S and not in R. Assuming S is the smaller of the relations, Read all tuples of S and form them into a main-memory search structure with the attributes of Y as search key. Use M - 1 blocks of memory for this. [I/P buffer size is M - 1] Read each block of R into Mth buffer/block. For each tuple t of R, find the tuples of S that agree with t on all attributes of Y, using the search structure. For each matching tuple of S, form a tuple by concatenating it with t, and move the resulting tuple to O/P buffer. Takes B(R) + B(S) [Clustered] and T(R) + T(S) [Non-Clustered] disk I/O’s to read operands. Works as long as B(S) ≤ M - 1 or approximately, B(S) ≤ M Equi-Join checks for equality, although the attributes can have different names. Theta-Join is an equi-join or product followed by a selection operation.

References Chapter 15, Section 15.2 One-Pass algorithms - Database Systems - The complete book Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer Widom.

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Similar presentations

Presentation on theme: "Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Similar presentations

Presentation on theme: "Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback