Advanced Database Systems: DBS CB, 2nd Edition

Advanced Database Systems: DBS CB, 2nd Edition
Query Processing & Optimizations Ch Ch. 16

Outline Database Logical and Physical Operators Overview
Query Processing Overview Parsing Logical Query Plan Estimate Result Size Alternative Physical Plans Estimate Costs (with and without indexes) Done… Query Optimization Overview Relational Algebra Optimization level Query Plan Optimization level Estimate Cost: (i) Estimate size of result. (ii) Estimate # of IOs Generate and Compare Plans

Database Logical and Physical Operators Overview
3 3 3

Logical Operators Overview): correspond to what SQL programmer use BarInfo := Sells ⋈ Bars Correspond to SELECT * FROM Sells, Bars WHERE Sells.bar = Bars.bar; Optimizer: query optimizer is free to decide the Join order as the JOIN operation is both commutative & associative. In addition, the optimizer is free to select the physical operators to implement a given logical Join Physical Join Operators: Mainly 3 physical operators the optimizer can select to implement the logical Join. These physical Joins are not visible to the SQL programmer

Logical Operators Overview Union [R1 = R2 U R3], intersection [R1 = R2 ∩ R3 ], and difference [R1 = R2 – R3] Usual set operations, but both operands must have the same relation schema Selection [R1 := σC (R2)]: picking certain rows Projection [R1 := πL(R1)]: picking certain columns. Extended Projection [R1 = πA+BC,A,A (R2)] Products [R3 := R1 Χ R2] and joins [R1 := R2 ⋈ R3 ]: compositions of relations Renaming [R1 := ρR1(A1,…,An)(R2)] of relations and attributes Grouping [R1 = L(R2)]: Select Customer, SUM(OrderPrice), MIN(OrderDate) From Orders GROUP BY Customer; Sort [OrderDate (Orders)]: Select * From Orders ORDER BY OrderDate ASC|DESC; Duplicate Elimination [R1 := δ(R2)]

Join (Logical Join) Operators Inner Join: Cross Join (X): cartesian product Equi-Join (where R1.col1 = R2.col2): cross join with equality predicates only Natural Join (⋈): cross join with union of the attributes of the two relations Theta Join (⋈C): like natural join but we apply a boolean-valued condition Outer Join: Left Outer Join (left join): for every tuple on left relation, join with every tuple on the right relation and if none matches the condition return a tuple with left side and NULLs for the right side relation Right Outer Join (right join): opposite of the left join Full Outer Join (full join): union of left join and right join Self Join: joining table to itself

Physical Operators (Ch 15.1) Physical Join Operators Nested Join: every outer element is tested against the inner table Merge Join: Efficient if both tables are already sorted on the join attribute Hash Join: Only used for equi-join Sort Scan …

Relational Algebra on Bags vs. Sets A bag (or multiset) is like a set, but an element may appear more than once SQL is a bag language; some operations (like projection) are more efficient on bags than on sets Intersection = [min(n,m)] or difference [max(0, n-m)] Some, but not all algebraic laws that hold for sets also hold for bags R1 U R2 = R2 U R1  commutative law holds for bags {1} ∪ {1} = {1,1} != {1}  Does not hold for bags

Query Processing Overview
9 9 9

Select B,D From R,S Where R.A = “c”  S.E = 2  R.C = S.C; R A B C S C D E a x 2 b y 2 c z 2 d x 1 e y 3 Answer B D x

How do we execute a Query? - Do Cartesian product - Select tuples - Do projection One idea RXS R.A R.B R.C S.C S.D S.E a x 2 a y 2 . C x 2 Bingo! Got one...

Relational Algebra can be used to describe plans B,D sR.A=“c”  S.E=2  R.C=S.C X R S Plan I OR: B,D [ sR.A=“c”  S.E=2  R.C = S.C (RXS)]

Another idea B,D sR.A = “c” sS.E = 2 R S Plan II natural join

R S A B C s (R) s(S) C D E a A B C C D E x 2 b c x y 2 c y z 2 d z x 1 e y 3

Yet another idea: Use R.A and S.C Indexes (1) Use R.A index to select R tuples with R.A = “c” (2) For each R.C value found, use S.C index to find matching tuples (3) Eliminate S tuples S.E  2 (4) Join matching R,S tuples, project B,D attributes and place in result Plan III

R S A B C C D E a x 2 b y 2 c z 2 d x 1 e y 3 A C I1 I2 =“c” <c,2,10> <10,x,2> check=2? output: <2,x> next tuple: <c,7,15>

parse convert apply laws estimate result sizes consider physical plans estimate costs pick best execute {P1,P2,…..} {(P1,C1),(P2,C2)...} Pi answer SQL query parse tree Logical query plan “improved” L.q.p L.q.p. + sizes statistics

Example: SQL Query SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); (Find the movies with stars born in 1960)

<SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthDate ‘%1960’

Generating Relational Algebra title  StarsIn <condition> <tuple> IN name <attribute> birthdate LIKE ‘%1960’ starName MovieStar An expression using a two-argument , midway between a parse tree and relational algebra

Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar Applying the rule for IN conditions 

Improved Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar An improvement on Last Page Logical Query Plan Question: Push projection to StarsIn?

Estimate Result Size (# of tuples) Need expected size StarsIn MovieStar P s

One Physical Plan: Hash join SEQ scan index scan Parameters: Select Condition,... StarsIn MovieStar

Estimate Costs L.Q.P P1 P2 …. Pn C1 C2 …. Cn Pick best!

Query Optimization Overview
26 26 26

Relational Algebra Optimization Level
Transformation rules (preserve equivalence) What are good transformations? Rules: Natural joins & cross products & union R S = S R (R S) T = R (S T) Carry attribute names in results, so order is not important Can also write as trees, e.g.: R S S T T R

Rules: Join, Cartesian Product, and Union R S = S R (R S) T = R (S T) R x S = S x R (R x S) x T = R x (S x T) R U S = S U R R U (S U T) = (R U S) U T

Rules: Select Bags vs. Sets sp1  p2(R) = sp1 v p2(R) = sp1 [ sp2 (R)] [ sp1 (R)] U [ sp2 (R)] R = {a,a,b,b,b,c} S = {b,b,c,c,d} R U S = ? Option 1 SUM R U S = {a,a,b,b,b,b,b,c,c,c,d} Option 2 MAX R U S = {a,a,b,b,b,c,c,d}

Option 2 (MAX) Makes this rule work: sp1 v p2 (R) = sp1(R) U sp2(R) Example: R={a,a,b,b,b,c} P1 satisfied by a,b; P2 satisfied by b,c sp1 v p2 (R) = {a,a,b,b,b,c} sp1(R) = {a,a,b,b,b} sp2(R) = {b,b,b,c} sp1(R) U sp2 (R) = {a,a,b,b,b,c}

Sum option makes more sense: Use “SUM” option for Bag unions Some rules cannot be used for Bags Senators (……) Rep (……) T1 = pyr,state Senators; T2 = pyr,state Reps T1 Yr State T2 Yr State 97 CA CA 99 CA CA 98 AZ CA Union?

Rule: Projection Let: X = set of attributes Y = set of attributes XY = X U Y pxy (R) = Rule: Selection + Join (s ) Let p = predicate with only R attributes q = predicate with only S attributes m = predicate with only R,S attributes sp (R S) = sq (R S) = px [py (R)] [sp (R)] S R [sq (S)]

Proof of one, rest is for homework: spq (R S) = [sp (R)] [sq (S)] s p ^ q ^ m (R S) = sm [(sp R) (sq S)] spvq (R S) = [(sp R) S] U [R (sq S)]

Rule: p, s combined Let x = subset of R attributes z = attributes in predicate P (subset of R attributes) px[sp (R) ] = {sp [ px (R) ]} px pxz

Rule: p, combined Let x = subset of R attributes y = subset of S attributes z = intersection of R and S attributes pxy (R S) = pxy{[pxz (R) ] [pyz (S) ]} pxy {sp (R S)} = where z’ = z U {attributes used in P } pxy {sp [pxz’ (R) pyz’ (S)]}

Rule: s, U combined sp(R U S) = sp(R) U sp(S) sp(R - S) = sp(R) - sp(S)

Good Transformations! sp1p2 (R)  sp1 [sp2 (R)] sp (R S)  [sp (R)] S R S  S R px [sp (R)]  px {sp [pxz (R)]}

Conventional Wisdom: do projection early Example: R(A,B,C,D,E) x={E} P: (A=3)  (B=“cat”) px {sp (R)}  vs. pE {sp{pABE(R)}} 

But, What if we have A, B indexes? B = “cat” A=3 Intersect pointers to get pointers to matching tuples Bottom line: early selection is usually good No transformation is always good Transformations vs. good transformations

Query Plan Optimization Level: Estimate Cost
Estimate cost of a query plan: Estimate the size of results Estimate number of IOs Estimate the size of results: Keep statistics for relation R T(R): # tuples in R S(R): # of bytes in each R tuple B(R): # of blocks to hold all R tuples V(R, A): # of distinct values in R for attribute A

R A: 20 byte string B: 4 byte integer C: 8 byte date D: 5 byte string A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d Example: T(R) = S(R) = 37 V(R,A) = 3 V(R,C) = 5 V(R,B) = 1 V(R,D) = 4

Size estimate for W = R1 X R2: T(W) = T(R1) X T(R2) S(W) = S(R1) + S(R2) Size estimate for W = sA=a (R): S(W) = S(R) T(W) = ?

R V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 W = sz=val(R) T(W) = Where z = A or B or C or D A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d T(R) V(R,Z) Example: Assumption-1 Assumption: Values in select expression Z = val are uniformly distributed over possible V(R,Z) values.

Example: Alternative assumption Values in select expression Z = val are uniformly distributed over domain with DOM(R,Z) values. R Alternate assumption V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d W = sz=val(R) T(W) = ?  = T(R) DOM(R,Z)

Selection Cardinality: SC(R,A) = average # records that satisfy equality condition on R.A T(R) V(R,A) SC(R,A) = DOM(R,A)

Estimate values in range (W = sz  val (R)): Z Min= V(R,Z)=10 W= sz  15 (R) Max=20 f = = (fraction of range) T(W) = f  T(R)

Equivalently: f  V(R,Z) = fraction of distinct values T(W) = [f  V(Z,R)]  T(R) = f  T(R) V(Z,R)

Size estimate for W = R R2: Let x = attributes of R1 y = attributes of R2 Case 1: X  Y = 0  same as R1 X R2 Case 2: W = R R X  Y = A R1 A B C R A D

Assumption: Containment of value sets V(R1,A)  V(R2,A)  Every A value in R1 is in R2 V(R2,A)  V(R1,A)  Every A value in R2 is in R1 When V(R1,A)  V(R2,A) R A B C R2 A D Take 1 tuple Match 1 tuple matches with T(R2) tuples... V(R2,A) so T(W) = T(R2)  T(R1) V(R2, A)

For A as common attribute [W = R R2]: V(R1,A)  V(R2,A) T(W) = T(R2) T(R1) V(R2,A) V(R2,A)  V(R1,A) T(W) = T(R2) T(R1) V(R1,A) In general: T(W) = T(R2) T(R1) max{ V(R1,A), V(R2,A) }

Case 2: Assume Values are uniform over Domain R1 A B C R2 A D This tuple matches T(R2)/DOM(R2,A) so T(W) = T(R2) T(R1) = T(R2) T(R1) DOM(R2, A) DOM(R1, A) Assume the same

In all cases: S(W) = S(R1) + S(R2) - S(A) size of attribute A Using similar ideas, we can estimate size of: PAB (R) ….. Sec [16.4.2] sA=aB=b (R) …. Sec [16.4.3] R S with common attributes. A,B,C Sec [16.4.5] Union, intersection, diff, … Sec [16.4.7]

For complex expressions, needs intermediate T,S,V results: For example: W = [sA=a (R1) ] R2 Treat as relation U T(U) = T(R1)/V(R1,A) S(U) = S(R1) Also need V (U, *) !!

To estimate the different Vs: Assume U = sA=a (R1) Say R1 has attributes A,B,C,D V(U, A) = … V(U, B) = … V(U, C) = … V(U, D) = …

Example: V(R1,A)=3 V(R1,B)=1 V(R1,C)=5 V(R1,D)=3 U = sA=a (R1) R1 A B C D cat 1 10 20 dog 30 40 bat 50 V(U,A) =1 V(U,B) =1 V(U,C) = T(R1) V(R1,A) V(U,D) ... somewhere in between, i.e., between 1 & # of rows in U

For Join U = R1(A,B) R2(A,C): V(U,A) = min { V(R1, A), V(R2, A) } V(U,B) = V(R1, B) V(U,C) = V(R2, C) [called “preservation of value sets” in section 7.4.4] Example: Z = R1(A,B) R2(B,C) R3(C,D) R1  T(R1) = V(R1,A)= V(R1,B)=100 R2  T(R2) = V(R2,B)=200 V(R2,C)=300 R3  T(R3) = V(R3,C)= V(R3,D)=500

Partial results: U = R R2 T(U) = T(R2) T(R1) = x 2000 max{ V(R1,B), V(R2,B) } V(U,A) = V(R1,A) = 50 V(U,B) = min(V(R1,B), V(R2,B)) = V(R1,B) = 100 V(U,C) = V(R2,C) = 300

Z = U R3 T(Z) = 10002000 V(Z,A) = V(U,A) = 50 200 V(Z,B) = V(U,B) = 100 V(Z,C) = V(R3,C) = 90 V(Z,D) = V(R3,D) = 500

Histogram 40 30 number of tuples in R with A value in given range 20 10 10 20 30 40 sA=val(R) = ?

Summary: Estimating size of results is an art Don’t forget: Statistics must be kept up-to-date (cost?)

Query Plan Optimization Level: Generate and Compare Plans
Generate Plans Pruning x x Estimate Cost Cost Select Pick Min

To generate plans consider: Transforming relational algebra expression (e.g. order of joins) Use of existing indexes Building indexes or sorting on the fly Implementation details: - Join algorithm - Memory management - Parallel processing

Estimating IO: Count # of disk blocks that must be read (or written) to execute query plan To estimate cost, we need to handle extra parameters: B(R) = # of blocks containing tuples of R f(R) = max # of tuples of R per block M = # memory blocks available HT(i) = # levels in index i LB(i) = # of leaf blocks in index i

Clustering Index: Index that allows tuples to be read in an order that corresponds to physical order; very useful for operations that involve many tuples A index 10 15 17 19 35 37

Example: R R2 over common attribute C T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block Memory available = 101 blocks  Metric: # of IOs (ignoring writing of result) Caution: ignoring CPU costs ignoring timing ignoring double buffering requirements

Options: Transformations: R R2, R R1 Joint algorithms: Iteration or Nested Join (nested loops) Merge join Join with index Hash join

Nested Join (conceptually): for each r  R1 do for each s  R2 do if r.C = s.C then output r,s pair Merge Join (conceptually): (1) if R1 and R2 not sorted, sort them (2) i  1; j  1; While (i  T(R1))  (j  T(R2)) do if R1{ i }.C = R2{ j }.C then outputTuples else if R1{ i }.C > R2{ j }.C then j  j+1 else if R1{ i }.C < R2{ j }.C then i  i+1

Procedure Output Tuples: While (R1[i].C = R2[j].C)  (i  T(R1)) do [ jj  j; while (R1[ i ].C = R2[ jj ].C)  (jj  T(R2)) do [output pair R1[ i ], R2[ jj ]; jj  jj+1 ] i  i+1 ] Example: i R1[i].C R2[j].C j 50 6 52 7

Join with index (conceptually): For each r  R1 do [ X  index (R2, C, r.C) for each s  X do output r,s pair] Assume R2.C index Note: X  index(rel, attr, value) then X = set of rel tuples with attr = value

Hash Join(conceptually): Hash function h, range 0  k Buckets for R1: G0, G1, ... Gk Buckets for R2: H0, H1, ... Hk Algorithm: Hash R1 tuples into G buckets Hash R2 tuples into H buckets For i = 0 to k do {match tuples in Gi and Hi buckets}

Example: hash even/odd R1 R2 Buckets 2 5 Even R1 R2 Odd: 8 13 11 14 2 4 8 3 5 9

Factors that affect performance: Tuples of relation stored physically together? Relations sorted by join attribute? Indexes exist?

Example 1(a): Iteration Join R R2: Relations not contiguous Recall T(R1) = 10, T(R2) = 5,000 S(R1) = S(R2) = 1/10 block MEM = 101 blocks Cost: for each R1 tuple: [Read tuple + Read R2] Total IO =10,000 [1+5000] = 50,010,000 IOs

Can we do better: Use our memory more efficiently Read 100 blocks of R1 Read all of R2 (using 1 block) + join Repeat until done Cost: for each R1 chunk Read R1 chunk: 100 blocks * 10 tuples/block = 1,000 tuples (IOs) Read R2 chunks = 5,000 tuples (IOs) Total IO = = 6,000 tuples (IOs) Total cost = (# of R1 chunks) * (# of IO per chunk) = 10,000 x 6,000 = 60,000 IOs 1,000

Can we still do better: Reverse join order (R R1) Read 100 blocks of R2 Read all of R1 (using 1 block) + join Repeat until done Cost: for each R2 chunk Read R2 chunk: 100 blocks * 10 tuples/block = 1,000 tuples (IOs) Read R1 chunks = 10,000 tuples (IOs) Total IO = ,000 = 11,000 tuples (IOs) Total cost = (# of R2 chunks) * (# of IO per chunk) = 5,000 x 11,000 = 55,000 IOs 1,000

Example 1(b): Iteration Join R R1: Relations are contiguous Recall Read 100 blocks of R2 Read all of R1 (using 1 block) + join Repeat until done Cost: for each R2 chunk Read R2 chunk: 100 blocks (IOs) = 1000 tuples Read R1 chunks = 1,000 blocks (IOs) Total IO = ,000 = 1,100 blocks (IOs) Total cost = (# of R2 chunks) * (# of IO per chunk) = 5 chunks x 1,100 = 5,500 IOs

Example 1(c): Merge Join R R2: Both R1, R2 ordered by C; Relations are contiguous Total cost = Read R1 cost + Read R2 cost = 1, = 1,500 IOs Memory R1 R2 …..

Example 1(d): Merge Join R R2: R1, R2 are not ordered by C; Relations are contiguous Need to sort R1, R2 first … How? One way is Merge Sort For each 100 blocks chunk of R: Read chunk  sort in memory  write to disk R1 R2 ... Sorted Chunks Memory

Read all chunks + Merge + write out Cost: SORT Each tuple is Read, Written + Read, Written So Sort cost for R1 = 4 * 1,000 = 4,000 IOs Sort cost for R2 = 4 * 500 = 2,000 IOs Total Sort cost = 4, ,000 = 6,000 IOs Total Merge/Sort cost = Sort cost + Join cost = 6, ,500 = 7,500 IOs But Iteration cost = 5,500 IOs; so Merge Sort does not pay off! ... Memory Sorted Chunks Sorted Files

But say R1 = 10,000 blocks contiguous & not ordered R2 = 5,000 blocks contiguous & not ordered Iterate: x (100+10,000) = 50 x 10,100 = 505,000 IOs Merge join: = 4 x (10, ,000) + (10, ,000) = 5 x 15,000 = 75,000 IOs Merge Sort (not ordered Relations)  7,500 IOs WINS! sort Merge/join

How much memory do we need for Merge Sort: Say we have 10 memory blocks Say k blocks in memory, x blocks for relation sort # chunks = (x/k) size of chunk = k But # chunks ≤ buffers available for merge so... (x/k)  k or k2  x or k  x 10 ... For 100 chunks  to merge, we need blocks! R1

In our example: R1 is 1000 blocks, k  31.62 R2 is 500 blocks, k  22.36  Needs at least 32 buffers Can we improve on Merge join? Hint: Do we really need the fully sorted files step? R1 R2 Join? sorted runs

Cost of improved Merge join: C = Read R1 + write R1 into “sorted runs” + read R2 + write R2 into “sorted runs” + join = = 4500  Memory requirements?

Example 1(e): Index Join Assume R1.C index exists; 2 levels Assume R2 contiguous, unordered Assume R1.C index fits in memory Cost: Reads R2: 500 IOs for each R2 tuple: - probe index - free - if match, read R1 tuple: 1 io

What is expected # of matching tuples: Say R1.C is primary key, R2.C is foreign key then expected matches = 1 Say V(R1,C) = 5000, T(R1) = 10,000 with uniform assumption expect = 10,000/5,000 = 2 Say DOM(R1, C) = 1,000, T(R1) = 10,000 with alternate assumption, Expect = 10, = _1_ 1,000, Total Cost with index join: Total cost = (1) x 1 = 5,500 Total cost = (2) x 1 = 10,500 Total cost = (1/100) x 1 = 550

What if index does not fit in memory: Example: say R1.C index is 201 blocks Keep root + 99 leaf nodes in memory Expected cost of each probe is E = (0)99_ + (1)101  0.5 IO

Total cost including probes: = [Probe + get records] = [0.5+2] uniform assumption = ,500 = 13,000 IOs (case b) For case (C): = [0.5  1 + (1/100)  1] = = 3050 IOs

So far: Iterate R R1 55,000 (best) Merge Join _________ Sort + Merge Join _________ R1.C Index _________ R2.C Index _________ Iterate R R Merge join Sort + Merge Join  4500 R1.C Index  3050  550 R2.C Index ________ contiguous not contiguous

Example 1(f): Hash Join R1, R2 contiguous (un-ordered)  Use 100 buckets  Read R1, hash, + write buckets ... 10 blocks 100 R1 

Example 1(f): Hash Join (Contd.) Same for R2 Read one R1 bucket; build memory hash table Read corresponding R2 bucket + hash probe  Then repeat for all buckets R1  R2 ... R1 memory Bucket = K blocks 1-Block

Example 1(f): Hash Join (Contd.) Cost: “Bucketize:” Read R1 + write Read R2 + write Join: Read R1, R2 Total cost = 3 x [ ] = 4500 Note: this is an approximation since buckets will vary in size and we have to round up to blocks

Example 1(f): Hash Join (Contd.) Minimum memory requirements: Max. Size of R1 bucket = (x/k) k = number of buckets (or memory buffers) x = number of R1 blocks Since... (x/k) < k k > x needs: k+1 total memory buffers

Example 1(f): Hash Join (Contd.) Keep some buckets in memory: R1 = 1000 blocks  # R1 buckets = k = 33 buckets, each R1 bucket = 31 blocks. keep 2 buckets in memory + 1 block for each of the remaining buckets (33 – 2 = 31 buffers) Memory use for R1: G0 31 buffers G1 31 buffers Output buffers R1 input 1 Total 94 buffers 6 buffers to spare!! G0 G1 in ... 31 33-2=31 R1 Called hybrid hash-join

Example 1(f): Hash Join (Contd.) Next packetize R2: R2 = 500 blocks  # R2 buckets =500/33= 16 blocks/bucket Two of the R2 buckets joined immediately with G0, G1 (i.e., one pass) G0 G1 in ... 16 33-2=31 R2 31 R2 buckets R1 buckets Memory

Example 1(f): Hash Join (Contd.) Finally join the remaining buckets: For each bucket pair: Read one of the buckets into memory Join with the second bucket Gi out ... 16 33-2=31 ans 31 R2 buckets R1 buckets one full R2 bucket one R1 buffer Memory

Example 1(f): Hash Join (Contd.) Cost: Bucketize R1 = 31=1961 IOs for R + W To bucketize R2, only write 31 buckets: so, cost = 16 = 996 IOs for R + W To compute join (2 buckets already done) read 31 16 = 1457 IOs Ignore write output of JOIN Total cost = = 4414 IOs

How many buckets in memory: # of buckets k > x Where x = # of blocks in the larger table; k = # buckets Needs: k+1 total memory buffers memory G0 G1 in R1 OR... ?

Another hash join trick: Only write into buckets <val,ptr> pairs When we get a match in join phase, must fetch tuples Cost: Increase the number of tuples per block; hence reduce needed memory buffers Reduces number of buckets Only when we get a match in the join phase, we will fetch tuples (additional IO)

So far: Iterate …………………. 5500 Merge join …………… Sort + merge joint ……. 7500 R1.C index …………  550 R2.C index …………… Build R.C index ……… Build S.C index ……… Hash join ……………… 4500+ with trick,R1 first ……. 4414 with trick,R2 first …… Hash join, pointers ……. 1600 contiguous

Summary: Iteration is ok for “small” relations (relative to memory size) For equi-join, where relations are not sorted and no indexes exist, hash join usually is the best Sort + merge join is good for non-equi-join (e.g., R1.C > R2.C) If relations already sorted, use merge join If index exists, it could be useful (depends on expected result size) After generating plans leveraging different join algorithms Compare cost of possible different plans

Advanced Database Systems: DBS CB, 2nd Edition

Similar presentations

Presentation on theme: "Advanced Database Systems: DBS CB, 2nd Edition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Database Systems: DBS CB, 2nd Edition

Similar presentations

Presentation on theme: "Advanced Database Systems: DBS CB, 2nd Edition"— Presentation transcript:

Similar presentations

About project

Feedback