Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Database Systems: DBS CB, 2nd Edition

Similar presentations


Presentation on theme: "Advanced Database Systems: DBS CB, 2nd Edition"— Presentation transcript:

1 Advanced Database Systems: DBS CB, 2nd Edition
Query Processing & Optimizations Ch Ch. 16

2 Outline Database Logical and Physical Operators Overview
Query Processing Overview Parsing Logical Query Plan Estimate Result Size Alternative Physical Plans Estimate Costs (with and without indexes) Done… Query Optimization Overview Relational Algebra Optimization level Query Plan Optimization level Estimate Cost: (i) Estimate size of result. (ii) Estimate # of IOs Generate and Compare Plans

3 Database Logical and Physical Operators Overview
3 3 3

4 Database Logical and Physical Operators Overview
Logical Operators Overview): correspond to what SQL programmer use BarInfo := Sells ⋈ Bars Correspond to SELECT * FROM Sells, Bars WHERE Sells.bar = Bars.bar; Optimizer: query optimizer is free to decide the Join order as the JOIN operation is both commutative & associative. In addition, the optimizer is free to select the physical operators to implement a given logical Join Physical Join Operators: Mainly 3 physical operators the optimizer can select to implement the logical Join. These physical Joins are not visible to the SQL programmer

5 Database Logical and Physical Operators Overview
Logical Operators Overview Union [R1 = R2 U R3], intersection [R1 = R2 ∩ R3 ], and difference [R1 = R2 – R3] Usual set operations, but both operands must have the same relation schema Selection [R1 := σC (R2)]: picking certain rows Projection [R1 := πL(R1)]: picking certain columns. Extended Projection [R1 = πA+BC,A,A (R2)] Products [R3 := R1 Χ R2] and joins [R1 := R2 ⋈ R3 ]: compositions of relations Renaming [R1 := ρR1(A1,…,An)(R2)] of relations and attributes Grouping [R1 = L(R2)]: Select Customer, SUM(OrderPrice), MIN(OrderDate) From Orders GROUP BY Customer; Sort [OrderDate (Orders)]: Select * From Orders ORDER BY OrderDate ASC|DESC; Duplicate Elimination [R1 := δ(R2)]

6 Database Logical and Physical Operators Overview
Join (Logical Join) Operators Inner Join: Cross Join (X): cartesian product Equi-Join (where R1.col1 = R2.col2): cross join with equality predicates only Natural Join (⋈): cross join with union of the attributes of the two relations Theta Join (⋈C): like natural join but we apply a boolean-valued condition Outer Join: Left Outer Join (left join): for every tuple on left relation, join with every tuple on the right relation and if none matches the condition return a tuple with left side and NULLs for the right side relation Right Outer Join (right join): opposite of the left join Full Outer Join (full join): union of left join and right join Self Join: joining table to itself

7 Database Logical and Physical Operators Overview
Physical Operators (Ch 15.1) Physical Join Operators Nested Join: every outer element is tested against the inner table Merge Join: Efficient if both tables are already sorted on the join attribute Hash Join: Only used for equi-join Sort Scan

8 Database Logical and Physical Operators Overview
Relational Algebra on Bags vs. Sets A bag (or multiset) is like a set, but an element may appear more than once SQL is a bag language; some operations (like projection) are more efficient on bags than on sets Intersection = [min(n,m)] or difference [max(0, n-m)] Some, but not all algebraic laws that hold for sets also hold for bags R1 U R2 = R2 U R1  commutative law holds for bags {1} ∪ {1} = {1,1} != {1}  Does not hold for bags

9 Query Processing Overview
9 9 9

10 Query Processing Overview
Select B,D From R,S Where R.A = “c”  S.E = 2  R.C = S.C; R A B C S C D E a x 2 b y 2 c z 2 d x 1 e y 3 Answer B D x

11 Query Processing Overview
How do we execute a Query? - Do Cartesian product - Select tuples - Do projection One idea RXS R.A R.B R.C S.C S.D S.E a x 2 a y 2 . C x 2 Bingo! Got one...

12 Query Processing Overview
Relational Algebra can be used to describe plans B,D sR.A=“c”  S.E=2  R.C=S.C X R S Plan I OR: B,D [ sR.A=“c”  S.E=2  R.C = S.C (RXS)]

13 Query Processing Overview
Another idea B,D sR.A = “c” sS.E = 2 R S Plan II natural join

14 Query Processing Overview
R S A B C s (R) s(S) C D E a A B C C D E x 2 b c x y 2 c y z 2 d z x 1 e y 3

15 Query Processing Overview
Yet another idea: Use R.A and S.C Indexes (1) Use R.A index to select R tuples with R.A = “c” (2) For each R.C value found, use S.C index to find matching tuples (3) Eliminate S tuples S.E  2 (4) Join matching R,S tuples, project B,D attributes and place in result Plan III

16 Query Processing Overview
R S A B C C D E a x 2 b y 2 c z 2 d x 1 e y 3 A C I1 I2 =“c” <c,2,10> <10,x,2> check=2? output: <2,x> next tuple: <c,7,15>

17 Query Processing Overview
parse convert apply laws estimate result sizes consider physical plans estimate costs pick best execute {P1,P2,…..} {(P1,C1),(P2,C2)...} Pi answer SQL query parse tree Logical query plan “improved” L.q.p L.q.p. + sizes statistics

18 Query Processing Overview
Example: SQL Query SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ ); (Find the movies with stars born in 1960)

19 Query Processing Overview
<SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthDate ‘%1960’

20 Query Processing Overview
Generating Relational Algebra title StarsIn <condition> <tuple> IN name <attribute> birthdate LIKE ‘%1960’ starName MovieStar An expression using a two-argument , midway between a parse tree and relational algebra

21 Query Processing Overview
Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar Applying the rule for IN conditions

22 Query Processing Overview
Improved Logical Query Plan title starName=name StarsIn name birthdate LIKE ‘%1960’ MovieStar An improvement on Last Page Logical Query Plan Question: Push projection to StarsIn?

23 Query Processing Overview
Estimate Result Size (# of tuples) Need expected size StarsIn MovieStar P s

24 Query Processing Overview
One Physical Plan: Hash join SEQ scan index scan Parameters: Select Condition,... StarsIn MovieStar

25 Query Processing Overview
Estimate Costs L.Q.P P1 P2 …. Pn C1 C2 …. Cn Pick best!

26 Query Optimization Overview
26 26 26

27 Relational Algebra Optimization Level
Transformation rules (preserve equivalence) What are good transformations? Rules: Natural joins & cross products & union R S = S R (R S) T = R (S T) Carry attribute names in results, so order is not important Can also write as trees, e.g.: R S S T T R

28 Relational Algebra Optimization Level
Rules: Join, Cartesian Product, and Union R S = S R (R S) T = R (S T) R x S = S x R (R x S) x T = R x (S x T) R U S = S U R R U (S U T) = (R U S) U T

29 Relational Algebra Optimization Level
Rules: Select Bags vs. Sets sp1  p2(R) = sp1 v p2(R) = sp1 [ sp2 (R)] [ sp1 (R)] U [ sp2 (R)] R = {a,a,b,b,b,c} S = {b,b,c,c,d} R U S = ? Option 1 SUM R U S = {a,a,b,b,b,b,b,c,c,c,d} Option 2 MAX R U S = {a,a,b,b,b,c,c,d}

30 Relational Algebra Optimization Level
Option 2 (MAX) Makes this rule work: sp1 v p2 (R) = sp1(R) U sp2(R) Example: R={a,a,b,b,b,c} P1 satisfied by a,b; P2 satisfied by b,c sp1 v p2 (R) = {a,a,b,b,b,c} sp1(R) = {a,a,b,b,b} sp2(R) = {b,b,b,c} sp1(R) U sp2 (R) = {a,a,b,b,b,c}

31 Relational Algebra Optimization Level
Sum option makes more sense: Use “SUM” option for Bag unions Some rules cannot be used for Bags Senators (……) Rep (……) T1 = pyr,state Senators; T2 = pyr,state Reps T1 Yr State T2 Yr State 97 CA CA 99 CA CA 98 AZ CA Union?

32 Relational Algebra Optimization Level
Rule: Projection Let: X = set of attributes Y = set of attributes XY = X U Y pxy (R) = Rule: Selection + Join (s ) Let p = predicate with only R attributes q = predicate with only S attributes m = predicate with only R,S attributes sp (R S) = sq (R S) = px [py (R)] [sp (R)] S R [sq (S)]

33 Relational Algebra Optimization Level
Proof of one, rest is for homework: spq (R S) = [sp (R)] [sq (S)] s p ^ q ^ m (R S) = sm [(sp R) (sq S)] spvq (R S) = [(sp R) S] U [R (sq S)]

34 Relational Algebra Optimization Level
Rule: p, s combined Let x = subset of R attributes z = attributes in predicate P (subset of R attributes) px[sp (R) ] = {sp [ px (R) ]} px pxz

35 Relational Algebra Optimization Level
Rule: p, combined Let x = subset of R attributes y = subset of S attributes z = intersection of R and S attributes pxy (R S) = pxy{[pxz (R) ] [pyz (S) ]} pxy {sp (R S)} = where z’ = z U {attributes used in P } pxy {sp [pxz’ (R) pyz’ (S)]}

36 Relational Algebra Optimization Level
Rule: s, U combined sp(R U S) = sp(R) U sp(S) sp(R - S) = sp(R) - sp(S)

37 Relational Algebra Optimization Level
Good Transformations! sp1p2 (R)  sp1 [sp2 (R)] sp (R S)  [sp (R)] S R S  S R px [sp (R)]  px {sp [pxz (R)]}

38 Relational Algebra Optimization Level
Conventional Wisdom: do projection early Example: R(A,B,C,D,E) x={E} P: (A=3)  (B=“cat”) px {sp (R)}  vs. pE {sp{pABE(R)}}

39 Relational Algebra Optimization Level
But, What if we have A, B indexes? B = “cat” A=3 Intersect pointers to get pointers to matching tuples Bottom line: early selection is usually good No transformation is always good Transformations vs. good transformations

40 Query Plan Optimization Level: Estimate Cost
Estimate cost of a query plan: Estimate the size of results Estimate number of IOs Estimate the size of results: Keep statistics for relation R T(R): # tuples in R S(R): # of bytes in each R tuple B(R): # of blocks to hold all R tuples V(R, A): # of distinct values in R for attribute A

41 Query Plan Optimization Level: Estimate Cost
R A: 20 byte string B: 4 byte integer C: 8 byte date D: 5 byte string A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d Example: T(R) = S(R) = 37 V(R,A) = 3 V(R,C) = 5 V(R,B) = 1 V(R,D) = 4

42 Query Plan Optimization Level: Estimate Cost
Size estimate for W = R1 X R2: T(W) = T(R1) X T(R2) S(W) = S(R1) + S(R2) Size estimate for W = sA=a (R): S(W) = S(R) T(W) = ?

43 Query Plan Optimization Level: Estimate Cost
R V(R,A)=3 V(R,B)=1 V(R,C)=5 V(R,D)=4 W = sz=val(R) T(W) = Where z = A or B or C or D A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d T(R) V(R,Z) Example: Assumption-1 Assumption: Values in select expression Z = val are uniformly distributed over possible V(R,Z) values.

44 Query Plan Optimization Level: Estimate Cost
Example: Alternative assumption Values in select expression Z = val are uniformly distributed over domain with DOM(R,Z) values. R Alternate assumption V(R,A)=3 DOM(R,A)=10 V(R,B)=1 DOM(R,B)=10 V(R,C)=5 DOM(R,C)=10 V(R,D)=4 DOM(R,D)=10 A B C D cat 1 10 a 20 b dog 30 40 c bat 50 d W = sz=val(R) T(W) = ?  = T(R) DOM(R,Z)

45 Query Plan Optimization Level: Estimate Cost
Selection Cardinality: SC(R,A) = average # records that satisfy equality condition on R.A T(R) V(R,A) SC(R,A) = DOM(R,A)

46 Query Plan Optimization Level: Estimate Cost
Estimate values in range (W = sz  val (R)): Z Min= V(R,Z)=10 W= sz  15 (R) Max=20 f = = (fraction of range) T(W) = f  T(R)

47 Query Plan Optimization Level: Estimate Cost
Equivalently: f  V(R,Z) = fraction of distinct values T(W) = [f  V(Z,R)]  T(R) = f  T(R) V(Z,R)

48 Query Plan Optimization Level: Estimate Cost
Size estimate for W = R R2: Let x = attributes of R1 y = attributes of R2 Case 1: X  Y = 0  same as R1 X R2 Case 2: W = R R X  Y = A R1 A B C R A D

49 Query Plan Optimization Level: Estimate Cost
Assumption: Containment of value sets V(R1,A)  V(R2,A)  Every A value in R1 is in R2 V(R2,A)  V(R1,A)  Every A value in R2 is in R1 When V(R1,A)  V(R2,A) R A B C R2 A D Take 1 tuple Match 1 tuple matches with T(R2) tuples... V(R2,A) so T(W) = T(R2)  T(R1) V(R2, A)

50 Query Plan Optimization Level: Estimate Cost
For A as common attribute [W = R R2]: V(R1,A)  V(R2,A) T(W) = T(R2) T(R1) V(R2,A) V(R2,A)  V(R1,A) T(W) = T(R2) T(R1) V(R1,A) In general: T(W) = T(R2) T(R1) max{ V(R1,A), V(R2,A) }

51 Query Plan Optimization Level: Estimate Cost
Case 2: Assume Values are uniform over Domain R1 A B C R2 A D This tuple matches T(R2)/DOM(R2,A) so T(W) = T(R2) T(R1) = T(R2) T(R1) DOM(R2, A) DOM(R1, A) Assume the same

52 Query Plan Optimization Level: Estimate Cost
In all cases: S(W) = S(R1) + S(R2) - S(A) size of attribute A Using similar ideas, we can estimate size of: PAB (R) ….. Sec [16.4.2] sA=aB=b (R) …. Sec [16.4.3] R S with common attributes. A,B,C Sec [16.4.5] Union, intersection, diff, … Sec [16.4.7]

53 Query Plan Optimization Level: Estimate Cost
For complex expressions, needs intermediate T,S,V results: For example: W = [sA=a (R1) ] R2 Treat as relation U T(U) = T(R1)/V(R1,A) S(U) = S(R1) Also need V (U, *) !!

54 Query Plan Optimization Level: Estimate Cost
To estimate the different Vs: Assume U = sA=a (R1) Say R1 has attributes A,B,C,D V(U, A) = … V(U, B) = … V(U, C) = … V(U, D) = …

55 Query Plan Optimization Level: Estimate Cost
Example: V(R1,A)=3 V(R1,B)=1 V(R1,C)=5 V(R1,D)=3 U = sA=a (R1) R1 A B C D cat 1 10 20 dog 30 40 bat 50 V(U,A) =1 V(U,B) =1 V(U,C) = T(R1) V(R1,A) V(U,D) ... somewhere in between, i.e., between 1 & # of rows in U

56 Query Plan Optimization Level: Estimate Cost
For Join U = R1(A,B) R2(A,C): V(U,A) = min { V(R1, A), V(R2, A) } V(U,B) = V(R1, B) V(U,C) = V(R2, C) [called “preservation of value sets” in section 7.4.4] Example: Z = R1(A,B) R2(B,C) R3(C,D) R1  T(R1) = V(R1,A)= V(R1,B)=100 R2  T(R2) = V(R2,B)=200 V(R2,C)=300 R3  T(R3) = V(R3,C)= V(R3,D)=500

57 Query Plan Optimization Level: Estimate Cost
Partial results: U = R R2 T(U) = T(R2) T(R1) = x 2000 max{ V(R1,B), V(R2,B) } V(U,A) = V(R1,A) = 50 V(U,B) = min(V(R1,B), V(R2,B)) = V(R1,B) = 100 V(U,C) = V(R2,C) = 300

58 Query Plan Optimization Level: Estimate Cost
Z = U R3 T(Z) = 10002000 V(Z,A) = V(U,A) = 50 200 V(Z,B) = V(U,B) = 100 V(Z,C) = V(R3,C) = 90 V(Z,D) = V(R3,D) = 500

59 Query Plan Optimization Level: Estimate Cost
Histogram 40 30 number of tuples in R with A value in given range 20 10 10 20 30 40 sA=val(R) = ?

60 Query Plan Optimization Level: Estimate Cost
Summary: Estimating size of results is an art Don’t forget: Statistics must be kept up-to-date (cost?)

61 Query Plan Optimization Level: Generate and Compare Plans
Generate Plans Pruning x x Estimate Cost Cost Select Pick Min

62 Query Plan Optimization Level: Generate and Compare Plans
To generate plans consider: Transforming relational algebra expression (e.g. order of joins) Use of existing indexes Building indexes or sorting on the fly Implementation details: - Join algorithm - Memory management - Parallel processing

63 Query Plan Optimization Level: Generate and Compare Plans
Estimating IO: Count # of disk blocks that must be read (or written) to execute query plan To estimate cost, we need to handle extra parameters: B(R) = # of blocks containing tuples of R f(R) = max # of tuples of R per block M = # memory blocks available HT(i) = # levels in index i LB(i) = # of leaf blocks in index i

64 Query Plan Optimization Level: Generate and Compare Plans
Clustering Index: Index that allows tuples to be read in an order that corresponds to physical order; very useful for operations that involve many tuples A index 10 15 17 19 35 37

65 Query Plan Optimization Level: Generate and Compare Plans
Example: R R2 over common attribute C T(R1) = 10,000 T(R2) = 5,000 S(R1) = S(R2) = 1/10 block Memory available = 101 blocks  Metric: # of IOs (ignoring writing of result) Caution: ignoring CPU costs ignoring timing ignoring double buffering requirements

66 Query Plan Optimization Level: Generate and Compare Plans
Options: Transformations: R R2, R R1 Joint algorithms: Iteration or Nested Join (nested loops) Merge join Join with index Hash join

67 Query Plan Optimization Level: Generate and Compare Plans
Nested Join (conceptually): for each r  R1 do for each s  R2 do if r.C = s.C then output r,s pair Merge Join (conceptually): (1) if R1 and R2 not sorted, sort them (2) i  1; j  1; While (i  T(R1))  (j  T(R2)) do if R1{ i }.C = R2{ j }.C then outputTuples else if R1{ i }.C > R2{ j }.C then j  j+1 else if R1{ i }.C < R2{ j }.C then i  i+1

68 Query Plan Optimization Level: Generate and Compare Plans
Procedure Output Tuples: While (R1[i].C = R2[j].C)  (i  T(R1)) do [ jj  j; while (R1[ i ].C = R2[ jj ].C)  (jj  T(R2)) do [output pair R1[ i ], R2[ jj ]; jj  jj+1 ] i  i+1 ] Example: i R1[i].C R2[j].C j 50 6 52 7

69 Query Plan Optimization Level: Generate and Compare Plans
Join with index (conceptually): For each r  R1 do [ X  index (R2, C, r.C) for each s  X do output r,s pair] Assume R2.C index Note: X  index(rel, attr, value) then X = set of rel tuples with attr = value

70 Query Plan Optimization Level: Generate and Compare Plans
Hash Join(conceptually): Hash function h, range 0  k Buckets for R1: G0, G1, ... Gk Buckets for R2: H0, H1, ... Hk Algorithm: Hash R1 tuples into G buckets Hash R2 tuples into H buckets For i = 0 to k do {match tuples in Gi and Hi buckets}

71 Query Plan Optimization Level: Generate and Compare Plans
Example: hash even/odd R1 R2 Buckets 2 5 Even R1 R2 Odd: 8 13 11 14 2 4 8 3 5 9

72 Query Plan Optimization Level: Generate and Compare Plans
Factors that affect performance: Tuples of relation stored physically together? Relations sorted by join attribute? Indexes exist?

73 Query Plan Optimization Level: Generate and Compare Plans
Example 1(a): Iteration Join R R2: Relations not contiguous Recall T(R1) = 10, T(R2) = 5,000 S(R1) = S(R2) = 1/10 block MEM = 101 blocks Cost: for each R1 tuple: [Read tuple + Read R2] Total IO =10,000 [1+5000] = 50,010,000 IOs

74 Query Plan Optimization Level: Generate and Compare Plans
Can we do better: Use our memory more efficiently Read 100 blocks of R1 Read all of R2 (using 1 block) + join Repeat until done Cost: for each R1 chunk Read R1 chunk: 100 blocks * 10 tuples/block = 1,000 tuples (IOs) Read R2 chunks = 5,000 tuples (IOs) Total IO = = 6,000 tuples (IOs) Total cost = (# of R1 chunks) * (# of IO per chunk) = 10,000 x 6,000 = 60,000 IOs 1,000

75 Query Plan Optimization Level: Generate and Compare Plans
Can we still do better: Reverse join order (R R1) Read 100 blocks of R2 Read all of R1 (using 1 block) + join Repeat until done Cost: for each R2 chunk Read R2 chunk: 100 blocks * 10 tuples/block = 1,000 tuples (IOs) Read R1 chunks = 10,000 tuples (IOs) Total IO = ,000 = 11,000 tuples (IOs) Total cost = (# of R2 chunks) * (# of IO per chunk) = 5,000 x 11,000 = 55,000 IOs 1,000

76 Query Plan Optimization Level: Generate and Compare Plans
Example 1(b): Iteration Join R R1: Relations are contiguous Recall Read 100 blocks of R2 Read all of R1 (using 1 block) + join Repeat until done Cost: for each R2 chunk Read R2 chunk: 100 blocks (IOs) = 1000 tuples Read R1 chunks = 1,000 blocks (IOs) Total IO = ,000 = 1,100 blocks (IOs) Total cost = (# of R2 chunks) * (# of IO per chunk) = 5 chunks x 1,100 = 5,500 IOs

77 Query Plan Optimization Level: Generate and Compare Plans
Example 1(c): Merge Join R R2: Both R1, R2 ordered by C; Relations are contiguous Total cost = Read R1 cost + Read R2 cost = 1, = 1,500 IOs Memory R1 R2 …..

78 Query Plan Optimization Level: Generate and Compare Plans
Example 1(d): Merge Join R R2: R1, R2 are not ordered by C; Relations are contiguous Need to sort R1, R2 first … How? One way is Merge Sort For each 100 blocks chunk of R: Read chunk  sort in memory  write to disk R1 R2 ... Sorted Chunks Memory

79 Query Plan Optimization Level: Generate and Compare Plans
Read all chunks + Merge + write out Cost: SORT Each tuple is Read, Written + Read, Written So Sort cost for R1 = 4 * 1,000 = 4,000 IOs Sort cost for R2 = 4 * 500 = 2,000 IOs Total Sort cost = 4, ,000 = 6,000 IOs Total Merge/Sort cost = Sort cost + Join cost = 6, ,500 = 7,500 IOs But Iteration cost = 5,500 IOs; so Merge Sort does not pay off! ... Memory Sorted Chunks Sorted Files

80 Query Plan Optimization Level: Generate and Compare Plans
But say R1 = 10,000 blocks contiguous & not ordered R2 = 5,000 blocks contiguous & not ordered Iterate: x (100+10,000) = 50 x 10,100 = 505,000 IOs Merge join: = 4 x (10, ,000) + (10, ,000) = 5 x 15,000 = 75,000 IOs Merge Sort (not ordered Relations)  7,500 IOs WINS! sort Merge/join

81 Query Plan Optimization Level: Generate and Compare Plans
How much memory do we need for Merge Sort: Say we have 10 memory blocks Say k blocks in memory, x blocks for relation sort # chunks = (x/k) size of chunk = k But # chunks ≤ buffers available for merge so... (x/k)  k or k2  x or k  x 10 ... For 100 chunks  to merge, we need blocks! R1

82 Query Plan Optimization Level: Generate and Compare Plans
In our example: R1 is 1000 blocks, k  31.62 R2 is 500 blocks, k  22.36  Needs at least 32 buffers Can we improve on Merge join? Hint: Do we really need the fully sorted files step? R1 R2 Join? sorted runs

83 Query Plan Optimization Level: Generate and Compare Plans
Cost of improved Merge join: C = Read R1 + write R1 into “sorted runs” + read R2 + write R2 into “sorted runs” + join = = 4500  Memory requirements?

84 Query Plan Optimization Level: Generate and Compare Plans
Example 1(e): Index Join Assume R1.C index exists; 2 levels Assume R2 contiguous, unordered Assume R1.C index fits in memory Cost: Reads R2: 500 IOs for each R2 tuple: - probe index - free - if match, read R1 tuple: 1 io

85 Query Plan Optimization Level: Generate and Compare Plans
What is expected # of matching tuples: Say R1.C is primary key, R2.C is foreign key then expected matches = 1 Say V(R1,C) = 5000, T(R1) = 10,000 with uniform assumption expect = 10,000/5,000 = 2 Say DOM(R1, C) = 1,000, T(R1) = 10,000 with alternate assumption, Expect = 10, = _1_ 1,000, Total Cost with index join: Total cost = (1) x 1 = 5,500 Total cost = (2) x 1 = 10,500 Total cost = (1/100) x 1 = 550

86 Query Plan Optimization Level: Generate and Compare Plans
What if index does not fit in memory: Example: say R1.C index is 201 blocks Keep root + 99 leaf nodes in memory Expected cost of each probe is E = (0)99_ + (1)101  0.5 IO

87 Query Plan Optimization Level: Generate and Compare Plans
Total cost including probes: = [Probe + get records] = [0.5+2] uniform assumption = ,500 = 13,000 IOs (case b) For case (C): = [0.5  1 + (1/100)  1] = = 3050 IOs

88 Query Plan Optimization Level: Generate and Compare Plans
So far: Iterate R R1 55,000 (best) Merge Join _________ Sort + Merge Join _________ R1.C Index _________ R2.C Index _________ Iterate R R Merge join Sort + Merge Join  4500 R1.C Index  3050  550 R2.C Index ________ contiguous not contiguous

89 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join R1, R2 contiguous (un-ordered)  Use 100 buckets  Read R1, hash, + write buckets ... 10 blocks 100 R1 

90 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Same for R2 Read one R1 bucket; build memory hash table Read corresponding R2 bucket + hash probe  Then repeat for all buckets R1  R2 ... R1 memory Bucket = K blocks 1-Block

91 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Cost: “Bucketize:” Read R1 + write Read R2 + write Join: Read R1, R2 Total cost = 3 x [ ] = 4500 Note: this is an approximation since buckets will vary in size and we have to round up to blocks

92 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Minimum memory requirements: Max. Size of R1 bucket = (x/k) k = number of buckets (or memory buffers) x = number of R1 blocks Since... (x/k) < k k > x needs: k+1 total memory buffers

93 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Keep some buckets in memory: R1 = 1000 blocks  # R1 buckets = k = 33 buckets, each R1 bucket = 31 blocks. keep 2 buckets in memory + 1 block for each of the remaining buckets (33 – 2 = 31 buffers) Memory use for R1: G0 31 buffers G1 31 buffers Output buffers R1 input 1 Total 94 buffers 6 buffers to spare!! G0 G1 in ... 31 33-2=31 R1 Called hybrid hash-join

94 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Next packetize R2: R2 = 500 blocks  # R2 buckets =500/33= 16 blocks/bucket Two of the R2 buckets joined immediately with G0, G1 (i.e., one pass) G0 G1 in ... 16 33-2=31 R2 31 R2 buckets R1 buckets Memory

95 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Finally join the remaining buckets: For each bucket pair: Read one of the buckets into memory Join with the second bucket Gi out ... 16 33-2=31 ans 31 R2 buckets R1 buckets one full R2 bucket one R1 buffer Memory

96 Query Plan Optimization Level: Generate and Compare Plans
Example 1(f): Hash Join (Contd.) Cost: Bucketize R1 = 31=1961 IOs for R + W To bucketize R2, only write 31 buckets: so, cost = 16 = 996 IOs for R + W To compute join (2 buckets already done) read 31 16 = 1457 IOs Ignore write output of JOIN Total cost = = 4414 IOs

97 Query Plan Optimization Level: Generate and Compare Plans
How many buckets in memory: # of buckets k > x Where x = # of blocks in the larger table; k = # buckets Needs: k+1 total memory buffers memory G0 G1 in R1 OR... ?

98 Query Plan Optimization Level: Generate and Compare Plans
Another hash join trick: Only write into buckets <val,ptr> pairs When we get a match in join phase, must fetch tuples Cost: Increase the number of tuples per block; hence reduce needed memory buffers Reduces number of buckets Only when we get a match in the join phase, we will fetch tuples (additional IO)

99 Query Plan Optimization Level: Generate and Compare Plans
So far: Iterate …………………. 5500 Merge join …………… Sort + merge joint ……. 7500 R1.C index …………  550 R2.C index …………… Build R.C index ……… Build S.C index ……… Hash join ……………… 4500+ with trick,R1 first ……. 4414 with trick,R2 first …… Hash join, pointers ……. 1600 contiguous

100 Query Plan Optimization Level: Generate and Compare Plans
Summary: Iteration is ok for “small” relations (relative to memory size) For equi-join, where relations are not sorted and no indexes exist, hash join usually is the best Sort + merge join is good for non-equi-join (e.g., R1.C > R2.C) If relations already sorted, use merge join If index exists, it could be useful (depends on expected result size) After generating plans leveraging different join algorithms Compare cost of possible different plans

101 END


Download ppt "Advanced Database Systems: DBS CB, 2nd Edition"

Similar presentations


Ads by Google