Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.

Estimating the Cost of Operations

Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations are applied to construct the “preferred” lqp Now, what is involved in transforming the lqp into a physical plan? For this, we need a cost estimate to choose: – An order and grouping for commutative-and- associative operations. –An algorithm for each operation in the lqp (nested- loop join or hash-join etc) –Additional ops. such as scanning, sorting, etc. –The way to pass the parameters from one operator to another.

Why estimate the costs? Initial logical query plan Two candidates for the best logical query plan. Which one to choose?

Estimating Costs How can we estimate the number of tuples in an intermediate relation? We don’t want to execute the query in order to learn the costs. So, we need to estimate them. Rules about estimation formulas: 1.Give (somehow) accurate estimates 2.Easy to compute 3.Consistent (= the size estimate should not depend on the algorithm)

Estimating the cost of operations Review of notation: –B(R) = number of blocks of R –T(R) = number of tuples of R –V(R,a) = number of different values in the a-column of R –V(R,L) = number of different L-values in R (L is the list of attributes)

Projection The size of a projection is the only one we can compute exactly. 1.Projection  retains duplicates, so the number of tuples in the result is the same as in the input. 2.Result tuples are usually shorter than the input tuples.

Selection Let S =  A=c (R) We can estimate the size of the result as T(S) = T(R) / V(R,A) Let S =  A<c (R) On average, T(S) would be T(R)/2, but more properly: T(R)/3 Let S =  A  c (R), Then, an estimate is: T(S) = T(R) * [ (V(R,A)-1)/V(R,A) ], or simply T(S) = T(R)

Selection... Let S =  C AND D (R) =  C (  D (R)) First estimate T(  D (R)) and then use this to estimate T(S). Example S =  a=10 AND b<20 (R) T(R) = 10,000, V(R,a) = 50 T(S) = (1/50)* (1/3) * T(R) = 67 Note: Watch for selections like:  a=10 AND a>20 (R)

Selection... Let S =  C OR D (R). Simple estimate: T(S) = T(  C (R)) + T(  D (R)). Problem: It’s possible that T(S)  T(R)! A more accurate estimate Let: –T(R)=n, –m 1 = size of selection on C, and –m 2 = size of selection on D. Then T(S) = n(1-(1-m 1 /n)(1-m 2 /n)) Why? Example: S =  a=10 OR b<20 (R). T(R) = 10,000, V(R,a) =50 Simple estimation: T(S) = 3533 More accurate: T(S) = 3466

Natural Join R(X,Y)  S(Y,Z) Anything could happen! Extremes No tuples join T(R  S) = 0 Y is the key of S and a foreign key of R (i.e., R.Y refers to S.Y): T(R ⊳⊲ S) = T(R) All tuples join: i.e. R.Y=S.Y = a T(R  S) = T(R)*T(S)

Two (simplifying) assumptions Containment of value sets If V(R,Y) ≤ V(S,Y), then every Y-value in R is assumed to occur as a Y-value in S When such thing can happen? For example when: Y is foreign key in R, and key in S Preservation of value sets If A is an attribute of R but not S, then it is assumed that V(R  S, A)=V(R, A) This may be violated when there are dangling tuples in R There is no violation when: Y is foreign key in R, and key in S

Natural Join size estimation Let, R(X,Y) and S(Y,Z), and suppose Y is a single attribute. What’s the size of T(R  S)? Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join? Suppose V(R,Y)  V(S,Y) By the containment of set values we infer that: –Every Y’s value in R appears in S. So, the tuple r of R surely is going match with some tuples of S, but what’s the probability it matches with s? It’s 1/V(S,Y). So, T(S)/V(S,Y) tuples of S would match with tuple r. Hence, T(R  S) = T(R)*T(S)/V(S,Y) By a similar reasoning, for the case when V(S,Y)  V(R,Y), we get T(R  S) = T(R)*T(S)/V(R,Y). Summarizing we have as an estimate: T(R  S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}

Example: R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500 Estimate the size of R  S  U T(R  S) = 1000*2000 / 50 = 40,000 T((R  S)  U) = 40,000 * 5000 / 500 = 400,000 T(S  U) = 2000*5000 / 500 = 20,000 T(R  (S  U)) = 1000*20,000 / 50 = 400,000 Equality of results is not a coincidence. Note 1: estimate of final result should not depend on the evaluation order Note 2: intermediate results could be of different sizes

Natural join with multiple join attrib. R(x,y 1,y 2 )  S(y 1,y 2,z) T(R  S) = T(R)*T(S)/m 1 *m 2, where m 1 = max{V(R,y 1 ),V(S,y 1 )} m 2 = max{V(R,y 2 ),V(S,y 2 )} Why? Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s agree on y 1 ? From the previous reasoning, it’s 1/max{V(R,y 1 ),V(S,y 1 )} = 1/m 1 Similarly, what’s the probability that r and s agree on y 2 ? It’s 1/max{V(R,y 2 ),V(S,y 2 )} = 1/m 2 Assuming that agreements on y 1 and y 2 are independent we estimate: T(R  S) = T(R)*T(S)/ m 1 *m 2 Example: T(R)=1000, V(R,b)=20, V(R,c)=100 T(S)=2000, V(S,d)=50, V(S,e)=50 R(a,b,c)  R.b=S.d AND R.c=S.e S(d,e,f) T(R  S) = (1000*2000)/(50*100)=400

Another example: (one of the previous) R(a,b), T(R)=1000, V(R,b)=20 S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100 U(c,d), T(U)=5000, V(U,c)=500 Estimate the size of R  S  U Observe that R  S  U = (R  U)  S T(R  U) = 1000*5000 = 5,000,000 Note that the number of b’s in the product is 20 (=V(R,b)), and the number of c’s is 500 (=V(U,c)). T((R  U)  S) = 5,000,000 * 2000 / (50 * 500) = 400,000

Size estimates for other operations Cartesian product: T(R  S) = T(R) * T(S) Bag Union: sum of sizes Set union: larger + half the smaller. Why? Because a set union can be as large as the sum of sizes or as small as the larger of the two arguments. Something in the middle is suggested. Intersection: half the smaller. Why? Because intersection can be as small as 0 or as large as the sizes of the smaller. Something in the middle is suggested. Difference: T(R-S) = T(R) - 1/2*T(S) Because the result can be between T(R) and T(R)-T(S). Something in the middle is suggested.

Size estimates for other operations Duplicate elimination  in (R(a 1,...,a n )): The size ranges from 1 to T(R). T(  (R))= V(R,[a 1...a n ]), if available (but usually not available). Otherwise: T(  (R))= min[V(R,a 1 )*...*V(R,a n ), 1/2*T(R)] is suggested. Why? V(R,a 1 )*...*V(R,a n ) is the upper limit on the number of distinct tuples that could exist 1/2*T(R) is because the size can be as small as 1 or as big as T(R) Grouping and Aggregation: similar to , but only with respect to grouping attributes.

Computing the statistics Computation of statistics is triggered automatically or manually. T(R)’s, and V(R,A)’s are just aggregation queries (COUNT queries). However, they are expensive to be computed.

Incremental computation of statistics Maintaining T(R): Add 1 for every insertion and subtract 1 for every deletion. –What’s the problem? If there is a B-Tree on any attribute of R, then: Just keep track of the B-Tree blocks and infer the approximate size of the relation. Requires effort only when? On B-Tree changes, which is relative rare compared with the rate of insertions and deletions.

Incremental computation of statistics Maintaining V(R,A): If there is an index on attribute A of a relation R, then: –On insert into R, we must find the A-value for the new tuple in the index anyway, and so we can determine whether there is already such a value for A. If not increment V(R,A). –On deletion… If there isn’t an index on A, the system could in effect create a rudimentary index by keeping a data structure (e.g. B-Tree) that holds every value of A. Final option: Sample the relation.

Histograms Equal width 1-2 3-4 4-5 6-7 8-9 Most frequent values 4 7 rest Advantage: more accurate estimate of the join size.

Example (most freq. values histogram) Estimate U = R(a,b)  S(b,c) V(R,b) = 14. Histogram for R.b: 0:150, 1:200, 5:100, rest: 550 V(S,b) = 13. Histogram for S.b: 0:100, 1:80, 2:70, rest: 250 Tuples in U –on 0: 100*150 = 15,000 –on 1: 200*80 = 16,000 –on 2: 70 * (550/(14-3)) = 3500 –on 5: 100 * (250/(13-3)) = 2500 –on the 9 other values: 9*(550/11)*(250/10) = 9*1250 Total T(U) = 15000 + 16000 + 3500 + 2500 + 9*1250 = 48,250 Simple estimate (equal occurrence assumption) T(U) = 1000*500/14 = 35,714 We have 9 values, because V(S,b)<V(R,b), and by the preservation of value sets assumption, all the 9 values we didn’t consider yet in S, will be in R as well.

Example (equal width histogram) Schemas: Jan(day,temp) July(day,temp) Query: Find the pairs of days in Jan and Jul that had the same temperature. SELECT Jan.day, July.day FROM Jan, July WHERE Jan.temp=July.temp; Size of join of each band is T1*T2/Width –On band 40-49: 10*5/10 = 5 –On band 50-59: 5*20/10 = 10  size of the result is thus 5+10 = 15 Without using the histogram we would estimate the size as –245*245/100 = 600 !!

Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.

Similar presentations

Presentation on theme: "Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.

Similar presentations

Presentation on theme: "Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations."— Presentation transcript:

Similar presentations

About project

Feedback