QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control.

QUERY PROCESSING AND OPTIMIZATION

Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control Will require DBMS to select an execution plan Will allow optimizations

Sample query SELECT C, D FROM R, S WHERE R.B = "z" AND S.F = 30 AND R.A = S.D

The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w

First execution plan Can use relational algebra to express an execution plan Could be:  Cartesian product: R×S  Selection: σ R.B = "z"  S.F = 30  R.A = S.D (R×S)  Projection: π C, E (σ R.B = "z"  S.F = 30  R.A = S.D (R×S))

Graphical representation R×S σ R.B = "z"  S.F = 200  R.A = S.D π C, E

R×S ABCDEF 1x1001u10 1x1003v30 1x1005w30 2y2001u10 2y2003v30 2y2005w30 3z3001u10 3z3003v30 3z3005w30 4y4001u10 4y4003v30 4y4005w30

Second execution plan Selection: σ B = "z" (R) σ F = 30 (S) Join:σ B = "z" (R)⋈ R.A=S.D σ F = 30 (S) Projection:π C, E (…)

The two tables R ABC 1x100 2y200 3z300 4y400 S DEF 1u10 3v30 5w

After the selections σ B = "z" (R) ABC 3z300 σ F = 30 (S) DEF 3v30 5w

σ B = "z" (R) R.A=S.D σ F = 30 (S) ABCDEF 3z3003v30

Discussion Second plan  Extracts first relevant rows of tables R and S  Uses more efficient join  for each row in σ B = "z" (R) : for each row in σ F = 30 (S) : if R.A = S.D : include rows in result  Note that inner loop searches the smaller temporary table (σ F = 30 (S))

More generally Exclude as quickly as possible:  Irrelevant lines  Irrelevant attributes Most important when the involved tables reside on different hosts (Distributed DBMS) Whenever possible, ensure that inner join loops search tables that can reside in main memory

Caching considerations Cannot rely on LRU to achieve that  Will keep in memory recently accessed pages of all tables Must keep  All pages of table inside the inner loop  No pages of the other table Can either  Let DBMS manage the cache  Use a scan-tolerant cache algorithm (ARC)

A third plan Find lines of R where B = "z" Using index S.D find lines of S where S.D matches R.A for the lines where R.B = "z" Include pair of lines in the join

Processing a query (I) Parse the query Convert query parse tree into a logical query plan (LQP) Apply equivalence rules (laws) and try to improve upon extant LQP Estimate result sizes

Processing a query (II) Consider possible physical plans Estimate their cost Select the best Execute it Given the high cost of query processing, it make sense to evaluate various alternatives

Example (from [GM]) SELECT title FROM StarsIn WHERE starName IN ( SELECT name FROM MovieStar WHERE birthdate LIKE ‘%1960’ );

Relational Algebra plan  title  StarsIn IN  name  birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra

Relational algebra plan  title  StarsIn IN  name  birthdate LIKE ‘%1960’ starName MovieStar Fig. 7.15: An expression using a two-argument , midway between a parse tree and relational algebra

Logical query plan  title  starName=name StarsIn  name  birthdate LIKE ‘%1960’ MovieStar Fig. 7.18: Applying the rule for IN conditions  Cartesian product could indicate a brute force solution

Estmating result sizes Need expected size StarsIn MovieStar 

Estimate the costs of each option Logical Query Plan P1 P2 …. Pn C1 C2 …. Cn Pick the best!

Query optimization At two levels  Relational algebra level: Use equivalence rules  Detailed query plan level: Takes into account result sizes Considers DB organization  How it is stored  Presence and types of indexes, …

Result sizes do matter Consider the Cartesian product  Very costly when its two operands are large tables  Less true when the tables are small

Equivalence rules for joins  R⋈S = S⋈R  (R⋈S)⋈T = R⋈(S⋈T) Column order does not matter because the columns have labels

Rules for product and union Equivalence rules for Cartesian product:  R x S = S x R  (R x S) x T = R x (S x T) Equivalence rules for union :  R  S = S  R  (R  S)  T = R  (S  T) Column order does not matter because the columns have labels

Rules for selections and unions Equivalence rules for selection:   p1  p2 (R) =  p1 (  p2 (R))   p1  p2 (R) =  p1 (R)   p2 (R) Equivalence rules for union :  R  S = S  R  (R  S)  T = R  (S  T)

Combining projections and joins  I f predicate p only involves attributes of R not used in the join  p (R⋈S) =  p (R)⋈S  If predicate q only involves attributes of S not used in the join  q (R⋈S) = R⋈  q (S) Warning: π p1, p2 (R) is NOT the same as π p1 ( π p2 (R))

Combining selection and joins  p  q (R⋈S)=  p (R)⋈  q (S)  p  q  m (R⋈S)=  m [(  p R)⋈(  q S)]  p  q (R⋈S)= [  p (R)⋈S]  [R⋈(  q (S)]

Combining projections and selections Let  x be a subset of R attributes  z the set of attributes of R used in predicate p then  π x [σ p (R)] = π x [σ p [π xz (R)]] We can only eliminate attributes that are not used in the selection predicate!

Combining projections and joins Let  x be a subset of R attributes  y a subset of S attributes  z the common attributes of R and S  then  xy (R⋈S) =  xy {[  xz (R)]⋈[  yz (S)]}

Combining projections, selections and joins Let   x, y, z be...  z' the union of z and the attributes used in predicate p  xy {  p (R⋈S)} =  xy {  p [  xz’ (R)]⋈[  yz'( S)]}

Combining selections, projections and Cartesian product Rules are similar Just replace join operator by Cartesian product operator  Keep in mind that join is a restricted Cartesian product

  p (R U S) =  p (R) U  p (S)   p (R - S) =  p (R) - S =  p (R) -  p (S) Combining selections and unions

 p1  p2 (R)   p1 [  p2 (R)]  Use successive selections  p (R ⋈ S)  [  p (R)] ⋈ S  Do selections before joins R ⋈ S  S ⋈ R  x [  p (R)]   x {  p [  xz (R)]}  Do projections before selection Finding the most promising transformations

First heuristics Do projections early  Example from [GM]: Given R(A,B,C,D,E) and the select predicate P: (A=3)  (B=“cat”) Seems a good idea to replace  x {  p (R)} by  E {  p {  ABE (R)} } What if we have indexes?

Same example with indexes Assume attribute A is indexed  Use index to locate all tuples where A = 3  Select tuples where B=“cat”  Do the projections In other words    x {  p (R)} is the best solution

Second heuristics Do selections early  Especially if we can use indexes but no heuristics is always true

Estimating cost of query plans Requires  Estimating sizes of the results  Estimating the number of I/O operations We will generally assume that the cost of a query plan is dominated by the number tuples being read or written

Estimating result sizes Relevant data are  T(R) : number of tuples in R  S(R) : size of each tuple of R (in bytes)  B(R): number of blocks required to store R  V(R, A) : number of distinct values for attribute A in R

Example Relation R T(R)=8 Assuming dates take 8 bytes and strings 20 bytes S(R)=48 bytes B(R)=1 block V(R, Owner)=3, V(R, Pet)=2,V(R, Vax date)=4 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14

Estimating cost of W = R 1 x R 2 T(W) = T(R 1 )×T(R 2 ) S(W) = S(R 1 )+S(R 2 ) Obvious!

Estimating cost of W =  A=a  (R) S(W) = S(R) T(W) = T(R)/V(R, A)  but this assumes that the values of A are uniformly distributed over all the tuples

Example W = σ owner= Bob (R) As T(R) = 6 and V(R, Owner) = 3 T(W) = 3 OwnerPetVax date AliceCat3/2/15 AliceCat3/2/15 BobDog10/8/14 BobDog10/8/15 CarolDog11/9/14 CarolCat12/7/14

Making another assumption Assume now that values in select expression Z = val are uniformly distributed over all possible V(R, Z) values. If W = σ Z=val (R)  T(W) = T(R)/V(R, Z)

Estimating sizes of range queries Attribute Z of table R has 50 possible values ranging from 1 to 100 If W = σ Z > 80, what is T(W)? Assuming the values in Z are uniformly distributed over [0, 1] T(W) = T(R)×(100 – 80)/(100 – 1 +1) = 0.2×T(R)

Explanation T(W) = T(R)×(Query_Range/Value_Range) If query had been W = σ Z ≥ 80 T(W) would have been T(R)×(100 – 80 + 1)/(100 – 1 +1) = 0.21×T(R) 21 possible values

Estimating the size of R⋈S queries We consider R(X, Y)⋈S(Y, Z) Special cases:  R and S have disjoint values for Y: T(R⋈S) = 0  Y is the key of S and a foreign key in R: T(R⋈S) = T(R)  Almost all tuples of R and S have the same value for Y: T(R⋈S) = T(R)T(S)

Estimating the size of R⋈S queries General case:  Will assume Containment of values:  If V(R, Y) ≤ V(S, Y) then all values of Y in R are also in S Preservation of value sets:  If A is an attribute of R that is not in S, then V(R⋈S, A) = V(R, A)

Estimating the size of R⋈S queries  If V(R, Y) ≤ V(S, Y) Every value of R is present in S On average, a given tuple in R is likely to match T(S)/V(S, Y) R has T(R) tuples T(R⋈S) = T(R)×T(S)/V(S, Y)

Estimating the size of R⋈S queries  If V(R, Y) ≥ V(S, Y) Every value of S is present in R On average, a given tuple in S is likely to match T(R)/V(R, Y) S has T(S) tuples T(R⋈S) = T(R)×T(S)/V(R, Y)

Estimating the size of R ⋈ S queries  In general T(R⋈S) = T(R)×T(S)/max(V(R, Y), V(R, S))

An example (I) Finding all employees who live in a city where the company has a plant:  EMPLOYEE( EID, NAME, …., CITY)  PLANT(PLANTID, …,CITY) SELECT E.NAME FROM EMPLOYEE E, PLANT P WHERE E.CITY = P.CITY SELECT EMPLOYEE.NAME FROM EMPLOYEE JOIN PLANT ON EMPLOYEE.CITY= PLANT.CITY

An example (II) Assume  T(E)=5,000V(E, CITY) = 100  T(P)= 200V(P, CITY) = 50  T(E⋈P) = T(E)×T(P)/ MAX(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000

Estimating the size of multiple joins R ⋈ S ⋈ U  R(A, B)S(B,C)U(C,D) T(R)=1,000T(S)=2,000T(U)=5,000 V(R,B)=20V(S,B)=50 V(S,C)=100V(U,C)=500 Left to right:  T(R⋈S)= 2,000,000/max(20, 50)=40,000  T(R⋈S⋈U)=200,000,000/max(100, 500)=400,000 Right to left:  T(S⋈U)= 10,000,000/max(100, 500)=20,000  T(R⋈S⋈U)=20,000,000/max(20, 50)=400,000

Estimating the size of multicondition joins R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z)  If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 ) …) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ) …)

Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z) In general  T(R⋈S) = T(R)×T(S)/ [max(V(R, y 1 ), V(R, y 1 ))× max(V(R, y 2 ), V(R, y 2 ))× ….]

Multicondition join R(X,y 1, y 2,…)⋈S(y 1, y 2,…, Z)  If V(R, y 1 )≤V(S, y 1 ) and V(R, y 2 )≤ V(S, y 2 ) … Every value of R is present in S On average, a given tuple in R is likely to match T(S)/(V(S, y 1 )×V(S, y 2 )) R has T(R) tuples T(R⋈S) = T(R)×T(S)/(V(S, y 1 )×V(S, y 2 ))

Estimating the size of unions T(R⋃S)  for a bag union: T(R⋃S) = T(R)+T(S) exact  for a regular union: If the relations are disjoint:  T(R⋃S) = T(R)+T(S) If one relation contains the other:  T(R⋃S) = max(T(R), T(S)) T(R⋃S)=(max(T(R), T(S))+T(R)+T(S))/2  We take the average!

Estimating the size of intersections T(R⋂S) If the relations are disjoint:  T(R⋂S) = 0 If one relation contains the other  T(R⋂S) = min(T(R), T(S)) T(R⋂S)=min(T(R), T(S))/2  We take the average!

Estimating the size of set differences T(R-S)  If the relations are disjoint T(R)-T(S) =T(R)  If relation R contains relation S: T(R-S) = T(R)-T(S)  T(R-S)=(2T(R)+T(S))/2 We take the average!

Estimating the cost of eliminating duplicates δ(R)  If all tuples are duplicates : T(δ(R)) = 1  If no tuples are duplicates : T(δ(R)) = T(R)  T(δ(R)) = T(R)/2 If R(a 1, a 2, …) and we know the V(R, a i )  T(δ(R)) = Π i V(R, a i )

Collecting statistics Can explicitly request statistics Maintain them incrementally Can collect histograms  Give an idea how data are distributed Not all patrons borrow equal number of books

The Zipf distribution (I) Empirical distribution Verified for many cases Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the second most popular item will be close to f/2  The probability … third most popular item will be close to f/3

The Zipf distribution (II) Can alter the shape Ranks items by frequency/popularity If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the second most popular item will be close to f/2  The probability … third most popular item will be close to f/3

The Zipf distribution (II)

The Zipf distribution (III) Can adjust the slope of the course by adding an exponent If f is the probability of accessing/using the most popular item in the list (rank 1)  The probability of accessing/using the n-th ranked item in the list is will be close to f/n i i = ½ seems to be a good choice

Example (I) A library uses two tables to keep track of its books  Book(BID,Title,Authors,Loaned,Due)  Patron(PID, Name, Phone, Address) The Loaned attribute of a book is equal to  The PID of the patron who borrowed the book  Zero if the book is on the shelves

Example (II) We want to find the titles of all books currently loaned to "E. Chambas" T(Books)=5,000V(Books, Loaned) = 200 T(Patron)=500V(Patron, Name) = 500

First plan X = Book ⋈ Loaned = PID Patron Y = σ Name = "E. Chambas" (X) Since PID is the key of Patron and assuming that Loaned were a foreign key in Books: T(X)= T(Books) (all books are borrowed!) = 5,000 T(Y)= T(X)/V(Patron, Name) =5,000/500 = 10

Second plan X = σ Name = "E. Chambas" (Patron) Y = Book ⋈ Loaned = PID X T(X)= T(Patron) /V(Patron, Name) = 5,000/5,000 = 1 T(Z)= T(Book)×T(X)/V(Book, Loaned) = 5,000/500 = 10

Comparing the two plans (I) Comparison based on the number of tuples created by the plan minus  The number of tuples constituting the answer Should be the same for all correct plans For the same reason, we do not considder the number of tuples being read

Comparing the two plans (II) Cost of first plan: 5,000 Cost of second plan: 1

An example (I) Finding all employees who live in a city where the company has a plant:  EMPLOYEE( EID, NAME, …., CITY)  PLANT(PLANTID, …,CITY) Assume  T(E)=5,000V(E, CITY) = 100 V(E, NAME) = 5,000  T(P)= 200V(P, CITY) = 50

A first plan X = E ⋈ E.CITY = P.CITY P Y = π E.NAME (X) T(E⋈P)= T(E)×T(P)/ max(V(E, CITY), V(P, CITY)) = 5,000×200/MAX(100, 50) = 1,000,000/100 = 10,000 T(Y) = 10,000 (not possible!)

A second plan X = π P.CITY (P) Y = δ(X) Z = E ⋈ E.CITY = Y.CITY Y U = π E.NAME (Z) T(X)= T(P) = 200 T(Y)= V(X, CITY) = V(P, CITY) =50 T(Z)= T(E)×T(Y)/max(V(E, CITY), 1) = 5,000×50/MAX(100, 1) = 2,500 T(U) =T(Z) = 2,500

Comparing the two plans Here it pays off to eliminate duplicates early

Example [GM] We have R(a,b) and S(b,c) We want δ(σ A="a" (R⋈S)) We know  T(R) = 5,000T(S) = 2,000 V(R, a) = 50 V(R, b) = 100V(S, b) = 200 V(S, c) = 100

First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 )

First plan X 1 = σ a="a" (R) X 2 = X 1 ⋈S X 3 = δ (X 2 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 ) = T(X 1 )×T(S)/max(V(R, b), V(S, b)) = 100×2000/max(100, 200) = 1,000 T(X 3 )= min(…, T(X 2 )/2)= 500 doesn't count

Second plan X 1 = δ(R) X 2 = δ(S) X 3 = σ a="a" (X 1 ) X 3 = X 3 ⋈X 2

Second plan X 1 =δ(R), X 2 =δ(S), X 3 =σ a="a" (X 1 ), X 4 =X 3 ⋈X 2 T(X 1 )= min(V(R, a)×V(R, b), T(R)/2) = min(50×100, 5000/2) = 2500 T(X 2 )= min(V(S, b)×V(S, c), T(S)/2) = min(200×100, 2000/2) = 1000 T(X 3 )=T(X 1 )/V(R, a) = 2500/50 = 50 T(X 4 )= T(X 3 )×T(X 2 )/max(V(R, b), V(S, b)) = 50×1000/max(100, 200))= 250 nono

Comparing the two plans Here it did not pay off to eliminate duplicates early

A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 )

A hybrid plan X 1 = σ a="a" (R) X 2 = δ(X 1 ) X 3 = X 2 ⋈S X 4 = δ(X 3 ) T(X 1 )= T(R)/V(R, a) = 5,000/50 = 100 T(X 2 )= min(V(R, b), T(X 1 )/2) = min(100,50) = 50 T(X 3 )= T(X 3 )×T(S)/max(V(R, b), V(S, b)) = 50×2000/max(100, 200))= 500 T(X 4 )= min(…, T(X 3 )/2)= 250 nono

Comparing the two best plans Reducing the sizes of the tables in a join is a good idea if we can do it on the cheap

Ordering joins Joins methods are often asymmetric so cost(R⋈S)≠cost(S⋈R) Useful to build a join tree A simple greedy algorithm will work well:  Start with pair of relation whose estimated join size will the the smallest  Find among other relations the one that would produce the smallest estimates size when joined to the current tree.

Implementing joins

A. Nested loops W = [ ] for rows in R : for rows in S : if match_found( ) : append_concatenated_rows() Number of operations:  T(R)×T(S)

The idea Table R Table S Try to match every tuple of R with all tuples of S

Optimization Assume that the second relation can fit in main memory  Read only once  Number of reads is T(R) + T(S)

B. Sort and merge We sort the two tables using the matching attributes as sorting keys Can now do select matches by doing a merge  Single pass process unless we have duplicate matches  Number of operations is O(T(R)log(T(R)))+O(T(S)log(T(S)))+T(R)+T(S) assuming one table does not have potential duplicate matches  Great if the tables are already sorted

C. Hashing Assume both tables maintain a hash with K entries for the matching attributes for i in range(0, K – 1) : join all R entries in bucket i with all S entries in the same bucket  We replace a big join by K smaller joins Number of operations will be: K×(T(R)/K)×(T(S)/K) = T(R)×T(S)/K

QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control.

Similar presentations

Presentation on theme: "QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control.

Similar presentations

Presentation on theme: "QUERY PROCESSING AND OPTIMIZATION. Overview SQL is a declarative language:  Specifies the outcome of the computation without defining any flow of control."— Presentation transcript:

Similar presentations

About project

Feedback