Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS346: Advanced Databases Graham Cormode Query Planning and Optimization.

Similar presentations


Presentation on theme: "CS346: Advanced Databases Graham Cormode Query Planning and Optimization."— Presentation transcript:

1 CS346: Advanced Databases Graham Cormode Query Planning and Optimization

2 Outline Chapter: “Algorithms for Query Processing and Optimization” in Elmasri and Navathe Background: “Basic SQL” and “More SQL” in Elmasri and Navathe or “SQL in ten minutes” etc.SQL in ten minutes  Query patterns: block (nested) queries  External sorting  Techniques for different query operators: – SELECT, JOIN, PROJECT  Query optimization: choosing between different query plans – Heuristic (rule-based) versus cost-based query optimization CS346 Advanced Databases 2

3 Why?  Understand the separation between query specification and query execution  See the process of deciding how to execute a query  Understand the different choices in query computation  Connection to many data processing tasks: – Sorting, indexing, joins ubiquitous in data processing, analytics  Connects theory and practice – Use properties of relational algebra to optimize queries CS346 Advanced Databases 3

4 Query Processing and Optimization  SQL is a high level declarative language – Say what result you want, not how to get it – Can be more than one way to process a query to get results  Basic example: find maximum value of an attribute – Simple approach: just scan through all records – Use an index (if it exists): jump to end of index to find max value  Gets more involved when more factors are in play – Multiple indexes, complex (nested) queries, joins,...  A query plan is the sequence of operations to answer a query – Query Processing & Optimization: choosing a good query plan – Not necessarily “the best” query plan: may take too long to find CS346 Advanced Databases 4

5 From Query to Results CS346 Advanced Databases 5 We will focus on this part

6 Query Representation: SQL and Algebra  Arbitrary queries can be quite complex – How to automatically make query plans?  Observation: queries are highly structured – Basic structures repeated at multiple levels  Break SQL query into query blocks – A query block contains a single SELECT-FROM-WHERE expression – Optionally may include GROUPBY and HAVING clauses – Nested queries are parsed into separate query blocks  Convert into relational algebra – Relational Operators:  (selection),  (projection), ⋈ (join) – Since SQL has aggregation (SUM, MAX), algebra also includes these CS346 Advanced Databases 6

7 Example parsing into blocks CS346 Advanced Databases 7 SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > (SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO = 5); SELECTMAX (SALARY) FROMEMPLOYEE WHERE DNO = 5 SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > C π LNAME, FNAME ( σ SALARY>C (EMPLOYEE)) ℱ MAX SALARY ( σ DNO=5 (EMPLOYEE))

8 Approach to Query Processing  Divide-and-conquer: break the problem into standard pieces – Define different ways to implement each relational operator SELECT, PROJECT, JOIN – Choose how to combine these ways based on estimated costs – The set of choices forms the query plan  We will look at each basic operator in turn  Start with a more basic primitive: sorting – Many operators can be done more efficiently if input is sorted CS346 Advanced Databases 8

9 External Sorting  In algorithms, mostly study in-memory sorting – How to sort big data files larger than memory?  External sorting algorithms – External – data resides outside the memory – Typically follow a “sort-merge” strategy Sort pieces of the file, merge these together to get final result Sort blocks as large as possible in memory – “External memory model”: measure cost as number of disk accesses In-memory operations are (treated as) effectively free In contrast to RAM model: measure cost as number of operations CS346 Advanced Databases 9

10 Sort-merge algorithm  Parameters of the sorting procedure: – b: number of blocks in the file to be sorted – n B : available buffer space (measured in blocks) – n R : number of runs (pieces) of the file produced n R =  b/n B  : break file into pieces that fit into memory  Small example: b=1024, n B = 5, need 205 runs of 5 blocks  After first pass, have n R sorted pieces of n B blocks: merge them together – In-memory mergesort: merge two at a time (two-way merge) – External sorting: merge as many as possible Limiting factor: smaller of (n B - 1) and n R CS346 Advanced Databases 10

11 Merge phases  After one merge phase, now have runs (sorted pieces) of length n B 2 blocks – Can merge these together, until one sorted file emerges – Each merge phases is a “pass”: requires reading the whole data – Number of passes is small: log n R /log n B – Can be much better than binary merging (log 2 n R ) if n B is large CS346 Advanced Databases 11

12 Digression: Sorting numbers  Considerable effort invested in optimizing sorting data!  Several competitive sort benchmarks ( sortbenchmark.org/ )sortbenchmark.org/ – GraySort: Fastest time to sort 100TB (2014: < half an hour) – MinuteSort: Most data sorted in 1 Minute (3.7TB) – JouleSort: Minimum energy to sort data (100K records/joule) – PennySort: Amount of data sorted using 1c of system time (300GB), based on depreciating cost of system over 3 years  Original test: sort 100 million records (100MB) – 1985 time: 1 hour – By 2000, <1 second  Current winners highly distributed, parallel systems CS346 Advanced Databases 12

13 SELECT Operator  SELECT operator: find all records that meet a condition  Examples: – OP1:  SSN=' ' (EMPLOYEE) – OP2:  DNUMBER>5 (DEPARTMENT) – OP3:  DNO=5 (EMPLOYEE) – OP4:  DNO=5 AND SALARY>30000 AND SEX=F (EMPLOYEE) – OP5:  ESSN= AND PNO=10 (WORKS_ON)  Simple, right? CS346 Advanced Databases 13

14 SELECT operation (basic conditions) Surprisingly many ways to SELECT records!  S1 Linear search (brute force): – Retrieve every record in the file – Test whether the attribute values satisfy the selection condition  S2 Binary search: – If selection condition involves an equality comparison on a key attribute on which the file is ordered, can binary search. (e.g. OP1,  SSN=' ' (EMPLOYEE))  S3 Use a primary index or hash key to retrieve one record: – If selection condition has an equality comparison on a key attribute with a primary index (or a hash key) – Use the primary index (or the hash key) to retrieve the record CS346 Advanced Databases 14

15 SELECT operation (basic conditions)  S4 Use a primary index to retrieve multiple records: – If comparison condition is >, ≥, <, or ≤ on field with primary index – Use index to find record satisfying corresponding equality condition – Then retrieve all subsequent records in the (ordered) file  S5 Using a clustering index to retrieve multiple records: – If selection condition involves equality comparison on a non-key attribute with a clustering index – Use the index to retrieve records satisfying selection condition  S6 Using a secondary (B + -tree) index: – If indexing field is a (candidate) key, use index to retrieve the match – Retrieve multiple records if the indexing field is not a key – Can also use index to retrieve records on conditions with >,≥, <, ≤ – Range queries: e.g < Salary < CS346 Advanced Databases 15

16 SELECT operation (complex conditions) Things get more complex when conditions are conjunctions (AND) of several simple conditions E.g. OP4:  DNO=5 AND SALARY>30000 AND SEX=F (EMPLOYEE)  S7 Conjunctive selection : – If an attribute in any part of the condition has an access path allowing S2-S6, use that condition to retrieve the records, then check if each retrieved record satisfies the whole condition  S8 Conjunctive selection using a composite index: – If two or more attributes are involved in equality conditions and a composite index (or hash structure) exists on the combined field, we can use it directly CS346 Advanced Databases 16

17 SELECT operation  S9 Conjunctive selection by intersecting record pointers: – If there are secondary indexes on all (or some) fields involved in equality comparison conditions in the conjunctive condition – Each index can be used to retrieve the record pointers that satisfy each individual condition – Intersecting these pointers gives the pointers that satisfy the conjunctive condition, which can then be retrieved – If only some conditions have secondary indexes, test each retrieved record against the full condition – Not necessary when one condition is on a key value. Why? CS346 Advanced Databases 17

18 Choosing a method  9 possible selection methods (and counting) – How to choose between them? – How should DBMS (automatically) choose one to use?  Easy case: single SELECT condition (OP1, OP2, OP3): – Either there exists a suitable access path: use it! – Or not: linear scan (S1)  Harder case: conjunctive SELECT condition (OP4, OP5): – If only one method has an access path: use it! (S7) – If more than one access path: pick method yielding fewest records But how can we know this without actually doing it? CS346 Advanced Databases 18 OP1: σ SSN=' ' (EMPLOYEE) OP2: σ DNUMBER>5 (DEPARTMENT) OP3: σ DNO=5 (EMPLOYEE) OP4: σ DNO=5 AND SALARY>30000 AND SEX=F (EMPLOYEE) OP5: σ ESSN= AND PNO=10 (WORKS_ON)

19 Selectivity estimation  DBMS keeps statistics on relations, uses them to estimate costs  The selectivity of a condition is the fraction that satisfy it – Zero selectivity: none satisfy the condition – Selectivity of 1: every record satisfies the condition  DBMS estimates selectivity of each part of a condition – Equality on key attribute: at most one record can match – Equality on non-key attribute with d distinct values: assume 1/d records match (uniformity assumption) – Range query for a fraction f of possible values: assume selectivity f  Pick method retrieving least records based on estimated selectivity – Much work on estimating selectivity based on samples, histograms… CS346 Advanced Databases 19

20 Disjunctive selection  Conditions with OR (disjunctive) are harder to optimize – E.g.  Dno=5 OR Salary > OR Sex = ‘F’ (EMPLOYEE) – Results are all records satisfying any of the simple conditions – If any of the conditions has no access path, must linear scan – If every condition has access path, can use each in turn Need to eliminate duplicates (later) CS346 Advanced Databases 20

21 The JOIN operation  JOIN is a costly operation to perform  Most examples of join are EQUIJOIN (or NATURAL JOIN) – Can be two–way join: a join on two files i.e. R ⋈ A=B S – Or multi-way joins: joins involving more than two files i.e. R ⋈ A=B S ⋈ C=D T  Examples – OP6: EMPLOYEE ⋈ DNO=DNUMBER DEPARTMENT – OP7: DEPARTMENT ⋈ MGRSSN=SSN EMPLOYEE CS346 Advanced Databases 21

22 Implementing JOIN As with SELECT, many ways to perform JOIN  J1 Nested-loop join (brute force): – For each record t in R (outer loop): Retrieve every record s from S (inner loop) Test if the two records satisfy the join condition t[A] = s[B]  J2 Single-loop join (Use an access structure to retrieve matching records): – If index (or hash key) exists for one join attribute — say, B of S: retrieve each record t in R, one at a time use access structure to retrieve matching records s in S with s[B] = t[A] CS346 Advanced Databases 22

23 Implementing JOIN  J3 Sort-merge join: – If R and S are physically sorted (ordered) by value of the join attributes A and B, respectively, sort-merge join is very efficient – Both files are scanned in order of the join attributes, matching the records that have the same values for A and B – Each file is scanned only once for matching with the other file If A and B are both non-key, need to generate all matching pairs CS346 Advanced Databases 23 XA YB XABY ⋈ =

24 Implementing JOIN  J4 Hash-join: – The records of R and S are both hashed with the same hash function on the join attributes A of R and B of S as hash keys. – A single pass through the file with least records (say, R) hashes its records to the hash file buckets – A single pass through the other file (S) then hashes each of its records to the appropriate bucket, where the record is combined with all matching records from R CS346 Advanced Databases 24

25 JOIN performance  Nested-loop join can be very expensive  Example: (OP6) EMPLOYEE ⋈ DNO=DNUMBER DEPARTMENT – Suppose n B = 7 disk blocks are available as file buffers – Suppose DEPARTMENT has r D = 50 records in b D = 10 disk blocks – And EMPLOYEE has r E = 6000 records in b E = 2000 disk blocks  Read as many blocks from the outer loop file as possible (n B -2) – Read 1 block at a time from inner loop file, look for matches – Fill one further block with records in the join, flush to disk when full  Which file is ‘inner’ and which is ‘outer’ makes a difference CS346 Advanced Databases 25

26 Nested Loop Example  Using EMPLOYEE for the outer loop: EMPLOYEE read once, DEPARTMENT read b E /(n B -2) times – Read EMPLOYEE once: b E blocks – Read DEPARTMENT  b E / (n B – 2)  times: b D * b E /(n B -2) – In our example: * 2000/5 = 6000 block reads  Using DEPARTMENT for the outer loop: DEPARTMENT read once, EMPLOYEE read b D /(n B -2 ) times – Read DEPARTMENT once: b D blocks – Read EMPLOYEE  b D / (n B – 2)  times: b E * b D /(n B -2) times – In our example: * 10/5 = 4010 block reads  Not counting the cost of writing results to disk CS346 Advanced Databases 26 DEPARTMENT: r D =50 records, b D =10 blocks EMPLOYEE: r E =6000 records, b E =2000 blocks

27 JOIN selection factor  JOIN selection factor: what fraction of records in one file join? – Particularly relevant for J2 single loop join (with access path)  Consider OP7: DEPARTMENT ⋈ MGRSSN=SSN EMPLOYEE – Assume 50 DEPARTMENT records, 6000 EMPLOYEE records  Option 1: for each EMPLOYEE record, look up in DEPARTMENT – Assume a 2 level index on SSN for DEPARTMENT – Approximate cost: b E + (r E * 3) = *3 = blocks – Join selection factor is (50/6000) =  Option 2: for each DEPARTMENT record, look up in EMPLOYEE – Assume a 4 level index on SSN for EMPLOYEE – Approximate cost: b D + (r D * 5) = *5 = 260 blocks – Join selection factor is 1 CS346 Advanced Databases 27

28 Sort-merge JOIN efficiency  Sort-merge JOIN is very efficient if files are already sorted – A single pass through each – For OP6 and OP7, need b E + b D = = 2010 block reads  If the files are not sorted on join attribute, can sort them – Cost is estimated as (b E log b E + b D log b D + b E + b D ) – The base of the logs depends on space available to sort – DBMS can easily find the estimated cost CS346 Advanced Databases 28 XA YB XABY ⋈ =

29 Partition hash JOIN  Hash join is most efficient if one file can be kept in memory – Store hash table in memory, then read through larger file  Partition hash JOIN: – A partitioning hash function on join attribute splits into M pieces Use M blocks as buffers. Read R and S, write out R 1...R M, S 1...S M – Then join each of the M pieces of both files: join R i with S i Can use whatever method is preferred Nested loop join is now efficient if one piece fits in memory – Records with different partitioning hash values cannot join  Cost is low if M is chosen suitably and hash function works well – Read each file, write out partitions, then read to join: 3 * (b R + b S ) – For OP6, this is 3 * ( ) = 6030 block accesses CS346 Advanced Databases 29

30 PROJECT operation  PROJECT operation picks out certain attributes from each record –  R – Why is this not completely trivial?  Relational semantics do not allow duplicates – PROJECT can create duplicates, so need to remove these – If attribute list includes a key of the relation, there will be no dupes  Two standard approaches to duplicate elimination: – Sort the results, then scan through to find duplicates – Hash the results: duplicates hash to same bucket, & can be dropped  SQL does not require elimination of duplicates by default – Must be dropped only if the query includes the DISTINCT keyword CS346 Advanced Databases 30

31 Set Operations  Other operations: UNION, INTERSECTION, SET DIFFERENCE – Require the input relations to be compatible (same attributes) – All can be implemented with variations of Sort-Merge  UNION, R  S: scan through sorted inputs, copy results to output – Suppress duplicate tuples  INTERSECTION, R  S: scan through sorted inputs – Only copy tuples that appear in both inputs  SET DIFFERENCE (EXCEPT in SQL), R / S: – Scan through sorted inputs, keep those in R but not in S  Can also be implemented via hashing – Exercise: how would you use hashing to implement UNION? CS346 Advanced Databases 31

32 Aggregate Operators  Aggregate operators in SQL: MIN, MAX, COUNT, AVERAGE, SUM  Linear scan works for all – Keep track of running sum, count or max/min  MAX, MIN: Indexes (on target attribute) are helpful – Follow the index to find the max/min value – E.g. SELECT MAX (SALARY) FROM EMPLOYEE;  SUM, COUNT, AVERAGE: – If there is a dense index on target attribute, compute from index – If the index is sparse (e.g. cluster index), need extra information E.g. keep the number of corresponding records with index entry Needs extra effort to keep this information up to date CS346 Advanced Databases 32

33 Aggregate operators and GROUP BY  Many aggregate queries include a GROUP BY – SELECT Dno, AVG(Salary) FROM EMPLOYEE GROUP BY Dno; – Apply the aggregate separately to each group of tuples  Use sorting/hashing on grouping attibute (Dno) to partition – Then apply aggregate on group (linear scan)  If there is a clustering index on the grouping attribute, then records are already partitioned correctly CS346 Advanced Databases 33

34 Avoiding hitting the disk: pipelining  It is convenient to think of each operation in isolation – Read input from disk, write results to disk  But this is slow: very disk intensive – Can we avoid creating intermediate (temporary) result files?  Pipelining: pass results of one operator directly to the next – Like unix pipes |  Create pipelined code: implement compound operations – For a 2-way join, combine 2 selections on the input and one projection on the output with the Join: eliminates 4 temp files  DBMS can generate code dynamically to pipeline operations – Using ideas from compilers CS346 Advanced Databases 34

35 Using Heuristics in Query Optimization  Steps for heuristic optimization: – Query parser generates initial internal representation – Apply heuristics to optimize the internal representation – Query execution plan generated to use available access paths  General idea: apply operations that reduce results size first – E.g., Apply SELECT and PROJECT before JOIN or similar operations – “Push down” selections (projections) in tree representation  Will introduce query tree and query graph representations – Develop the idea of “equivalent query trees” CS346 Advanced Databases 35

36 Query Trees  Represent a relational algebra expression as a tree structure – Input relations are leaf nodes of the tree – Internal nodes are (relational) operations – Execution proceeds bottom-up: result is the root  E.g.:  PNUMBER, DNUM, LNAME, ADDRESS, BDATE (((  PLOCATION=‘Stafford’ (PROJECT)) ⋈ DNUM=DNUMBER (DEPARTMENT)) ⋈ MGRSSN=SSN (EMPLOYEE)) CS346 Advanced Databases 36 SELECT P.Pnumber, P.Dnum, E.Lname, E.Address, E.Bdate FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E WHERE P.Dnum=D.Dnumber AND D.Mgrssn=E.Ssn AND P.Plocation=‘Stafford’;

37 Query trees  Query trees impose a (partial) ordering on operations – (1) must be done before (2) before (3)  One query can correspond to many relational algebra expressions – Hence there can be many different query trees for the same query – Query optimization goal: pick a tree that is efficient to execute  Analogy: multiple mathematical expressions are equivalent – E.g. a(x + y) = ax + ay  Start with an initial query tree derived from the SQL query – Often very inefficient: uses cartesian product instead of join – Initial query tree never executed, but is easier to optimize  Query optimizer will use equivalence rules to transform trees CS346 Advanced Databases 37

38 CS346 Advanced Databases 38 SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE PNAME = ‘AQUARIUS’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE > ‘ ’;

39 CS346 Advanced Databases 39 SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE PNAME = ‘AQUARIUS’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE > ‘ ’;

40 Final query tree CS346 Advanced Databases 40 SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT WHERE PNAME = ‘AQUARIUS’ AND PNUMBER=PNO AND ESSN=SSN AND BDATE > ‘ ’;

41 Transformation rules for relational algebra  Rules preserve the semantics of queries: same result – Not necessarily with the same order of attributes 1.Cascade of . Conjunction of selections can be broken up:  c1 AND c2 AND … cn (R)   c1 (  c2 ( ….  cn (R))…)) 2.The  operation is commutative (follows from previous)  c1 (  c2 (R))   c2 (  c1 (R)) 3.Cascade of . Can ignore all but the final projection:  list1 (  list2 (…(  listn (R)…))   list1 (R) 4.Commuting  with . If selection only involves attributes in the projection list, can swap  A1, A2, …, An (  c (R))   c (  A1, A2, …, An (R)) CS346 Advanced Databases 41

42 Transformation rules for relational algebra 5.Commutativity of ⋈ (and  ): Join (cartesian product) commutes R ⋈ c S = S ⋈ c Rand R  S = S  R 6.Commuting  with ⋈ (or  ): If all attributes in selection condition c are in only one relation (say, R) then  c ( R ⋈ S )  (  c (R)) ⋈ S If c can be written as (c 1 AND c 2 ) where c 1 is only on R, and c 2 is only on S, then  c ( R ⋈ S )  (  c1 (R)) ⋈ (  c2 (S)) 7.Commuting  with ⋈ (  ): Write projection list L={A 1...A n,B 1... B n } where A’s are attributes of R, B’s are attributes of S. If join condition c involves only attributes in L, then:  L (R ⋈ c S)  (  A1,...An (R)) ⋈ c (  B1,...Bn (S)) Can also allow join on attributes not in L with an extra projection CS346 Advanced Databases 42

43 Transformation rules for relational algebra 8.Commutativity of set operators Operations  and  are commutative ( / is not) 9.Associativity of ⋈, , ,  These operations are individually associative: (R  S)  T = R  (S  T) where  is the same op throughout 10.Commuting  with set operations  commutes with ,  and / :  c (r  S)  (  c (R))  (  c (S)) 11.The  operation commutes with  L (R  S)  (  L (R))  (  L (S)) 12.Converting a ( ,  ) sequence into a ⋈ If the condition c of selection  following a  corresponds to a join condition, then: (  c (R  S))  (R ⋈ c S) CS346 Advanced Databases 43

44 Yet more transformation rules  Other rules from arithmetic and logic can be applied  E.g. DeMorgan’s laws: NOT ( c 1 AND c 2 )  (NOT c 1 ) OR (NOT c 2 ) NOT ( c 1 OR c 2 )  (NOT c 1 ) AND (NOT c 2 ) CS346 Advanced Databases 44

45 Algebraic Optimization  How to go from rules into an algorithm for optimizing queries? – What order to apply rules? Avoid going round in circles…  Using rule 1, break up SELECTs with ANDs into a cascade – Makes it easier to move them around the tree  Use rules 2, 3, 6, 10 to move SELECTs as far down as possible – SELECT reduces data size so do it as soon as possible  Use rules 5 and 9 (commutativity and associativity of binary ops): – Arrange that leaf operations with most restrictive SELECT are first – Use selectivity estimates to determine this – But avoid creating cartesian products CS346 Advanced Databases 45

46 Algebraic Optimization Continued  Use Rule 12 to turn cartesian product + select into join  Use rules 3, 4, 7, 11 to push PROJECT as far down the tree as possible, creating new PROJECT operations if needed – Only attributes needed higher up the tree should remain  Find subtrees that can be pipelined into a single operation  The example from earlier follows this sequence of steps  Main points to remember: – First apply operations that reduce the size of intermediate results – Perform select and project early, to minimize result size – The most restrictive select and join should be done earliest CS346 Advanced Databases 46

47 Query Trees to Query Plans CS346 Advanced Databases 47  Now we have a query tree… but we’re not quite done – Need to specify a few more points to make the final plan – Tree specifies order of operations, not the implementation to use  E.g. for tree below, DBMS could specify: – Use secondary index to search for SELECT on DEPARTMENT – Single-loop join algorithm to do the join using an index on Dno – Linear scan of the join result to do the project – Lastly, specify whether intermediate results are materialized on disk or pipelined direct to next operation

48 Cost-based query optimization  Cost-based query optimization assigns costs to different strategies – Aims to pick the strategy with the lowest (estimated) cost – Sits in place of or in addition to heuristic query optimization  Can sometimes be time-consuming to compare costs – Better suited to queries that are compiled (to run multiple times)  Does not guarantee finding the best (optimal) strategy – Poor estimates of costs could pick a suboptimal strategy – Not time-effective to consider every possible strategy  Need cost functions to assign cost to each operation CS346 Advanced Databases 48

49 Measuring the cost Many different dimensions over which cost could be measured:  Access cost to secondary storage (disk), or I/O cost – The (time/number) of disk accesses  Disk storage cost – The size of intermediate files stored on disk  Computation cost (also known as CPU cost) – The time to perform the operations on data in memory  Memory usage cost: how much memory is needed?  Communication cost: how much data goes over the network? For large databases, disk cost dominates For small databases in memory, CPU cost dominates CS346 Advanced Databases 49

50 The DBMS catalog  The catalog(ue) for a database stores metadata – Information about the tables, types of field within the tables – Stored as a database itself, and can be queries – Information for query optimization can also be stored in a catalog  Statistics (typically) stored about each database file: – Number of records (tuples), r – (Average) record size, R – Number of file blocks, b – Blocking factor, bfr – File organization: unordered, ordered, hashed CS346 Advanced Databases 50

51 The DBMS catalog  Information stored about fields in each file: – Number of distinct values observed, d – (Average) attribute selectivity for an equality condition, sl Estimated from number of distinct values May keep a histogram if this is very non-uniform  Information stored about each index – Number of levels in each multilevel index, x – Number of attributes in first-level index blocks, b I1 CS346 Advanced Databases 51

52 Example Cost Estimates for SELECT Some simple functions to estimate number of block transfers  S1: Linear search (brute force) – Simply check every block, so cost C = b, number of blocks  S2: Binary search on sorted file – Access approximately C = log 2 b +  s/bfr  – 1 blocks – Search cost + number of blocks that match the criteria – Simplifies to log 2 b if equality condition is on a key attribute  S3: Use hash key or primary index to retrieve a record – C = x + 1 to search an x-level index – C = 1 if using hashing: just jump directly to disk block C is a larger constant (2 or 3) if e.g. extendible hashing CS346 Advanced Databases 52

53 Example cost functions for SELECT  S4: use an ordering index to retrieve multiple records – If condition is >, , , < on a key field, assume half the records pass – So cost function is C = x + (b/2) – Better estimates from using histograms/samples of data  S5: use clustering index to retrieve multiple records – First traverse the index to find the first matching record (x) – Then one disk block access for each matching record (s/bfr) – Total estimated cost is C = x +  s / bfr   S6: use secondary index (B + -tree) – For equality on a key attribute, cost is C = x + 1 – For equality on non-key attributes, cost is C = x +1 + s – For >, , , < conditions, C = x + f(b I1 + r) – if fraction f records match CS346 Advanced Databases 53

54 Example cost functions for SELECT  S7 Conjunctive selection (is AND of multiple conditions) – Use any of the above methods for one condition – Check that the retrieved records match all the required conditions – No extra disk accesses for the checks so cost is of the initial method – If multiple indexes exist, can first take intersection of pointers CS346 Advanced Databases 54

55 Example using cost functions  Query optimizer finds possible strategies, estimates cost of each  Simple example for SELECT – EMPLOYEE file has r E = 10,000 records in b E = 2000 disk blocks Blocking factor bfr = 5 – Clustering index on Salary has x = 3 levels and average selection cardinality s = 20 – Secondary index on key attribute SSN with x = 4 levels – Secondary index on nonkey attribute Dno with x = 2 levels and b I1 = 4 first level bocks. 125 distinct values of Dno – Secondary index on sex with x = 1 levels, and d = 2 distinct values CS346 Advanced Databases 55

56 SELECT example  OP1:  SSN=' ' (EMPLOYEE) – S1 Brute force: read on average 1000 blocks (half the file) – S6 use index: 4 blocks  OP2:  DNUMBER>5 (DEPARTMENT) – S6: use index on dno: 2 + (4/2) /2 = 5000 Assumes half records match, incur cost of reading each block Better to use brute force linear scan!  OP3:  DNO=5 (EMPLOYEE) – S6: use index on Dno, estimate 10000/125 = 80 records Cost is = 82  OP4:  DNO=5 AND SALARY>30000 AND SEX=F (EMPLOYEE) – Select on Dno: 82; Select on salary: ~1000; select on sex: ~5000 CS346 Advanced Databases 56 EMPLOYEE: r E = 10,000 records in b E = 2000 disk blocks Cluster index on Salary: x=3 levels, selection cardinality s=20 Secondary index on key attribute SSN with x = 4 levels Secondary index on nonkey attribute Dno with x = 2 levels and b I1 = 4 first level bocks. 125 distinct values of Dno Secondary index on sex with x=1 levels, d=2 distinct values

57 Join Selectivity  To estimate cost of JOIN, need to know how many tuples result – Store as a selectivity ratio: size of join to size of cartesian product – Join selectivity js = |(R ⋈ c S)| / |(R  S)| = |(R ⋈ c S)|/(|R|*|S|)  Join selectivity js takes on values from 0 to 1 – If no condition c, js = 1, result is just cartesian product – If no tuples meet the join condition, js = 0, result is empty  Special case when c is an equality condition, R.A = S.B – If A is a key of R, then |(R ⋈ c S )| can be at most |S| [Why?] js  1/|R| If B is a foreign key of S then size of join is exactly |S| – By symmetry, if B is a key of S, then js  1/|S| CS346 Advanced Databases 57

58 JOIN estimation for equijoins Estimate cost for |(R ⋈ A=B S )| assuming an estimate for js  J1: Nested loop join – Suppose R is used for outerloop, blocking factor for result is bfr RS – Joint cost C = b R + (b R * b S ) + (js * |R| * |S|/bfr RS ) – Refines the earlier cost for nested-loop by including output size – With n B buffers, C = b R + (  b R /(n B -s)  *b S ) + (js*|R|*|S|/bfr RS )  J3: Sort-merge join – If files are already sorted, C = b R + b S + (js * |R| * |S|/bfr RS ) – If not, add the cost of sorting (estimated as b log b) CS346 Advanced Databases 58

59 JOIN estimation for equijoins  J2: Single loop join with access path for matching records – Use index on B with x levels – s B denotes the average number of tuples matching in S – Secondary index: C = b R + (|R| * (x s B ) + (js*|R|*|S|/bfr RS )) – Cluster index: C = b R + (|R| * (x s B /bfr B ) + (js*|R|*|S|/bfr RS )) Cluster index means all matching records are sequential – Primary index: C = b R + (|R| * (x + 1) + (js*|R|*|S|/bfr RS )) Primary index implies can only be one matching record – Hash key on B of S: C = b R + (|R| * h) + (js*|R|*|S|/bfr RS )) h is average number of reads to find a record, should be 1 or 2 CS346 Advanced Databases 59

60 JOIN example Size of result is (125 * 10,000 )/(125 * 4) = 2500 with bfr ED = 4 1.J1 Nested loop join with EMPLOYEE as outer loop – C = b E + (b E * b D ) + [output] = * = 30,500 2.J1 Nested loop join with DEPARTMENT as outer loop – C = b D + (b E * b D ) + [output] = * = 28,513 3.J2 Single loop join with EMPLOYEE as outer loop – C = b E + (r E * (x + 1)) + [output] = (10,000*2) = 24,500 4.J2 Single loop join with DEPARTMENT as outer loop – C = b D + (r D * (x + s)) + [output] = 13 + (125 * (2+80)) = 2500 =  Can do better: DEPARTMENT is small, so keep in memory (case 2) CS346 Advanced Databases 60 EMPLOYEE file: r E = 10,000 records in b E = 2000 blocks Secondary index on nonkey attribute Dno: x = 2 levels and b I1 = 4 first level bocks. 125 distinct values of Dno DEPARTMENT file: r D = 125 records in b D = 13 blocks Primary index on Dnumber with x = 1 levels OP6: EMPLOYEE ⋈ DNO=DNUMBER DEPARTMENT Assume js for OP6 is 1/125 (number of depts)

61 JOIN ordering  Commutativity & associativity of join give many equivalent trees – Exponential growth with number of joins – Not feasible to estimate all costs  Restrict to a subset of possible query trees – Left-deep tree is one where every right child is a base relation CS346 Advanced Databases 61

62 Joins and trees  With left-deep trees, right child is always the “inner”relation – When doing a nested loop join, or single loop join  Left-deep trees are amenable to pipelining – Standard structure is easy to generate code for  Base relation on the right means access path can be used (if present) CS346 Advanced Databases 62

63 Query optimization example Starting point: result of heuristic query optimization CS346 Advanced Databases 63 SELECT P.Number, P.Dnum, E.Lname, E.Address, E.Bdate FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E WHERE P.Dnum=D.Dnumber AND D.Mgrssn=E.Ssn AND P.Plocation=‘Stafford’;

64 Query optimization example Several possible join orders: – (PROJECT ⋈ DEPARTMENT) ⋈ EMPLOYEE – (DEPARTMENT ⋈ PROJECT) ⋈ EMPLOYEE – (DEPARTMENT ⋈ EMPLOYEE) ⋈ PROJECT – (EMPLOYEE ⋈ DEPARTMENT) ⋈ PROJECT  Consider first ordering, (PROJECT ⋈ DEPARTMENT) ⋈ EMPLOYEE – Compare using index to PROJECT table to linear scan – Estimate number of matching records with location Stafford – Use of index requires ~100 block accesses, so better than scan – Estimate join sizes to get size of temporary files – Combine with cost of second join to get total estimate (+32 blocks)  Repeat for other join orders. See textbook for more details CS346 Advanced Databases 64

65 Query Optimization in Oracle  Oracle: major example of a commercial DBMS – Largest market share, beating IBM, Microsoft, SAP and teradata  Used to use a rule-based approach – Used a list of 15 possible access path types, from best to worst  Current approach: cost-based optimization – Cost aims to be proportional to total query run time – Combination of costs from I/O, CPU and memory  Allows developers to include “hints” in the SQL queries – Tell the optimizer to use a certain join method or access path  Example of hint: consider an attribute with few distinct values – Simple assumption would be that all are equally likely – But if a value in the query is rare, tell (hint) optimizer to use index CS346 Advanced Databases 65

66 Summary CS346 Advanced Databases 66  Queries follow simple query patterns: block (nested) queries  Make use of external sorting to help query evaluation  There are many techniques for different query operators: – Many different choices for each of SELECT, JOIN, PROJECT  Query optimization: choosing between different query plans – Heuristic (rule-based) versus cost-based query optimization – Used in real-world databases (Oracle, MySQL, MS SQLServer…) – SQL ‘EXPLAIN’ keyword gives details of the query plan chosen Chapter: “Algorithms for Query Processing and Optimization” in Elmasri and Navathe


Download ppt "CS346: Advanced Databases Graham Cormode Query Planning and Optimization."

Similar presentations


Ads by Google