Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.

Similar presentations


Presentation on theme: "COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization."— Presentation transcript:

1 COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization

2 22 L10 Query Processing & Optimization Parsing and Translation Evaluation Individual operations Entire operations Optimization Cost-based Heuristic Today’s Agenda

3 33 L10 Query Processing & Optimization Overview Query Processing Basic Steps Parsing and Translation Optimisation Evaluation Source: Silberschatz/Korth/Sudarshan: Database System Concepts, 2002.

4 44 L10 Query Processing & Optimization Basic Steps in Query Processing Parsing and Translation translate the query into its internal form. This is then translated into relational algebra. Parser checks syntax, verifies relations Query Optimization Amongst all equivalent query-evaluation plans choose the one with lowest cost. Query Evaluation Query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

5 55 L10 Query Processing & Optimization Parsing and Translation SQL gets translated into relational algebra, which can be shown as expression tree. Operators have one or more input sets and return one (or more) output set(s). Leafs correspond to (base) relations. Example: SELECT name FROM students, enrolled WHERE cid = ‘COMP5138’ students enrolled  cid=‘COMP5138’  nam e Projection Join Selection

6 66 L10 Query Processing & Optimization Parsing and Translation Evaluation Individual operations Entire operations Optimization Cost-based Heuristic Today’s Agenda

7 77 L10 Query Processing & Optimization Query Evaluation Plan Each relational algebra operation can be evaluated using one of several different algorithms Correspondingly, a relational-algebra expression can be evaluated in many ways. Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. Example: We can use an index on balance to find accounts with balance<2500. Or can perform complete relation scan and discard accounts with balance  2500 account  balance  balance2500

8 88 L10 Query Processing & Optimization Expression Tree vs. Query Evaluation Plan Relational algebra operators == logical operators which represent the intermediate results. Physical operators show how query is evaluated. account  balance  balance2500 Project balance Index Range Scan ( account.balance, balance2500) Table Scan (account) Project balance Filter balance2500

9 99 L10 Query Processing & Optimization Measures of Query Costs Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is generally measured as total elapsed time for answering query Many factors contribute to time cost disk accesses, CPU, or even network communication Typically disk access is the predominant cost, and is also relatively easy to estimate. For simplicity, we just use number of block transfers from disk as the cost measure We ignore the difference in cost between sequential and random I/O for simplicity We also ignore CPU costs for simplicity

10 1010 L10 Query Processing & Optimization Access Path An access path is a method of retrieving tuples: File scan, or index that matches a selection (in the query) A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on matches the selection a=5 AND b=3, and matches a=5 AND b>6, but not b=3. A hash index matches (a conjunction of) terms that has a term attribute = value for every attribute in the search key of the index. E.g., Hash index on matches a=5 AND b=3 AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.

11 1111 L10 Query Processing & Optimization Selection Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we estimate will require the fewest page I/Os. Table scan – search algorithms that locate and retrieve records that fulfill a selection condition. Cost estimate (number of disk blocks scanned) = b R b R denotes number of blocks containing records from relation R e.g. Oracle: TABLE SCAN, TABLE ACCESS FULL Index scan – search algorithms that use an index selection condition must be on search-key of index. Cost = HT i + number of blocks containing retrieved records (PK: 1 block) HT: number of index level e.g. Oracle: INDEX RANGE SCAN or INDEX UNIQUE SCAN

12 1212 L10 Query Processing & Optimization Projection The expensive part is removing duplicates. SQL systems don’t remove duplicates unless the keyword DISTINCT is specified in a query. Sorting Approach: Sort on and remove duplicates. (Can optimize this by dropping unwanted information while sorting.) Hashing Approach: Hash on to create partitions. Load partitions into memory one at a time, build in-memory hash structure, and eliminate duplicates.

13 1313 L10 Query Processing & Optimization Sorting Importance of sorting SQL queries can specify that the output be sorted SQL and relational algebra can be implemented efficiently if the input are sorted We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each tuple. For relations that fit in memory, techniques like quick sort can be used. For relations that don’t fit in memory, external sort-merge is a good choice.

14 1414 L10 Query Processing & Optimization External Sort-Merge 1.Create sorted runs. Let i be 0 initially. Repeatedly do the following till the end of the relation: (a) Read M blocks of relation into memory (b) Sort the in-memory blocks (c) Write sorted data to run R i ; increment i. Let the final value of i be N 2.Merge the runs (N-way merge). We assume (for now) that N < M. ① Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of each run into its buffer page ② repeat 1)Select the first record (in sort order) among all buffer pages 2)Write the record to the output buffer. If output is full, write it to disk. 3)If this is the last records of the input buffer page then read the next block (if any) of the run into the buffer. ③ until all input buffer pages are empty: If i  M, several merge passes are required. In each pass, contiguous groups of M - 1 runs are merged. Let M denote memory size (in pages). 

15 1515 L10 Query Processing & Optimization External Sort-Merge - Example

16 1616 L10 Query Processing & Optimization External Sort-Merge  Cost analysis: Total number of merge passes required: log M–1 (b r /M). Disk accesses for initial run creation as well as in each pass is 2b r for final pass, we don’t count write cost we ignore final write cost for all operations since the output of an operation may be sent to the parent operation without being written to disk Thus total number of disk accesses for external sorting: b r ( 2 log M–1 (b r / M) + 1)

17 1717 L10 Query Processing & Optimization Join Operations Several different algorithms to implement joins Nested-loop join Block nested-loop join Indexed nested-loop join Merge-join Hash-join Choice based on cost estimate Examples use the following information Number of records of students: 1,000 enrolled: 10,000 Number of blocks of students: 100 enrolled: 400

18 1818 L10 Query Processing & Optimization Nested-Loop Join To compute the theta join R  S for each tuple r in R do begin for each tuple s in S do begin if (r,s)=true then add rs to the result end end R is called the outer relation,S the inner relation of the join Requires no indices and can be used with any kind of join condition. Expensive since it examines every pair of tuples in the two relations.

19 1919 L10 Query Processing & Optimization Nested-Loop Join Cost Analysis In the worst case, if there is memory only to hold one block of each relation, the estimated cost is |R|  b S + b R If the smaller relation fits entirely in memory, use that as the inner relation. Reduces cost to b R + b S disk accesses. Example: In the worst case scenario: students as outer relation: 1000  400 + 100 = 400,100 disk accesses enrolled as outer relation: 10000 100 + 400= 1,000,400 disk accesses If smaller relation (students) fits entirely in memory, the cost estimate will be 500 disk accesses. Block nested-loops algorithm (next slide) is preferable.

20 2020 L10 Query Processing & Optimization Block Nested-Loop Join Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation. for each block B R of R do begin for each block B s of S do begin for each tuple r in B R do begin for each tuple s in B s do begin if (r,s)=true then add rs to the result end end end end

21 2121 L10 Query Processing & Optimization Block Nested-Loop Join Cost Analysis Worst case estimate: b R  b S + b R block accesses. Each block in the inner relation S read once for each block in the outer relation (instead of once for each tuple in the outer relation) Best case: b R + b S block accesses. Improvements to nested loop & block nested loop algorithms: In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output Cost = b R / (M-2)  b S + b R If equi-join attribute forms a key on inner relation, stop inner loop on first match Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement) Use index on inner relation if available (next slide)

22 2222 L10 Query Processing & Optimization Indexed Nested-Loop Join Index lookups can replace file scans if join is an equi-join or natural join and an index is available on the inner relation’s join attribute For each tuple r in the outer relation R, use the index to look up tuples in S that satisfy the join condition with tuple R. Worst case: buffer has space for only one page of R, and, for each tuple in R, we perform an index lookup on S. Cost of the join: b R + |R|  c Where c is the cost of traversing index and fetching all matching S tuples for one tuple of R c can be estimated as cost of a single selection on S using the join condition. If an index exists on the appropriate column of each relation, you can choose which relation is the outer one Compare the costs each way

23 2323 L10 Query Processing & Optimization Merge-Join 1.Sort both relations on their join attribute (if not already sorted on the join attributes). 2.Merge the sorted relations to join them Join step is similar to the merge stage of the sort-merge algorithm. Main difference is handling of duplicate values in join attribute Can construct an index just to compute a join. Detailed algorithm in textbook

24 2424 L10 Query Processing & Optimization Merge-Join Cost Analysis Can be used only for equi-joins and natural joins Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory Thus number of block accesses for merge-join is b R + b S + the cost of sorting if relations are unsorted. hybrid merge-join: If one relation is sorted, and the other has a secondary B + -tree index on the join attribute Merge the sorted relation with the leaf entries of the B + -tree. Sort the result on the addresses of the unsorted relation’s tuples Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples Sequential scan more efficient than random lookup

25 2525 L10 Query Processing & Optimization Hash-Join Applicable for equi-joins and natural joins. A hash function h is used to partition tuples of both relations h maps JoinAttrs values to {0, 1,..., n}, where JoinAttrs denotes the common attributes of R and S used in the natural join. R 0, R 1,..., R n denote partitions of R tuples Each tuple r  R is put in partition R i where i = h(r [JoinAttrs]). S 0, S 1..., S n denotes partitions of S tuples Each tuple s  S is put in partition S i, where i = h(s[JoinAttrs]).

26 2626 L10 Query Processing & Optimization Hash-Join (cont’d)

27 2727 L10 Query Processing & Optimization Other Operations Duplicate elimination can be implemented via hashing or sorting. On sorting duplicates will come adjacent to each other, and all but one set of duplicates can be deleted. Hashing is similar – duplicates will come into the same bucket. Aggregation can be implemented in a manner similar to duplicate elimination. Sorting or hashing can be used to bring tuples in the same group together, then the aggregate functions can be applied on each group. Set Operations (UNION, INTERSECT, SET MINUS)

28 2828 L10 Query Processing & Optimization Evaluation of Expressions So far: we have seen algorithms for individual operations Alternatives for evaluating an entire expression tree: Materialization (also: set-at-a-time): simply evaluate one operation at a time. The result of each evaluation is materialized (stored) in a temporary relation for subsequent use. Pipelining (also: tuple-at-a-time): evaluate several operations simultaneously in a pipeline

29 2929 L10 Query Processing & Optimization Materialization Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. E.g., in figure below, compute and store then compute the store its join with customer, and finally compute the projections on customer-name. Materialized evaluation is always applicable Costs can be quite high

30 3030 L10 Query Processing & Optimization Pipelining Pipelined evaluation : evaluate several operations simultaneously, passing the results of one operation on to the next. E.g., in previous expression tree, don’t store result of instead, pass tuples directly to the join. Similarly, don’t store result of join, pass tuples directly to projection. Much cheaper than materialization: no need to store a temporary relation to disk. Pipelining may not always be possible – e.g., sort, hash-join.

31 3131 L10 Query Processing & Optimization Parsing and Translation Evaluation Individual operations Entire operations Optimization Cost-based Heuristic Today’s Agenda

32 3232 L10 Query Processing & Optimization Intro to Query Optimization A relational algebra expression may have many equivalent expressions Example: SELECT BALANCE FROM account WHERE balance < 2500 Can be translated into  balance2500 (  balance (account)) which is equivalent to  balance (  balance2500 (account))

33 3333 L10 Query Processing & Optimization Query Optimization Central Problems: Query is (by definition) declarative, e.g. it does not specify the execution order. But we need an executable plan. The goal of query optimization is less to find the optimal plan, but more to avoid the worst. Time for query optimization adds to total query execution time. Three evaluation plan Interaction of evaluation techniques Cost-based Heuristic

34 3434 L10 Query Processing & Optimization Cost-based Query Optimization Generation of query-evaluation plans for an expression involves several steps: Generating logically equivalent expressions Use equivalence rules to transform an expression into an equivalent one. Annotating resultant expressions to get alternative query plans Choosing the cheapest plan based on estimated cost The overall process is called cost-based optimization. Some systems only have rule-based optimization, i.e. they don‘t take statistical information into account.

35 3535 L10 Query Processing & Optimization Statistic Information in the DB Catalog Need information about the relations and indexes involved. Catalogs typically contain at least: # tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Catalogs are updated periodically. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok. More detailed information (e.g., histograms of the values in some field) are sometimes stored.

36 3636 L10 Query Processing & Optimization Heuristic Optimization Cost-based optimization is expensive, even with dynamic programming. Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion. Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: Perform selection early (reduces the number of tuples) Perform projection early (reduces the number of attributes) Perform most restrictive selection and join operations before other similar operations. Some systems use only heuristics, others combine heuristics with partial cost-based optimization.

37 3737 L10 Query Processing & Optimization Transformation Example Query: Find the names of all students who have enrolled in some course with 6 credit points.  name ( credit_points=6 (course (enrolled student))) Transformation using distribution rule 3:  name ( ( credit_points=6 (course)) (enrolled student))) Performing the selection as early as possible reduces the size of the relation to be joined.

38 3838 L10 Query Processing & Optimization Example with Multiple Transformations Query: Find the names of all students which enrolled in a 6 credit_point course, whose grade is a distinction (‘D’).  name ( grade=‘D’  credit_points=6 (course (enrolled student))) Transformation using join association rule:  name ( grade=‘D’  credit_points=6 ((course enrolled) student)) Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression  credit_points=6 (course)  grade=‘D’ (enrolled) Thus a sequence of transformations can be useful

39 3939 L10 Query Processing & Optimization Optimization Strategies Ideally: find the optimal plan. This approach is very expensive in time and space. Typically: Find a good plan, avoiding the worst plan. do not generate all expressions Central Problem: Join Ordering (Join Enumeration) Typical Approach: use dynamic programming for generating join plans restrict to left-deep join trees

40 4040 L10 Query Processing & Optimization Left Deep Join Trees In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

41 4141 L10 Query Processing & Optimization Dynamic Programming in Optimization To find best join tree for a set of n relations: To find best plan for a set S of n relations, consider all possible plans of the form: S 1 (S – S 1 ) where S 1 is any non-empty subset of S. Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2 n – 1 alternatives. When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it Dynamic programming E.g. r 1, r 2, r 3, r 4, r 5 (r 1, r 2, r 3 ), (r 4, r 5 ) =12*12=144 (r 1, r 2, r 3 ) and (r 4, r 5 ) = 12+12 =24

42 4242 L10 Query Processing & Optimization Wrap-Up Parsing and Translation Evaluation Individual operations Entire operations Optimization Cost-based Heuristic


Download ppt "COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization."

Similar presentations


Ads by Google