Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation.

Similar presentations


Presentation on theme: "ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation."— Presentation transcript:

1 ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation Techniques ©Manuel Rodriguez – All rights reserved

2 ICOM 6005Dr. Manuel Rodriguez Martinez2 Query Evaluation Techniques Read : –Chapter 12, sec 12.1-12.3 –Chapter 13 Purpose: –Study different algorithms to execute (evaluate) SQL relational operators Selection Projection Joins Aggregates Etc.

3 ICOM 6005Dr. Manuel Rodriguez Martinez3 Selection and SF Selections can have where clause with CNF, or DNF. Selectivity factor for CNF case: –If where clause is of the form p1 and p2 and … and pn, –then –Assume that all predicates are independent –Exsmple: Select sid, sname From Students Where gpa= 4.0 AND age “Bob”;

4 ICOM 6005Dr. Manuel Rodriguez Martinez4 Some examples Query: Q1: Select sname, sage From Students where gpa = 4.0; Q2: Select sname, gpa From Students where gpa > 3.50 AND age < 25; Information for Students: –Cardinality = 5,000 –#tuples / page = 100 –SF gpa> 3.50 = 10% –SF age < 25 = 90% Get selectivity of each predicate? What about where clause? Which predicate should go first?

5 ICOM 6005Dr. Manuel Rodriguez Martinez5 The Role of sorting Sorting plays a pivotal role in the implementation of relational operators and in choosing access path Idea: If some operator need to pass several times over the tuples of a table, sorting might speed things up Examples: –Sorting for duplicate elimination in projections –Sort-merge join –Sorting for aggregation –Sorting for order by clauses Question: Since a table R can have gigabytes of worth of data, how do we sort? –Answer: External Sorting

6 ICOM 6005Dr. Manuel Rodriguez Martinez6 External Sorting 3,45,28,12,10 3,42,51,82,10 2,34,51,28,10 1,22,34,58,10 Input file Sort pages (Quicksort) Merge-Sort into 2-pages Blocks (runs) Merge-Sort into 4-pages run

7 ICOM 6005Dr. Manuel Rodriguez Martinez7 Progression of the algorithm Pass 0: produces 2 k sorted runs of 1 page Pass 1: produces 2 k-1 sorted runs of 2 pages Pass 2: produces 2 k-2 sorted runs of 4 pages Pass 3: produces 2 k-3 sorted runs of 6 pages … Pass k: produces 1 sorted run 2 k pages The costs of external sorting: –2 I/O per page per pass (1 read, 1 write) –Number of passes:  log 2 N  + 1, N is the # of pages –Total cost: 2N * (  log 2 N  + 1)

8 ICOM 6005Dr. Manuel Rodriguez Martinez8 The idea behind External sorting Phase I, sort each page in memory using an in- memory sorting algorithm –Often Quicksort is used –Requires 1 pass over the relation to sort –Each page is called a 1-page run Run is a collection of pages with sorted tuples, stored as a file Phase II, merge sort 1.Let i = 1 2.Start sort merging pairs of runs of size i to build runs of size 2i 3.When all runs of size i have been consumed If only one run remains, finish else set i = 2i, and goto step 2

9 ICOM 6005Dr. Manuel Rodriguez Martinez9 Implementing External Sorting Each run is stored in a temporary file Worst case, you need three buffer pages –2 pages for input –1 page for output Usage of pages –One is used to keep a page of tuples from a run A –The other is used to keep a page of tuples from a run B –Third page becomes a merged and sorted page to become part of a new run C, with size twice that of the runs A and B In practice, you have B pages available –B – 1 are used to input runs –1 is used for output

10 ICOM 6005Dr. Manuel Rodriguez Martinez10 Three buffer pages Input 1 Input 2 Output Buffer pages This is the minimal barebones scheme

11 ICOM 6005Dr. Manuel Rodriguez Martinez11 Reality: B buffer pages Input 1 Input 2 Output Buffer pages This is the minimal barebones scheme Input B-1 … …

12 ICOM 6005Dr. Manuel Rodriguez Martinez12 Sorting with B buffers: Pass 0 3,45,28,12,107,219,6 5,22,10 3,48,1 1,22,34,58,10 Sort & Merge 1 st run on pass 0

13 ICOM 6005Dr. Manuel Rodriguez Martinez13 Sorting with B buffers: Pass 0 (cont.) 3,45,28,12,107,219,6 7,21 9,6 1,22,34,58,10 Sort & Merge 2 nd run on pass 0 6,79,21

14 ICOM 6005Dr. Manuel Rodriguez Martinez14 Sketch of Algorithm Pass 0: –Read B pages at a time, –sort each page in memory (e.g. quick sort or heap sort) –Write run of size B to disk. This will produce N/B runs, where N = NPages(R) for relation R Passes 1, 2, …, K 1.Use B -1 buffers to read a page from each run of size i 2.Do a (B-1)-way sort merge to produce a run with the size 2i. Each page is first kept in ouput buffer, then written 3.Repeat (2) until all runs of the current size have been merged 4.If only only run is left, exit. 5.Goto (1)

15 ICOM 6005Dr. Manuel Rodriguez Martinez15 Leveraging on buffers By using B buffers first pass builds  N/B  runs, where N is NPages(R) for a relation R –Originally we had N runs … Moreover, the total number of passes to do sorting is decreased by doing (B-1)-way merging operations The cost for sorting a relation R, with N pages using B buffers (1 for output, B-1 for input) is: –I/O Cost =2*N*(  log B-1  N/B   + 1) –Must do (B-1)-way merging operations Example: For table R, N = 10000, B = 5 –I/O Cost = 2* 10000*(  log4  10000/5   + 1) = 140,000 I/Os –I/O Cost with 3 buffers= 2* 10000(  log2 10000  + 1) = 300000 I/Os

16 ICOM 6005Dr. Manuel Rodriguez Martinez16 Implementing Selection Operator Selection operator can be done via: –File Scan Unsorted data Sorted data –Index Scan B+-tree Hash Tree Each access path has very different costs File Scans fetch the data and then apply predicates. Index scan combine the search with predicate evaluation –Search on the index is implicit predicate evaluation.

17 ICOM 6005Dr. Manuel Rodriguez Martinez17 Scenario Query: Q1: Select sname, sage From Students where gpa = 4.0; Q2: Select sname, gpa From Students where gpa > 3.50 AND age < 25; Information for Students: –Cardinality = 5,000 –#tuples / page = 100 –SF gpa> 3.50 = 10% –SF age < 25 = 90% Get selectivity of each predicate? What about where clause? Which predicate should go first?

18 ICOM 6005Dr. Manuel Rodriguez Martinez18 Costs for Selection Access paths No index, and data not sorted for table R, predicate is Attr op value –Algorithm: Read each tuple, Evaluate the predicates in the where clause for each one. Emit the results to next operator or to the output stream –Cost: Read entire relation R, NPages(R) I/Os –Scenario: Heap file with not index defined

19 ICOM 6005Dr. Manuel Rodriguez Martinez19 Costs for Selection Access paths (2) Data Sorted on attribute in a predicate: Attr op value –Algorithm: Do binary search to find first attribute that matches, –Return this tuple Scan subsequent pages while tuples match predicate –Return these tuples Stop when no attribute matches the condition –Cost: (log2(NPages(R)) + # of pages with matches) I/Os –If data is sorted we get a break! Note: Sorting would require reading whole table! Do not sort the data just to run a selection!

20 ICOM 6005Dr. Manuel Rodriguez Martinez20 Cost for selections with Hash Index Hash index, and predicate : Attr = value –Algorithm: Probe hash table to find page with values. –Clustered: Fetch records, and read any overflow page Cost: ( 2 + #number of overflow pages) * I/O –Unclustered: Fetch records, fetch data pages, and do the same for overflow pages. Cost: Variable, in the worst case each search key forces us to read a different page in the data file. Often we don’t use unclustered index for selections Use selectivity to make a good guess

21 ICOM 6005Dr. Manuel Rodriguez Martinez21 Selections using a B+-tree When a selection has a where clause with a predicate of the form Attr op value –Clustered B+-tree: best strategy to use (if available) Note that hash index is better if op is equality –Un-clustered B+ tree: depends on SF for the predicate –Algorithm: Search tree to find first data entry entry that points to qualifying tuples (tuples that pass the where condition) Scan data entries to find and retrieve qualifying tuples Returns these to next operator End when tuples that do not match condition first appear

22 ICOM 6005Dr. Manuel Rodriguez Martinez22 Selections using a B+-tree (2) Costs: –Clustered: # of visited index pages + # of visited data entries First item is usually 3– 4 (depends on where root is located) Estimate: 3 + # visited data entries (assuming root is on buffer pool) # of visited data entries can be estimated as the selectivity of the predicate. –Un-clustered: # of visited index pages + # of visited data entries + # of file data pages read Worst case: 1 file data page is read for each tuple Bad for range queries Un-clustered is good when selectivity is small –Just a few pages are expected

23 ICOM 6005Dr. Manuel Rodriguez Martinez23 Cost estimates for Selections with B+ trees Clustered Unclustered

24 ICOM 6005Dr. Manuel Rodriguez Martinez24 Some issues … On equality predicate, If the B+ tree index is clustered it is very probable that 1 page has all the data records –Since # of visited index entries is often 3 - 4 I/Os, total operations can be implemented in 4-5 I/Os! For any selection, tf the index is unclustered, the worst case is that a file data page is read for each tuple found in index Typically, DBMS first determines the page ids of the pages to read from a unclustered B+ tree Then, sorts the page ids, reads pages in order and fetches the tuples from them –This prevents a given page to be read more than once!

25 ICOM 6005Dr. Manuel Rodriguez Martinez25 Some issues on selectivity … If index matches with predicate in the form Attr = value, the SF of the predicate is computed as –SF = 1/ NKeys(I) For an and expression: P1 and P2 –SF = SFp1*SFp2 For an or expression: P1 or P2 –SF = SFp1 + SFp2 – SFp1*SFp2 For a not expression: not P1 –SF = 1 – SFp1

26 ICOM 6005Dr. Manuel Rodriguez Martinez26 Example: Q1: Select sid, sname from Students where gpa = 4.0 Q2: Select sname, gpa, sage from Students where age < 20; NTuples(Students):10,000 Indices: –Clustered Hash Index on sid: NKeys(I) = 10,000, NPages(I)= 500 –Un-Clustered B+Tree on age: NKeys(I)= 40, NPages(I) = 31, NTPages(I)=333 Selectivity factors: –Gpa 4.0 = 10% –Age < 20 = 40 % What should be the strategy to solve each of the queries above?

27 ICOM 6005Dr. Manuel Rodriguez Martinez27 Evaluation with un-clustered B+ tree index Suppose query is of the form If we have an un-clustered B+ tree index I that matches predicate p, we could use it –Need to see if its cheaper than full scan of table Estimated cost will be:

28 ICOM 6005Dr. Manuel Rodriguez Martinez28 General Selection Conditions DBMS often deals with where clause in two forms –Where clause has not disjunctions Or clauses –Where clause has an or condition Each predicate in the where clause has a free from –Attr op value –Attr1 op Attr2 –Attr1 op func1(…) // embedded function call –Func1(Attr1, …., Attrn) // function call is the predicate DBMS excel in optimizing where clauses with no disjunctions –They drop the ball when an OR appears on where clause

29 ICOM 6005Dr. Manuel Rodriguez Martinez29 Query with predicates in CNF Condition: Atttr1 op value And attr2 op value … DBMS estimates the selectivity of predicate + index availability The predicate that is the most selective is evaluated first –The rest are evaluated based on selectivity Alternatively, if an index exist for a subset of conditions –Each Condition is evaluated separately –Results are then intersected –Any remaining predicated are then evaluated

30 ICOM 6005Dr. Manuel Rodriguez Martinez30 Evaluation with rid and intersection Algorithm: –Use the index for each condition that has index match on the query –For each index used, store the rids of the tuples that pass the condition evaluated with the index –Find tuples in the intersection –Retrieve only tuples with rid in this intersection –Apply the rest of the predicates to find the final answer This alternative is good when indices are unclustered Why?

31 ICOM 6005Dr. Manuel Rodriguez Martinez31 Evaluating the intersection Option 1: –Sort the files based on rid –Do a merge operation to find rid that appear in all files Pages are then read only once !!! Why? –These are the records that pass the set of attributed being evaluated with the indices. Option 2: –Build a clustered index on the rid for each file –Compute an index join on the rids –Rids that result are the rids of tuples that pass the selection evaluation A similar technique is used to evaluate materialized views.

32 ICOM 6005Dr. Manuel Rodriguez Martinez32 Query with predicate in DNF Strategy 1: –File scan and evaluate the where clause –Cost: Read the entire relation Strategy 2 –Separate each term in where in a separte query –Union results –Cost: sum of cost of individual operations. Example: –Select sid, sname from Students where age = 20 OR gpa > 3.50 –Become the union between: Select sid, sname from Students where age = 20; Select sid, sname from Students where gpq > 3.50;

33 ICOM 6005Dr. Manuel Rodriguez Martinez33 More on strategy 2 If where clause is of the form: –Attr1 op value1 AND Attr2 op value2 OR Attr3 op value3 –Strategy is to use any indices to help evaluate: Attr3 op value 3 Apply the predicate Attr1 op value1 And Attr2 op value2 to the result. If where clause is of the form: –Attr1 op value1 OR Attr2 op value2 OR Attr3 op value3 –Strategy is to use available index to evaluate each predicate separately –Then take the union of the results Can be done with rids. Store rids, store them, and then fetch tuples with each rid. Fetch each tuple only once.

34 ICOM 6005Dr. Manuel Rodriguez Martinez34 Evaluation of Projections The main issue is duplicate elimination. –If no duplication elimination is need, just scan tuples and project attributes. –If selection was applied first, simply project tuples being collected from selection operator. Select sid, sname from Students; Select sid from Students where gpa = 4.0; Duplicate elimination makes things harder –Need to project tuples –Remove duplicates from this result set Strategies for duplicate elimination –Partition via Sorting (using external sorting) –Partition via Hashing –Indexing on projected attributes

35 ICOM 6005Dr. Manuel Rodriguez Martinez35 Projections via Sorting Strategy: 1.Scan relation R and project tuples to get desired attributes 2.Store projected tuples into a temporary relation T 3.Sort the resulting set of tuples from previous step. Key used for sorting is the set of all attributes in the tuple 4.Scan the sorted set of tuples, compared adjacent tuples, and only keep a copy of repeated tuples (discard others) Costs estimates –Step 1: NPages(R) I/Os –Step 2: NPages(T) I/Os –Step 3: O(NlogN), where N = NPages(T) –Step 4: NPages(T) I/Os –Computational Complexity of algorithm: O(NlogN), N = Pages(T)

36 ICOM 6005Dr. Manuel Rodriguez Martinez36 Example: Evaluation via Sorting Query: Select distinct R.sid, R.bid from Reserves R; Reserves: –NTuples(R) = 100,000 –Tuples size = 100 bytes –Page Size = 4096 bytes –Buffers for sorting = 6 –Size of projected tuples: 16 bytes How much does it costs to evaluate this projection? –Step 1 =  100,000 /  4096/100  )  I/0s = 2,500 I/Os –Step 2 =  100,000 /  4096/16  )  I/0s = 391 I/Os –Step 3 = 2*391*(  log 5  391/6   + 1) = 2 *391* 4 = 3128 I/Os –Step 4 = 391 I/Os –Total = 2500 + 391 + 3128 +391 = 6410 I/Os

37 ICOM 6005Dr. Manuel Rodriguez Martinez37 Projections via Hashing If we have a lot of memory buffers, hashing provides an attractive alternative –Modern DBMS servers have Gigabytes of RAM Suppose we have B buffers to use for projections General idea is as follows: –Have B -1 buckets resident on disk B-1 Temporary files on disk –Use 1 buffer to keep tuples read from relation R –Use a hash function to hash each tuple to a bucket –Use 1 buffer per bucket to keep hashed tuples in memory Flush page to disk-resident bucket only when page is full This forms an in-memory hash table –Remove duplicates at the buckets on disk

38 ICOM 6005Dr. Manuel Rodriguez Martinez38 Partition Phase (Phase I)... H Input Relation Partitions Build B-1 disk-resident partitions of variables size Input Hash function Output 0 1 … B-1

39 ICOM 6005Dr. Manuel Rodriguez Martinez39 Duplicate Elimination Phase (Phase II)... H Input Partitions (taken one-at-time) Results without duplicates Build hash table in-memory with whole tuple as key Remove duplicates from bucket Input Hash function Output

40 ICOM 6005Dr. Manuel Rodriguez Martinez40 Some Issues Hash function should distribute tuples uniformly This option for projection evaluation is very memory consuming –Should only be used if enough buffers are available How many buffer is enough? –Let B be the number of buffers to use –We have B-1 partitions –Let T be the number of pages with tuples after projecting table R –We have T/B-1 pages in each partition –Hash table size will be (T/B-1) * f, f is fudge factor to compensate for extra space need –Thus,


Download ppt "ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation."

Similar presentations


Ads by Google