Download presentation
Presentation is loading. Please wait.
1
1 Optimization - Selection
2
2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000 pages Compute the query: SELECT * FROM Reserves R WHERE R.agent = ‘Joe’
3
3 General Problem Compute: A op c (R) where –A is an attribute of R –op is an operator, such as =, <, etc. –c is a constant We assume that there are M pages in R.
4
4 No Index, Unsorted Data We can compute the selection by scanning the entire relation. Cost: M (For the query on Reserves, Cost = 1,000)
5
5 No Index, Sorted Data Suppose that the file in which R is stored is physically sorted on A. Then, we can use binary search to find the first tuple matching the search condition. Then, scan to find all additional tuples that match the condition. Cost: log(M) + #pages with tuples from the result (For the query on Reserves: 10 + #pages with tuples from the result)
6
6 B+ Tree Index Suppose that there is a B+ Tree Index available on attribute A of R. Search the tree to find the first index entry that points to a tuple satisfying the condition. Scan leaf pages of index to find all entries in which key value satisfies the condition. Retrieve the satisfying tuples from the file.
7
7 Example Selection condition: rname < ‘C%’ Assume that names are uniformly distributed with respect to first letter About 2/26 10% = 10,000 tuples = 100 pages match the condition If the B+ Tree is clustered, we can return them in 100 I/Os (plus a few to traverse the tree) Otherwise, up to 10,000 I/Os may be needed!! –Can you improve the number of I/Os needed?
8
8 Hash Index, Equality Selection Find appropriate bucket (about 1, 2 I/Os) Retrieve the qualifying tuples from R –Time depends on whether index is clustered. Example: Consider condition agent = ‘Joe’. Suppose there is an unclustered hash index on agent. Suppose there 100 reservations made by Joe. Cost: Could be between 1 and 100.
9
9 Optimization - Join
10
10 The Join Operation Compute the query: SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid Number of PagesTuples per Page R1000 (M)100 (pR) S500 (N)80 (pS)
11
11 Block Nested Loops Join Suppose there are B buffer pages foreach block of B-2 pages of R do foreach page of S do { for all matching in-memory pairs r, s: add to result }
12
12 Cost Analysis R is read once: M S is read ceil(M/(B-2)) times: ceil(M/(B-2))*N Total: M + ceil(M/(B-2))*N For our Sailors, Reserves example. (B = 102) Reserves is the outer Relation: –Cost = 1000 + (1,000/100)*500 = 6,000 I/Os Sailors is the outer Relation: –Cost = 500 + (500/100)*1,000 = 5,500 I/Os
13
13 Index Nested Loops Join Suppose there is an index on the join attribute of S We find the tuple s using the index! foreach tuple r of R foreach tuple s of S where r i =s j add to result
14
14 Cost Analysis 1.If index is B+ Tree, about 2-4 I/Os to find leaf. If index is Hash index, about 1-2 I/Os to find bucket 2.Retrieve tuples. Time depends on whether index is clustered. If so, cost is usually 1 I/O per r tuple. Otherwise, could be 1 I/O per matching s tuple
15
15 Example (1) Hash index on sid of Sailors. (about 1.2 I/Os to find bucket) sid is a key in Sailors, so there is at most one matching tuple (actually, exactly 1: why?) Scanning Reserves: 1,000 There are 100 * 1,000 = 100,000 tuples in Reserves. For each tuple, search index (1.2) and retrieve page (1) Total time: 1,000 + 100,000 * 2.2 = 221,000 I/Os
16
16 Example Hash index on sid of Reserves. Scanning Sailors: 500 There are 80 * 500 = 40,000 tuples in Sailors For each tuple, search index (1.2) and retreive page (??) There are 100,000 reservations for 40,000 sailors. Assuming uniform distribution on reservations, each sailor tuple matches about 2.5 reserves tuples. If index is unclustered, they may be on different pages Total: 500 + 40,000*(1.2 + 2.5) = 148,500 I/Os
17
17 Sort-Merge Join Sort both relations on join attribute. This creates “partitions” according to the join attributes. Join relations while merging them. Tuples in corresponding partitions are joined. Cost depends on whether partitions are large and therefore, are scanned multiple times.
18
18 sidsnameratingage 22dustin745 28yuppy935 31lubber855 36lubber636 44guppy535 58rusty1035 sidbiddayagent 2810312/4/96Joe 2810311/3/96Frank 3110110/2/96Joe 3110212/7/96Sam 3110113/7/96Sam 5810322/6/96Frank Reserves Sailors
19
19 Cost Analysis Sort R: O(M logM) Sort S: O(N log N) Merge: M + N (If partitions aren’t scanned multiple times. Otherwise, worst case is M*N!!) Cost: O(M+N+MlogM + NlogN)
20
20 Example Sort Reserves: 1000 log 1000 10,000 (actually better with a good algorithm for external sorting) Sort Sailors: 500 log 500 4,500 Merge: 1,000 + 500 = 1,500 Total: 1,500 + 10,000 + 4,500 = 16,000 (actually about 7,500 if sorting is done well)
21
21 Hash Join //Partition R into k partitions foreach tuple r in R do //flush when fills read r and add it to buffer page h(r i ) foreach tuple s in S do //flush when fills read s and add it to buffer page h(s j ) for l = 1..k //Build in-memory hash table for R l using h2 foreach tuple r in R l do read r and insert into hash table with h2 foreach tuple s in S l do read s and probe table using h2 output matching pairs
22
22 Cost Analysis Partition phase - Read R, S once and write them once: 2(M + N) In the second phase, we can read each partition once, assuming that it fits into memory: M + N Cost: 3(M + N) In our example: 3(1,000 + 500) = 4,500 I/Os
23
23 Estimating Result Sizes
24
24 Picking a Query Plan Suppose we want to find the natural join of: Reserves, Sailors, Boats. The 2 options that appear the best are (ignoring the order within a single join): (Sailors Reserves) Boats Sailors (Reserves Boats) We would like intermediate results to be as small as possible. Which is better?
25
25 Analyzing Result Sizes In order to answer the question in the previous slide, we must be able to estimate the size of (Sailors Reserves) and (Reserves Boats). The DBMS stores statistics about the relations and indexes. They are updated periodically (not every time the underlying relations are modified).
26
26 Statistics Maintained by DBMS Cardinality: Number of tuples NTuples(R) in each relation R Size: Number of pages NPages(R) in each relation R Index Cardinality: Number of distinct key values NKeys(I) for each index I Index Size: Number of pages INPages(I) in each index I Index Height: Number of non-leaf levels IHeight(I) in each B+ Tree index I Index Range: The minimum ILow(I) and maximum value IHigh(I) for each index I
27
27 Estimating Result Sizes Consider The maximum number of tuples is the product of the cardinalities of the relations in the FROM clause The WHERE clause is associating a reduction factor with each term Estimated result size is: maximum size times product of reduction factors SELECT attribute-list FROM relation-list WHERE term 1 and... and term n
28
28 Estimating Reduction Factors column = value: 1/NKeys(I) if there is an index I on column. This assumes a uniform distribution. Otherwise, System R assumes 1/10. column1 = column2: 1/Max(NKeys(I1),NKeys(I2)) if there is an index I1 on column1 and I2 on column2. If only one column has an index, we use it to estimate the value. Otherwise, use 1/10. column > value: (High(I)-value)/(High(I)-Low(I)) if there is an index I on column.
29
29 Example Cardinality(R) = 1,000 * 100 = 100,000 Cardinality(S) = 500 * 80 = 40,000 NKeys(Index on R.agent) = 100 High(Index on Rating) = 10, Low = 0 SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid and S.rating > 3 and R.agent = ‘Joe’
30
30 Example (cont.) Maximum cardinality: 100,000 * 40,000 Reduction factor of R.sid = S.sid: 1/40,000 Reduction factor of S.rating > 3: (10–3)/(10-0) = 7/10 Reduction factor of R.agent = ‘Joe’: 1/100 Total Estimated size: 700
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.