1 Optimization Recap and examples
2 Optimization introduction For every SQL expression, there are many possible ways of implementation. The different alternatives could result in huge run-time differences. Our aim is to introduce the basic hardware used, and optimization principles
3 Disk-Memory-CPU Delete from Sailors where sid=90 DISK sailors Reserves Main Memory CPU
4 Hardware Recap The DB is kept on the Disk. The Disk is divided into BLOCKS Any processing of the information occurs in the Main Memory. Therefore, a block which we want to access has to be brought from the Disk to the memory, and perhaps written back. Blocks are read/written from/to the Disk as single units. The time of reading/writing a block to/from the disk is an I/O operation, and takes a lot of time.
5 Hardware Recap We assume a constant time for each Disk access, and that only disk access define the run time. We do not consider writing to the disk Every table in the DB is stored as a File (on the Disk), which is a ‘bunch of Blocks’. We will deal with files that are ‘heap-sorted’, i.e., there is no order in the file tuples Every block contains many tuples, each of them has a Record ID (RID), which states its location: (number of block, number of tuple within the block)
6 SID SNAME ratingage 1923Joe Phil Boe Bill Paul Jim Vicky954 RID Block 1 Block 2 Block 3 (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1)
7 SID SNAME ratingage 1923Joe Phil Boe Bill Paul Jim Vicky954 B blocks t tuples Q: What would be the cost of the following queries? Select * from sailors Select * from sailors where sname= ‘ Jim ’ Select * from sailors where rating>4 Answer: B
8 Indexes on files An Index on a table is an additional file which helps access the data fast. An index holds ‘data entries’ to the table file The index can have the structure of a B+ Tree, or a hash function.
9 Tree index on sname of sailors ‘ A ’ -> ’ M ’ B1 ‘ N ’ -> ’ Z ’ B2 ‘ N ’ -> ’ T ’ L3 ‘ U ’ -> ’ Z ’ L4 ‘ A ’ -> ’ G ’ L1 ‘ H ’ -> ’ M ’ L2 Root block Leaf blocks Branch blocks ‘ Bill ’ (2,1) ‘ Boe ’ (1,3) ‘ Vicky ’ (3,1) … ‘ Paul ’ (2,2) ‘ Phil ’ (1,2) ‘ Jim ’ (2,3) ‘ Joe ’ (1,1) B1 L4L3 L2L1 B2 SIDSNAMEratingage 1923Joe Phil Boe Bill Paul Jim Vicky954
10 Tree index The tree is kept balanced The tree entries are always ordered The leaves point to the exact location of tuples Getting to the leaf is typically 2-3 I/O Each leaf points to the next/previous leaf A Clustered index means that the index and the table are ordered by the same attribute
11 Tree index on sname of sailors ‘ Bill ’ (2,1) ‘ Boe ’ (1,3) ‘ Phil ’ (1,2) … ‘ Joe ’ (2,2) ‘ Joe ’ (3,1) ‘ Jim ’ (2,3) ‘ Joe ’ (1,1) B1 L4L3 L2L1 B2 SIDSNAMEratingage 1923Joe Phil Boe Bill Joe Jim Joe954 How would the following queries be processed? Select * from sailors where sname= ‘ Joe ’ Select * from sailors Select * from sailors where sname> ’ J ’ Notice: index is not clustered
12 Tree index on sname of sailors ‘ Bill ’ (1,1) ‘ Boe ’ (1,2) ‘ Phil ’ (3,1) … ‘ Joe ’ (2,2) ‘ Joe ’ (2,3) ‘ Jim ’ (1,1) ‘ Joe ’ (2,1) B1 L4L3 L2L1 B2 SIDSNAMEratingage 1226Bill Boe Jim Joe Joe Joe Phil941 How would the following queries be processed? Select * from sailors where sname= ‘ Joe ’ Select * from sailors Select * from sailors where sname> ’ J ’ Notice: index is clustered
13 Hash index Works in a similar way, but using a hash function instead of a tree Works only for equality conditions Average of 1.2 I/O to get to the tuple location
14 Natural Join We want to compute Naïve algorithm: SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid Foreach tuple r in R Foreach tuple s in S if r.sid=s.sid add r,s to result Cost: B R +t R *B S Running example data t R =5000 t S =10, tuples per block 12 buffer pages = *200=1,000,100
15 Natural Join We want to compute We have 4 optional algorithms: 1.Block Nested Loops Join 2.Index Nested Loops Join 3.Sort Merge Join 4.Hash Join SELECT * FROM Reserves R, Sailors S WHERE R.sid = S.sid This is assuming there is not enough space in the memory for the smaller of the 2 relations+2
16 Block Nested Loop Join Suppose there are B available blocks in the memory, B R blocks of relation R, and B S blocks of relations S, and B R <B S Until all blocks of R have been read: –Read B-2 blocks of R –Read all blocks of S (one by one), and write the result Run time: B R + B S * ceil(B R /(B-2)) = *100/10=2,100
17 Index Nested Loop Suppose there is an index on sid of Sailors Until all blocks of R have been read: –Read a block of R –For each tuple in the block, use the index of S to locate the matching tuples in S. We mark the time it takes to read the tuples in S that match a single tuple in R as X. Run time: B R + t R *X If the index is clustered, X=2-4 If it is not clustered, we evaluate X. = *3=15,100
18 Q: So when would we typically choose to use an index-nested loop over block- nested? A: Look at the inequality…
19 Sort-Merge Join Sort both relations on the join column Join them according to the join algorithm: sidbiddayagent /4/96Joe /3/96Frank /2/96Joe /7/96Sam /7/96Sam /6/96Frank sidsnameratingage 22dustin745 28yuppy935 31lubber855 36lubber636 44guppy535 58rusty1035
20 Run time of Sort-Merge M,N: number of blocks of the relations Sorting: MlogM+NlogN Merging: N+M if no partition is scanned twice. Total: MlogM+NlogN+N+M Especially good if one or both of the relations are already sorted. = 100*7+200* =2,600
21Question Suppose: tuple size= 100 bytes number of tuples (employees)=3,000 Page size=1000 bytes You have an unclustered index on Hobby. You know that 50 employees collect stamps. Would you use the index? And for 1,000 stamp-lovers? SELECT E.dno FROM Employees E WHERE E.hobby=‘stamps’
22 Question 2 Length of tuples, Number of tuples –Emp: 20 bytes, 20,000 tuples –Dept: 40 bytes, 5000 tuples Pages contain 4000 bytes; 12 buffer pages Which algorithm would you use if there is an unclustered tree index on E.eid? And clustered? SELECT E.ename FROM Employees E, Departments D WHERE E.eid=D.eid