CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing (2)
3. Overview(2) CS522_S Outline Comparison of file organizations Choice of indices Index only plans
3. Overview(2) CS522_S File organizations and operation cost Which file organization is good? m Each file organization makes certain operations efficient but other operations expensive Compare the costs of some simple operations for some basic file organizations m Example table Employees: (name, age, sal) m Assume files and indices are organized according to the composite search key m All selection operations are specified on age and sal Goal m Emphasize the importance of an appropriate file organization
Operations Scan m Fetch all records in the file to memory (buffer pool) Search with equality selection m Fetch all records that satisfy an equality selections m Locate qualifying records Search with range selection m Fetch all records that satisfy a range selection Insert a record m Indentify the page in the file into which the new record must be inserted m Fetch that page from disk m Modify it to include the new record m Write back the modified page Delete a record 3. Overview(2) CS522_S16 3-4
3. Overview(2) CS522_S16 Cost Model for Our Analysis B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page C: (Average) time to process a record) Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated. Average-case analysis; based on several simplistic assumptions. 3-5
3. Overview(2) CS522_S16 Comparing File Organizations Heap files (random order; insert at eof) Sorted files, sorted on Clustered B+ tree file, Alternative (1), search key Heap file with unclustered B + tree index on search key Heap file with unclustered hash index on search key 3-6
3. Overview(2) CS522_S16 Heap files Scan m Cost: B(D+RC) Search with equality selection m Assumption: exactly one match. m On average, half of the file is scanned m Cost: 0.5B(D+RC) Search with range selection m The entire must be scanned m Cost: B(D+RC) 3-7
3. Overview(2) CS522_S Heap files Insert m Fetch the last page, add the record and write the page back m Cost: 2D+C Delete m Find the record, remove the record from the page and write the modified page back. m Cost: (search cost) + D + C
3. Overview(2) CS522_S Sorted files Scan m Cost: B(D+RC) Search with equality selection m Assumption: equality search matches the sort order m Locate the page: log 2 B (binary search) m Find the record: log 2 R m Cost: Dlog 2 B+C log 2 R Search with range selection m Assume that the range selection mathches the composite key m Cost: cost of search + cost of retrieving matching records
3. Overview(2) CS522_S Sorted files Insert m Assumption: the new record belongs in the middle of the file m Need to fetch the latter half of the file and write them back after adding the new record m Cost: cost of searching the position of the new record+2*(0.5B(D+RC)) Delete m Cost: same as that for an insert
3. Overview(2) CS522_S Clustered B+ tree, alternative 1 Page occupancy m The pages are usually at bout 67% occupancy m The number of data pages is about 1.5B Scan m Cost: 1.5B(D+RC) Search with equality selection m Assumption: each index node has F children m Cost of finding the page containing the desired record: Dlog F 1.5B m Cost of find the desired record: Clog 2 R m Cost: Dlog F 1.5B + Clog 2 R + the cost of reading all the qualifying records
3. Overview(2) CS522_S Clustered B+ tree, alternative 1 Search with range selection m Cost: cost of locating the first qualifying record + cost of retrieving other qualifying records Insert: m Assumption: the leaf page has sufficient space for the new record m Find the correct leaf page, add the record and write it back m Cost: Dlog F 1.5B + Clog 2 R + D Delete: m Same as that for an insert
3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2 Assumption m a data entry is 0.1 the size of a data record m 67% occupancy of leaf pages m Then the number of leaf pages: 0.1(1.5B) = 0.15B m The number of data entries in a page: 10(0.67R) = 6.7R Scan m If use index Each data entry may point to a different data page Cost of read all data entries: 0.15B(D+6.7RC) Cost of read all records: BR(D+C) m Better ignoring the index and Scanning the data file directly
3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2 Search with equality selection m Cost of locating the right leaf page: Dlog F 0.15B m Cost of locating the right data entry: Clog 2 6.7R m Cost: Dlog F 0.15B+Clog 2 6.7R+D Search with range selection m When should we use index? If 10% percent of data records satisfy the selection condition, we are better off ignoring the index
3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2 Insert m Insert the record to data file: 2D+C m Insert the corresponding data entry to the index file: DlogF0.15B+Clog 2 6.7R+D m Cost: 2D+C+DlogF0.15B+Clog 2 6.7R+D Delete m Same as that for an insert
3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2 Assumptions m No overflow pages m A data entry is 0.1 the size the a data record m Pages are kept at 80% occupancy Leave space for future insertions Minimize overflows as the file expands m Then, the number of pages for data entries: 1.25(0.1B)=0.125B m The number of data entries in a page: 10(0.8R)=8R Scan m Don’t use hash index
3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2 Search with equality selection m Suppose the cost of hash calculation is H m Assumptions The bucket only consists of one page. D We find the data entry after scanning half the entries: 0.5(8RC) = 4RC m Fetch the data record for the data file: D m Cost: H+2D+4RC Search with range selection m The hash index doesn’t help
3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2 Insert m Insert the record in the data file: 2D+C m Insert the data entry to the index file: H+2D+C m Cost: H+4D+2C Delete m Find the data entry in the index and locate the data record in the data file: H+2D+4RC m Write out the modified index page and data page: 2D m Cost: H+4D+4RC
3. Overview(2) CS522_S Comparison of I/O costs A heap file has good storage efficiency and supports fast scanning and insertion of records, however, it’s slow for searches and deletions A sorted file also offers good storage efficiency, but insertion and deletion of records is slow. Searches are faster than in heap files A clustered file offers all the advantages of a sorted file and supports inserts and deletes efficiently. Searches are even faster than in sorted files
3. Overview(2) CS522_S Comparison of I/O costs Unclustered tree and hash indexes offer fast searches, insertion and deletion, but scans and range searches with many matches are slow. Hash indexes are a litter faster on equality searches, but they don’t support range queries In summary, no one file organization is uniformly superior to all situations
Indices and performance tuning How to choose indices to improve performance in a database system? Workload m Mix of queries and update operation 3. Overview(2) CS522_S
3. Overview(2) CS522_S16 Understanding the Workload For each query in the workload m Which relations does it access? m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be? For each update in the workload m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be? m The type of update (insert/delete/update), and the attributes that are affected 3-22
3. Overview(2) CS522_S16 Choice of Indices What indices should we create? m Which relations should have indices? m What field(s) should be the search key? m Should we build several indices? For each index, what kind of index should it be? m Clustered/Unclustered? m Hash/tree? 3-23
3. Overview(2) CS522_S16 Index Selection Guidelines Attributes in WHERE clause are candidates for index keys. m Exact match condition suggests hash index. m Range query suggests tree index. Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates. Multi-attribute search keys should be considered when a WHERE clause contains several conditions. m Order of attributes is important for range queries. m Such indexes can sometimes enable index-only strategies for important queries. For index-only strategies, clustering is not important! Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. 3-24
Example B+tree index on E.age Is the index worthwhile m Depends on selectivity of the condition Is the index useful m Depends on whether the index is clustered 3. Overview(2) CS522_S SELECT E.dno FROM Emp E WHERE E.age>40
Example Consider the GROUP BY query. If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. Clustered E.dno index may be better! 3. Overview(2) CS522_S SELECT E.dno, COUNT(*) FROM Emp E WHERE E.age>10 GROUP BY E.dno
3. Overview(2) CS522_S Example SELECT E.dno, COUNT(*) FROM Emp E WHERE E.hobby=Stamps GROUP BY E.dno Clustering on E.hobby helps Clustering is also important for an index on a search key that does not include a candidate key
3. Overview(2) CS522_S Index with composite search keys Composite Search Keys m Search on a combination of fields. m Equality query Each field in the search key is bound to a constant m Range query Not all fields in the search key are bound to constants Ex: age=20, any value is acceptable for sal fields
3. Overview(2) CS522_S Examples of composite key indexes sue1375 bob cal joe nameagesal 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80, Data records sorted by name Data entries in index sorted by Data entries sorted by
3. Overview(2) CS522_S Examples of composite key indexes (Cont.) To retrieve Emp records with m age=30 AND sal=4000 Index on m 20<age<30 AND 3000<sal<5000 Clustered tree index on or m Age=30 AND 3000<sal<5000 Clustered index on or Clustered index on ?
3. Overview(2) CS522_S16 Index-Only Plans A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT E.dno, COUNT (*) FROM Emp E GROUP BY E.dno SELECT E.dno, MIN (E.sal) FROM Emp E GROUP BY E.dno Tree index! 3-31
3. Overview(2) CS522_S16 Index-Only Plans (Contd.) Index-only plans are possible if the key is or we have a tree index with key m Which is better? m What if we consider the second query? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno 3-32