Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing.

Similar presentations


Presentation on theme: "CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing."— Presentation transcript:

1 CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing (2)

2 3. Overview(2) CS522_S16 3-2 Outline  Comparison of file organizations  Choice of indices  Index only plans

3 3. Overview(2) CS522_S16 3-3 File organizations and operation cost  Which file organization is good? m Each file organization makes certain operations efficient but other operations expensive  Compare the costs of some simple operations for some basic file organizations m Example table Employees: (name, age, sal) m Assume files and indices are organized according to the composite search key m All selection operations are specified on age and sal  Goal m Emphasize the importance of an appropriate file organization

4 Operations  Scan m Fetch all records in the file to memory (buffer pool)  Search with equality selection m Fetch all records that satisfy an equality selections m Locate qualifying records  Search with range selection m Fetch all records that satisfy a range selection  Insert a record m Indentify the page in the file into which the new record must be inserted m Fetch that page from disk m Modify it to include the new record m Write back the modified page  Delete a record 3. Overview(2) CS522_S16 3-4

5 3. Overview(2) CS522_S16 Cost Model for Our Analysis  B: The number of data pages  R: Number of records per page  D: (Average) time to read or write disk page  C: (Average) time to process a record)  Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated.  Average-case analysis; based on several simplistic assumptions. 3-5

6 3. Overview(2) CS522_S16 Comparing File Organizations  Heap files (random order; insert at eof)  Sorted files, sorted on  Clustered B+ tree file, Alternative (1), search key  Heap file with unclustered B + tree index on search key  Heap file with unclustered hash index on search key 3-6

7 3. Overview(2) CS522_S16 Heap files  Scan m Cost: B(D+RC)  Search with equality selection m Assumption: exactly one match. m On average, half of the file is scanned m Cost: 0.5B(D+RC)  Search with range selection m The entire must be scanned m Cost: B(D+RC) 3-7

8 3. Overview(2) CS522_S16 3-8 Heap files  Insert m Fetch the last page, add the record and write the page back m Cost: 2D+C  Delete m Find the record, remove the record from the page and write the modified page back. m Cost: (search cost) + D + C

9 3. Overview(2) CS522_S16 3-9 Sorted files  Scan m Cost: B(D+RC)  Search with equality selection m Assumption: equality search matches the sort order m Locate the page: log 2 B (binary search) m Find the record: log 2 R m Cost: Dlog 2 B+C log 2 R  Search with range selection m Assume that the range selection mathches the composite key m Cost: cost of search + cost of retrieving matching records

10 3. Overview(2) CS522_S16 3-10 Sorted files  Insert m Assumption: the new record belongs in the middle of the file m Need to fetch the latter half of the file and write them back after adding the new record m Cost: cost of searching the position of the new record+2*(0.5B(D+RC))  Delete m Cost: same as that for an insert

11 3. Overview(2) CS522_S16 3-11 Clustered B+ tree, alternative 1  Page occupancy m The pages are usually at bout 67% occupancy m The number of data pages is about 1.5B  Scan m Cost: 1.5B(D+RC)  Search with equality selection m Assumption: each index node has F children m Cost of finding the page containing the desired record: Dlog F 1.5B m Cost of find the desired record: Clog 2 R m Cost: Dlog F 1.5B + Clog 2 R + the cost of reading all the qualifying records

12 3. Overview(2) CS522_S16 3-12 Clustered B+ tree, alternative 1  Search with range selection m Cost: cost of locating the first qualifying record + cost of retrieving other qualifying records  Insert: m Assumption: the leaf page has sufficient space for the new record m Find the correct leaf page, add the record and write it back m Cost: Dlog F 1.5B + Clog 2 R + D  Delete: m Same as that for an insert

13 3. Overview(2) CS522_S16 3-13 Heap file with unclustered tree index: alternative 2  Assumption m a data entry is 0.1 the size of a data record m 67% occupancy of leaf pages m Then the number of leaf pages: 0.1(1.5B) = 0.15B m The number of data entries in a page: 10(0.67R) = 6.7R  Scan m If use index Each data entry may point to a different data page Cost of read all data entries: 0.15B(D+6.7RC) Cost of read all records: BR(D+C) m Better ignoring the index and Scanning the data file directly

14 3. Overview(2) CS522_S16 3-14 Heap file with unclustered tree index: alternative 2  Search with equality selection m Cost of locating the right leaf page: Dlog F 0.15B m Cost of locating the right data entry: Clog 2 6.7R m Cost: Dlog F 0.15B+Clog 2 6.7R+D  Search with range selection m When should we use index? If 10% percent of data records satisfy the selection condition, we are better off ignoring the index

15 3. Overview(2) CS522_S16 3-15 Heap file with unclustered tree index: alternative 2  Insert m Insert the record to data file: 2D+C m Insert the corresponding data entry to the index file: DlogF0.15B+Clog 2 6.7R+D m Cost: 2D+C+DlogF0.15B+Clog 2 6.7R+D  Delete m Same as that for an insert

16 3. Overview(2) CS522_S16 3-16 Heap file with unclustered hash index: alternative 2  Assumptions m No overflow pages m A data entry is 0.1 the size the a data record m Pages are kept at 80% occupancy Leave space for future insertions Minimize overflows as the file expands m Then, the number of pages for data entries: 1.25(0.1B)=0.125B m The number of data entries in a page: 10(0.8R)=8R  Scan m Don’t use hash index

17 3. Overview(2) CS522_S16 3-17 Heap file with unclustered hash index: alternative 2  Search with equality selection m Suppose the cost of hash calculation is H m Assumptions The bucket only consists of one page. D We find the data entry after scanning half the entries: 0.5(8RC) = 4RC m Fetch the data record for the data file: D m Cost: H+2D+4RC  Search with range selection m The hash index doesn’t help

18 3. Overview(2) CS522_S16 3-18 Heap file with unclustered hash index: alternative 2  Insert m Insert the record in the data file: 2D+C m Insert the data entry to the index file: H+2D+C m Cost: H+4D+2C  Delete m Find the data entry in the index and locate the data record in the data file: H+2D+4RC m Write out the modified index page and data page: 2D m Cost: H+4D+4RC

19 3. Overview(2) CS522_S16 3-19 Comparison of I/O costs  A heap file has good storage efficiency and supports fast scanning and insertion of records, however, it’s slow for searches and deletions  A sorted file also offers good storage efficiency, but insertion and deletion of records is slow. Searches are faster than in heap files  A clustered file offers all the advantages of a sorted file and supports inserts and deletes efficiently. Searches are even faster than in sorted files

20 3. Overview(2) CS522_S16 3-20 Comparison of I/O costs  Unclustered tree and hash indexes offer fast searches, insertion and deletion, but scans and range searches with many matches are slow. Hash indexes are a litter faster on equality searches, but they don’t support range queries  In summary, no one file organization is uniformly superior to all situations

21 Indices and performance tuning  How to choose indices to improve performance in a database system?  Workload m Mix of queries and update operation 3. Overview(2) CS522_S16 3-21

22 3. Overview(2) CS522_S16 Understanding the Workload  For each query in the workload m Which relations does it access? m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be?  For each update in the workload m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be? m The type of update (insert/delete/update), and the attributes that are affected 3-22

23 3. Overview(2) CS522_S16 Choice of Indices  What indices should we create? m Which relations should have indices? m What field(s) should be the search key? m Should we build several indices?  For each index, what kind of index should it be? m Clustered/Unclustered? m Hash/tree? 3-23

24 3. Overview(2) CS522_S16 Index Selection Guidelines  Attributes in WHERE clause are candidates for index keys. m Exact match condition suggests hash index. m Range query suggests tree index. Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.  Multi-attribute search keys should be considered when a WHERE clause contains several conditions. m Order of attributes is important for range queries. m Such indexes can sometimes enable index-only strategies for important queries. For index-only strategies, clustering is not important!  Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. 3-24

25 Example  B+tree index on E.age  Is the index worthwhile m Depends on selectivity of the condition  Is the index useful m Depends on whether the index is clustered 3. Overview(2) CS522_S16 3-25 SELECT E.dno FROM Emp E WHERE E.age>40

26 Example  Consider the GROUP BY query.  If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly.  Clustered E.dno index may be better! 3. Overview(2) CS522_S16 3-26 SELECT E.dno, COUNT(*) FROM Emp E WHERE E.age>10 GROUP BY E.dno

27 3. Overview(2) CS522_S16 3-27 Example SELECT E.dno, COUNT(*) FROM Emp E WHERE E.hobby=Stamps GROUP BY E.dno  Clustering on E.hobby helps  Clustering is also important for an index on a search key that does not include a candidate key

28 3. Overview(2) CS522_S16 3-28 Index with composite search keys  Composite Search Keys m Search on a combination of fields. m Equality query Each field in the search key is bound to a constant m Range query Not all fields in the search key are bound to constants Ex: age=20, any value is acceptable for sal fields

29 3. Overview(2) CS522_S16 3-29 Examples of composite key indexes sue1375 bob cal joe12 10 20 8011 12 nameagesal 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80,11 11 12 13 10 20 75 80 Data records sorted by name Data entries in index sorted by Data entries sorted by

30 3. Overview(2) CS522_S16 3-30 Examples of composite key indexes (Cont.) To retrieve Emp records with m age=30 AND sal=4000 Index on m 20<age<30 AND 3000<sal<5000 Clustered tree index on or m Age=30 AND 3000<sal<5000 Clustered index on or Clustered index on ?

31 3. Overview(2) CS522_S16 Index-Only Plans  A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT E.dno, COUNT (*) FROM Emp E GROUP BY E.dno SELECT E.dno, MIN (E.sal) FROM Emp E GROUP BY E.dno Tree index! 3-31

32 3. Overview(2) CS522_S16 Index-Only Plans (Contd.)  Index-only plans are possible if the key is or we have a tree index with key m Which is better? m What if we consider the second query? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno 3-32


Download ppt "CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing."

Similar presentations


Ads by Google