Presentation on theme: "Hashing and Indexing John Ortiz. Lecture 18Hashing and Indexing2 Overview How to retrieve data records with a key value? Sequential search (O(N))"— Presentation transcript:
Hashing and Indexing John Ortiz
Lecture 18Hashing and Indexing2 Overview How to retrieve data records with a key value? Sequential search (O(N)) Binary Search (O(log 2 N)) Hashing Conventional (simple) indexing B+ tree indexing More advanced hashing & indexing
Lecture 18Hashing and Indexing3 Hashing A hash function maps a key value to a bucket address where the record can be found Good for queries with condition A=v Typical complexity O(1)...... Buckets (typically 1 disk block) key h(key)
Lecture 18Hashing and Indexing4 Hash Table What are stored in buckets? Option 1: Actual records (hash table is the file) Option 2: Key values and pointers to records key h(key) Hash table record key 1 Data file Should we sort keys in buckets?
Lecture 18Hashing and Indexing5 Hash Functions Perfect hash functions should evenly distribute key values to buckets (very difficult to come by). Good hash functions do a random distribution. Ex: Key = ‘x1 x2 … xn’ n byte character string and table has b buckets, h(key) = (x1 + x2 + … xn) mod b Many other choices (see Knuth Vol. 3) May have to handle collision (bucket overflow). Typically with chaining of overflow blocks With K mod M, 0 <= result <= M-1
Lecture 18Hashing and Indexing6 Loading Factor Loading Factor = # keys loaded / total # keys that fit Try to keep loading factor between 50% & 80% If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys per bucket
Lecture 18Hashing and Indexing8 Terms Data file: contains blocks of data records Index file: contains blocks of index entries Index entry: Index key: not necessarily a key of a relation Address: for record with a key value (may be block or record address) Search key: Index value used for a search
Lecture 18Hashing and Indexing9 Types of Indexes What type of index key are used? Primary indexes Clustering indexes Secondary indexes Multilevel indexes Dynamic Indexes, B-Trees, B + -Trees Does every record has an index entry? Dense index, sparse index Consideration: How does it handle Updates? Insertions? Deletions?
Lecture 18Hashing and Indexing10 Primary Index Index key is the primary key Sparse: one entry per data block Data file sorted on index key Can be B+ tree as well Data file 20 10 40 30 60 50 80 70 100 90 10 30 50 70 90 110 130 150 170 190 210 230 Index
Lecture 18Hashing and Indexing11 Primary Index Example 1, p.159 Ordered file, r = 30,000 records Block Size B = 1024 bytes/block Unspanned Record Length R = 100 bytes/record Bfr = floor( B/R ) records/block # blocks = ??? (use units to determine formula!) Binary search yields ~log 2 B accesses Compare to search with primary index, see p.159
Lecture 18Hashing and Indexing12 Clustering Index Index key is not a key of relation (may have duplicate values) Sparse: one entry per distinct value Data file sorted on index key Can be B+ tree 10 20 30 40 10 20 10 30 20 30 45 40 Data File Index 45 56 59 61
Lecture 18Hashing and Indexing13 Secondary Index Index key: can be key or non-key Pointers point to record (not block) Dense: one entry per record (sparse at higher levels) Data file not sorted on index key! Can be B+ tree
Lecture 18Hashing and Indexing14 Data file 50 30 70 20 40 80 10 100 60 90 Index 10 20 30 40 50 60 70... 10 50 90... sparse high level Secondary Index
Lecture 18Hashing and Indexing15 Secondary Index for Non-Key Option 1: Dense index Problem: excess overhead! disk space 10 20 40 20 40 10 40 10 40 30 Data file 10 20 30 40... Index file
Lecture 18Hashing and Indexing16 Secondary Index for Non-Key Option 2: Reserve multiple pointers Problem: variable size records in index! 10 20 40 20 40 10 40 10 40 30 10 40 30 20 Data fileIndex file
Lecture 18Hashing and Indexing17 Secondary Index for Non-Key Option 3: Use pointer buckets 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... Data file Index file Buckets
Lecture 18Hashing and Indexing18 Advantage of Pointer Buckets Assume following indexes on EMP(name,dept,floor,...) Name: primary Dept: secondary Floor: secondary Find employees in Toy dept on 2 nd floor Find pointers for Toy dept Find pointers for 2 nd floor Intersect these sets of pointers Retrieve records
Lecture 18Hashing and Indexing19 Data file 20 10 40 30 60 50 80 70 100 90 Index level 2 10 90 170 250 330 410 490 570 Simple Multiple level Index 10 30 50 70 90 110 130 150 170 190 210 230 Index level 1
Lecture 18Hashing and Indexing20 Tree-Structured Indexes Problem of simple indexes: binary search is still expensive. Why not build indexes as multiple level trees? Nodes of tree are blocks Types of tree-structured indexes ISAM (Indexed Sequential Access Method): static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes.
Lecture 18Hashing and Indexing21 B+ Tree Index The index file is organized as a B+ tree Height-balanced Nodes are blocks of index keys and pointers Order P: Max # of pointers fits in a node Nodes are at least 50% full Support efficient updates
Lecture 18Hashing and Indexing22 B+ Tree Index Example P = 4 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 Root Point to data records/blocks Index file
Lecture 18Hashing and Indexing23 Internal Nodes The root must have k 2 pointers Others must have k P/2 pointers, where P is the order of the B+ tree Must have k keys and k+1 pointers Keys are sorted key >= a k …p0p0 a1a1 aiai pipi a i+1 pkpk ……akak a i <= key < a i+1 … key < a 1
Lecture 18Hashing and Indexing24 Leaf Nodes All external nodes are at the same level. Must have k P/2 keys, unless it is the only node in the tree. Keys are sorted Has a (block pointer) to next leaf node (other pointers can be block or record pointers) to next Leaf node to data records a1a1 p1p1 aiai pipi ppkpk ……akak
Lecture 18Hashing and Indexing25 An Example File: Employees(SSN, Name, Dept, Age, Phone) Attributes sizes in bytes: SSN (9), Name (25), Dept (4), Age (4), Phone (10). Block (page) size = 1024 bytes. # of records: 30,000, packed unspanned. What is the file size in pages? Tuple Size = 9+25+4+4+10 = 52 bytes bf Employee = 1024 / 52 = 19 record/page (block) b Employee = 30,000 / 19 = 1,579 pages (blocks)
Lecture 18Hashing and Indexing26 Example: B + Tree Primary Index Pointer size = 4 bytes Nodes are 70% full How big is a B+ tree primary index on SSN? Order: P = (1024 + 9) / (9+4) = 79 Average order = 79*.7 =56 pointers # pointers of internal nodes = 56 # index entries in leaf node = 55 # index entries = 1579 (one per page) # leaf nodes = 1579 / 55 = 29 # nodes next level = 29 / 56 = 1
Lecture 18Hashing and Indexing27 Example: B+ Tree Primary Index Total # of levels = 2 Total # of index nodes (pages) = 30 To answer the query: select * from Employees where SSN=123456789; # of page I/Os = 3 (2 index pages + 1 data page).
Lecture 18Hashing and Indexing28 Example: B + Tree Secondary Index B+ tree secondary index on Dept # of distinct values = 1000 Assume a dense index. # of index entries = 30000 Size of index entry = 8 bytes (4-byte Dept + 4-byte pointer) Order: P = (1024+4) / (4+4) = 128 Assume nodes are 70% full internal node has 90 pointers leaf node has 89 keys.
Lecture 18Hashing and Indexing29 Example: B + Tree Secondary Index # of leaf nodes = 30,000 / 89 = 338 # of nodes at 2nd level = 338 / 90 = 4 # of nodes at 3rd level = 4 / 90 = 1 Total # of levels = 3 Total # of pages = 343 # records per distinct value = 30 (each on a different page) To find all employees of “Dept = x”, # of page I/O = 33 (3 pages of index + 30 pages of data)
Lecture 18Hashing and Indexing30 Indexes in SQL Create a secondary index create index Salary_Index on Employees(Salary); Create an index on a key create unique index SSN_Index on Employees(SSN); Drop an index drop index Salary_Index;
Lecture 18Hashing and Indexing31 Summary Hashing is every efficient, but is effective only when search condition is equality Indexing is effective for range selection as well as equality selection Simple indexing is good for small files ISAM is good if update is infrequent B+ tree is a dynamic structure. Inserts/deletes leave B+ tree height- balanced; O(log P N) cost. Typically has 3 or 4 levels for large files
Lecture 18Hashing and Indexing32 Summary (Contd.) Almost always better than maintaining a sorted file (no sorting, no global moving) Typically, 67%-70% full on average Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. Oracle automatically creates index for primary key attribute(s) and unique attribute(s) It is not possible to specify different types of index or hashing using SQL Many other types of indexes …
Lecture 18Hashing and Indexing33 Look Ahead Next topic: Query Processing and Optimization Read textbook: Chapter 18