Hashing and Indexing John Ortiz.

Presentation on theme: "Hashing and Indexing John Ortiz."— Presentation transcript:

Hashing and Indexing John Ortiz

Overview How to retrieve data records with a key value?
Sequential search (O(N)) Binary Search (O(log2 N)) Hashing Conventional (simple) indexing B+ tree indexing More advanced hashing & indexing Lecture 18 Hashing and Indexing

Hashing <key> . Buckets (typically 1 disk block) key  h(key) A hash function maps a key value to a bucket address where the record can be found Good for queries with condition A=v Typical complexity O(1) Lecture 18 Hashing and Indexing

Hash Table What are stored in buckets?
Option 1: Actual records (hash table is the file) Option 2: Key values and pointers to records key  h(key) Hash table record key 1 Data file Should we sort keys in buckets? Lecture 18 Hashing and Indexing

Hash Functions Perfect hash functions should evenly distribute key values to buckets (very difficult to come by). Good hash functions do a random distribution. Ex: Key = ‘x1 x2 … xn’ n byte character string and table has b buckets, h(key) = (x1 + x2 + … xn) mod b Many other choices (see Knuth Vol. 3) May have to handle collision (bucket overflow). Typically with chaining of overflow blocks With K mod M, 0 <= result <= M-1 Lecture 18 Hashing and Indexing

Try to keep loading factor between 50% & 80% If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys per bucket Lecture 18 Hashing and Indexing

Indexing Find rec. w/ key 60 Data File Index File 50 20 30 60 10 80 70
40 50 60 70 80 90 100 110 120 Data File 20 50 60 30 80 10 40 70 100 90 Find rec. w/ key 60 Lecture 18 Hashing and Indexing

Terms Data file: contains blocks of data records
Index file: contains blocks of index entries Index entry: <index key, address> Index key: not necessarily a key of a relation Address: for record with a key value (may be block or record address) Search key: Index value used for a search Lecture 18 Hashing and Indexing

Types of Indexes What type of index key are used? Primary indexes
Clustering indexes Secondary indexes Multilevel indexes Dynamic Indexes, B-Trees, B+-Trees Does every record has an index entry? Dense index, sparse index Consideration: How does it handle Updates? Insertions? Deletions? Lecture 18 Hashing and Indexing

Primary Index Index key is the primary key
Data file 20 10 40 30 60 50 80 70 100 90 110 130 150 170 190 210 230 Index Index key is the primary key Sparse: one entry per data block Data file sorted on index key Can be B+ tree as well Lecture 18 Hashing and Indexing

Primary Index Example 1, p.159 Ordered file, r = 30,000 records
Block Size B = 1024 bytes/block Unspanned Record Length R = 100 bytes/record Bfr = floor( B/R ) records/block # blocks = ??? (use units to determine formula!) Binary search yields ~log2B accesses Compare to search with primary index, see p.159 Lecture 18 Hashing and Indexing

Clustering Index 10 20 30 40 45 Data File Index 56 59 61 Index key is not a key of relation (may have duplicate values) Sparse: one entry per distinct value Data file sorted on index key Can be B+ tree Lecture 18 Hashing and Indexing

Secondary Index Index key: can be key or non-key
Pointers point to record (not block) Dense: one entry per record (sparse at higher levels) Data file not sorted on index key! Can be B+ tree Lecture 18 Hashing and Indexing

Secondary Index sparse high level Index Data file 50 30 70 20 40 80 10
60 70 ... Data file 50 30 70 20 40 80 10 100 60 90 10 50 90 ... sparse high level Lecture 18 Hashing and Indexing

Secondary Index for Non-Key
Option 1: Dense index 10 20 40 30 Data file ... Index file Problem: excess overhead! disk space Lecture 18 Hashing and Indexing

Secondary Index for Non-Key
Option 2: Reserve multiple pointers 10 20 40 30 Data file Index file Problem: variable size records in index! Lecture 18 Hashing and Indexing

Secondary Index for Non-Key
Option 3: Use pointer buckets Data file 10 20 40 30 50 60 ... Index file Buckets Lecture 18 Hashing and Indexing

Assume following indexes on EMP(name,dept,floor,...) Name: primary Dept: secondary Floor: secondary Find employees in Toy dept on 2nd floor Find pointers for Toy dept Find pointers for 2nd floor Intersect these sets of pointers Retrieve records Lecture 18 Hashing and Indexing

Simple Multiple level Index
Index level 2 10 90 170 250 330 410 490 570 10 30 50 70 90 110 130 150 170 190 210 230 Index level 1 Data file 20 10 40 30 60 50 80 70 100 90 Lecture 18 Hashing and Indexing

Tree-Structured Indexes
Problem of simple indexes: binary search is still expensive. Why not build indexes as multiple level trees? Nodes of tree are blocks Types of tree-structured indexes ISAM (Indexed Sequential Access Method): static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes. Lecture 18 Hashing and Indexing 2

B+ Tree Index The index file is organized as a B+ tree Height-balanced
Nodes are blocks of index keys and pointers Order P: Max # of pointers fits in a node Nodes are at least 50% full Support efficient updates Lecture 18 Hashing and Indexing

Point to data records/blocks
B+ Tree Index Example P = 4 Root Index file 100 120 150 180 30 3 5 11 35 101 110 130 156 179 200 Point to data records/blocks Lecture 18 Hashing and Indexing

Internal Nodes key >= ak … p0 a1 ai pi ai+1 pk ak
ai <= key < ai+1 key < a1 The root must have k  2 pointers Others must have k  P/2 pointers, where P is the order of the B+ tree Must have k keys and k+1 pointers Keys are sorted Lecture 18 Hashing and Indexing

Leaf Nodes to next Leaf node to data records a1 p1 ai pi p pk … ak
All external nodes are at the same level. Must have k  P/2 keys, unless it is the only node in the tree. Keys are sorted Has a (block pointer) to next leaf node (other pointers can be block or record pointers) Lecture 18 Hashing and Indexing

An Example File: Employees(SSN, Name, Dept, Age, Phone)
Attributes sizes in bytes: SSN (9), Name (25), Dept (4), Age (4), Phone (10). Block (page) size = 1024 bytes. # of records: 30,000, packed unspanned. What is the file size in pages? Tuple Size = = 52 bytes bfEmployee = 1024 / 52 = 19 record/page (block) bEmployee = 30,000 / 19 = 1,579 pages (blocks) Lecture 18 Hashing and Indexing

Example: B+ Tree Primary Index
Pointer size = 4 bytes Nodes are 70% full How big is a B+ tree primary index on SSN? Order: P = ( ) / (9+4) = 79 Average order = 79*.7 =56 pointers # pointers of internal nodes = 56 # index entries in leaf node = 55 # index entries = 1579 (one per page) # leaf nodes = 1579 / 55 = 29 # nodes next level = 29 / 56 = 1 Lecture 18 Hashing and Indexing

Example: B+ Tree Primary Index
Total # of levels = 2 Total # of index nodes (pages) = 30 To answer the query: select * from Employees where SSN= ; # of page I/Os = 3 (2 index pages + 1 data page). Lecture 18 Hashing and Indexing

Example: B+ Tree Secondary Index
B+ tree secondary index on Dept # of distinct values = 1000 Assume a dense index. # of index entries = 30000 Size of index entry = 8 bytes (4-byte Dept + 4-byte pointer) Order: P = (1024+4) / (4+4) = 128 Assume nodes are 70% full internal node has 90 pointers leaf node has 89 keys. Lecture 18 Hashing and Indexing

Example: B+ Tree Secondary Index
# of leaf nodes = 30,000 / 89 = 338 # of nodes at 2nd level = 338 / 90 = 4 # of nodes at 3rd level = 4 / 90 = 1 Total # of levels = 3 Total # of pages = 343 # records per distinct value = 30 (each on a different page) To find all employees of “Dept = x”, # of page I/O = 33 (3 pages of index + 30 pages of data) Lecture 18 Hashing and Indexing

Indexes in SQL Create a secondary index create index Salary_Index on
Employees(Salary); Create an index on a key create unique index SSN_Index on Employees(SSN); Drop an index drop index Salary_Index; Lecture 18 Hashing and Indexing

Summary Hashing is every efficient, but is effective only when search condition is equality Indexing is effective for range selection as well as equality selection Simple indexing is good for small files ISAM is good if update is infrequent B+ tree is a dynamic structure. Inserts/deletes leave B+ tree height-balanced; O(logP N) cost. Typically has 3 or 4 levels for large files Lecture 18 Hashing and Indexing

Summary (Contd.) Almost always better than maintaining a sorted file (no sorting, no global moving) Typically, 67%-70% full on average Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. Oracle automatically creates index for primary key attribute(s) and unique attribute(s) It is not possible to specify different types of index or hashing using SQL Many other types of indexes … Lecture 18 Hashing and Indexing

Look Ahead Next topic: Query Processing and Optimization
Read textbook: Chapter 18 Lecture 18 Hashing and Indexing

Similar presentations