Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hashing and Indexing John Ortiz. Lecture 18Hashing and Indexing2 Overview  How to retrieve data records with a key value?  Sequential search (O(N))

Similar presentations


Presentation on theme: "Hashing and Indexing John Ortiz. Lecture 18Hashing and Indexing2 Overview  How to retrieve data records with a key value?  Sequential search (O(N))"— Presentation transcript:

1 Hashing and Indexing John Ortiz

2 Lecture 18Hashing and Indexing2 Overview  How to retrieve data records with a key value?  Sequential search (O(N))  Binary Search (O(log 2 N))  Hashing  Conventional (simple) indexing  B+ tree indexing  More advanced hashing & indexing

3 Lecture 18Hashing and Indexing3 Hashing  A hash function maps a key value to a bucket address where the record can be found  Good for queries with condition A=v  Typical complexity O(1) Buckets (typically 1 disk block) key  h(key)

4 Lecture 18Hashing and Indexing4 Hash Table  What are stored in buckets?  Option 1: Actual records (hash table is the file)  Option 2: Key values and pointers to records key  h(key) Hash table record key 1 Data file  Should we sort keys in buckets?

5 Lecture 18Hashing and Indexing5 Hash Functions  Perfect hash functions should evenly distribute key values to buckets (very difficult to come by). Good hash functions do a random distribution.  Ex: Key = ‘x1 x2 … xn’ n byte character string and table has b buckets, h(key) = (x1 + x2 + … xn) mod b  Many other choices (see Knuth Vol. 3)  May have to handle collision (bucket overflow). Typically with chaining of overflow blocks  With K mod M, 0 <= result <= M-1

6 Lecture 18Hashing and Indexing6 Loading Factor  Loading Factor = # keys loaded / total # keys that fit  Try to keep loading factor between 50% & 80%  If < 50%, wasting space  If > 80%, overflows significant depends on how good hash function is & on # keys per bucket

7 Lecture 18Hashing and Indexing7 Indexing Data File Index File Find rec. w/ key 60

8 Lecture 18Hashing and Indexing8 Terms  Data file: contains blocks of data records  Index file: contains blocks of index entries  Index entry:  Index key: not necessarily a key of a relation  Address: for record with a key value (may be block or record address)  Search key: Index value used for a search

9 Lecture 18Hashing and Indexing9 Types of Indexes  What type of index key are used?  Primary indexes  Clustering indexes  Secondary indexes  Multilevel indexes  Dynamic Indexes, B-Trees, B + -Trees  Does every record has an index entry?  Dense index, sparse index  Consideration: How does it handle Updates?  Insertions? Deletions?

10 Lecture 18Hashing and Indexing10 Primary Index  Index key is the primary key  Sparse: one entry per data block  Data file sorted on index key  Can be B+ tree as well Data file Index

11 Lecture 18Hashing and Indexing11 Primary Index  Example 1, p.159  Ordered file, r = 30,000 records  Block Size B = 1024 bytes/block  Unspanned  Record Length R = 100 bytes/record  Bfr = floor( B/R ) records/block  # blocks = ??? (use units to determine formula!)  Binary search yields ~log 2 B accesses  Compare to search with primary index, see p.159

12 Lecture 18Hashing and Indexing12 Clustering Index  Index key is not a key of relation (may have duplicate values)  Sparse: one entry per distinct value  Data file sorted on index key  Can be B+ tree Data File Index

13 Lecture 18Hashing and Indexing13 Secondary Index  Index key: can be key or non-key  Pointers point to record (not block)  Dense: one entry per record (sparse at higher levels)  Data file not sorted on index key!  Can be B+ tree

14 Lecture 18Hashing and Indexing14 Data file Index sparse high level Secondary Index

15 Lecture 18Hashing and Indexing15 Secondary Index for Non-Key  Option 1: Dense index Problem: excess overhead! disk space Data file Index file

16 Lecture 18Hashing and Indexing16 Secondary Index for Non-Key  Option 2: Reserve multiple pointers Problem: variable size records in index! Data fileIndex file

17 Lecture 18Hashing and Indexing17 Secondary Index for Non-Key  Option 3: Use pointer buckets Data file Index file Buckets

18 Lecture 18Hashing and Indexing18 Advantage of Pointer Buckets  Assume following indexes on EMP(name,dept,floor,...)  Name: primary  Dept: secondary  Floor: secondary  Find employees in Toy dept on 2 nd floor  Find pointers for Toy dept  Find pointers for 2 nd floor  Intersect these sets of pointers  Retrieve records

19 Lecture 18Hashing and Indexing19 Data file Index level Simple Multiple level Index Index level 1

20 Lecture 18Hashing and Indexing20 Tree-Structured Indexes  Problem of simple indexes: binary search is still expensive. Why not build indexes as multiple level trees?  Nodes of tree are blocks  Types of tree-structured indexes  ISAM (Indexed Sequential Access Method): static structure;  B+ tree: dynamic, adjusts gracefully under inserts and deletes.

21 Lecture 18Hashing and Indexing21 B+ Tree Index  The index file is organized as a B+ tree  Height-balanced  Nodes are blocks of index keys and pointers  Order P: Max # of pointers fits in a node  Nodes are at least 50% full  Support efficient updates

22 Lecture 18Hashing and Indexing22 B+ Tree Index Example  P = Root Point to data records/blocks Index file

23 Lecture 18Hashing and Indexing23 Internal Nodes  The root must have k  2 pointers  Others must have k   P/2  pointers, where P is the order of the B+ tree  Must have k keys and k+1 pointers  Keys are sorted key >= a k …p0p0 a1a1 aiai pipi a i+1 pkpk ……akak a i <= key < a i+1 … key < a 1

24 Lecture 18Hashing and Indexing24 Leaf Nodes  All external nodes are at the same level.  Must have k   P/2  keys, unless it is the only node in the tree.  Keys are sorted  Has a (block pointer) to next leaf node (other pointers can be block or record pointers) to next Leaf node to data records a1a1 p1p1 aiai pipi ppkpk ……akak

25 Lecture 18Hashing and Indexing25 An Example  File: Employees(SSN, Name, Dept, Age, Phone)  Attributes sizes in bytes: SSN (9), Name (25), Dept (4), Age (4), Phone (10).  Block (page) size = 1024 bytes.  # of records: 30,000, packed unspanned.  What is the file size in pages?  Tuple Size = = 52 bytes  bf Employee =  1024 / 52  = 19 record/page (block)  b Employee =  30,000 / 19  = 1,579 pages (blocks)

26 Lecture 18Hashing and Indexing26 Example: B + Tree Primary Index  Pointer size = 4 bytes  Nodes are 70% full  How big is a B+ tree primary index on SSN?  Order: P =  ( ) / (9+4)  = 79  Average order =  79*.7  =56 pointers  # pointers of internal nodes = 56  # index entries in leaf node = 55  # index entries = 1579 (one per page)  # leaf nodes =  1579 / 55  = 29  # nodes next level =  29 / 56  = 1

27 Lecture 18Hashing and Indexing27 Example: B+ Tree Primary Index  Total # of levels = 2  Total # of index nodes (pages) = 30  To answer the query: select * from Employees where SSN= ;  # of page I/Os = 3 (2 index pages + 1 data page).

28 Lecture 18Hashing and Indexing28 Example: B + Tree Secondary Index  B+ tree secondary index on Dept  # of distinct values = 1000  Assume a dense index.  # of index entries =  Size of index entry = 8 bytes (4-byte Dept + 4-byte pointer)  Order: P =  (1024+4) / (4+4)  = 128  Assume nodes are 70% full  internal node has 90 pointers  leaf node has 89 keys.

29 Lecture 18Hashing and Indexing29 Example: B + Tree Secondary Index  # of leaf nodes =  30,000 / 89  = 338  # of nodes at 2nd level =  338 / 90  = 4  # of nodes at 3rd level =  4 / 90  = 1  Total # of levels = 3  Total # of pages = 343  # records per distinct value = 30 (each on a different page)  To find all employees of “Dept = x”, # of page I/O = 33 (3 pages of index + 30 pages of data)

30 Lecture 18Hashing and Indexing30 Indexes in SQL  Create a secondary index create index Salary_Index on Employees(Salary);  Create an index on a key create unique index SSN_Index on Employees(SSN);  Drop an index drop index Salary_Index;

31 Lecture 18Hashing and Indexing31 Summary  Hashing is every efficient, but is effective only when search condition is equality  Indexing is effective for range selection as well as equality selection  Simple indexing is good for small files  ISAM is good if update is infrequent  B+ tree is a dynamic structure.  Inserts/deletes leave B+ tree height- balanced; O(log P N) cost.  Typically has 3 or 4 levels for large files

32 Lecture 18Hashing and Indexing32 Summary (Contd.)  Almost always better than maintaining a sorted file (no sorting, no global moving)  Typically, 67%-70% full on average  Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully.  Oracle automatically creates index for primary key attribute(s) and unique attribute(s)  It is not possible to specify different types of index or hashing using SQL  Many other types of indexes …

33 Lecture 18Hashing and Indexing33 Look Ahead  Next topic: Query Processing and Optimization  Read textbook:  Chapter 18


Download ppt "Hashing and Indexing John Ortiz. Lecture 18Hashing and Indexing2 Overview  How to retrieve data records with a key value?  Sequential search (O(N))"

Similar presentations


Ads by Google