ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
6. Files of (horizontal) Records
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
CST203-2 Database Management Systems Lecture 7. Disadvantages on index structure: We must access an index structure to locate data, or must use binary.
Hash Indexes: Chap. 11 CS634 Lecture 6, Feb
ICOM 4035 – Data Structures Lecture 12 – Hashtables & Map ADT Manuel Rodriguez Martinez Electrical and Computer Engineering University of Puerto Rico,
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B+-tree and Hashing.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Database Systems ( 資料庫系統 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 )
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
B-Trees, Part 2 Hash-Based Indexes R&G Chapter 10 Lecture 10.
1 Database Systems ( 資料庫系統 ) November 28, 2005 Lecture #9.
Chapter 5 Record Storage and Primary File Organizations
Database Applications (15-415) DBMS Internals- Part IV Lecture 15, March 13, 2016 Mohammad Hammoud.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
COP Introduction to Database Structures
Dynamic Hashing (Chapter 12)
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Database Management Systems (CS 564)
Introduction to Database Systems
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Advance Database System
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
ICOM 5016 – Introduction to Database Systems
Index tuning Hash Index.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
ICOM 5016 – Introduction to Database Systems
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing ©Manuel Rodriguez – All rights reserved

ICOM 6005Dr. Manuel Rodriguez Martinez2 Hash-Based Indexing Read –New Book: Chapter 11 Hash methods can be used for index files to support efficient searches by equality –Often require 1 to 2 I/O operation Three type of hashing schemes –Static Hashing –Extensible Hashing –Linear Hashing In practice, commercial DBMS use hashing indexing for temporary calculations –Aggregation and joins –Tree-based indices are use as actual indices on relations

ICOM 6005Dr. Manuel Rodriguez Martinez3 General hashing approach Hash File for a relation R has N pages –Each page is called a bucket –Buckets are numbered from 0 to N – 1 –If a bucket gets full, an overflow page must be chained to it Each record t in R has a search key k –Can be made out of 1 or more attributes –Ex. Studens(sid, name, login, age, gpa) Search key: age attribute A hash function is used to map the search key k of a record t in R to bucket number [0, N-1] –Hash function should distribute records uniformly –Record is searched inside the bucket

ICOM 6005Dr. Manuel Rodriguez Martinez4 Hash Index (clustered) 121JilNY$ BobNY$ PatWI$ BillLA$ NedSJ$ AlSF$ NedNY$ TimMIA$4000 H() Account attribute

ICOM 6005Dr. Manuel Rodriguez Martinez5 Hash Index (Unclustered) 121JilNY$ BobNY$ PatWI$ BillLA$ NedSJ$ AlSF$ NedNY$ TimMIA$4000 H() LA NY MIA SJ SF WI city

ICOM 6005Dr. Manuel Rodriguez Martinez6 Static Hashing: Scheme … … … N -1 h Search Key k … Primary BucketsOverflow Pages

ICOM 6005Dr. Manuel Rodriguez Martinez7 Static Hashing: Issues Number of primary buckets is fixed at file creation Hash function maps key to a bucket number Typical hash function –H(k) = a*k + b –Bucket number = h(k) mod N Key can be int or char –Char – each character is mapped to ASCII, and all value are added to get an integer –Parameters a and b are choose to tune the distribution of values (i.e. need to play with this values to get them right …) When a primary bucket get full, need to create an overflow page and chain it to primary bucket.

ICOM 6005Dr. Manuel Rodriguez Martinez8 Static hashing: Operations Search for key value k: –Hash k to find the bucket, call this bucket B –Search records in B to find the one(s) with key k –If records are found clustered, the data record is there: Cost: 1 I/O unclustered, need to fetch the actual data page: Cost 2 I/Os –If records are not found, need to search in overflow pages (if there are any) Clustered: Cost: (1 + number of pages searched) * I/O Unclustered: Cost: (2 + number of pages searched) * I/O The more overflow page you have, the worst the performance get –Need to keep overflow pages to 1 or 2, but rarely gets done!

ICOM 6005Dr. Manuel Rodriguez Martinez9 Static Hashing: Operations (cont.) Insert (or Update) record with key k –Hash k to find bucket, call this bucket B –If bucket has room clustered, write data record there: Cost: 2 I/Os –Read page, then write it back updated unclustered, write record to actual data page: Cost 4 I/Os –If bucket is full, write to overflow page (create one if needed) Clustered: Cost: (2 + number of pages searched) * I/O Unclustered: Cost: (4 + number of pages searched) * I/O Delete costs are the same, since we need to write page back to disk Again, overflow pages make performance bad as the number of records increases

ICOM 6005Dr. Manuel Rodriguez Martinez10 Extensible hashing Allows the number of buckets to grow or shrink Hash function hashes to slots in a directory –Slots store the page id of the bucket –Directory can be kept in buffer pool –Directory can have hundreds or thousand of slots to buckets When a bucket gets full –Create a new bucket and split records between the new and full bucket Redistributes the data Hash function still works!!! –Overflow page is need only if you have many duplicate records

ICOM 6005Dr. Manuel Rodriguez Martinez11 Binary Pattern Hashing Technique Hash function will map search key to binary pattern –Ex h(51) = Last d bits in the pattern are taken as bucket number! –Ex. If d = 2, then h(51) = will yield bucket number 11, which is 3 in binary Thus, 51 goes to bucket 3 The number of d of bits used to hash the search key is called the depth Two types of depths –Bucket depth Number of bits need to hash value to a given bucket –File depth Largest depth of any bucket in the hash file

ICOM 6005Dr. Manuel Rodriguez Martinez12 Example Extensible Hashing Index 2 Directory Page Bucket A Bucket B Bucket C Bucket D Hashing on an int attribute H(4) = 100 d = 2 gives slot 00. For value 4 Bucket is then found from the slot.

ICOM 6005Dr. Manuel Rodriguez Martinez13 The use of the depth Depth tells us the number of bits that we need to use to pick a bucket –Ex. H(4) = 100, d = 2, tell us to use 00 to identify slot. This would be slot 00. Directory has a global depth –Used to hash key to proper slot Each bucket has a local depth –Used when bucket need to be slipt Let us see what happens when we need to insert the value 22 into the hash index –H(22) = 10110, d =2

ICOM 6005Dr. Manuel Rodriguez Martinez14 Hash File before the Insert 22 2 Directory Page Bucket A Bucket B Bucket C Bucket D

ICOM 6005Dr. Manuel Rodriguez Martinez15 Hash File After Inserting 22 2 Directory Page Bucket A Bucket B Bucket C Bucket D H(22) = d = 2 gives slot 10. This bucket has space

ICOM 6005Dr. Manuel Rodriguez Martinez16 The issue of a full bucket: Insert 20 2 Directory Page Bucket A Bucket B Bucket C Bucket D H(20) = d = 2 gives slot 00. This bucket is Full

ICOM 6005Dr. Manuel Rodriguez Martinez17 Splitting a bucket A full bucket gets split into two buckets –Their directory slots are called corresponding elements These buckets have the same hash value at the current depth d But at depth d + 1, they differ by 1 bit –one has a 1 at bit position d + 1 –the other has a 0 at bit position d + 1 Example: –Bucket A is split into two buckets: bucket A and bucket A2 –Bucket A, d = 2, has value 00, but at d = 3 becomes 000 –Bucket A2, d = 2, has value 00, but at d = 3 becomes 100

ICOM 6005Dr. Manuel Rodriguez Martinez18 Splitting a bucket (cont.) The values in the original bucket A and the new value to be inserted get distributed into buckets A and A2. The hash function now increment the local depth of the buckets A and A2 to be d + 1 Now, the keys are hashed to buckets using d + 1 bits Recall that bucket A had: 4, 12, 32, 16, and d = 2 –We wanted to insert 20 Now d becomes 3, we get the following hashing: –4 = 100 H(4) = 100 –12 = 1100H(12) = 100 –32 = H(32) = 000 –16 = 10000H(16) = 000 –20 = 10100H(20) = 100

ICOM 6005Dr. Manuel Rodriguez Martinez19 Corresponding elements & buckets Bucket A Bucket A Bucket A2 Operation insert 20 After Split Original bucket ASplit image of bucket A splitting Local depth is changed from 2 to 3 Need 3 bits for hashing Hash = 00 Hash = 000 Hash = 100

ICOM 6005Dr. Manuel Rodriguez Martinez20 Expanding the Directory 2 Directory Page Bucket A Bucket B Bucket C Bucket D Bucket A2 Number of slots is not enough. Need to double size of the directory and increase Global depth

ICOM 6005Dr. Manuel Rodriguez Martinez21 Expanded Hash Index Directory Page Bucket A Bucket B Bucket C Bucket D Bucket A

ICOM 6005Dr. Manuel Rodriguez Martinez22 Some Issues Some corresponding elements point to the same bucket –This means the bucket has not been split Not all splits operations cause the directory to be doubled. Each bucket has a local depth –If depth of bucket = global depth – 1, then splitting this bucket will not cause a doubling in directory Doubling only occurs when –Bucket is full and cannot fit another insertion –Bucket has same local depth as global depth

ICOM 6005Dr. Manuel Rodriguez Martinez23 Cost estimates for operations Assume directory is in buffer pool Search for equality –Clustered – 1 I/O –Un-clustered – 2 I/O Erase –Clustered – 2 I/Os –Unclustered – 4 I/Os Insert (No splitting) –Clustered – 2 I/Os –Unclustered – 4 I/Os Insert (Splitting) –Clustered – 4 I/Os –Unclustered – 8 I/Os Overflow pages will be needed when you have lots of values with the same search key k

ICOM 6005Dr. Manuel Rodriguez Martinez24 Tradeoffs of Extensible Hashing Advantages –Can gracefully adapt to insertion and deletions –Limits the number of overflow pages –Hash function is easy to implement No need for complex prime number computations Disadvantages –Directory can grow large when we have billions of records –Also, when we have skewed data distributions Lots of values go to same bucket Lots of empty buckets, a few one have all the data Overflow pages due to collisions (values that hash to same bucket) –Too much doubly in the size of the directory

ICOM 6005Dr. Manuel Rodriguez Martinez25 Linear Hashing Dynamic hashing technique –No need for directory –Limits overflow pages due to collisions –Splitting of buckets is done in a more lazy fashion Idea is to have a family of hash functions –h 0, h 1, h 2, … –Each function has a range twice as big as the predecesor –If h i maps to M buckets, h i+1 maps to 2M buckets –This is used when more buckets are needed Switch from current h i to h i+1 if we need to grow number of buckets beyond current M (we double number of buckets to 2M)

ICOM 6005Dr. Manuel Rodriguez Martinez26 Building the family of hash functions General form is –h i (key) = h(key) mod (2 i N) –h(key) acts as the base function –h(key) is the same as for extensible hashing Looks as the bit pattern in the value If N is a power of 2, and d 0 is the number of bit to represent N, then d i gives the number of bits used by function h i –d i = d 0 + i

ICOM 6005Dr. Manuel Rodriguez Martinez27 General Scheme Hash index file as an associated round number –Called Level At round number Level we use hash functions –h Level and h Level + 1 We keep track of the next bucket to be split –Buckets are split in a round robin fashion Every bucket eventually gets splits … Index file has tree types of buckets –Buckets that were split in this round –Buckets that are yet to be split –Buckets created by splits in this round

ICOM 6005Dr. Manuel Rodriguez Martinez28 Organization of Index Hash File Next bucket to be split Split in this round Original Buckets in this Level h Level used here Buckets not split New buckets resulting from splits

ICOM 6005Dr. Manuel Rodriguez Martinez29 Example scenario h1h0 Next=0