CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
CS 4432lecture #11 - indexing & hashing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
ICS 421 Spring 2010 Indexing (2) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 2/23/20101Lipyeow Lim.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #11.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #8.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #12.
CS CS4432: Database Systems II. CS Index definition in SQL Create index name on rel (attr) (Check online for index definitions in SQL) Drop.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Today Review of Directory of Slot Block Organizations Heap Files Program 1 Hints Ordered Files & Hash Files RAID.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Basic File Structures and Hashing Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
Chapter 5 Record Storage and Primary File Organizations
Hash-Based Indexes. Introduction uAs for any index, 3 alternatives for data entries k*: w Data record with key value k w w Choice orthogonal to the indexing.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
Jun-Ki Min. Slide  Such a multi-level index is a form of search tr ee ◦ However, insertion and deletion of new index entrie s is a severe problem.
COMP3017 Advanced Databases
CS 245: Database System Principles
Hash-Based Indexes Chapter 11
CPSC-608 Database Systems
CS 245: Database System Principles
Disk Storage, Basic File Structures, and Hashing
Yan Huang - CSCI5330 Database Implementation – Access Methods
Hash-Based Indexes Chapter 10
CS 245: Database System Principles
Index tuning Hash Index.
Advance Database System
Database Design and Programming
Chapter 11: Indexing and Hashing
CPSC-608 Database Systems
Chapter 11 Instructor: Xin Zhang
CS4432: Database Systems II
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More

6/8/20162 Copyright notice Materials and lecture notes in this course are adapted from various sources, including the authors of the textbook and references, Internet, etc. The Copyright belong to the original authors. Thanks!

Hashed Files Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket 0, bucket 1,..., bucket M-1. –Typically, a bucket corresponds to one (or a fixed number of) disk block. One of the file fields is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. Search is very efficient on the hash key. Collisions occur when a new record hashes to a bucket that is already full. –An overflow file is kept for storing such records. –Overflow records that hash to each bucket can be linked together.

CPSC 8620Notes 64 key  h(key) Hashing Buckets (typically 1 disk block)

CPSC 8620Notes Two alternatives records (1) key  h(key)

CPSC 8620Notes 66 (2) key  h(key) Index record key 1 Two alternatives

CPSC 8620Notes 67 (2) key  h(key) Index record key 1 Two alternatives Alt (2) for “secondary” search key

CPSC 8620Notes 68 Example hash function Key = ‘x 1 x 2 … x n ’ n byte character string Have b buckets h: add x 1 + x 2 + ….. x n – compute sum modulo b

CPSC 8620Notes 69  This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function.

CPSC 8620Notes 610 Next: example to illustrate inserts, overflows, deletes h(K)

CPSC 8620Notes 611 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) =

CPSC 8620Notes 612 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = d a c b h(e) = 1

CPSC 8620Notes 613 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = d a c b h(e) = 1 e

CPSC 8620Notes a b c e d EXAMPLE: deletion Delete: e f f g

CPSC 8620Notes a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c

CPSC 8620Notes a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d

Overflows There are numerous methods for collision resolution, including the following: –Open addressing: Proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found. –Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. –Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary.

CPSC 8620Notes 618 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit

CPSC 8620Notes 619 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket

Hashed Files The hash function h should distribute the records uniformly among the buckets –Otherwise, search time will be increased because many overflow records will exist. Main disadvantages of static external hashing: –Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. –Ordered access on the hash key is quite inefficient (requires sorting the records).

CPSC 8620Notes 621 How do we cope with growth? Overflows and reorganizations Dynamic hashing

CPSC 8620Notes 622 How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear

Dynamic And Extensible Hashed Files Dynamic and Extendible Hashing Techniques –Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. –These techniques include the following: dynamic hashing, extensible hashing, and linear hashing. Both dynamic and extensible hashing use the binary representation of the hash value h(K) in order to access a directory. –In dynamic hashing the directory is a binary tree. –In extensible hashing the directory is an array of size 2 d where d is called the global depth. Records are distributed among buckets based on the values of the leading bits in their hash values

CPSC 8620Notes 624 Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i  grows over time…

CPSC 8620Notes 625 (b) Use directory h(K)[i ] to bucket

CPSC 8620Notes 626 Example: h(k) is 4 bits; 2 keys/bucket i = Insert 1010

CPSC 8620Notes 627 Example: h(k) is 4 bits; 2 keys/bucket i = Insert

CPSC 8620Notes 628 Example: h(k) is 4 bits; 2 keys/bucket i = Insert New directory i = 2 2

CPSC 8620Notes Insert: i = Example continued

CPSC 8620Notes Insert: i = Example continued

CPSC 8620Notes Insert: i = Example continued

CPSC 8620Notes i = Insert: 1001 Example continued

CPSC 8620Notes i = Insert: 1001 Example continued

CPSC 8620Notes i = Insert: 1001 Example continued i = 3 3

CPSC 8620Notes 635 Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure)

CPSC 8620Notes 636 Deletion example: Run thru insert example in reverse!

CPSC 8620Notes 637 Note: Still need overflow chains Example: many records with duplicate keys insert if we split:

CPSC 8620Notes 638 Solution: overflow chains insert 1100 add overflow block:

CPSC 8620Notes 639 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary +

CPSC 8620Notes 640 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary + Indirection (Not bad if directory in memory) Directory doubles in size - -

Dynamic Hashing In dynamic hashing the addresses of the buckets were either the n high-order bits or n − 1 high-order bits, depending on the total number of keys belonging to the respective bucket. Dynamic hashing maintains a tree-structured directory with two types of nodes: –Internal nodes that have two pointers—the left pointer corresponding to the 0 bit (in the hashed address) and a right pointer corresponding to the 1 bit. –Leaf nodes—these hold a pointer to the actual bucket with records.

Dynamic Hashing

CPSC 8620Notes 643 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash grows b i

CPSC 8620Notes 644 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash grows b i (b) File grows linearly

CPSC 8620Notes 645 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets

CPSC 8620Notes 646 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule

CPSC 8620Notes 647 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule insert 0101

CPSC 8620Notes 648 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule 0101 can have overflow chains! insert 0101

CPSC 8620Notes 649 Note In textbook, n is used instead of m n=m m = 01 (max used block) Future growth buckets n=10

CPSC 8620Notes 650 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets

CPSC 8620Notes 651 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets

CPSC 8620Notes 652 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets insert 0101

CPSC 8620Notes 653 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets insert

CPSC 8620Notes 654 Example b=4 bits, i =2, 2 keys/bucket m = 01 (max used block) Future growth buckets insert

CPSC 8620Notes 655 Example Continued: How to grow beyond this? m = 11 (max used block) i = 2...

CPSC 8620Notes 656 Example Continued: How to grow beyond this? m = 11 (max used block) i =

CPSC 8620Notes 657 Example Continued: How to grow beyond this? m = 11 (max used block) i =

CPSC 8620Notes 658 Example Continued: How to grow beyond this? m = 11 (max used block) i =

CPSC 8620Notes 659  When do we expand file? Keep track of: # used slots total # of slots = U

CPSC 8620Notes 660 If U > threshold then increase m (and maybe i )  When do we expand file? Keep track of: # used slots total # of slots = U

CPSC 8620Notes 661 Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing Summary + + Can still have overflow chains -

CPSC 8620Notes 662 Example: BAD CASE Very full Very emptyNeed to move m here… Would waste space...

CPSC 8620Notes 663 Hashing - How it works - Dynamic hashing - Extensible - Linear Summary

CPSC 8620Notes 664 Next: Indexing vs Hashing Index definition in SQL Multiple key access

CPSC 8620Notes 665 Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Indexing vs Hashing

CPSC 8620Notes 666 INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Indexing vs Hashing

CPSC 8620Notes 667 Index definition in SQL Create index name on rel (attr) Create unique index name on rel (attr) defines candidate key Drop INDEX name

CPSC 8620Notes 668 CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...)... at least in SQL... Note

CPSC 8620Notes 669 ATTRIBUTE LIST  MULTIKEY INDEX (next) e.g., CREATE INDEX foo ON R(A,B,C) Note

CPSC 8620Notes 670 Motivation: Find records where DEPT = “Toy” AND SAL > 50k Multi-key Index

CPSC 8620Notes 671 Strategy I: Use one index, say Dept. Get all Dept = “Toy” records and check their salary I1I1

CPSC 8620Notes 672 Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:

CPSC 8620Notes 673 Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3

CPSC 8620Notes 674 Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k

CPSC 8620Notes 675 For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k

CPSC 8620Notes 676 Interesting application: Geographic Data DATA: x y...

CPSC 8620Notes 677 Queries: What city is at ? What is within 5 miles from ? Which is closest point to ?

CPSC 8620Notes 678 h n b i a c o d Example e g f m l k j

CPSC 8620Notes 679 h n b i a c o d Example e g f m l k j

CPSC 8620Notes 680 h n b i a c o d Example e g f m l k j

CPSC 8620Notes 681 h n b i a c o d Example e g f m l k j

CPSC 8620Notes 682 h n b i a c o d Example e g f m l k j h i a b c d e f g n o m l j k 5 15

CPSC 8620Notes 683 h n b i a c o d Example e g f m l k j h i a b c d e f g n o m l j k Search points near f Search points near b 5 15

CPSC 8620Notes 684 Queries Find points with Yi > 20 Find points with Xi < 5 Find points “close” to i = Find points “close” to b =

CPSC 8620Notes 685 Many types of geographic index structures have been suggested kd-Trees (very similar to what we described here) Quad Trees R Trees...

CPSC 8620Notes 686 Two more types of multi key indexes Grid Partitioned hash

CPSC 8620Notes 687 Grid Index Key 2 X 1 X 2 …… X n V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2

CPSC 8620Notes 688 CLAIM Can quickly find records with –key 1 = V i  Key 2 = X j –key 1 = V i –key 2 = X j

CPSC 8620Notes 689 CLAIM Can quickly find records with –key 1 = V i  Key 2 = X j –key 1 = V i –key 2 = X j And also ranges…. –E.g., key 1  V i  key 2 < X j

How do we find entry i,j in linear structure? CPSC 8620Notes 690 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = max number of i values N=4

How do we find entry i,j in linear structure? CPSC 8620Notes 691 i, j position S+0 position S+1 position S+2 position S+3 position S+4 position S+9 pos(i, j) = S + iN + j max number of i values N=4 Issue: Cells must be same size, and N must be constant! Issue: Some cells may overflow, some may be sparse...

CPSC 8620Notes 692 Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3

CPSC 8620Notes 693 With indirection: Grid can be regular without wasting space We do have price of indirection

CPSC 8620Notes 694 Can also index grid on value ranges SalaryGrid Linear Scale 123 ToySalesPersonnel 0-20K1 20K-50K2 50K-3 8

CPSC 8620Notes 695 Grid files Good for multiple-key search Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - -

CPSC 8620Notes 696 Idea: Key1 Key2 Partitioned hash function h1h

CPSC 8620Notes 697 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

CPSC 8620Notes 698 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

CPSC 8620Notes 699 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales  Sal=40k

CPSC 8620Notes 6100 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales  Sal=40k

CPSC 8620Notes 6101 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Sal=30k

CPSC 8620Notes 6102 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Sal=30k look here

CPSC 8620Notes 6103 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales

CPSC 8620Notes 6104 h1(toy)=0000 h1(sales)=1001 h1(art)= h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)= Find Emp. with Dept. = Sales look here