Are they better or worse than a B+Tree?

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Hash Tables Hash function h: search key  [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be.
Hash Table indexing and Secondary Storage Hashing.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B+-tree and Hashing.
Chapter 8 File organization and Indices.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Chapter 13.4 Hash Tables Steve Ikeoka ID: 113 CS 257 – Spring 2008.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Copyright © Curt Hill Hashing A quick lookup strategy.
Chapter 5 Record Storage and Primary File Organizations
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Dynamic Hashing (Chapter 12)
Hash-Based Indexes Chapter 11
Hashing CENG 351.
External Memory Hashing
Chapter 11: Indexing and Hashing
Introduction to Database Systems
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes R&G Chapter 10 Lecture 18
Hash-Based Indexes Chapter 10
External Memory Hashing
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Advance Database System
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Database Design and Programming
Module 12a: Dynamic Hashing
Index tuning Hash Index.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
Chapter 11: Indexing and Hashing
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Join Implementation How is it done? Copyright © Curt Hill.
Presentation transcript:

Are they better or worse than a B+Tree? Hash indexes Are they better or worse than a B+Tree? Copyright 2003Curt Hill

Tasks Consider the basics of hashing Consider how this applies to indexing schemes Consider variations Consider the Hash Join Copyright 2003Curt Hill

Basics of hashing Internal hashing External hashing A table in memory Bucket usually holds one entry Covered in a separate presentation: Hashing.ppt External hashing On disk Bucket is a page – holds multiple entries Copyright 2003Curt Hill

External Hashing Hash function takes the key and computes an integer How is this integer used? Direct file Key is an integer Directory of heap file Works well if one directory page can hold the correct number of page ids Lookup table Converts integer to page id number Copyright 2003Curt Hill

How does it work? Hash function takes the key and computes a page number Search the page for the correct data page Access the data For very large indices the number of accesses can still be quite small Copyright 2003Curt Hill

Diagram Hash buckets 2 1 N-1 . . . . . . Data Copyright 2003Curt Hill

Example Assume 1 million records with 4 records to a page 250000 pages of data Assume the key and pointer is 24 bytes with a 512 byte page 21 keys and pointers in a page Assume buckets are ¾ full 67000 buckets with 15 keys Hash function computes a value in range 0 – 66999 Without collisions it takes two accesses to get data Copyright 2003Curt Hill

Static and Dynamic Static hashing works well until the buckets fill up Then a bucket requires an overflow bucket Searching the original and overflow pages increases the accesses and performance drops Dynamic hashing involves techniques where the sizes may grow gracefully Copyright 2003Curt Hill

Extendible/Extensible Hashing Mechanism for altering the size of the hash table without the usual pain Common strategy for internal hashes is to double the hash table and rehash each entry This is too expensive for an index Instead we do incremental doubling of the buckets and index Spreads the cost nicely Copyright 2003Curt Hill

Scheme We generate a hash that is in a range much larger than we need Typically modulo some large prime number Use only the bottom so many bits of that result to select the bucket Start the process with just one bit We also have the notion of global and local depth Copyright 2003Curt Hill

First example Index Buckets Bit pattern ends in 0 1 Bit pattern 2 4 8 14 1 Bit pattern ends in 1 1 5 7 9 11 Numbers shown are hash function output. Copyright 2003Curt Hill

Splitting a bucket The next insertion will overfill a bucket The exact action is dependent on the local and global levels If the local level = global level Add one to global level (number of bits) Double the index Add one more bit Double the bucket Distribute values between the two If the local level < global level only double bucket Copyright 2003Curt Hill

Bucket and Index Split 1 1 2 4 8 14 1 5 7 9 11 1 Add 3 – split 1 bucket into 01 and 11 2 1 2 4 8 14 00 01 2 5 9 10 Notice that 11 2 3 7 11 Copyright 2003Curt Hill

Continued Insertions Notice that there were two pointers to the unsplit bucket Insertions to a bucket that has a lower level than the global level only splits the bucket not the index It separates the two pointers Copyright 2003Curt Hill

Bucket Only Split 00 01 Add 10 split bucket 10 11 00 01 10 11 2 1 2 4 8 14 00 01 Add 10 split bucket 2 3 7 11 10 11 2 5 9 2 2 4 8 00 2 3 7 11 01 Notice that 10 2 5 9 11 2 2 10 14 Copyright 2003Curt Hill

Extensible Hashing When the index exceeds one page The upper so many bits may be checked so the entire index is not searched The mechanism is different than a tree The net effect is not that much different The index may grow smoothly without changes to the hash function or drastic rewriting Copyright 2003Curt Hill

Not without problems When the index is doubled there is work which is added to the insertion When the index will not fit in memory substantial I/Os occur When number of records per block is small we can end up with much larger global levels than needed Suppose 2 records per block and 3 records have the same key for the last 20 bits Global level of 20, even when most local levels are in 1-5 range Copyright 2003Curt Hill

Linear Hashing A different scheme with mechanism different than extensible hashing but some common properties Splits are incrementally added Some flexibility when they occur Like extensible hashing we use the bottom so many bits of a larger hash function Round robin bucket splitting Overflow buckets are used but may be later consumed Copyright 2003Curt Hill

Linear Hashing Numbers N – the number of buckets Not always a power of two I – the number of used bits in the hash function R – the number of records in the structure M – the hash result 0  M  2i M may larger or smaller than N If M > N we use M - 2i-1 Copyright 2003Curt Hill

Adding a bucket Any strategy can be used to determine when a bucket is added Adding a bucket increases N When the ratio of records to buckets crosses a threshold When a bucket is forced to add an overflow bucket Copyright 2003Curt Hill

Linear Example I = 1 0000 R = 3 N = 2 1010 1111 1 R/N = 1.5 1010 1111 1 R/N = 1.5 Split > 1.7 Copyright 2003Curt Hill

Adding an Item When an item is added it is put in the proper bucket If if does not fit add an overflow bucket If the R/N threshold is crossed add a new bucket This causes the corresponding bucket to be redistributed over the two buckets The number of hash bits used may be increased Copyright 2003Curt Hill

Insert 0101 I = 1 0000 R = 3 N = 2 Add 0101 to this 1010 Add 0101 to this Increases R/N ratio 1010 1111 1 0000 00 0101 01 I = 2 R = 4 N = 3 1111 1010 10 Copyright 2003Curt Hill

Explanation The bucket that was added to was not split It was not its turn Buckets are split in round robin fashion Copyright 2003Curt Hill

Insert 0111 I = 2 R = 4 N = 3 I = 2 R = 5 N = 3 Add 0111 No split Overflow 0000 0000 00 00 0101 0111 0101 01 01 1111 1111 1010 1010 10 10 R/N = 1.67 R/N = 1.3 Copyright 2003Curt Hill

Insertion and Searching The hash function is now using two bits which gives it four possibilities, but there are only three buckets If the hash result M < N just use that bucket If the hash result M  N subtract from M 2i-1 In last case 0111 was inserted Last two bits are 11, but there is not yet a 11 bucket, so 10 was subtracted from it Searching uses same type of scheme Copyright 2003Curt Hill

Insert 1100 I = 2 R = 6 N = 4 0000 00 0000 1100 00 0101 01 0101 0111 01 1111 1010 10 1010 10 0111 11 1111 R/N = 1.5 Copyright 2003Curt Hill

Dynamic Hashing Summary Linear hashing lacks the index of extensible hashing There are similarities Hash function where only the bottom so many bits are used Gradual splits Quick lookups Copyright 2003Curt Hill

Joins If both files are sorted on the join the previously mentioned zipper join is used However, if the join field is not the primary key sorting the relation on this field may be expensive Especially so if the outer join is larger than an inner join The number of joined records is small compared to either relation size Copyright 2003Curt Hill

Hash Join Recall that a Cartesian Product makes all possible combinations of records from two relations This could mean read numbering the products of block That is exactly what we want to avoid Hash join partitions two relations into pieces based on a hash function Then only joins partitions that reacted similarly to the hash function Of course only works on Equi-Joins Copyright 2003Curt Hill

Hash Join Process Hash the smaller of the two files on the join field Read in the other file Hash each key into a bucket The only candidates for equality are here Produce the output Smaller but still substantial Copyright 2003Curt Hill