FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Advertisements

Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Department of Computer Science and Engineering, HKUST Slide 1 Dynamic Hashing Good for database that grows and shrinks in size Allows the hash function.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
CPSC 404, Laks V.S. Lakshmanan1 Hash-Based Indexes Chapter 11 Ramakrishnan & Gehrke (Sections )
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Hash Table indexing and Secondary Storage Hashing.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Hashing General idea: Get a large array
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Database Systems ( 資料庫系統 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 )
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Static Hashing (using overflow for collision managment e.g., h(key) mod M h key Primary bucket pages 1 0 M-1 Overflow pages(as separate link list) Overflow.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Chapter 5 Record Storage and Primary File Organizations
CENG Hashing for files. CENG 3512 Introduction Idea: to reference items in a table directly by doing arithmetic operations to transform keys into.
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Dynamic Hashing (Chapter 12)
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Subject Name: File Structures
Database Management Systems (CS 564)
Disk Storage, Basic File Structures, and Hashing
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Advance Database System
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
Module 12a: Dynamic Hashing
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

FALL 2005 CENG 351 Data Management and File Structures 1 Hashing

FALL 2005 CENG 351 Data Management and File Structures2 Hashing: Introduction Idea: to reference items in a table directly by doing arithmetic operations to transform keys into table addresses Idea: to reference items in a table directly by doing arithmetic operations to transform keys into table addresses Steps: Steps: 1. Compute a hash function that transforms the search key into a table address 2. Collision-resolution that deals with the keys that may be hashed to the same table address Hashing is a good example of a time-space tradeoff Hashing is a good example of a time-space tradeoff Hashing is a classical computer science problem Hashing is a classical computer science problem

FALL 2005 CENG 351 Data Management and File Structures3 Motivation The primary goal is to locate the desired record in a single access of disk. The primary goal is to locate the desired record in a single access of disk. Sequential search: O(N) Sequential search: O(N) B+ trees: O(log k N) B+ trees: O(log k N) Hashing: O(1) Hashing: O(1) In hashing, the key of a record is transformed into an address and the record is stored at that address. In hashing, the key of a record is transformed into an address and the record is stored at that address. Hash-based indexes are best for equality selections. Cannot support range searches. Hash-based indexes are best for equality selections. Cannot support range searches. Static and dynamic hashing techniques exist. Static and dynamic hashing techniques exist.

FALL 2005 CENG 351 Data Management and File Structures4 Hash-based Index Data entries are kept in buckets (an abstract term) Data entries are kept in buckets (an abstract term) Each bucket is a collection of one primary page and zero or more overflow pages. Each bucket is a collection of one primary page and zero or more overflow pages. Given a search key value, k, we can find the bucket where the data entry k* is stored as follows: Given a search key value, k, we can find the bucket where the data entry k* is stored as follows: Use a hash function, denoted by h Use a hash function, denoted by h The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets

FALL 2005 CENG 351 Data Management and File Structures5 Hashing: Introduction K * 100, round to the nearest integer

FALL 2005 CENG 351 Data Management and File Structures6 Design Factors Bucket size: the number of records that can be held at the same address. Bucket size: the number of records that can be held at the same address. Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets. Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets. Hash function: should evenly distribute the keys among the addresses. Hash function: should evenly distribute the keys among the addresses. Overflow resolution technique. Overflow resolution technique.

FALL 2005 CENG 351 Data Management and File Structures7 Hash Functions Key mod N: Key mod N: N is the size of the table, better if it is prime. N is the size of the table, better if it is prime. Folding: Folding: e.g. 123|456|789: add them and take mod. e.g. 123|456|789: add them and take mod. Truncation: Truncation: e.g map to a table of 1000 addresses by picking 3 digits of the key. e.g map to a table of 1000 addresses by picking 3 digits of the key. Squaring: Squaring: Square the key and then truncate Square the key and then truncate Radix conversion: Radix conversion: e.g treat it to be base 11, truncate if necessary. e.g treat it to be base 11, truncate if necessary.

FALL 2005 CENG 351 Data Management and File Structures8 Static Hashing Primary Area: # primary pages fixed, allocated sequentially, never de-allocated; (say M buckets). Primary Area: # primary pages fixed, allocated sequentially, never de-allocated; (say M buckets). A simple hash function: h(k) = f(k) mod M A simple hash function: h(k) = f(k) mod M Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket. Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket. Adding the address of an overflow bucket to a primary area bucket is called chaining. Adding the address of an overflow bucket to a primary area bucket is called chaining. Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full. Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full.

FALL 2005 CENG 351 Data Management and File Structures9 Example Assume f(k) = k. Let M = 5. So, h(k) = k mod 5 Assume f(k) = k. Let M = 5. So, h(k) = k mod 5 Bucket factor = 3 records. Bucket factor = 3 records Primary areaoverflow

FALL 2005 CENG 351 Data Management and File Structures10 Load Factor (Packing density) To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full) To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full) Load Factor f = Load Factor f = => f = # of records in the file # of spaces in primary area n M * m Where m is the blocking factor Bfr, M is number of blocks

FALL 2005 CENG 351 Data Management and File Structures11 Effects of f and m Performance can be enhanced by the choice of bucket size and load factor. Performance can be enhanced by the choice of bucket size and load factor. In general, a smaller load factor means In general, a smaller load factor means less overflow and a faster fetch time; less overflow and a faster fetch time; but more wasted space. but more wasted space. A larger m means A larger m means less overflow in general, less overflow in general, but slower fetch. but slower fetch.

FALL 2005 CENG 351 Data Management and File Structures12 Insertion and Deletion Insertion: New records are inserted at the end of the chain. Insertion: New records are inserted at the end of the chain. Deletion: Two ways are possible: Deletion: Two ways are possible: 1. Mark the record to be deleted 2. Consolidate sparse buckets when deleting records. In the 2 nd approach: In the 2 nd approach: When a record is deleted, fill its place with the last record in the chain of the current bucket. When a record is deleted, fill its place with the last record in the chain of the current bucket. Deallocate the last bucket when it becomes empty. Deallocate the last bucket when it becomes empty.

FALL 2005 CENG 351 Data Management and File Structures13 Problem of Static Hashing The main problem with static hashing: the number of buckets is fixed: The main problem with static hashing: the number of buckets is fixed: Long overflow chains can develop and degrade performance. Long overflow chains can develop and degrade performance. On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted. On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted. There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include: There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include: linear hashing linear hashing extendible hashing extendible hashing

FALL 2005 CENG 351 Data Management and File Structures14 Linear Hashing It maintains a constant load factor. It maintains a constant load factor. Thus avoids reorganization. Thus avoids reorganization. It does so, by incrementally adding new buckets to the primary area. It does so, by incrementally adding new buckets to the primary area. In linear hashing the last bits in the hash number are used for placing the records. In linear hashing the last bits in the hash number are used for placing the records.

FALL 2005 CENG 351 Data Management and File Structures15 Example Last 3 bits e.g. 34: : : : Insert: 13, 21, 37 f = 15/24 = 63%

FALL 2005 CENG 351 Data Management and File Structures16 Insertion of records To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits. To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits. e.g. e.g

FALL 2005 CENG 351 Data Management and File Structures17 Expanding the table Boundary value

FALL 2005 CENG 351 Data Management and File Structures Boundary value 37 k = 3 Hash # 1000: uses last 4 digits Hash # 1101: uses last 3 digits

FALL 2005 CENG 351 Data Management and File Structures19 Fetching a record Apply the hash function. Apply the hash function. Look at the last k digits. Look at the last k digits. If it’s less than the boundary value, the location is in the bucket area labeled with the last k+1 digits. If it’s less than the boundary value, the location is in the bucket area labeled with the last k+1 digits. Otherwise it is in the bucket area labeled with the last k digits. Otherwise it is in the bucket area labeled with the last k digits. Follow overflow chains as with static hashing. Follow overflow chains as with static hashing.

FALL 2005 CENG 351 Data Management and File Structures20 Insertion Search for the correct bucket into which to place the new record. Search for the correct bucket into which to place the new record. If the bucket is full, allocate a new overflow bucket. If the bucket is full, allocate a new overflow bucket. If there are now f*m records more than needed for the given f, If there are now f*m records more than needed for the given f, Add one more bucket to the primary area. Add one more bucket to the primary area. Distribute the records from the bucket chain at the boundary value between the original area and the new primary area buckets Distribute the records from the bucket chain at the boundary value between the original area and the new primary area buckets Add 1 to the boundary value. Add 1 to the boundary value.

FALL 2005 CENG 351 Data Management and File Structures21 Deletion Read in a chain of records. Read in a chain of records. Replace the deleted record with the last record in the chain. Replace the deleted record with the last record in the chain. If the last overflow bucket becomes empty, deallocate it. If the last overflow bucket becomes empty, deallocate it. When the number of records is f * m less than the number needed for f, contract the primary area by one bucket. When the number of records is f * m less than the number needed for f, contract the primary area by one bucket. Compressing the table is exact opposite of expanding it: Keep the total # of records in the file and buckets in primary area. Keep the total # of records in the file and buckets in primary area. When we have f * m fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits. When we have f * m fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits.

FALL 2005 CENG 351 Data Management and File Structures22 Extendible Hashing Extendible hashing does not have chains of buckets, contrary to linear hashing. Extendible hashing does not have chains of buckets, contrary to linear hashing. Hashing creates index to an index table entries, which have pointers to the data buckets. Hashing creates index to an index table entries, which have pointers to the data buckets. The number of the entries in the index table is 2 i, where i is number of bit used for indexing. The number of the entries in the index table is 2 i, where i is number of bit used for indexing. The records with the same first i bits are placed in the same bucket i The records with the same first i bits are placed in the same bucket i If max content of any bucket is reached, a new bucket is added, if the corresponding index entry is free. If max content of any bucket is reached, a new bucket is added, if the corresponding index entry is free. If max content of any bucket is reached and a new bucket cannot be added, according to the first i bits, then the table size is doubled. (see fig 6.9 and 6.10) If max content of any bucket is reached and a new bucket cannot be added, according to the first i bits, then the table size is doubled. (see fig 6.9 and 6.10)

FALL 2005 CENG 351 Data Management and File Structures23 Extendible Hashing: Overflow When i=j and overflow occurs, then index table is doubled; where j is the level of the index, while i is the level of data buckets When i=j and overflow occurs, then index table is doubled; where j is the level of the index, while i is the level of data buckets Successive doubling of the index table may happen, in case of a poor hashing function. Successive doubling of the index table may happen, in case of a poor hashing function. The main disadvantage of the extendible hashing is that, the index table may grow to be too big and too sparse. The main disadvantage of the extendible hashing is that, the index table may grow to be too big and too sparse. The best case: there are as many buckets as index table size. The best case: there are as many buckets as index table size. If the index table can fit into the memory, the access to a record requires one disk access only. If the index table can fit into the memory, the access to a record requires one disk access only.

FALL 2005 CENG 351 Data Management and File Structures24 Extendible Hashing Hash function returns b bits Hash function returns b bits Only the prefix i bits are used to hash the item Only the prefix i bits are used to hash the item There are 2 i entries in the bucket address table There are 2 i entries in the bucket address table Let i j be the length of the common hash prefix for data bucket j, there is 2 (i-i j ) entries in bucket address table points to j Let i j be the length of the common hash prefix for data bucket j, there is 2 (i-i j ) entries in bucket address table points to j i i2 bucket2 i3 bucket3 i1 bucket1 Data bucket Bucket address table Length of common hash prefixHash prefix

FALL 2005 CENG 351 Data Management and File Structures25 Splitting a bucket: Case 1 Splitting (Case 1: i j =i) Splitting (Case 1: i j =i) Only one entry in bucket address table points to data bucket j Only one entry in bucket address table points to data bucket j i++; split data bucket j to j, z; i j =i z =i; rehash all items previously in j; i++; split data bucket j to j, z; i j =i z =i; rehash all items previously in j;

FALL 2005 CENG 351 Data Management and File Structures26 Splitting: Case 2 Splitting (Case 2: i j < i) Splitting (Case 2: i j < i) More than one entry in bucket address table point to data bucket j More than one entry in bucket address table point to data bucket j split data bucket j to j, z; i j = i z = i j +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j; split data bucket j to j, z; i j = i z = i j +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j;

FALL 2005 CENG 351 Data Management and File Structures27 Example Extendable Hashin:Example Suppose the hash function is h(x) = x mod 8 and each bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20. Suppose the hash function is h(x) = x mod 8 and each bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2,

FALL 2005 CENG 351 Data Management and File Structures28 Example inserting 1, 4, 5, 7, 8, 2,

FALL 2005 CENG 351 Data Management and File Structures29 Comments on Extendible Hashing If directory fits in memory, equality search is realized with one disk access. If directory fits in memory, equality search is realized with one disk access. A typical example: a 100MB file with 100 bytes record and a page (bucket) size of 4K contains 1,000,000 records (as data entries) but only 25,000 [=1,000,000/(4,000/100)] directory elements A typical example: a 100MB file with 100 bytes record and a page (bucket) size of 4K contains 1,000,000 records (as data entries) but only 25,000 [=1,000,000/(4,000/100)] directory elements  chances are high that directory will fit in memory. If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow very large large! If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow very large large! But this kind of skew must be avoided with a well-tuned hashing function But this kind of skew must be avoided with a well-tuned hashing function

FALL 2005 CENG 351 Data Management and File Structures30 Comments on Extendible Hashing Delete: If removal of data entry makes bucket empty, can be merged with a “ buddy ” bucket. Delete: If removal of data entry makes bucket empty, can be merged with a “ buddy ” bucket. If each directory element points to same bucket as its split image, If each directory element points to same bucket as its split image, can halve directory. can halve directory.

FALL 2005 CENG 351 Data Management and File Structures31 Summary, so far Hash-based indexes: best for equality searches, cannot support range searches. Hash-based indexes: best for equality searches, cannot support range searches. Static Hashing can lead to long overflow chains. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. Directory to keep track of buckets, doubles periodically. Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory. Can get large with skewed data; additional I/O if this does not fit in main memory.

FALL 2005 CENG 351 Data Management and File Structures32 Time considerations for Hashing: Reading Assignment

FALL 2005 CENG 351 Data Management and File Structures33 Simple Hashing: Search No overflow case, ie. direct hit: No overflow case, ie. direct hit: T F =s+r+dtt Overflow successful case, for average chain length of x Overflow successful case, for average chain length of x T Fs =s+r+dtt + (x/2)*(s+r+dtt) Overflow unsuccessful case, for average chain length of x Overflow unsuccessful case, for average chain length of x T Fu =s+r+dtt + x*(s+r+dtt)

FALL 2005 CENG 351 Data Management and File Structures34 Deletion Deletion from a hashed file Deletion from a hashed file Delete without consolidation, i.e., mark the deleted record as deleted… which requires occasional reorganization. Delete without consolidation, i.e., mark the deleted record as deleted… which requires occasional reorganization. T D =T Fs +2r Where x/2 of the chain is involved in T Fs - successful search.

FALL 2005 CENG 351 Data Management and File Structures35 Delete with consolidation Delete with consolidation, requires the entire chain to be read in and written out, after been updated. This will not require any reorganization. Delete with consolidation, requires the entire chain to be read in and written out, after been updated. This will not require any reorganization. The approximate formula: The approximate formula: T D =T Fu +2r Where the full chain (x blocks) is involved in T Fu … Where the full chain (x blocks) is involved in T Fu …

FALL 2005 CENG 351 Data Management and File Structures36 Insertion Assuming that insertion involves the modification of the last bucket, just like deletion… Assuming that insertion involves the modification of the last bucket, just like deletion… T I =T Fu +2r

FALL 2005 CENG 351 Data Management and File Structures37 Sequential Access In hashing, each record is a randomly placed and randomly accessed, requiring very long sequential record processing: In hashing, each record is a randomly placed and randomly accessed, requiring very long sequential record processing: T X =n*T F

FALL 2005 CENG 351 Data Management and File Structures38 Sequential Access In hashing, each record is a randomly placed and randomly accessed, requiring very long sequential record processing: In hashing, each record is a randomly placed and randomly accessed, requiring very long sequential record processing: T X =n*T F

FALL 2005 CENG 351 Data Management and File Structures39 Linear hashing: Review=1 A bucket overflow, cause a split to take place. A bucket overflow, cause a split to take place. Split will start from the first bucket address. It will cause the use of k+1 bits for the split buckets, k bits for others. Split will start from the first bucket address. It will cause the use of k+1 bits for the split buckets, k bits for others. E.e., if the bucket whose last 3 bit address is 010 split, the records with 0010 go to the one of the bucket (with 4 bit address), the records with 1010 will go to the other. E.e., if the bucket whose last 3 bit address is 010 split, the records with 0010 go to the one of the bucket (with 4 bit address), the records with 1010 will go to the other.

FALL 2005 CENG 351 Data Management and File Structures40 Linear hashing : Review-2 The next bucket address, known as boundary value, to split needs to be recorded in the file header. The next bucket address, known as boundary value, to split needs to be recorded in the file header. If the value of the last k bits is less than that of the boundary value use k+1 bits, to find the address of the related bucket… If the value of the last k bits is less than that of the boundary value use k+1 bits, to find the address of the related bucket… Split buckets become part of the primary area. Split buckets become part of the primary area. For access time considerations, Knuth’s tables in Fig 6.2 in the textbook by Salzberg can be used. For access time considerations, Knuth’s tables in Fig 6.2 in the textbook by Salzberg can be used. E.g., blocking factor m=50 and load factor f=75%, the average access times for successful and unsuccessful cases can be stated as E.g., blocking factor m=50 and load factor f=75%, the average access times for successful and unsuccessful cases can be stated as T Fs =1.05*(s+r+dtt) T Fu =1.27*(s+r+dtt)

FALL 2005 CENG 351 Data Management and File Structures41 Linear hashing: Insertion -1 Find the correct bucket to place the new record Find the correct bucket to place the new record If the bucket is full, allocate a new bucket and place the new record and link the new bucket to the related chain. If the bucket is full, allocate a new bucket and place the new record and link the new bucket to the related chain. If the load factor f is now imbalanced, add one more bucket to the primary area, split the records in the boundary bucket with the new bucket. If the load factor f is now imbalanced, add one more bucket to the primary area, split the records in the boundary bucket with the new bucket. Increment the boundary bucket address. Increment the boundary bucket address.

FALL 2005 CENG 351 Data Management and File Structures42 Linear hashing: Insertion-2 An insertion formula needs to consider all the timing aspects caused by the above algorithm. An insertion formula needs to consider all the timing aspects caused by the above algorithm. T I =(T Fu +2r)+(1/m)*(s+r+dtt)+1/(f*m)*[(s+r+dtt)+2r+ s+r+dtt] The first term (T Fu +2r) is for finding the bucket to insert and write back to the disk, The first term (T Fu +2r) is for finding the bucket to insert and write back to the disk, The second term is to link a new bucket, The second term is to link a new bucket, The third term is to expand the primary area with probability of 1/(f*m) by adding a new bucket, read and write back the boundary bucket. The third term is to expand the primary area with probability of 1/(f*m) by adding a new bucket, read and write back the boundary bucket.

FALL 2005 CENG 351 Data Management and File Structures43 Linear hashing: Deletion Every delete may cause the linked buckets to be read and rewritten, even deleted. Every delete may cause the linked buckets to be read and rewritten, even deleted. Deletion may also cause contraction of the primary area if f falls below what it should be. The last buckets consolidate with the buckets sharing the same last k bits. Deletion may also cause contraction of the primary area if f falls below what it should be. The last buckets consolidate with the buckets sharing the same last k bits. A rough estimate for the deletion time A rough estimate for the deletion time T D =T Fu +2r Where the other contributing terms and factors are ignored as they are small compared to these two terms. Where the other contributing terms and factors are ignored as they are small compared to these two terms.

FALL 2005 CENG 351 Data Management and File Structures44 Linear hashing: Sequential Reading in record order The hash function does not preserve order. Therefore, each record is an independent search, causing The hash function does not preserve order. Therefore, each record is an independent search, causing T X =n*1.05*(s+r+dtt), for f=75% and m=50 case. Note that, there is no provision about which blocks or buckets to allocate to an indexed file. There may be small disk space allocation directory in the file header to indicate which disk extents are allocated to the file. Note that, there is no provision about which blocks or buckets to allocate to an indexed file. There may be small disk space allocation directory in the file header to indicate which disk extents are allocated to the file.

FALL 2005 CENG 351 Data Management and File Structures45 Linear hashing: Reorganizing It is possible to reorganize a hashed file, which costs: It is possible to reorganize a hashed file, which costs: T reorg =n*(T F +2r) = n*(s+r+dtt+2r) The amount of space used by a hashed file is:- The amount of space used by a hashed file is:-n/(f*m). How many bits (k) are required to form the hash table for M buckets: How many bits (k) are required to form the hash table for M buckets: 2 k <=M<2 k+1 or k<=logM<k+1

FALL 2005 CENG 351 Data Management and File Structures46 Redundant slides

FALL 2005 CENG 351 Data Management and File Structures47 Predicting the distribution of records In random case, one can approximate the collision probability P(x), which is the probability that a given address will have x records assigned to it, In random case, one can approximate the collision probability P(x), which is the probability that a given address will have x records assigned to it, in n records N addresses case, x can take values 0, 1, 2, 3, etc.. bigger the x smaller is the probability. in n records N addresses case, x can take values 0, 1, 2, 3, etc.. bigger the x smaller is the probability. Rather than taking the p(x) as the probability, it can be taken as the percentage of the addresses having x logical records assigned to it by hashing. Rather than taking the p(x) as the probability, it can be taken as the percentage of the addresses having x logical records assigned to it by hashing. Thus, p(0) is the proportion of the addresses with 0 records assigned to it. Given p(0), the expected number of addresses with no assignments is N*p(0). Thus, p(0) is the proportion of the addresses with 0 records assigned to it. Given p(0), the expected number of addresses with no assignments is N*p(0). Let n/N also represents f, load factor (or packing density) Let n/N also represents f, load factor (or packing density)

FALL 2005 CENG 351 Data Management and File Structures48 Collision resolution Collisions can be resolved by chaining. A simple (separate) chaining method, allows overflow from a full bucket to a bucket in a separate overflow area. Collisions can be resolved by chaining. A simple (separate) chaining method, allows overflow from a full bucket to a bucket in a separate overflow area. Usually N=M*m, where M is the number of buckets and m is the blocking factor. Usually N=M*m, where M is the number of buckets and m is the blocking factor. Given a bucket size and the load factor, one can compute the average number of accesses (a) required to fetch a record. For example, for m=10 and f=70, a=1.201, for m=50, f=70, a= (see the textbook for computed a values for a given f and m) Given a bucket size and the load factor, one can compute the average number of accesses (a) required to fetch a record. For example, for m=10 and f=70, a=1.201, for m=50, f=70, a= (see the textbook for computed a values for a given f and m)

FALL 2005 CENG 351 Data Management and File Structures49 Bounded Extendible hashing A buffer of fixed size (tuned to the size of the available memory) is designated to serve as the index. The index table grows according to the extendible hashing principle, A buffer of fixed size (tuned to the size of the available memory) is designated to serve as the index. The index table grows according to the extendible hashing principle, however, index computed from a record key is assumed not to fall outside the size of this table. however, index computed from a record key is assumed not to fall outside the size of this table. The file grows by doubling the data block size, on overflow situation. Thus, the size of the bucket is power of 2. The file grows by doubling the data block size, on overflow situation. Thus, the size of the bucket is power of 2.

FALL 2005 CENG 351 Data Management and File Structures50 Bounded Extendible hashing: Hash key digits As an example, a hashed-key digits may have the following meanings: As an example, a hashed-key digits may have the following meanings: First y digits : corresponding block address in the index area First y digits : corresponding block address in the index area Next z digits: the correct record index entry in the located index block Next z digits: the correct record index entry in the located index block Next w digits: used to indicate the offset from the beginning of the bucket, for the correct block containing the record. This is normally log 2 m, in number of bits, where m indicates the number of the blocks in the bucket. Next w digits: used to indicate the offset from the beginning of the bucket, for the correct block containing the record. This is normally log 2 m, in number of bits, where m indicates the number of the blocks in the bucket. To avoid sparsely populated buckets, chaining may be allowed to a certain extend… To avoid sparsely populated buckets, chaining may be allowed to a certain extend…