CPSC 335 Computer Science University of Calgary Canada.

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
CST203-2 Database Management Systems Lecture 7. Disadvantages on index structure: We must access an index structure to locate data, or must use binary.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Hashing Techniques.
Hashing CS 3358 Data Structures.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Hashing.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Hashing Hashing is another method for sorting and searching data.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.
Chapter 5 Record Storage and Primary File Organizations
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
Dynamic Hashing (Chapter 12)
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Advanced Associative Structures
Hash Table.
Chapter 10 Hashing.
Indexing and Hashing Basic Concepts Ordered Indices
Advance Database System
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
What we learn with pleasure we never forget. Alfred Mercier
Data Structures and Algorithm Analysis Hashing
Presentation transcript:

CPSC 335 Computer Science University of Calgary Canada

Outline Coalesced Hashing Variants Brent’s Method Binary Tree Comparison of various methods

Coalesced Hashing Coalesced hashing is a collision resolution method that uses pointers to connect the elements of a synonym chain. A hybrid of separate chaining and open addressing. Linked lists within the hash table handle collisions. This strategy is effective, efficient and very easy to implement.

Coalesced Hashing Coalesced hashing obtains its name from what occurs when we attempt to insert a record with a home address that is already occupied by a record from a chain with a different home address. This situation would occur, for example, if we attempted to insert a record with a home address of s into the hash table. What occurs is that the two chains with records having different home addresses coalesce or grow together.

Coalesced Hashing In figure to the right, the records with keys X, D, and Y were inserted in the given order into the hash table. A, B, C, and D form one set of synonyms and X and Y form another set. When X is inserted into the table with coalescing, it must be inserted as the end of the chain that it is coalescing with. Instead of needing only one probe to retrieve X, three are needed. The greater the coalescing the longer he probe chain will be, and as a result, retrieval performance will be degraded. When record D is now added, it must be inserted at the end of the coalesced chains; we must move over record X from the other chain then to locate D. Synonym chain: with coalescing (The shaded portion indicates portion of the chain in which coalescing has occurred, the thin line represents the insertions on the synonym chain with r as its home address. The thick line represents the insertions on the chain with s as its home address.)

Coalesced Hashing Algorithm for Coalesced Hashing Coalesced hashing originated with Williams [1] and is also referred to as direct chaining.

Variants Many suggestions have been made for reducing the coalescing of probe chains and thereby lowering the number of retrieval probes which in turn improves performance. The variants may be classified in three ways: The table organization (whether or not a separate overflow area is used). The manner of linking a colliding item into a chain. The manner of choosing unoccupied locations.

Variants Coalescing may be reduced by modifying the table organization. Instead of allocating the entire table space for both overflow records and home address records, the table is divided into a primary area and a overflow area. Primary Overflow (cellar) The primary area is the address space that the hash function maps into. The overflow or cellar area contains only overflow records. The address factor is the ratio of primary area to the total table size – Address Factor = primary area / total table size

Variants For a fixed amount of storage, as the address factor decreases, the cellar size increases, which reduces the coalescing but because the primary area becomes smaller, it increases the number of collisions. More collisions mean more items requiring multiple retrieval probes. Vitter [2] determined that an address factor of 0.86 yields nearly optimal retrieval performance for most load factors.

Variants LISCH The algorithm given in slide 6 is called Late Insertion Standard Coalesced Hashing (LISCH) since new records are inserted at the end of a probe chain. [ The ‘Standard’ in the name refers to the lack of a cellar. The variant of that algorithm that uses a cellar is called LICH, Late Insertion Coalesced Hashing.

Variants Another way of varying the insertion algorithm Changing the way in which we choose a unoccupied location. The unoccupied locations are always chosen from the bottom of the storage area. But the no. of collisions is increased in this way. Hsaio [3] suggest REISCH (‘R’ stands for ‘Random’), in which a random unoccupied location for the new insertion is chosen. REISCH gives only 1% improvement over EISCH. BLISCH (‘B’ signifies ‘Bidirectional’) is another method of choosing the overflow location for a collision insertion is to alternate the selection between the top and bottom of the table. In DCWC (Direct Chaining Without Coalescing), a record not stored at its home address is moved.

Variants Table 1: Mean number of probes for successful lookup (n = 997) for variants of Coalesced Hashing

Brent’s Method Dynamic collision resolution methods are methods in which an item once stored may be moved. With these methods, any item may be moved, not only those records which are not stored at their home addresses. These methods require additional processing when inserting a record into the table but reduce the number of probes needed for retrieval. The justification for this additional processing is that we usually insert an tem into a table only once but retrieve it many times.

Brent’s Method The Primary Probe Chain of a record is the sequence of locations visited during the insertion or retrieval of the record. The sequence of positions visited when attempting to move a record from the primary probe chain is called the Secondary Probe Chain. We want to minimize the total number of probes for both the item being inserted and the items already in the table. This strategy assumes an equal likelihood of any of the items being retrieved.

Brent’s Method Brent’s method is the first of several dynamic collision resolution methods. In each of them, moving a previously stored tem to achieve a reduction in the retrieval probes is considered. The solid vertical line represents the primary probe chain. The horizontal lines represent the secondary probe chain. The q value along the primary probe chain is the increment for the item being inserted whereas the qi’s along the secondary probe chains represent the increments associated with the item being moved. Brent’s method, probe chains, and their order of processing

Brent’s Method The subscript i gives the number of probes needed to retrieve the item being inserted along its primary probe chain. The subscript j gives the number of additional probes needed to retrieve the item being moved along its secondary probe chain. To minimize the number of retrieval probes, (i+j) is minimized. In the case of i=j, we will arbitrarily choose to minimize on i. When we can no longer achieve a reduction in the no. of retrieval probes, we should terminate the process of attempting to move an item. Brent’s method, probe chains, and their order of processing

Brent’s Method Let s be the number of probes required to retrieve an item if nothing is moved. We then try all combinations of (i+j) < s such that we minimize (i+j). On equality, since there would be no reduction in the number of probes, no movement would occur. Brent’s method, probe chains, and their order of processing

Coalesced Hashing Algorithm for insertion into a file using Brent’s method

Binary Tree A question that is often asked when considering Brent’s collision resolution method is, “If it is a good idea to move an item on a primary probe chain, why not carry this concept one step further and move items from secondary and subsequent probe chains?” Two features of the binary tree collision resolution method make it worth considering: It needs fewer retrieval probes than Brent’s method. Perhaps more importantly, it illustrates the importance of choosing an appropriate data structure in order to be able to solve a problem effectively.

Binary Tree Binary tree collision resolution method uses a binary tree structure to determine when to move an item and where to move it. A binary tree is appropriate since there are essentially two choices at each probable storage address – continue to the next address along the probe chain of the item being inserted or move the item stored at that address to the next position on its probe chain. A left branch in the binary tree signifies the continue option and a right branch the move option.

Binary Tree Binary decision tree The Binary decision tree is generated in a breadth first fashion from the top down left to right a shown: The binary tree is used only as a control mechanism in deciding where to store an item and is not used for string records. A different binary tree is constructed for each insertion of a record. By moving items from secondary and subsequent probe chains, a placement of records that will further reduce the average number of retrieval probes when compared with Brent’s method is achieved. Binary decision tree

Comparison Table 2: Comparison of Mean number of probes for successful lookup (n = 997; = packing factor) Table 2 provides the average number of retrieval probes for successful searches on a table of 997 records with a uniform distribution of keys.

Comparison Figure 5 graphically displays the performance data for all methods except for computed chaining with a 2-bit link field. Performance of collision resolution methods

Comparison It can be noticed the wide variance in performance at packing factors >= 90 percent. The result of computed chaining with a 20 percent packing factor s less than that for DCWC (Direct Chaining Without Coalescing). Performance of collision resolution methods

Comparison Table 3: Search, relocation and storage comparisons The above table offers additional useful comparison criteria. The successful search criteria give the minimum and maximum number of probes necessary to retrieve an item.

Comparison The range for worst case performance varies from ln n to n. Table 3: Search, relocation and storage comparisons The range for worst case performance varies from ln n to n. Although the worst case performance for locating a record with both LISCH and computed chaining is n, their typical performances would be better, because only records of one chain need to be searched.

Comparison What is the best method? There is no single method that is the best for all purposes. The method that provides the lowest average number of probes, and thus the best performance, in general, is DCWC. The method with the second lowest average number of retrieval probes is computed chaining. Without coalescing, LISCH is DCWC and does perform better than computed chaining. If storage s somewhat scarce, computed chaining will then have an advantage over DCWC.

Comparison Table 4: Advantages, disadvantages, and when to use various collision resolution methods