Comp 335 File Structures Hashing.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hashing.
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Hashing Techniques.
Hash Table indexing and Secondary Storage Hashing.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Searching Chapter 2.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Symbol Tables Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing Hashing is another method for sorting and searching data.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Copyright © Curt Hill Hashing A quick lookup strategy.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 9 Hashing Dr. Youssef Harrath
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing. Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
1 Data Structures CSCI 132, Spring 2014 Lecture 33 Hash Tables.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Data Structures Chapter 8: Hashing 8-1. Performance Comparison of Arrays and Trees Is it possible to perform these operations in O(1) ? ArrayTree Sorted.
Hashing.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
School of Computer Science and Engineering
Subject Name: File Structures
Data Structures Using C++ 2E
Hash functions Open addressing
Hash Table.
Hash Table.
Chapter 10 Hashing.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
What we learn with pleasure we never forget. Alfred Mercier
Presentation transcript:

Comp 335 File Structures Hashing

What is Hashing? A process used with record files that will try to achieve O(1) (i.e. – constant) access to a record’s location in the file. An algorithm, called a hash function (h), is given a primary key as input; the resulting output is the location of the record within the file; h(key) = address.

Hashing Example Assume you want to store 5,000 data records on file. You want this to be a hashed file for quick access. Each record will be fixed in length and the primary key for each record is an employee number which is 8 digits long. A common hash function is called modulo arithmetic. h(key) = key mod n; n = 5000 h(82461792) = 82461792 mod 5000 = 1792 The address (RRN) of the record with this key is 1792

Other Hashing Methods Folding Folding requires extracting certain groupings from the key and then adding or multiplying the groupings in some fashion to form the hash address. Example : Key = “BISON” Address Space = 101 Step 1 – get ASCII values of each character in the string B(66), I(73), S(83), O(79), N(78) Step 2 – Add “even[even index val]” 66 +83+78 = 227 Step 3 – Add “odd[odd index val]” 73+79 = 152 Step 4 – Multiply results 227 * 152 = 34504 Step 5 – Modulo results 34504 mod 101 = 63 (hash address)

Other Hashing Methods Mid-Square Involves squaring the “numeric” form of a key and extracting some of the digits from the “middle of the square”. Example: Assume address space is 1000 Key(4 digit int) = 2973 2973 * 2973 = 8838729 Extract “middle” digits = 387 (hash address)

Other Hashing Methods Radix Transformation Convert the key to a different base and then use modulo arithmetic. Example: Address space is 100. Key is 43510 Conversion: 38211 382 mod 100 = 82 (hash address)

Other Hashing Methods Multiplicative Function Involves multiplying the key by some constant less than one, the hash function will return some of the digits of the fractional part of the result. Example: Address space = 1000 Key (5 digit integer): 82165 Multiplier: 0.39731 82165 * 0.39731 = 32644.97615 First three digits of fractional part is hash address = 976

Major Problem with Hashing Given a random set of keys and a hash function (h), it is highly probable that some keys in the set will be hash synonyms. In other words, the same hash function output can be obtained from different keys in the set. A hashing algorithm can yield three different types of address distributions: Perfect – no synonyms given a set of keys; the probability of obtaining a perfect distribution from a large set of unknown keys is very, very low (textbook – 1 out 10120,000) Random – “few” synonyms generated; what we strive for! Scud – many synonyms generated If the set of keys is known beforehand, it is possible to generate a perfect hashing algorithm (Pearson, Cichelli)

Collisions When two or more keys hash to same address, this is called a collision. This has to be accounted for with random hashing algorithms. The handling of collisions becomes a critical issue in the overall search efficiency of a given file. Remember each search could mean a “disk access”.

Decreasing the Probability of Collisions Increase the address space – a common technique; allocate more addresses in the file than records to store; this can decrease the possibility of collisions greatly assuming the hashing algorithm is random. The disadvantage obviously is wasted space. Place more than one record at an address. This is commonly referred to as buckets. A single address space can store an array of records. This has been shown to increase search efficiency.

Collision Resolution Even if you have tried to decrease the probability of collisions, they still can and will happen. Ways to resolve collisions: Linear Probing Double Hashing Prime area with overflow Chaining

Linear Probing If a key is hashed to an address already occupied or full, search the address space linearly until the first free space is found. Easy to implement, however this technique can lead to poor search efficiency. This technique can take away home addresses from other keys resulting in more collision handling. It can also take many accesses to determine if a key does not exist. What about if a key is deleted using this technique? Could be bad if not handled properly.

Double Hashing Upon a collision, the key re-hashed using a different algorithm; this determines the increment to take to search for an open address space. The same problems exist as with linear probing. Research has shown that this technique will give better performance than linear probing.

Prime area with Overflow Usually used with buckets. A bucket will hold x number of records in the prime address space and will also contain a pointer to an overflow area of the file which is entry-sequenced. This pointer will contain the first overflow record and each overflow record will contain a pointer to the next overflow record. This is a common technique and gives excellent search efficiency.

Chaining The file consists of a hash table which is simply an array of pointers. When a key is hashed, the result is an index into the hash table. At this location is a pointer to the first record which has this hash address. All the records are then “chained” together as a linked list. The data record portion of the file can be entry sequenced.

Hash Address Distributions Assuming you have a random hash function, the Poisson Function can be used to compute various probabilities such as: How many empty hash slots will there be? What percentage of the time will access to a key result in more than one access to find it? What is the probability that a certain hash address will have x number of keys assigned to it?

Poisson Function p(x) = (r/n)x e-r/n x! n – the address space r - number of keys to hash x – number of records assigned to a given address r/n = packing density; load factor

Poisson Function Example Assume 1,000 records to be hashed into a 1,000 address hashed file. What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,000)2 e-1,000/1,000 2! = e-1 2 = .368/2 = .184 1,000 (number of addresses) * .184 = 184 Therefore there are approximately 184 addresses which will have 2 keys hashed to it which means there will be 184 overflow records.

Poisson Function Example Assume 1,000 records to be hashed into a 1,500 address hashed file. What is the probability that a given address will have two keys hashed to it? p(2) = (1,000/1,500)2 e-1,000/1,500 2! = (.67)2 e-.67 = (.449)(.512)/2 = .230/2 = .115 1,500 (number of addresses) * .115 = 172.5 (173) Therefore there are approximately 173 addresses which will have 2 keys hashed to it which means there will be 173 overflow records.