Searching: Self Organizing Structures and Hashing

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Hash Tables.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
CSCE 3400 Data Structures & Algorithm Analysis
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Data Structures Using C++ 2E
Hashing as a Dictionary Implementation
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Using arrays – Example 2: names as keys How do we map strings to integers? One way is to convert each letter to a number, either by mapping them to 0-25.
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
Spring 2015 Lecture 6: Hash Tables
Chapter 9 Searching. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not.
Comp 335 File Structures Hashing.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Hashing as a Dictionary Implementation Chapter 19.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Course notes CS2606: Data Structures and Object-Oriented Development Chapter 9: Searching Department of Computer Science Virginia Tech Spring 2008 (The.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
Data Structures Using C++
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
UNCA CSCI October, 2001 These notes were largely prepared by the text’s author Clifford A. Shaffer Department of Computer Science Virginia Tech.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
1 Hash Tables Chapter Motivation Many applications require only: –Insert –Search –Delete Examples –Symbol tables –Memory management mechanisms.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hashing.
CSCE 3110 Data Structures & Algorithm Analysis
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Collision Handling Collisions occur when different elements are mapped to the same cell.
Presentation transcript:

Searching: Self Organizing Structures and Hashing CS 400/600 – Data Structures Searching: Self Organizing Structures and Hashing

Searching Records contain information and keys <k1, I1>, <k2, I2>, …, <kn, In> Find all records with key value K May be successful or unsuccessful Range query: all records with key values between Klow and Khigh Search and Hashing

Searching Sorted Arrays Previously we determined: With the probability of a failed search = p0 and probability to find record in each slot = p: Search and Hashing

Self Organization 80/20 Rule – In many applications, 80% of the accesses reference 20% of the records If we sorted the records by the frequency that they will be accessed, then a linear search through the array can be efficient Since we don’t know what the actual access pattern will be, we use heuristics to order the array Search and Hashing

Reorder Heuristics Count – keep a count for each record and sort by count Doesn’t react well to changes in access frequency over time Move-to-front – move record to front of the list on access Responds better to dynamic changes Transpose – swap record with previous (move one step towards front of list) on access Pathological case: Last and next-to-last/repeat Search and Hashing

Analysis of Self Organizing Lists Slower search than search trees or sorted lists Fast insert Simple to implement Very efficient for small lists Search and Hashing

Hashing Use a hash function, h, that maps a key, k, to a slot in the hash table, HT HT[h(k)] = record The number of records in the hash table is M. 0  h(k)  M-1 Simple case: When unique keys are integers, we might use h(k) = k % M Even distribution of h(k) Collision resolution Search and Hashing

Hash Function Distribution Should depend on all bits of the key Example: h(k) = k % 8 – only the last 4 bits of the key used Should distribute keys evenly among slots to minimize collisions Two possibilities We know nothing about the distribution of keys Uniform distribution of slots We know something about the keys Example: English words rarely start with Z or K Search and Hashing

Example Hash Functions Mid-square: square the key, then take the middle r bits for a table with 2r slots Folding for strings Sum up the ASCII values for characters in the string Order doesn’t matter (not good) ELFhash int ELFhash(char* key) { unsigned long h = 0; while(*key) { h = (h << 4) + *key++; unsigned long g = h & 0xF0000000L; if (g) h ^= g >> 24; h &= ~g; } return h % M; Search and Hashing

Open Hashing What to do when collisions occur? Open hashing treats each hash table slot as a bin. We hope to have n/M elements in each list. Effective for a hash in memory, but difficult to implement efficiently on disk. Search and Hashing

Bucket Hashing Divide hash table slots into buckets Example, 8 slots per bucket Hash function maps to buckets Global overflow bucket Becomes inefficient when overflow bucket is very full Variation: map to home slot as though no bucketing, then check the rest of the bucket 1000 9530 9877 2007 3013 9879 1 2 3 4 Hash Table 1057 Overflow Search and Hashing

Closed Hashing Closed hashing stores all records directly in the hash table. Bucket hashing is a type of closed hasing Each record i has a home position h(ki). If another record occupies i’s home position, then another slot must be found to store i. The new slot is found by a collision resolution policy. Search must follow the same policy to find records not in their home slots. Search and Hashing

Collision Resolution During insertion, the goal of collision resolution is to find a free slot in the table. Probe sequence: The series of slots visited during insert/search by following a collision resolution policy. Let b0 = h(K). Let (b0, b1, …) be the series of slots making up the probe sequence. Search and Hashing

Insertion // Insert e into hash table HT template <class Key, class Elem, class KEComp, class EEComp> bool hashdict<Key, Elem, KEComp, EEComp>:: hashInsert(const Elem& e) { int home; // Home position for e int pos = home = h(getkey(e)); // Init for (int i=1; !(EEComp::eq(EMPTY, HT[pos])); i++) { pos = (home + p(K, i)) % M; if (EEComp::eq(e, HT[pos])) return false; // Duplicate } HT[pos] = e; // Insert e return true; Search and Hashing

Search // Search for the record with Key K template <class Key, class Elem, class KEComp, class EEComp> bool hashdict<Key, Elem, KEComp, EEComp>:: hashSearch(const Key& K, Elem& e) const { int home; // Home position for K int pos = home = h(K); // Initial posit for (int i = 1; !KEComp::eq(K, HT[pos]) && !EEComp::eq(EMPTY, HT[pos]); i++) pos = (home + p(K, i)) % M; // Next if (KEComp::eq(K, HT[pos])) { // Found it e = HT[pos]; return true; } else return false; // K not in hash table Search and Hashing

Probe Function Look carefully at the probe function p(). pos = (home + p(getkey(e), i)) % M; Each time p() is called, it generates a value to be added to the home position to generate the new slot to be examined. p() is a function both of the element’s key value, and of the number of steps taken along the probe sequence. Not all probe functions use both parameters. Search and Hashing

Linear Probing Use the following probe function: p(K, i) = i; Linear probing simply goes to the next slot in the table. Past bottom, wrap around to the top. To avoid infinite loop, one slot in the table must always be empty. Search and Hashing

Linear Probing Example Primary Clustering: Records tend to cluster in the table under linear probing since the probabilities for which slot to use next are not the same for all slots. For (a): prob(3) = 4/11, prob(4) = 1/11, prob(5) = 1/11, prob(6) = 1/11, prob(10) = 4/11. For(b): prob(3) = 8/11, prob(4,5,6) = 1/11 each. Ideally: equal probability for each slot at all times. Search and Hashing

Improved Linear Probing Instead of going to the next slot, skip by some constant c. Warning: Pick M and c carefully. Example: c=2 and M=10  two hash tables! The probe sequence SHOULD cycle through all slots of the table. Pick c to be relatively prime to M. There is still some clustering Ex: c=2, h(k1) = 3; h(k2) = 5. Probe sequences for k1 and k2 are linked together. If M=10 and c=2, then we effectively have created 2 hash tables (evens vs. odds). Search and Hashing

Pseudo-random Probing Ideally, for any two keys, k1 and k2, the probe sequences should diverge. An ideal probe function would select the next value in the probe sequence at random. Why can’t we do this? Select a random permutation of the numbers from 1 to M1: Perm = [r1, r2, r3, …, rM-1] p(K, i) = Perm[i-1]; Search and Hashing

Pseudo-random probe example Example: Hash table size of M = 101 Perm = [2, 5, 32, …] h(k1)=30, h(k2)=28. Probe sequence for k1: 30, 32, 35, 62 Probe sequence for k2: 28, 30, 33, 60 Although they temporarily converge, they quickly diverge again afterwards Search and Hashing

Quadratic probing p(K, i) = i2; Example: M=101, h(k1)=30, h(k2) = 29. Probe sequence for k1 is: 30, 31, 34, 39 Probe sequence for k2 is: 29, 30, 33, 38 Eliminates primary clustering Doesn’t guarantee that every slot in the hash table is in the probe sequence for every key Search and Hashing

Secondary Clustering Pseudo-random probing eliminates primary clustering. If two keys hash to the same slot, they follow the same probe sequence. This is called secondary clustering. To avoid secondary clustering, need probe sequence to be a function of the original key value, not just the home position. None of the probe functions we have looked at use K in any way! Search and Hashing

Double hashing One way to get a probe sequence that depends on K is to use linear probing, but to have the constant be different for each K We can use a second hash function to get the constant: p(K, i) = i  h2(K) where h2 is another hash function Example: Hash table of size M=101 h(k1)=30, h(k2)=28, h(k3)=30. h2(k1)=2, h2(k2)=5, h2(k3)=5. Probe sequence for k1 is: 30, 32, 34, 36 Probe sequence for k2 is: 28, 33, 38, 43 Probe sequence for k3 is: 30, 35, 40, 45 Search and Hashing

How do we pick the two hash functions A good implementation of double hashing should ensure that all values of the second hash function are relatively prime to M. If M is prime, than h2() can return any number from 1 to M1 If M is 2m than any odd number between 1 and M will do Search and Hashing

How fast is hashing? When a record is found in its home position, search takes O(1) time. As the table fills, the probability of collision increases Define the load factor for a table as  = N/M, where N is the number of records currently in the table Search and Hashing

Analysis of hashing When inserting a record, the probability that the home position will be occupied is simply  (N/M) The probability that the home position and the next slot probed are occupied is And the probability of i collisions is Search and Hashing

Analysis of hashing (2) This value is approximated by (N/M)i The expected number of probes is: Which is approximately This is a theoretical best-case, where there is no clustering happening Search and Hashing

Hashing Performance  = no clustering (theoretical bound) = linear probing (lots of clustering) Expected number of accesses  Search and Hashing

Deletion Deleting a record must not hinder later searches. Remember, we stop the search through the probe sequence when we find an empty slot. We do not want to make positions in the hash table unusable because of deletion. Search and Hashing

Tombstones (1) Both of these problems can be resolved by placing a special mark in place of the deleted record, called a tombstone. A tombstone will not stop a search, but that slot can be used for future insertions. Search and Hashing

Tombstones (2) Unfortunately, tombstones add to the average path length. Solutions: Local reorganizations to try to shorten the average path length. Periodically rehash the table (by order of most frequently accessed record). Search and Hashing