Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001.

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Hash Tables
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSCE 3400 Data Structures & Algorithm Analysis
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
© 2004 Goodrich, Tamassia Hash Tables1  
Chapter 9: Maps, Dictionaries, Hashing Nancy Amato Parasol Lab, Dept. CSE, Texas A&M University Acknowledgement: These slides are adapted from slides provided.
Maps. Hash Tables. Dictionaries. 2 CPSC 3200 University of Tennessee at Chattanooga – Summer 2013 © 2010 Goodrich, Tamassia.
Data Structures Lecture 12 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Log Files. O(n) Data Structure Exercises 16.1.
CSC401 – Analysis of Algorithms Lecture Notes 5 Heaps and Hash Tables Objectives: Introduce Heaps, Heap-sorting, and Heap- construction Analyze the performance.
Dictionaries and Hash Tables1  
Hash Tables and Associative Containers CS-212 Dick Steflik.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hashing (Ch. 14) Goal: to implement a symbol table or dictionary (insert, delete, search)  What if you don’t need ordered keys--pred, succ, sort, select?
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Dictionaries 4/17/2017 3:23 PM Hash Tables  
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Maps, Dictionaries, Hashing
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
CS 221 Analysis of Algorithms Data Structures Dictionaries, Hash Tables, Ordered Dictionary and Binary Search Trees.
Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 HashTable. 2 Dictionary A collection of data that is accessed by “key” values –The keys may be ordered or unordered –Multiple key values may/may-not.
Hash Tables1   © 2010 Goodrich, Tamassia.
Dictionaries and Hash Tables1 Hash Tables  
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hashing Hashing is another method for sorting and searching data.
© 2004 Goodrich, Tamassia Hash Tables1  
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Chapter 7 Section 3. What is hashing? Hashing is using a 1-D array to implement a dictionary o This implementation is called a "hash table" Items.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Chapter 13 C Advanced Implementations of Tables – Hash Tables.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Chapter 2.5: Dictionaries and Hash Tables  
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Algorithms Design Fall 2016 Week 6 Hash Collusion Algorithms and Binary Search Trees.
Hash Tables 1/28/2018 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Hashing (part 2) CSE 2011 Winter March 2018.
Chapter 2.5: Dictionaries and Hash Tables
CSCI 210 Data Structures and Algorithms
Hashing CSE 2011 Winter July 2018.
Dictionaries and Hash Tables
Dictionaries Dictionaries 07/27/16 16:46 07/27/16 16:46 Hash Tables 
© 2013 Goodrich, Tamassia, Goldwasser
Dictionaries 9/14/ :35 AM Hash Tables   4
Hash Tables 3/25/15 Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M.
Searching.
Hash Tables 11/22/2018 3:15 AM Hash Tables  1 2  3  4
Dictionaries and Hash Tables
Chapter 2.5: Dictionaries and Hash Tables
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Dictionaries < > = /9/2018 3:06 AM Dictionaries
Dictionaries and Hash Tables
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Dictionaries < > = /17/2019 4:20 PM Dictionaries
Dictionaries 4/5/2019 1:49 AM Hash Tables  
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Dictionaries and Hash Tables
Presentation transcript:

Chapter 2.5: Dictionaries and Hash Tables  

Dictionary ADT (§2.5.1) The dictionary ADT models a searchable collection of key- element items The main operations of a dictionary are searching, inserting, and deleting items Multiple items with the same key are allowed Applications: address book credit card authorization mapping host names (e.g., cs16.net) to internet addresses (e.g., ) Dictionary ADT methods: findElement(k): if the dictionary has an item with key k, returns its element, else, returns the special element NO_SUCH_KEY insertItem(k, o): inserts item (k, o) into the dictionary removeElement(k): if the dictionary has an item with key k, removes it from the dictionary and returns its element, else returns the special element NO_SUCH_KEY size(), isEmpty() keys(), Elements()

Log File (§2.5.1) A log file is a dictionary implemented by means of an unsorted sequence We store the items of the dictionary in a sequence (based on a doubly- linked lists or a circular array), in arbitrary order Performance: insertItem takes O(1) time since we can insert the new item at the beginning or at the end of the sequence findElement and removeElement take O(n) time since in the worst case (the item is not found) we traverse the entire sequence to look for an item with the given key The log file is effective only for dictionaries of small size or for dictionaries on which insertions are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation)

Lookup Table A lookup table is a dictionary implemented with a sorted sequence We store the items of the dictionary in an array-based sequence, sorted by key We use an external comparator for the keys Performance: findElement takes O(log n) time, using binary search insertItem takes O(n) time since in the worst case we have to shift O(n) items to make room for the new item removeElement take O(n) time since in the worst case we have to shift O(n) items to compact the items after the removal Effective for small dictionaries or for dictionaries where searches are common but inserts and deletes are rare (e.g. credit card authorizations)

Hashing (2.5.2) Application: word occurrence statistics Operations: insert, find Dictionary: insert, delete, find Are O(log n) comparisons necessary? (no) Hashing basic plan: create a big array for the items to be stored use a function to figure out storage location from key (hash function) a collision resolution scheme is necessary

Hash Table Example Simple Hash function: Treat the key as a large integer K h(K) = K mod M, where M is the table size let M be a prime number. Example: Suppose we have 101 buckets in the hash table. ‘abcd’ in hex is 0x Converted to decimal it’s % 101 = 11 Thus h(‘abcd’) = 11. Store the key at location 11. “dcba” hashes to 57. “abbc” also hashes to 57 – collision. What to do? If you have billions of possible keys and hundreds of buckets, lots of collisions are possible!

Hash Functions (§ 2.5.3) A hash function is usually specified as the composition of two functions: Hash code map: h 1 : keys  integers Compression map: h 2 : integers  [0, N  1] The hash code map is applied first, and the compression map is applied next on the result, i.e., h(x) = h 2 (h 1 (x)) The goal of the hash function is to “disperse” the keys in an apparently random way

Hash Code Maps (§2.5.3) Memory address: interpret the memory address of the key as an integer Integer cast: interpret the bits of the key as an integer (for short keys) Component sum: partition the bits of the key into chunks (e.g., 16 or 32 bits) and sum, ignoring overflows (for long keys) Polynomial accumulation: like component sum, but multiply each term by 1, z, z 2, z 3,... p(z)  a 0  a 1 z  a 2 z 2  … …  a n  1 z n  1 at a fixed value z, ignoring overflows Can be evaluated in O(n) time using Horner’s rule: Each term is computed from the previous in O(1) time p 0 (z)  a n  1 p i (z)  a n  i  1  zp i  1 (z) (i  1, 2, …, n  1)

Compression Maps (§2.5.4) Division: h 2 (y)  y mod N The size N of the hash table is usually chosen to be a prime The reason has to do with number theory and is beyond the scope of this course Multiply, Add and Divide (MAD): h 2 (y)  (ay  b) mod N a and b are nonnegative integers such that a mod N  0 Otherwise, every integer would map to the same value b

Hashing Strings h(‘aVeryLongVariableName’)? Horner’s method example: 256 * = % 101 = * = % 101 = * = % 101 = 87 Scramble by replacing 256 with 117 int hash(char *v, int M) { int h, a=117; for (h=0; *v; v++) h = (a*h + *v) % M; return h; }

Collisions (§2.5.5) How likely are collisions? Birthday paradox M sqrt(  M/2) (about 1.25 sqrt(M)) [1.25 sqrt(365) is about 24] Experiment: generate random numbers … Collision at 13 th number, as predicted What to do about collisions?

Collision Resolution: Chaining Build a linked list for each bucket Linear search within each list Simple, practical, widely used Cuts search time by a factor of M over sequential search But, requires extra memory outside of table   

Chaining 2 Insertion time? O(1) Average search cost, successful search? O(N/2M) Average search cost, unsuccessful? O(N/M) M large: CONSTANT average search time Worst case: N (“probabilistically unlikely”) Keep lists sorted? insert time O(N/2M) unsuccessful search time O(N/2M)

Linear Probing (§2.5.5) Or, we could keep everything in the same table Insert: upon collision, search for a free spot Search: same (if you find one, fail) Runtime? Still O(1) if table is sparse But: as table fills, clustering occurs Skipping c spots doesn’t help…

Clustering Long clusters tend to get longer Precise analysis difficult Theorem (Knuth): Insert cost: approx. (1 + 1/(1-N/M) 2 )/2 (50% full  2.5 probes; 80% full  13 probes) Search (hit) cost: approx. (1 + 1/(1-N/M))/2 (50% full  1.5 probes; 80% full  3 probes) Search (miss): same as insert Too slow when table gets 70-80% full How to reduce/avoid clustering?

Double Hashing Use a second hash function to compute increment seq. Analysis extremely difficult About like ideal (random probe) Thm (Guibas-Szemeredi): Insert: approx 1+1/(1-N/M) Search hit: ln(1+N/M)/(N/M) Search miss: same as insert Not too slow until the table is about 90% full

Consider a hash table storing integer keys that handles collision with double hashing N  13 h(k)  k mod 13 d(k)  7  k mod 7 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order Example of Double Hashing

Dynamic Hash Tables Suppose you are making a symbol table for a compiler. How big should you make the hash table? If you don’t know in advance how big a table to make, what to do? Could grow the table when it “fills” (e.g. 50% full) Make a new table of twice the size. Make a new hash function Re-hash all of the items in the new table Dispose of the old table

Table Growing Analysis Worst case insertion:  (n), to re-hash all items Can we make any better statements? Average case? O(1), since insertions n through 2n cost O(n) (on average) for insertions and O(2n) (on average) for rehashing  O(n) total (with 3x the constant) Amortized analysis? The result above is actually an amortized result for the rehashing. Any sequence of j insertions into an empty table has O(j) average cost for insertions and O(2j) for rehashing. Or, think of it as billing 3 time units for each insertion, storing 2 in the bank. Withdraw them later for rehashing.

Separate Chaining vs. Double Hashing Assume the same amount of space for keys, links (use pointers for long or variable-length keys) Separate chaining: 1M buckets, 4M keys 4M links in nodes 9M words total; avg search time 2 Double hashing in same space: 4M items, 9M buckets in table average search time: 1/(1-4/9) = 1.8: 10% faster Double hashing in same time 4M items, average search time 2 space needed: 8M words (1/(1-4/8) = 2) (11% less space)

Deletion How to implement delete() with separate chaining? Simply unlink unwanted item Runtime? Same as search() How to implement delete() with linear probing? Can’t just erase it. (Why not?) Re-hash entire cluster Or mark as deleted? How to delete() with double hashing? Re-hashing cluster doesn’t work – which “cluster”? Mark as deleted Every so often re-hash entire table to prune “dead-wood”

Comparisons Separate chaining advantages: Idiot-proof (degrades gracefully) No large chunks of memory needed (but is this better?) Why use hashing? Fastest dictionary implementation Constant time search and insert, on average Easy to implement; built into many environments Why not use hashing? No performance guarantees Uses extra space Doesn’t support pred, succ, sort, etc. – no notion of order Where did perl “hashes” get their name?

Hashing Summary Separate chaining: easiest to deploy Linear probing: fastest (but takes more memory) Double hashing: least memory (but takes more time to compute the second hash function) Dynamic (grow): handles any number of inserts at < 3x time Curious use of hashing: early unix spell checker (back in the days of 3M machines…) Construction Search Miss Chain Probe Dbl Grow Chain Probe Dbl Grow 5k k k k k

File tamper test Problem: you want to guarantee that a file hasn’t been tampered with. How?

Password verification Problem: you run a website. You need to verify people’s logins. How? Possible techniques? Ethics?

Cache filenames Problem: caching FlexScores $outfileName = $outDirectory. "/". $cfg->hymn. "-". $cfg->instrument; if ($custom) $outfileName.= "-". substr(md5($cfg->asXML()),0,6); $outfileNameLy = $outfileName. ".ly";

Turing Test You want to generate random text that sounds like it makes sense. How?

Hash Tables in C# System.Collections: basic data structures Hashtable ArrayList (See handout)

GUI Programming in C# Microsoft has provided a number of different GUI programming environments over the years: MFC: Microsoft Foundation Classes C++, legacy Windows Forms.net wrapper for access to native Windows interface WPF: Windows Presentation Foundation.net, Managed code, primarily C# and VB XML file defines user interface Better support for media, animation, etc.