March 23 & 28, 20001 Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.

Slides:



Advertisements
Similar presentations
HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Part II Chapter 8 Hashing Introduction Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Array Linked list Tree.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Skip List & Hashing CSE, POSTECH.
Data Structures Using C++ 2E
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
CST203-2 Database Management Systems Lecture 7. Disadvantages on index structure: We must access an index structure to locate data, or must use binary.
Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Appendix I Hashing. Chapter Scope Hashing, conceptually Using hashes to solve problems Hash implementations Java Foundations, 3rd Edition, Lewis/DePasquale/Chase21.
Chapter 11. Hashing.
Hash Table indexing and Secondary Storage Hashing.
1 Hashing (Walls & Mirrors - end of Chapter 12). 2 I hate quotations. Tell me what you know. – Ralph Waldo Emerson.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
CS 206 Introduction to Computer Science II 11 / 17 / 2008 Instructor: Michael Eckmann.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
CpSc 3220 File and Database Processing Hashing. Exercise – Build a B + - Tree Construct an order-4 B + -tree for the following set of key values: (2,
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Hashing Hashing is another method for sorting and searching data.
FALL 2005 CENG 351 Data Management and File Structures 1 Hashing.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Lecture 12COMPSCI.220.FS.T Symbol Table and Hashing A ( symbol) table is a set of table entries, ( K,V) Each entry contains: –a unique key, K,
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Data Structures1 Overview(1) O(1) access to files Variation of the relative file Record number for a record is not arbitrary; rather, it is obtained by.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Copyright © Curt Hill Hashing A quick lookup strategy.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing by Rafael Jaffarove CS157b. Motivation  Fast data access  Search  Insertion  Deletion  Ideal seek time is O(1)
1 Chapter 9 Searching And Table. 2 OBJECTIVE Introduces: Basic searching concept Type of searching Hash function Collision problems.
File Structure SNU-OOPSLA Lab 1 Chap11. Hashing Chap11. Hashing 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File Strutures by Folk, Zoellick and Riccardi.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
TOPIC 5 ASSIGNMENT SORTING, HASH TABLES & LINKED LISTS Yerusha Nuh & Ivan Yu.
Hashing 1 Lec# 12 Presented by Halla Abdel Hameed.
Data Structures Using C++ 2E
Hashing, Hash Function, Collision & Deletion
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Hashing CENG 351.
Subject Name: File Structures
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash Table.
Chapter 10 Hashing.
Hashing.
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Chapter 11. Hashing.
What we learn with pleasure we never forget. Alfred Mercier
Presentation transcript:

March 23 & 28, Hashing

2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing in that it involves associating a key with a relative record address. Hashing, however, is different from indexing in two important ways: –With hashing, there is no obvious connection between the key and the location. –With hashing two different keys may be transformed to the same address.

3 Collisions When two different keys produce the same address, there is a collision. The keys involved are called synonyms. Coming up with a hashing function that avoids collision is extremely difficult. It is best to simply find ways to deal with them. Possible Solutions: –Spread out the records –Use extra memory –Put more than one record at a single address.

4 A Simple Hashing Algorithm Step 1: Represent the key in numerical form Step 2: Fold and Add Step 3: Divide by a prime number and use the remainder as the address.

5 Hashing Functions and Record Distributions Records can be distributed among addresses in different ways: there may be (a) no synonyms (uniform distribution); (b) only synonyms (worst case); (c) a few synonyms (happens with random distributions). Purely uniform distributions are difficult to obtain and may not be worth searching for. Random distributions can be easily derived, but they are not perfect since they may generate a fair number of synonyms. We want better hashing methods.

6 Some Other Hashing Methods Though there is no hash function that guarantees better-than-random distributions in all cases, by taking into considerations the keys that are being hashed, certain improvements are possible. Here are some methods that are potentially better than random: –Examine keys for a pattern –Fold parts of the key –Divide the key by a number –Square the key and take the middle –Radix transformation

7 Predicting the Distribution of Records When using a random distribution, we can use a number of mathematical tools to obtain conservative estimates of how our hashing function is likely to behave:

8 Predicting Collisions for a Full File Suppose you have a hashing function that you believe will distribute records randomly and you want to store 10,000 records in 10,000 addresses. How many addresses do you expect to have no records assigned to them? How many addresses should have one, two, and three records assigned respectively? How can we reduce the number of overflow records?

9 Increasing Memory Space I Reducing collisions can be done by choosing a good hashing function or using extra memory. The question asked here is how much extra memory should be use to obtain a given rate of collision reduction? Definition: Packing density refers to the ratio of the number of records to be stored (r) to the number of available spaces (N). The packing density gives a measure of the amount of space in a file that is used.

10 Increasing Memory Space II The Poisson Distribution allows us to predict the number of collisions that are likely to occur given a certain packing density. We use the Poisson Distribution to answer the following questions: How many addresses should have no records assigned to them? How many addresses should have exactly one record assigned (no synonym)? How many addresses should have one record plus one or more synonyms? Assuming that only one record can be assigned to each home address, how many overflow records can be expected? What percentage of records should be overflow records?

11 Collision Resolution by Progressive Overflow How do we deal with records that cannot fit into their home address? A simple approach: Progressive Overflow or Linear Probing. If a key, k1, hashes into the same address, a1, as another key, k2, then look for the first available address, a2, following a1 and place k1 in a2. If the end of the address space is reached, then wrap around it. When searching for a key that is not in, if the address space is not full, then an empty address will be reached or the search will come back to where it began.

12 Search Length when using Progressive Overflow Progressive Overflow causes extra searches and thus extra disk accesses. If there are many collisions, then many records will be far from “home”. Definitions: Search length refers to the number of accesses required to retrieve a record from secondary memory. The average search length is the average number of times you can expect to have to access the disk to retrieve a record. Average search length = (Total search length)/(Total number of records)

13 Storing More than One Record per Address: Buckets Definition: A bucket describes a block of records sharing the same address that is retrieved in one disk access. When a record is to be stored or retrieved, its home bucket address is determined by hashing. When a bucket is filled, we still have to worry about the record overflow problem, but this occurs much less often than when each address can hold only one record.

14 Effect of Buckets on Performance To compute how densely packed a file is, we need to consider 1) the number of addresses, N, (buckets) 2) the number of records we can put at each address, b, (bucket size) and 3) the number of records, r. Then, Packing Density = r/bN. Though the packing density does not change when halving the number of addresses and doubling the size of the buckets, the expected number of overflows decreases dramatically.

15 Making Deletions Deleting a record from a hashed file is more complicated than adding a record for two reasons: –The slot freed by the deletion must not be allowed to hinder later searches –It should be possible to reuse the freed slot for later additions. In order to deal with deletions we use tombstones, i.e., a marker indicating that a record once lived there but no longer does. Tombstones solve both the problems caused by deletion. Insertion of records is slightly different when using tombstones.

16 Effects of Deletions and Additions on Performance After a large number of deletions and additions have taken places, one can expect to find many tombstones occupying places that could be occupied by records whose home address precedes them but that are stored after them. This deteriorates average search lengths. There are 3 types of solutions for dealing with this problem: a) local reorganization during deletions; b) global reorganization when the average search length is too large; c) use of a different collision resolution algorithm.

17 Other Collision Resolution Techniques There are a few variations on random hashing that may improve performance: –Double Hashing: When an overflow occurs, use a second hashing function to map the record to its overflow location. –Chained Progressive Overflow: Like Progressive overflow except that synonyms are linked together with pointers. –Chaining with a Separate Overflow Area: Like chained progressive overflow except that overflow addresses do not occupy home addresses. –Scatter Tables: The Hash file contains no records, but only pointers to records. I.e., it is an index.

18 Pattern of Record Access If we have some information about what records get accessed most often, we can optimize their location so that these records will have short search lengths. By doing this, we try to decrease the effective average search length even if the nominal average search length remains the same. This principle is related to the one used in Huffman encoding.